Btrfs: fix assertion failure during fsync and use of stale transaction
diff mbox series

Message ID 20190910142649.19808-1-fdmanana@kernel.org
State New
Headers show
Series
  • Btrfs: fix assertion failure during fsync and use of stale transaction
Related show

Commit Message

Filipe Manana Sept. 10, 2019, 2:26 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Sometimes when fsync'ing a file we need to log that other inodes exist and
when we need to do that we acquire a reference on the inodes and then drop
that reference using iput() after logging them.

That generally is not a problem except if we end up doing the final iput()
(dropping the last reference) on the inode and that inode has a link count
of 0, which can happen in a very short time window if the logging path
gets a reference on the inode while it's being unlinked.

In that case we end up getting the eviction callback, btrfs_evict_inode(),
invoked through the iput() call chain which needs to drop all of the
inode's items from its subvolume btree, and in order to do that, it needs
to join a transaction at the helper function evict_refill_and_join().
However because the task previously started a transaction at the fsync
handler, btrfs_sync_file(), it has current->journal_info already pointing
to a transaction handle and therefore evict_refill_and_join() will get
that transaction handle from btrfs_join_transaction(). From this point on,
two different problems can happen:

1) evict_refill_and_join() will often change the transaction handle's
   block reserve (->block_rsv) and set its ->bytes_reserved field to a
   value greater than 0. If evict_refill_and_join() never commits the
   transaction, the eviction handler ends up decreasing the reference
   count (->use_count) of the transaction handle through the call to
   btrfs_end_transaction(), and after that point we have a transaction
   handle with a NULL ->block_rsv (which is the value prior to the
   transaction join from evict_refill_and_join()) and a ->bytes_reserved
   value greater than 0. If after the eviction/iput completes the inode
   logging path hits an error or it decides that it must fallback to a
   transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
   non-zero value from btrfs_log_dentry_safe(), and because of that
   non-zero value it tries to commit the transaction using a handle with
   a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
   the transaction commit hit an assertion failure at
   btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
   the ->block_rsv is NULL. The produced stack trace for that is like the
   following:

   [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
   [192922.917553] ------------[ cut here ]------------
   [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
   [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
   [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
   [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
   [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
   (...)
   [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
   [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
   [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
   [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
   [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
   [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
   [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
   [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
   [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   [192922.925105] Call Trace:
   [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
   [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
   [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
   [192922.926731]  do_fsync+0x38/0x60
   [192922.927138]  __x64_sys_fdatasync+0x13/0x20
   [192922.927543]  do_syscall_64+0x60/0x1c0
   [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
   (...)
   [192922.934077] ---[ end trace f00808b12068168f ]---

2) If evict_refill_and_join() decides to commit the transaction, it will
   be able to do it, since the nested transaction join only increments the
   transaction handle's ->use_count reference counter and it does not
   prevent the transaction from getting committed. This means that after
   eviction completes, the fsync logging path will be using a transaction
   handle that refers to an already committed transaction. What happens
   when using such a stale transaction can be unpredictable, we are at
   least having a use-after-free on the transaction handle itself, since
   the transaction commit will call kmem_cache_free() against the handle
   regardless of its ->use_count value, or we can end up silently losing
   all the updates to the log tree after that iput() in the logging path,
   or using a transaction handle that in the meanwhile was allocated to
   another task for a new transaction, etc, pretty much unpredictable
   what can happen.

In order to fix both of them, instead of using iput() during logging, use
btrfs_add_delayed_iput(), so that the logging path of fsync never drops
the last reference on an inode, that step is offloaded to a safe context
(usually the cleaner kthread).

The assertion failure issue was sporadically triggered by the test case
generic/475 from fstests, which loads the dm error target while fsstress
is running, which lead to fsync failing while logging inodes with -EIO
errors and then trying later to commit the transaction, triggering the
assertion failure.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

Comments

Josef Bacik Sept. 11, 2019, 4:04 p.m. UTC | #1
On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Sometimes when fsync'ing a file we need to log that other inodes exist and
> when we need to do that we acquire a reference on the inodes and then drop
> that reference using iput() after logging them.
> 
> That generally is not a problem except if we end up doing the final iput()
> (dropping the last reference) on the inode and that inode has a link count
> of 0, which can happen in a very short time window if the logging path
> gets a reference on the inode while it's being unlinked.
> 
> In that case we end up getting the eviction callback, btrfs_evict_inode(),
> invoked through the iput() call chain which needs to drop all of the
> inode's items from its subvolume btree, and in order to do that, it needs
> to join a transaction at the helper function evict_refill_and_join().
> However because the task previously started a transaction at the fsync
> handler, btrfs_sync_file(), it has current->journal_info already pointing
> to a transaction handle and therefore evict_refill_and_join() will get
> that transaction handle from btrfs_join_transaction(). From this point on,
> two different problems can happen:
> 
> 1) evict_refill_and_join() will often change the transaction handle's
>    block reserve (->block_rsv) and set its ->bytes_reserved field to a
>    value greater than 0. If evict_refill_and_join() never commits the
>    transaction, the eviction handler ends up decreasing the reference
>    count (->use_count) of the transaction handle through the call to
>    btrfs_end_transaction(), and after that point we have a transaction
>    handle with a NULL ->block_rsv (which is the value prior to the
>    transaction join from evict_refill_and_join()) and a ->bytes_reserved
>    value greater than 0. If after the eviction/iput completes the inode
>    logging path hits an error or it decides that it must fallback to a
>    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
>    non-zero value from btrfs_log_dentry_safe(), and because of that
>    non-zero value it tries to commit the transaction using a handle with
>    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
>    the transaction commit hit an assertion failure at
>    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
>    the ->block_rsv is NULL. The produced stack trace for that is like the
>    following:
> 
>    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
>    [192922.917553] ------------[ cut here ]------------
>    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
>    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
>    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
>    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
>    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
>    (...)
>    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
>    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
>    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
>    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
>    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
>    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
>    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
>    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
>    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    [192922.925105] Call Trace:
>    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
>    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
>    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
>    [192922.926731]  do_fsync+0x38/0x60
>    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
>    [192922.927543]  do_syscall_64+0x60/0x1c0
>    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>    (...)
>    [192922.934077] ---[ end trace f00808b12068168f ]---
> 
> 2) If evict_refill_and_join() decides to commit the transaction, it will
>    be able to do it, since the nested transaction join only increments the
>    transaction handle's ->use_count reference counter and it does not
>    prevent the transaction from getting committed. This means that after
>    eviction completes, the fsync logging path will be using a transaction
>    handle that refers to an already committed transaction. What happens
>    when using such a stale transaction can be unpredictable, we are at
>    least having a use-after-free on the transaction handle itself, since
>    the transaction commit will call kmem_cache_free() against the handle
>    regardless of its ->use_count value, or we can end up silently losing
>    all the updates to the log tree after that iput() in the logging path,
>    or using a transaction handle that in the meanwhile was allocated to
>    another task for a new transaction, etc, pretty much unpredictable
>    what can happen.
> 
> In order to fix both of them, instead of using iput() during logging, use
> btrfs_add_delayed_iput(), so that the logging path of fsync never drops
> the last reference on an inode, that step is offloaded to a safe context
> (usually the cleaner kthread).
> 
> The assertion failure issue was sporadically triggered by the test case
> generic/475 from fstests, which loads the dm error target while fsstress
> is running, which lead to fsync failing while logging inodes with -EIO
> errors and then trying later to commit the transaction, triggering the
> assertion failure.
> 
> CC: stable@vger.kernel.org # 4.4+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
Sasha Levin Sept. 12, 2019, 7:30 a.m. UTC | #2
Hi,

[This is an automated email]

This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: 4.4+

The bot has tested the following trees: v5.2.14, v4.19.72, v4.14.143, v4.9.192, v4.4.192.

v5.2.14: Build OK!
v4.19.72: Failed to apply! Possible dependencies:
    6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
    a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
    b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks")

v4.14.143: Failed to apply! Possible dependencies:
    0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
    1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
    6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
    8d9e220ca084 ("btrfs: simplify IS_ERR/PTR_ERR checks")
    a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
    b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks")

v4.9.192: Failed to apply! Possible dependencies:
    0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables")
    0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
    1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
    4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode")
    6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
    a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
    cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
    da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
    db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode")
    de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info")
    fb456252d3d9 ("btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere")

v4.4.192: Failed to apply! Possible dependencies:
    0132761017e0 ("btrfs: fix string and comment grammatical issues and typos")
    09cbfeaf1a5a ("mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros")
    0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables")
    0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
    0e749e54244e ("dax: increase granularity of dax_clear_blocks() operations")
    1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
    44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink")
    4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode")
    52db400fcd50 ("pmem, dax: clean up clear_pmem()")
    6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
    781feef7e6be ("Btrfs: fix lockdep warning about log_mutex")
    a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
    b2e0d1625e19 ("dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()")
    bb7ab3b92e46 ("btrfs: Fix misspellings in comments.")
    cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
    d1a5f2b4d8a1 ("block: use DAX for partition table reads")
    db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode")
    de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info")


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

--
Thanks,
Sasha
Filipe Manana Sept. 12, 2019, 8:02 a.m. UTC | #3
On Thu, Sep 12, 2019 at 8:32 AM Sasha Levin <sashal@kernel.org> wrote:
>
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: 4.4+
>
> The bot has tested the following trees: v5.2.14, v4.19.72, v4.14.143, v4.9.192, v4.4.192.
>
> v5.2.14: Build OK!
> v4.19.72: Failed to apply! Possible dependencies:
>     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
>     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
>     b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks")
>
> v4.14.143: Failed to apply! Possible dependencies:
>     0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
>     1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
>     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
>     8d9e220ca084 ("btrfs: simplify IS_ERR/PTR_ERR checks")
>     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
>     b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks")
>
> v4.9.192: Failed to apply! Possible dependencies:
>     0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables")
>     0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
>     1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
>     4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode")
>     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
>     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
>     cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
>     da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
>     db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode")
>     de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info")
>     fb456252d3d9 ("btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere")
>
> v4.4.192: Failed to apply! Possible dependencies:
>     0132761017e0 ("btrfs: fix string and comment grammatical issues and typos")
>     09cbfeaf1a5a ("mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros")
>     0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables")
>     0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
>     0e749e54244e ("dax: increase granularity of dax_clear_blocks() operations")
>     1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
>     44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink")
>     4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode")
>     52db400fcd50 ("pmem, dax: clean up clear_pmem()")
>     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
>     781feef7e6be ("Btrfs: fix lockdep warning about log_mutex")
>     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
>     b2e0d1625e19 ("dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()")
>     bb7ab3b92e46 ("btrfs: Fix misspellings in comments.")
>     cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
>     d1a5f2b4d8a1 ("block: use DAX for partition table reads")
>     db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode")
>     de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info")
>
>
> NOTE: The patch will not be queued to stable trees until it is upstream.
>
> How should we proceed with this patch?

After it lands on Linus' tree, I'll try to send patch versions that
apply to different kernel releases.

Thanks.

>
> --
> Thanks,
> Sasha
Nikolay Borisov Sept. 12, 2019, 11:10 a.m. UTC | #4
On 10.09.19 г. 17:26 ч., fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Sometimes when fsync'ing a file we need to log that other inodes exist and
> when we need to do that we acquire a reference on the inodes and then drop
> that reference using iput() after logging them.
> 
> That generally is not a problem except if we end up doing the final iput()
> (dropping the last reference) on the inode and that inode has a link count
> of 0, which can happen in a very short time window if the logging path
> gets a reference on the inode while it's being unlinked.
> 
> In that case we end up getting the eviction callback, btrfs_evict_inode(),
> invoked through the iput() call chain which needs to drop all of the
> inode's items from its subvolume btree, and in order to do that, it needs
> to join a transaction at the helper function evict_refill_and_join().
> However because the task previously started a transaction at the fsync
> handler, btrfs_sync_file(), it has current->journal_info already pointing
> to a transaction handle and therefore evict_refill_and_join() will get
> that transaction handle from btrfs_join_transaction(). From this point on,
> two different problems can happen:
> 
> 1) evict_refill_and_join() will often change the transaction handle's
>    block reserve (->block_rsv) and set its ->bytes_reserved field to a
>    value greater than 0. If evict_refill_and_join() never commits the
>    transaction, the eviction handler ends up decreasing the reference
>    count (->use_count) of the transaction handle through the call to
>    btrfs_end_transaction(), and after that point we have a transaction
>    handle with a NULL ->block_rsv (which is the value prior to the
>    transaction join from evict_refill_and_join()) and a ->bytes_reserved
>    value greater than 0. If after the eviction/iput completes the inode
>    logging path hits an error or it decides that it must fallback to a
>    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
>    non-zero value from btrfs_log_dentry_safe(), and because of that
>    non-zero value it tries to commit the transaction using a handle with
>    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
>    the transaction commit hit an assertion failure at
>    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
>    the ->block_rsv is NULL. The produced stack trace for that is like the
>    following:
> 
>    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
>    [192922.917553] ------------[ cut here ]------------
>    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
>    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
>    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
>    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
>    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
>    (...)
>    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
>    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
>    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
>    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
>    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
>    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
>    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
>    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
>    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    [192922.925105] Call Trace:
>    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
>    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
>    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
>    [192922.926731]  do_fsync+0x38/0x60
>    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
>    [192922.927543]  do_syscall_64+0x60/0x1c0
>    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>    (...)
>    [192922.934077] ---[ end trace f00808b12068168f ]---
> 
> 2) If evict_refill_and_join() decides to commit the transaction, it will
evict_refill_and_join only ever calls btrfs_join_transaction so it
cannot ever commit the transaction. Let's look into its sole caller,
btrfs_evict_inode, I see 3 cases where transactions handle is involved
in that function:

1. btrfs_commit_inode_delayed_inode - iot calls trans_join +
end_transaction => use_count = 2 -> no commit

2. btrfs_truncate_inode_items -> never calls end/commit transaction

3. btrfs_evict_inode itself only ever calls btrfs_end_transaction on a
handle that's obtained via btrfs_join_transaction, meaning that trans
handle's ->use_count will be 2. This in turn will hit:

 if (refcount_read(&trans->use_count) > 1) {
                  refcount_dec(&trans->use_count);

                  trans->block_rsv = trans->orig_rsv;

                  return 0;

}

in __btrfs_end_transaction. That code is in direct contradiction with
what you describe next? Am I missing something here? If my analysis is
correct this implies case 1) above also cannot happen because
trans->block_rsv is set to NULL in __btrfs_end_transaction only if
use_count == 1.

>    be able to do it, since the nested transaction join only increments the
>    transaction handle's ->use_count reference counter and it does not
>    prevent the transaction from getting committed. This means that after
>    eviction completes, the fsync logging path will be using a transaction
>    handle that refers to an already committed transaction. What happens
>    when using such a stale transaction can be unpredictable, we are at
>    least having a use-after-free on the transaction handle itself, since
>    the transaction commit will call kmem_cache_free() against the handle
>    regardless of its ->use_count value, or we can end up silently losing
>    all the updates to the log tree after that iput() in the logging path,
>    or using a transaction handle that in the meanwhile was allocated to
>    another task for a new transaction, etc, pretty much unpredictable
>    what can happen.
> 
> In order to fix both of them, instead of using iput() during logging, use
> btrfs_add_delayed_iput(), so that the logging path of fsync never drops
> the last reference on an inode, that step is offloaded to a safe context
> (usually the cleaner kthread).
> 
> The assertion failure issue was sporadically triggered by the test case
> generic/475 from fstests, which loads the dm error target while fsstress
> is running, which lead to fsync failing while logging inodes with -EIO
> errors and then trying later to commit the transaction, triggering the
> assertion failure.
> 
> CC: stable@vger.kernel.org # 4.4+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/tree-log.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 6c8297bcfeb7..1bfd7e34f31e 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4985,7 +4985,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
>  						      BTRFS_I(inode),
>  						      LOG_OTHER_INODE_ALL,
>  						      0, LLONG_MAX, ctx);
> -					iput(inode);
> +					btrfs_add_delayed_iput(inode);
>  				}
>  			}
>  			continue;
> @@ -5000,7 +5000,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
>  		ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
>  				      LOG_OTHER_INODE, 0, LLONG_MAX, ctx);
>  		if (ret) {
> -			iput(inode);
> +			btrfs_add_delayed_iput(inode);
>  			continue;
>  		}
>  
> @@ -5009,7 +5009,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
>  		key.offset = 0;
>  		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>  		if (ret < 0) {
> -			iput(inode);
> +			btrfs_add_delayed_iput(inode);
>  			continue;
>  		}
>  
> @@ -5056,7 +5056,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
>  			}
>  			path->slots[0]++;
>  		}
> -		iput(inode);
> +		btrfs_add_delayed_iput(inode);
>  	}
>  
>  	return ret;
> @@ -5689,7 +5689,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
>  			}
>  
>  			if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) {
> -				iput(di_inode);
> +				btrfs_add_delayed_iput(di_inode);
>  				break;
>  			}
>  
> @@ -5701,7 +5701,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
>  			if (!ret &&
>  			    btrfs_must_commit_transaction(trans, BTRFS_I(di_inode)))
>  				ret = 1;
> -			iput(di_inode);
> +			btrfs_add_delayed_iput(di_inode);
>  			if (ret)
>  				goto next_dir_inode;
>  			if (ctx->log_new_dentries) {
> @@ -5848,7 +5848,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans,
>  			if (!ret && ctx && ctx->log_new_dentries)
>  				ret = log_new_dir_dentries(trans, root,
>  						   BTRFS_I(dir_inode), ctx);
> -			iput(dir_inode);
> +			btrfs_add_delayed_iput(dir_inode);
>  			if (ret)
>  				goto out;
>  		}
> @@ -5891,7 +5891,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans,
>  			ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
>  					      LOG_INODE_EXISTS,
>  					      0, LLONG_MAX, ctx);
> -		iput(inode);
> +		btrfs_add_delayed_iput(inode);
>  		if (ret)
>  			return ret;
>  
>
Filipe Manana Sept. 12, 2019, 11:24 a.m. UTC | #5
On Thu, Sep 12, 2019 at 12:10 PM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 10.09.19 г. 17:26 ч., fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Sometimes when fsync'ing a file we need to log that other inodes exist and
> > when we need to do that we acquire a reference on the inodes and then drop
> > that reference using iput() after logging them.
> >
> > That generally is not a problem except if we end up doing the final iput()
> > (dropping the last reference) on the inode and that inode has a link count
> > of 0, which can happen in a very short time window if the logging path
> > gets a reference on the inode while it's being unlinked.
> >
> > In that case we end up getting the eviction callback, btrfs_evict_inode(),
> > invoked through the iput() call chain which needs to drop all of the
> > inode's items from its subvolume btree, and in order to do that, it needs
> > to join a transaction at the helper function evict_refill_and_join().
> > However because the task previously started a transaction at the fsync
> > handler, btrfs_sync_file(), it has current->journal_info already pointing
> > to a transaction handle and therefore evict_refill_and_join() will get
> > that transaction handle from btrfs_join_transaction(). From this point on,
> > two different problems can happen:
> >
> > 1) evict_refill_and_join() will often change the transaction handle's
> >    block reserve (->block_rsv) and set its ->bytes_reserved field to a
> >    value greater than 0. If evict_refill_and_join() never commits the
> >    transaction, the eviction handler ends up decreasing the reference
> >    count (->use_count) of the transaction handle through the call to
> >    btrfs_end_transaction(), and after that point we have a transaction
> >    handle with a NULL ->block_rsv (which is the value prior to the
> >    transaction join from evict_refill_and_join()) and a ->bytes_reserved
> >    value greater than 0. If after the eviction/iput completes the inode
> >    logging path hits an error or it decides that it must fallback to a
> >    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
> >    non-zero value from btrfs_log_dentry_safe(), and because of that
> >    non-zero value it tries to commit the transaction using a handle with
> >    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
> >    the transaction commit hit an assertion failure at
> >    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
> >    the ->block_rsv is NULL. The produced stack trace for that is like the
> >    following:
> >
> >    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
> >    [192922.917553] ------------[ cut here ]------------
> >    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
> >    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
> >    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
> >    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
> >    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
> >    (...)
> >    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
> >    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
> >    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
> >    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
> >    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
> >    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
> >    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
> >    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
> >    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >    [192922.925105] Call Trace:
> >    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
> >    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
> >    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
> >    [192922.926731]  do_fsync+0x38/0x60
> >    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
> >    [192922.927543]  do_syscall_64+0x60/0x1c0
> >    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >    (...)
> >    [192922.934077] ---[ end trace f00808b12068168f ]---
> >
> > 2) If evict_refill_and_join() decides to commit the transaction, it will
> evict_refill_and_join only ever calls btrfs_join_transaction so it
> cannot ever commit the transaction.

It can:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/inode.c?h=v5.3-rc8#n5399

>  Let's look into its sole caller,
> btrfs_evict_inode, I see 3 cases where transactions handle is involved
> in that function:
>
> 1. btrfs_commit_inode_delayed_inode - iot calls trans_join +
> end_transaction => use_count = 2 -> no commit
>
> 2. btrfs_truncate_inode_items -> never calls end/commit transaction
>
> 3. btrfs_evict_inode itself only ever calls btrfs_end_transaction on a
> handle that's obtained via btrfs_join_transaction, meaning that trans
> handle's ->use_count will be 2. This in turn will hit:
>
>  if (refcount_read(&trans->use_count) > 1) {
>                   refcount_dec(&trans->use_count);
>
>                   trans->block_rsv = trans->orig_rsv;
>
>                   return 0;
>
> }
>
> in __btrfs_end_transaction. That code is in direct contradiction with
> what you describe next?

No, it isn't.

I explicitly mention it in point 1):

"... the eviction handler ends up decreasing the reference
count (->use_count) of the transaction handle through the call to
btrfs_end_transaction(), and after that point we have a transaction
handle with a NULL ->block_rsv (which is the value prior to the
transaction join from evict_refill_and_join()) and a ->bytes_reserved
value greater than 0."


>  Am I missing something here? If my analysis is
> correct this implies case 1) above also cannot happen because
> trans->block_rsv is set to NULL in __btrfs_end_transaction only if
> use_count == 1.

Again, go read the change log and the code more carefully.

Thanks.

>
> >    be able to do it, since the nested transaction join only increments the
> >    transaction handle's ->use_count reference counter and it does not
> >    prevent the transaction from getting committed. This means that after
> >    eviction completes, the fsync logging path will be using a transaction
> >    handle that refers to an already committed transaction. What happens
> >    when using such a stale transaction can be unpredictable, we are at
> >    least having a use-after-free on the transaction handle itself, since
> >    the transaction commit will call kmem_cache_free() against the handle
> >    regardless of its ->use_count value, or we can end up silently losing
> >    all the updates to the log tree after that iput() in the logging path,
> >    or using a transaction handle that in the meanwhile was allocated to
> >    another task for a new transaction, etc, pretty much unpredictable
> >    what can happen.
> >
> > In order to fix both of them, instead of using iput() during logging, use
> > btrfs_add_delayed_iput(), so that the logging path of fsync never drops
> > the last reference on an inode, that step is offloaded to a safe context
> > (usually the cleaner kthread).
> >
> > The assertion failure issue was sporadically triggered by the test case
> > generic/475 from fstests, which loads the dm error target while fsstress
> > is running, which lead to fsync failing while logging inodes with -EIO
> > errors and then trying later to commit the transaction, triggering the
> > assertion failure.
> >
> > CC: stable@vger.kernel.org # 4.4+
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/tree-log.c | 16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index 6c8297bcfeb7..1bfd7e34f31e 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -4985,7 +4985,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
> >                                                     BTRFS_I(inode),
> >                                                     LOG_OTHER_INODE_ALL,
> >                                                     0, LLONG_MAX, ctx);
> > -                                     iput(inode);
> > +                                     btrfs_add_delayed_iput(inode);
> >                               }
> >                       }
> >                       continue;
> > @@ -5000,7 +5000,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
> >               ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
> >                                     LOG_OTHER_INODE, 0, LLONG_MAX, ctx);
> >               if (ret) {
> > -                     iput(inode);
> > +                     btrfs_add_delayed_iput(inode);
> >                       continue;
> >               }
> >
> > @@ -5009,7 +5009,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
> >               key.offset = 0;
> >               ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> >               if (ret < 0) {
> > -                     iput(inode);
> > +                     btrfs_add_delayed_iput(inode);
> >                       continue;
> >               }
> >
> > @@ -5056,7 +5056,7 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
> >                       }
> >                       path->slots[0]++;
> >               }
> > -             iput(inode);
> > +             btrfs_add_delayed_iput(inode);
> >       }
> >
> >       return ret;
> > @@ -5689,7 +5689,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
> >                       }
> >
> >                       if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) {
> > -                             iput(di_inode);
> > +                             btrfs_add_delayed_iput(di_inode);
> >                               break;
> >                       }
> >
> > @@ -5701,7 +5701,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
> >                       if (!ret &&
> >                           btrfs_must_commit_transaction(trans, BTRFS_I(di_inode)))
> >                               ret = 1;
> > -                     iput(di_inode);
> > +                     btrfs_add_delayed_iput(di_inode);
> >                       if (ret)
> >                               goto next_dir_inode;
> >                       if (ctx->log_new_dentries) {
> > @@ -5848,7 +5848,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans,
> >                       if (!ret && ctx && ctx->log_new_dentries)
> >                               ret = log_new_dir_dentries(trans, root,
> >                                                  BTRFS_I(dir_inode), ctx);
> > -                     iput(dir_inode);
> > +                     btrfs_add_delayed_iput(dir_inode);
> >                       if (ret)
> >                               goto out;
> >               }
> > @@ -5891,7 +5891,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans,
> >                       ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
> >                                             LOG_INODE_EXISTS,
> >                                             0, LLONG_MAX, ctx);
> > -             iput(inode);
> > +             btrfs_add_delayed_iput(inode);
> >               if (ret)
> >                       return ret;
> >
> >
Nikolay Borisov Sept. 12, 2019, 11:35 a.m. UTC | #6
On 12.09.19 г. 14:24 ч., Filipe Manana wrote:
> On Thu, Sep 12, 2019 at 12:10 PM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>>
>>
>> On 10.09.19 г. 17:26 ч., fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> Sometimes when fsync'ing a file we need to log that other inodes exist and
>>> when we need to do that we acquire a reference on the inodes and then drop
>>> that reference using iput() after logging them.
>>>
>>> That generally is not a problem except if we end up doing the final iput()
>>> (dropping the last reference) on the inode and that inode has a link count
>>> of 0, which can happen in a very short time window if the logging path
>>> gets a reference on the inode while it's being unlinked.
>>>
>>> In that case we end up getting the eviction callback, btrfs_evict_inode(),
>>> invoked through the iput() call chain which needs to drop all of the
>>> inode's items from its subvolume btree, and in order to do that, it needs
>>> to join a transaction at the helper function evict_refill_and_join().
>>> However because the task previously started a transaction at the fsync
>>> handler, btrfs_sync_file(), it has current->journal_info already pointing
>>> to a transaction handle and therefore evict_refill_and_join() will get
>>> that transaction handle from btrfs_join_transaction(). From this point on,
>>> two different problems can happen:
>>>
>>> 1) evict_refill_and_join() will often change the transaction handle's
>>>    block reserve (->block_rsv) and set its ->bytes_reserved field to a
>>>    value greater than 0. If evict_refill_and_join() never commits the
>>>    transaction, the eviction handler ends up decreasing the reference
>>>    count (->use_count) of the transaction handle through the call to
>>>    btrfs_end_transaction(), and after that point we have a transaction
>>>    handle with a NULL ->block_rsv (which is the value prior to the
>>>    transaction join from evict_refill_and_join()) and a ->bytes_reserved
>>>    value greater than 0. If after the eviction/iput completes the inode
>>>    logging path hits an error or it decides that it must fallback to a
>>>    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
>>>    non-zero value from btrfs_log_dentry_safe(), and because of that
>>>    non-zero value it tries to commit the transaction using a handle with
>>>    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
>>>    the transaction commit hit an assertion failure at
>>>    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
>>>    the ->block_rsv is NULL. The produced stack trace for that is like the
>>>    following:
>>>
>>>    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
>>>    [192922.917553] ------------[ cut here ]------------
>>>    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
>>>    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
>>>    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
>>>    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
>>>    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
>>>    (...)
>>>    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
>>>    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
>>>    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
>>>    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
>>>    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
>>>    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
>>>    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
>>>    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
>>>    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>    [192922.925105] Call Trace:
>>>    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
>>>    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
>>>    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
>>>    [192922.926731]  do_fsync+0x38/0x60
>>>    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
>>>    [192922.927543]  do_syscall_64+0x60/0x1c0
>>>    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>    (...)
>>>    [192922.934077] ---[ end trace f00808b12068168f ]---
>>>
>>> 2) If evict_refill_and_join() decides to commit the transaction, it will
>> evict_refill_and_join only ever calls btrfs_join_transaction so it
>> cannot ever commit the transaction.
> 
> It can:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/inode.c?h=v5.3-rc8#n5399

Right, in 5.3 branch yes but in misc-next evict_and_join looks
completely different, that's where the confusion stems from. Thanks

<snip>
Josef Bacik Sept. 12, 2019, 11:58 a.m. UTC | #7
On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Sometimes when fsync'ing a file we need to log that other inodes exist and
> when we need to do that we acquire a reference on the inodes and then drop
> that reference using iput() after logging them.
> 
> That generally is not a problem except if we end up doing the final iput()
> (dropping the last reference) on the inode and that inode has a link count
> of 0, which can happen in a very short time window if the logging path
> gets a reference on the inode while it's being unlinked.
> 
> In that case we end up getting the eviction callback, btrfs_evict_inode(),
> invoked through the iput() call chain which needs to drop all of the
> inode's items from its subvolume btree, and in order to do that, it needs
> to join a transaction at the helper function evict_refill_and_join().
> However because the task previously started a transaction at the fsync
> handler, btrfs_sync_file(), it has current->journal_info already pointing
> to a transaction handle and therefore evict_refill_and_join() will get
> that transaction handle from btrfs_join_transaction(). From this point on,
> two different problems can happen:
> 
> 1) evict_refill_and_join() will often change the transaction handle's
>    block reserve (->block_rsv) and set its ->bytes_reserved field to a
>    value greater than 0. If evict_refill_and_join() never commits the
>    transaction, the eviction handler ends up decreasing the reference
>    count (->use_count) of the transaction handle through the call to
>    btrfs_end_transaction(), and after that point we have a transaction
>    handle with a NULL ->block_rsv (which is the value prior to the
>    transaction join from evict_refill_and_join()) and a ->bytes_reserved
>    value greater than 0. If after the eviction/iput completes the inode
>    logging path hits an error or it decides that it must fallback to a
>    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
>    non-zero value from btrfs_log_dentry_safe(), and because of that
>    non-zero value it tries to commit the transaction using a handle with
>    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
>    the transaction commit hit an assertion failure at
>    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
>    the ->block_rsv is NULL. The produced stack trace for that is like the
>    following:
> 
>    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
>    [192922.917553] ------------[ cut here ]------------
>    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
>    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
>    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
>    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
>    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
>    (...)
>    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
>    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
>    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
>    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
>    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
>    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
>    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
>    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
>    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    [192922.925105] Call Trace:
>    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
>    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
>    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
>    [192922.926731]  do_fsync+0x38/0x60
>    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
>    [192922.927543]  do_syscall_64+0x60/0x1c0
>    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>    (...)
>    [192922.934077] ---[ end trace f00808b12068168f ]---
> 
> 2) If evict_refill_and_join() decides to commit the transaction, it will
>    be able to do it, since the nested transaction join only increments the
>    transaction handle's ->use_count reference counter and it does not
>    prevent the transaction from getting committed. This means that after

This brings up a good point, we should probably not allow the commit in this
case, or add an ASSERT(use_count == 1) or something, cause this would be bad.
Thanks,

Josef
Josef Bacik Sept. 12, 2019, 12:18 p.m. UTC | #8
On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Sometimes when fsync'ing a file we need to log that other inodes exist and
> when we need to do that we acquire a reference on the inodes and then drop
> that reference using iput() after logging them.
> 
> That generally is not a problem except if we end up doing the final iput()
> (dropping the last reference) on the inode and that inode has a link count
> of 0, which can happen in a very short time window if the logging path
> gets a reference on the inode while it's being unlinked.
> 
> In that case we end up getting the eviction callback, btrfs_evict_inode(),
> invoked through the iput() call chain which needs to drop all of the
> inode's items from its subvolume btree, and in order to do that, it needs
> to join a transaction at the helper function evict_refill_and_join().
> However because the task previously started a transaction at the fsync
> handler, btrfs_sync_file(), it has current->journal_info already pointing
> to a transaction handle and therefore evict_refill_and_join() will get
> that transaction handle from btrfs_join_transaction(). From this point on,
> two different problems can happen:
> 
> 1) evict_refill_and_join() will often change the transaction handle's
>    block reserve (->block_rsv) and set its ->bytes_reserved field to a
>    value greater than 0. If evict_refill_and_join() never commits the
>    transaction, the eviction handler ends up decreasing the reference
>    count (->use_count) of the transaction handle through the call to
>    btrfs_end_transaction(), and after that point we have a transaction
>    handle with a NULL ->block_rsv (which is the value prior to the
>    transaction join from evict_refill_and_join()) and a ->bytes_reserved
>    value greater than 0. If after the eviction/iput completes the inode
>    logging path hits an error or it decides that it must fallback to a
>    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
>    non-zero value from btrfs_log_dentry_safe(), and because of that
>    non-zero value it tries to commit the transaction using a handle with
>    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
>    the transaction commit hit an assertion failure at
>    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
>    the ->block_rsv is NULL. The produced stack trace for that is like the
>    following:
> 
>    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
>    [192922.917553] ------------[ cut here ]------------
>    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
>    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
>    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
>    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
>    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
>    (...)
>    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
>    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
>    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
>    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
>    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
>    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
>    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
>    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
>    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    [192922.925105] Call Trace:
>    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
>    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
>    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
>    [192922.926731]  do_fsync+0x38/0x60
>    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
>    [192922.927543]  do_syscall_64+0x60/0x1c0
>    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>    (...)
>    [192922.934077] ---[ end trace f00808b12068168f ]---
> 
> 2) If evict_refill_and_join() decides to commit the transaction, it will
>    be able to do it, since the nested transaction join only increments the
>    transaction handle's ->use_count reference counter and it does not
>    prevent the transaction from getting committed. This means that after
>    eviction completes, the fsync logging path will be using a transaction
>    handle that refers to an already committed transaction. What happens
>    when using such a stale transaction can be unpredictable, we are at
>    least having a use-after-free on the transaction handle itself, since
>    the transaction commit will call kmem_cache_free() against the handle
>    regardless of its ->use_count value, or we can end up silently losing
>    all the updates to the log tree after that iput() in the logging path,
>    or using a transaction handle that in the meanwhile was allocated to
>    another task for a new transaction, etc, pretty much unpredictable
>    what can happen.
> 

And talking it over with Nikolay I realized that since we're doing the commit
through the flushing this doesn't actually happen anymore.
may_commit_transaction() returns -EGAIN if we already have a trans handle open,
so hooray I made it safer by accident!  But we definitely should follow up and
add an assert in btrfs_commit_transaction() to catch this case, cause holy shit
that's bad.  Thanks,

Josef
Filipe Manana Sept. 12, 2019, 1:17 p.m. UTC | #9
On Thu, Sep 12, 2019 at 12:59 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Sometimes when fsync'ing a file we need to log that other inodes exist and
> > when we need to do that we acquire a reference on the inodes and then drop
> > that reference using iput() after logging them.
> >
> > That generally is not a problem except if we end up doing the final iput()
> > (dropping the last reference) on the inode and that inode has a link count
> > of 0, which can happen in a very short time window if the logging path
> > gets a reference on the inode while it's being unlinked.
> >
> > In that case we end up getting the eviction callback, btrfs_evict_inode(),
> > invoked through the iput() call chain which needs to drop all of the
> > inode's items from its subvolume btree, and in order to do that, it needs
> > to join a transaction at the helper function evict_refill_and_join().
> > However because the task previously started a transaction at the fsync
> > handler, btrfs_sync_file(), it has current->journal_info already pointing
> > to a transaction handle and therefore evict_refill_and_join() will get
> > that transaction handle from btrfs_join_transaction(). From this point on,
> > two different problems can happen:
> >
> > 1) evict_refill_and_join() will often change the transaction handle's
> >    block reserve (->block_rsv) and set its ->bytes_reserved field to a
> >    value greater than 0. If evict_refill_and_join() never commits the
> >    transaction, the eviction handler ends up decreasing the reference
> >    count (->use_count) of the transaction handle through the call to
> >    btrfs_end_transaction(), and after that point we have a transaction
> >    handle with a NULL ->block_rsv (which is the value prior to the
> >    transaction join from evict_refill_and_join()) and a ->bytes_reserved
> >    value greater than 0. If after the eviction/iput completes the inode
> >    logging path hits an error or it decides that it must fallback to a
> >    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
> >    non-zero value from btrfs_log_dentry_safe(), and because of that
> >    non-zero value it tries to commit the transaction using a handle with
> >    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
> >    the transaction commit hit an assertion failure at
> >    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
> >    the ->block_rsv is NULL. The produced stack trace for that is like the
> >    following:
> >
> >    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
> >    [192922.917553] ------------[ cut here ]------------
> >    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
> >    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
> >    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
> >    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
> >    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
> >    (...)
> >    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
> >    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
> >    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
> >    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
> >    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
> >    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
> >    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
> >    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
> >    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >    [192922.925105] Call Trace:
> >    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
> >    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
> >    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
> >    [192922.926731]  do_fsync+0x38/0x60
> >    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
> >    [192922.927543]  do_syscall_64+0x60/0x1c0
> >    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >    (...)
> >    [192922.934077] ---[ end trace f00808b12068168f ]---
> >
> > 2) If evict_refill_and_join() decides to commit the transaction, it will
> >    be able to do it, since the nested transaction join only increments the
> >    transaction handle's ->use_count reference counter and it does not
> >    prevent the transaction from getting committed. This means that after
>
> This brings up a good point, we should probably not allow the commit in this
> case, or add an ASSERT(use_count == 1) or something, cause this would be bad.
> Thanks,

Yes, that should be separate change however.
I had warn_on(trans->use_count > 1) in the commit locally during
testing (didn't trigger in 4 days of fstests).

Thanks.
>
> Josef
Filipe Manana Sept. 12, 2019, 1:19 p.m. UTC | #10
On Thu, Sep 12, 2019 at 1:18 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Sometimes when fsync'ing a file we need to log that other inodes exist and
> > when we need to do that we acquire a reference on the inodes and then drop
> > that reference using iput() after logging them.
> >
> > That generally is not a problem except if we end up doing the final iput()
> > (dropping the last reference) on the inode and that inode has a link count
> > of 0, which can happen in a very short time window if the logging path
> > gets a reference on the inode while it's being unlinked.
> >
> > In that case we end up getting the eviction callback, btrfs_evict_inode(),
> > invoked through the iput() call chain which needs to drop all of the
> > inode's items from its subvolume btree, and in order to do that, it needs
> > to join a transaction at the helper function evict_refill_and_join().
> > However because the task previously started a transaction at the fsync
> > handler, btrfs_sync_file(), it has current->journal_info already pointing
> > to a transaction handle and therefore evict_refill_and_join() will get
> > that transaction handle from btrfs_join_transaction(). From this point on,
> > two different problems can happen:
> >
> > 1) evict_refill_and_join() will often change the transaction handle's
> >    block reserve (->block_rsv) and set its ->bytes_reserved field to a
> >    value greater than 0. If evict_refill_and_join() never commits the
> >    transaction, the eviction handler ends up decreasing the reference
> >    count (->use_count) of the transaction handle through the call to
> >    btrfs_end_transaction(), and after that point we have a transaction
> >    handle with a NULL ->block_rsv (which is the value prior to the
> >    transaction join from evict_refill_and_join()) and a ->bytes_reserved
> >    value greater than 0. If after the eviction/iput completes the inode
> >    logging path hits an error or it decides that it must fallback to a
> >    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
> >    non-zero value from btrfs_log_dentry_safe(), and because of that
> >    non-zero value it tries to commit the transaction using a handle with
> >    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
> >    the transaction commit hit an assertion failure at
> >    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
> >    the ->block_rsv is NULL. The produced stack trace for that is like the
> >    following:
> >
> >    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
> >    [192922.917553] ------------[ cut here ]------------
> >    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
> >    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
> >    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
> >    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
> >    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
> >    (...)
> >    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
> >    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
> >    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
> >    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
> >    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
> >    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
> >    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
> >    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
> >    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >    [192922.925105] Call Trace:
> >    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
> >    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
> >    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
> >    [192922.926731]  do_fsync+0x38/0x60
> >    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
> >    [192922.927543]  do_syscall_64+0x60/0x1c0
> >    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >    (...)
> >    [192922.934077] ---[ end trace f00808b12068168f ]---
> >
> > 2) If evict_refill_and_join() decides to commit the transaction, it will
> >    be able to do it, since the nested transaction join only increments the
> >    transaction handle's ->use_count reference counter and it does not
> >    prevent the transaction from getting committed. This means that after
> >    eviction completes, the fsync logging path will be using a transaction
> >    handle that refers to an already committed transaction. What happens
> >    when using such a stale transaction can be unpredictable, we are at
> >    least having a use-after-free on the transaction handle itself, since
> >    the transaction commit will call kmem_cache_free() against the handle
> >    regardless of its ->use_count value, or we can end up silently losing
> >    all the updates to the log tree after that iput() in the logging path,
> >    or using a transaction handle that in the meanwhile was allocated to
> >    another task for a new transaction, etc, pretty much unpredictable
> >    what can happen.
> >
>
> And talking it over with Nikolay I realized that since we're doing the commit
> through the flushing this doesn't actually happen anymore.
> may_commit_transaction() returns -EGAIN if we already have a trans handle open,
> so hooray I made it safer by accident!  But we definitely should follow up and
> add an assert in btrfs_commit_transaction() to catch this case, cause holy shit
> that's bad.  Thanks,

Yep, but problem 1) (assertion failure) is still valid after your
recent changes in misc-next.
Either way, this fix has to be added to stable releases to fix both
problems there.

Thanks.

>
> Josef
Josef Bacik Sept. 12, 2019, 3:14 p.m. UTC | #11
On Thu, Sep 12, 2019 at 02:19:55PM +0100, Filipe Manana wrote:
> On Thu, Sep 12, 2019 at 1:18 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Sometimes when fsync'ing a file we need to log that other inodes exist and
> > > when we need to do that we acquire a reference on the inodes and then drop
> > > that reference using iput() after logging them.
> > >
> > > That generally is not a problem except if we end up doing the final iput()
> > > (dropping the last reference) on the inode and that inode has a link count
> > > of 0, which can happen in a very short time window if the logging path
> > > gets a reference on the inode while it's being unlinked.
> > >
> > > In that case we end up getting the eviction callback, btrfs_evict_inode(),
> > > invoked through the iput() call chain which needs to drop all of the
> > > inode's items from its subvolume btree, and in order to do that, it needs
> > > to join a transaction at the helper function evict_refill_and_join().
> > > However because the task previously started a transaction at the fsync
> > > handler, btrfs_sync_file(), it has current->journal_info already pointing
> > > to a transaction handle and therefore evict_refill_and_join() will get
> > > that transaction handle from btrfs_join_transaction(). From this point on,
> > > two different problems can happen:
> > >
> > > 1) evict_refill_and_join() will often change the transaction handle's
> > >    block reserve (->block_rsv) and set its ->bytes_reserved field to a
> > >    value greater than 0. If evict_refill_and_join() never commits the
> > >    transaction, the eviction handler ends up decreasing the reference
> > >    count (->use_count) of the transaction handle through the call to
> > >    btrfs_end_transaction(), and after that point we have a transaction
> > >    handle with a NULL ->block_rsv (which is the value prior to the
> > >    transaction join from evict_refill_and_join()) and a ->bytes_reserved
> > >    value greater than 0. If after the eviction/iput completes the inode
> > >    logging path hits an error or it decides that it must fallback to a
> > >    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
> > >    non-zero value from btrfs_log_dentry_safe(), and because of that
> > >    non-zero value it tries to commit the transaction using a handle with
> > >    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
> > >    the transaction commit hit an assertion failure at
> > >    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
> > >    the ->block_rsv is NULL. The produced stack trace for that is like the
> > >    following:
> > >
> > >    [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
> > >    [192922.917553] ------------[ cut here ]------------
> > >    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
> > >    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
> > >    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
> > >    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
> > >    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
> > >    (...)
> > >    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
> > >    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
> > >    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
> > >    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
> > >    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
> > >    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
> > >    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
> > >    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
> > >    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >    [192922.925105] Call Trace:
> > >    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
> > >    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
> > >    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
> > >    [192922.926731]  do_fsync+0x38/0x60
> > >    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
> > >    [192922.927543]  do_syscall_64+0x60/0x1c0
> > >    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > >    (...)
> > >    [192922.934077] ---[ end trace f00808b12068168f ]---
> > >
> > > 2) If evict_refill_and_join() decides to commit the transaction, it will
> > >    be able to do it, since the nested transaction join only increments the
> > >    transaction handle's ->use_count reference counter and it does not
> > >    prevent the transaction from getting committed. This means that after
> > >    eviction completes, the fsync logging path will be using a transaction
> > >    handle that refers to an already committed transaction. What happens
> > >    when using such a stale transaction can be unpredictable, we are at
> > >    least having a use-after-free on the transaction handle itself, since
> > >    the transaction commit will call kmem_cache_free() against the handle
> > >    regardless of its ->use_count value, or we can end up silently losing
> > >    all the updates to the log tree after that iput() in the logging path,
> > >    or using a transaction handle that in the meanwhile was allocated to
> > >    another task for a new transaction, etc, pretty much unpredictable
> > >    what can happen.
> > >
> >
> > And talking it over with Nikolay I realized that since we're doing the commit
> > through the flushing this doesn't actually happen anymore.
> > may_commit_transaction() returns -EGAIN if we already have a trans handle open,
> > so hooray I made it safer by accident!  But we definitely should follow up and
> > add an assert in btrfs_commit_transaction() to catch this case, cause holy shit
> > that's bad.  Thanks,
> 
> Yep, but problem 1) (assertion failure) is still valid after your
> recent changes in misc-next.
> Either way, this fix has to be added to stable releases to fix both
> problems there.
> 

Agreed, just adding context for the current changes.  This patch is fine as it
is, just some follow up stuff to catch transaction commits under nested trans
handles.  Thanks,

Josef
Filipe Manana Sept. 16, 2019, 12:59 p.m. UTC | #12
On Thu, Sep 12, 2019 at 8:32 AM Sasha Levin <sashal@kernel.org> wrote:
>
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: 4.4+
>
> The bot has tested the following trees: v5.2.14, v4.19.72, v4.14.143, v4.9.192, v4.4.192.
>
> v5.2.14: Build OK!
> v4.19.72: Failed to apply! Possible dependencies:
>     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
>     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
>     b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks")
>
> v4.14.143: Failed to apply! Possible dependencies:
>     0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
>     1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
>     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
>     8d9e220ca084 ("btrfs: simplify IS_ERR/PTR_ERR checks")
>     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
>     b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks")
>
> v4.9.192: Failed to apply! Possible dependencies:
>     0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables")
>     0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
>     1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
>     4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode")
>     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
>     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
>     cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
>     da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
>     db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode")
>     de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info")
>     fb456252d3d9 ("btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere")
>
> v4.4.192: Failed to apply! Possible dependencies:
>     0132761017e0 ("btrfs: fix string and comment grammatical issues and typos")
>     09cbfeaf1a5a ("mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros")
>     0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables")
>     0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
>     0e749e54244e ("dax: increase granularity of dax_clear_blocks() operations")
>     1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
>     44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink")
>     4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode")
>     52db400fcd50 ("pmem, dax: clean up clear_pmem()")
>     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
>     781feef7e6be ("Btrfs: fix lockdep warning about log_mutex")
>     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
>     b2e0d1625e19 ("dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()")
>     bb7ab3b92e46 ("btrfs: Fix misspellings in comments.")
>     cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
>     d1a5f2b4d8a1 ("block: use DAX for partition table reads")
>     db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode")
>     de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info")
>
>
> NOTE: The patch will not be queued to stable trees until it is upstream.
>
> How should we proceed with this patch?

So here follows, as attachments, patches for the following stable
versions: 4.4.193, 4.9.193, 4.14.144 and 4.19.73.

Thanks.

>
> --
> Thanks,
> Sasha
Greg KH Sept. 16, 2019, 3:13 p.m. UTC | #13
On Mon, Sep 16, 2019 at 01:59:47PM +0100, Filipe Manana wrote:
> On Thu, Sep 12, 2019 at 8:32 AM Sasha Levin <sashal@kernel.org> wrote:
> >
> > Hi,
> >
> > [This is an automated email]
> >
> > This commit has been processed because it contains a -stable tag.
> > The stable tag indicates that it's relevant for the following trees: 4.4+
> >
> > The bot has tested the following trees: v5.2.14, v4.19.72, v4.14.143, v4.9.192, v4.4.192.
> >
> > v5.2.14: Build OK!
> > v4.19.72: Failed to apply! Possible dependencies:
> >     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
> >     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
> >     b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks")
> >
> > v4.14.143: Failed to apply! Possible dependencies:
> >     0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
> >     1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
> >     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
> >     8d9e220ca084 ("btrfs: simplify IS_ERR/PTR_ERR checks")
> >     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
> >     b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks")
> >
> > v4.9.192: Failed to apply! Possible dependencies:
> >     0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables")
> >     0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
> >     1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
> >     4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode")
> >     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
> >     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
> >     cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
> >     da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
> >     db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode")
> >     de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info")
> >     fb456252d3d9 ("btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere")
> >
> > v4.4.192: Failed to apply! Possible dependencies:
> >     0132761017e0 ("btrfs: fix string and comment grammatical issues and typos")
> >     09cbfeaf1a5a ("mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros")
> >     0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables")
> >     0d836392cadd ("Btrfs: fix mount failure after fsync due to hard link recreation")
> >     0e749e54244e ("dax: increase granularity of dax_clear_blocks() operations")
> >     1f250e929a9c ("Btrfs: fix log replay failure after unlink and link combination")
> >     44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink")
> >     4791c8f19c45 ("btrfs: Make btrfs_check_ref_name_override take btrfs_inode")
> >     52db400fcd50 ("pmem, dax: clean up clear_pmem()")
> >     6b5fc433a7ad ("Btrfs: fix fsync after succession of renames of different files")
> >     781feef7e6be ("Btrfs: fix lockdep warning about log_mutex")
> >     a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
> >     b2e0d1625e19 ("dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()")
> >     bb7ab3b92e46 ("btrfs: Fix misspellings in comments.")
> >     cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
> >     d1a5f2b4d8a1 ("block: use DAX for partition table reads")
> >     db0a669fb002 ("btrfs: Make btrfs_add_link take btrfs_inode")
> >     de143792253e ("btrfs: struct btrfsic_state->root should be an fs_info")
> >
> >
> > NOTE: The patch will not be queued to stable trees until it is upstream.
> >
> > How should we proceed with this patch?
> 
> So here follows, as attachments, patches for the following stable
> versions: 4.4.193, 4.9.193, 4.14.144 and 4.19.73.

Thanks for these backports, now queued up!

greg k-h

Patch
diff mbox series

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6c8297bcfeb7..1bfd7e34f31e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4985,7 +4985,7 @@  static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
 						      BTRFS_I(inode),
 						      LOG_OTHER_INODE_ALL,
 						      0, LLONG_MAX, ctx);
-					iput(inode);
+					btrfs_add_delayed_iput(inode);
 				}
 			}
 			continue;
@@ -5000,7 +5000,7 @@  static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
 		ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
 				      LOG_OTHER_INODE, 0, LLONG_MAX, ctx);
 		if (ret) {
-			iput(inode);
+			btrfs_add_delayed_iput(inode);
 			continue;
 		}
 
@@ -5009,7 +5009,7 @@  static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
 		key.offset = 0;
 		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
 		if (ret < 0) {
-			iput(inode);
+			btrfs_add_delayed_iput(inode);
 			continue;
 		}
 
@@ -5056,7 +5056,7 @@  static int log_conflicting_inodes(struct btrfs_trans_handle *trans,
 			}
 			path->slots[0]++;
 		}
-		iput(inode);
+		btrfs_add_delayed_iput(inode);
 	}
 
 	return ret;
@@ -5689,7 +5689,7 @@  static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 			}
 
 			if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) {
-				iput(di_inode);
+				btrfs_add_delayed_iput(di_inode);
 				break;
 			}
 
@@ -5701,7 +5701,7 @@  static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 			if (!ret &&
 			    btrfs_must_commit_transaction(trans, BTRFS_I(di_inode)))
 				ret = 1;
-			iput(di_inode);
+			btrfs_add_delayed_iput(di_inode);
 			if (ret)
 				goto next_dir_inode;
 			if (ctx->log_new_dentries) {
@@ -5848,7 +5848,7 @@  static int btrfs_log_all_parents(struct btrfs_trans_handle *trans,
 			if (!ret && ctx && ctx->log_new_dentries)
 				ret = log_new_dir_dentries(trans, root,
 						   BTRFS_I(dir_inode), ctx);
-			iput(dir_inode);
+			btrfs_add_delayed_iput(dir_inode);
 			if (ret)
 				goto out;
 		}
@@ -5891,7 +5891,7 @@  static int log_new_ancestors(struct btrfs_trans_handle *trans,
 			ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
 					      LOG_INODE_EXISTS,
 					      0, LLONG_MAX, ctx);
-		iput(inode);
+		btrfs_add_delayed_iput(inode);
 		if (ret)
 			return ret;