diff mbox

Btrfs: fix log replay failure after unlink and link combination

Message ID 20180228155610.5828-1-fdmanana@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Filipe Manana Feb. 28, 2018, 3:56 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

If we have a file with 2 (or more) hard links in the same directory,
remove one of the hard links, create a new file (or link an existing file)
in the same directory with the name of the removed hard link, and then
finally fsync the new file, we end up with a log that fails to replay,
causing a mount failure.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/testdir
  $ touch /mnt/testdir/foo
  $ ln /mnt/testdir/foo /mnt/testdir/bar

  $ sync

  $ unlink /mnt/testdir/bar
  $ touch /mnt/testdir/bar
  $ xfs_io -c "fsync" /mnt/testdir/bar

  <power failure>

  $ mount /dev/sdb /mnt
  mount: mount(2) failed: /mnt: No such file or directory

When replaying the log, for that example, we also see the following in
dmesg/syslog:

  [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
  [71813.674204] ------------[ cut here ]------------
  [71813.675694] BTRFS: Transaction aborted (error -2)
  [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
  [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
  [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G        W        4.15.0-rc9-btrfs-next-56+ #1
  [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
  [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
  [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
  [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
  [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
  [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
  [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
  [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
  [71813.679669] FS:  00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
  [71813.679669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
  [71813.679669] Call Trace:
  [71813.679669]  btrfs_unlink_inode+0x17/0x41 [btrfs]
  [71813.679669]  drop_one_dir_item+0xfa/0x131 [btrfs]
  [71813.679669]  add_inode_ref+0x71e/0x851 [btrfs]
  [71813.679669]  ? __lock_is_held+0x39/0x71
  [71813.679669]  ? replay_one_buffer+0x53/0x53a [btrfs]
  [71813.679669]  replay_one_buffer+0x4a4/0x53a [btrfs]
  [71813.679669]  ? rcu_read_unlock+0x3a/0x57
  [71813.679669]  ? __lock_is_held+0x39/0x71
  [71813.679669]  walk_up_log_tree+0x101/0x1d2 [btrfs]
  [71813.679669]  walk_log_tree+0xad/0x188 [btrfs]
  [71813.679669]  btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
  [71813.679669]  ? replay_one_extent+0x544/0x544 [btrfs]
  [71813.679669]  open_ctree+0x1cf6/0x2209 [btrfs]
  [71813.679669]  btrfs_mount_root+0x368/0x482 [btrfs]
  [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
  [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
  [71813.679669]  ? mount_fs+0x64/0x10b
  [71813.679669]  mount_fs+0x64/0x10b
  [71813.679669]  vfs_kern_mount+0x68/0xce
  [71813.679669]  btrfs_mount+0x13e/0x772 [btrfs]
  [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
  [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
  [71813.679669]  ? mount_fs+0x64/0x10b
  [71813.679669]  mount_fs+0x64/0x10b
  [71813.679669]  vfs_kern_mount+0x68/0xce
  [71813.679669]  do_mount+0x6e5/0x973
  [71813.679669]  ? memdup_user+0x3e/0x5c
  [71813.679669]  SyS_mount+0x72/0x98
  [71813.679669]  entry_SYSCALL_64_fastpath+0x1e/0x8b
  [71813.679669] RIP: 0033:0x7f7cedf150ba
  [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
  [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
  [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
  [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
  [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
  [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
  [71814.128078] BTRFS error (device dm-0): open_ctree failed

This happens because the log has inode reference items for both inode 258
(the first file we created) and inode 259 (the second file created), and
when processing the reference item for inode 258, we replace the
corresponding item in the subvolume tree (which has two names, "foo" and
"bar") witht he one in the log (which only has one name, "foo") without
removing the corresponding dir index keys from the parent directory.
Later, when processing the inode reference item for inode 259, which has
a name of "bar" associated to it, we notice that dir index entries exist
for that name and for a different inode, so we attempt to unlink that
name, which fails because the inode reference item for inode 258 no longer
has the name "bar" associated to it, making a call to btrfs_unlink_inode()
fail with a -ENOENT error.

Fix this by unlinking all the names in an inode reference item from a
subvolume tree that are not present in the inode reference item found in
the log tree, before overwriting it with the item from the log tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h      |   5 ++-
 fs/btrfs/inode-item.c |  44 ++++++++++++--------
 fs/btrfs/tree-log.c   | 112 ++++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 139 insertions(+), 22 deletions(-)

Comments

Liu Bo March 2, 2018, 6:29 p.m. UTC | #1
On Wed, Feb 28, 2018 at 03:56:10PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> If we have a file with 2 (or more) hard links in the same directory,
> remove one of the hard links, create a new file (or link an existing file)
> in the same directory with the name of the removed hard link, and then
> finally fsync the new file, we end up with a log that fails to replay,
> causing a mount failure.
> 
> Example:
> 
>   $ mkfs.btrfs -f /dev/sdb
>   $ mount /dev/sdb /mnt
> 
>   $ mkdir /mnt/testdir
>   $ touch /mnt/testdir/foo
>   $ ln /mnt/testdir/foo /mnt/testdir/bar
> 
>   $ sync
> 
>   $ unlink /mnt/testdir/bar
>   $ touch /mnt/testdir/bar
>   $ xfs_io -c "fsync" /mnt/testdir/bar
> 
>   <power failure>
> 
>   $ mount /dev/sdb /mnt
>   mount: mount(2) failed: /mnt: No such file or directory
> 
> When replaying the log, for that example, we also see the following in
> dmesg/syslog:
> 
>   [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
>   [71813.674204] ------------[ cut here ]------------
>   [71813.675694] BTRFS: Transaction aborted (error -2)
>   [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
>   [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
>   [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G        W        4.15.0-rc9-btrfs-next-56+ #1
>   [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
>   [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
>   [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
>   [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
>   [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
>   [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
>   [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
>   [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
>   [71813.679669] FS:  00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
>   [71813.679669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
>   [71813.679669] Call Trace:
>   [71813.679669]  btrfs_unlink_inode+0x17/0x41 [btrfs]
>   [71813.679669]  drop_one_dir_item+0xfa/0x131 [btrfs]
>   [71813.679669]  add_inode_ref+0x71e/0x851 [btrfs]
>   [71813.679669]  ? __lock_is_held+0x39/0x71
>   [71813.679669]  ? replay_one_buffer+0x53/0x53a [btrfs]
>   [71813.679669]  replay_one_buffer+0x4a4/0x53a [btrfs]
>   [71813.679669]  ? rcu_read_unlock+0x3a/0x57
>   [71813.679669]  ? __lock_is_held+0x39/0x71
>   [71813.679669]  walk_up_log_tree+0x101/0x1d2 [btrfs]
>   [71813.679669]  walk_log_tree+0xad/0x188 [btrfs]
>   [71813.679669]  btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
>   [71813.679669]  ? replay_one_extent+0x544/0x544 [btrfs]
>   [71813.679669]  open_ctree+0x1cf6/0x2209 [btrfs]
>   [71813.679669]  btrfs_mount_root+0x368/0x482 [btrfs]
>   [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
>   [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
>   [71813.679669]  ? mount_fs+0x64/0x10b
>   [71813.679669]  mount_fs+0x64/0x10b
>   [71813.679669]  vfs_kern_mount+0x68/0xce
>   [71813.679669]  btrfs_mount+0x13e/0x772 [btrfs]
>   [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
>   [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
>   [71813.679669]  ? mount_fs+0x64/0x10b
>   [71813.679669]  mount_fs+0x64/0x10b
>   [71813.679669]  vfs_kern_mount+0x68/0xce
>   [71813.679669]  do_mount+0x6e5/0x973
>   [71813.679669]  ? memdup_user+0x3e/0x5c
>   [71813.679669]  SyS_mount+0x72/0x98
>   [71813.679669]  entry_SYSCALL_64_fastpath+0x1e/0x8b
>   [71813.679669] RIP: 0033:0x7f7cedf150ba
>   [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
>   [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
>   [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
>   [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
>   [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
>   [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
>   [71814.128078] BTRFS error (device dm-0): open_ctree failed
> 
> This happens because the log has inode reference items for both inode 258
> (the first file we created) and inode 259 (the second file created), and
> when processing the reference item for inode 258, we replace the
> corresponding item in the subvolume tree (which has two names, "foo" and
> "bar") witht he one in the log (which only has one name, "foo") without
> removing the corresponding dir index keys from the parent directory.
> Later, when processing the inode reference item for inode 259, which has
> a name of "bar" associated to it, we notice that dir index entries exist
> for that name and for a different inode, so we attempt to unlink that
> name, which fails because the inode reference item for inode 258 no longer
> has the name "bar" associated to it, making a call to btrfs_unlink_inode()
> fail with a -ENOENT error.
> 
> Fix this by unlinking all the names in an inode reference item from a
> subvolume tree that are not present in the inode reference item found in
> the log tree, before overwriting it with the item from the log tree.

Since the order during replaying is INODE_ITEM then DIR_INDEX then
other types, in this particular case, can we fix the problem by also
logging the parent so that we have the correct DIR_INDEX?

With DIR_INDEX, the problem would be fixed simpler, wouldn't it?

Thanks,

-liubo
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/ctree.h      |   5 ++-
>  fs/btrfs/inode-item.c |  44 ++++++++++++--------
>  fs/btrfs/tree-log.c   | 112 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 139 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1a462ab85c49..5d33478bc704 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3095,7 +3095,10 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
>  			  u64 inode_objectid, u64 ref_objectid, int ins_len,
>  			  int cow);
>  
> -int btrfs_find_name_in_ext_backref(struct btrfs_path *path,
> +int btrfs_find_name_in_backref(struct extent_buffer *leaf, int slot,
> +			       const char *name,
> +			       int name_len, struct btrfs_inode_ref **ref_ret);
> +int btrfs_find_name_in_ext_backref(struct extent_buffer *leaf, int slot,
>  				   u64 ref_objectid, const char *name,
>  				   int name_len,
>  				   struct btrfs_inode_extref **extref_ret);
> diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
> index 39c968f80157..65e1a76bf755 100644
> --- a/fs/btrfs/inode-item.c
> +++ b/fs/btrfs/inode-item.c
> @@ -22,10 +22,10 @@
>  #include "transaction.h"
>  #include "print-tree.h"
>  
> -static int find_name_in_backref(struct btrfs_path *path, const char *name,
> -			 int name_len, struct btrfs_inode_ref **ref_ret)
> +int btrfs_find_name_in_backref(struct extent_buffer *leaf, int slot,
> +			       const char *name,
> +			       int name_len, struct btrfs_inode_ref **ref_ret)
>  {
> -	struct extent_buffer *leaf;
>  	struct btrfs_inode_ref *ref;
>  	unsigned long ptr;
>  	unsigned long name_ptr;
> @@ -33,9 +33,8 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name,
>  	u32 cur_offset = 0;
>  	int len;
>  
> -	leaf = path->nodes[0];
> -	item_size = btrfs_item_size_nr(leaf, path->slots[0]);
> -	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
> +	item_size = btrfs_item_size_nr(leaf, slot);
> +	ptr = btrfs_item_ptr_offset(leaf, slot);
>  	while (cur_offset < item_size) {
>  		ref = (struct btrfs_inode_ref *)(ptr + cur_offset);
>  		len = btrfs_inode_ref_name_len(leaf, ref);
> @@ -44,18 +43,19 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name,
>  		if (len != name_len)
>  			continue;
>  		if (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0) {
> -			*ref_ret = ref;
> +			if (ref_ret)
> +				*ref_ret = ref;
>  			return 1;
>  		}
>  	}
>  	return 0;
>  }
>  
> -int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
> +int btrfs_find_name_in_ext_backref(struct extent_buffer *leaf, int slot,
> +				   u64 ref_objectid,
>  				   const char *name, int name_len,
>  				   struct btrfs_inode_extref **extref_ret)
>  {
> -	struct extent_buffer *leaf;
>  	struct btrfs_inode_extref *extref;
>  	unsigned long ptr;
>  	unsigned long name_ptr;
> @@ -63,9 +63,8 @@ int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
>  	u32 cur_offset = 0;
>  	int ref_name_len;
>  
> -	leaf = path->nodes[0];
> -	item_size = btrfs_item_size_nr(leaf, path->slots[0]);
> -	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
> +	item_size = btrfs_item_size_nr(leaf, slot);
> +	ptr = btrfs_item_ptr_offset(leaf, slot);
>  
>  	/*
>  	 * Search all extended backrefs in this item. We're only
> @@ -113,7 +112,9 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
>  		return ERR_PTR(ret);
>  	if (ret > 0)
>  		return NULL;
> -	if (!btrfs_find_name_in_ext_backref(path, ref_objectid, name, name_len, &extref))
> +	if (!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0],
> +					    ref_objectid, name, name_len,
> +					    &extref))
>  		return NULL;
>  	return extref;
>  }
> @@ -155,7 +156,8 @@ static int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
>  	 * This should always succeed so error here will make the FS
>  	 * readonly.
>  	 */
> -	if (!btrfs_find_name_in_ext_backref(path, ref_objectid,
> +	if (!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0],
> +					    ref_objectid,
>  					    name, name_len, &extref)) {
>  		btrfs_handle_fs_error(root->fs_info, -ENOENT, NULL);
>  		ret = -EROFS;
> @@ -225,7 +227,8 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
>  	} else if (ret < 0) {
>  		goto out;
>  	}
> -	if (!find_name_in_backref(path, name, name_len, &ref)) {
> +	if (!btrfs_find_name_in_backref(path->nodes[0], path->slots[0],
> +					name, name_len, &ref)) {
>  		ret = -ENOENT;
>  		search_ext_refs = 1;
>  		goto out;
> @@ -293,7 +296,9 @@ static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans,
>  	ret = btrfs_insert_empty_item(trans, root, path, &key,
>  				      ins_len);
>  	if (ret == -EEXIST) {
> -		if (btrfs_find_name_in_ext_backref(path, ref_objectid,
> +		if (btrfs_find_name_in_ext_backref(path->nodes[0],
> +						   path->slots[0],
> +						   ref_objectid,
>  						   name, name_len, NULL))
>  			goto out;
>  
> @@ -351,7 +356,8 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
>  	if (ret == -EEXIST) {
>  		u32 old_size;
>  
> -		if (find_name_in_backref(path, name, name_len, &ref))
> +		if (btrfs_find_name_in_backref(path->nodes[0], path->slots[0],
> +					       name, name_len, &ref))
>  			goto out;
>  
>  		old_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]);
> @@ -365,7 +371,9 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
>  		ret = 0;
>  	} else if (ret < 0) {
>  		if (ret == -EOVERFLOW) {
> -			if (find_name_in_backref(path, name, name_len, &ref))
> +			if (btrfs_find_name_in_backref(path->nodes[0],
> +						       path->slots[0],
> +						       name, name_len, &ref))
>  				ret = -EEXIST;
>  			else
>  				ret = -EMLINK;
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 411a022489e4..fd573816f461 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -966,7 +966,9 @@ static noinline int backref_in_log(struct btrfs_root *log,
>  	ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
>  
>  	if (key->type == BTRFS_INODE_EXTREF_KEY) {
> -		if (btrfs_find_name_in_ext_backref(path, ref_objectid,
> +		if (btrfs_find_name_in_ext_backref(path->nodes[0],
> +						   path->slots[0],
> +						   ref_objectid,
>  						   name, namelen, NULL))
>  			match = 1;
>  
> @@ -1190,7 +1192,8 @@ static int extref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr,
>  	read_extent_buffer(eb, *name, (unsigned long)&extref->name,
>  			   *namelen);
>  
> -	*index = btrfs_inode_extref_index(eb, extref);
> +	if (index)
> +		*index = btrfs_inode_extref_index(eb, extref);
>  	if (parent_objectid)
>  		*parent_objectid = btrfs_inode_extref_parent(eb, extref);
>  
> @@ -1211,12 +1214,102 @@ static int ref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr,
>  
>  	read_extent_buffer(eb, *name, (unsigned long)(ref + 1), *namelen);
>  
> -	*index = btrfs_inode_ref_index(eb, ref);
> +	if (index)
> +		*index = btrfs_inode_ref_index(eb, ref);
>  
>  	return 0;
>  }
>  
>  /*
> + * Take an inode reference item from the log tree and iterate all names from the
> + * inode reference item in the subvolume tree with the same key (if it exists).
> + * For any name that is not in the inode reference item from the log tree, do a
> + * proper unlink of that name (that is, remove its entry from the inode
> + * reference item and both dir index keys).
> + */
> +static int unlink_old_inode_refs(struct btrfs_trans_handle *trans,
> +				 struct btrfs_root *root,
> +				 struct btrfs_path *path,
> +				 struct btrfs_inode *inode,
> +				 struct extent_buffer *log_eb,
> +				 int log_slot,
> +				 struct btrfs_key *key)
> +{
> +	int ret;
> +	unsigned long ref_ptr;
> +	unsigned long ref_end;
> +	struct extent_buffer *eb;
> +
> +again:
> +	btrfs_release_path(path);
> +	ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
> +	if (ret > 0) {
> +		ret = 0;
> +		goto out;
> +	}
> +	if (ret < 0)
> +		goto out;
> +
> +	eb = path->nodes[0];
> +	ref_ptr = btrfs_item_ptr_offset(eb, path->slots[0]);
> +	ref_end = ref_ptr + btrfs_item_size_nr(eb, path->slots[0]);
> +	while (ref_ptr < ref_end) {
> +		char *name = NULL;
> +		int namelen;
> +		u64 parent_id;
> +
> +		if (key->type == BTRFS_INODE_EXTREF_KEY) {
> +			ret = extref_get_fields(eb, ref_ptr, &namelen, &name,
> +						NULL, &parent_id);
> +		} else {
> +			parent_id = key->offset;
> +			ret = ref_get_fields(eb, ref_ptr, &namelen, &name,
> +					     NULL);
> +		}
> +		if (ret)
> +			goto out;
> +
> +		if (key->type == BTRFS_INODE_EXTREF_KEY)
> +			ret = btrfs_find_name_in_ext_backref(log_eb, log_slot,
> +							     parent_id, name,
> +							     namelen, NULL);
> +		else
> +			ret = btrfs_find_name_in_backref(log_eb, log_slot, name,
> +							 namelen, NULL);
> +
> +		if (!ret) {
> +			struct inode *dir;
> +
> +			btrfs_release_path(path);
> +			dir = read_one_inode(root, parent_id);
> +			if (!dir) {
> +				ret = -ENOENT;
> +				kfree(name);
> +				goto out;
> +			}
> +			ret = btrfs_unlink_inode(trans, root, BTRFS_I(dir),
> +						 inode, name, namelen);
> +			kfree(name);
> +			iput(dir);
> +			if (ret)
> +				goto out;
> +			goto again;
> +		}
> +
> +		kfree(name);
> +		ref_ptr += namelen;
> +		if (key->type == BTRFS_INODE_EXTREF_KEY)
> +			ref_ptr += sizeof(struct btrfs_inode_extref);
> +		else
> +			ref_ptr += sizeof(struct btrfs_inode_ref);
> +	}
> +	ret = 0;
> + out:
> +	btrfs_release_path(path);
> +	return ret;
> +}
> +
> +/*
>   * replay one inode back reference item found in the log tree.
>   * eb, slot and key refer to the buffer and key found in the log tree.
>   * root is the destination we are replaying into, and path is for temp
> @@ -1344,6 +1437,19 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans,
>  		}
>  	}
>  
> +	/*
> +	 * Before we overwrite the inode reference item in the subvolume tree
> +	 * with the item from the log tree, we must unlink all names from the
> +	 * parent directory that are in the subvolume's tree inode reference
> +	 * item, otherwise we end up with an inconsistent subvolume tree where
> +	 * dir index entries exist for a name but there is no inode reference
> +	 * item with the same name.
> +	 */
> +	ret = unlink_old_inode_refs(trans, root, path, BTRFS_I(inode), eb, slot,
> +				    key);
> +	if (ret)
> +		goto out;
> +
>  	/* finally write the back reference in the inode */
>  	ret = overwrite_item(trans, root, path, eb, slot, key);
>  out:
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo March 2, 2018, 7 p.m. UTC | #2
On Fri, Mar 02, 2018 at 11:29:33AM -0700, Liu Bo wrote:
> On Wed, Feb 28, 2018 at 03:56:10PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > If we have a file with 2 (or more) hard links in the same directory,
> > remove one of the hard links, create a new file (or link an existing file)
> > in the same directory with the name of the removed hard link, and then
> > finally fsync the new file, we end up with a log that fails to replay,
> > causing a mount failure.
> > 
> > Example:
> > 
> >   $ mkfs.btrfs -f /dev/sdb
> >   $ mount /dev/sdb /mnt
> > 
> >   $ mkdir /mnt/testdir
> >   $ touch /mnt/testdir/foo
> >   $ ln /mnt/testdir/foo /mnt/testdir/bar
> > 
> >   $ sync
> > 
> >   $ unlink /mnt/testdir/bar
> >   $ touch /mnt/testdir/bar

So here, "unlink & touch" similar to rename, therefore we could also
be conservative and set /mnt/testdir 's last_unlink_trans to force to
commit transaction.

Thanks,

-liubo

> >   $ xfs_io -c "fsync" /mnt/testdir/bar
> > 
> >   <power failure>
> > 
> >   $ mount /dev/sdb /mnt
> >   mount: mount(2) failed: /mnt: No such file or directory
> > 
> > When replaying the log, for that example, we also see the following in
> > dmesg/syslog:
> > 
> >   [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
> >   [71813.674204] ------------[ cut here ]------------
> >   [71813.675694] BTRFS: Transaction aborted (error -2)
> >   [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
> >   [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
> >   [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G        W        4.15.0-rc9-btrfs-next-56+ #1
> >   [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
> >   [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
> >   [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
> >   [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
> >   [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
> >   [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
> >   [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
> >   [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
> >   [71813.679669] FS:  00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
> >   [71813.679669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >   [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
> >   [71813.679669] Call Trace:
> >   [71813.679669]  btrfs_unlink_inode+0x17/0x41 [btrfs]
> >   [71813.679669]  drop_one_dir_item+0xfa/0x131 [btrfs]
> >   [71813.679669]  add_inode_ref+0x71e/0x851 [btrfs]
> >   [71813.679669]  ? __lock_is_held+0x39/0x71
> >   [71813.679669]  ? replay_one_buffer+0x53/0x53a [btrfs]
> >   [71813.679669]  replay_one_buffer+0x4a4/0x53a [btrfs]
> >   [71813.679669]  ? rcu_read_unlock+0x3a/0x57
> >   [71813.679669]  ? __lock_is_held+0x39/0x71
> >   [71813.679669]  walk_up_log_tree+0x101/0x1d2 [btrfs]
> >   [71813.679669]  walk_log_tree+0xad/0x188 [btrfs]
> >   [71813.679669]  btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
> >   [71813.679669]  ? replay_one_extent+0x544/0x544 [btrfs]
> >   [71813.679669]  open_ctree+0x1cf6/0x2209 [btrfs]
> >   [71813.679669]  btrfs_mount_root+0x368/0x482 [btrfs]
> >   [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
> >   [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
> >   [71813.679669]  ? mount_fs+0x64/0x10b
> >   [71813.679669]  mount_fs+0x64/0x10b
> >   [71813.679669]  vfs_kern_mount+0x68/0xce
> >   [71813.679669]  btrfs_mount+0x13e/0x772 [btrfs]
> >   [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
> >   [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
> >   [71813.679669]  ? mount_fs+0x64/0x10b
> >   [71813.679669]  mount_fs+0x64/0x10b
> >   [71813.679669]  vfs_kern_mount+0x68/0xce
> >   [71813.679669]  do_mount+0x6e5/0x973
> >   [71813.679669]  ? memdup_user+0x3e/0x5c
> >   [71813.679669]  SyS_mount+0x72/0x98
> >   [71813.679669]  entry_SYSCALL_64_fastpath+0x1e/0x8b
> >   [71813.679669] RIP: 0033:0x7f7cedf150ba
> >   [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
> >   [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
> >   [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
> >   [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
> >   [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
> >   [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
> >   [71814.128078] BTRFS error (device dm-0): open_ctree failed
> > 
> > This happens because the log has inode reference items for both inode 258
> > (the first file we created) and inode 259 (the second file created), and
> > when processing the reference item for inode 258, we replace the
> > corresponding item in the subvolume tree (which has two names, "foo" and
> > "bar") witht he one in the log (which only has one name, "foo") without
> > removing the corresponding dir index keys from the parent directory.
> > Later, when processing the inode reference item for inode 259, which has
> > a name of "bar" associated to it, we notice that dir index entries exist
> > for that name and for a different inode, so we attempt to unlink that
> > name, which fails because the inode reference item for inode 258 no longer
> > has the name "bar" associated to it, making a call to btrfs_unlink_inode()
> > fail with a -ENOENT error.
> > 
> > Fix this by unlinking all the names in an inode reference item from a
> > subvolume tree that are not present in the inode reference item found in
> > the log tree, before overwriting it with the item from the log tree.
> 
> Since the order during replaying is INODE_ITEM then DIR_INDEX then
> other types, in this particular case, can we fix the problem by also
> logging the parent so that we have the correct DIR_INDEX?
> 
> With DIR_INDEX, the problem would be fixed simpler, wouldn't it?
> 
> Thanks,
> 
> -liubo
> > 
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/ctree.h      |   5 ++-
> >  fs/btrfs/inode-item.c |  44 ++++++++++++--------
> >  fs/btrfs/tree-log.c   | 112 ++++++++++++++++++++++++++++++++++++++++++++++++--
> >  3 files changed, 139 insertions(+), 22 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 1a462ab85c49..5d33478bc704 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -3095,7 +3095,10 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
> >  			  u64 inode_objectid, u64 ref_objectid, int ins_len,
> >  			  int cow);
> >  
> > -int btrfs_find_name_in_ext_backref(struct btrfs_path *path,
> > +int btrfs_find_name_in_backref(struct extent_buffer *leaf, int slot,
> > +			       const char *name,
> > +			       int name_len, struct btrfs_inode_ref **ref_ret);
> > +int btrfs_find_name_in_ext_backref(struct extent_buffer *leaf, int slot,
> >  				   u64 ref_objectid, const char *name,
> >  				   int name_len,
> >  				   struct btrfs_inode_extref **extref_ret);
> > diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
> > index 39c968f80157..65e1a76bf755 100644
> > --- a/fs/btrfs/inode-item.c
> > +++ b/fs/btrfs/inode-item.c
> > @@ -22,10 +22,10 @@
> >  #include "transaction.h"
> >  #include "print-tree.h"
> >  
> > -static int find_name_in_backref(struct btrfs_path *path, const char *name,
> > -			 int name_len, struct btrfs_inode_ref **ref_ret)
> > +int btrfs_find_name_in_backref(struct extent_buffer *leaf, int slot,
> > +			       const char *name,
> > +			       int name_len, struct btrfs_inode_ref **ref_ret)
> >  {
> > -	struct extent_buffer *leaf;
> >  	struct btrfs_inode_ref *ref;
> >  	unsigned long ptr;
> >  	unsigned long name_ptr;
> > @@ -33,9 +33,8 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name,
> >  	u32 cur_offset = 0;
> >  	int len;
> >  
> > -	leaf = path->nodes[0];
> > -	item_size = btrfs_item_size_nr(leaf, path->slots[0]);
> > -	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
> > +	item_size = btrfs_item_size_nr(leaf, slot);
> > +	ptr = btrfs_item_ptr_offset(leaf, slot);
> >  	while (cur_offset < item_size) {
> >  		ref = (struct btrfs_inode_ref *)(ptr + cur_offset);
> >  		len = btrfs_inode_ref_name_len(leaf, ref);
> > @@ -44,18 +43,19 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name,
> >  		if (len != name_len)
> >  			continue;
> >  		if (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0) {
> > -			*ref_ret = ref;
> > +			if (ref_ret)
> > +				*ref_ret = ref;
> >  			return 1;
> >  		}
> >  	}
> >  	return 0;
> >  }
> >  
> > -int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
> > +int btrfs_find_name_in_ext_backref(struct extent_buffer *leaf, int slot,
> > +				   u64 ref_objectid,
> >  				   const char *name, int name_len,
> >  				   struct btrfs_inode_extref **extref_ret)
> >  {
> > -	struct extent_buffer *leaf;
> >  	struct btrfs_inode_extref *extref;
> >  	unsigned long ptr;
> >  	unsigned long name_ptr;
> > @@ -63,9 +63,8 @@ int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
> >  	u32 cur_offset = 0;
> >  	int ref_name_len;
> >  
> > -	leaf = path->nodes[0];
> > -	item_size = btrfs_item_size_nr(leaf, path->slots[0]);
> > -	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
> > +	item_size = btrfs_item_size_nr(leaf, slot);
> > +	ptr = btrfs_item_ptr_offset(leaf, slot);
> >  
> >  	/*
> >  	 * Search all extended backrefs in this item. We're only
> > @@ -113,7 +112,9 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
> >  		return ERR_PTR(ret);
> >  	if (ret > 0)
> >  		return NULL;
> > -	if (!btrfs_find_name_in_ext_backref(path, ref_objectid, name, name_len, &extref))
> > +	if (!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0],
> > +					    ref_objectid, name, name_len,
> > +					    &extref))
> >  		return NULL;
> >  	return extref;
> >  }
> > @@ -155,7 +156,8 @@ static int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
> >  	 * This should always succeed so error here will make the FS
> >  	 * readonly.
> >  	 */
> > -	if (!btrfs_find_name_in_ext_backref(path, ref_objectid,
> > +	if (!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0],
> > +					    ref_objectid,
> >  					    name, name_len, &extref)) {
> >  		btrfs_handle_fs_error(root->fs_info, -ENOENT, NULL);
> >  		ret = -EROFS;
> > @@ -225,7 +227,8 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
> >  	} else if (ret < 0) {
> >  		goto out;
> >  	}
> > -	if (!find_name_in_backref(path, name, name_len, &ref)) {
> > +	if (!btrfs_find_name_in_backref(path->nodes[0], path->slots[0],
> > +					name, name_len, &ref)) {
> >  		ret = -ENOENT;
> >  		search_ext_refs = 1;
> >  		goto out;
> > @@ -293,7 +296,9 @@ static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans,
> >  	ret = btrfs_insert_empty_item(trans, root, path, &key,
> >  				      ins_len);
> >  	if (ret == -EEXIST) {
> > -		if (btrfs_find_name_in_ext_backref(path, ref_objectid,
> > +		if (btrfs_find_name_in_ext_backref(path->nodes[0],
> > +						   path->slots[0],
> > +						   ref_objectid,
> >  						   name, name_len, NULL))
> >  			goto out;
> >  
> > @@ -351,7 +356,8 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
> >  	if (ret == -EEXIST) {
> >  		u32 old_size;
> >  
> > -		if (find_name_in_backref(path, name, name_len, &ref))
> > +		if (btrfs_find_name_in_backref(path->nodes[0], path->slots[0],
> > +					       name, name_len, &ref))
> >  			goto out;
> >  
> >  		old_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]);
> > @@ -365,7 +371,9 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
> >  		ret = 0;
> >  	} else if (ret < 0) {
> >  		if (ret == -EOVERFLOW) {
> > -			if (find_name_in_backref(path, name, name_len, &ref))
> > +			if (btrfs_find_name_in_backref(path->nodes[0],
> > +						       path->slots[0],
> > +						       name, name_len, &ref))
> >  				ret = -EEXIST;
> >  			else
> >  				ret = -EMLINK;
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index 411a022489e4..fd573816f461 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -966,7 +966,9 @@ static noinline int backref_in_log(struct btrfs_root *log,
> >  	ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
> >  
> >  	if (key->type == BTRFS_INODE_EXTREF_KEY) {
> > -		if (btrfs_find_name_in_ext_backref(path, ref_objectid,
> > +		if (btrfs_find_name_in_ext_backref(path->nodes[0],
> > +						   path->slots[0],
> > +						   ref_objectid,
> >  						   name, namelen, NULL))
> >  			match = 1;
> >  
> > @@ -1190,7 +1192,8 @@ static int extref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr,
> >  	read_extent_buffer(eb, *name, (unsigned long)&extref->name,
> >  			   *namelen);
> >  
> > -	*index = btrfs_inode_extref_index(eb, extref);
> > +	if (index)
> > +		*index = btrfs_inode_extref_index(eb, extref);
> >  	if (parent_objectid)
> >  		*parent_objectid = btrfs_inode_extref_parent(eb, extref);
> >  
> > @@ -1211,12 +1214,102 @@ static int ref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr,
> >  
> >  	read_extent_buffer(eb, *name, (unsigned long)(ref + 1), *namelen);
> >  
> > -	*index = btrfs_inode_ref_index(eb, ref);
> > +	if (index)
> > +		*index = btrfs_inode_ref_index(eb, ref);
> >  
> >  	return 0;
> >  }
> >  
> >  /*
> > + * Take an inode reference item from the log tree and iterate all names from the
> > + * inode reference item in the subvolume tree with the same key (if it exists).
> > + * For any name that is not in the inode reference item from the log tree, do a
> > + * proper unlink of that name (that is, remove its entry from the inode
> > + * reference item and both dir index keys).
> > + */
> > +static int unlink_old_inode_refs(struct btrfs_trans_handle *trans,
> > +				 struct btrfs_root *root,
> > +				 struct btrfs_path *path,
> > +				 struct btrfs_inode *inode,
> > +				 struct extent_buffer *log_eb,
> > +				 int log_slot,
> > +				 struct btrfs_key *key)
> > +{
> > +	int ret;
> > +	unsigned long ref_ptr;
> > +	unsigned long ref_end;
> > +	struct extent_buffer *eb;
> > +
> > +again:
> > +	btrfs_release_path(path);
> > +	ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
> > +	if (ret > 0) {
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +	if (ret < 0)
> > +		goto out;
> > +
> > +	eb = path->nodes[0];
> > +	ref_ptr = btrfs_item_ptr_offset(eb, path->slots[0]);
> > +	ref_end = ref_ptr + btrfs_item_size_nr(eb, path->slots[0]);
> > +	while (ref_ptr < ref_end) {
> > +		char *name = NULL;
> > +		int namelen;
> > +		u64 parent_id;
> > +
> > +		if (key->type == BTRFS_INODE_EXTREF_KEY) {
> > +			ret = extref_get_fields(eb, ref_ptr, &namelen, &name,
> > +						NULL, &parent_id);
> > +		} else {
> > +			parent_id = key->offset;
> > +			ret = ref_get_fields(eb, ref_ptr, &namelen, &name,
> > +					     NULL);
> > +		}
> > +		if (ret)
> > +			goto out;
> > +
> > +		if (key->type == BTRFS_INODE_EXTREF_KEY)
> > +			ret = btrfs_find_name_in_ext_backref(log_eb, log_slot,
> > +							     parent_id, name,
> > +							     namelen, NULL);
> > +		else
> > +			ret = btrfs_find_name_in_backref(log_eb, log_slot, name,
> > +							 namelen, NULL);
> > +
> > +		if (!ret) {
> > +			struct inode *dir;
> > +
> > +			btrfs_release_path(path);
> > +			dir = read_one_inode(root, parent_id);
> > +			if (!dir) {
> > +				ret = -ENOENT;
> > +				kfree(name);
> > +				goto out;
> > +			}
> > +			ret = btrfs_unlink_inode(trans, root, BTRFS_I(dir),
> > +						 inode, name, namelen);
> > +			kfree(name);
> > +			iput(dir);
> > +			if (ret)
> > +				goto out;
> > +			goto again;
> > +		}
> > +
> > +		kfree(name);
> > +		ref_ptr += namelen;
> > +		if (key->type == BTRFS_INODE_EXTREF_KEY)
> > +			ref_ptr += sizeof(struct btrfs_inode_extref);
> > +		else
> > +			ref_ptr += sizeof(struct btrfs_inode_ref);
> > +	}
> > +	ret = 0;
> > + out:
> > +	btrfs_release_path(path);
> > +	return ret;
> > +}
> > +
> > +/*
> >   * replay one inode back reference item found in the log tree.
> >   * eb, slot and key refer to the buffer and key found in the log tree.
> >   * root is the destination we are replaying into, and path is for temp
> > @@ -1344,6 +1437,19 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans,
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * Before we overwrite the inode reference item in the subvolume tree
> > +	 * with the item from the log tree, we must unlink all names from the
> > +	 * parent directory that are in the subvolume's tree inode reference
> > +	 * item, otherwise we end up with an inconsistent subvolume tree where
> > +	 * dir index entries exist for a name but there is no inode reference
> > +	 * item with the same name.
> > +	 */
> > +	ret = unlink_old_inode_refs(trans, root, path, BTRFS_I(inode), eb, slot,
> > +				    key);
> > +	if (ret)
> > +		goto out;
> > +
> >  	/* finally write the back reference in the inode */
> >  	ret = overwrite_item(trans, root, path, eb, slot, key);
> >  out:
> > -- 
> > 2.11.0
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana March 2, 2018, 9:02 p.m. UTC | #3
On Fri, Mar 2, 2018 at 6:29 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Wed, Feb 28, 2018 at 03:56:10PM +0000, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> If we have a file with 2 (or more) hard links in the same directory,
>> remove one of the hard links, create a new file (or link an existing file)
>> in the same directory with the name of the removed hard link, and then
>> finally fsync the new file, we end up with a log that fails to replay,
>> causing a mount failure.
>>
>> Example:
>>
>>   $ mkfs.btrfs -f /dev/sdb
>>   $ mount /dev/sdb /mnt
>>
>>   $ mkdir /mnt/testdir
>>   $ touch /mnt/testdir/foo
>>   $ ln /mnt/testdir/foo /mnt/testdir/bar
>>
>>   $ sync
>>
>>   $ unlink /mnt/testdir/bar
>>   $ touch /mnt/testdir/bar
>>   $ xfs_io -c "fsync" /mnt/testdir/bar
>>
>>   <power failure>
>>
>>   $ mount /dev/sdb /mnt
>>   mount: mount(2) failed: /mnt: No such file or directory
>>
>> When replaying the log, for that example, we also see the following in
>> dmesg/syslog:
>>
>>   [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
>>   [71813.674204] ------------[ cut here ]------------
>>   [71813.675694] BTRFS: Transaction aborted (error -2)
>>   [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
>>   [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
>>   [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G        W        4.15.0-rc9-btrfs-next-56+ #1
>>   [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
>>   [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
>>   [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
>>   [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
>>   [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
>>   [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
>>   [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
>>   [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
>>   [71813.679669] FS:  00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
>>   [71813.679669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
>>   [71813.679669] Call Trace:
>>   [71813.679669]  btrfs_unlink_inode+0x17/0x41 [btrfs]
>>   [71813.679669]  drop_one_dir_item+0xfa/0x131 [btrfs]
>>   [71813.679669]  add_inode_ref+0x71e/0x851 [btrfs]
>>   [71813.679669]  ? __lock_is_held+0x39/0x71
>>   [71813.679669]  ? replay_one_buffer+0x53/0x53a [btrfs]
>>   [71813.679669]  replay_one_buffer+0x4a4/0x53a [btrfs]
>>   [71813.679669]  ? rcu_read_unlock+0x3a/0x57
>>   [71813.679669]  ? __lock_is_held+0x39/0x71
>>   [71813.679669]  walk_up_log_tree+0x101/0x1d2 [btrfs]
>>   [71813.679669]  walk_log_tree+0xad/0x188 [btrfs]
>>   [71813.679669]  btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
>>   [71813.679669]  ? replay_one_extent+0x544/0x544 [btrfs]
>>   [71813.679669]  open_ctree+0x1cf6/0x2209 [btrfs]
>>   [71813.679669]  btrfs_mount_root+0x368/0x482 [btrfs]
>>   [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
>>   [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
>>   [71813.679669]  ? mount_fs+0x64/0x10b
>>   [71813.679669]  mount_fs+0x64/0x10b
>>   [71813.679669]  vfs_kern_mount+0x68/0xce
>>   [71813.679669]  btrfs_mount+0x13e/0x772 [btrfs]
>>   [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
>>   [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
>>   [71813.679669]  ? mount_fs+0x64/0x10b
>>   [71813.679669]  mount_fs+0x64/0x10b
>>   [71813.679669]  vfs_kern_mount+0x68/0xce
>>   [71813.679669]  do_mount+0x6e5/0x973
>>   [71813.679669]  ? memdup_user+0x3e/0x5c
>>   [71813.679669]  SyS_mount+0x72/0x98
>>   [71813.679669]  entry_SYSCALL_64_fastpath+0x1e/0x8b
>>   [71813.679669] RIP: 0033:0x7f7cedf150ba
>>   [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
>>   [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
>>   [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
>>   [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
>>   [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
>>   [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
>>   [71814.128078] BTRFS error (device dm-0): open_ctree failed
>>
>> This happens because the log has inode reference items for both inode 258
>> (the first file we created) and inode 259 (the second file created), and
>> when processing the reference item for inode 258, we replace the
>> corresponding item in the subvolume tree (which has two names, "foo" and
>> "bar") witht he one in the log (which only has one name, "foo") without
>> removing the corresponding dir index keys from the parent directory.
>> Later, when processing the inode reference item for inode 259, which has
>> a name of "bar" associated to it, we notice that dir index entries exist
>> for that name and for a different inode, so we attempt to unlink that
>> name, which fails because the inode reference item for inode 258 no longer
>> has the name "bar" associated to it, making a call to btrfs_unlink_inode()
>> fail with a -ENOENT error.
>>
>> Fix this by unlinking all the names in an inode reference item from a
>> subvolume tree that are not present in the inode reference item found in
>> the log tree, before overwriting it with the item from the log tree.
>
> Since the order during replaying is INODE_ITEM then DIR_INDEX then
> other types, in this particular case, can we fix the problem by also
> logging the parent so that we have the correct DIR_INDEX?
>
> With DIR_INDEX, the problem would be fixed simpler, wouldn't it?

It wouldn't work when the parent directory has a higher inode number
then the child.
Plus for large directories, that would take a performance penalty on
fsync'ing any file inside them (existing or new).

thanks

>
> Thanks,
>
> -liubo
>>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>  fs/btrfs/ctree.h      |   5 ++-
>>  fs/btrfs/inode-item.c |  44 ++++++++++++--------
>>  fs/btrfs/tree-log.c   | 112 ++++++++++++++++++++++++++++++++++++++++++++++++--
>>  3 files changed, 139 insertions(+), 22 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 1a462ab85c49..5d33478bc704 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -3095,7 +3095,10 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
>>                         u64 inode_objectid, u64 ref_objectid, int ins_len,
>>                         int cow);
>>
>> -int btrfs_find_name_in_ext_backref(struct btrfs_path *path,
>> +int btrfs_find_name_in_backref(struct extent_buffer *leaf, int slot,
>> +                            const char *name,
>> +                            int name_len, struct btrfs_inode_ref **ref_ret);
>> +int btrfs_find_name_in_ext_backref(struct extent_buffer *leaf, int slot,
>>                                  u64 ref_objectid, const char *name,
>>                                  int name_len,
>>                                  struct btrfs_inode_extref **extref_ret);
>> diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
>> index 39c968f80157..65e1a76bf755 100644
>> --- a/fs/btrfs/inode-item.c
>> +++ b/fs/btrfs/inode-item.c
>> @@ -22,10 +22,10 @@
>>  #include "transaction.h"
>>  #include "print-tree.h"
>>
>> -static int find_name_in_backref(struct btrfs_path *path, const char *name,
>> -                      int name_len, struct btrfs_inode_ref **ref_ret)
>> +int btrfs_find_name_in_backref(struct extent_buffer *leaf, int slot,
>> +                            const char *name,
>> +                            int name_len, struct btrfs_inode_ref **ref_ret)
>>  {
>> -     struct extent_buffer *leaf;
>>       struct btrfs_inode_ref *ref;
>>       unsigned long ptr;
>>       unsigned long name_ptr;
>> @@ -33,9 +33,8 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name,
>>       u32 cur_offset = 0;
>>       int len;
>>
>> -     leaf = path->nodes[0];
>> -     item_size = btrfs_item_size_nr(leaf, path->slots[0]);
>> -     ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
>> +     item_size = btrfs_item_size_nr(leaf, slot);
>> +     ptr = btrfs_item_ptr_offset(leaf, slot);
>>       while (cur_offset < item_size) {
>>               ref = (struct btrfs_inode_ref *)(ptr + cur_offset);
>>               len = btrfs_inode_ref_name_len(leaf, ref);
>> @@ -44,18 +43,19 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name,
>>               if (len != name_len)
>>                       continue;
>>               if (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0) {
>> -                     *ref_ret = ref;
>> +                     if (ref_ret)
>> +                             *ref_ret = ref;
>>                       return 1;
>>               }
>>       }
>>       return 0;
>>  }
>>
>> -int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
>> +int btrfs_find_name_in_ext_backref(struct extent_buffer *leaf, int slot,
>> +                                u64 ref_objectid,
>>                                  const char *name, int name_len,
>>                                  struct btrfs_inode_extref **extref_ret)
>>  {
>> -     struct extent_buffer *leaf;
>>       struct btrfs_inode_extref *extref;
>>       unsigned long ptr;
>>       unsigned long name_ptr;
>> @@ -63,9 +63,8 @@ int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
>>       u32 cur_offset = 0;
>>       int ref_name_len;
>>
>> -     leaf = path->nodes[0];
>> -     item_size = btrfs_item_size_nr(leaf, path->slots[0]);
>> -     ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
>> +     item_size = btrfs_item_size_nr(leaf, slot);
>> +     ptr = btrfs_item_ptr_offset(leaf, slot);
>>
>>       /*
>>        * Search all extended backrefs in this item. We're only
>> @@ -113,7 +112,9 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
>>               return ERR_PTR(ret);
>>       if (ret > 0)
>>               return NULL;
>> -     if (!btrfs_find_name_in_ext_backref(path, ref_objectid, name, name_len, &extref))
>> +     if (!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0],
>> +                                         ref_objectid, name, name_len,
>> +                                         &extref))
>>               return NULL;
>>       return extref;
>>  }
>> @@ -155,7 +156,8 @@ static int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
>>        * This should always succeed so error here will make the FS
>>        * readonly.
>>        */
>> -     if (!btrfs_find_name_in_ext_backref(path, ref_objectid,
>> +     if (!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0],
>> +                                         ref_objectid,
>>                                           name, name_len, &extref)) {
>>               btrfs_handle_fs_error(root->fs_info, -ENOENT, NULL);
>>               ret = -EROFS;
>> @@ -225,7 +227,8 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
>>       } else if (ret < 0) {
>>               goto out;
>>       }
>> -     if (!find_name_in_backref(path, name, name_len, &ref)) {
>> +     if (!btrfs_find_name_in_backref(path->nodes[0], path->slots[0],
>> +                                     name, name_len, &ref)) {
>>               ret = -ENOENT;
>>               search_ext_refs = 1;
>>               goto out;
>> @@ -293,7 +296,9 @@ static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans,
>>       ret = btrfs_insert_empty_item(trans, root, path, &key,
>>                                     ins_len);
>>       if (ret == -EEXIST) {
>> -             if (btrfs_find_name_in_ext_backref(path, ref_objectid,
>> +             if (btrfs_find_name_in_ext_backref(path->nodes[0],
>> +                                                path->slots[0],
>> +                                                ref_objectid,
>>                                                  name, name_len, NULL))
>>                       goto out;
>>
>> @@ -351,7 +356,8 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
>>       if (ret == -EEXIST) {
>>               u32 old_size;
>>
>> -             if (find_name_in_backref(path, name, name_len, &ref))
>> +             if (btrfs_find_name_in_backref(path->nodes[0], path->slots[0],
>> +                                            name, name_len, &ref))
>>                       goto out;
>>
>>               old_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]);
>> @@ -365,7 +371,9 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
>>               ret = 0;
>>       } else if (ret < 0) {
>>               if (ret == -EOVERFLOW) {
>> -                     if (find_name_in_backref(path, name, name_len, &ref))
>> +                     if (btrfs_find_name_in_backref(path->nodes[0],
>> +                                                    path->slots[0],
>> +                                                    name, name_len, &ref))
>>                               ret = -EEXIST;
>>                       else
>>                               ret = -EMLINK;
>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> index 411a022489e4..fd573816f461 100644
>> --- a/fs/btrfs/tree-log.c
>> +++ b/fs/btrfs/tree-log.c
>> @@ -966,7 +966,9 @@ static noinline int backref_in_log(struct btrfs_root *log,
>>       ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
>>
>>       if (key->type == BTRFS_INODE_EXTREF_KEY) {
>> -             if (btrfs_find_name_in_ext_backref(path, ref_objectid,
>> +             if (btrfs_find_name_in_ext_backref(path->nodes[0],
>> +                                                path->slots[0],
>> +                                                ref_objectid,
>>                                                  name, namelen, NULL))
>>                       match = 1;
>>
>> @@ -1190,7 +1192,8 @@ static int extref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr,
>>       read_extent_buffer(eb, *name, (unsigned long)&extref->name,
>>                          *namelen);
>>
>> -     *index = btrfs_inode_extref_index(eb, extref);
>> +     if (index)
>> +             *index = btrfs_inode_extref_index(eb, extref);
>>       if (parent_objectid)
>>               *parent_objectid = btrfs_inode_extref_parent(eb, extref);
>>
>> @@ -1211,12 +1214,102 @@ static int ref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr,
>>
>>       read_extent_buffer(eb, *name, (unsigned long)(ref + 1), *namelen);
>>
>> -     *index = btrfs_inode_ref_index(eb, ref);
>> +     if (index)
>> +             *index = btrfs_inode_ref_index(eb, ref);
>>
>>       return 0;
>>  }
>>
>>  /*
>> + * Take an inode reference item from the log tree and iterate all names from the
>> + * inode reference item in the subvolume tree with the same key (if it exists).
>> + * For any name that is not in the inode reference item from the log tree, do a
>> + * proper unlink of that name (that is, remove its entry from the inode
>> + * reference item and both dir index keys).
>> + */
>> +static int unlink_old_inode_refs(struct btrfs_trans_handle *trans,
>> +                              struct btrfs_root *root,
>> +                              struct btrfs_path *path,
>> +                              struct btrfs_inode *inode,
>> +                              struct extent_buffer *log_eb,
>> +                              int log_slot,
>> +                              struct btrfs_key *key)
>> +{
>> +     int ret;
>> +     unsigned long ref_ptr;
>> +     unsigned long ref_end;
>> +     struct extent_buffer *eb;
>> +
>> +again:
>> +     btrfs_release_path(path);
>> +     ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
>> +     if (ret > 0) {
>> +             ret = 0;
>> +             goto out;
>> +     }
>> +     if (ret < 0)
>> +             goto out;
>> +
>> +     eb = path->nodes[0];
>> +     ref_ptr = btrfs_item_ptr_offset(eb, path->slots[0]);
>> +     ref_end = ref_ptr + btrfs_item_size_nr(eb, path->slots[0]);
>> +     while (ref_ptr < ref_end) {
>> +             char *name = NULL;
>> +             int namelen;
>> +             u64 parent_id;
>> +
>> +             if (key->type == BTRFS_INODE_EXTREF_KEY) {
>> +                     ret = extref_get_fields(eb, ref_ptr, &namelen, &name,
>> +                                             NULL, &parent_id);
>> +             } else {
>> +                     parent_id = key->offset;
>> +                     ret = ref_get_fields(eb, ref_ptr, &namelen, &name,
>> +                                          NULL);
>> +             }
>> +             if (ret)
>> +                     goto out;
>> +
>> +             if (key->type == BTRFS_INODE_EXTREF_KEY)
>> +                     ret = btrfs_find_name_in_ext_backref(log_eb, log_slot,
>> +                                                          parent_id, name,
>> +                                                          namelen, NULL);
>> +             else
>> +                     ret = btrfs_find_name_in_backref(log_eb, log_slot, name,
>> +                                                      namelen, NULL);
>> +
>> +             if (!ret) {
>> +                     struct inode *dir;
>> +
>> +                     btrfs_release_path(path);
>> +                     dir = read_one_inode(root, parent_id);
>> +                     if (!dir) {
>> +                             ret = -ENOENT;
>> +                             kfree(name);
>> +                             goto out;
>> +                     }
>> +                     ret = btrfs_unlink_inode(trans, root, BTRFS_I(dir),
>> +                                              inode, name, namelen);
>> +                     kfree(name);
>> +                     iput(dir);
>> +                     if (ret)
>> +                             goto out;
>> +                     goto again;
>> +             }
>> +
>> +             kfree(name);
>> +             ref_ptr += namelen;
>> +             if (key->type == BTRFS_INODE_EXTREF_KEY)
>> +                     ref_ptr += sizeof(struct btrfs_inode_extref);
>> +             else
>> +                     ref_ptr += sizeof(struct btrfs_inode_ref);
>> +     }
>> +     ret = 0;
>> + out:
>> +     btrfs_release_path(path);
>> +     return ret;
>> +}
>> +
>> +/*
>>   * replay one inode back reference item found in the log tree.
>>   * eb, slot and key refer to the buffer and key found in the log tree.
>>   * root is the destination we are replaying into, and path is for temp
>> @@ -1344,6 +1437,19 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans,
>>               }
>>       }
>>
>> +     /*
>> +      * Before we overwrite the inode reference item in the subvolume tree
>> +      * with the item from the log tree, we must unlink all names from the
>> +      * parent directory that are in the subvolume's tree inode reference
>> +      * item, otherwise we end up with an inconsistent subvolume tree where
>> +      * dir index entries exist for a name but there is no inode reference
>> +      * item with the same name.
>> +      */
>> +     ret = unlink_old_inode_refs(trans, root, path, BTRFS_I(inode), eb, slot,
>> +                                 key);
>> +     if (ret)
>> +             goto out;
>> +
>>       /* finally write the back reference in the inode */
>>       ret = overwrite_item(trans, root, path, eb, slot, key);
>>  out:
>> --
>> 2.11.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana March 2, 2018, 9:08 p.m. UTC | #4
On Fri, Mar 2, 2018 at 7:00 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Fri, Mar 02, 2018 at 11:29:33AM -0700, Liu Bo wrote:
>> On Wed, Feb 28, 2018 at 03:56:10PM +0000, fdmanana@kernel.org wrote:
>> > From: Filipe Manana <fdmanana@suse.com>
>> >
>> > If we have a file with 2 (or more) hard links in the same directory,
>> > remove one of the hard links, create a new file (or link an existing file)
>> > in the same directory with the name of the removed hard link, and then
>> > finally fsync the new file, we end up with a log that fails to replay,
>> > causing a mount failure.
>> >
>> > Example:
>> >
>> >   $ mkfs.btrfs -f /dev/sdb
>> >   $ mount /dev/sdb /mnt
>> >
>> >   $ mkdir /mnt/testdir
>> >   $ touch /mnt/testdir/foo
>> >   $ ln /mnt/testdir/foo /mnt/testdir/bar
>> >
>> >   $ sync
>> >
>> >   $ unlink /mnt/testdir/bar
>> >   $ touch /mnt/testdir/bar
>
> So here, "unlink & touch" similar to rename, therefore we could also
> be conservative and set /mnt/testdir 's last_unlink_trans to force to
> commit transaction.

That would cause any fsync on any file inside the directory to result
in a full transaction commit.
We had in the past more "exotic" scenarios with renames and creating
new files with the same old name where the initial solution was just
that, forcing a full commit [1], just to find out later that it caused
big performance drop on real workloads [2]. That's why I'm avoiding
it, even because fsync'ing any file in the directory (existing or new)
isn't uncommon at all.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56f23fdbb600e6087db7b009775b95ce07cc3195
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=44f714dae50a2e795d3268a6831762aa6fa54f55

thanks

>
> Thanks,
>
> -liubo
>
>> >   $ xfs_io -c "fsync" /mnt/testdir/bar
>> >
>> >   <power failure>
>> >
>> >   $ mount /dev/sdb /mnt
>> >   mount: mount(2) failed: /mnt: No such file or directory
>> >
>> > When replaying the log, for that example, we also see the following in
>> > dmesg/syslog:
>> >
>> >   [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
>> >   [71813.674204] ------------[ cut here ]------------
>> >   [71813.675694] BTRFS: Transaction aborted (error -2)
>> >   [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
>> >   [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
>> >   [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G        W        4.15.0-rc9-btrfs-next-56+ #1
>> >   [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
>> >   [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
>> >   [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
>> >   [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
>> >   [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
>> >   [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
>> >   [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
>> >   [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
>> >   [71813.679669] FS:  00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
>> >   [71813.679669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> >   [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
>> >   [71813.679669] Call Trace:
>> >   [71813.679669]  btrfs_unlink_inode+0x17/0x41 [btrfs]
>> >   [71813.679669]  drop_one_dir_item+0xfa/0x131 [btrfs]
>> >   [71813.679669]  add_inode_ref+0x71e/0x851 [btrfs]
>> >   [71813.679669]  ? __lock_is_held+0x39/0x71
>> >   [71813.679669]  ? replay_one_buffer+0x53/0x53a [btrfs]
>> >   [71813.679669]  replay_one_buffer+0x4a4/0x53a [btrfs]
>> >   [71813.679669]  ? rcu_read_unlock+0x3a/0x57
>> >   [71813.679669]  ? __lock_is_held+0x39/0x71
>> >   [71813.679669]  walk_up_log_tree+0x101/0x1d2 [btrfs]
>> >   [71813.679669]  walk_log_tree+0xad/0x188 [btrfs]
>> >   [71813.679669]  btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
>> >   [71813.679669]  ? replay_one_extent+0x544/0x544 [btrfs]
>> >   [71813.679669]  open_ctree+0x1cf6/0x2209 [btrfs]
>> >   [71813.679669]  btrfs_mount_root+0x368/0x482 [btrfs]
>> >   [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
>> >   [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
>> >   [71813.679669]  ? mount_fs+0x64/0x10b
>> >   [71813.679669]  mount_fs+0x64/0x10b
>> >   [71813.679669]  vfs_kern_mount+0x68/0xce
>> >   [71813.679669]  btrfs_mount+0x13e/0x772 [btrfs]
>> >   [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
>> >   [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
>> >   [71813.679669]  ? mount_fs+0x64/0x10b
>> >   [71813.679669]  mount_fs+0x64/0x10b
>> >   [71813.679669]  vfs_kern_mount+0x68/0xce
>> >   [71813.679669]  do_mount+0x6e5/0x973
>> >   [71813.679669]  ? memdup_user+0x3e/0x5c
>> >   [71813.679669]  SyS_mount+0x72/0x98
>> >   [71813.679669]  entry_SYSCALL_64_fastpath+0x1e/0x8b
>> >   [71813.679669] RIP: 0033:0x7f7cedf150ba
>> >   [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
>> >   [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
>> >   [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
>> >   [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
>> >   [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
>> >   [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
>> >   [71814.128078] BTRFS error (device dm-0): open_ctree failed
>> >
>> > This happens because the log has inode reference items for both inode 258
>> > (the first file we created) and inode 259 (the second file created), and
>> > when processing the reference item for inode 258, we replace the
>> > corresponding item in the subvolume tree (which has two names, "foo" and
>> > "bar") witht he one in the log (which only has one name, "foo") without
>> > removing the corresponding dir index keys from the parent directory.
>> > Later, when processing the inode reference item for inode 259, which has
>> > a name of "bar" associated to it, we notice that dir index entries exist
>> > for that name and for a different inode, so we attempt to unlink that
>> > name, which fails because the inode reference item for inode 258 no longer
>> > has the name "bar" associated to it, making a call to btrfs_unlink_inode()
>> > fail with a -ENOENT error.
>> >
>> > Fix this by unlinking all the names in an inode reference item from a
>> > subvolume tree that are not present in the inode reference item found in
>> > the log tree, before overwriting it with the item from the log tree.
>>
>> Since the order during replaying is INODE_ITEM then DIR_INDEX then
>> other types, in this particular case, can we fix the problem by also
>> logging the parent so that we have the correct DIR_INDEX?
>>
>> With DIR_INDEX, the problem would be fixed simpler, wouldn't it?
>>
>> Thanks,
>>
>> -liubo
>> >
>> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> > ---
>> >  fs/btrfs/ctree.h      |   5 ++-
>> >  fs/btrfs/inode-item.c |  44 ++++++++++++--------
>> >  fs/btrfs/tree-log.c   | 112 ++++++++++++++++++++++++++++++++++++++++++++++++--
>> >  3 files changed, 139 insertions(+), 22 deletions(-)
>> >
>> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> > index 1a462ab85c49..5d33478bc704 100644
>> > --- a/fs/btrfs/ctree.h
>> > +++ b/fs/btrfs/ctree.h
>> > @@ -3095,7 +3095,10 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
>> >                       u64 inode_objectid, u64 ref_objectid, int ins_len,
>> >                       int cow);
>> >
>> > -int btrfs_find_name_in_ext_backref(struct btrfs_path *path,
>> > +int btrfs_find_name_in_backref(struct extent_buffer *leaf, int slot,
>> > +                          const char *name,
>> > +                          int name_len, struct btrfs_inode_ref **ref_ret);
>> > +int btrfs_find_name_in_ext_backref(struct extent_buffer *leaf, int slot,
>> >                                u64 ref_objectid, const char *name,
>> >                                int name_len,
>> >                                struct btrfs_inode_extref **extref_ret);
>> > diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
>> > index 39c968f80157..65e1a76bf755 100644
>> > --- a/fs/btrfs/inode-item.c
>> > +++ b/fs/btrfs/inode-item.c
>> > @@ -22,10 +22,10 @@
>> >  #include "transaction.h"
>> >  #include "print-tree.h"
>> >
>> > -static int find_name_in_backref(struct btrfs_path *path, const char *name,
>> > -                    int name_len, struct btrfs_inode_ref **ref_ret)
>> > +int btrfs_find_name_in_backref(struct extent_buffer *leaf, int slot,
>> > +                          const char *name,
>> > +                          int name_len, struct btrfs_inode_ref **ref_ret)
>> >  {
>> > -   struct extent_buffer *leaf;
>> >     struct btrfs_inode_ref *ref;
>> >     unsigned long ptr;
>> >     unsigned long name_ptr;
>> > @@ -33,9 +33,8 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name,
>> >     u32 cur_offset = 0;
>> >     int len;
>> >
>> > -   leaf = path->nodes[0];
>> > -   item_size = btrfs_item_size_nr(leaf, path->slots[0]);
>> > -   ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
>> > +   item_size = btrfs_item_size_nr(leaf, slot);
>> > +   ptr = btrfs_item_ptr_offset(leaf, slot);
>> >     while (cur_offset < item_size) {
>> >             ref = (struct btrfs_inode_ref *)(ptr + cur_offset);
>> >             len = btrfs_inode_ref_name_len(leaf, ref);
>> > @@ -44,18 +43,19 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name,
>> >             if (len != name_len)
>> >                     continue;
>> >             if (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0) {
>> > -                   *ref_ret = ref;
>> > +                   if (ref_ret)
>> > +                           *ref_ret = ref;
>> >                     return 1;
>> >             }
>> >     }
>> >     return 0;
>> >  }
>> >
>> > -int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
>> > +int btrfs_find_name_in_ext_backref(struct extent_buffer *leaf, int slot,
>> > +                              u64 ref_objectid,
>> >                                const char *name, int name_len,
>> >                                struct btrfs_inode_extref **extref_ret)
>> >  {
>> > -   struct extent_buffer *leaf;
>> >     struct btrfs_inode_extref *extref;
>> >     unsigned long ptr;
>> >     unsigned long name_ptr;
>> > @@ -63,9 +63,8 @@ int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
>> >     u32 cur_offset = 0;
>> >     int ref_name_len;
>> >
>> > -   leaf = path->nodes[0];
>> > -   item_size = btrfs_item_size_nr(leaf, path->slots[0]);
>> > -   ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
>> > +   item_size = btrfs_item_size_nr(leaf, slot);
>> > +   ptr = btrfs_item_ptr_offset(leaf, slot);
>> >
>> >     /*
>> >      * Search all extended backrefs in this item. We're only
>> > @@ -113,7 +112,9 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
>> >             return ERR_PTR(ret);
>> >     if (ret > 0)
>> >             return NULL;
>> > -   if (!btrfs_find_name_in_ext_backref(path, ref_objectid, name, name_len, &extref))
>> > +   if (!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0],
>> > +                                       ref_objectid, name, name_len,
>> > +                                       &extref))
>> >             return NULL;
>> >     return extref;
>> >  }
>> > @@ -155,7 +156,8 @@ static int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
>> >      * This should always succeed so error here will make the FS
>> >      * readonly.
>> >      */
>> > -   if (!btrfs_find_name_in_ext_backref(path, ref_objectid,
>> > +   if (!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0],
>> > +                                       ref_objectid,
>> >                                         name, name_len, &extref)) {
>> >             btrfs_handle_fs_error(root->fs_info, -ENOENT, NULL);
>> >             ret = -EROFS;
>> > @@ -225,7 +227,8 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
>> >     } else if (ret < 0) {
>> >             goto out;
>> >     }
>> > -   if (!find_name_in_backref(path, name, name_len, &ref)) {
>> > +   if (!btrfs_find_name_in_backref(path->nodes[0], path->slots[0],
>> > +                                   name, name_len, &ref)) {
>> >             ret = -ENOENT;
>> >             search_ext_refs = 1;
>> >             goto out;
>> > @@ -293,7 +296,9 @@ static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans,
>> >     ret = btrfs_insert_empty_item(trans, root, path, &key,
>> >                                   ins_len);
>> >     if (ret == -EEXIST) {
>> > -           if (btrfs_find_name_in_ext_backref(path, ref_objectid,
>> > +           if (btrfs_find_name_in_ext_backref(path->nodes[0],
>> > +                                              path->slots[0],
>> > +                                              ref_objectid,
>> >                                                name, name_len, NULL))
>> >                     goto out;
>> >
>> > @@ -351,7 +356,8 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
>> >     if (ret == -EEXIST) {
>> >             u32 old_size;
>> >
>> > -           if (find_name_in_backref(path, name, name_len, &ref))
>> > +           if (btrfs_find_name_in_backref(path->nodes[0], path->slots[0],
>> > +                                          name, name_len, &ref))
>> >                     goto out;
>> >
>> >             old_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]);
>> > @@ -365,7 +371,9 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
>> >             ret = 0;
>> >     } else if (ret < 0) {
>> >             if (ret == -EOVERFLOW) {
>> > -                   if (find_name_in_backref(path, name, name_len, &ref))
>> > +                   if (btrfs_find_name_in_backref(path->nodes[0],
>> > +                                                  path->slots[0],
>> > +                                                  name, name_len, &ref))
>> >                             ret = -EEXIST;
>> >                     else
>> >                             ret = -EMLINK;
>> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> > index 411a022489e4..fd573816f461 100644
>> > --- a/fs/btrfs/tree-log.c
>> > +++ b/fs/btrfs/tree-log.c
>> > @@ -966,7 +966,9 @@ static noinline int backref_in_log(struct btrfs_root *log,
>> >     ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
>> >
>> >     if (key->type == BTRFS_INODE_EXTREF_KEY) {
>> > -           if (btrfs_find_name_in_ext_backref(path, ref_objectid,
>> > +           if (btrfs_find_name_in_ext_backref(path->nodes[0],
>> > +                                              path->slots[0],
>> > +                                              ref_objectid,
>> >                                                name, namelen, NULL))
>> >                     match = 1;
>> >
>> > @@ -1190,7 +1192,8 @@ static int extref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr,
>> >     read_extent_buffer(eb, *name, (unsigned long)&extref->name,
>> >                        *namelen);
>> >
>> > -   *index = btrfs_inode_extref_index(eb, extref);
>> > +   if (index)
>> > +           *index = btrfs_inode_extref_index(eb, extref);
>> >     if (parent_objectid)
>> >             *parent_objectid = btrfs_inode_extref_parent(eb, extref);
>> >
>> > @@ -1211,12 +1214,102 @@ static int ref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr,
>> >
>> >     read_extent_buffer(eb, *name, (unsigned long)(ref + 1), *namelen);
>> >
>> > -   *index = btrfs_inode_ref_index(eb, ref);
>> > +   if (index)
>> > +           *index = btrfs_inode_ref_index(eb, ref);
>> >
>> >     return 0;
>> >  }
>> >
>> >  /*
>> > + * Take an inode reference item from the log tree and iterate all names from the
>> > + * inode reference item in the subvolume tree with the same key (if it exists).
>> > + * For any name that is not in the inode reference item from the log tree, do a
>> > + * proper unlink of that name (that is, remove its entry from the inode
>> > + * reference item and both dir index keys).
>> > + */
>> > +static int unlink_old_inode_refs(struct btrfs_trans_handle *trans,
>> > +                            struct btrfs_root *root,
>> > +                            struct btrfs_path *path,
>> > +                            struct btrfs_inode *inode,
>> > +                            struct extent_buffer *log_eb,
>> > +                            int log_slot,
>> > +                            struct btrfs_key *key)
>> > +{
>> > +   int ret;
>> > +   unsigned long ref_ptr;
>> > +   unsigned long ref_end;
>> > +   struct extent_buffer *eb;
>> > +
>> > +again:
>> > +   btrfs_release_path(path);
>> > +   ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
>> > +   if (ret > 0) {
>> > +           ret = 0;
>> > +           goto out;
>> > +   }
>> > +   if (ret < 0)
>> > +           goto out;
>> > +
>> > +   eb = path->nodes[0];
>> > +   ref_ptr = btrfs_item_ptr_offset(eb, path->slots[0]);
>> > +   ref_end = ref_ptr + btrfs_item_size_nr(eb, path->slots[0]);
>> > +   while (ref_ptr < ref_end) {
>> > +           char *name = NULL;
>> > +           int namelen;
>> > +           u64 parent_id;
>> > +
>> > +           if (key->type == BTRFS_INODE_EXTREF_KEY) {
>> > +                   ret = extref_get_fields(eb, ref_ptr, &namelen, &name,
>> > +                                           NULL, &parent_id);
>> > +           } else {
>> > +                   parent_id = key->offset;
>> > +                   ret = ref_get_fields(eb, ref_ptr, &namelen, &name,
>> > +                                        NULL);
>> > +           }
>> > +           if (ret)
>> > +                   goto out;
>> > +
>> > +           if (key->type == BTRFS_INODE_EXTREF_KEY)
>> > +                   ret = btrfs_find_name_in_ext_backref(log_eb, log_slot,
>> > +                                                        parent_id, name,
>> > +                                                        namelen, NULL);
>> > +           else
>> > +                   ret = btrfs_find_name_in_backref(log_eb, log_slot, name,
>> > +                                                    namelen, NULL);
>> > +
>> > +           if (!ret) {
>> > +                   struct inode *dir;
>> > +
>> > +                   btrfs_release_path(path);
>> > +                   dir = read_one_inode(root, parent_id);
>> > +                   if (!dir) {
>> > +                           ret = -ENOENT;
>> > +                           kfree(name);
>> > +                           goto out;
>> > +                   }
>> > +                   ret = btrfs_unlink_inode(trans, root, BTRFS_I(dir),
>> > +                                            inode, name, namelen);
>> > +                   kfree(name);
>> > +                   iput(dir);
>> > +                   if (ret)
>> > +                           goto out;
>> > +                   goto again;
>> > +           }
>> > +
>> > +           kfree(name);
>> > +           ref_ptr += namelen;
>> > +           if (key->type == BTRFS_INODE_EXTREF_KEY)
>> > +                   ref_ptr += sizeof(struct btrfs_inode_extref);
>> > +           else
>> > +                   ref_ptr += sizeof(struct btrfs_inode_ref);
>> > +   }
>> > +   ret = 0;
>> > + out:
>> > +   btrfs_release_path(path);
>> > +   return ret;
>> > +}
>> > +
>> > +/*
>> >   * replay one inode back reference item found in the log tree.
>> >   * eb, slot and key refer to the buffer and key found in the log tree.
>> >   * root is the destination we are replaying into, and path is for temp
>> > @@ -1344,6 +1437,19 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans,
>> >             }
>> >     }
>> >
>> > +   /*
>> > +    * Before we overwrite the inode reference item in the subvolume tree
>> > +    * with the item from the log tree, we must unlink all names from the
>> > +    * parent directory that are in the subvolume's tree inode reference
>> > +    * item, otherwise we end up with an inconsistent subvolume tree where
>> > +    * dir index entries exist for a name but there is no inode reference
>> > +    * item with the same name.
>> > +    */
>> > +   ret = unlink_old_inode_refs(trans, root, path, BTRFS_I(inode), eb, slot,
>> > +                               key);
>> > +   if (ret)
>> > +           goto out;
>> > +
>> >     /* finally write the back reference in the inode */
>> >     ret = overwrite_item(trans, root, path, eb, slot, key);
>> >  out:
>> > --
>> > 2.11.0
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1a462ab85c49..5d33478bc704 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3095,7 +3095,10 @@  btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
 			  u64 inode_objectid, u64 ref_objectid, int ins_len,
 			  int cow);
 
-int btrfs_find_name_in_ext_backref(struct btrfs_path *path,
+int btrfs_find_name_in_backref(struct extent_buffer *leaf, int slot,
+			       const char *name,
+			       int name_len, struct btrfs_inode_ref **ref_ret);
+int btrfs_find_name_in_ext_backref(struct extent_buffer *leaf, int slot,
 				   u64 ref_objectid, const char *name,
 				   int name_len,
 				   struct btrfs_inode_extref **extref_ret);
diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
index 39c968f80157..65e1a76bf755 100644
--- a/fs/btrfs/inode-item.c
+++ b/fs/btrfs/inode-item.c
@@ -22,10 +22,10 @@ 
 #include "transaction.h"
 #include "print-tree.h"
 
-static int find_name_in_backref(struct btrfs_path *path, const char *name,
-			 int name_len, struct btrfs_inode_ref **ref_ret)
+int btrfs_find_name_in_backref(struct extent_buffer *leaf, int slot,
+			       const char *name,
+			       int name_len, struct btrfs_inode_ref **ref_ret)
 {
-	struct extent_buffer *leaf;
 	struct btrfs_inode_ref *ref;
 	unsigned long ptr;
 	unsigned long name_ptr;
@@ -33,9 +33,8 @@  static int find_name_in_backref(struct btrfs_path *path, const char *name,
 	u32 cur_offset = 0;
 	int len;
 
-	leaf = path->nodes[0];
-	item_size = btrfs_item_size_nr(leaf, path->slots[0]);
-	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
+	item_size = btrfs_item_size_nr(leaf, slot);
+	ptr = btrfs_item_ptr_offset(leaf, slot);
 	while (cur_offset < item_size) {
 		ref = (struct btrfs_inode_ref *)(ptr + cur_offset);
 		len = btrfs_inode_ref_name_len(leaf, ref);
@@ -44,18 +43,19 @@  static int find_name_in_backref(struct btrfs_path *path, const char *name,
 		if (len != name_len)
 			continue;
 		if (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0) {
-			*ref_ret = ref;
+			if (ref_ret)
+				*ref_ret = ref;
 			return 1;
 		}
 	}
 	return 0;
 }
 
-int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
+int btrfs_find_name_in_ext_backref(struct extent_buffer *leaf, int slot,
+				   u64 ref_objectid,
 				   const char *name, int name_len,
 				   struct btrfs_inode_extref **extref_ret)
 {
-	struct extent_buffer *leaf;
 	struct btrfs_inode_extref *extref;
 	unsigned long ptr;
 	unsigned long name_ptr;
@@ -63,9 +63,8 @@  int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
 	u32 cur_offset = 0;
 	int ref_name_len;
 
-	leaf = path->nodes[0];
-	item_size = btrfs_item_size_nr(leaf, path->slots[0]);
-	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
+	item_size = btrfs_item_size_nr(leaf, slot);
+	ptr = btrfs_item_ptr_offset(leaf, slot);
 
 	/*
 	 * Search all extended backrefs in this item. We're only
@@ -113,7 +112,9 @@  btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
 		return ERR_PTR(ret);
 	if (ret > 0)
 		return NULL;
-	if (!btrfs_find_name_in_ext_backref(path, ref_objectid, name, name_len, &extref))
+	if (!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0],
+					    ref_objectid, name, name_len,
+					    &extref))
 		return NULL;
 	return extref;
 }
@@ -155,7 +156,8 @@  static int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
 	 * This should always succeed so error here will make the FS
 	 * readonly.
 	 */
-	if (!btrfs_find_name_in_ext_backref(path, ref_objectid,
+	if (!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0],
+					    ref_objectid,
 					    name, name_len, &extref)) {
 		btrfs_handle_fs_error(root->fs_info, -ENOENT, NULL);
 		ret = -EROFS;
@@ -225,7 +227,8 @@  int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
 	} else if (ret < 0) {
 		goto out;
 	}
-	if (!find_name_in_backref(path, name, name_len, &ref)) {
+	if (!btrfs_find_name_in_backref(path->nodes[0], path->slots[0],
+					name, name_len, &ref)) {
 		ret = -ENOENT;
 		search_ext_refs = 1;
 		goto out;
@@ -293,7 +296,9 @@  static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans,
 	ret = btrfs_insert_empty_item(trans, root, path, &key,
 				      ins_len);
 	if (ret == -EEXIST) {
-		if (btrfs_find_name_in_ext_backref(path, ref_objectid,
+		if (btrfs_find_name_in_ext_backref(path->nodes[0],
+						   path->slots[0],
+						   ref_objectid,
 						   name, name_len, NULL))
 			goto out;
 
@@ -351,7 +356,8 @@  int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
 	if (ret == -EEXIST) {
 		u32 old_size;
 
-		if (find_name_in_backref(path, name, name_len, &ref))
+		if (btrfs_find_name_in_backref(path->nodes[0], path->slots[0],
+					       name, name_len, &ref))
 			goto out;
 
 		old_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]);
@@ -365,7 +371,9 @@  int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
 		ret = 0;
 	} else if (ret < 0) {
 		if (ret == -EOVERFLOW) {
-			if (find_name_in_backref(path, name, name_len, &ref))
+			if (btrfs_find_name_in_backref(path->nodes[0],
+						       path->slots[0],
+						       name, name_len, &ref))
 				ret = -EEXIST;
 			else
 				ret = -EMLINK;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 411a022489e4..fd573816f461 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -966,7 +966,9 @@  static noinline int backref_in_log(struct btrfs_root *log,
 	ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
 
 	if (key->type == BTRFS_INODE_EXTREF_KEY) {
-		if (btrfs_find_name_in_ext_backref(path, ref_objectid,
+		if (btrfs_find_name_in_ext_backref(path->nodes[0],
+						   path->slots[0],
+						   ref_objectid,
 						   name, namelen, NULL))
 			match = 1;
 
@@ -1190,7 +1192,8 @@  static int extref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr,
 	read_extent_buffer(eb, *name, (unsigned long)&extref->name,
 			   *namelen);
 
-	*index = btrfs_inode_extref_index(eb, extref);
+	if (index)
+		*index = btrfs_inode_extref_index(eb, extref);
 	if (parent_objectid)
 		*parent_objectid = btrfs_inode_extref_parent(eb, extref);
 
@@ -1211,12 +1214,102 @@  static int ref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr,
 
 	read_extent_buffer(eb, *name, (unsigned long)(ref + 1), *namelen);
 
-	*index = btrfs_inode_ref_index(eb, ref);
+	if (index)
+		*index = btrfs_inode_ref_index(eb, ref);
 
 	return 0;
 }
 
 /*
+ * Take an inode reference item from the log tree and iterate all names from the
+ * inode reference item in the subvolume tree with the same key (if it exists).
+ * For any name that is not in the inode reference item from the log tree, do a
+ * proper unlink of that name (that is, remove its entry from the inode
+ * reference item and both dir index keys).
+ */
+static int unlink_old_inode_refs(struct btrfs_trans_handle *trans,
+				 struct btrfs_root *root,
+				 struct btrfs_path *path,
+				 struct btrfs_inode *inode,
+				 struct extent_buffer *log_eb,
+				 int log_slot,
+				 struct btrfs_key *key)
+{
+	int ret;
+	unsigned long ref_ptr;
+	unsigned long ref_end;
+	struct extent_buffer *eb;
+
+again:
+	btrfs_release_path(path);
+	ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
+	if (ret > 0) {
+		ret = 0;
+		goto out;
+	}
+	if (ret < 0)
+		goto out;
+
+	eb = path->nodes[0];
+	ref_ptr = btrfs_item_ptr_offset(eb, path->slots[0]);
+	ref_end = ref_ptr + btrfs_item_size_nr(eb, path->slots[0]);
+	while (ref_ptr < ref_end) {
+		char *name = NULL;
+		int namelen;
+		u64 parent_id;
+
+		if (key->type == BTRFS_INODE_EXTREF_KEY) {
+			ret = extref_get_fields(eb, ref_ptr, &namelen, &name,
+						NULL, &parent_id);
+		} else {
+			parent_id = key->offset;
+			ret = ref_get_fields(eb, ref_ptr, &namelen, &name,
+					     NULL);
+		}
+		if (ret)
+			goto out;
+
+		if (key->type == BTRFS_INODE_EXTREF_KEY)
+			ret = btrfs_find_name_in_ext_backref(log_eb, log_slot,
+							     parent_id, name,
+							     namelen, NULL);
+		else
+			ret = btrfs_find_name_in_backref(log_eb, log_slot, name,
+							 namelen, NULL);
+
+		if (!ret) {
+			struct inode *dir;
+
+			btrfs_release_path(path);
+			dir = read_one_inode(root, parent_id);
+			if (!dir) {
+				ret = -ENOENT;
+				kfree(name);
+				goto out;
+			}
+			ret = btrfs_unlink_inode(trans, root, BTRFS_I(dir),
+						 inode, name, namelen);
+			kfree(name);
+			iput(dir);
+			if (ret)
+				goto out;
+			goto again;
+		}
+
+		kfree(name);
+		ref_ptr += namelen;
+		if (key->type == BTRFS_INODE_EXTREF_KEY)
+			ref_ptr += sizeof(struct btrfs_inode_extref);
+		else
+			ref_ptr += sizeof(struct btrfs_inode_ref);
+	}
+	ret = 0;
+ out:
+	btrfs_release_path(path);
+	return ret;
+}
+
+/*
  * replay one inode back reference item found in the log tree.
  * eb, slot and key refer to the buffer and key found in the log tree.
  * root is the destination we are replaying into, and path is for temp
@@ -1344,6 +1437,19 @@  static noinline int add_inode_ref(struct btrfs_trans_handle *trans,
 		}
 	}
 
+	/*
+	 * Before we overwrite the inode reference item in the subvolume tree
+	 * with the item from the log tree, we must unlink all names from the
+	 * parent directory that are in the subvolume's tree inode reference
+	 * item, otherwise we end up with an inconsistent subvolume tree where
+	 * dir index entries exist for a name but there is no inode reference
+	 * item with the same name.
+	 */
+	ret = unlink_old_inode_refs(trans, root, path, BTRFS_I(inode), eb, slot,
+				    key);
+	if (ret)
+		goto out;
+
 	/* finally write the back reference in the inode */
 	ret = overwrite_item(trans, root, path, eb, slot, key);
 out: