Message ID | 20190415083018.2224-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Btrfs: fix race between send and deduplication that lead to failures and crashes | expand |
On Mon, Apr 15, 2019 at 09:30:18AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Send operates on read only trees and expects them to never change while it > is using them. This is part of its initial design, and this expection is > due to two different reasons: > > 1) When it was introduced, no operations were allowed to modifiy read-only > subvolumes/snapshots (including defrag for example). > > 2) It keeps send from having an impact on other filesystem operations. > Namely send does not need to keep locks on the trees nor needs to hold on > to transaction handles and delay transaction commits. This ends up being > a consequence of the former reason. > > However the deduplication feature was introduced later (on September 2013, > while send was introduced in July 2012) and it allowed for deduplication > with destination files that belong to read-only trees (subvolumes and > snapshots). > > That means that having a send operation (either full or incremental) running > in parallel with a deduplication that has the destination inode in one of > the trees used by the send operation, can result in tree nodes and leaves > getting freed and reused while send is using them. This problem is similar > to the problem solved for the root nodes getting freed and reused when a > snapshot is made against one tree that is currenly being used by a send > operation, fixed in commits [1] and [2]. These commits explain in detail > how the problem happens and the explanation is valid for any node or leaf > that is not the root of a tree as well. This problem was also discussed > and explained recently in a thread [3]. > > The problem is very easy to reproduce when using send with large trees > (snapshots) and just a few concurrent deduplication operations that target > files in the trees used by send. A stress test case is being sent for > fstests that triggers the issue easily. The most common error to hit is > the send ioctl return -EIO with the following messages in dmesg/syslog: > > [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400 > [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27 > > The first one is very easy to hit while the second one happens much less > frequently, except for very large trees (in that test case, snapshots > with 100000 files having large xattrs to get deep and wide trees). > Less frequently, at least one BUG_ON can be hit: > > [1631742.130080] ------------[ cut here ]------------ > [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806! > [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI > [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1 > [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 > [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] > (...) > [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246 > [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002 > [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000 > [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d > [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708 > [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 > [1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000 > [1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0 > [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [1631742.141272] Call Trace: > [1631742.141826] ? do_raw_spin_unlock+0x49/0xc0 > [1631742.142390] tree_advance+0x173/0x1d0 [btrfs] > [1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs] > [1631742.143533] ? process_extent+0x1070/0x1070 [btrfs] > [1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs] > [1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs] > [1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0 > [1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs] > [1631742.146179] ? account_entity_enqueue+0xd3/0x100 > [1631742.146662] ? reweight_entity+0x154/0x1a0 > [1631742.147135] ? update_curr+0x20/0x2a0 > [1631742.147593] ? check_preempt_wakeup+0x103/0x250 > [1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0 > [1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] > [1631742.148942] do_vfs_ioctl+0xa2/0x6f0 > [1631742.149361] ? __fget+0x113/0x200 > [1631742.149767] ksys_ioctl+0x70/0x80 > [1631742.150159] __x64_sys_ioctl+0x16/0x20 > [1631742.150543] do_syscall_64+0x60/0x1b0 > [1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [1631742.151326] RIP: 0033:0x7f4cd9f5add7 > (...) > [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 > [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7 > [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007 > [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700 > [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003 > [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001 > (...) > [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]--- > > That BUG_ON happens because while send is using a node, that node is COWed > by a concurrent deduplication, gets freed and gets reused as a leaf (because > a transaction commit happened in between), so when it attempts to read a > slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer > contents were wiped out and it now matches a leaf (which can even belong to > some other tree now), hitting the BUG_ON(level == 0). > > Fix this concurrency issue by not allowing send and deduplication to run > in parallel if both operate on the same readonly trees, returning EAGAIN > to user space and logging an exlicit warning in dmesg/syslog. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2 > [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/ > > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> > + btrfs_warn_rl(fs_info, > + "Can not use root %llu for send while a deduplication on it is in progress", > + send_root->root_key.objectid); > + "Can not use root %llu for send while a deduplication on it is in progress", > + clone_root->root_key.objectid); > + btrfs_warn_rl(fs_info, > + "Can not use root %llu for send while a deduplication on it is in progress", > + sctx->parent_root->root_key.objectid); I think 3 occurences of the same string deserve a helper. Also strings should be un-indented to fit below 80 chars per column. I'm thinking about the message level, you use warning. The ratelimiting makes sense here, but I'm not sure if KERN_INFO would be sufficient. Maybe there are different expectations for send and dedupe too. If dedupe does not work due to send, then I don't rate is as a severe problem. Some space is not deduped. OTOH, send can be used for backups so a failure should be more visible. As we don't have a better mechanism to report such event to the user, it's the return code and syslog message. We should try to find a reasonable default but can't satisfy all users.
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0f4838e00fbc..76edc741a60c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1348,6 +1348,12 @@ struct btrfs_root { * manipulation with the read-only status via SUBVOL_SETFLAGS */ int send_in_progress; + /* + * Number of currently running deduplication operations that have a + * destination inode belonging to this root. Protected by the lock + * root_item_lock. + */ + int dedupe_in_progress; struct btrfs_subvolume_writers *subv_writers; atomic_t will_be_snapshotted; atomic_t snapshot_force_cow; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 11550199c6bd..cd83b476078e 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3260,6 +3260,18 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, { int ret; u64 i, tail_len, chunk_count; + struct btrfs_root *root_dst = BTRFS_I(dst)->root; + + spin_lock(&root_dst->root_item_lock); + if (root_dst->send_in_progress) { + spin_unlock(&root_dst->root_item_lock); + btrfs_warn_rl(root_dst->fs_info, + "Can not deduplicate to root %llu while send is using it", + root_dst->root_key.objectid); + return -EAGAIN; + } + root_dst->dedupe_in_progress++; + spin_unlock(&root_dst->root_item_lock); tail_len = olen % BTRFS_MAX_DEDUPE_LEN; chunk_count = div_u64(olen, BTRFS_MAX_DEDUPE_LEN); @@ -3268,7 +3280,7 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, ret = btrfs_extent_same_range(src, loff, BTRFS_MAX_DEDUPE_LEN, dst, dst_loff); if (ret) - return ret; + goto out; loff += BTRFS_MAX_DEDUPE_LEN; dst_loff += BTRFS_MAX_DEDUPE_LEN; @@ -3277,6 +3289,10 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, if (tail_len > 0) ret = btrfs_extent_same_range(src, loff, tail_len, dst, dst_loff); +out: + spin_lock(&root_dst->root_item_lock); + root_dst->dedupe_in_progress--; + spin_unlock(&root_dst->root_item_lock); return ret; } diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index fe700b228b5d..9b987d499481 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -6649,6 +6649,13 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg) * making it RW. This also protects against deletion. */ spin_lock(&send_root->root_item_lock); + if (btrfs_root_readonly(send_root) && send_root->dedupe_in_progress) { + spin_unlock(&send_root->root_item_lock); + btrfs_warn_rl(fs_info, + "Can not use root %llu for send while a deduplication on it is in progress", + send_root->root_key.objectid); + return -EAGAIN; + } send_root->send_in_progress++; spin_unlock(&send_root->root_item_lock); @@ -6783,6 +6790,15 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg) ret = -EPERM; goto out; } + if (clone_root->dedupe_in_progress) { + spin_unlock(&clone_root->root_item_lock); + srcu_read_unlock(&fs_info->subvol_srcu, index); + ret = -EAGAIN; + btrfs_warn_rl(fs_info, + "Can not use root %llu for send while a deduplication on it is in progress", + clone_root->root_key.objectid); + goto out; + } clone_root->send_in_progress++; spin_unlock(&clone_root->root_item_lock); srcu_read_unlock(&fs_info->subvol_srcu, index); @@ -6817,6 +6833,15 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg) ret = -EPERM; goto out; } + if (sctx->parent_root->dedupe_in_progress) { + spin_unlock(&sctx->parent_root->root_item_lock); + srcu_read_unlock(&fs_info->subvol_srcu, index); + ret = -EAGAIN; + btrfs_warn_rl(fs_info, + "Can not use root %llu for send while a deduplication on it is in progress", + sctx->parent_root->root_key.objectid); + goto out; + } spin_unlock(&sctx->parent_root->root_item_lock); srcu_read_unlock(&fs_info->subvol_srcu, index);