[v2] Btrfs: fix race between send and deduplication that lead to failures and crashes
diff mbox series

Message ID 20190422154342.11873-1-fdmanana@kernel.org
State New
Headers show
Series
  • [v2] Btrfs: fix race between send and deduplication that lead to failures and crashes
Related show

Commit Message

Filipe Manana April 22, 2019, 3:43 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:

1) When it was introduced, no operations were allowed to modifiy read-only
   subvolumes/snapshots (including defrag for example).

2) It keeps send from having an impact on other filesystem operations.
   Namely send does not need to keep locks on the trees nor needs to hold on
   to transaction handles and delay transaction commits. This ends up being
   a consequence of the former reason.

However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).

That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].

The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:

 [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
 [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27

The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:

 [1631742.130080] ------------[ cut here ]------------
 [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
 [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
 [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G    B D W         5.0.0-rc8-btrfs-next-45 #1
 [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
 [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
 (...)
 [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
 [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
 [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
 [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
 [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
 [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
 [1631742.138455] FS:  00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
 [1631742.139010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
 [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [1631742.141272] Call Trace:
 [1631742.141826]  ? do_raw_spin_unlock+0x49/0xc0
 [1631742.142390]  tree_advance+0x173/0x1d0 [btrfs]
 [1631742.142948]  btrfs_compare_trees+0x268/0x690 [btrfs]
 [1631742.143533]  ? process_extent+0x1070/0x1070 [btrfs]
 [1631742.144088]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
 [1631742.144645]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
 [1631742.145161]  ? trace_sched_stick_numa+0xe0/0xe0
 [1631742.145685]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
 [1631742.146179]  ? account_entity_enqueue+0xd3/0x100
 [1631742.146662]  ? reweight_entity+0x154/0x1a0
 [1631742.147135]  ? update_curr+0x20/0x2a0
 [1631742.147593]  ? check_preempt_wakeup+0x103/0x250
 [1631742.148053]  ? do_vfs_ioctl+0xa2/0x6f0
 [1631742.148510]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
 [1631742.148942]  do_vfs_ioctl+0xa2/0x6f0
 [1631742.149361]  ? __fget+0x113/0x200
 [1631742.149767]  ksys_ioctl+0x70/0x80
 [1631742.150159]  __x64_sys_ioctl+0x16/0x20
 [1631742.150543]  do_syscall_64+0x60/0x1b0
 [1631742.150931]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 [1631742.151326] RIP: 0033:0x7f4cd9f5add7
 (...)
 [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
 [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
 [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
 [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
 [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
 [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
 (...)
 [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---

That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).

Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Moved send warning to an helper function. Log number of send and 
    deduplications in progress too.

 fs/btrfs/ctree.h |  6 ++++++
 fs/btrfs/ioctl.c | 19 ++++++++++++++++++-
 fs/btrfs/send.c  | 26 ++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 1 deletion(-)

Comments

David Sterba April 24, 2019, 4:26 p.m. UTC | #1
On Mon, Apr 22, 2019 at 04:43:42PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Send operates on read only trees and expects them to never change while it
> is using them. This is part of its initial design, and this expection is
> due to two different reasons:
> 
> 1) When it was introduced, no operations were allowed to modifiy read-only
>    subvolumes/snapshots (including defrag for example).
> 
> 2) It keeps send from having an impact on other filesystem operations.
>    Namely send does not need to keep locks on the trees nor needs to hold on
>    to transaction handles and delay transaction commits. This ends up being
>    a consequence of the former reason.
> 
> However the deduplication feature was introduced later (on September 2013,
> while send was introduced in July 2012) and it allowed for deduplication
> with destination files that belong to read-only trees (subvolumes and
> snapshots).
> 
> That means that having a send operation (either full or incremental) running
> in parallel with a deduplication that has the destination inode in one of
> the trees used by the send operation, can result in tree nodes and leaves
> getting freed and reused while send is using them. This problem is similar
> to the problem solved for the root nodes getting freed and reused when a
> snapshot is made against one tree that is currenly being used by a send
> operation, fixed in commits [1] and [2]. These commits explain in detail
> how the problem happens and the explanation is valid for any node or leaf
> that is not the root of a tree as well. This problem was also discussed
> and explained recently in a thread [3].
> 
> The problem is very easy to reproduce when using send with large trees
> (snapshots) and just a few concurrent deduplication operations that target
> files in the trees used by send. A stress test case is being sent for
> fstests that triggers the issue easily. The most common error to hit is
> the send ioctl return -EIO with the following messages in dmesg/syslog:
> 
>  [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
>  [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
> 
> The first one is very easy to hit while the second one happens much less
> frequently, except for very large trees (in that test case, snapshots
> with 100000 files having large xattrs to get deep and wide trees).
> Less frequently, at least one BUG_ON can be hit:
> 
>  [1631742.130080] ------------[ cut here ]------------
>  [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
>  [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
>  [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G    B D W         5.0.0-rc8-btrfs-next-45 #1
>  [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
>  [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
>  (...)
>  [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
>  [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
>  [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
>  [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
>  [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
>  [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
>  [1631742.138455] FS:  00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
>  [1631742.139010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
>  [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  [1631742.141272] Call Trace:
>  [1631742.141826]  ? do_raw_spin_unlock+0x49/0xc0
>  [1631742.142390]  tree_advance+0x173/0x1d0 [btrfs]
>  [1631742.142948]  btrfs_compare_trees+0x268/0x690 [btrfs]
>  [1631742.143533]  ? process_extent+0x1070/0x1070 [btrfs]
>  [1631742.144088]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
>  [1631742.144645]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
>  [1631742.145161]  ? trace_sched_stick_numa+0xe0/0xe0
>  [1631742.145685]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
>  [1631742.146179]  ? account_entity_enqueue+0xd3/0x100
>  [1631742.146662]  ? reweight_entity+0x154/0x1a0
>  [1631742.147135]  ? update_curr+0x20/0x2a0
>  [1631742.147593]  ? check_preempt_wakeup+0x103/0x250
>  [1631742.148053]  ? do_vfs_ioctl+0xa2/0x6f0
>  [1631742.148510]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>  [1631742.148942]  do_vfs_ioctl+0xa2/0x6f0
>  [1631742.149361]  ? __fget+0x113/0x200
>  [1631742.149767]  ksys_ioctl+0x70/0x80
>  [1631742.150159]  __x64_sys_ioctl+0x16/0x20
>  [1631742.150543]  do_syscall_64+0x60/0x1b0
>  [1631742.150931]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>  [1631742.151326] RIP: 0033:0x7f4cd9f5add7
>  (...)
>  [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
>  [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
>  [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
>  [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
>  [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
>  [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
>  (...)
>  [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
> 
> That BUG_ON happens because while send is using a node, that node is COWed
> by a concurrent deduplication, gets freed and gets reused as a leaf (because
> a transaction commit happened in between), so when it attempts to read a
> slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
> contents were wiped out and it now matches a leaf (which can even belong to
> some other tree now), hitting the BUG_ON(level == 0).
> 
> Fix this concurrency issue by not allowing send and deduplication to run
> in parallel if both operate on the same readonly trees, returning EAGAIN
> to user space and logging an exlicit warning in dmesg/syslog.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
> [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
> V2: Moved send warning to an helper function. Log number of send and 
>     deduplications in progress too.

Added to misc-next, thanks.
David Sterba May 13, 2019, 3:56 p.m. UTC | #2
On Mon, Apr 22, 2019 at 04:43:42PM +0100, fdmanana@kernel.org wrote:
> +		btrfs_warn_rl(root_dst->fs_info,
> +"Can not deduplicate to root %llu while send operations are using it (%d in progress)",
> +			      root_dst->root_key.objectid,
> +			      root_dst->send_in_progress);

The test btrfs/187 stresses this code and the logs are flooded by the
messages, even ratelimited.

[ 7276.358431] btrfs_extent_same: 152 callbacks suppressed
[ 7276.358434] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7276.473264] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7276.476145] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7276.481035] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7276.509456] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7276.531005] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7276.654349] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7276.659043] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7276.664147] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7276.667872] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7281.386168] btrfs_extent_same: 151 callbacks suppressed
[ 7281.386171] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7281.391417] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7281.402340] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7281.424163] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7281.443574] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7281.559380] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7281.565187] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7281.569269] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7281.572955] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7281.603316] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7286.479641] btrfs_extent_same: 155 callbacks suppressed
[ 7286.479644] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7286.493800] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7286.497717] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7286.508728] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7286.541218] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7286.655025] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7286.659737] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7286.673313] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7286.678633] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7286.699049] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7291.485963] btrfs_extent_same: 155 callbacks suppressed
[ 7291.485966] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7291.504093] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7291.508731] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7291.524168] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7291.539922] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7291.656334] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7291.661194] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7291.663548] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7291.677950] BTRFS warning (device vdb): cannot deduplicate to root 263 while send operations are using it (1 in progress)
[ 7291.690344] BTRFS warning (device vdb): cannot deduplicate to root 260 while send operations are using it (2 in progress)
[ 7296.555685] btrfs_extent_same: 160 callbacks suppressed

I wonder if the test is rather artificail (and that's fine for the testing
purposes) or if the number of messages would repeat under normal conditions.

We don't need to print the message each time the dedup tries to acces a
snapshot under send, so keeping track if the message has been sent already
would be less intrusive and still provide the information.
David Sterba May 13, 2019, 4:07 p.m. UTC | #3
On Mon, May 13, 2019 at 05:56:07PM +0200, David Sterba wrote:
> On Mon, Apr 22, 2019 at 04:43:42PM +0100, fdmanana@kernel.org wrote:
> > +		btrfs_warn_rl(root_dst->fs_info,
> > +"Can not deduplicate to root %llu while send operations are using it (%d in progress)",
> > +			      root_dst->root_key.objectid,
> > +			      root_dst->send_in_progress);
> 
> The test btrfs/187 stresses this code and the logs are flooded by the
> messages, even ratelimited.
> 
> I wonder if the test is rather artificail (and that's fine for the testing
> purposes) or if the number of messages would repeat under normal conditions.
> 
> We don't need to print the message each time the dedup tries to acces a
> snapshot under send, so keeping track if the message has been sent already
> would be less intrusive and still provide the information.

Untested:

--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1205,6 +1205,8 @@ enum {
        BTRFS_ROOT_DEAD_RELOC_TREE,
        /* Mark dead root stored on device whose cleanup needs to be resumed */
        BTRFS_ROOT_DEAD_TREE,
+       /* Track if dedupe was attempted under a current send */
+       BTRFS_ROOT_NOTIFIED_DEDUPE_DURING_SEND,
 };
 
 /*
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 6dafa857bbb9..23677cf12afc 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3263,7 +3263,9 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 
        spin_lock(&root_dst->root_item_lock);
        if (root_dst->send_in_progress) {
-               btrfs_warn_rl(root_dst->fs_info,
+               if (!test_and_set_bit(BTRFS_ROOT_NOTIFIED_DEDUPE_DURING_SEND,
+                                       &root_dst->state))
+                       btrfs_warn(root_dst->fs_info,
 "cannot deduplicate to root %llu while send operations are using it (%d in progress)",
                              root_dst->root_key.objectid,
                              root_dst->send_in_progress);
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index dd38dfe174df..cc85ae903368 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -6637,6 +6637,8 @@ static void btrfs_root_dec_send_in_progress(struct btrfs_root* root)
                btrfs_err(root->fs_info,
                          "send_in_progress unbalanced %d root %llu",
                          root->send_in_progress, root->root_key.objectid);
+       if (root->send_in_progress == 0)
+               clear_bit(BTRFS_ROOT_NOTIFIED_DEDUPE_DURING_SEND, &root->state);
        spin_unlock(&root->root_item_lock);
 }
Filipe Manana May 13, 2019, 4:18 p.m. UTC | #4
On Mon, May 13, 2019 at 5:06 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, May 13, 2019 at 05:56:07PM +0200, David Sterba wrote:
> > On Mon, Apr 22, 2019 at 04:43:42PM +0100, fdmanana@kernel.org wrote:
> > > +           btrfs_warn_rl(root_dst->fs_info,
> > > +"Can not deduplicate to root %llu while send operations are using it (%d in progress)",
> > > +                         root_dst->root_key.objectid,
> > > +                         root_dst->send_in_progress);
> >
> > The test btrfs/187 stresses this code and the logs are flooded by the
> > messages, even ratelimited.
> >
> > I wonder if the test is rather artificail (and that's fine for the testing
> > purposes) or if the number of messages would repeat under normal conditions.
> >
> > We don't need to print the message each time the dedup tries to acces a
> > snapshot under send, so keeping track if the message has been sent already
> > would be less intrusive and still provide the information.
>
> Untested:
>
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1205,6 +1205,8 @@ enum {
>         BTRFS_ROOT_DEAD_RELOC_TREE,
>         /* Mark dead root stored on device whose cleanup needs to be resumed */
>         BTRFS_ROOT_DEAD_TREE,
> +       /* Track if dedupe was attempted under a current send */
> +       BTRFS_ROOT_NOTIFIED_DEDUPE_DURING_SEND,
>  };
>
>  /*
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 6dafa857bbb9..23677cf12afc 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3263,7 +3263,9 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
>
>         spin_lock(&root_dst->root_item_lock);
>         if (root_dst->send_in_progress) {
> -               btrfs_warn_rl(root_dst->fs_info,
> +               if (!test_and_set_bit(BTRFS_ROOT_NOTIFIED_DEDUPE_DURING_SEND,
> +                                       &root_dst->state))
> +                       btrfs_warn(root_dst->fs_info,
>  "cannot deduplicate to root %llu while send operations are using it (%d in progress)",
>                               root_dst->root_key.objectid,
>                               root_dst->send_in_progress);
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index dd38dfe174df..cc85ae903368 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -6637,6 +6637,8 @@ static void btrfs_root_dec_send_in_progress(struct btrfs_root* root)
>                 btrfs_err(root->fs_info,
>                           "send_in_progress unbalanced %d root %llu",
>                           root->send_in_progress, root->root_key.objectid);
> +       if (root->send_in_progress == 0)
> +               clear_bit(BTRFS_ROOT_NOTIFIED_DEDUPE_DURING_SEND, &root->state);
>         spin_unlock(&root->root_item_lock);

I would leave it as it is unless users start to complain. Yes, the
test does this on purpose.
Adding such code/state seems weird to me, instead I would change the
rate limit state so that the messages would repeat much less
frequently.

>  }
>
David Sterba May 13, 2019, 4:58 p.m. UTC | #5
On Mon, May 13, 2019 at 05:18:37PM +0100, Filipe Manana wrote:
> I would leave it as it is unless users start to complain. Yes, the
> test does this on purpose.
> Adding such code/state seems weird to me, instead I would change the
> rate limit state so that the messages would repeat much less
> frequently.

The difference to the state tracking is that the warning would be
printed repeatedly, which I find unnecessary and based on past user
feedback, there will be somebody asking about that.

The rate limiting can also skip a message that can be for a different
subvolume, so this makes it harder to diagnose problems.

Current state is not satisfactory at least for me because it hurts
testing, the test runs for about 2 hours now, besides the log bloat. The
number of messages that slipped through ratelimiting is now over 11k,
which is roughly 150k messages printed overall.
Filipe Manana May 13, 2019, 5:05 p.m. UTC | #6
On Mon, May 13, 2019 at 5:57 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, May 13, 2019 at 05:18:37PM +0100, Filipe Manana wrote:
> > I would leave it as it is unless users start to complain. Yes, the
> > test does this on purpose.
> > Adding such code/state seems weird to me, instead I would change the
> > rate limit state so that the messages would repeat much less
> > frequently.
>
> The difference to the state tracking is that the warning would be
> printed repeatedly, which I find unnecessary and based on past user
> feedback, there will be somebody asking about that.
>
> The rate limiting can also skip a message that can be for a different
> subvolume, so this makes it harder to diagnose problems.
>
> Current state is not satisfactory at least for me because it hurts
> testing, the test runs for about 2 hours now, besides the log bloat. The

You mean the test case for fstests (btrfs/187) takes 2 hours for you?
For me it takes under 8 minutes for an unpatched btrfs, while a
patched btrfs takes somewhere between 1 minute and 3 minutes. This is
on VMs, with a debug kernel, average/cheap host hardware, etc.

> number of messages that slipped through ratelimiting is now over 11k,
> which is roughly 150k messages printed overall.
David Sterba May 13, 2019, 7:10 p.m. UTC | #7
On Mon, May 13, 2019 at 06:05:54PM +0100, Filipe Manana wrote:
> On Mon, May 13, 2019 at 5:57 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Mon, May 13, 2019 at 05:18:37PM +0100, Filipe Manana wrote:
> > > I would leave it as it is unless users start to complain. Yes, the
> > > test does this on purpose.
> > > Adding such code/state seems weird to me, instead I would change the
> > > rate limit state so that the messages would repeat much less
> > > frequently.
> >
> > The difference to the state tracking is that the warning would be
> > printed repeatedly, which I find unnecessary and based on past user
> > feedback, there will be somebody asking about that.
> >
> > The rate limiting can also skip a message that can be for a different
> > subvolume, so this makes it harder to diagnose problems.
> >
> > Current state is not satisfactory at least for me because it hurts
> > testing, the test runs for about 2 hours now, besides the log bloat. The
> 
> You mean the test case for fstests (btrfs/187) takes 2 hours for you?

This is on a VM with file-backed devices, that I use for initial tests
of patches before they go to other branches. It's a slow setup but helps
me to identify problems early as I can run a few in parallel.  I'd still
like to have the run time below say 5 hours (currently it's 4). Though I
can skip some thests I'd rather not due to coverage, but if there's no
other way I'll have to.

> For me it takes under 8 minutes for an unpatched btrfs, while a
> patched btrfs takes somewhere between 1 minute and 3 minutes. This is
> on VMs, with a debug kernel, average/cheap host hardware, etc.

On a another host, VM with physical disks it's closer to that time, it
took about 13 minutes which is acceptable.
David Sterba May 13, 2019, 10:06 p.m. UTC | #8
On Mon, May 13, 2019 at 09:10:03PM +0200, David Sterba wrote:
> On Mon, May 13, 2019 at 06:05:54PM +0100, Filipe Manana wrote:
> > On Mon, May 13, 2019 at 5:57 PM David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Mon, May 13, 2019 at 05:18:37PM +0100, Filipe Manana wrote:
> > > > I would leave it as it is unless users start to complain. Yes, the
> > > > test does this on purpose.
> > > > Adding such code/state seems weird to me, instead I would change the
> > > > rate limit state so that the messages would repeat much less
> > > > frequently.
> > >
> > > The difference to the state tracking is that the warning would be
> > > printed repeatedly, which I find unnecessary and based on past user
> > > feedback, there will be somebody asking about that.
> > >
> > > The rate limiting can also skip a message that can be for a different
> > > subvolume, so this makes it harder to diagnose problems.
> > >
> > > Current state is not satisfactory at least for me because it hurts
> > > testing, the test runs for about 2 hours now, besides the log bloat. The
> > 
> > You mean the test case for fstests (btrfs/187) takes 2 hours for you?
> 
> This is on a VM with file-backed devices, that I use for initial tests
> of patches before they go to other branches.

Update: it took about 6 minutes in another round, so the test does work
in that setup.

Patch
diff mbox series

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a1c7effe6de4..753ff68a8e8f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1348,6 +1348,12 @@  struct btrfs_root {
 	 * manipulation with the read-only status via SUBVOL_SETFLAGS
 	 */
 	int send_in_progress;
+	/*
+	 * Number of currently running deduplication operations that have a
+	 * destination inode belonging to this root. Protected by the lock
+	 * root_item_lock.
+	 */
+	int dedupe_in_progress;
 	struct btrfs_subvolume_writers *subv_writers;
 	atomic_t will_be_snapshotted;
 	atomic_t snapshot_force_cow;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 11550199c6bd..785ef4006d4d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3260,6 +3260,19 @@  static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 {
 	int ret;
 	u64 i, tail_len, chunk_count;
+	struct btrfs_root *root_dst = BTRFS_I(dst)->root;
+
+	spin_lock(&root_dst->root_item_lock);
+	if (root_dst->send_in_progress) {
+		btrfs_warn_rl(root_dst->fs_info,
+"Can not deduplicate to root %llu while send operations are using it (%d in progress)",
+			      root_dst->root_key.objectid,
+			      root_dst->send_in_progress);
+		spin_unlock(&root_dst->root_item_lock);
+		return -EAGAIN;
+	}
+	root_dst->dedupe_in_progress++;
+	spin_unlock(&root_dst->root_item_lock);
 
 	tail_len = olen % BTRFS_MAX_DEDUPE_LEN;
 	chunk_count = div_u64(olen, BTRFS_MAX_DEDUPE_LEN);
@@ -3268,7 +3281,7 @@  static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 		ret = btrfs_extent_same_range(src, loff, BTRFS_MAX_DEDUPE_LEN,
 					      dst, dst_loff);
 		if (ret)
-			return ret;
+			goto out;
 
 		loff += BTRFS_MAX_DEDUPE_LEN;
 		dst_loff += BTRFS_MAX_DEDUPE_LEN;
@@ -3277,6 +3290,10 @@  static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 	if (tail_len > 0)
 		ret = btrfs_extent_same_range(src, loff, tail_len, dst,
 					      dst_loff);
+out:
+	spin_lock(&root_dst->root_item_lock);
+	root_dst->dedupe_in_progress--;
+	spin_unlock(&root_dst->root_item_lock);
 
 	return ret;
 }
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index fe700b228b5d..f91f474f14f6 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -6626,6 +6626,13 @@  static void btrfs_root_dec_send_in_progress(struct btrfs_root* root)
 	spin_unlock(&root->root_item_lock);
 }
 
+static void dedupe_in_progress_warn(const struct btrfs_root *root)
+{
+	btrfs_warn_rl(root->fs_info,
+"Can not use root %llu for send while deduplications on it are in progress (%d in progress)",
+		      root->root_key.objectid, root->dedupe_in_progress);
+}
+
 long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 {
 	int ret = 0;
@@ -6649,6 +6656,11 @@  long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 	 * making it RW. This also protects against deletion.
 	 */
 	spin_lock(&send_root->root_item_lock);
+	if (btrfs_root_readonly(send_root) && send_root->dedupe_in_progress) {
+		dedupe_in_progress_warn(send_root);
+		spin_unlock(&send_root->root_item_lock);
+		return -EAGAIN;
+	}
 	send_root->send_in_progress++;
 	spin_unlock(&send_root->root_item_lock);
 
@@ -6783,6 +6795,13 @@  long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 				ret = -EPERM;
 				goto out;
 			}
+			if (clone_root->dedupe_in_progress) {
+				dedupe_in_progress_warn(clone_root);
+				spin_unlock(&clone_root->root_item_lock);
+				srcu_read_unlock(&fs_info->subvol_srcu, index);
+				ret = -EAGAIN;
+				goto out;
+			}
 			clone_root->send_in_progress++;
 			spin_unlock(&clone_root->root_item_lock);
 			srcu_read_unlock(&fs_info->subvol_srcu, index);
@@ -6817,6 +6836,13 @@  long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 			ret = -EPERM;
 			goto out;
 		}
+		if (sctx->parent_root->dedupe_in_progress) {
+			dedupe_in_progress_warn(sctx->parent_root);
+			spin_unlock(&sctx->parent_root->root_item_lock);
+			srcu_read_unlock(&fs_info->subvol_srcu, index);
+			ret = -EAGAIN;
+			goto out;
+		}
 		spin_unlock(&sctx->parent_root->root_item_lock);
 
 		srcu_read_unlock(&fs_info->subvol_srcu, index);