[1/2] Btrfs: fix race between ranged fsync and writeback of adjacent ranges
diff mbox series

Message ID 20190506154402.20097-1-fdmanana@kernel.org
State New
Headers show
Series
  • [1/2] Btrfs: fix race between ranged fsync and writeback of adjacent ranges
Related show

Commit Message

Filipe Manana May 6, 2019, 3:44 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
inode) that happens to be ranged, which happens during a msync() or writes
for files opened with O_SYNC for example, we can end up with a corrupt log,
due to different file extent items representing ranges that overlap with
each other, or hit some assertion failures.

When doing a ranged fsync we only flush delalloc and wait for ordered
exents within that range. If while we are logging items from our inode
ordered extents for adjacent ranges complete, we end up in a race that can
make us insert the file extent items that overlap with others we logged
previously and the assertion failures.

For example, if tree-log.c:copy_items() receives a leaf that has the
following file extents items, all with a length of 4K and therefore there
is an implicit hole in the range 68K to 72K - 1:

  (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...

It copies them to the log tree. However due to the need to detect implicit
holes, it may release the path, in order to look at the previous leaf to
detect an implicit hole, and then later it will search again in the tree
for the first file extent item key, with the goal of locking again the
leaf (which might have changed due to concurrent changes to other inodes).

However when it locks again the leaf containing the first key, the key
corresponding to the extent at offset 72K may not be there anymore since
there is an ordered extent for that range that is finishing (that is,
somewhere in the middle of btrfs_finish_ordered_io()), and it just
removed the file extent item but has not yet replaced it with a new file
extent item, so the part of copy_items() that does hole detection will
decide that there is a hole in the range starting from 68K to 76K - 1,
and therefore insert a file extent item to represent that hole, having
a key offset of 68K. After that we now have a log tree with 2 different
extent items that have overlapping ranges:

 1) The file extent item copied before copy_items() released the path,
    which has a key offset of 72K and a length of 4K, representing the
    file range 72K to 76K - 1.

 2) And a file extent item representing a hole that has a key offset of
    68K and a length of 8K, representing the range 68K to 76K - 1. This
    item was inserted after releasing the path, and overlaps with the
    extent item inserted before.

The overlapping extent items can cause all sorts of unpredictable and
incorrect behaviour, either when replayed or if a fast (non full) fsync
happens later, which can trigger a BUG_ON() when calling
btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
trace like the following:

  [61666.783269] ------------[ cut here ]------------
  [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
  [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
  (...)
  [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
  [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
  [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
  [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
  [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
  [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
  [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
  [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
  [61666.786253] FS:  00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
  [61666.786253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
  [61666.786253] Call Trace:
  [61666.786253]  __btrfs_drop_extents+0x5e3/0xaad [btrfs]
  [61666.786253]  ? time_hardirqs_on+0x9/0x14
  [61666.786253]  btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
  [61666.786253]  ? release_extent_buffer+0x38/0xb4 [btrfs]
  [61666.786253]  btrfs_log_inode+0xb6e/0xcdc [btrfs]
  [61666.786253]  ? lock_acquire+0x131/0x1c5
  [61666.786253]  ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
  [61666.786253]  ? arch_local_irq_save+0x9/0xc
  [61666.786253]  ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
  [61666.786253]  btrfs_log_inode_parent+0x223/0x659 [btrfs]
  [61666.786253]  ? arch_local_irq_save+0x9/0xc
  [61666.786253]  ? lockref_get_not_zero+0x2c/0x34
  [61666.786253]  ? rcu_read_unlock+0x3e/0x5d
  [61666.786253]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
  [61666.786253]  btrfs_sync_file+0x317/0x42c [btrfs]
  [61666.786253]  vfs_fsync_range+0x8c/0x9e
  [61666.786253]  SyS_msync+0x13c/0x1c9
  [61666.786253]  entry_SYSCALL_64_fastpath+0x18/0xad

A sample of a corrupt log tree leaf with overlapping extents I got from
running btrfs/072:

      item 14 key (295 108 200704) itemoff 2599 itemsize 53
              extent data disk bytenr 0 nr 0
              extent data offset 0 nr 458752 ram 458752
      item 15 key (295 108 659456) itemoff 2546 itemsize 53
              extent data disk bytenr 4343541760 nr 770048
              extent data offset 606208 nr 163840 ram 770048
      item 16 key (295 108 663552) itemoff 2493 itemsize 53
              extent data disk bytenr 4343541760 nr 770048
              extent data offset 610304 nr 155648 ram 770048
      item 17 key (295 108 819200) itemoff 2440 itemsize 53
              extent data disk bytenr 4334788608 nr 4096
              extent data offset 0 nr 4096 ram 4096

The file extent item at offset 659456 (item 15) ends at offset 823296
(659456 + 163840) while the next file extent item (item 16) starts at
offset 663552.

Another different problem that the race can trigger is a failure in the
assertions at tree-log.c:copy_items(), which expect that the first file
extent item key we found before releasing the path exists after we have
released path and that the last key we found before releasing the path
also exists after releasing the path:

  $ cat -n fs/btrfs/tree-log.c
  4080          if (need_find_last_extent) {
  4081                  /* btrfs_prev_leaf could return 1 without releasing the path */
  4082                  btrfs_release_path(src_path);
  4083                  ret = btrfs_search_slot(NULL, inode->root, &first_key,
  4084                                  src_path, 0, 0);
  4085                  if (ret < 0)
  4086                          return ret;
  4087                  ASSERT(ret == 0);
  (...)
  4103                  if (i >= btrfs_header_nritems(src_path->nodes[0])) {
  4104                          ret = btrfs_next_leaf(inode->root, src_path);
  4105                          if (ret < 0)
  4106                                  return ret;
  4107                          ASSERT(ret == 0);
  4108                          src = src_path->nodes[0];
  4109                          i = 0;
  4110                          need_find_last_extent = true;
  4111                  }
  (...)

The second assertion implicitly expects that the last key before the path
release still exists, because the surrounding while loop only stops after
we have found that key. When this assertion fails it produces a stack like
this:

  [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
  [139590.037406] ------------[ cut here ]------------
  [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
  [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
  [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G        W         5.0.0-btrfs-next-46 #1
  (...)
  [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
  (...)
  [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
  [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
  [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
  [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
  [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
  [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
  [139590.042501] FS:  00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
  [139590.042847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
  [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [139590.044250] Call Trace:
  [139590.044631]  copy_items+0xa3f/0x1000 [btrfs]
  [139590.045009]  ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
  [139590.045396]  btrfs_log_inode+0x7b3/0xd70 [btrfs]
  [139590.045773]  btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
  [139590.046143]  ? do_raw_spin_unlock+0x49/0xc0
  [139590.046510]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
  [139590.046872]  btrfs_sync_file+0x3b6/0x440 [btrfs]
  [139590.047243]  btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
  [139590.047592]  __vfs_write+0x129/0x1c0
  [139590.047932]  vfs_write+0xc2/0x1b0
  [139590.048270]  ksys_write+0x55/0xc0
  [139590.048608]  do_syscall_64+0x60/0x1b0
  [139590.048946]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [139590.049287] RIP: 0033:0x7f2efc4be190
  (...)
  [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
  [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
  [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
  [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
  [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
  (...)
  [139590.055128] ---[ end trace 193f35d0215cdeeb ]---

So fix this race between a full ranged fsync and writeback of adjacent
ranges by flushing all delalloc and waiting for all ordered extents to
complete before logging the inode. This is the simplest way to solve the
problem because currently the full fsync path does not deal with ranges
at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
to look at adjacent ranges for hole detection. For use cases of ranged
fsyncs this can make a few fsyncs slower but on the other hand it can
make some following fsyncs to other ranges do less work or no need to do
anything at all. A full fsync is rare anyway and happens only once after
loading/creating an inode and once after less common operations such as a
shrinking truncate.

This is an issue that exists for a long time, and was often triggered by
generic/127, because it does mmap'ed writes and msync (which triggers a
ranged fsync). Adding support for the tree checker to detect overlapping
extents (next patch in the series) and trigger a WARN() when such cases
are found, and then calling btrfs_check_leaf_full() at the end of
btrfs_insert_file_extent() made the issue much easier to detect. Running
btrfs/072 with that change to the tree checker and making fsstress open
files always with O_SYNC made it much easier to trigger the issue (as
triggering it with generic/127 is very rare).

CC: stable@vger.kernel.org # 3.16+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Sasha Levin May 6, 2019, 5:18 p.m. UTC | #1
Hi,

[This is an automated email]

This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: 3.16+

The bot has tested the following trees: v5.0.13, v4.19.40, v4.14.116, v4.9.173, v4.4.179, v3.18.139.

v5.0.13: Build OK!
v4.19.40: Build OK!
v4.14.116: Build OK!
v4.9.173: Build OK!
v4.4.179: Build OK!
v3.18.139: Failed to apply! Possible dependencies:
    9dcbeed4d7e1 ("btrfs: fix signed overflows in btrfs_sync_file")


How should we proceed with this patch?

--
Thanks,
Sasha
Josef Bacik May 7, 2019, 5:44 p.m. UTC | #2
On Mon, May 06, 2019 at 04:44:02PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
> inode) that happens to be ranged, which happens during a msync() or writes
> for files opened with O_SYNC for example, we can end up with a corrupt log,
> due to different file extent items representing ranges that overlap with
> each other, or hit some assertion failures.
> 
> When doing a ranged fsync we only flush delalloc and wait for ordered
> exents within that range. If while we are logging items from our inode
> ordered extents for adjacent ranges complete, we end up in a race that can
> make us insert the file extent items that overlap with others we logged
> previously and the assertion failures.
> 
> For example, if tree-log.c:copy_items() receives a leaf that has the
> following file extents items, all with a length of 4K and therefore there
> is an implicit hole in the range 68K to 72K - 1:
> 
>   (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
> 
> It copies them to the log tree. However due to the need to detect implicit
> holes, it may release the path, in order to look at the previous leaf to
> detect an implicit hole, and then later it will search again in the tree
> for the first file extent item key, with the goal of locking again the
> leaf (which might have changed due to concurrent changes to other inodes).
> 
> However when it locks again the leaf containing the first key, the key
> corresponding to the extent at offset 72K may not be there anymore since
> there is an ordered extent for that range that is finishing (that is,
> somewhere in the middle of btrfs_finish_ordered_io()), and it just
> removed the file extent item but has not yet replaced it with a new file
> extent item, so the part of copy_items() that does hole detection will
> decide that there is a hole in the range starting from 68K to 76K - 1,
> and therefore insert a file extent item to represent that hole, having
> a key offset of 68K. After that we now have a log tree with 2 different
> extent items that have overlapping ranges:
> 

Well this kind of sucks.  I wonder if we should be locking the extent range
while we're doing this in order to keep this problem from happening.  A fix for
another day though

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
Filipe Manana May 7, 2019, 7:09 p.m. UTC | #3
On Tue, May 7, 2019 at 6:44 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Mon, May 06, 2019 at 04:44:02PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
> > inode) that happens to be ranged, which happens during a msync() or writes
> > for files opened with O_SYNC for example, we can end up with a corrupt log,
> > due to different file extent items representing ranges that overlap with
> > each other, or hit some assertion failures.
> >
> > When doing a ranged fsync we only flush delalloc and wait for ordered
> > exents within that range. If while we are logging items from our inode
> > ordered extents for adjacent ranges complete, we end up in a race that can
> > make us insert the file extent items that overlap with others we logged
> > previously and the assertion failures.
> >
> > For example, if tree-log.c:copy_items() receives a leaf that has the
> > following file extents items, all with a length of 4K and therefore there
> > is an implicit hole in the range 68K to 72K - 1:
> >
> >   (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
> >
> > It copies them to the log tree. However due to the need to detect implicit
> > holes, it may release the path, in order to look at the previous leaf to
> > detect an implicit hole, and then later it will search again in the tree
> > for the first file extent item key, with the goal of locking again the
> > leaf (which might have changed due to concurrent changes to other inodes).
> >
> > However when it locks again the leaf containing the first key, the key
> > corresponding to the extent at offset 72K may not be there anymore since
> > there is an ordered extent for that range that is finishing (that is,
> > somewhere in the middle of btrfs_finish_ordered_io()), and it just
> > removed the file extent item but has not yet replaced it with a new file
> > extent item, so the part of copy_items() that does hole detection will
> > decide that there is a hole in the range starting from 68K to 76K - 1,
> > and therefore insert a file extent item to represent that hole, having
> > a key offset of 68K. After that we now have a log tree with 2 different
> > extent items that have overlapping ranges:
> >
>
> Well this kind of sucks.  I wonder if we should be locking the extent range
> while we're doing this in order to keep this problem from happening.  A fix for
> another day though

The only reason I have not adopted that, is because it would prevent
fsync and readpages() / readpage() from running concurrently (more of
a problem when fsync is full ranged).

Locking the range would avoid any such eventual performance drop on
fsync alone, but it would also allow to drop the inode's dio_sem?
Remember it was giving lockdep warnings before and you moved it to
btrfs_sync_file() from btrfs_log_changed_extents() sometime ago.
However still not enough, as I still get often (random xfstests,
brfs/072 and generic/299 for example) similar lockdep warnings due to
dio_sem:

https://pastebin.com/ar6cLg2Q

Thanks.




>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>
> Thanks,
>
> Josef
Josef Bacik May 7, 2019, 7:31 p.m. UTC | #4
On Tue, May 07, 2019 at 08:09:02PM +0100, Filipe Manana wrote:
> On Tue, May 7, 2019 at 6:44 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Mon, May 06, 2019 at 04:44:02PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
> > > inode) that happens to be ranged, which happens during a msync() or writes
> > > for files opened with O_SYNC for example, we can end up with a corrupt log,
> > > due to different file extent items representing ranges that overlap with
> > > each other, or hit some assertion failures.
> > >
> > > When doing a ranged fsync we only flush delalloc and wait for ordered
> > > exents within that range. If while we are logging items from our inode
> > > ordered extents for adjacent ranges complete, we end up in a race that can
> > > make us insert the file extent items that overlap with others we logged
> > > previously and the assertion failures.
> > >
> > > For example, if tree-log.c:copy_items() receives a leaf that has the
> > > following file extents items, all with a length of 4K and therefore there
> > > is an implicit hole in the range 68K to 72K - 1:
> > >
> > >   (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
> > >
> > > It copies them to the log tree. However due to the need to detect implicit
> > > holes, it may release the path, in order to look at the previous leaf to
> > > detect an implicit hole, and then later it will search again in the tree
> > > for the first file extent item key, with the goal of locking again the
> > > leaf (which might have changed due to concurrent changes to other inodes).
> > >
> > > However when it locks again the leaf containing the first key, the key
> > > corresponding to the extent at offset 72K may not be there anymore since
> > > there is an ordered extent for that range that is finishing (that is,
> > > somewhere in the middle of btrfs_finish_ordered_io()), and it just
> > > removed the file extent item but has not yet replaced it with a new file
> > > extent item, so the part of copy_items() that does hole detection will
> > > decide that there is a hole in the range starting from 68K to 76K - 1,
> > > and therefore insert a file extent item to represent that hole, having
> > > a key offset of 68K. After that we now have a log tree with 2 different
> > > extent items that have overlapping ranges:
> > >
> >
> > Well this kind of sucks.  I wonder if we should be locking the extent range
> > while we're doing this in order to keep this problem from happening.  A fix for
> > another day though
> 
> The only reason I have not adopted that, is because it would prevent
> fsync and readpages() / readpage() from running concurrently (more of
> a problem when fsync is full ranged).
> 

Hrm good point.  Time to make the extent lock read/write lock!

> Locking the range would avoid any such eventual performance drop on
> fsync alone, but it would also allow to drop the inode's dio_sem?
> Remember it was giving lockdep warnings before and you moved it to
> btrfs_sync_file() from btrfs_log_changed_extents() sometime ago.
> However still not enough, as I still get often (random xfstests,
> brfs/072 and generic/299 for example) similar lockdep warnings due to
> dio_sem:
> 
> https://pastebin.com/ar6cLg2Q

Hmm that's annoying, but not a real problem because we're just destroying a
workqueue that has nothing on it because we raced with somebody else allocating
a new workqueue.  I think the best thing to do is to export sb_init_dio_done_wq
and call it within our direct io before we take all of our locks, that should
shut lockdep up.  That's kind of shitty though, I'll think about it some more.
Thanks,

Josef

Patch
diff mbox series

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 34fe8a58b0e9..db277bde4894 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2059,6 +2059,18 @@  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	u64 len;
 
 	/*
+	 * If the inode needs a full sync, make sure we use a full range to
+	 * avoid log tree corruption, due to hole detection racing with ordered
+	 * extent completion for adjacent ranges, and assertion failures during
+	 * hole detection.
+	 */
+	if (test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
+		     &BTRFS_I(inode)->runtime_flags)) {
+		start = 0;
+		end = LLONG_MAX;
+	}
+
+	/*
 	 * The range length can be represented by u64, we have to do the typecasts
 	 * to avoid signed overflow if it's [0, LLONG_MAX] eg. from fsync()
 	 */