diff mbox

[v2] Btrfs: fix file data corruption after cloning a range and fsync

Message ID 20180712003643.16439-1-fdmanana@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Filipe Manana July 12, 2018, 12:36 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

When we clone a range into a file we can end up dropping existing
extent maps (or trimming them) and replacing them with new ones if the
range to be cloned overlaps with a range in the destination inode.
When that happens we add the new extent maps to the  list of modified
extents in the inode's extent map tree, so that a "fast" fsync (the flag
BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
and log corresponding extent items. However, at the end of range cloning
operation we do truncate all the pages in the affected range (in order to
ensure future reads will not get stale data). Sometimes this truncation
will release the corresponding extent maps besides the pages from the page
cache. If this happens, then a "fast" fsync operation will miss logging
some extent items, because it relies exclusively on the extent maps being
present in the inode's extent tree, leading to data loss/corruption if
the fsync ends up using the same transaction used by the clone operation
(that transaction was not committed in the meanwhile). An extent map is
released through the callback btrfs_invalidatepage(), which gets called by
truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
later ends up calling try_release_extent_mapping() which will release the
extent map if some conditions are met, like the file size being greater
than 16Mb, gfp flags allow blocking and the range not being locked (which
is the case during the clone operation) nor being the extent map flagged
as pinned (also the case for cloning).

The following example, turned into a test for fstests, reproduces the
issue:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
  $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar

  $ xfs_io -c "fsync" /mnt/bar
  # reflink destination offset corresponds to the size of file bar,
  # 2728Kb minus 4Kb.
  $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
  $ xfs_io -c "fsync" /mnt/bar

  $ md5sum /mnt/bar
  95a95813a8c2abc9aa75a6c2914a077e  /mnt/bar

  <power fail>

  $ mount /dev/sdb /mnt
  $ md5sum /mnt/bar
  207fd8d0b161be8a84b945f0df8d5f8d  /mnt/bar
  # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
  # power failure

In the above example, the destination offset of the clone operation
corresponds to the size of the "bar" file minus 4Kb. So during the clone
operation, the extent map covering the range from 2572Kb to 2728Kb gets
trimmed so that it ends at offset 2724Kb, and a new extent map covering
the range from 2724Kb to 11724Kb is created. So at the end of the clone
operation when we ask to truncate the pages in the range from 2724Kb to
2724Kb + 15908Kb, the page invalidation callback ends up removing the new
extent map (through try_release_extent_mapping()) when the page at offset
2724Kb is passed to that callback.

Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
map is removed at try_release_extent_mapping(), forcing the next fsync to
search for modified extents in the fs/subvolume tree instead of relying on
the presence of extent maps in memory. This way we can continue doing a
"fast" fsync if the destination range of a clone operation does not
overlap with an existing range or if any of the criteria necessary to
remove an extent map at try_release_extent_mapping() is not met (file
size not bigger then 16Mb or gfp flags do not allow blocking).

CC: stable@vger.kernel.org # 3.16+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Added missing "fix" word to subject only.

 fs/btrfs/extent_io.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

David Sterba July 19, 2018, 1:35 p.m. UTC | #1
On Thu, Jul 12, 2018 at 01:36:43AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When we clone a range into a file we can end up dropping existing
> extent maps (or trimming them) and replacing them with new ones if the
> range to be cloned overlaps with a range in the destination inode.
> When that happens we add the new extent maps to the  list of modified
> extents in the inode's extent map tree, so that a "fast" fsync (the flag
> BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
> and log corresponding extent items. However, at the end of range cloning
> operation we do truncate all the pages in the affected range (in order to
> ensure future reads will not get stale data). Sometimes this truncation
> will release the corresponding extent maps besides the pages from the page
> cache. If this happens, then a "fast" fsync operation will miss logging
> some extent items, because it relies exclusively on the extent maps being
> present in the inode's extent tree, leading to data loss/corruption if
> the fsync ends up using the same transaction used by the clone operation
> (that transaction was not committed in the meanwhile). An extent map is
> released through the callback btrfs_invalidatepage(), which gets called by
> truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
> later ends up calling try_release_extent_mapping() which will release the
> extent map if some conditions are met, like the file size being greater
> than 16Mb, gfp flags allow blocking and the range not being locked (which
> is the case during the clone operation) nor being the extent map flagged
> as pinned (also the case for cloning).
> 
> The following example, turned into a test for fstests, reproduces the
> issue:
> 
>   $ mkfs.btrfs -f /dev/sdb
>   $ mount /dev/sdb /mnt
> 
>   $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
>   $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar
> 
>   $ xfs_io -c "fsync" /mnt/bar
>   # reflink destination offset corresponds to the size of file bar,
>   # 2728Kb minus 4Kb.
>   $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
>   $ xfs_io -c "fsync" /mnt/bar
> 
>   $ md5sum /mnt/bar
>   95a95813a8c2abc9aa75a6c2914a077e  /mnt/bar
> 
>   <power fail>
> 
>   $ mount /dev/sdb /mnt
>   $ md5sum /mnt/bar
>   207fd8d0b161be8a84b945f0df8d5f8d  /mnt/bar
>   # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
>   # power failure
> 
> In the above example, the destination offset of the clone operation
> corresponds to the size of the "bar" file minus 4Kb. So during the clone
> operation, the extent map covering the range from 2572Kb to 2728Kb gets
> trimmed so that it ends at offset 2724Kb, and a new extent map covering
> the range from 2724Kb to 11724Kb is created. So at the end of the clone
> operation when we ask to truncate the pages in the range from 2724Kb to
> 2724Kb + 15908Kb, the page invalidation callback ends up removing the new
> extent map (through try_release_extent_mapping()) when the page at offset
> 2724Kb is passed to that callback.
> 
> Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
> map is removed at try_release_extent_mapping(), forcing the next fsync to
> search for modified extents in the fs/subvolume tree instead of relying on
> the presence of extent maps in memory. This way we can continue doing a
> "fast" fsync if the destination range of a clone operation does not
> overlap with an existing range or if any of the criteria necessary to
> remove an extent map at try_release_extent_mapping() is not met (file
> size not bigger then 16Mb or gfp flags do not allow blocking).
> 
> CC: stable@vger.kernel.org # 3.16+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

I've added this to misc-next and will forward it to 4.18 as the type of
fix qualifies for a late rc state, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e55843f536bc..b3e45714d28f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4238,8 +4238,9 @@  int try_release_extent_mapping(struct page *page, gfp_t mask)
 	struct extent_map *em;
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_SIZE - 1;
-	struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree;
-	struct extent_map_tree *map = &BTRFS_I(page->mapping->host)->extent_tree;
+	struct btrfs_inode *btrfs_inode = BTRFS_I(page->mapping->host);
+	struct extent_io_tree *tree = &btrfs_inode->io_tree;
+	struct extent_map_tree *map = &btrfs_inode->extent_tree;
 
 	if (gfpflags_allow_blocking(mask) &&
 	    page->mapping->host->i_size > SZ_16M) {
@@ -4262,6 +4263,8 @@  int try_release_extent_mapping(struct page *page, gfp_t mask)
 					    extent_map_end(em) - 1,
 					    EXTENT_LOCKED | EXTENT_WRITEBACK,
 					    0, NULL)) {
+				set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
+					&btrfs_inode->runtime_flags);
 				remove_extent_mapping(map, em);
 				/* once for the rb tree */
 				free_extent_map(em);