Message ID | 20180712003643.16439-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jul 12, 2018 at 01:36:43AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When we clone a range into a file we can end up dropping existing > extent maps (or trimming them) and replacing them with new ones if the > range to be cloned overlaps with a range in the destination inode. > When that happens we add the new extent maps to the list of modified > extents in the inode's extent map tree, so that a "fast" fsync (the flag > BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps > and log corresponding extent items. However, at the end of range cloning > operation we do truncate all the pages in the affected range (in order to > ensure future reads will not get stale data). Sometimes this truncation > will release the corresponding extent maps besides the pages from the page > cache. If this happens, then a "fast" fsync operation will miss logging > some extent items, because it relies exclusively on the extent maps being > present in the inode's extent tree, leading to data loss/corruption if > the fsync ends up using the same transaction used by the clone operation > (that transaction was not committed in the meanwhile). An extent map is > released through the callback btrfs_invalidatepage(), which gets called by > truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The > later ends up calling try_release_extent_mapping() which will release the > extent map if some conditions are met, like the file size being greater > than 16Mb, gfp flags allow blocking and the range not being locked (which > is the case during the clone operation) nor being the extent map flagged > as pinned (also the case for cloning). > > The following example, turned into a test for fstests, reproduces the > issue: > > $ mkfs.btrfs -f /dev/sdb > $ mount /dev/sdb /mnt > > $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo > $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar > > $ xfs_io -c "fsync" /mnt/bar > # reflink destination offset corresponds to the size of file bar, > # 2728Kb minus 4Kb. > $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar > $ xfs_io -c "fsync" /mnt/bar > > $ md5sum /mnt/bar > 95a95813a8c2abc9aa75a6c2914a077e /mnt/bar > > <power fail> > > $ mount /dev/sdb /mnt > $ md5sum /mnt/bar > 207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar > # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the > # power failure > > In the above example, the destination offset of the clone operation > corresponds to the size of the "bar" file minus 4Kb. So during the clone > operation, the extent map covering the range from 2572Kb to 2728Kb gets > trimmed so that it ends at offset 2724Kb, and a new extent map covering > the range from 2724Kb to 11724Kb is created. So at the end of the clone > operation when we ask to truncate the pages in the range from 2724Kb to > 2724Kb + 15908Kb, the page invalidation callback ends up removing the new > extent map (through try_release_extent_mapping()) when the page at offset > 2724Kb is passed to that callback. > > Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent > map is removed at try_release_extent_mapping(), forcing the next fsync to > search for modified extents in the fs/subvolume tree instead of relying on > the presence of extent maps in memory. This way we can continue doing a > "fast" fsync if the destination range of a clone operation does not > overlap with an existing range or if any of the criteria necessary to > remove an extent map at try_release_extent_mapping() is not met (file > size not bigger then 16Mb or gfp flags do not allow blocking). > > CC: stable@vger.kernel.org # 3.16+ > Signed-off-by: Filipe Manana <fdmanana@suse.com> I've added this to misc-next and will forward it to 4.18 as the type of fix qualifies for a late rc state, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index e55843f536bc..b3e45714d28f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4238,8 +4238,9 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) struct extent_map *em; u64 start = page_offset(page); u64 end = start + PAGE_SIZE - 1; - struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree; - struct extent_map_tree *map = &BTRFS_I(page->mapping->host)->extent_tree; + struct btrfs_inode *btrfs_inode = BTRFS_I(page->mapping->host); + struct extent_io_tree *tree = &btrfs_inode->io_tree; + struct extent_map_tree *map = &btrfs_inode->extent_tree; if (gfpflags_allow_blocking(mask) && page->mapping->host->i_size > SZ_16M) { @@ -4262,6 +4263,8 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) extent_map_end(em) - 1, EXTENT_LOCKED | EXTENT_WRITEBACK, 0, NULL)) { + set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, + &btrfs_inode->runtime_flags); remove_extent_mapping(map, em); /* once for the rb tree */ free_extent_map(em);