[v3] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to CoW without data reservation

[BUG]
The following script can cause unexpected fsync failure:

  #!/bin/bash

  dev=/dev/test/test
  mnt=/mnt/btrfs

  mkfs.btrfs -f $dev -b 512M > /dev/null
  mount $dev $mnt -o nospace_cache

  # Prealloc one extent
  xfs_io -f -c "falloc 8k 64m" $mnt/file1
  # Fill the remaining data space
  xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
  sync

  # Write into the prealloc extent
  xfs_io -c "pwrite 1m 16m" $mnt/file1

  # Reflink then fsync, fsync would fail due to ENOSPC
  xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
  umount $dev

The fsync fails with ENOSPC, and the last page of the buffered write is
lost.

[CAUSE]
This is caused by:
- Btrfs' back reference only has extent level granularity
  So write into shared extent must be CoWed even only part of the extent
  is shared.

So for above script we have:
- fallocate
  Create a preallocated extent where we can do NOCOW write.

- fill all the remaining data and unallocated space

- buffered write into preallocated space
  As we have not enough space available for data and the extent is not
  shared (yet) we fall into NOCOW mode.

- reflink
  Now part of the large preallocated extent is shared, later write
  into that extent must be CoWed.

- fsync triggers writeback
  But now the extent is shared and therefore we must fallback into COW
  mode, which fails with ENOSPC since there's not enough space to
  allocate data extents.

[WORKAROUND]
The workaround is to ensure any buffered write in the related extents
(not just the reflink source range) get flushed before reflink/dedupe,
so that NOCOW writes succeed that happened before reflinking succeed.

The workaround is expensive, we could do it better by only flushing
NOCOW range, but that needs extra accounting for NOCOW range.
For now, fix the possible data loss first.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
changelog:
RFC->v1:
- Use better words for commit message and comment.
- Move the whole inode flushing to btrfs_remap_file_range_prep().
  This also covers dedupe.
- Update the reproducer to fail explicitly.
- Remove false statement on transaction abort.

v1->v2:
- Extra comment and commit message refine.
- Don't wait ordered extent, only flush (starts delalloc)
  Single filemap_flush() should be enough. The async extent part will
  never be NOCOWed, thus no need to worry.
---
 fs/btrfs/ioctl.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Message ID	20190508104958.18363-1-wqu@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH v3] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to CoW without data reservation Date: Wed, 8 May 2019 18:49:58 +0800 Message-Id: <20190508104958.18363-1-wqu@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	[v3] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to CoW without data reservation \| expand [v3] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to CoW without d…

[v3] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to CoW without data reservation

Commit Message

Comments

Patch