diff mbox series

[v3] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to CoW without data reservation

Message ID 20190508104958.18363-1-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [v3] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to CoW without data reservation | expand

Commit Message

Qu Wenruo May 8, 2019, 10:49 a.m. UTC
[BUG]
The following script can cause unexpected fsync failure:

  #!/bin/bash

  dev=/dev/test/test
  mnt=/mnt/btrfs

  mkfs.btrfs -f $dev -b 512M > /dev/null
  mount $dev $mnt -o nospace_cache

  # Prealloc one extent
  xfs_io -f -c "falloc 8k 64m" $mnt/file1
  # Fill the remaining data space
  xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
  sync

  # Write into the prealloc extent
  xfs_io -c "pwrite 1m 16m" $mnt/file1

  # Reflink then fsync, fsync would fail due to ENOSPC
  xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
  umount $dev

The fsync fails with ENOSPC, and the last page of the buffered write is
lost.

[CAUSE]
This is caused by:
- Btrfs' back reference only has extent level granularity
  So write into shared extent must be CoWed even only part of the extent
  is shared.

So for above script we have:
- fallocate
  Create a preallocated extent where we can do NOCOW write.

- fill all the remaining data and unallocated space

- buffered write into preallocated space
  As we have not enough space available for data and the extent is not
  shared (yet) we fall into NOCOW mode.

- reflink
  Now part of the large preallocated extent is shared, later write
  into that extent must be CoWed.

- fsync triggers writeback
  But now the extent is shared and therefore we must fallback into COW
  mode, which fails with ENOSPC since there's not enough space to
  allocate data extents.

[WORKAROUND]
The workaround is to ensure any buffered write in the related extents
(not just the reflink source range) get flushed before reflink/dedupe,
so that NOCOW writes succeed that happened before reflinking succeed.

The workaround is expensive, we could do it better by only flushing
NOCOW range, but that needs extra accounting for NOCOW range.
For now, fix the possible data loss first.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
changelog:
RFC->v1:
- Use better words for commit message and comment.
- Move the whole inode flushing to btrfs_remap_file_range_prep().
  This also covers dedupe.
- Update the reproducer to fail explicitly.
- Remove false statement on transaction abort.

v1->v2:
- Extra comment and commit message refine.
- Don't wait ordered extent, only flush (starts delalloc)
  Single filemap_flush() should be enough. The async extent part will
  never be NOCOWed, thus no need to worry.
---
 fs/btrfs/ioctl.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Comments

Filipe Manana May 8, 2019, 5:03 p.m. UTC | #1
On Wed, May 8, 2019 at 1:43 PM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> The following script can cause unexpected fsync failure:
>
>   #!/bin/bash
>
>   dev=/dev/test/test
>   mnt=/mnt/btrfs
>
>   mkfs.btrfs -f $dev -b 512M > /dev/null
>   mount $dev $mnt -o nospace_cache
>
>   # Prealloc one extent
>   xfs_io -f -c "falloc 8k 64m" $mnt/file1
>   # Fill the remaining data space
>   xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
>   sync
>
>   # Write into the prealloc extent
>   xfs_io -c "pwrite 1m 16m" $mnt/file1
>
>   # Reflink then fsync, fsync would fail due to ENOSPC
>   xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
>   umount $dev
>
> The fsync fails with ENOSPC, and the last page of the buffered write is
> lost.
>
> [CAUSE]
> This is caused by:
> - Btrfs' back reference only has extent level granularity
>   So write into shared extent must be CoWed even only part of the extent
>   is shared.
>
> So for above script we have:
> - fallocate
>   Create a preallocated extent where we can do NOCOW write.
>
> - fill all the remaining data and unallocated space
>
> - buffered write into preallocated space
>   As we have not enough space available for data and the extent is not
>   shared (yet) we fall into NOCOW mode.
>
> - reflink
>   Now part of the large preallocated extent is shared, later write
>   into that extent must be CoWed.
>
> - fsync triggers writeback
>   But now the extent is shared and therefore we must fallback into COW
>   mode, which fails with ENOSPC since there's not enough space to
>   allocate data extents.
>
> [WORKAROUND]
> The workaround is to ensure any buffered write in the related extents
> (not just the reflink source range) get flushed before reflink/dedupe,
> so that NOCOW writes succeed that happened before reflinking succeed.
>
> The workaround is expensive, we could do it better by only flushing
> NOCOW range, but that needs extra accounting for NOCOW range.
> For now, fix the possible data loss first.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> ---
> changelog:
> RFC->v1:
> - Use better words for commit message and comment.
> - Move the whole inode flushing to btrfs_remap_file_range_prep().
>   This also covers dedupe.
> - Update the reproducer to fail explicitly.
> - Remove false statement on transaction abort.
>
> v1->v2:
> - Extra comment and commit message refine.
> - Don't wait ordered extent, only flush (starts delalloc)
>   Single filemap_flush() should be enough. The async extent part will
>   never be NOCOWed, thus no need to worry.
> ---
>  fs/btrfs/ioctl.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 6dafa857bbb9..0e35bef2ec59 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4001,6 +4001,27 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>         if (!same_inode)
>                 inode_dio_wait(inode_out);
>
> +       /*
> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
> +        *
> +        * Btrfs' back references do not have a block level granularity, they
> +        * work at the whole extent level.
> +        * NOCOW buffered write without data space reserved may not be able
> +        * to fall back to CoW due to lack of data space, thus could cause
> +        * data loss.
> +        *
> +        * Here we take a shortcut by flushing the whole inode, so that all
> +        * nocow write should reach disk as nocow before we increase the
> +        * reference of the extent. We could do better by only flushing NOCOW
> +        * data, but that needs extra accounting.
> +        *
> +        * Also we don't need to check ASYNC_EXTENT, as async extent will be
> +        * CoWed anyway, not affecting nocow part.
> +        */
> +       ret = filemap_flush(inode_in->i_mapping);
> +       if (ret < 0)
> +               return ret;
> +
>         ret = btrfs_wait_ordered_range(inode_in, ALIGN_DOWN(pos_in, bs),
>                                        wb_len);
>         if (ret < 0)
> --
> 2.21.0
>
David Sterba May 9, 2019, 2:49 p.m. UTC | #2
On Wed, May 08, 2019 at 06:49:58PM +0800, Qu Wenruo wrote:
> [BUG]
> The following script can cause unexpected fsync failure:
> 
>   #!/bin/bash
> 
>   dev=/dev/test/test
>   mnt=/mnt/btrfs
> 
>   mkfs.btrfs -f $dev -b 512M > /dev/null
>   mount $dev $mnt -o nospace_cache
> 
>   # Prealloc one extent
>   xfs_io -f -c "falloc 8k 64m" $mnt/file1
>   # Fill the remaining data space
>   xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
>   sync
> 
>   # Write into the prealloc extent
>   xfs_io -c "pwrite 1m 16m" $mnt/file1
> 
>   # Reflink then fsync, fsync would fail due to ENOSPC
>   xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
>   umount $dev
> 
> The fsync fails with ENOSPC, and the last page of the buffered write is
> lost.
> 
> [CAUSE]
> This is caused by:
> - Btrfs' back reference only has extent level granularity
>   So write into shared extent must be CoWed even only part of the extent
>   is shared.
> 
> So for above script we have:
> - fallocate
>   Create a preallocated extent where we can do NOCOW write.
> 
> - fill all the remaining data and unallocated space
> 
> - buffered write into preallocated space
>   As we have not enough space available for data and the extent is not
>   shared (yet) we fall into NOCOW mode.
> 
> - reflink
>   Now part of the large preallocated extent is shared, later write
>   into that extent must be CoWed.
> 
> - fsync triggers writeback
>   But now the extent is shared and therefore we must fallback into COW
>   mode, which fails with ENOSPC since there's not enough space to
>   allocate data extents.
> 
> [WORKAROUND]
> The workaround is to ensure any buffered write in the related extents
> (not just the reflink source range) get flushed before reflink/dedupe,
> so that NOCOW writes succeed that happened before reflinking succeed.
> 
> The workaround is expensive

Can you please quantify that, how big the performance drop is going to
be?

If the fsync comes soon after reflink, then it's effectively no change.
In case the buffered writes happen on a different range than reflink and
fsync comes later, the buffered writes will stall reflink, right?

If there are other similar corner cases we'd better know them in advance
and estimate the impact, that'll be something to look for when we get
complaints that reflink is suddenly slow.

> NOCOW range, but that needs extra accounting for NOCOW range.
> For now, fix the possible data loss first.

filemap_flush says

 437 /**
 438  * filemap_flush - mostly a non-blocking flush
 439  * @mapping:    target address_space
 440  *
 441  * This is a mostly non-blocking flush.  Not suitable for data-integrity
 442  * purposes - I/O may not be started against all dirty pages.
 443  *
 444  * Return: %0 on success, negative error code otherwise.
 445  */

so how does this work together with the statement about preventing data
loss?
Qu Wenruo May 9, 2019, 11:28 p.m. UTC | #3
On 2019/5/9 下午10:49, David Sterba wrote:
> On Wed, May 08, 2019 at 06:49:58PM +0800, Qu Wenruo wrote:
>> [BUG]
>> The following script can cause unexpected fsync failure:
>>
>>   #!/bin/bash
>>
>>   dev=/dev/test/test
>>   mnt=/mnt/btrfs
>>
>>   mkfs.btrfs -f $dev -b 512M > /dev/null
>>   mount $dev $mnt -o nospace_cache
>>
>>   # Prealloc one extent
>>   xfs_io -f -c "falloc 8k 64m" $mnt/file1
>>   # Fill the remaining data space
>>   xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
>>   sync
>>
>>   # Write into the prealloc extent
>>   xfs_io -c "pwrite 1m 16m" $mnt/file1
>>
>>   # Reflink then fsync, fsync would fail due to ENOSPC
>>   xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
>>   umount $dev
>>
>> The fsync fails with ENOSPC, and the last page of the buffered write is
>> lost.
>>
>> [CAUSE]
>> This is caused by:
>> - Btrfs' back reference only has extent level granularity
>>   So write into shared extent must be CoWed even only part of the extent
>>   is shared.
>>
>> So for above script we have:
>> - fallocate
>>   Create a preallocated extent where we can do NOCOW write.
>>
>> - fill all the remaining data and unallocated space
>>
>> - buffered write into preallocated space
>>   As we have not enough space available for data and the extent is not
>>   shared (yet) we fall into NOCOW mode.
>>
>> - reflink
>>   Now part of the large preallocated extent is shared, later write
>>   into that extent must be CoWed.
>>
>> - fsync triggers writeback
>>   But now the extent is shared and therefore we must fallback into COW
>>   mode, which fails with ENOSPC since there's not enough space to
>>   allocate data extents.
>>
>> [WORKAROUND]
>> The workaround is to ensure any buffered write in the related extents
>> (not just the reflink source range) get flushed before reflink/dedupe,
>> so that NOCOW writes succeed that happened before reflinking succeed.
>>
>> The workaround is expensive
> 
> Can you please quantify that, how big the performance drop is going to
> be?

Depends on how many dirty pages there are at the timing of reflink/dedupe.

If there are a lot, then it would be a delay for reflink/dedupe.

> 
> If the fsync comes soon after reflink, then it's effectively no change.
> In case the buffered writes happen on a different range than reflink and
> fsync comes later, the buffered writes will stall reflink, right?

Fsync doesn't make much difference, it mostly depends on how many dirty
pages are.

Thus the most impacted use case is concurrent buffered write with
reflink/dedupe.

> 
> If there are other similar corner cases we'd better know them in advance
> and estimate the impact, that'll be something to look for when we get
> complaints that reflink is suddenly slow.
> 
>> NOCOW range, but that needs extra accounting for NOCOW range.
>> For now, fix the possible data loss first.
> 
> filemap_flush says
> 
>  437 /**
>  438  * filemap_flush - mostly a non-blocking flush
>  439  * @mapping:    target address_space
>  440  *
>  441  * This is a mostly non-blocking flush.  Not suitable for data-integrity
>  442  * purposes - I/O may not be started against all dirty pages.
>  443  *
>  444  * Return: %0 on success, negative error code otherwise.
>  445  */
> 
> so how does this work together with the statement about preventing data
> loss?

The data loss is caused by the fact that we can start buffered write
without reserving data space, but after reflink/dedupe we have to do CoW.
Without enough space, CoW will fail due to ENOSPC.

The fix here is, we ensure all dirties pages start their writeback
(start btrfs_run_delalloc_range()) before reflink.

At btrfs_run_delalloc_range() we determine whether a range goes through
NOCOW or COW, and submit ordered extent to do real write back/csum
calculation/etc.

As long as the whole inode goes through btrfs_run_delalloc_range(), any
NOCOW write will go NOCOW on-disk.
We don't need to wait for the ordered extent to finish, just ensure all
pages goes through delalloc is enough.
Waiting for ordered extent will cause even more latency for reflink.

Thus the filemap_flush() is enough, as the point is to ensure delalloc
is started before reflink.

Thanks,
Qu
Filipe Manana May 15, 2019, 2:56 p.m. UTC | #4
On Fri, May 10, 2019 at 12:30 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2019/5/9 下午10:49, David Sterba wrote:
> > On Wed, May 08, 2019 at 06:49:58PM +0800, Qu Wenruo wrote:
> >> [BUG]
> >> The following script can cause unexpected fsync failure:
> >>
> >>   #!/bin/bash
> >>
> >>   dev=/dev/test/test
> >>   mnt=/mnt/btrfs
> >>
> >>   mkfs.btrfs -f $dev -b 512M > /dev/null
> >>   mount $dev $mnt -o nospace_cache
> >>
> >>   # Prealloc one extent
> >>   xfs_io -f -c "falloc 8k 64m" $mnt/file1
> >>   # Fill the remaining data space
> >>   xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
> >>   sync
> >>
> >>   # Write into the prealloc extent
> >>   xfs_io -c "pwrite 1m 16m" $mnt/file1
> >>
> >>   # Reflink then fsync, fsync would fail due to ENOSPC
> >>   xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
> >>   umount $dev
> >>
> >> The fsync fails with ENOSPC, and the last page of the buffered write is
> >> lost.
> >>
> >> [CAUSE]
> >> This is caused by:
> >> - Btrfs' back reference only has extent level granularity
> >>   So write into shared extent must be CoWed even only part of the extent
> >>   is shared.
> >>
> >> So for above script we have:
> >> - fallocate
> >>   Create a preallocated extent where we can do NOCOW write.
> >>
> >> - fill all the remaining data and unallocated space
> >>
> >> - buffered write into preallocated space
> >>   As we have not enough space available for data and the extent is not
> >>   shared (yet) we fall into NOCOW mode.
> >>
> >> - reflink
> >>   Now part of the large preallocated extent is shared, later write
> >>   into that extent must be CoWed.
> >>
> >> - fsync triggers writeback
> >>   But now the extent is shared and therefore we must fallback into COW
> >>   mode, which fails with ENOSPC since there's not enough space to
> >>   allocate data extents.
> >>
> >> [WORKAROUND]
> >> The workaround is to ensure any buffered write in the related extents
> >> (not just the reflink source range) get flushed before reflink/dedupe,
> >> so that NOCOW writes succeed that happened before reflinking succeed.
> >>
> >> The workaround is expensive
> >
> > Can you please quantify that, how big the performance drop is going to
> > be?
>
> Depends on how many dirty pages there are at the timing of reflink/dedupe.
>
> If there are a lot, then it would be a delay for reflink/dedupe.
>
> >
> > If the fsync comes soon after reflink, then it's effectively no change.
> > In case the buffered writes happen on a different range than reflink and
> > fsync comes later, the buffered writes will stall reflink, right?
>
> Fsync doesn't make much difference, it mostly depends on how many dirty
> pages are.
>
> Thus the most impacted use case is concurrent buffered write with
> reflink/dedupe.
>
> >
> > If there are other similar corner cases we'd better know them in advance
> > and estimate the impact, that'll be something to look for when we get
> > complaints that reflink is suddenly slow.
> >
> >> NOCOW range, but that needs extra accounting for NOCOW range.
> >> For now, fix the possible data loss first.
> >
> > filemap_flush says
> >
> >  437 /**
> >  438  * filemap_flush - mostly a non-blocking flush
> >  439  * @mapping:    target address_space
> >  440  *
> >  441  * This is a mostly non-blocking flush.  Not suitable for data-integrity
> >  442  * purposes - I/O may not be started against all dirty pages.
> >  443  *
> >  444  * Return: %0 on success, negative error code otherwise.
> >  445  */
> >
> > so how does this work together with the statement about preventing data
> > loss?
>
> The data loss is caused by the fact that we can start buffered write
> without reserving data space, but after reflink/dedupe we have to do CoW.
> Without enough space, CoW will fail due to ENOSPC.
>
> The fix here is, we ensure all dirties pages start their writeback
> (start btrfs_run_delalloc_range()) before reflink.
>
> At btrfs_run_delalloc_range() we determine whether a range goes through
> NOCOW or COW, and submit ordered extent to do real write back/csum
> calculation/etc.
>
> As long as the whole inode goes through btrfs_run_delalloc_range(), any
> NOCOW write will go NOCOW on-disk.
> We don't need to wait for the ordered extent to finish, just ensure all
> pages goes through delalloc is enough.
> Waiting for ordered extent will cause even more latency for reflink.
>
> Thus the filemap_flush() is enough, as the point is to ensure delalloc
> is started before reflink.

I believe that David's comment is related to this part of the comment
on filemap_flush():

"I/O may not be started against all dirty pages."

I.e., his concern being that writeback may not be started and
therefore we end up with the data loss due to ENOSPC later, and not to
the technical details of why the ENOSPC failure happens, which is
already described in the changelog and discussed during the review of
previous versions of the patch.

However btrfs has its own writepages() implementation, which even for
WB_SYNC_NONE (used by filemap_flush) starts writeback (and doesn't
wait for it to finish if it started before, which is fine for this use
case).
Anyway, just my interpretation of the doubt/comment.

Thanks.

>
> Thanks,
> Qu
>
Qu Wenruo May 31, 2019, 10:56 a.m. UTC | #5
Gentle ping?

I didn't see this patch included in misc-next.
Anything went wrong?

Thanks,
Qu

On 2019/5/15 下午10:56, Filipe Manana wrote:
> On Fri, May 10, 2019 at 12:30 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2019/5/9 下午10:49, David Sterba wrote:
>>> On Wed, May 08, 2019 at 06:49:58PM +0800, Qu Wenruo wrote:
>>>> [BUG]
>>>> The following script can cause unexpected fsync failure:
>>>>
>>>>   #!/bin/bash
>>>>
>>>>   dev=/dev/test/test
>>>>   mnt=/mnt/btrfs
>>>>
>>>>   mkfs.btrfs -f $dev -b 512M > /dev/null
>>>>   mount $dev $mnt -o nospace_cache
>>>>
>>>>   # Prealloc one extent
>>>>   xfs_io -f -c "falloc 8k 64m" $mnt/file1
>>>>   # Fill the remaining data space
>>>>   xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
>>>>   sync
>>>>
>>>>   # Write into the prealloc extent
>>>>   xfs_io -c "pwrite 1m 16m" $mnt/file1
>>>>
>>>>   # Reflink then fsync, fsync would fail due to ENOSPC
>>>>   xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
>>>>   umount $dev
>>>>
>>>> The fsync fails with ENOSPC, and the last page of the buffered write is
>>>> lost.
>>>>
>>>> [CAUSE]
>>>> This is caused by:
>>>> - Btrfs' back reference only has extent level granularity
>>>>   So write into shared extent must be CoWed even only part of the extent
>>>>   is shared.
>>>>
>>>> So for above script we have:
>>>> - fallocate
>>>>   Create a preallocated extent where we can do NOCOW write.
>>>>
>>>> - fill all the remaining data and unallocated space
>>>>
>>>> - buffered write into preallocated space
>>>>   As we have not enough space available for data and the extent is not
>>>>   shared (yet) we fall into NOCOW mode.
>>>>
>>>> - reflink
>>>>   Now part of the large preallocated extent is shared, later write
>>>>   into that extent must be CoWed.
>>>>
>>>> - fsync triggers writeback
>>>>   But now the extent is shared and therefore we must fallback into COW
>>>>   mode, which fails with ENOSPC since there's not enough space to
>>>>   allocate data extents.
>>>>
>>>> [WORKAROUND]
>>>> The workaround is to ensure any buffered write in the related extents
>>>> (not just the reflink source range) get flushed before reflink/dedupe,
>>>> so that NOCOW writes succeed that happened before reflinking succeed.
>>>>
>>>> The workaround is expensive
>>>
>>> Can you please quantify that, how big the performance drop is going to
>>> be?
>>
>> Depends on how many dirty pages there are at the timing of reflink/dedupe.
>>
>> If there are a lot, then it would be a delay for reflink/dedupe.
>>
>>>
>>> If the fsync comes soon after reflink, then it's effectively no change.
>>> In case the buffered writes happen on a different range than reflink and
>>> fsync comes later, the buffered writes will stall reflink, right?
>>
>> Fsync doesn't make much difference, it mostly depends on how many dirty
>> pages are.
>>
>> Thus the most impacted use case is concurrent buffered write with
>> reflink/dedupe.
>>
>>>
>>> If there are other similar corner cases we'd better know them in advance
>>> and estimate the impact, that'll be something to look for when we get
>>> complaints that reflink is suddenly slow.
>>>
>>>> NOCOW range, but that needs extra accounting for NOCOW range.
>>>> For now, fix the possible data loss first.
>>>
>>> filemap_flush says
>>>
>>>  437 /**
>>>  438  * filemap_flush - mostly a non-blocking flush
>>>  439  * @mapping:    target address_space
>>>  440  *
>>>  441  * This is a mostly non-blocking flush.  Not suitable for data-integrity
>>>  442  * purposes - I/O may not be started against all dirty pages.
>>>  443  *
>>>  444  * Return: %0 on success, negative error code otherwise.
>>>  445  */
>>>
>>> so how does this work together with the statement about preventing data
>>> loss?
>>
>> The data loss is caused by the fact that we can start buffered write
>> without reserving data space, but after reflink/dedupe we have to do CoW.
>> Without enough space, CoW will fail due to ENOSPC.
>>
>> The fix here is, we ensure all dirties pages start their writeback
>> (start btrfs_run_delalloc_range()) before reflink.
>>
>> At btrfs_run_delalloc_range() we determine whether a range goes through
>> NOCOW or COW, and submit ordered extent to do real write back/csum
>> calculation/etc.
>>
>> As long as the whole inode goes through btrfs_run_delalloc_range(), any
>> NOCOW write will go NOCOW on-disk.
>> We don't need to wait for the ordered extent to finish, just ensure all
>> pages goes through delalloc is enough.
>> Waiting for ordered extent will cause even more latency for reflink.
>>
>> Thus the filemap_flush() is enough, as the point is to ensure delalloc
>> is started before reflink.
> 
> I believe that David's comment is related to this part of the comment
> on filemap_flush():
> 
> "I/O may not be started against all dirty pages."
> 
> I.e., his concern being that writeback may not be started and
> therefore we end up with the data loss due to ENOSPC later, and not to
> the technical details of why the ENOSPC failure happens, which is
> already described in the changelog and discussed during the review of
> previous versions of the patch.
> 
> However btrfs has its own writepages() implementation, which even for
> WB_SYNC_NONE (used by filemap_flush) starts writeback (and doesn't
> wait for it to finish if it started before, which is fine for this use
> case).
> Anyway, just my interpretation of the doubt/comment.
> 
> Thanks.
> 
>>
>> Thanks,
>> Qu
>>
> 
>
David Sterba May 31, 2019, 12:26 p.m. UTC | #6
On Wed, May 15, 2019 at 03:56:31PM +0100, Filipe Manana wrote:
> On Fri, May 10, 2019 at 12:30 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > On 2019/5/9 下午10:49, David Sterba wrote:
> > > On Wed, May 08, 2019 at 06:49:58PM +0800, Qu Wenruo wrote:
> > >> [BUG]
> > >> The following script can cause unexpected fsync failure:
> > >>
> > >>   #!/bin/bash
> > >>
> > >>   dev=/dev/test/test
> > >>   mnt=/mnt/btrfs
> > >>
> > >>   mkfs.btrfs -f $dev -b 512M > /dev/null
> > >>   mount $dev $mnt -o nospace_cache
> > >>
> > >>   # Prealloc one extent
> > >>   xfs_io -f -c "falloc 8k 64m" $mnt/file1
> > >>   # Fill the remaining data space
> > >>   xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
> > >>   sync
> > >>
> > >>   # Write into the prealloc extent
> > >>   xfs_io -c "pwrite 1m 16m" $mnt/file1
> > >>
> > >>   # Reflink then fsync, fsync would fail due to ENOSPC
> > >>   xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
> > >>   umount $dev
> > >>
> > >> The fsync fails with ENOSPC, and the last page of the buffered write is
> > >> lost.
> > >>
> > >> [CAUSE]
> > >> This is caused by:
> > >> - Btrfs' back reference only has extent level granularity
> > >>   So write into shared extent must be CoWed even only part of the extent
> > >>   is shared.
> > >>
> > >> So for above script we have:
> > >> - fallocate
> > >>   Create a preallocated extent where we can do NOCOW write.
> > >>
> > >> - fill all the remaining data and unallocated space
> > >>
> > >> - buffered write into preallocated space
> > >>   As we have not enough space available for data and the extent is not
> > >>   shared (yet) we fall into NOCOW mode.
> > >>
> > >> - reflink
> > >>   Now part of the large preallocated extent is shared, later write
> > >>   into that extent must be CoWed.
> > >>
> > >> - fsync triggers writeback
> > >>   But now the extent is shared and therefore we must fallback into COW
> > >>   mode, which fails with ENOSPC since there's not enough space to
> > >>   allocate data extents.
> > >>
> > >> [WORKAROUND]
> > >> The workaround is to ensure any buffered write in the related extents
> > >> (not just the reflink source range) get flushed before reflink/dedupe,
> > >> so that NOCOW writes succeed that happened before reflinking succeed.
> > >>
> > >> The workaround is expensive
> > >
> > > Can you please quantify that, how big the performance drop is going to
> > > be?
> >
> > Depends on how many dirty pages there are at the timing of reflink/dedupe.
> >
> > If there are a lot, then it would be a delay for reflink/dedupe.
> >
> > >
> > > If the fsync comes soon after reflink, then it's effectively no change.
> > > In case the buffered writes happen on a different range than reflink and
> > > fsync comes later, the buffered writes will stall reflink, right?
> >
> > Fsync doesn't make much difference, it mostly depends on how many dirty
> > pages are.
> >
> > Thus the most impacted use case is concurrent buffered write with
> > reflink/dedupe.
> >
> > >
> > > If there are other similar corner cases we'd better know them in advance
> > > and estimate the impact, that'll be something to look for when we get
> > > complaints that reflink is suddenly slow.
> > >
> > >> NOCOW range, but that needs extra accounting for NOCOW range.
> > >> For now, fix the possible data loss first.
> > >
> > > filemap_flush says
> > >
> > >  437 /**
> > >  438  * filemap_flush - mostly a non-blocking flush
> > >  439  * @mapping:    target address_space
> > >  440  *
> > >  441  * This is a mostly non-blocking flush.  Not suitable for data-integrity
> > >  442  * purposes - I/O may not be started against all dirty pages.
> > >  443  *
> > >  444  * Return: %0 on success, negative error code otherwise.
> > >  445  */
> > >
> > > so how does this work together with the statement about preventing data
> > > loss?
> >
> > The data loss is caused by the fact that we can start buffered write
> > without reserving data space, but after reflink/dedupe we have to do CoW.
> > Without enough space, CoW will fail due to ENOSPC.
> >
> > The fix here is, we ensure all dirties pages start their writeback
> > (start btrfs_run_delalloc_range()) before reflink.
> >
> > At btrfs_run_delalloc_range() we determine whether a range goes through
> > NOCOW or COW, and submit ordered extent to do real write back/csum
> > calculation/etc.
> >
> > As long as the whole inode goes through btrfs_run_delalloc_range(), any
> > NOCOW write will go NOCOW on-disk.
> > We don't need to wait for the ordered extent to finish, just ensure all
> > pages goes through delalloc is enough.
> > Waiting for ordered extent will cause even more latency for reflink.
> >
> > Thus the filemap_flush() is enough, as the point is to ensure delalloc
> > is started before reflink.
> 
> I believe that David's comment is related to this part of the comment
> on filemap_flush():
> 
> "I/O may not be started against all dirty pages."
> 
> I.e., his concern being that writeback may not be started and
> therefore we end up with the data loss due to ENOSPC later, and not to
> the technical details of why the ENOSPC failure happens, which is
> already described in the changelog and discussed during the review of
> previous versions of the patch.
> 
> However btrfs has its own writepages() implementation, which even for
> WB_SYNC_NONE (used by filemap_flush) starts writeback (and doesn't
> wait for it to finish if it started before, which is fine for this use
> case).
> Anyway, just my interpretation of the doubt/comment.

That's correct and answers my questions, thanks.
diff mbox series

Patch

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 6dafa857bbb9..0e35bef2ec59 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4001,6 +4001,27 @@  static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	if (!same_inode)
 		inode_dio_wait(inode_out);
 
+	/*
+	 * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
+	 *
+	 * Btrfs' back references do not have a block level granularity, they
+	 * work at the whole extent level.
+	 * NOCOW buffered write without data space reserved may not be able
+	 * to fall back to CoW due to lack of data space, thus could cause
+	 * data loss.
+	 *
+	 * Here we take a shortcut by flushing the whole inode, so that all
+	 * nocow write should reach disk as nocow before we increase the
+	 * reference of the extent. We could do better by only flushing NOCOW
+	 * data, but that needs extra accounting.
+	 *
+	 * Also we don't need to check ASYNC_EXTENT, as async extent will be
+	 * CoWed anyway, not affecting nocow part.
+	 */
+	ret = filemap_flush(inode_in->i_mapping);
+	if (ret < 0)
+		return ret;
+
 	ret = btrfs_wait_ordered_range(inode_in, ALIGN_DOWN(pos_in, bs),
 				       wb_len);
 	if (ret < 0)