[RFC] btrfs: reflink: Flush before reflink any extent to prevent NOCOW write falling back to CoW without data reservation
diff mbox series

Message ID 20190503010852.10342-1-wqu@suse.com
State New
Headers show
Series
  • [RFC] btrfs: reflink: Flush before reflink any extent to prevent NOCOW write falling back to CoW without data reservation
Related show

Commit Message

Qu Wenruo May 3, 2019, 1:08 a.m. UTC
[BUG]
The following command can lead to unexpected data COW:

  #!/bin/bash

  dev=/dev/test/test
  mnt=/mnt/btrfs

  mkfs.btrfs -f $dev -b 1G > /dev/null
  mount $dev $mnt -o nospace_cache

  xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
  xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
  umount $dev

The result extent will be

	item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
		generation 6 type 2 (prealloc)
		prealloc data disk byte 13631488 nr 28672
	item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
		generation 6 type 1 (regular)
		extent data disk byte 13660160 nr 12288 <<< COW
	item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
		generation 6 type 2 (prealloc)
		prealloc data disk byte 13631488 nr 28672

Currently we always reserve space even for NOCOW buffered write, thus
under most case it shouldn't cause anything wrong even we fall back to
COW.

However when we're out of data space, we fall back to skip data space if
we can do NOCOW write.

If such behavior happens under that case, we could hit the following
problems:
- data space bytes_may_use underflow
  This will cause kernel warning.

- ENOSPC at delalloc time
  This will lead to transaction abort and fs forced to RO.

[CAUSE]
This is due to the fact that btrfs can only do extent level share check.

Btrfs can only tell if an extent is shared, no matter if only part of the
extent is shared or not.

So for above script we have:
- fallocate
- buffered write
  If we don't have enough data space, we fall back to NOCOW check.
  At this timming, the extent is not shared, we can skip data
  reservation.
- reflink
  Now part of the large preallocated extent is shared.
- delalloc kicks in
  For the NOCOW range, as the preallocated extent is shared, we need
  to fall back to COW.

[WORKAROUND]
The workaround is to ensure any buffered write in the related extents
(not the reflink source range) get flushed before reflink.

However it's pretty expensive to do a comprehensive check.
In the reproducer, the reflink source is just a part of a larger
preallocated extent, we need to flush all buffered write of that extent
before reflink.
Such backward search can be complex and we may not get much benefit from
it.

So this patch will just try to flush the whole inode before reflink.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Reason for RFC:
Flushing an inode just because it's a reflink source is definitely
overkilling, but I don't have any better way to handle it.

Any comment on this is welcomed.
---
 fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Comments

Filipe Manana May 3, 2019, 9:21 a.m. UTC | #1
On Fri, May 3, 2019 at 2:46 AM Qu Wenruo <wqu@suse.com> wrote:

What a great subject. The "reflink:" part is unnecessary, since the
rest of the subject already mentions it, that makes it a bit shorter.

>
> [BUG]
> The following command can lead to unexpected data COW:
>
>   #!/bin/bash
>
>   dev=/dev/test/test
>   mnt=/mnt/btrfs
>
>   mkfs.btrfs -f $dev -b 1G > /dev/null
>   mount $dev $mnt -o nospace_cache
>
>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
>   umount $dev
>
> The result extent will be
>
>         item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
>                 generation 6 type 2 (prealloc)
>                 prealloc data disk byte 13631488 nr 28672
>         item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
>                 generation 6 type 1 (regular)
>                 extent data disk byte 13660160 nr 12288 <<< COW
>         item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
>                 generation 6 type 2 (prealloc)
>                 prealloc data disk byte 13631488 nr 28672
>
> Currently we always reserve space even for NOCOW buffered write, thus

I would add 'data' between 'reserve' and 'space', to be clear.

> under most case it shouldn't cause anything wrong even we fall back to
> COW.

even we ... -> even if we fallback to COW when running delalloc /
starting writeback.

>
> However when we're out of data space, we fall back to skip data space if
> we can do NOCOW write.

we fall back to skip data space ... -> we fallback to NOCOW write
without reserving data space.

>
> If such behavior happens under that case, we could hit the following
> problems:

> - data space bytes_may_use underflow
>   This will cause kernel warning.

Ok.

>
> - ENOSPC at delalloc time

at delalloc time - that is an ambiguous term you use through the change log.
You mean when running/starting delalloc, which happens when starting writeback,
but that could be confused with creating delalloc, which happens
during the buffered write path.

So I would always replace 'dealloc time' with 'when running delalloc'
(or when starting writeback).

>   This will lead to transaction abort and fs forced to RO.

Where does that happen exactly?
I don't recall starting transactions when running dealloc, and failed
to see where after a quick glance to cow_file_range()
and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
when starting writeback.

>
> [CAUSE]
> This is due to the fact that btrfs can only do extent level share check.
>
> Btrfs can only tell if an extent is shared, no matter if only part of the
> extent is shared or not.
>
> So for above script we have:
> - fallocate
> - buffered write
>   If we don't have enough data space, we fall back to NOCOW check.
>   At this timming, the extent is not shared, we can skip data
>   reservation.

But in the above example we don't fall to nocow mode when doing the
buffered write, as there's plenty of data space available (1Gb -
24Kb).
You need to update the example.


> - reflink
>   Now part of the large preallocated extent is shared.
> - delalloc kicks in

writeback kicks in

>   For the NOCOW range, as the preallocated extent is shared, we need
>   to fall back to COW.
>
> [WORKAROUND]
> The workaround is to ensure any buffered write in the related extents
> (not the reflink source range) get flushed before reflink.

not the reflink source range -> not just the reflink source range

>
> However it's pretty expensive to do a comprehensive check.
> In the reproducer, the reflink source is just a part of a larger

Again, the reproducer needs to be fixed (yes, I tested it even if it's
clear by looking at it that it doesn't trigger the nocow case).

> preallocated extent, we need to flush all buffered write of that extent
> before reflink.
> Such backward search can be complex and we may not get much benefit from
> it.
>
> So this patch will just try to flush the whole inode before reflink.


>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Reason for RFC:
> Flushing an inode just because it's a reflink source is definitely
> overkilling, but I don't have any better way to handle it.
>
> Any comment on this is welcomed.
> ---
>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 7755b503b348..8caa0edb6fbf 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>                         return ret;
>         }
>
> +       /*
> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
> +        *
> +        * Due to the limit of btrfs extent tree design, we can only have
> +        * extent level share view. Any part of an extent is shared then the

Any -> If any

> +        * whole extent is shared and any write into that extent needs to fall

is -> is considered

> +        * back to COW.

I would add, something like:

That is, btrfs' back references do not have a block level granularity,
they work at the whole extent level.

> +        *
> +        * NOCOW buffered write without data space reserved could to lead to
> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
> +        * at delalloc time (transaction abort).

I would omit the warning and transaction abort parts, that can change
any time. And we have that information in the changelog, so it's not
lost.

> +        *
> +        * Here we take a shortcut by flush the whole inode. We could do better
> +        * by finding all extents in that range and flush the space referring
> +        * all those extents.
> +        * But that's too complex for such corner case.
> +        */
> +       filemap_flush(src->i_mapping);
> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> +                    &BTRFS_I(src)->runtime_flags))
> +               filemap_flush(src->i_mapping);

So a few comments here:

- why just in the clone part? The dedupe side has the same problem, doesn't it?

- I would move such flushing to btrfs_remap_file_range_prep - this is
where we do the source and target range flush and wait.

Can you turn the reproducer into an fstests case?

Thanks.

> +
>         /*
>          * Lock destination range to serialize with concurrent readpages() and
>          * source range to serialize with relocation.
> --
> 2.21.0
>
Qu Wenruo May 3, 2019, 10:18 a.m. UTC | #2
On 2019/5/3 下午5:21, Filipe Manana wrote:
> On Fri, May 3, 2019 at 2:46 AM Qu Wenruo <wqu@suse.com> wrote:
> 
> What a great subject. The "reflink:" part is unnecessary, since the
> rest of the subject already mentions it, that makes it a bit shorter.
> 
>>
>> [BUG]
>> The following command can lead to unexpected data COW:
>>
>>   #!/bin/bash
>>
>>   dev=/dev/test/test
>>   mnt=/mnt/btrfs
>>
>>   mkfs.btrfs -f $dev -b 1G > /dev/null
>>   mount $dev $mnt -o nospace_cache
>>
>>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
>>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
>>   umount $dev
>>
>> The result extent will be
>>
>>         item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
>>                 generation 6 type 2 (prealloc)
>>                 prealloc data disk byte 13631488 nr 28672
>>         item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
>>                 generation 6 type 1 (regular)
>>                 extent data disk byte 13660160 nr 12288 <<< COW
>>         item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
>>                 generation 6 type 2 (prealloc)
>>                 prealloc data disk byte 13631488 nr 28672
>>
>> Currently we always reserve space even for NOCOW buffered write, thus
> 
> I would add 'data' between 'reserve' and 'space', to be clear.
> 
>> under most case it shouldn't cause anything wrong even we fall back to
>> COW.
> 
> even we ... -> even if we fallback to COW when running delalloc /
> starting writeback.
> 
>>
>> However when we're out of data space, we fall back to skip data space if
>> we can do NOCOW write.
> 
> we fall back to skip data space ... -> we fallback to NOCOW write
> without reserving data space.
> 
>>
>> If such behavior happens under that case, we could hit the following
>> problems:
> 
>> - data space bytes_may_use underflow
>>   This will cause kernel warning.
> 
> Ok.
> 
>>
>> - ENOSPC at delalloc time
> 
> at delalloc time - that is an ambiguous term you use through the change log.

In fact, I have a lot of uncertain terminology through kernel.

Things like flush get referred multiple times in different context (e.g.
filemap flush, flushoncommit, super block flush).

If we have a terminology list, we can't be more happy to follow.

> You mean when running/starting delalloc, which happens when starting writeback,
> but that could be confused with creating delalloc, which happens
> during the buffered write path.

Another confusion due to different terminology.

My understanding of the write path is:
buffered write -> delalloc (start delalloc) -> ordered extent -> finish
ordered io.

Thus I missed the delalloc creating part.

> 
> So I would always replace 'dealloc time' with 'when running delalloc'
> (or when starting writeback).

I will change use running delalloc, with extra comment reference to the
function name (btrfs_run_delalloc_range()).

> 
>>   This will lead to transaction abort and fs forced to RO.
> 
> Where does that happen exactly?
My bad, I got confused with metadata writeback path.

For data writeback, it should only cause sync related failure.

> I don't recall starting transactions when running dealloc, and failed
> to see where after a quick glance to cow_file_range()
> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
> when starting writeback.
> 
>>
>> [CAUSE]
>> This is due to the fact that btrfs can only do extent level share check.
>>
>> Btrfs can only tell if an extent is shared, no matter if only part of the
>> extent is shared or not.
>>
>> So for above script we have:
>> - fallocate
>> - buffered write
>>   If we don't have enough data space, we fall back to NOCOW check.
>>   At this timming, the extent is not shared, we can skip data
>>   reservation.
> 
> But in the above example we don't fall to nocow mode when doing the
> buffered write, as there's plenty of data space available (1Gb -
> 24Kb).
> You need to update the example.
I have to admit that the core part is mostly based on the worst case
*assumption*.

I'll try to make the case convincing by making it fail directly.

> 
> 
>> - reflink
>>   Now part of the large preallocated extent is shared.
>> - delalloc kicks in
> 
> writeback kicks in
> 
>>   For the NOCOW range, as the preallocated extent is shared, we need
>>   to fall back to COW.
>>
>> [WORKAROUND]
>> The workaround is to ensure any buffered write in the related extents
>> (not the reflink source range) get flushed before reflink.
> 
> not the reflink source range -> not just the reflink source range
> 
>>
>> However it's pretty expensive to do a comprehensive check.
>> In the reproducer, the reflink source is just a part of a larger
> 
> Again, the reproducer needs to be fixed (yes, I tested it even if it's
> clear by looking at it that it doesn't trigger the nocow case).
> 
>> preallocated extent, we need to flush all buffered write of that extent
>> before reflink.
>> Such backward search can be complex and we may not get much benefit from
>> it.
>>
>> So this patch will just try to flush the whole inode before reflink.
> 
> 
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Reason for RFC:
>> Flushing an inode just because it's a reflink source is definitely
>> overkilling, but I don't have any better way to handle it.
>>
>> Any comment on this is welcomed.
>> ---
>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 7755b503b348..8caa0edb6fbf 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>>                         return ret;
>>         }
>>
>> +       /*
>> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>> +        *
>> +        * Due to the limit of btrfs extent tree design, we can only have
>> +        * extent level share view. Any part of an extent is shared then the
> 
> Any -> If any
> 
>> +        * whole extent is shared and any write into that extent needs to fall
> 
> is -> is considered
> 
>> +        * back to COW.
> 
> I would add, something like:
> 
> That is, btrfs' back references do not have a block level granularity,
> they work at the whole extent level.
> 
>> +        *
>> +        * NOCOW buffered write without data space reserved could to lead to
>> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>> +        * at delalloc time (transaction abort).
> 
> I would omit the warning and transaction abort parts, that can change
> any time. And we have that information in the changelog, so it's not
> lost.
> 
>> +        *
>> +        * Here we take a shortcut by flush the whole inode. We could do better
>> +        * by finding all extents in that range and flush the space referring
>> +        * all those extents.
>> +        * But that's too complex for such corner case.
>> +        */
>> +       filemap_flush(src->i_mapping);
>> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>> +                    &BTRFS_I(src)->runtime_flags))
>> +               filemap_flush(src->i_mapping);
> 
> So a few comments here:
> 
> - why just in the clone part? The dedupe side has the same problem, doesn't it?

Right.

> 
> - I would move such flushing to btrfs_remap_file_range_prep - this is
> where we do the source and target range flush and wait.
> 
> Can you turn the reproducer into an fstests case?

Sure.

Thanks for the info and all the comment,
Qu

> 
> Thanks.
> 
>> +
>>         /*
>>          * Lock destination range to serialize with concurrent readpages() and
>>          * source range to serialize with relocation.
>> --
>> 2.21.0
>>
> 
>
Filipe Manana May 3, 2019, 10:45 a.m. UTC | #3
On Fri, May 3, 2019 at 11:18 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2019/5/3 下午5:21, Filipe Manana wrote:
> > On Fri, May 3, 2019 at 2:46 AM Qu Wenruo <wqu@suse.com> wrote:
> >
> > What a great subject. The "reflink:" part is unnecessary, since the
> > rest of the subject already mentions it, that makes it a bit shorter.
> >
> >>
> >> [BUG]
> >> The following command can lead to unexpected data COW:
> >>
> >>   #!/bin/bash
> >>
> >>   dev=/dev/test/test
> >>   mnt=/mnt/btrfs
> >>
> >>   mkfs.btrfs -f $dev -b 1G > /dev/null
> >>   mount $dev $mnt -o nospace_cache
> >>
> >>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
> >>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
> >>   umount $dev
> >>
> >> The result extent will be
> >>
> >>         item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
> >>                 generation 6 type 2 (prealloc)
> >>                 prealloc data disk byte 13631488 nr 28672
> >>         item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
> >>                 generation 6 type 1 (regular)
> >>                 extent data disk byte 13660160 nr 12288 <<< COW
> >>         item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
> >>                 generation 6 type 2 (prealloc)
> >>                 prealloc data disk byte 13631488 nr 28672
> >>
> >> Currently we always reserve space even for NOCOW buffered write, thus
> >
> > I would add 'data' between 'reserve' and 'space', to be clear.
> >
> >> under most case it shouldn't cause anything wrong even we fall back to
> >> COW.
> >
> > even we ... -> even if we fallback to COW when running delalloc /
> > starting writeback.
> >
> >>
> >> However when we're out of data space, we fall back to skip data space if
> >> we can do NOCOW write.
> >
> > we fall back to skip data space ... -> we fallback to NOCOW write
> > without reserving data space.
> >
> >>
> >> If such behavior happens under that case, we could hit the following
> >> problems:
> >
> >> - data space bytes_may_use underflow
> >>   This will cause kernel warning.
> >
> > Ok.
> >
> >>
> >> - ENOSPC at delalloc time
> >
> > at delalloc time - that is an ambiguous term you use through the change log.
>
> In fact, I have a lot of uncertain terminology through kernel.
>
> Things like flush get referred multiple times in different context (e.g.
> filemap flush, flushoncommit, super block flush).
>
> If we have a terminology list, we can't be more happy to follow.

So, some is kernel wide while others are btrfs specific.

A buffered write creates dealloc - copies data to pages, marks the
pages as dirty and tags the range in the extent io tree as dellaloc.
Running delalloc, flushes writeback (starts IO for the dirty pages and
tags them as under writeback) and does other necessary things (like
reserving extents).
When writeback finishes, we add a task to a work queue to run
btrfs_finish_ordered_io - after that happens we say that the ordered
extent completed.

It can get ambiguous very often.

>
> > You mean when running/starting delalloc, which happens when starting writeback,
> > but that could be confused with creating delalloc, which happens
> > during the buffered write path.
>
> Another confusion due to different terminology.
>
> My understanding of the write path is:
> buffered write -> delalloc (start delalloc) -> ordered extent -> finish
> ordered io.
>
> Thus I missed the delalloc creating part.
>
> >
> > So I would always replace 'dealloc time' with 'when running delalloc'
> > (or when starting writeback).
>
> I will change use running delalloc, with extra comment reference to the
> function name (btrfs_run_delalloc_range()).
>
> >
> >>   This will lead to transaction abort and fs forced to RO.
> >
> > Where does that happen exactly?
> My bad, I got confused with metadata writeback path.
>
> For data writeback, it should only cause sync related failure.

Ok, so please remove the transaction abort comments for next iteration.
By sync related failure, you mean running dealloc fails with ENOSPC,
since when it tries to reserve a data extent it fails as it can't find
any free extent.

>
> > I don't recall starting transactions when running dealloc, and failed
> > to see where after a quick glance to cow_file_range()
> > and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
> > when starting writeback.
> >
> >>
> >> [CAUSE]
> >> This is due to the fact that btrfs can only do extent level share check.
> >>
> >> Btrfs can only tell if an extent is shared, no matter if only part of the
> >> extent is shared or not.
> >>
> >> So for above script we have:
> >> - fallocate
> >> - buffered write
> >>   If we don't have enough data space, we fall back to NOCOW check.
> >>   At this timming, the extent is not shared, we can skip data
> >>   reservation.
> >
> > But in the above example we don't fall to nocow mode when doing the
> > buffered write, as there's plenty of data space available (1Gb -
> > 24Kb).
> > You need to update the example.
> I have to admit that the core part is mostly based on the worst case
> *assumption*.
>
> I'll try to make the case convincing by making it fail directly.

Great, thanks.

>
> >
> >
> >> - reflink
> >>   Now part of the large preallocated extent is shared.
> >> - delalloc kicks in
> >
> > writeback kicks in
> >
> >>   For the NOCOW range, as the preallocated extent is shared, we need
> >>   to fall back to COW.
> >>
> >> [WORKAROUND]
> >> The workaround is to ensure any buffered write in the related extents
> >> (not the reflink source range) get flushed before reflink.
> >
> > not the reflink source range -> not just the reflink source range
> >
> >>
> >> However it's pretty expensive to do a comprehensive check.
> >> In the reproducer, the reflink source is just a part of a larger
> >
> > Again, the reproducer needs to be fixed (yes, I tested it even if it's
> > clear by looking at it that it doesn't trigger the nocow case).
> >
> >> preallocated extent, we need to flush all buffered write of that extent
> >> before reflink.
> >> Such backward search can be complex and we may not get much benefit from
> >> it.
> >>
> >> So this patch will just try to flush the whole inode before reflink.
> >
> >
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >> Reason for RFC:
> >> Flushing an inode just because it's a reflink source is definitely
> >> overkilling, but I don't have any better way to handle it.
> >>
> >> Any comment on this is welcomed.
> >> ---
> >>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
> >>  1 file changed, 22 insertions(+)
> >>
> >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> >> index 7755b503b348..8caa0edb6fbf 100644
> >> --- a/fs/btrfs/ioctl.c
> >> +++ b/fs/btrfs/ioctl.c
> >> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
> >>                         return ret;
> >>         }
> >>
> >> +       /*
> >> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
> >> +        *
> >> +        * Due to the limit of btrfs extent tree design, we can only have
> >> +        * extent level share view. Any part of an extent is shared then the
> >
> > Any -> If any
> >
> >> +        * whole extent is shared and any write into that extent needs to fall
> >
> > is -> is considered
> >
> >> +        * back to COW.
> >
> > I would add, something like:
> >
> > That is, btrfs' back references do not have a block level granularity,
> > they work at the whole extent level.
> >
> >> +        *
> >> +        * NOCOW buffered write without data space reserved could to lead to
> >> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
> >> +        * at delalloc time (transaction abort).
> >
> > I would omit the warning and transaction abort parts, that can change
> > any time. And we have that information in the changelog, so it's not
> > lost.
> >
> >> +        *
> >> +        * Here we take a shortcut by flush the whole inode. We could do better
> >> +        * by finding all extents in that range and flush the space referring
> >> +        * all those extents.
> >> +        * But that's too complex for such corner case.
> >> +        */
> >> +       filemap_flush(src->i_mapping);
> >> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> >> +                    &BTRFS_I(src)->runtime_flags))
> >> +               filemap_flush(src->i_mapping);
> >
> > So a few comments here:
> >
> > - why just in the clone part? The dedupe side has the same problem, doesn't it?
>
> Right.
>
> >
> > - I would move such flushing to btrfs_remap_file_range_prep - this is
> > where we do the source and target range flush and wait.
> >
> > Can you turn the reproducer into an fstests case?
>
> Sure.
>
> Thanks for the info and all the comment,
> Qu
>
> >
> > Thanks.
> >
> >> +
> >>         /*
> >>          * Lock destination range to serialize with concurrent readpages() and
> >>          * source range to serialize with relocation.
> >> --
> >> 2.21.0
> >>
> >
> >
>
Zygo Blaxell May 3, 2019, 9:56 p.m. UTC | #4
On Fri, May 03, 2019 at 09:08:52AM +0800, Qu Wenruo wrote:
> [BUG]
> The following command can lead to unexpected data COW:
> 
>   #!/bin/bash
> 
>   dev=/dev/test/test
>   mnt=/mnt/btrfs
> 
>   mkfs.btrfs -f $dev -b 1G > /dev/null
>   mount $dev $mnt -o nospace_cache
> 
>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
>   umount $dev
> 
> The result extent will be
> 
> 	item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
> 		generation 6 type 2 (prealloc)
> 		prealloc data disk byte 13631488 nr 28672
> 	item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
> 		generation 6 type 1 (regular)
> 		extent data disk byte 13660160 nr 12288 <<< COW
> 	item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
> 		generation 6 type 2 (prealloc)
> 		prealloc data disk byte 13631488 nr 28672
> 
> Currently we always reserve space even for NOCOW buffered write, thus
> under most case it shouldn't cause anything wrong even we fall back to
> COW.
> 
> However when we're out of data space, we fall back to skip data space if
> we can do NOCOW write.
> 
> If such behavior happens under that case, we could hit the following
> problems:
> - data space bytes_may_use underflow
>   This will cause kernel warning.
> 
> - ENOSPC at delalloc time
>   This will lead to transaction abort and fs forced to RO.
> 
> [CAUSE]
> This is due to the fact that btrfs can only do extent level share check.
> 
> Btrfs can only tell if an extent is shared, no matter if only part of the
> extent is shared or not.
> 
> So for above script we have:
> - fallocate
> - buffered write
>   If we don't have enough data space, we fall back to NOCOW check.
>   At this timming, the extent is not shared, we can skip data
>   reservation.
> - reflink
>   Now part of the large preallocated extent is shared.
> - delalloc kicks in
>   For the NOCOW range, as the preallocated extent is shared, we need
>   to fall back to COW.
> 
> [WORKAROUND]
> The workaround is to ensure any buffered write in the related extents
> (not the reflink source range) get flushed before reflink.
> 
> However it's pretty expensive to do a comprehensive check.
> In the reproducer, the reflink source is just a part of a larger
> preallocated extent, we need to flush all buffered write of that extent
> before reflink.
> Such backward search can be complex and we may not get much benefit from
> it.
> 
> So this patch will just try to flush the whole inode before reflink.

Does that mean that if a large file is being written and deduped
simultaneously, that the dedupes would now trigger flushes over the
entire file?  That seems like it could be slow.

e.g. if the file is a big VM image file and it is used src and for dedupe
(which is quite common in VM image files), we'd be hammering the disk
with writes similar to hitting it with fsync() in a tight loop?

> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Reason for RFC:
> Flushing an inode just because it's a reflink source is definitely
> overkilling, but I don't have any better way to handle it.
> 
> Any comment on this is welcomed.
> ---
>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 7755b503b348..8caa0edb6fbf 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>  			return ret;
>  	}
>  
> +	/*
> +	 * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
> +	 *
> +	 * Due to the limit of btrfs extent tree design, we can only have
> +	 * extent level share view. Any part of an extent is shared then the
> +	 * whole extent is shared and any write into that extent needs to fall
> +	 * back to COW.
> +	 *
> +	 * NOCOW buffered write without data space reserved could to lead to
> +	 * either data space bytes_may_use underflow (kernel warning) or ENOSPC
> +	 * at delalloc time (transaction abort).
> +	 *
> +	 * Here we take a shortcut by flush the whole inode. We could do better
> +	 * by finding all extents in that range and flush the space referring
> +	 * all those extents.
> +	 * But that's too complex for such corner case.
> +	 */
> +	filemap_flush(src->i_mapping);
> +	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> +		     &BTRFS_I(src)->runtime_flags))
> +		filemap_flush(src->i_mapping);
> +
>  	/*
>  	 * Lock destination range to serialize with concurrent readpages() and
>  	 * source range to serialize with relocation.
> -- 
> 2.21.0
>
Qu Wenruo May 4, 2019, 12:32 a.m. UTC | #5
On 2019/5/4 上午5:56, Zygo Blaxell wrote:
> On Fri, May 03, 2019 at 09:08:52AM +0800, Qu Wenruo wrote:
>> [BUG]
>> The following command can lead to unexpected data COW:
>>
>>   #!/bin/bash
>>
>>   dev=/dev/test/test
>>   mnt=/mnt/btrfs
>>
>>   mkfs.btrfs -f $dev -b 1G > /dev/null
>>   mount $dev $mnt -o nospace_cache
>>
>>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
>>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
>>   umount $dev
>>
>> The result extent will be
>>
>> 	item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
>> 		generation 6 type 2 (prealloc)
>> 		prealloc data disk byte 13631488 nr 28672
>> 	item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
>> 		generation 6 type 1 (regular)
>> 		extent data disk byte 13660160 nr 12288 <<< COW
>> 	item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
>> 		generation 6 type 2 (prealloc)
>> 		prealloc data disk byte 13631488 nr 28672
>>
>> Currently we always reserve space even for NOCOW buffered write, thus
>> under most case it shouldn't cause anything wrong even we fall back to
>> COW.
>>
>> However when we're out of data space, we fall back to skip data space if
>> we can do NOCOW write.
>>
>> If such behavior happens under that case, we could hit the following
>> problems:
>> - data space bytes_may_use underflow
>>   This will cause kernel warning.
>>
>> - ENOSPC at delalloc time
>>   This will lead to transaction abort and fs forced to RO.
>>
>> [CAUSE]
>> This is due to the fact that btrfs can only do extent level share check.
>>
>> Btrfs can only tell if an extent is shared, no matter if only part of the
>> extent is shared or not.
>>
>> So for above script we have:
>> - fallocate
>> - buffered write
>>   If we don't have enough data space, we fall back to NOCOW check.
>>   At this timming, the extent is not shared, we can skip data
>>   reservation.
>> - reflink
>>   Now part of the large preallocated extent is shared.
>> - delalloc kicks in
>>   For the NOCOW range, as the preallocated extent is shared, we need
>>   to fall back to COW.
>>
>> [WORKAROUND]
>> The workaround is to ensure any buffered write in the related extents
>> (not the reflink source range) get flushed before reflink.
>>
>> However it's pretty expensive to do a comprehensive check.
>> In the reproducer, the reflink source is just a part of a larger
>> preallocated extent, we need to flush all buffered write of that extent
>> before reflink.
>> Such backward search can be complex and we may not get much benefit from
>> it.
>>
>> So this patch will just try to flush the whole inode before reflink.
> 
> Does that mean that if a large file is being written and deduped
> simultaneously, that the dedupes would now trigger flushes over the
> entire file?  That seems like it could be slow.

Yes, also my reason for RFC.

But it shouldn't be that heavy, as after the first dedupe/reflink, most
IO should be flushed back, later dedupe has much less work to do.


> 
> e.g. if the file is a big VM image file and it is used src and for dedupe
> (which is quite common in VM image files), we'd be hammering the disk
> with writes similar to hitting it with fsync() in a tight loop?

The original behavior also flush the target and source range, so we're
not completely creating some new overhead.

Thanks,
Qu

> 
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Reason for RFC:
>> Flushing an inode just because it's a reflink source is definitely
>> overkilling, but I don't have any better way to handle it.
>>
>> Any comment on this is welcomed.
>> ---
>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 7755b503b348..8caa0edb6fbf 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>>  			return ret;
>>  	}
>>  
>> +	/*
>> +	 * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>> +	 *
>> +	 * Due to the limit of btrfs extent tree design, we can only have
>> +	 * extent level share view. Any part of an extent is shared then the
>> +	 * whole extent is shared and any write into that extent needs to fall
>> +	 * back to COW.
>> +	 *
>> +	 * NOCOW buffered write without data space reserved could to lead to
>> +	 * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>> +	 * at delalloc time (transaction abort).
>> +	 *
>> +	 * Here we take a shortcut by flush the whole inode. We could do better
>> +	 * by finding all extents in that range and flush the space referring
>> +	 * all those extents.
>> +	 * But that's too complex for such corner case.
>> +	 */
>> +	filemap_flush(src->i_mapping);
>> +	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>> +		     &BTRFS_I(src)->runtime_flags))
>> +		filemap_flush(src->i_mapping);
>> +
>>  	/*
>>  	 * Lock destination range to serialize with concurrent readpages() and
>>  	 * source range to serialize with relocation.
>> -- 
>> 2.21.0
>>
Nikolay Borisov May 4, 2019, 8:29 a.m. UTC | #6
On 3.05.19 г. 13:45 ч., Filipe Manana wrote:
> On Fri, May 3, 2019 at 11:18 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2019/5/3 下午5:21, Filipe Manana wrote:
>>> On Fri, May 3, 2019 at 2:46 AM Qu Wenruo <wqu@suse.com> wrote:
>>>
>>> What a great subject. The "reflink:" part is unnecessary, since the
>>> rest of the subject already mentions it, that makes it a bit shorter.
>>>
>>>>
>>>> [BUG]
>>>> The following command can lead to unexpected data COW:
>>>>
>>>>   #!/bin/bash
>>>>
>>>>   dev=/dev/test/test
>>>>   mnt=/mnt/btrfs
>>>>
>>>>   mkfs.btrfs -f $dev -b 1G > /dev/null
>>>>   mount $dev $mnt -o nospace_cache
>>>>
>>>>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
>>>>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
>>>>   umount $dev
>>>>
>>>> The result extent will be
>>>>
>>>>         item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
>>>>                 generation 6 type 2 (prealloc)
>>>>                 prealloc data disk byte 13631488 nr 28672
>>>>         item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
>>>>                 generation 6 type 1 (regular)
>>>>                 extent data disk byte 13660160 nr 12288 <<< COW
>>>>         item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
>>>>                 generation 6 type 2 (prealloc)
>>>>                 prealloc data disk byte 13631488 nr 28672
>>>>
>>>> Currently we always reserve space even for NOCOW buffered write, thus
>>>
>>> I would add 'data' between 'reserve' and 'space', to be clear.
>>>
>>>> under most case it shouldn't cause anything wrong even we fall back to
>>>> COW.
>>>
>>> even we ... -> even if we fallback to COW when running delalloc /
>>> starting writeback.
>>>
>>>>
>>>> However when we're out of data space, we fall back to skip data space if
>>>> we can do NOCOW write.
>>>
>>> we fall back to skip data space ... -> we fallback to NOCOW write
>>> without reserving data space.
>>>
>>>>
>>>> If such behavior happens under that case, we could hit the following
>>>> problems:
>>>
>>>> - data space bytes_may_use underflow
>>>>   This will cause kernel warning.
>>>
>>> Ok.
>>>
>>>>
>>>> - ENOSPC at delalloc time
>>>
>>> at delalloc time - that is an ambiguous term you use through the change log.
>>
>> In fact, I have a lot of uncertain terminology through kernel.
>>
>> Things like flush get referred multiple times in different context (e.g.
>> filemap flush, flushoncommit, super block flush).
>>
>> If we have a terminology list, we can't be more happy to follow.
> 
> So, some is kernel wide while others are btrfs specific.
> 
> A buffered write creates dealloc - copies data to pages, marks the
> pages as dirty and tags the range in the extent io tree as dellaloc.
> Running delalloc, flushes writeback (starts IO for the dirty pages and
> tags them as under writeback) and does other necessary things (like
> reserving extents).
> When writeback finishes, we add a task to a work queue to run
> btrfs_finish_ordered_io - after that happens we say that the ordered
> extent completed.
> 
> It can get ambiguous very often.


That's why I have created the following document which (tries) to
explain this:

https://github.com/btrfs/btrfs-dev-docs/blob/master/delalloc.txt

It's not perfect but it's better than nothing, feel free to contribute
improvements.

< snip>
Zygo Blaxell May 5, 2019, 3:07 p.m. UTC | #7
On Sat, May 04, 2019 at 08:32:11AM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/5/4 上午5:56, Zygo Blaxell wrote:
> > On Fri, May 03, 2019 at 09:08:52AM +0800, Qu Wenruo wrote:
> >> [BUG]
> >> The following command can lead to unexpected data COW:
> >>
> >>   #!/bin/bash
> >>
> >>   dev=/dev/test/test
> >>   mnt=/mnt/btrfs
> >>
> >>   mkfs.btrfs -f $dev -b 1G > /dev/null
> >>   mount $dev $mnt -o nospace_cache
> >>
> >>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
> >>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
> >>   umount $dev
> >>
> >> The result extent will be
> >>
> >> 	item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
> >> 		generation 6 type 2 (prealloc)
> >> 		prealloc data disk byte 13631488 nr 28672
> >> 	item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
> >> 		generation 6 type 1 (regular)
> >> 		extent data disk byte 13660160 nr 12288 <<< COW
> >> 	item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
> >> 		generation 6 type 2 (prealloc)
> >> 		prealloc data disk byte 13631488 nr 28672
> >>
> >> Currently we always reserve space even for NOCOW buffered write, thus
> >> under most case it shouldn't cause anything wrong even we fall back to
> >> COW.
> >>
> >> However when we're out of data space, we fall back to skip data space if
> >> we can do NOCOW write.
> >>
> >> If such behavior happens under that case, we could hit the following
> >> problems:
> >> - data space bytes_may_use underflow
> >>   This will cause kernel warning.
> >>
> >> - ENOSPC at delalloc time
> >>   This will lead to transaction abort and fs forced to RO.
> >>
> >> [CAUSE]
> >> This is due to the fact that btrfs can only do extent level share check.
> >>
> >> Btrfs can only tell if an extent is shared, no matter if only part of the
> >> extent is shared or not.
> >>
> >> So for above script we have:
> >> - fallocate
> >> - buffered write
> >>   If we don't have enough data space, we fall back to NOCOW check.
> >>   At this timming, the extent is not shared, we can skip data
> >>   reservation.
> >> - reflink
> >>   Now part of the large preallocated extent is shared.
> >> - delalloc kicks in
> >>   For the NOCOW range, as the preallocated extent is shared, we need
> >>   to fall back to COW.
> >>
> >> [WORKAROUND]
> >> The workaround is to ensure any buffered write in the related extents
> >> (not the reflink source range) get flushed before reflink.
> >>
> >> However it's pretty expensive to do a comprehensive check.
> >> In the reproducer, the reflink source is just a part of a larger
> >> preallocated extent, we need to flush all buffered write of that extent
> >> before reflink.
> >> Such backward search can be complex and we may not get much benefit from
> >> it.
> >>
> >> So this patch will just try to flush the whole inode before reflink.
> > 
> > Does that mean that if a large file is being written and deduped
> > simultaneously, that the dedupes would now trigger flushes over the
> > entire file?  That seems like it could be slow.
> 
> Yes, also my reason for RFC.
> 
> But it shouldn't be that heavy, as after the first dedupe/reflink, most
> IO should be flushed back, later dedupe has much less work to do.

Sure, but if writes are continuously happening (e.g. writes at offset
10GB, dedupe at 1GB), these will get flushed out on the next dedupe.
I'm thinking of scenarious where both writes and dedupes are happening
continuously, e.g. a host with multiple VM images running out of raw
image files that are deduped on the host filesystem.

I'm not sure what all the conditions for this flush are.  From the list
above, it sounds like this only occurs _after_ the filesystem has found
there is no free space.  If that's true, then the extra overhead is
negligible since it rarely happens (assuming that having no free space
is a rare condition for filesystems).


> > e.g. if the file is a big VM image file and it is used src and for dedupe
> > (which is quite common in VM image files), we'd be hammering the disk
> > with writes similar to hitting it with fsync() in a tight loop?
> 
> The original behavior also flush the target and source range, so we're
> not completely creating some new overhead.
> 
> Thanks,
> Qu
> 
> > 
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >> Reason for RFC:
> >> Flushing an inode just because it's a reflink source is definitely
> >> overkilling, but I don't have any better way to handle it.
> >>
> >> Any comment on this is welcomed.
> >> ---
> >>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
> >>  1 file changed, 22 insertions(+)
> >>
> >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> >> index 7755b503b348..8caa0edb6fbf 100644
> >> --- a/fs/btrfs/ioctl.c
> >> +++ b/fs/btrfs/ioctl.c
> >> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
> >>  			return ret;
> >>  	}
> >>  
> >> +	/*
> >> +	 * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
> >> +	 *
> >> +	 * Due to the limit of btrfs extent tree design, we can only have
> >> +	 * extent level share view. Any part of an extent is shared then the
> >> +	 * whole extent is shared and any write into that extent needs to fall
> >> +	 * back to COW.
> >> +	 *
> >> +	 * NOCOW buffered write without data space reserved could to lead to
> >> +	 * either data space bytes_may_use underflow (kernel warning) or ENOSPC
> >> +	 * at delalloc time (transaction abort).
> >> +	 *
> >> +	 * Here we take a shortcut by flush the whole inode. We could do better
> >> +	 * by finding all extents in that range and flush the space referring
> >> +	 * all those extents.
> >> +	 * But that's too complex for such corner case.
> >> +	 */
> >> +	filemap_flush(src->i_mapping);
> >> +	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> >> +		     &BTRFS_I(src)->runtime_flags))
> >> +		filemap_flush(src->i_mapping);
> >> +
> >>  	/*
> >>  	 * Lock destination range to serialize with concurrent readpages() and
> >>  	 * source range to serialize with relocation.
> >> -- 
> >> 2.21.0
> >>
>
Filipe Manana May 5, 2019, 4:24 p.m. UTC | #8
On Sun, May 5, 2019 at 4:33 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Sat, May 04, 2019 at 08:32:11AM +0800, Qu Wenruo wrote:
> >
> >
> > On 2019/5/4 上午5:56, Zygo Blaxell wrote:
> > > On Fri, May 03, 2019 at 09:08:52AM +0800, Qu Wenruo wrote:
> > >> [BUG]
> > >> The following command can lead to unexpected data COW:
> > >>
> > >>   #!/bin/bash
> > >>
> > >>   dev=/dev/test/test
> > >>   mnt=/mnt/btrfs
> > >>
> > >>   mkfs.btrfs -f $dev -b 1G > /dev/null
> > >>   mount $dev $mnt -o nospace_cache
> > >>
> > >>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
> > >>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
> > >>   umount $dev
> > >>
> > >> The result extent will be
> > >>
> > >>    item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
> > >>            generation 6 type 2 (prealloc)
> > >>            prealloc data disk byte 13631488 nr 28672
> > >>    item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
> > >>            generation 6 type 1 (regular)
> > >>            extent data disk byte 13660160 nr 12288 <<< COW
> > >>    item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
> > >>            generation 6 type 2 (prealloc)
> > >>            prealloc data disk byte 13631488 nr 28672
> > >>
> > >> Currently we always reserve space even for NOCOW buffered write, thus
> > >> under most case it shouldn't cause anything wrong even we fall back to
> > >> COW.
> > >>
> > >> However when we're out of data space, we fall back to skip data space if
> > >> we can do NOCOW write.
> > >>
> > >> If such behavior happens under that case, we could hit the following
> > >> problems:
> > >> - data space bytes_may_use underflow
> > >>   This will cause kernel warning.
> > >>
> > >> - ENOSPC at delalloc time
> > >>   This will lead to transaction abort and fs forced to RO.
> > >>
> > >> [CAUSE]
> > >> This is due to the fact that btrfs can only do extent level share check.
> > >>
> > >> Btrfs can only tell if an extent is shared, no matter if only part of the
> > >> extent is shared or not.
> > >>
> > >> So for above script we have:
> > >> - fallocate
> > >> - buffered write
> > >>   If we don't have enough data space, we fall back to NOCOW check.
> > >>   At this timming, the extent is not shared, we can skip data
> > >>   reservation.
> > >> - reflink
> > >>   Now part of the large preallocated extent is shared.
> > >> - delalloc kicks in
> > >>   For the NOCOW range, as the preallocated extent is shared, we need
> > >>   to fall back to COW.
> > >>
> > >> [WORKAROUND]
> > >> The workaround is to ensure any buffered write in the related extents
> > >> (not the reflink source range) get flushed before reflink.
> > >>
> > >> However it's pretty expensive to do a comprehensive check.
> > >> In the reproducer, the reflink source is just a part of a larger
> > >> preallocated extent, we need to flush all buffered write of that extent
> > >> before reflink.
> > >> Such backward search can be complex and we may not get much benefit from
> > >> it.
> > >>
> > >> So this patch will just try to flush the whole inode before reflink.
> > >
> > > Does that mean that if a large file is being written and deduped
> > > simultaneously, that the dedupes would now trigger flushes over the
> > > entire file?  That seems like it could be slow.
> >
> > Yes, also my reason for RFC.
> >
> > But it shouldn't be that heavy, as after the first dedupe/reflink, most
> > IO should be flushed back, later dedupe has much less work to do.
>
> Sure, but if writes are continuously happening (e.g. writes at offset
> 10GB, dedupe at 1GB), these will get flushed out on the next dedupe.
> I'm thinking of scenarious where both writes and dedupes are happening
> continuously, e.g. a host with multiple VM images running out of raw
> image files that are deduped on the host filesystem.
>
> I'm not sure what all the conditions for this flush are.  From the list
> above, it sounds like this only occurs _after_ the filesystem has found
> there is no free space.  If that's true, then the extra overhead is
> negligible since it rarely happens (assuming that having no free space
> is a rare condition for filesystems).

The problem is not that flush is done only when low on available space.
The flush would always happen on the entire source file before
reflinking, so that buffered writes that happened before the
clone/dedupe operation and "entered" nodatacow mode (because at the
time there was not enough available data space) will not fail when
their writeback starts - which would happen after the reflinking -
that's why the entire range is flushed.

Even if btrfs' reference counts are tracked per extent and not per
block, here we could maybe do something like check each reference,
extract the inode number, root number and offset. Then use that to
find the respective file extent items, and using those extract their
length and determine exactly which parts (blocks) of an extent are
shared. That would be a lot of work to do, and would always be racy
for checks for inodes that are not the inode we have locked for the
reflink operation. Very impractical.

So it's one more big inconvenience from the extent based back
references, other then the already known space wasting inconvenience
(even if only 1 block of an extent is really referenced, the rest of
the extent is unavailable for allocation, considered used space).



>
>
> > > e.g. if the file is a big VM image file and it is used src and for dedupe
> > > (which is quite common in VM image files), we'd be hammering the disk
> > > with writes similar to hitting it with fsync() in a tight loop?
> >
> > The original behavior also flush the target and source range, so we're
> > not completely creating some new overhead.
> >
> > Thanks,
> > Qu
> >
> > >
> > >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> > >> ---
> > >> Reason for RFC:
> > >> Flushing an inode just because it's a reflink source is definitely
> > >> overkilling, but I don't have any better way to handle it.
> > >>
> > >> Any comment on this is welcomed.
> > >> ---
> > >>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
> > >>  1 file changed, 22 insertions(+)
> > >>
> > >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > >> index 7755b503b348..8caa0edb6fbf 100644
> > >> --- a/fs/btrfs/ioctl.c
> > >> +++ b/fs/btrfs/ioctl.c
> > >> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
> > >>                    return ret;
> > >>    }
> > >>
> > >> +  /*
> > >> +   * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
> > >> +   *
> > >> +   * Due to the limit of btrfs extent tree design, we can only have
> > >> +   * extent level share view. Any part of an extent is shared then the
> > >> +   * whole extent is shared and any write into that extent needs to fall
> > >> +   * back to COW.
> > >> +   *
> > >> +   * NOCOW buffered write without data space reserved could to lead to
> > >> +   * either data space bytes_may_use underflow (kernel warning) or ENOSPC
> > >> +   * at delalloc time (transaction abort).
> > >> +   *
> > >> +   * Here we take a shortcut by flush the whole inode. We could do better
> > >> +   * by finding all extents in that range and flush the space referring
> > >> +   * all those extents.
> > >> +   * But that's too complex for such corner case.
> > >> +   */
> > >> +  filemap_flush(src->i_mapping);
> > >> +  if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> > >> +               &BTRFS_I(src)->runtime_flags))
> > >> +          filemap_flush(src->i_mapping);
> > >> +
> > >>    /*
> > >>     * Lock destination range to serialize with concurrent readpages() and
> > >>     * source range to serialize with relocation.
> > >> --
> > >> 2.21.0
> > >>
> >
>
>
>
Qu Wenruo May 6, 2019, 12:06 a.m. UTC | #9
On 2019/5/6 上午12:24, Filipe Manana wrote:
[snip]
>>>
>>> Yes, also my reason for RFC.
>>>
>>> But it shouldn't be that heavy, as after the first dedupe/reflink, most
>>> IO should be flushed back, later dedupe has much less work to do.
>>
>> Sure, but if writes are continuously happening (e.g. writes at offset
>> 10GB, dedupe at 1GB), these will get flushed out on the next dedupe.
>> I'm thinking of scenarious where both writes and dedupes are happening
>> continuously, e.g. a host with multiple VM images running out of raw
>> image files that are deduped on the host filesystem.
>>
>> I'm not sure what all the conditions for this flush are.  From the list
>> above, it sounds like this only occurs _after_ the filesystem has found
>> there is no free space.  If that's true, then the extra overhead is
>> negligible since it rarely happens (assuming that having no free space
>> is a rare condition for filesystems).
> 
> The problem is not that flush is done only when low on available space.
> The flush would always happen on the entire source file before
> reflinking, so that buffered writes that happened before the
> clone/dedupe operation and "entered" nodatacow mode (because at the
> time there was not enough available data space) will not fail when
> their writeback starts - which would happen after the reflinking -
> that's why the entire range is flushed.
> 
> Even if btrfs' reference counts are tracked per extent and not per
> block, here we could maybe do something like check each reference,
> extract the inode number, root number and offset. Then use that to
> find the respective file extent items, and using those extract their
> length and determine exactly which parts (blocks) of an extent are
> shared. That would be a lot of work to do, and would always be racy
> for checks for inodes that are not the inode we have locked for the
> reflink operation. Very impractical.

To add my idea on better backref (block level), it's more impractical
than I thought.

From extent double/triple split, to how to handle old extents in old
snapshot, it's way too expensive from the developer's respect.

> 
> So it's one more big inconvenience from the extent based back
> references, other then the already known space wasting inconvenience
> (even if only 1 block of an extent is really referenced, the rest of
> the extent is unavailable for allocation, considered used space).

Currently this patch is only to be a workaround.

There is an idea of introducing new extent io tree bit, NODATACOW for
this case. Buffered write with NODATACOW will set the bit, and for the
specific problem described here, we only need to flush the range with
NODATACOW bit set.

And that bit can also be used to detect unexpected COW with no data
space reserved.

But that need extra work/testing (especially with my special sauce of
doing NODATACOW at buffered write time).
At least we have some idea on how to reduce the overhead.

Thanks,
Qu

> 
> 
> 
>>
>>
>>>> e.g. if the file is a big VM image file and it is used src and for dedupe
>>>> (which is quite common in VM image files), we'd be hammering the disk
>>>> with writes similar to hitting it with fsync() in a tight loop?
>>>
>>> The original behavior also flush the target and source range, so we're
>>> not completely creating some new overhead.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>> ---
>>>>> Reason for RFC:
>>>>> Flushing an inode just because it's a reflink source is definitely
>>>>> overkilling, but I don't have any better way to handle it.
>>>>>
>>>>> Any comment on this is welcomed.
>>>>> ---
>>>>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>>>>  1 file changed, 22 insertions(+)
>>>>>
>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>> index 7755b503b348..8caa0edb6fbf 100644
>>>>> --- a/fs/btrfs/ioctl.c
>>>>> +++ b/fs/btrfs/ioctl.c
>>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>>>>>                    return ret;
>>>>>    }
>>>>>
>>>>> +  /*
>>>>> +   * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>>>>> +   *
>>>>> +   * Due to the limit of btrfs extent tree design, we can only have
>>>>> +   * extent level share view. Any part of an extent is shared then the
>>>>> +   * whole extent is shared and any write into that extent needs to fall
>>>>> +   * back to COW.
>>>>> +   *
>>>>> +   * NOCOW buffered write without data space reserved could to lead to
>>>>> +   * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>>>>> +   * at delalloc time (transaction abort).
>>>>> +   *
>>>>> +   * Here we take a shortcut by flush the whole inode. We could do better
>>>>> +   * by finding all extents in that range and flush the space referring
>>>>> +   * all those extents.
>>>>> +   * But that's too complex for such corner case.
>>>>> +   */
>>>>> +  filemap_flush(src->i_mapping);
>>>>> +  if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>>>>> +               &BTRFS_I(src)->runtime_flags))
>>>>> +          filemap_flush(src->i_mapping);
>>>>> +
>>>>>    /*
>>>>>     * Lock destination range to serialize with concurrent readpages() and
>>>>>     * source range to serialize with relocation.
>>>>> --
>>>>> 2.21.0
>>>>>
>>>
>>
>>
>>
> 
>
Qu Wenruo May 6, 2019, 2:04 a.m. UTC | #10
[snip]
>>
>> For data writeback, it should only cause sync related failure.
> 
> Ok, so please remove the transaction abort comments for next iteration.
> By sync related failure, you mean running dealloc fails with ENOSPC,
> since when it tries to reserve a data extent it fails as it can't find
> any free extent.

Well, btrfs has some hidden way to fix such problem by itself already.

At buffered write time, we have the following call chain:
btrfs_buffered_write()
|- btrfs_check_data_free_space()
   |- btrfs_alloc_data_chunk_ondemand()
      |- need_commit = 2; /* We have 2 chance to retry, 1 to flush */
      |- do_chunk_alloc() /* we have no data space available */
      |- if (ret < 0 && ret == -ENOSPC)
      |-     goto commit_trans;  /* try to free some space */
      |- commit_trans:
      |-     need_commit--;
      |-     if (need_commit > 0) {
      |-         btrfs_start_delalloc_roots();
      |-         btrfs_wait_ordered_roots();
      |-     }

This means, as long as we hit ENOSPC for data space, we will start write
back, thus NODATACOW data will still hit disk as NODATACOW.

Such hidden behavior along with always-reserve-data-space solves the
problem pretty well.
We either:
- reserve data space
  Then no matter how it ends, we're OK, although it may end as CoW.

- Failed to reserve data space
  Writeback will be triggered anyway, no way to screw things around.

Thus this workaround has nothing to fix, but only make certain NODATACOW
reach disk as NODATACOW.

It makes some NODATACOW behaves more correctly but won't fix any obvious
bug.

My personal take is to fix any strange behavior even it won't cause any
problem, but the full inode writeback can be performance heavy.

So my question is, do we really need this anyway?

Thanks,
Qu

> 
>>
>>> I don't recall starting transactions when running dealloc, and failed
>>> to see where after a quick glance to cow_file_range()
>>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
>>> when starting writeback.
>>>
>>>>
>>>> [CAUSE]
>>>> This is due to the fact that btrfs can only do extent level share check.
>>>>
>>>> Btrfs can only tell if an extent is shared, no matter if only part of the
>>>> extent is shared or not.
>>>>
>>>> So for above script we have:
>>>> - fallocate
>>>> - buffered write
>>>>   If we don't have enough data space, we fall back to NOCOW check.
>>>>   At this timming, the extent is not shared, we can skip data
>>>>   reservation.
>>>
>>> But in the above example we don't fall to nocow mode when doing the
>>> buffered write, as there's plenty of data space available (1Gb -
>>> 24Kb).
>>> You need to update the example.
>> I have to admit that the core part is mostly based on the worst case
>> *assumption*.
>>
>> I'll try to make the case convincing by making it fail directly.
> 
> Great, thanks.
> 
>>
>>>
>>>
>>>> - reflink
>>>>   Now part of the large preallocated extent is shared.
>>>> - delalloc kicks in
>>>
>>> writeback kicks in
>>>
>>>>   For the NOCOW range, as the preallocated extent is shared, we need
>>>>   to fall back to COW.
>>>>
>>>> [WORKAROUND]
>>>> The workaround is to ensure any buffered write in the related extents
>>>> (not the reflink source range) get flushed before reflink.
>>>
>>> not the reflink source range -> not just the reflink source range
>>>
>>>>
>>>> However it's pretty expensive to do a comprehensive check.
>>>> In the reproducer, the reflink source is just a part of a larger
>>>
>>> Again, the reproducer needs to be fixed (yes, I tested it even if it's
>>> clear by looking at it that it doesn't trigger the nocow case).
>>>
>>>> preallocated extent, we need to flush all buffered write of that extent
>>>> before reflink.
>>>> Such backward search can be complex and we may not get much benefit from
>>>> it.
>>>>
>>>> So this patch will just try to flush the whole inode before reflink.
>>>
>>>
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>> Reason for RFC:
>>>> Flushing an inode just because it's a reflink source is definitely
>>>> overkilling, but I don't have any better way to handle it.
>>>>
>>>> Any comment on this is welcomed.
>>>> ---
>>>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>>>  1 file changed, 22 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>> index 7755b503b348..8caa0edb6fbf 100644
>>>> --- a/fs/btrfs/ioctl.c
>>>> +++ b/fs/btrfs/ioctl.c
>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>>>>                         return ret;
>>>>         }
>>>>
>>>> +       /*
>>>> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>>>> +        *
>>>> +        * Due to the limit of btrfs extent tree design, we can only have
>>>> +        * extent level share view. Any part of an extent is shared then the
>>>
>>> Any -> If any
>>>
>>>> +        * whole extent is shared and any write into that extent needs to fall
>>>
>>> is -> is considered
>>>
>>>> +        * back to COW.
>>>
>>> I would add, something like:
>>>
>>> That is, btrfs' back references do not have a block level granularity,
>>> they work at the whole extent level.
>>>
>>>> +        *
>>>> +        * NOCOW buffered write without data space reserved could to lead to
>>>> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>>>> +        * at delalloc time (transaction abort).
>>>
>>> I would omit the warning and transaction abort parts, that can change
>>> any time. And we have that information in the changelog, so it's not
>>> lost.
>>>
>>>> +        *
>>>> +        * Here we take a shortcut by flush the whole inode. We could do better
>>>> +        * by finding all extents in that range and flush the space referring
>>>> +        * all those extents.
>>>> +        * But that's too complex for such corner case.
>>>> +        */
>>>> +       filemap_flush(src->i_mapping);
>>>> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>>>> +                    &BTRFS_I(src)->runtime_flags))
>>>> +               filemap_flush(src->i_mapping);
>>>
>>> So a few comments here:
>>>
>>> - why just in the clone part? The dedupe side has the same problem, doesn't it?
>>
>> Right.
>>
>>>
>>> - I would move such flushing to btrfs_remap_file_range_prep - this is
>>> where we do the source and target range flush and wait.
>>>
>>> Can you turn the reproducer into an fstests case?
>>
>> Sure.
>>
>> Thanks for the info and all the comment,
>> Qu
>>
>>>
>>> Thanks.
>>>
>>>> +
>>>>         /*
>>>>          * Lock destination range to serialize with concurrent readpages() and
>>>>          * source range to serialize with relocation.
>>>> --
>>>> 2.21.0
>>>>
>>>
>>>
>>
> 
>
Nikolay Borisov May 7, 2019, 7:49 a.m. UTC | #11
On 6.05.19 г. 5:04 ч., Qu Wenruo wrote:
> [snip]
>>>
>>> For data writeback, it should only cause sync related failure.
>>
>> Ok, so please remove the transaction abort comments for next iteration.
>> By sync related failure, you mean running dealloc fails with ENOSPC,
>> since when it tries to reserve a data extent it fails as it can't find
>> any free extent.
> 
> Well, btrfs has some hidden way to fix such problem by itself already.
> 
> At buffered write time, we have the following call chain:
> btrfs_buffered_write()
> |- btrfs_check_data_free_space()
>    |- btrfs_alloc_data_chunk_ondemand()
>       |- need_commit = 2; /* We have 2 chance to retry, 1 to flush */
>       |- do_chunk_alloc() /* we have no data space available */
>       |- if (ret < 0 && ret == -ENOSPC)
>       |-     goto commit_trans;  /* try to free some space */
>       |- commit_trans:
>       |-     need_commit--;
>       |-     if (need_commit > 0) {
>       |-         btrfs_start_delalloc_roots();
>       |-         btrfs_wait_ordered_roots();
>       |-     }
> 
> This means, as long as we hit ENOSPC for data space, we will start write
> back, thus NODATACOW data will still hit disk as NODATACOW.

I'm lost for words at expressing how subtle and despicable that code is
... Is there a way to factor that out and make it more explicit ? I
don't think we should rely on such subtleties...

> 
> Such hidden behavior along with always-reserve-data-space solves the
> problem pretty well.
> We either:
> - reserve data space
>   Then no matter how it ends, we're OK, although it may end as CoW.
> 
> - Failed to reserve data space
>   Writeback will be triggered anyway, no way to screw things around.
> 
> Thus this workaround has nothing to fix, but only make certain NODATACOW
> reach disk as NODATACOW.
> 
> It makes some NODATACOW behaves more correctly but won't fix any obvious
> bug.
> 
> My personal take is to fix any strange behavior even it won't cause any
> problem, but the full inode writeback can be performance heavy.
> 
> So my question is, do we really need this anyway?
> 
> Thanks,
> Qu
> 
>>
>>>
>>>> I don't recall starting transactions when running dealloc, and failed
>>>> to see where after a quick glance to cow_file_range()
>>>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
>>>> when starting writeback.
>>>>
>>>>>
>>>>> [CAUSE]
>>>>> This is due to the fact that btrfs can only do extent level share check.
>>>>>
>>>>> Btrfs can only tell if an extent is shared, no matter if only part of the
>>>>> extent is shared or not.
>>>>>
>>>>> So for above script we have:
>>>>> - fallocate
>>>>> - buffered write
>>>>>   If we don't have enough data space, we fall back to NOCOW check.
>>>>>   At this timming, the extent is not shared, we can skip data
>>>>>   reservation.
>>>>
>>>> But in the above example we don't fall to nocow mode when doing the
>>>> buffered write, as there's plenty of data space available (1Gb -
>>>> 24Kb).
>>>> You need to update the example.
>>> I have to admit that the core part is mostly based on the worst case
>>> *assumption*.
>>>
>>> I'll try to make the case convincing by making it fail directly.
>>
>> Great, thanks.
>>
>>>
>>>>
>>>>
>>>>> - reflink
>>>>>   Now part of the large preallocated extent is shared.
>>>>> - delalloc kicks in
>>>>
>>>> writeback kicks in
>>>>
>>>>>   For the NOCOW range, as the preallocated extent is shared, we need
>>>>>   to fall back to COW.
>>>>>
>>>>> [WORKAROUND]
>>>>> The workaround is to ensure any buffered write in the related extents
>>>>> (not the reflink source range) get flushed before reflink.
>>>>
>>>> not the reflink source range -> not just the reflink source range
>>>>
>>>>>
>>>>> However it's pretty expensive to do a comprehensive check.
>>>>> In the reproducer, the reflink source is just a part of a larger
>>>>
>>>> Again, the reproducer needs to be fixed (yes, I tested it even if it's
>>>> clear by looking at it that it doesn't trigger the nocow case).
>>>>
>>>>> preallocated extent, we need to flush all buffered write of that extent
>>>>> before reflink.
>>>>> Such backward search can be complex and we may not get much benefit from
>>>>> it.
>>>>>
>>>>> So this patch will just try to flush the whole inode before reflink.
>>>>
>>>>
>>>>>
>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>> ---
>>>>> Reason for RFC:
>>>>> Flushing an inode just because it's a reflink source is definitely
>>>>> overkilling, but I don't have any better way to handle it.
>>>>>
>>>>> Any comment on this is welcomed.
>>>>> ---
>>>>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>>>>  1 file changed, 22 insertions(+)
>>>>>
>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>> index 7755b503b348..8caa0edb6fbf 100644
>>>>> --- a/fs/btrfs/ioctl.c
>>>>> +++ b/fs/btrfs/ioctl.c
>>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>>>>>                         return ret;
>>>>>         }
>>>>>
>>>>> +       /*
>>>>> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>>>>> +        *
>>>>> +        * Due to the limit of btrfs extent tree design, we can only have
>>>>> +        * extent level share view. Any part of an extent is shared then the
>>>>
>>>> Any -> If any
>>>>
>>>>> +        * whole extent is shared and any write into that extent needs to fall
>>>>
>>>> is -> is considered
>>>>
>>>>> +        * back to COW.
>>>>
>>>> I would add, something like:
>>>>
>>>> That is, btrfs' back references do not have a block level granularity,
>>>> they work at the whole extent level.
>>>>
>>>>> +        *
>>>>> +        * NOCOW buffered write without data space reserved could to lead to
>>>>> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>>>>> +        * at delalloc time (transaction abort).
>>>>
>>>> I would omit the warning and transaction abort parts, that can change
>>>> any time. And we have that information in the changelog, so it's not
>>>> lost.
>>>>
>>>>> +        *
>>>>> +        * Here we take a shortcut by flush the whole inode. We could do better
>>>>> +        * by finding all extents in that range and flush the space referring
>>>>> +        * all those extents.
>>>>> +        * But that's too complex for such corner case.
>>>>> +        */
>>>>> +       filemap_flush(src->i_mapping);
>>>>> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>>>>> +                    &BTRFS_I(src)->runtime_flags))
>>>>> +               filemap_flush(src->i_mapping);
>>>>
>>>> So a few comments here:
>>>>
>>>> - why just in the clone part? The dedupe side has the same problem, doesn't it?
>>>
>>> Right.
>>>
>>>>
>>>> - I would move such flushing to btrfs_remap_file_range_prep - this is
>>>> where we do the source and target range flush and wait.
>>>>
>>>> Can you turn the reproducer into an fstests case?
>>>
>>> Sure.
>>>
>>> Thanks for the info and all the comment,
>>> Qu
>>>
>>>>
>>>> Thanks.
>>>>
>>>>> +
>>>>>         /*
>>>>>          * Lock destination range to serialize with concurrent readpages() and
>>>>>          * source range to serialize with relocation.
>>>>> --
>>>>> 2.21.0
>>>>>
>>>>
>>>>
>>>
>>
>>
>
Filipe Manana May 7, 2019, 8:56 a.m. UTC | #12
On Mon, May 6, 2019 at 3:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
> [snip]
> >>
> >> For data writeback, it should only cause sync related failure.
> >
> > Ok, so please remove the transaction abort comments for next iteration.
> > By sync related failure, you mean running dealloc fails with ENOSPC,
> > since when it tries to reserve a data extent it fails as it can't find
> > any free extent.
>
> Well, btrfs has some hidden way to fix such problem by itself already.
>
> At buffered write time, we have the following call chain:
> btrfs_buffered_write()
> |- btrfs_check_data_free_space()
>    |- btrfs_alloc_data_chunk_ondemand()
>       |- need_commit = 2; /* We have 2 chance to retry, 1 to flush */
>       |- do_chunk_alloc() /* we have no data space available */
>       |- if (ret < 0 && ret == -ENOSPC)
>       |-     goto commit_trans;  /* try to free some space */
>       |- commit_trans:
>       |-     need_commit--;
>       |-     if (need_commit > 0) {
>       |-         btrfs_start_delalloc_roots();
>       |-         btrfs_wait_ordered_roots();
>       |-     }
>
> This means, as long as we hit ENOSPC for data space, we will start write
> back, thus NODATACOW data will still hit disk as NODATACOW.
>
> Such hidden behavior along with always-reserve-data-space solves the
> problem pretty well.

It doesn't solve the problem you reported in the rfc patch.

What happens:

1) We have a file with a prealloc extent, that isn't shared

2) We have 0 bytes of available data space (or any amount less then
that of the buffered write size)

3) A buffered write happens that falls within a subrange of the prealloc extent.
    We can't reserve space, we do all those things at
btrfs_alloc_data_chunk_ondemand(), but we can't get any data space
released, since it's all allocated.
    Therefore we fall back to nodatacow mode. We dirty the pages, mark
the range as dealloc, etc.

4) The reflink happens, for a subrange of the prealloc extent that
does not overlap the range of the buffered write.

5) Some time after the reflink, writeback starts for the inode.
    During the writeback we fallback to COW mode, because the prealloc
extent is shared, even if the subrange of the buffered write does not
overlap the reflinked subrange.
    Now the write silently fails with -ENOSPC, and a user doesn't know
about it unless it does an fsync after that writeback, which will
report the error via filemap_check_wb_err().

> We either:
> - reserve data space
>   Then no matter how it ends, we're OK, although it may end as CoW.
>
> - Failed to reserve data space
>   Writeback will be triggered anyway, no way to screw things around.
>
> Thus this workaround has nothing to fix, but only make certain NODATACOW
> reach disk as NODATACOW.
>
> It makes some NODATACOW behaves more correctly but won't fix any obvious
> bug.
>
> My personal take is to fix any strange behavior even it won't cause any
> problem, but the full inode writeback can be performance heavy.
>
> So my question is, do we really need this anyway?

Do we need what? Your patch, that logic at
btrfs_alloc_data_chunk_ondemand(), something else?

Thanks.

>
> Thanks,
> Qu
>
> >
> >>
> >>> I don't recall starting transactions when running dealloc, and failed
> >>> to see where after a quick glance to cow_file_range()
> >>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
> >>> when starting writeback.
> >>>
> >>>>
> >>>> [CAUSE]
> >>>> This is due to the fact that btrfs can only do extent level share check.
> >>>>
> >>>> Btrfs can only tell if an extent is shared, no matter if only part of the
> >>>> extent is shared or not.
> >>>>
> >>>> So for above script we have:
> >>>> - fallocate
> >>>> - buffered write
> >>>>   If we don't have enough data space, we fall back to NOCOW check.
> >>>>   At this timming, the extent is not shared, we can skip data
> >>>>   reservation.
> >>>
> >>> But in the above example we don't fall to nocow mode when doing the
> >>> buffered write, as there's plenty of data space available (1Gb -
> >>> 24Kb).
> >>> You need to update the example.
> >> I have to admit that the core part is mostly based on the worst case
> >> *assumption*.
> >>
> >> I'll try to make the case convincing by making it fail directly.
> >
> > Great, thanks.
> >
> >>
> >>>
> >>>
> >>>> - reflink
> >>>>   Now part of the large preallocated extent is shared.
> >>>> - delalloc kicks in
> >>>
> >>> writeback kicks in
> >>>
> >>>>   For the NOCOW range, as the preallocated extent is shared, we need
> >>>>   to fall back to COW.
> >>>>
> >>>> [WORKAROUND]
> >>>> The workaround is to ensure any buffered write in the related extents
> >>>> (not the reflink source range) get flushed before reflink.
> >>>
> >>> not the reflink source range -> not just the reflink source range
> >>>
> >>>>
> >>>> However it's pretty expensive to do a comprehensive check.
> >>>> In the reproducer, the reflink source is just a part of a larger
> >>>
> >>> Again, the reproducer needs to be fixed (yes, I tested it even if it's
> >>> clear by looking at it that it doesn't trigger the nocow case).
> >>>
> >>>> preallocated extent, we need to flush all buffered write of that extent
> >>>> before reflink.
> >>>> Such backward search can be complex and we may not get much benefit from
> >>>> it.
> >>>>
> >>>> So this patch will just try to flush the whole inode before reflink.
> >>>
> >>>
> >>>>
> >>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >>>> ---
> >>>> Reason for RFC:
> >>>> Flushing an inode just because it's a reflink source is definitely
> >>>> overkilling, but I don't have any better way to handle it.
> >>>>
> >>>> Any comment on this is welcomed.
> >>>> ---
> >>>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
> >>>>  1 file changed, 22 insertions(+)
> >>>>
> >>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> >>>> index 7755b503b348..8caa0edb6fbf 100644
> >>>> --- a/fs/btrfs/ioctl.c
> >>>> +++ b/fs/btrfs/ioctl.c
> >>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
> >>>>                         return ret;
> >>>>         }
> >>>>
> >>>> +       /*
> >>>> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
> >>>> +        *
> >>>> +        * Due to the limit of btrfs extent tree design, we can only have
> >>>> +        * extent level share view. Any part of an extent is shared then the
> >>>
> >>> Any -> If any
> >>>
> >>>> +        * whole extent is shared and any write into that extent needs to fall
> >>>
> >>> is -> is considered
> >>>
> >>>> +        * back to COW.
> >>>
> >>> I would add, something like:
> >>>
> >>> That is, btrfs' back references do not have a block level granularity,
> >>> they work at the whole extent level.
> >>>
> >>>> +        *
> >>>> +        * NOCOW buffered write without data space reserved could to lead to
> >>>> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
> >>>> +        * at delalloc time (transaction abort).
> >>>
> >>> I would omit the warning and transaction abort parts, that can change
> >>> any time. And we have that information in the changelog, so it's not
> >>> lost.
> >>>
> >>>> +        *
> >>>> +        * Here we take a shortcut by flush the whole inode. We could do better
> >>>> +        * by finding all extents in that range and flush the space referring
> >>>> +        * all those extents.
> >>>> +        * But that's too complex for such corner case.
> >>>> +        */
> >>>> +       filemap_flush(src->i_mapping);
> >>>> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> >>>> +                    &BTRFS_I(src)->runtime_flags))
> >>>> +               filemap_flush(src->i_mapping);
> >>>
> >>> So a few comments here:
> >>>
> >>> - why just in the clone part? The dedupe side has the same problem, doesn't it?
> >>
> >> Right.
> >>
> >>>
> >>> - I would move such flushing to btrfs_remap_file_range_prep - this is
> >>> where we do the source and target range flush and wait.
> >>>
> >>> Can you turn the reproducer into an fstests case?
> >>
> >> Sure.
> >>
> >> Thanks for the info and all the comment,
> >> Qu
> >>
> >>>
> >>> Thanks.
> >>>
> >>>> +
> >>>>         /*
> >>>>          * Lock destination range to serialize with concurrent readpages() and
> >>>>          * source range to serialize with relocation.
> >>>> --
> >>>> 2.21.0
> >>>>
> >>>
> >>>
> >>
> >
> >
>
Qu Wenruo May 7, 2019, 11:13 a.m. UTC | #13
On 2019/5/7 下午4:56, Filipe Manana wrote:
> On Mon, May 6, 2019 at 3:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>> [snip]
>>>>
>>>> For data writeback, it should only cause sync related failure.
>>>
>>> Ok, so please remove the transaction abort comments for next iteration.
>>> By sync related failure, you mean running dealloc fails with ENOSPC,
>>> since when it tries to reserve a data extent it fails as it can't find
>>> any free extent.
>>
>> Well, btrfs has some hidden way to fix such problem by itself already.
>>
>> At buffered write time, we have the following call chain:
>> btrfs_buffered_write()
>> |- btrfs_check_data_free_space()
>>    |- btrfs_alloc_data_chunk_ondemand()
>>       |- need_commit = 2; /* We have 2 chance to retry, 1 to flush */
>>       |- do_chunk_alloc() /* we have no data space available */
>>       |- if (ret < 0 && ret == -ENOSPC)
>>       |-     goto commit_trans;  /* try to free some space */
>>       |- commit_trans:
>>       |-     need_commit--;
>>       |-     if (need_commit > 0) {
>>       |-         btrfs_start_delalloc_roots();
>>       |-         btrfs_wait_ordered_roots();
>>       |-     }
>>
>> This means, as long as we hit ENOSPC for data space, we will start write
>> back, thus NODATACOW data will still hit disk as NODATACOW.
>>
>> Such hidden behavior along with always-reserve-data-space solves the
>> problem pretty well.
> 
> It doesn't solve the problem you reported in the rfc patch.

You're right, it doesn't solve the problem at all.
In fact, another bug caused my test script to pass even with some dirty
pages unable to be flushed back.

But it at least make sure all other pages reach disk as NODATACOW except
the last page.

> 
> What happens:
> 
> 1) We have a file with a prealloc extent, that isn't shared
> 
> 2) We have 0 bytes of available data space (or any amount less then
> that of the buffered write size)
> 
> 3) A buffered write happens that falls within a subrange of the prealloc extent.
>     We can't reserve space, we do all those things at
> btrfs_alloc_data_chunk_ondemand(), but we can't get any data space
> released, since it's all allocated.

At that time, we're already flushing all previously buffered write data.

E.g. if we're writing into one 1M preallocated extent.
The first 4K, we have no data space reserved, dirtied the page, prepare
all delalloc.

Then the 2nd 4K, we call btrfs_check_data_free_space(), as we're at low
data free space already, we flush all inodes, including the previous 4K
we just dirtied.
Then the first 4K get written to disk NODATACOW, as expected.

This loop happens until we reach the last page.

>     Therefore we fall back to nodatacow mode. We dirty the pages, mark
> the range as dealloc, etc.
> 
> 4) The reflink happens, for a subrange of the prealloc extent that
> does not overlap the range of the buffered write.

Just before the reflink, we only have 1 dirty page (the last page of
that buffered write) doesn't reach disk yet.

For the final page, we have no choice but do COW, and it fails with -ENOSPC.

However due to some other problem, the -ENOSPC doesn't reach user space
at all.


> 
> 5) Some time after the reflink, writeback starts for the inode.
>     During the writeback we fallback to COW mode, because the prealloc
> extent is shared, even if the subrange of the buffered write does not
> overlap the reflinked subrange.
>     Now the write silently fails with -ENOSPC, and a user doesn't know
> about it unless it does an fsync after that writeback, which will
> report the error via filemap_check_wb_err().
> 
>> We either:
>> - reserve data space
>>   Then no matter how it ends, we're OK, although it may end as CoW.
>>
>> - Failed to reserve data space
>>   Writeback will be triggered anyway, no way to screw things around.
>>
>> Thus this workaround has nothing to fix, but only make certain NODATACOW
>> reach disk as NODATACOW.
>>
>> It makes some NODATACOW behaves more correctly but won't fix any obvious
>> bug.
>>
>> My personal take is to fix any strange behavior even it won't cause any
>> problem, but the full inode writeback can be performance heavy.
>>
>> So my question is, do we really need this anyway?
> 
> Do we need what? Your patch, that logic at
> btrfs_alloc_data_chunk_ondemand(), something else?

I meant the patch, but the deeper I dig into the problem, more problem I
found.

The patch is still needed, but there is a more important bug, that
btrfs_run_delalloc_range() failure won't be reported in sync.

The script here I'm using is:
------
#!/bin/bash

dev=/dev/test/test
mnt=/mnt/btrfs

#mkfs.btrfs -f $dev -b 1G > /dev/null
#mount $dev $mnt -o nospace_cache

umount $mnt &> /dev/null
umount $dev &> /dev/null

dmesg -C
mkfs.btrfs -f $dev -b 512M > /dev/null

mount $dev $mnt -o nospace_cache

xfs_io -f -c "falloc 8k 64m" $mnt/file1
xfs_io -f -c "pwrite 0 -b 4k 370M" $mnt/padding

sync
btrfs fi df $mnt --raw

xfs_io -c "pwrite 1m 16m" $mnt/file1
echo "nodatacow write finished" >> /dev/kmsg
xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
echo "reflink finished" >> /dev/kmsg
sync
echo "sync finished ret=$?" >> /dev/kmsg
umount $dev
------

As describe, the last write at 17821696 (17M - 4K) will fail due to ENOSPC.
But the sync succeeded without reporting any problem.

Thanks,
Qu

> 
> Thanks.
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>>> I don't recall starting transactions when running dealloc, and failed
>>>>> to see where after a quick glance to cow_file_range()
>>>>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
>>>>> when starting writeback.
>>>>>
>>>>>>
>>>>>> [CAUSE]
>>>>>> This is due to the fact that btrfs can only do extent level share check.
>>>>>>
>>>>>> Btrfs can only tell if an extent is shared, no matter if only part of the
>>>>>> extent is shared or not.
>>>>>>
>>>>>> So for above script we have:
>>>>>> - fallocate
>>>>>> - buffered write
>>>>>>   If we don't have enough data space, we fall back to NOCOW check.
>>>>>>   At this timming, the extent is not shared, we can skip data
>>>>>>   reservation.
>>>>>
>>>>> But in the above example we don't fall to nocow mode when doing the
>>>>> buffered write, as there's plenty of data space available (1Gb -
>>>>> 24Kb).
>>>>> You need to update the example.
>>>> I have to admit that the core part is mostly based on the worst case
>>>> *assumption*.
>>>>
>>>> I'll try to make the case convincing by making it fail directly.
>>>
>>> Great, thanks.
>>>
>>>>
>>>>>
>>>>>
>>>>>> - reflink
>>>>>>   Now part of the large preallocated extent is shared.
>>>>>> - delalloc kicks in
>>>>>
>>>>> writeback kicks in
>>>>>
>>>>>>   For the NOCOW range, as the preallocated extent is shared, we need
>>>>>>   to fall back to COW.
>>>>>>
>>>>>> [WORKAROUND]
>>>>>> The workaround is to ensure any buffered write in the related extents
>>>>>> (not the reflink source range) get flushed before reflink.
>>>>>
>>>>> not the reflink source range -> not just the reflink source range
>>>>>
>>>>>>
>>>>>> However it's pretty expensive to do a comprehensive check.
>>>>>> In the reproducer, the reflink source is just a part of a larger
>>>>>
>>>>> Again, the reproducer needs to be fixed (yes, I tested it even if it's
>>>>> clear by looking at it that it doesn't trigger the nocow case).
>>>>>
>>>>>> preallocated extent, we need to flush all buffered write of that extent
>>>>>> before reflink.
>>>>>> Such backward search can be complex and we may not get much benefit from
>>>>>> it.
>>>>>>
>>>>>> So this patch will just try to flush the whole inode before reflink.
>>>>>
>>>>>
>>>>>>
>>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>>> ---
>>>>>> Reason for RFC:
>>>>>> Flushing an inode just because it's a reflink source is definitely
>>>>>> overkilling, but I don't have any better way to handle it.
>>>>>>
>>>>>> Any comment on this is welcomed.
>>>>>> ---
>>>>>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>>>>>  1 file changed, 22 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>>> index 7755b503b348..8caa0edb6fbf 100644
>>>>>> --- a/fs/btrfs/ioctl.c
>>>>>> +++ b/fs/btrfs/ioctl.c
>>>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>>>>>>                         return ret;
>>>>>>         }
>>>>>>
>>>>>> +       /*
>>>>>> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>>>>>> +        *
>>>>>> +        * Due to the limit of btrfs extent tree design, we can only have
>>>>>> +        * extent level share view. Any part of an extent is shared then the
>>>>>
>>>>> Any -> If any
>>>>>
>>>>>> +        * whole extent is shared and any write into that extent needs to fall
>>>>>
>>>>> is -> is considered
>>>>>
>>>>>> +        * back to COW.
>>>>>
>>>>> I would add, something like:
>>>>>
>>>>> That is, btrfs' back references do not have a block level granularity,
>>>>> they work at the whole extent level.
>>>>>
>>>>>> +        *
>>>>>> +        * NOCOW buffered write without data space reserved could to lead to
>>>>>> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>>>>>> +        * at delalloc time (transaction abort).
>>>>>
>>>>> I would omit the warning and transaction abort parts, that can change
>>>>> any time. And we have that information in the changelog, so it's not
>>>>> lost.
>>>>>
>>>>>> +        *
>>>>>> +        * Here we take a shortcut by flush the whole inode. We could do better
>>>>>> +        * by finding all extents in that range and flush the space referring
>>>>>> +        * all those extents.
>>>>>> +        * But that's too complex for such corner case.
>>>>>> +        */
>>>>>> +       filemap_flush(src->i_mapping);
>>>>>> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>>>>>> +                    &BTRFS_I(src)->runtime_flags))
>>>>>> +               filemap_flush(src->i_mapping);
>>>>>
>>>>> So a few comments here:
>>>>>
>>>>> - why just in the clone part? The dedupe side has the same problem, doesn't it?
>>>>
>>>> Right.
>>>>
>>>>>
>>>>> - I would move such flushing to btrfs_remap_file_range_prep - this is
>>>>> where we do the source and target range flush and wait.
>>>>>
>>>>> Can you turn the reproducer into an fstests case?
>>>>
>>>> Sure.
>>>>
>>>> Thanks for the info and all the comment,
>>>> Qu
>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>> +
>>>>>>         /*
>>>>>>          * Lock destination range to serialize with concurrent readpages() and
>>>>>>          * source range to serialize with relocation.
>>>>>> --
>>>>>> 2.21.0
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
> 
>
Filipe Manana May 7, 2019, 11:36 a.m. UTC | #14
On Tue, May 7, 2019 at 12:13 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2019/5/7 下午4:56, Filipe Manana wrote:
> > On Mon, May 6, 2019 at 3:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >> [snip]
> >>>>
> >>>> For data writeback, it should only cause sync related failure.
> >>>
> >>> Ok, so please remove the transaction abort comments for next iteration.
> >>> By sync related failure, you mean running dealloc fails with ENOSPC,
> >>> since when it tries to reserve a data extent it fails as it can't find
> >>> any free extent.
> >>
> >> Well, btrfs has some hidden way to fix such problem by itself already.
> >>
> >> At buffered write time, we have the following call chain:
> >> btrfs_buffered_write()
> >> |- btrfs_check_data_free_space()
> >>    |- btrfs_alloc_data_chunk_ondemand()
> >>       |- need_commit = 2; /* We have 2 chance to retry, 1 to flush */
> >>       |- do_chunk_alloc() /* we have no data space available */
> >>       |- if (ret < 0 && ret == -ENOSPC)
> >>       |-     goto commit_trans;  /* try to free some space */
> >>       |- commit_trans:
> >>       |-     need_commit--;
> >>       |-     if (need_commit > 0) {
> >>       |-         btrfs_start_delalloc_roots();
> >>       |-         btrfs_wait_ordered_roots();
> >>       |-     }
> >>
> >> This means, as long as we hit ENOSPC for data space, we will start write
> >> back, thus NODATACOW data will still hit disk as NODATACOW.
> >>
> >> Such hidden behavior along with always-reserve-data-space solves the
> >> problem pretty well.
> >
> > It doesn't solve the problem you reported in the rfc patch.
>
> You're right, it doesn't solve the problem at all.
> In fact, another bug caused my test script to pass even with some dirty
> pages unable to be flushed back.
>
> But it at least make sure all other pages reach disk as NODATACOW except
> the last page.
>
> >
> > What happens:
> >
> > 1) We have a file with a prealloc extent, that isn't shared
> >
> > 2) We have 0 bytes of available data space (or any amount less then
> > that of the buffered write size)
> >
> > 3) A buffered write happens that falls within a subrange of the prealloc extent.
> >     We can't reserve space, we do all those things at
> > btrfs_alloc_data_chunk_ondemand(), but we can't get any data space
> > released, since it's all allocated.
>
> At that time, we're already flushing all previously buffered write data.
>
> E.g. if we're writing into one 1M preallocated extent.
> The first 4K, we have no data space reserved, dirtied the page, prepare
> all delalloc.
>
> Then the 2nd 4K, we call btrfs_check_data_free_space(), as we're at low
> data free space already, we flush all inodes, including the previous 4K
> we just dirtied.
> Then the first 4K get written to disk NODATACOW, as expected.
>
> This loop happens until we reach the last page.

We fragment a buffered write into several chunks depending on its
size, the size of a page, the size of a pointer, etc.
We will hardly iterate for each page alone.
If your point is that btrfs_check_data_free_space() can be called
multiple times for a single buffered write, yes, that's true if it's
large than a certain threshold, but that will hardly be for each
individual page, typically it's much more than that.

>
> >     Therefore we fall back to nodatacow mode. We dirty the pages, mark
> > the range as dealloc, etc.
> >
> > 4) The reflink happens, for a subrange of the prealloc extent that
> > does not overlap the range of the buffered write.
>
> Just before the reflink, we only have 1 dirty page (the last page of
> that buffered write) doesn't reach disk yet.
>
> For the final page, we have no choice but do COW, and it fails with -ENOSPC.

Yes, isn't that what I said?

>
> However due to some other problem, the -ENOSPC doesn't reach user space
> at all.

Yes, isn't that what I said?

Yes... Same as the snapshotting case Robbie fixed. No news here.
Same as many other silent data loss issues I fixed with fsync a few
times sometime ago as well.

>
>
> >
> > 5) Some time after the reflink, writeback starts for the inode.
> >     During the writeback we fallback to COW mode, because the prealloc
> > extent is shared, even if the subrange of the buffered write does not
> > overlap the reflinked subrange.
> >     Now the write silently fails with -ENOSPC, and a user doesn't know
> > about it unless it does an fsync after that writeback, which will
> > report the error via filemap_check_wb_err().
> >
> >> We either:
> >> - reserve data space
> >>   Then no matter how it ends, we're OK, although it may end as CoW.
> >>
> >> - Failed to reserve data space
> >>   Writeback will be triggered anyway, no way to screw things around.
> >>
> >> Thus this workaround has nothing to fix, but only make certain NODATACOW
> >> reach disk as NODATACOW.
> >>
> >> It makes some NODATACOW behaves more correctly but won't fix any obvious
> >> bug.
> >>
> >> My personal take is to fix any strange behavior even it won't cause any
> >> problem, but the full inode writeback can be performance heavy.
> >>
> >> So my question is, do we really need this anyway?
> >
> > Do we need what? Your patch, that logic at
> > btrfs_alloc_data_chunk_ondemand(), something else?
>
> I meant the patch, but the deeper I dig into the problem, more problem I
> found.
>
> The patch is still needed, but there is a more important bug, that
> btrfs_run_delalloc_range() failure won't be reported in sync.

How could it report a failure?

sync calls sync(2), which returns void, meaning it always succeeds and
has no way to report errors.
See https://linux.die.net/man/2/sync

Error tracking and reporting was precisely one of main goals of
filemap_check_wb_err() and all the related work Jeff Layton did
sometime ago.

>
> The script here I'm using is:
> ------
> #!/bin/bash
>
> dev=/dev/test/test
> mnt=/mnt/btrfs
>
> #mkfs.btrfs -f $dev -b 1G > /dev/null
> #mount $dev $mnt -o nospace_cache
>
> umount $mnt &> /dev/null
> umount $dev &> /dev/null
>
> dmesg -C
> mkfs.btrfs -f $dev -b 512M > /dev/null
>
> mount $dev $mnt -o nospace_cache
>
> xfs_io -f -c "falloc 8k 64m" $mnt/file1
> xfs_io -f -c "pwrite 0 -b 4k 370M" $mnt/padding
>
> sync
> btrfs fi df $mnt --raw
>
> xfs_io -c "pwrite 1m 16m" $mnt/file1
> echo "nodatacow write finished" >> /dev/kmsg
> xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
> echo "reflink finished" >> /dev/kmsg
> sync
> echo "sync finished ret=$?" >> /dev/kmsg
> umount $dev
> ------
>
> As describe, the last write at 17821696 (17M - 4K) will fail due to ENOSPC.
> But the sync succeeded without reporting any problem.
>
> Thanks,
> Qu
>
> >
> > Thanks.
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>>>
> >>>>> I don't recall starting transactions when running dealloc, and failed
> >>>>> to see where after a quick glance to cow_file_range()
> >>>>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
> >>>>> when starting writeback.
> >>>>>
> >>>>>>
> >>>>>> [CAUSE]
> >>>>>> This is due to the fact that btrfs can only do extent level share check.
> >>>>>>
> >>>>>> Btrfs can only tell if an extent is shared, no matter if only part of the
> >>>>>> extent is shared or not.
> >>>>>>
> >>>>>> So for above script we have:
> >>>>>> - fallocate
> >>>>>> - buffered write
> >>>>>>   If we don't have enough data space, we fall back to NOCOW check.
> >>>>>>   At this timming, the extent is not shared, we can skip data
> >>>>>>   reservation.
> >>>>>
> >>>>> But in the above example we don't fall to nocow mode when doing the
> >>>>> buffered write, as there's plenty of data space available (1Gb -
> >>>>> 24Kb).
> >>>>> You need to update the example.
> >>>> I have to admit that the core part is mostly based on the worst case
> >>>> *assumption*.
> >>>>
> >>>> I'll try to make the case convincing by making it fail directly.
> >>>
> >>> Great, thanks.
> >>>
> >>>>
> >>>>>
> >>>>>
> >>>>>> - reflink
> >>>>>>   Now part of the large preallocated extent is shared.
> >>>>>> - delalloc kicks in
> >>>>>
> >>>>> writeback kicks in
> >>>>>
> >>>>>>   For the NOCOW range, as the preallocated extent is shared, we need
> >>>>>>   to fall back to COW.
> >>>>>>
> >>>>>> [WORKAROUND]
> >>>>>> The workaround is to ensure any buffered write in the related extents
> >>>>>> (not the reflink source range) get flushed before reflink.
> >>>>>
> >>>>> not the reflink source range -> not just the reflink source range
> >>>>>
> >>>>>>
> >>>>>> However it's pretty expensive to do a comprehensive check.
> >>>>>> In the reproducer, the reflink source is just a part of a larger
> >>>>>
> >>>>> Again, the reproducer needs to be fixed (yes, I tested it even if it's
> >>>>> clear by looking at it that it doesn't trigger the nocow case).
> >>>>>
> >>>>>> preallocated extent, we need to flush all buffered write of that extent
> >>>>>> before reflink.
> >>>>>> Such backward search can be complex and we may not get much benefit from
> >>>>>> it.
> >>>>>>
> >>>>>> So this patch will just try to flush the whole inode before reflink.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >>>>>> ---
> >>>>>> Reason for RFC:
> >>>>>> Flushing an inode just because it's a reflink source is definitely
> >>>>>> overkilling, but I don't have any better way to handle it.
> >>>>>>
> >>>>>> Any comment on this is welcomed.
> >>>>>> ---
> >>>>>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
> >>>>>>  1 file changed, 22 insertions(+)
> >>>>>>
> >>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> >>>>>> index 7755b503b348..8caa0edb6fbf 100644
> >>>>>> --- a/fs/btrfs/ioctl.c
> >>>>>> +++ b/fs/btrfs/ioctl.c
> >>>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
> >>>>>>                         return ret;
> >>>>>>         }
> >>>>>>
> >>>>>> +       /*
> >>>>>> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
> >>>>>> +        *
> >>>>>> +        * Due to the limit of btrfs extent tree design, we can only have
> >>>>>> +        * extent level share view. Any part of an extent is shared then the
> >>>>>
> >>>>> Any -> If any
> >>>>>
> >>>>>> +        * whole extent is shared and any write into that extent needs to fall
> >>>>>
> >>>>> is -> is considered
> >>>>>
> >>>>>> +        * back to COW.
> >>>>>
> >>>>> I would add, something like:
> >>>>>
> >>>>> That is, btrfs' back references do not have a block level granularity,
> >>>>> they work at the whole extent level.
> >>>>>
> >>>>>> +        *
> >>>>>> +        * NOCOW buffered write without data space reserved could to lead to
> >>>>>> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
> >>>>>> +        * at delalloc time (transaction abort).
> >>>>>
> >>>>> I would omit the warning and transaction abort parts, that can change
> >>>>> any time. And we have that information in the changelog, so it's not
> >>>>> lost.
> >>>>>
> >>>>>> +        *
> >>>>>> +        * Here we take a shortcut by flush the whole inode. We could do better
> >>>>>> +        * by finding all extents in that range and flush the space referring
> >>>>>> +        * all those extents.
> >>>>>> +        * But that's too complex for such corner case.
> >>>>>> +        */
> >>>>>> +       filemap_flush(src->i_mapping);
> >>>>>> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> >>>>>> +                    &BTRFS_I(src)->runtime_flags))
> >>>>>> +               filemap_flush(src->i_mapping);
> >>>>>
> >>>>> So a few comments here:
> >>>>>
> >>>>> - why just in the clone part? The dedupe side has the same problem, doesn't it?
> >>>>
> >>>> Right.
> >>>>
> >>>>>
> >>>>> - I would move such flushing to btrfs_remap_file_range_prep - this is
> >>>>> where we do the source and target range flush and wait.
> >>>>>
> >>>>> Can you turn the reproducer into an fstests case?
> >>>>
> >>>> Sure.
> >>>>
> >>>> Thanks for the info and all the comment,
> >>>> Qu
> >>>>
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>> +
> >>>>>>         /*
> >>>>>>          * Lock destination range to serialize with concurrent readpages() and
> >>>>>>          * source range to serialize with relocation.
> >>>>>> --
> >>>>>> 2.21.0
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>
Josef Bacik May 7, 2019, 5:36 p.m. UTC | #15
On Fri, May 03, 2019 at 09:08:52AM +0800, Qu Wenruo wrote:
> [BUG]
> The following command can lead to unexpected data COW:
> 
>   #!/bin/bash
> 
>   dev=/dev/test/test
>   mnt=/mnt/btrfs
> 
>   mkfs.btrfs -f $dev -b 1G > /dev/null
>   mount $dev $mnt -o nospace_cache
> 
>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
>   umount $dev
> 
> The result extent will be
> 
> 	item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
> 		generation 6 type 2 (prealloc)
> 		prealloc data disk byte 13631488 nr 28672
> 	item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
> 		generation 6 type 1 (regular)
> 		extent data disk byte 13660160 nr 12288 <<< COW
> 	item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
> 		generation 6 type 2 (prealloc)
> 		prealloc data disk byte 13631488 nr 28672
> 
> Currently we always reserve space even for NOCOW buffered write, thus
> under most case it shouldn't cause anything wrong even we fall back to
> COW.
> 
> However when we're out of data space, we fall back to skip data space if
> we can do NOCOW write.
> 
> If such behavior happens under that case, we could hit the following
> problems:
> - data space bytes_may_use underflow
>   This will cause kernel warning.
> 

This can be fixed, I laid out a few ways it could be fixed.

> - ENOSPC at delalloc time
>   This will lead to transaction abort and fs forced to RO.
> 

How?  The metadata and data reservations are separate.  If we can't make the
metadata reservation we fail out, the only thing we allow is skipping the data
reservation.  So if we fall back to cow_file_range() at run_delalloc_nocow()
time all we'll do is get an ENOSPC outside of a transaction, so we can just
mark the inode as having failed its writeout with ENOSPC so fsync() returns the
appropriate error and carry on.  We shouldn't be aborting a transaction here at
all.  Thanks,

Josef

Patch
diff mbox series

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7755b503b348..8caa0edb6fbf 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3930,6 +3930,28 @@  static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 			return ret;
 	}
 
+	/*
+	 * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
+	 *
+	 * Due to the limit of btrfs extent tree design, we can only have
+	 * extent level share view. Any part of an extent is shared then the
+	 * whole extent is shared and any write into that extent needs to fall
+	 * back to COW.
+	 *
+	 * NOCOW buffered write without data space reserved could to lead to
+	 * either data space bytes_may_use underflow (kernel warning) or ENOSPC
+	 * at delalloc time (transaction abort).
+	 *
+	 * Here we take a shortcut by flush the whole inode. We could do better
+	 * by finding all extents in that range and flush the space referring
+	 * all those extents.
+	 * But that's too complex for such corner case.
+	 */
+	filemap_flush(src->i_mapping);
+	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+		     &BTRFS_I(src)->runtime_flags))
+		filemap_flush(src->i_mapping);
+
 	/*
 	 * Lock destination range to serialize with concurrent readpages() and
 	 * source range to serialize with relocation.