diff mbox series

btrfs: defrag: add under utilized extent to defrag target list

Message ID 2c2fac36b67c97c9955eb24a97c6f3c09d21c7ff.1704440000.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: defrag: add under utilized extent to defrag target list | expand

Commit Message

Qu Wenruo Jan. 5, 2024, 7:33 a.m. UTC
[BUG]
The following script can lead to a very under utilized extent and we
have no way to use defrag to properly reclaim its wasted space:

  # mkfs.btrfs -f $dev
  # mount $dev $mnt
  # xfs_io -f -c "pwrite 0 128M" $mnt/foobar
  # sync
  # btrfs filesystem defrag $mnt/foobar
  # sync

After the above operations, the file "foobar" is still utilizing the
whole 128M:

        item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160
                generation 7 transid 8 size 4096 nbytes 4096
                block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                sequence 32770 flags 0x0(none)
        item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14
                index 2 namelen 4 name: file
        item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
                generation 7 type 1 (regular)
                extent data disk byte 298844160 nr 134217728 <<<
                extent data offset 0 nr 4096 ram 134217728
                extent compression 0 (none)

Meaning the expected defrag way to reclaim the space is not working.

[CAUSE]
The file extent has no adjacent extent at all, thus all existing defrag
code consider it a perfectly good file extent, even if it's only
utilizing a very tiny amount of space.

[FIX]
Add a special handling for under utilized extents, currently the ratio
is 6.25% (1/16).

This would allow us to add such extent to our defrag target list,
resulting it to be properly defragged.

Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/defrag.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Andrei Borzenkov Jan. 5, 2024, 4:45 p.m. UTC | #1
On 05.01.2024 10:33, Qu Wenruo wrote:
> [BUG]
> The following script can lead to a very under utilized extent and we
> have no way to use defrag to properly reclaim its wasted space:
> 
>    # mkfs.btrfs -f $dev
>    # mount $dev $mnt
>    # xfs_io -f -c "pwrite 0 128M" $mnt/foobar
>    # sync
>    # btrfs filesystem defrag $mnt/foobar
>    # sync
> 
> After the above operations, the file "foobar" is still utilizing the
> whole 128M:
> 
>          item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160
>                  generation 7 transid 8 size 4096 nbytes 4096
>                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>                  sequence 32770 flags 0x0(none)
>          item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14
>                  index 2 namelen 4 name: file
>          item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
>                  generation 7 type 1 (regular)
>                  extent data disk byte 298844160 nr 134217728 <<<
>                  extent data offset 0 nr 4096 ram 134217728
>                  extent compression 0 (none)
> 
> Meaning the expected defrag way to reclaim the space is not working.
> 
> [CAUSE]
> The file extent has no adjacent extent at all, thus all existing defrag
> code consider it a perfectly good file extent, even if it's only
> utilizing a very tiny amount of space.
> 
> [FIX]
> Add a special handling for under utilized extents, currently the ratio
> is 6.25% (1/16).
> 
> This would allow us to add such extent to our defrag target list,
> resulting it to be properly defragged.
> 

This sounds like the very wrong thing to do unconditionally. You have no 
knowledge of application I/O pattern and do not know whether application 
is going to fill this extent or not. Instead of de-fragmenting you are 
effectively fragmenting.

This sounds more like a target for a different operation, which may be 
called "garbage collection". But it should be explicit operation not 
tied to defragmentation (or at least explicit opt-in for defragmentation).

> Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/defrag.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index c276b136ab63..cc319190b6fb 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -1070,6 +1070,17 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
>   		if (!next_mergeable) {
>   			struct defrag_target_range *last;
>   
> +			/*
> +			 * Special entry point utilization ratio under 1/16 (only
> +			 * referring 1/16 of an on-disk extent).
> +			 * This can happen for a truncated large extent.
> +			 * If we don't add them, then for a truncated file
> +			 * (may be the last 4K of a 128M extent) it will never
> +			 * be defraged.
> +			 */
> +			if (em->ram_bytes < em->orig_block_len / 16)
> +				goto add;
> +
>   			/* Empty target list, no way to merge with last entry */
>   			if (list_empty(target_list))
>   				goto next;
Qu Wenruo Jan. 5, 2024, 8:11 p.m. UTC | #2
On 2024/1/6 03:15, Andrei Borzenkov wrote:
> On 05.01.2024 10:33, Qu Wenruo wrote:
>> [BUG]
>> The following script can lead to a very under utilized extent and we
>> have no way to use defrag to properly reclaim its wasted space:
>>
>>    # mkfs.btrfs -f $dev
>>    # mount $dev $mnt
>>    # xfs_io -f -c "pwrite 0 128M" $mnt/foobar
>>    # sync
>>    # btrfs filesystem defrag $mnt/foobar
>>    # sync
>>
>> After the above operations, the file "foobar" is still utilizing the
>> whole 128M:
>>
>>          item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160
>>                  generation 7 transid 8 size 4096 nbytes 4096
>>                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>>                  sequence 32770 flags 0x0(none)
>>          item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14
>>                  index 2 namelen 4 name: file
>>          item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
>>                  generation 7 type 1 (regular)
>>                  extent data disk byte 298844160 nr 134217728 <<<
>>                  extent data offset 0 nr 4096 ram 134217728
>>                  extent compression 0 (none)
>>
>> Meaning the expected defrag way to reclaim the space is not working.
>>
>> [CAUSE]
>> The file extent has no adjacent extent at all, thus all existing defrag
>> code consider it a perfectly good file extent, even if it's only
>> utilizing a very tiny amount of space.
>>
>> [FIX]
>> Add a special handling for under utilized extents, currently the ratio
>> is 6.25% (1/16).
>>
>> This would allow us to add such extent to our defrag target list,
>> resulting it to be properly defragged.
>>
> 
> This sounds like the very wrong thing to do unconditionally. You have no 
> knowledge of application I/O pattern and do not know whether application 
> is going to fill this extent or not. Instead of de-fragmenting you are 
> effectively fragmenting.

This is not exactly true.

If you have one file extent only referring 1/16 of a larger extent, it 
already means the extent itself is fragmented.

Either through truncate (the case we're getting into), or tons of later 
COW leading to this situation (which would also be defragged by the 
existing defrag conditions)

If you want to prove it wrong, please give me a counter example.

Thanks,
Qu
> 
> This sounds more like a target for a different operation, which may be 
> called "garbage collection". But it should be explicit operation not 
> tied to defragmentation (or at least explicit opt-in for defragmentation).
> 
>> Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/defrag.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
>> index c276b136ab63..cc319190b6fb 100644
>> --- a/fs/btrfs/defrag.c
>> +++ b/fs/btrfs/defrag.c
>> @@ -1070,6 +1070,17 @@ static int defrag_collect_targets(struct 
>> btrfs_inode *inode,
>>           if (!next_mergeable) {
>>               struct defrag_target_range *last;
>> +            /*
>> +             * Special entry point utilization ratio under 1/16 (only
>> +             * referring 1/16 of an on-disk extent).
>> +             * This can happen for a truncated large extent.
>> +             * If we don't add them, then for a truncated file
>> +             * (may be the last 4K of a 128M extent) it will never
>> +             * be defraged.
>> +             */
>> +            if (em->ram_bytes < em->orig_block_len / 16)
>> +                goto add;
>> +
>>               /* Empty target list, no way to merge with last entry */
>>               if (list_empty(target_list))
>>                   goto next;
>
Filipe Manana Jan. 9, 2024, 2:55 p.m. UTC | #3
On Fri, Jan 5, 2024 at 7:34 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> The following script can lead to a very under utilized extent and we
> have no way to use defrag to properly reclaim its wasted space:
>
>   # mkfs.btrfs -f $dev
>   # mount $dev $mnt
>   # xfs_io -f -c "pwrite 0 128M" $mnt/foobar
>   # sync
>   # btrfs filesystem defrag $mnt/foobar
>   # sync

There's a missing truncate in the example.

>
> After the above operations, the file "foobar" is still utilizing the
> whole 128M:
>
>         item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160
>                 generation 7 transid 8 size 4096 nbytes 4096
>                 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>                 sequence 32770 flags 0x0(none)
>         item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14
>                 index 2 namelen 4 name: file
>         item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
>                 generation 7 type 1 (regular)
>                 extent data disk byte 298844160 nr 134217728 <<<
>                 extent data offset 0 nr 4096 ram 134217728
>                 extent compression 0 (none)
>
> Meaning the expected defrag way to reclaim the space is not working.
>
> [CAUSE]
> The file extent has no adjacent extent at all, thus all existing defrag
> code consider it a perfectly good file extent, even if it's only
> utilizing a very tiny amount of space.
>
> [FIX]
> Add a special handling for under utilized extents, currently the ratio
> is 6.25% (1/16).
>
> This would allow us to add such extent to our defrag target list,
> resulting it to be properly defragged.
>
> Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/defrag.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index c276b136ab63..cc319190b6fb 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -1070,6 +1070,17 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
>                 if (!next_mergeable) {
>                         struct defrag_target_range *last;
>
> +                       /*
> +                        * Special entry point utilization ratio under 1/16 (only
> +                        * referring 1/16 of an on-disk extent).
> +                        * This can happen for a truncated large extent.
> +                        * If we don't add them, then for a truncated file
> +                        * (may be the last 4K of a 128M extent) it will never

may be -> maybe

> +                        * be defraged.

defraged -> defragged

> +                        */
> +                       if (em->ram_bytes < em->orig_block_len / 16)

Why 1 / 16?
For a 128M extent for example, even 1 / 2 (64M) is a lot of wasted space.
So I think a better condition is needed, probably to consider the
absolute value of wasted/unused space too.

And this should use em->len and not em->ram_bytes. The latter is
preserved when spitting an extent map.
You can even notice this in the tree dump example from the change log
- the file extent's items ram bytes is 128M, not 4K.

Thanks.

> +                               goto add;
> +
>                         /* Empty target list, no way to merge with last entry */
>                         if (list_empty(target_list))
>                                 goto next;
> --
> 2.43.0
>
>
Filipe Manana Jan. 9, 2024, 4:12 p.m. UTC | #4
On Tue, Jan 9, 2024 at 2:55 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Fri, Jan 5, 2024 at 7:34 AM Qu Wenruo <wqu@suse.com> wrote:
> >
> > [BUG]
> > The following script can lead to a very under utilized extent and we
> > have no way to use defrag to properly reclaim its wasted space:
> >
> >   # mkfs.btrfs -f $dev
> >   # mount $dev $mnt
> >   # xfs_io -f -c "pwrite 0 128M" $mnt/foobar
> >   # sync
> >   # btrfs filesystem defrag $mnt/foobar
> >   # sync
>
> There's a missing truncate in the example.
>
> >
> > After the above operations, the file "foobar" is still utilizing the
> > whole 128M:
> >
> >         item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160
> >                 generation 7 transid 8 size 4096 nbytes 4096
> >                 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
> >                 sequence 32770 flags 0x0(none)
> >         item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14
> >                 index 2 namelen 4 name: file
> >         item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
> >                 generation 7 type 1 (regular)
> >                 extent data disk byte 298844160 nr 134217728 <<<
> >                 extent data offset 0 nr 4096 ram 134217728
> >                 extent compression 0 (none)
> >
> > Meaning the expected defrag way to reclaim the space is not working.
> >
> > [CAUSE]
> > The file extent has no adjacent extent at all, thus all existing defrag
> > code consider it a perfectly good file extent, even if it's only
> > utilizing a very tiny amount of space.
> >
> > [FIX]
> > Add a special handling for under utilized extents, currently the ratio
> > is 6.25% (1/16).
> >
> > This would allow us to add such extent to our defrag target list,
> > resulting it to be properly defragged.
> >
> > Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
> > Signed-off-by: Qu Wenruo <wqu@suse.com>

Also don't forget to add a Link tag pointing to the thread at lore.
Thanks.

> > ---
> >  fs/btrfs/defrag.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> > index c276b136ab63..cc319190b6fb 100644
> > --- a/fs/btrfs/defrag.c
> > +++ b/fs/btrfs/defrag.c
> > @@ -1070,6 +1070,17 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
> >                 if (!next_mergeable) {
> >                         struct defrag_target_range *last;
> >
> > +                       /*
> > +                        * Special entry point utilization ratio under 1/16 (only
> > +                        * referring 1/16 of an on-disk extent).
> > +                        * This can happen for a truncated large extent.
> > +                        * If we don't add them, then for a truncated file
> > +                        * (may be the last 4K of a 128M extent) it will never
>
> may be -> maybe
>
> > +                        * be defraged.
>
> defraged -> defragged
>
> > +                        */
> > +                       if (em->ram_bytes < em->orig_block_len / 16)
>
> Why 1 / 16?
> For a 128M extent for example, even 1 / 2 (64M) is a lot of wasted space.
> So I think a better condition is needed, probably to consider the
> absolute value of wasted/unused space too.
>
> And this should use em->len and not em->ram_bytes. The latter is
> preserved when spitting an extent map.
> You can even notice this in the tree dump example from the change log
> - the file extent's items ram bytes is 128M, not 4K.
>
> Thanks.
>
> > +                               goto add;
> > +
> >                         /* Empty target list, no way to merge with last entry */
> >                         if (list_empty(target_list))
> >                                 goto next;
> > --
> > 2.43.0
> >
> >
Qu Wenruo Jan. 9, 2024, 9:04 p.m. UTC | #5
On 2024/1/10 01:25, Filipe Manana wrote:
> On Fri, Jan 5, 2024 at 7:34 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> The following script can lead to a very under utilized extent and we
>> have no way to use defrag to properly reclaim its wasted space:
>>
>>    # mkfs.btrfs -f $dev
>>    # mount $dev $mnt
>>    # xfs_io -f -c "pwrite 0 128M" $mnt/foobar
>>    # sync
>>    # btrfs filesystem defrag $mnt/foobar
>>    # sync
>
> There's a missing truncate in the example.
>
>>
>> After the above operations, the file "foobar" is still utilizing the
>> whole 128M:
>>
>>          item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160
>>                  generation 7 transid 8 size 4096 nbytes 4096
>>                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>>                  sequence 32770 flags 0x0(none)
>>          item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14
>>                  index 2 namelen 4 name: file
>>          item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
>>                  generation 7 type 1 (regular)
>>                  extent data disk byte 298844160 nr 134217728 <<<
>>                  extent data offset 0 nr 4096 ram 134217728
>>                  extent compression 0 (none)
>>
>> Meaning the expected defrag way to reclaim the space is not working.
>>
>> [CAUSE]
>> The file extent has no adjacent extent at all, thus all existing defrag
>> code consider it a perfectly good file extent, even if it's only
>> utilizing a very tiny amount of space.
>>
>> [FIX]
>> Add a special handling for under utilized extents, currently the ratio
>> is 6.25% (1/16).
>>
>> This would allow us to add such extent to our defrag target list,
>> resulting it to be properly defragged.
>>
>> Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/defrag.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
>> index c276b136ab63..cc319190b6fb 100644
>> --- a/fs/btrfs/defrag.c
>> +++ b/fs/btrfs/defrag.c
>> @@ -1070,6 +1070,17 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
>>                  if (!next_mergeable) {
>>                          struct defrag_target_range *last;
>>
>> +                       /*
>> +                        * Special entry point utilization ratio under 1/16 (only
>> +                        * referring 1/16 of an on-disk extent).
>> +                        * This can happen for a truncated large extent.
>> +                        * If we don't add them, then for a truncated file
>> +                        * (may be the last 4K of a 128M extent) it will never
>
> may be -> maybe
>
>> +                        * be defraged.
>
> defraged -> defragged
>
>> +                        */
>> +                       if (em->ram_bytes < em->orig_block_len / 16)
>
> Why 1 / 16?
> For a 128M extent for example, even 1 / 2 (64M) is a lot of wasted space.
> So I think a better condition is needed, probably to consider the
> absolute value of wasted/unused space too.

The 1/16 is chosen as a trade-off between wasted space and possible
unnecessary defrag.

E.g. the file extent is only referring to 4K of a 32K extent. Doing a
defrag would not save much bytes.

I can definitely go both (ratio and absolute wasted space), but that
also means I need to find a new threshold (for wasted bytes), and I'm
not confident enough to come up another one.

>
> And this should use em->len and not em->ram_bytes. The latter is
> preserved when spitting an extent map.

Oh indeed, would definitely fix it.

Thanks,
Qu


> You can even notice this in the tree dump example from the change log
> - the file extent's items ram bytes is 128M, not 4K.
>
> Thanks.
>
>> +                               goto add;
>> +
>>                          /* Empty target list, no way to merge with last entry */
>>                          if (list_empty(target_list))
>>                                  goto next;
>> --
>> 2.43.0
>>
>>
>
Christoph Anton Mitterer Jan. 9, 2024, 9:57 p.m. UTC | #6
On Wed, 2024-01-10 at 07:34 +1030, Qu Wenruo wrote:
> The 1/16 is chosen as a trade-off between wasted space and possible
> unnecessary defrag.

Felipe's point doesn't seem so invalid.... wouldn't it be possible to
make the trade-off somehow configurable, so that the user can choose
via some command line option?

Cheers,
Chris.
Qu Wenruo Jan. 9, 2024, 10:17 p.m. UTC | #7
On 2024/1/10 08:27, Christoph Anton Mitterer wrote:
> On Wed, 2024-01-10 at 07:34 +1030, Qu Wenruo wrote:
>> The 1/16 is chosen as a trade-off between wasted space and possible
>> unnecessary defrag.
>
> Felipe's point doesn't seem so invalid.... wouldn't it be possible to
> make the trade-off somehow configurable, so that the user can choose
> via some command line option?

We have 16 bytes extra space for the ioctl, I think we can go 2 u32 to
configure both ratio and wasted space if needed.

Unfortunately we don't have any existing checks on
btrfs_ioctl_defrag_range_args::flags, which means we have no way to
detect if the kernel supports new flags.

I'll start adding such rejection first, then we can start adding new
btrfs_ioctl_defrag_range_args flags/options to support configurable
ratio/wasted bytes.

Thanks,
Qu

>
> Cheers,
> Chris.
David Sterba Jan. 10, 2024, 5:09 p.m. UTC | #8
On Fri, Jan 05, 2024 at 06:03:40PM +1030, Qu Wenruo wrote:
> [BUG]
> The following script can lead to a very under utilized extent and we
> have no way to use defrag to properly reclaim its wasted space:
> 
>   # mkfs.btrfs -f $dev
>   # mount $dev $mnt
>   # xfs_io -f -c "pwrite 0 128M" $mnt/foobar
>   # sync
>   # btrfs filesystem defrag $mnt/foobar
>   # sync

I don't see what's wrong with this example, as Filipe noted there's a
truncate missing, but still this should be explained better.

Is this the problem when an overwritten and shared extent is partially
overwritten but still occupying the whole range, aka. bookend extent?
If yes, defrag was never meant to deal with that, though we could use
the interface for that.

As Andrei pointed out, this is more like a garbage collection, get rid
of extent that is partially unreachable. Detecting such extent requires
looking for the unreferenced part of the extent while defragmentation
deals with live data. This could be a new ioctl entirely too. But first
I'd like to know if we're talking about the same thing.
Qu Wenruo Jan. 11, 2024, 6:24 a.m. UTC | #9
On 2024/1/11 03:39, David Sterba wrote:
> On Fri, Jan 05, 2024 at 06:03:40PM +1030, Qu Wenruo wrote:
>> [BUG]
>> The following script can lead to a very under utilized extent and we
>> have no way to use defrag to properly reclaim its wasted space:
>>
>>    # mkfs.btrfs -f $dev
>>    # mount $dev $mnt
>>    # xfs_io -f -c "pwrite 0 128M" $mnt/foobar
>>    # sync
>>    # btrfs filesystem defrag $mnt/foobar
>>    # sync
> 
> I don't see what's wrong with this example, as Filipe noted there's a
> truncate missing, but still this should be explained better.

Sorry, the full explanation looks like this:

After above truncation, we will got the following file extent layout:

	item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
		generation 7 type 1 (regular)
		extent data disk byte 298844160 nr 134217728
		extent data offset 0 nr 4096 ram 134217728
		extent compression 0 (none)

That would be the last 4K referring that 128M extent, aka, wasted 
(128M-4K) bytes, or 99.695% of the extent.

Normally we expect defrag to properly re-dirty the extent so that we can 
free that 128M extent.
But defrag won't touch it at all, mostly due to there is no next extent 
to merge.

> 
> Is this the problem when an overwritten and shared extent is partially
> overwritten but still occupying the whole range, aka. bookend extent?
> If yes, defrag was never meant to deal with that, though we could use
> the interface for that.

If we don't go defrag, there is really no good way to do it safely.

Sure you can copy the file to another non-btrfs location or dd it.
But that's not safe if there is still some process accessing it etc.

> 
> As Andrei pointed out, this is more like a garbage collection, get rid
> of extent that is partially unreachable. Detecting such extent requires
> looking for the unreferenced part of the extent while defragmentation
> deals with live data. This could be a new ioctl entirely too. But first
> I'd like to know if we're talking about the same thing.

Yes, we're talking about the bookend problem.
As I would expect defrag to free most, if not all, such bookend extents.
(And that's exactly what I recommend to the initial report)

Thanks,
Qu
David Sterba Jan. 12, 2024, 3:58 p.m. UTC | #10
On Thu, Jan 11, 2024 at 04:54:47PM +1030, Qu Wenruo wrote:
> 
> 
> On 2024/1/11 03:39, David Sterba wrote:
> > On Fri, Jan 05, 2024 at 06:03:40PM +1030, Qu Wenruo wrote:
> >> [BUG]
> >> The following script can lead to a very under utilized extent and we
> >> have no way to use defrag to properly reclaim its wasted space:
> >>
> >>    # mkfs.btrfs -f $dev
> >>    # mount $dev $mnt
> >>    # xfs_io -f -c "pwrite 0 128M" $mnt/foobar
> >>    # sync
> >>    # btrfs filesystem defrag $mnt/foobar
> >>    # sync
> > 
> > I don't see what's wrong with this example, as Filipe noted there's a
> > truncate missing, but still this should be explained better.
> 
> Sorry, the full explanation looks like this:
> 
> After above truncation, we will got the following file extent layout:
> 
> 	item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
> 		generation 7 type 1 (regular)
> 		extent data disk byte 298844160 nr 134217728
> 		extent data offset 0 nr 4096 ram 134217728
> 		extent compression 0 (none)
> 
> That would be the last 4K referring that 128M extent, aka, wasted 
> (128M-4K) bytes, or 99.695% of the extent.

Ok, so it's the known issue.

> Normally we expect defrag to properly re-dirty the extent so that we can 
> free that 128M extent.
> But defrag won't touch it at all, mostly due to there is no next extent 
> to merge.
> 
> > Is this the problem when an overwritten and shared extent is partially
> > overwritten but still occupying the whole range, aka. bookend extent?
> > If yes, defrag was never meant to deal with that, though we could use
> > the interface for that.
> 
> If we don't go defrag, there is really no good way to do it safely.
> 
> Sure you can copy the file to another non-btrfs location or dd it.
> But that's not safe if there is still some process accessing it etc.
> 
> > As Andrei pointed out, this is more like a garbage collection, get rid
> > of extent that is partially unreachable. Detecting such extent requires
> > looking for the unreferenced part of the extent while defragmentation
> > deals with live data. This could be a new ioctl entirely too. But first
> > I'd like to know if we're talking about the same thing.
> 
> Yes, we're talking about the bookend problem.
> As I would expect defrag to free most, if not all, such bookend extents.
> (And that's exactly what I recommend to the initial report)

Here the defrag can mean two things, the interface (ioctl and command)
and the implementation. As defrag tries to merge adjacent extents or
coalesce small extents and move it to a new location, this may not be
always necessary just to get rid of the unreachable extent parts.

From the interface side, we can add a mode that does only the garbage
collection, effectively just looking up the unreachable parts, trimming
the extents but leaving the live data intact.

The modes of operation:

- current defrag, move if filters and conditons allow that
- defrag + garbage collect extents
- just garbage collect extents

The third mode is for use case to let it run on the whole filesystem but
not try to rewrite everything.

I'm not sure how would it affect send/receive, read-only subvolumes
should not be touched.
Qu Wenruo Jan. 13, 2024, 3:17 a.m. UTC | #11
On 2024/1/13 02:28, David Sterba wrote:
> On Thu, Jan 11, 2024 at 04:54:47PM +1030, Qu Wenruo wrote:
>>
>>
>> On 2024/1/11 03:39, David Sterba wrote:
>>> On Fri, Jan 05, 2024 at 06:03:40PM +1030, Qu Wenruo wrote:
>>>> [BUG]
>>>> The following script can lead to a very under utilized extent and we
>>>> have no way to use defrag to properly reclaim its wasted space:
>>>>
>>>>     # mkfs.btrfs -f $dev
>>>>     # mount $dev $mnt
>>>>     # xfs_io -f -c "pwrite 0 128M" $mnt/foobar
>>>>     # sync
>>>>     # btrfs filesystem defrag $mnt/foobar
>>>>     # sync
>>>
>>> I don't see what's wrong with this example, as Filipe noted there's a
>>> truncate missing, but still this should be explained better.
>>
>> Sorry, the full explanation looks like this:
>>
>> After above truncation, we will got the following file extent layout:
>>
>> 	item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>> 		generation 7 type 1 (regular)
>> 		extent data disk byte 298844160 nr 134217728
>> 		extent data offset 0 nr 4096 ram 134217728
>> 		extent compression 0 (none)
>>
>> That would be the last 4K referring that 128M extent, aka, wasted
>> (128M-4K) bytes, or 99.695% of the extent.
>
> Ok, so it's the known issue.
>
>> Normally we expect defrag to properly re-dirty the extent so that we can
>> free that 128M extent.
>> But defrag won't touch it at all, mostly due to there is no next extent
>> to merge.
>>
>>> Is this the problem when an overwritten and shared extent is partially
>>> overwritten but still occupying the whole range, aka. bookend extent?
>>> If yes, defrag was never meant to deal with that, though we could use
>>> the interface for that.
>>
>> If we don't go defrag, there is really no good way to do it safely.
>>
>> Sure you can copy the file to another non-btrfs location or dd it.
>> But that's not safe if there is still some process accessing it etc.
>>
>>> As Andrei pointed out, this is more like a garbage collection, get rid
>>> of extent that is partially unreachable. Detecting such extent requires
>>> looking for the unreferenced part of the extent while defragmentation
>>> deals with live data. This could be a new ioctl entirely too. But first
>>> I'd like to know if we're talking about the same thing.
>>
>> Yes, we're talking about the bookend problem.
>> As I would expect defrag to free most, if not all, such bookend extents.
>> (And that's exactly what I recommend to the initial report)
>
> Here the defrag can mean two things, the interface (ioctl and command)
> and the implementation. As defrag tries to merge adjacent extents or
> coalesce small extents and move it to a new location, this may not be
> always necessary just to get rid of the unreachable extent parts.

To me, defrag just means re-dirty the file range.
Whether it would result contig extent or lead to more fragments is not
ensured.
(E.g. defrag a fragmented file, but the fs itself is also super
fragmented, or due to very high memory pressure we have to do writeback
very often).

>
>  From the interface side, we can add a mode that does only the garbage
> collection, effectively just looking up the unreachable parts, trimming
> the extents but leaving the live data intact.

Another thing is, the same bookend problem would lead to very different
behavior, based on whether the file extent has an adjacent extent.

To me, the very basic defrag is just re-dirty all extents, no matter
what (and I believe some older fs is doing exactly that?).

It's us doing extra checks to avoid wasting IO on already very good extents.

>
> The modes of operation:
>
> - current defrag, move if filters and conditons allow that
> - defrag + garbage collect extents
> - just garbage collect extents

Thus I don't really think there is any different between garbage
collection or "defrag".

They are the same thing, just re-dirty the file extents.
(And as stated above, if the result would be better is never ensured).

What we really want to do is, just add extra filters to allow end users
to re-dirty the last file extent.

Thanks,
Qu
>
> The third mode is for use case to let it run on the whole filesystem but
> not try to rewrite everything.
>
> I'm not sure how would it affect send/receive, read-only subvolumes
> should not be touched.
>
Christoph Anton Mitterer Jan. 13, 2024, 3:47 a.m. UTC | #12
On Fri, 2024-01-12 at 16:58 +0100, David Sterba wrote:
> I'm not sure how would it affect send/receive, read-only subvolumes
> should not be touched.

This of course means, that if the IO pattern that causes this issues
happens, and if snapshots are used on such fs, the supposed cure would
make the problem even worse.

Guess either some documentation should then be added somewhere,
describing the whole thing... or perhaps it should be (re-)considered
whether one could do something so that the wasted parts get
automatically freed upon truncation. Maybe some attribute that can be
set on a directory and files/dirs created beneath that get the special
treatment.


Cheers,
Chris.
Andrei Borzenkov Jan. 13, 2024, 8:05 a.m. UTC | #13
On 13.01.2024 06:17, Qu Wenruo wrote:
> 
> What we really want to do is, just add extra filters to allow end users
> to re-dirty the last file extent.
> 

It is possible to punch holes in the middle of the file, so partial 
extents can happen anywhere.
Qu Wenruo Jan. 13, 2024, 8:32 a.m. UTC | #14
On 2024/1/13 18:35, Andrei Borzenkov wrote:
> On 13.01.2024 06:17, Qu Wenruo wrote:
>>
>> What we really want to do is, just add extra filters to allow end users
>> to re-dirty the last file extent.
>>
>
> It is possible to punch holes in the middle of the file, so partial
> extents can happen anywhere.
>

That would change the content of a file.

Defrag (re-dirty files) should not change the content.

If you mean zero detection, then I'd say it's a little beyond our simple
(not that simple though) defrag implementation.

Thanks,
Qu
Christoph Anton Mitterer Feb. 5, 2024, 5:39 a.m. UTC | #15
Hey there.

Is that patch still under consideration?

The filesystems on my 6.1 kernel regularly run full because of these IO
patterns... and AFAICS the patch hasn't been merged to master or
backported to the stables yet.


Any ideas how to proceed?
I mean if btrfs just break with that specific kind of IO patterns and
there will be no fix,... fine,... but then please tell so that I can
witch to something else for that use case :-)


Cheers,
Chris.
Qu Wenruo Feb. 5, 2024, 5:42 a.m. UTC | #16
On 2024/2/5 16:09, Christoph Anton Mitterer wrote:
> Hey there.
>
> Is that patch still under consideration?
>
> The filesystems on my 6.1 kernel regularly run full because of these IO
> patterns... and AFAICS the patch hasn't been merged to master or
> backported to the stables yet.
>
>
> Any ideas how to proceed?
> I mean if btrfs just break with that specific kind of IO patterns and
> there will be no fix,... fine,... but then please tell so that I can
> witch to something else for that use case :-)

Thanks for reminding me to update the patch.

I'll try to address the comments in the next version and hopes we can
push the next version into the kernel.

Thanks,
Qu
>
>
> Cheers,
> Chris.
>
Skirnir Torvaldsson April 20, 2024, 4:30 a.m. UTC | #17
05.02.2024 10:42, Qu Wenruo пишет:
>
>
> On 2024/2/5 16:09, Christoph Anton Mitterer wrote:
>> Hey there.
>>
>> Is that patch still under consideration?
>>
>> The filesystems on my 6.1 kernel regularly run full because of these IO
>> patterns... and AFAICS the patch hasn't been merged to master or
>> backported to the stables yet.
>>
>>
>> Any ideas how to proceed?
>> I mean if btrfs just break with that specific kind of IO patterns and
>> there will be no fix,... fine,... but then please tell so that I can
>> witch to something else for that use case :-)
>
> Thanks for reminding me to update the patch.
>
> I'll try to address the comments in the next version and hopes we can
> push the next version into the kernel.
>
> Thanks,
> Qu
>>
>>
>> Cheers,
>> Chris.
>>
>
Dear developers,

Could you please update us on the status of this patch? Any hope this 
out of space issue will be resolved?  If for some reason it cannot be 
(or you are not planning to work on it in the foreseeable future) kindly 
let us know.
diff mbox series

Patch

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index c276b136ab63..cc319190b6fb 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -1070,6 +1070,17 @@  static int defrag_collect_targets(struct btrfs_inode *inode,
 		if (!next_mergeable) {
 			struct defrag_target_range *last;
 
+			/*
+			 * Special entry point utilization ratio under 1/16 (only
+			 * referring 1/16 of an on-disk extent).
+			 * This can happen for a truncated large extent.
+			 * If we don't add them, then for a truncated file
+			 * (may be the last 4K of a 128M extent) it will never
+			 * be defraged.
+			 */
+			if (em->ram_bytes < em->orig_block_len / 16)
+				goto add;
+
 			/* Empty target list, no way to merge with last entry */
 			if (list_empty(target_list))
 				goto next;