Message ID | 2c2fac36b67c97c9955eb24a97c6f3c09d21c7ff.1704440000.git.wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: defrag: add under utilized extent to defrag target list | expand |
On 05.01.2024 10:33, Qu Wenruo wrote: > [BUG] > The following script can lead to a very under utilized extent and we > have no way to use defrag to properly reclaim its wasted space: > > # mkfs.btrfs -f $dev > # mount $dev $mnt > # xfs_io -f -c "pwrite 0 128M" $mnt/foobar > # sync > # btrfs filesystem defrag $mnt/foobar > # sync > > After the above operations, the file "foobar" is still utilizing the > whole 128M: > > item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160 > generation 7 transid 8 size 4096 nbytes 4096 > block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 > sequence 32770 flags 0x0(none) > item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14 > index 2 namelen 4 name: file > item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 > generation 7 type 1 (regular) > extent data disk byte 298844160 nr 134217728 <<< > extent data offset 0 nr 4096 ram 134217728 > extent compression 0 (none) > > Meaning the expected defrag way to reclaim the space is not working. > > [CAUSE] > The file extent has no adjacent extent at all, thus all existing defrag > code consider it a perfectly good file extent, even if it's only > utilizing a very tiny amount of space. > > [FIX] > Add a special handling for under utilized extents, currently the ratio > is 6.25% (1/16). > > This would allow us to add such extent to our defrag target list, > resulting it to be properly defragged. > This sounds like the very wrong thing to do unconditionally. You have no knowledge of application I/O pattern and do not know whether application is going to fill this extent or not. Instead of de-fragmenting you are effectively fragmenting. This sounds more like a target for a different operation, which may be called "garbage collection". But it should be explicit operation not tied to defragmentation (or at least explicit opt-in for defragmentation). > Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/defrag.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c > index c276b136ab63..cc319190b6fb 100644 > --- a/fs/btrfs/defrag.c > +++ b/fs/btrfs/defrag.c > @@ -1070,6 +1070,17 @@ static int defrag_collect_targets(struct btrfs_inode *inode, > if (!next_mergeable) { > struct defrag_target_range *last; > > + /* > + * Special entry point utilization ratio under 1/16 (only > + * referring 1/16 of an on-disk extent). > + * This can happen for a truncated large extent. > + * If we don't add them, then for a truncated file > + * (may be the last 4K of a 128M extent) it will never > + * be defraged. > + */ > + if (em->ram_bytes < em->orig_block_len / 16) > + goto add; > + > /* Empty target list, no way to merge with last entry */ > if (list_empty(target_list)) > goto next;
On 2024/1/6 03:15, Andrei Borzenkov wrote: > On 05.01.2024 10:33, Qu Wenruo wrote: >> [BUG] >> The following script can lead to a very under utilized extent and we >> have no way to use defrag to properly reclaim its wasted space: >> >> # mkfs.btrfs -f $dev >> # mount $dev $mnt >> # xfs_io -f -c "pwrite 0 128M" $mnt/foobar >> # sync >> # btrfs filesystem defrag $mnt/foobar >> # sync >> >> After the above operations, the file "foobar" is still utilizing the >> whole 128M: >> >> item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160 >> generation 7 transid 8 size 4096 nbytes 4096 >> block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 >> sequence 32770 flags 0x0(none) >> item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14 >> index 2 namelen 4 name: file >> item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 >> generation 7 type 1 (regular) >> extent data disk byte 298844160 nr 134217728 <<< >> extent data offset 0 nr 4096 ram 134217728 >> extent compression 0 (none) >> >> Meaning the expected defrag way to reclaim the space is not working. >> >> [CAUSE] >> The file extent has no adjacent extent at all, thus all existing defrag >> code consider it a perfectly good file extent, even if it's only >> utilizing a very tiny amount of space. >> >> [FIX] >> Add a special handling for under utilized extents, currently the ratio >> is 6.25% (1/16). >> >> This would allow us to add such extent to our defrag target list, >> resulting it to be properly defragged. >> > > This sounds like the very wrong thing to do unconditionally. You have no > knowledge of application I/O pattern and do not know whether application > is going to fill this extent or not. Instead of de-fragmenting you are > effectively fragmenting. This is not exactly true. If you have one file extent only referring 1/16 of a larger extent, it already means the extent itself is fragmented. Either through truncate (the case we're getting into), or tons of later COW leading to this situation (which would also be defragged by the existing defrag conditions) If you want to prove it wrong, please give me a counter example. Thanks, Qu > > This sounds more like a target for a different operation, which may be > called "garbage collection". But it should be explicit operation not > tied to defragmentation (or at least explicit opt-in for defragmentation). > >> Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> fs/btrfs/defrag.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) >> >> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c >> index c276b136ab63..cc319190b6fb 100644 >> --- a/fs/btrfs/defrag.c >> +++ b/fs/btrfs/defrag.c >> @@ -1070,6 +1070,17 @@ static int defrag_collect_targets(struct >> btrfs_inode *inode, >> if (!next_mergeable) { >> struct defrag_target_range *last; >> + /* >> + * Special entry point utilization ratio under 1/16 (only >> + * referring 1/16 of an on-disk extent). >> + * This can happen for a truncated large extent. >> + * If we don't add them, then for a truncated file >> + * (may be the last 4K of a 128M extent) it will never >> + * be defraged. >> + */ >> + if (em->ram_bytes < em->orig_block_len / 16) >> + goto add; >> + >> /* Empty target list, no way to merge with last entry */ >> if (list_empty(target_list)) >> goto next; >
On Fri, Jan 5, 2024 at 7:34 AM Qu Wenruo <wqu@suse.com> wrote: > > [BUG] > The following script can lead to a very under utilized extent and we > have no way to use defrag to properly reclaim its wasted space: > > # mkfs.btrfs -f $dev > # mount $dev $mnt > # xfs_io -f -c "pwrite 0 128M" $mnt/foobar > # sync > # btrfs filesystem defrag $mnt/foobar > # sync There's a missing truncate in the example. > > After the above operations, the file "foobar" is still utilizing the > whole 128M: > > item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160 > generation 7 transid 8 size 4096 nbytes 4096 > block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 > sequence 32770 flags 0x0(none) > item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14 > index 2 namelen 4 name: file > item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 > generation 7 type 1 (regular) > extent data disk byte 298844160 nr 134217728 <<< > extent data offset 0 nr 4096 ram 134217728 > extent compression 0 (none) > > Meaning the expected defrag way to reclaim the space is not working. > > [CAUSE] > The file extent has no adjacent extent at all, thus all existing defrag > code consider it a perfectly good file extent, even if it's only > utilizing a very tiny amount of space. > > [FIX] > Add a special handling for under utilized extents, currently the ratio > is 6.25% (1/16). > > This would allow us to add such extent to our defrag target list, > resulting it to be properly defragged. > > Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/defrag.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c > index c276b136ab63..cc319190b6fb 100644 > --- a/fs/btrfs/defrag.c > +++ b/fs/btrfs/defrag.c > @@ -1070,6 +1070,17 @@ static int defrag_collect_targets(struct btrfs_inode *inode, > if (!next_mergeable) { > struct defrag_target_range *last; > > + /* > + * Special entry point utilization ratio under 1/16 (only > + * referring 1/16 of an on-disk extent). > + * This can happen for a truncated large extent. > + * If we don't add them, then for a truncated file > + * (may be the last 4K of a 128M extent) it will never may be -> maybe > + * be defraged. defraged -> defragged > + */ > + if (em->ram_bytes < em->orig_block_len / 16) Why 1 / 16? For a 128M extent for example, even 1 / 2 (64M) is a lot of wasted space. So I think a better condition is needed, probably to consider the absolute value of wasted/unused space too. And this should use em->len and not em->ram_bytes. The latter is preserved when spitting an extent map. You can even notice this in the tree dump example from the change log - the file extent's items ram bytes is 128M, not 4K. Thanks. > + goto add; > + > /* Empty target list, no way to merge with last entry */ > if (list_empty(target_list)) > goto next; > -- > 2.43.0 > >
On Tue, Jan 9, 2024 at 2:55 PM Filipe Manana <fdmanana@kernel.org> wrote: > > On Fri, Jan 5, 2024 at 7:34 AM Qu Wenruo <wqu@suse.com> wrote: > > > > [BUG] > > The following script can lead to a very under utilized extent and we > > have no way to use defrag to properly reclaim its wasted space: > > > > # mkfs.btrfs -f $dev > > # mount $dev $mnt > > # xfs_io -f -c "pwrite 0 128M" $mnt/foobar > > # sync > > # btrfs filesystem defrag $mnt/foobar > > # sync > > There's a missing truncate in the example. > > > > > After the above operations, the file "foobar" is still utilizing the > > whole 128M: > > > > item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160 > > generation 7 transid 8 size 4096 nbytes 4096 > > block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 > > sequence 32770 flags 0x0(none) > > item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14 > > index 2 namelen 4 name: file > > item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 > > generation 7 type 1 (regular) > > extent data disk byte 298844160 nr 134217728 <<< > > extent data offset 0 nr 4096 ram 134217728 > > extent compression 0 (none) > > > > Meaning the expected defrag way to reclaim the space is not working. > > > > [CAUSE] > > The file extent has no adjacent extent at all, thus all existing defrag > > code consider it a perfectly good file extent, even if it's only > > utilizing a very tiny amount of space. > > > > [FIX] > > Add a special handling for under utilized extents, currently the ratio > > is 6.25% (1/16). > > > > This would allow us to add such extent to our defrag target list, > > resulting it to be properly defragged. > > > > Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> > > Signed-off-by: Qu Wenruo <wqu@suse.com> Also don't forget to add a Link tag pointing to the thread at lore. Thanks. > > --- > > fs/btrfs/defrag.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c > > index c276b136ab63..cc319190b6fb 100644 > > --- a/fs/btrfs/defrag.c > > +++ b/fs/btrfs/defrag.c > > @@ -1070,6 +1070,17 @@ static int defrag_collect_targets(struct btrfs_inode *inode, > > if (!next_mergeable) { > > struct defrag_target_range *last; > > > > + /* > > + * Special entry point utilization ratio under 1/16 (only > > + * referring 1/16 of an on-disk extent). > > + * This can happen for a truncated large extent. > > + * If we don't add them, then for a truncated file > > + * (may be the last 4K of a 128M extent) it will never > > may be -> maybe > > > + * be defraged. > > defraged -> defragged > > > + */ > > + if (em->ram_bytes < em->orig_block_len / 16) > > Why 1 / 16? > For a 128M extent for example, even 1 / 2 (64M) is a lot of wasted space. > So I think a better condition is needed, probably to consider the > absolute value of wasted/unused space too. > > And this should use em->len and not em->ram_bytes. The latter is > preserved when spitting an extent map. > You can even notice this in the tree dump example from the change log > - the file extent's items ram bytes is 128M, not 4K. > > Thanks. > > > + goto add; > > + > > /* Empty target list, no way to merge with last entry */ > > if (list_empty(target_list)) > > goto next; > > -- > > 2.43.0 > > > >
On 2024/1/10 01:25, Filipe Manana wrote: > On Fri, Jan 5, 2024 at 7:34 AM Qu Wenruo <wqu@suse.com> wrote: >> >> [BUG] >> The following script can lead to a very under utilized extent and we >> have no way to use defrag to properly reclaim its wasted space: >> >> # mkfs.btrfs -f $dev >> # mount $dev $mnt >> # xfs_io -f -c "pwrite 0 128M" $mnt/foobar >> # sync >> # btrfs filesystem defrag $mnt/foobar >> # sync > > There's a missing truncate in the example. > >> >> After the above operations, the file "foobar" is still utilizing the >> whole 128M: >> >> item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160 >> generation 7 transid 8 size 4096 nbytes 4096 >> block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 >> sequence 32770 flags 0x0(none) >> item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14 >> index 2 namelen 4 name: file >> item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 >> generation 7 type 1 (regular) >> extent data disk byte 298844160 nr 134217728 <<< >> extent data offset 0 nr 4096 ram 134217728 >> extent compression 0 (none) >> >> Meaning the expected defrag way to reclaim the space is not working. >> >> [CAUSE] >> The file extent has no adjacent extent at all, thus all existing defrag >> code consider it a perfectly good file extent, even if it's only >> utilizing a very tiny amount of space. >> >> [FIX] >> Add a special handling for under utilized extents, currently the ratio >> is 6.25% (1/16). >> >> This would allow us to add such extent to our defrag target list, >> resulting it to be properly defragged. >> >> Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> fs/btrfs/defrag.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) >> >> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c >> index c276b136ab63..cc319190b6fb 100644 >> --- a/fs/btrfs/defrag.c >> +++ b/fs/btrfs/defrag.c >> @@ -1070,6 +1070,17 @@ static int defrag_collect_targets(struct btrfs_inode *inode, >> if (!next_mergeable) { >> struct defrag_target_range *last; >> >> + /* >> + * Special entry point utilization ratio under 1/16 (only >> + * referring 1/16 of an on-disk extent). >> + * This can happen for a truncated large extent. >> + * If we don't add them, then for a truncated file >> + * (may be the last 4K of a 128M extent) it will never > > may be -> maybe > >> + * be defraged. > > defraged -> defragged > >> + */ >> + if (em->ram_bytes < em->orig_block_len / 16) > > Why 1 / 16? > For a 128M extent for example, even 1 / 2 (64M) is a lot of wasted space. > So I think a better condition is needed, probably to consider the > absolute value of wasted/unused space too. The 1/16 is chosen as a trade-off between wasted space and possible unnecessary defrag. E.g. the file extent is only referring to 4K of a 32K extent. Doing a defrag would not save much bytes. I can definitely go both (ratio and absolute wasted space), but that also means I need to find a new threshold (for wasted bytes), and I'm not confident enough to come up another one. > > And this should use em->len and not em->ram_bytes. The latter is > preserved when spitting an extent map. Oh indeed, would definitely fix it. Thanks, Qu > You can even notice this in the tree dump example from the change log > - the file extent's items ram bytes is 128M, not 4K. > > Thanks. > >> + goto add; >> + >> /* Empty target list, no way to merge with last entry */ >> if (list_empty(target_list)) >> goto next; >> -- >> 2.43.0 >> >> >
On Wed, 2024-01-10 at 07:34 +1030, Qu Wenruo wrote: > The 1/16 is chosen as a trade-off between wasted space and possible > unnecessary defrag. Felipe's point doesn't seem so invalid.... wouldn't it be possible to make the trade-off somehow configurable, so that the user can choose via some command line option? Cheers, Chris.
On 2024/1/10 08:27, Christoph Anton Mitterer wrote: > On Wed, 2024-01-10 at 07:34 +1030, Qu Wenruo wrote: >> The 1/16 is chosen as a trade-off between wasted space and possible >> unnecessary defrag. > > Felipe's point doesn't seem so invalid.... wouldn't it be possible to > make the trade-off somehow configurable, so that the user can choose > via some command line option? We have 16 bytes extra space for the ioctl, I think we can go 2 u32 to configure both ratio and wasted space if needed. Unfortunately we don't have any existing checks on btrfs_ioctl_defrag_range_args::flags, which means we have no way to detect if the kernel supports new flags. I'll start adding such rejection first, then we can start adding new btrfs_ioctl_defrag_range_args flags/options to support configurable ratio/wasted bytes. Thanks, Qu > > Cheers, > Chris.
On Fri, Jan 05, 2024 at 06:03:40PM +1030, Qu Wenruo wrote: > [BUG] > The following script can lead to a very under utilized extent and we > have no way to use defrag to properly reclaim its wasted space: > > # mkfs.btrfs -f $dev > # mount $dev $mnt > # xfs_io -f -c "pwrite 0 128M" $mnt/foobar > # sync > # btrfs filesystem defrag $mnt/foobar > # sync I don't see what's wrong with this example, as Filipe noted there's a truncate missing, but still this should be explained better. Is this the problem when an overwritten and shared extent is partially overwritten but still occupying the whole range, aka. bookend extent? If yes, defrag was never meant to deal with that, though we could use the interface for that. As Andrei pointed out, this is more like a garbage collection, get rid of extent that is partially unreachable. Detecting such extent requires looking for the unreferenced part of the extent while defragmentation deals with live data. This could be a new ioctl entirely too. But first I'd like to know if we're talking about the same thing.
On 2024/1/11 03:39, David Sterba wrote: > On Fri, Jan 05, 2024 at 06:03:40PM +1030, Qu Wenruo wrote: >> [BUG] >> The following script can lead to a very under utilized extent and we >> have no way to use defrag to properly reclaim its wasted space: >> >> # mkfs.btrfs -f $dev >> # mount $dev $mnt >> # xfs_io -f -c "pwrite 0 128M" $mnt/foobar >> # sync >> # btrfs filesystem defrag $mnt/foobar >> # sync > > I don't see what's wrong with this example, as Filipe noted there's a > truncate missing, but still this should be explained better. Sorry, the full explanation looks like this: After above truncation, we will got the following file extent layout: item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53 generation 7 type 1 (regular) extent data disk byte 298844160 nr 134217728 extent data offset 0 nr 4096 ram 134217728 extent compression 0 (none) That would be the last 4K referring that 128M extent, aka, wasted (128M-4K) bytes, or 99.695% of the extent. Normally we expect defrag to properly re-dirty the extent so that we can free that 128M extent. But defrag won't touch it at all, mostly due to there is no next extent to merge. > > Is this the problem when an overwritten and shared extent is partially > overwritten but still occupying the whole range, aka. bookend extent? > If yes, defrag was never meant to deal with that, though we could use > the interface for that. If we don't go defrag, there is really no good way to do it safely. Sure you can copy the file to another non-btrfs location or dd it. But that's not safe if there is still some process accessing it etc. > > As Andrei pointed out, this is more like a garbage collection, get rid > of extent that is partially unreachable. Detecting such extent requires > looking for the unreferenced part of the extent while defragmentation > deals with live data. This could be a new ioctl entirely too. But first > I'd like to know if we're talking about the same thing. Yes, we're talking about the bookend problem. As I would expect defrag to free most, if not all, such bookend extents. (And that's exactly what I recommend to the initial report) Thanks, Qu
On Thu, Jan 11, 2024 at 04:54:47PM +1030, Qu Wenruo wrote: > > > On 2024/1/11 03:39, David Sterba wrote: > > On Fri, Jan 05, 2024 at 06:03:40PM +1030, Qu Wenruo wrote: > >> [BUG] > >> The following script can lead to a very under utilized extent and we > >> have no way to use defrag to properly reclaim its wasted space: > >> > >> # mkfs.btrfs -f $dev > >> # mount $dev $mnt > >> # xfs_io -f -c "pwrite 0 128M" $mnt/foobar > >> # sync > >> # btrfs filesystem defrag $mnt/foobar > >> # sync > > > > I don't see what's wrong with this example, as Filipe noted there's a > > truncate missing, but still this should be explained better. > > Sorry, the full explanation looks like this: > > After above truncation, we will got the following file extent layout: > > item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53 > generation 7 type 1 (regular) > extent data disk byte 298844160 nr 134217728 > extent data offset 0 nr 4096 ram 134217728 > extent compression 0 (none) > > That would be the last 4K referring that 128M extent, aka, wasted > (128M-4K) bytes, or 99.695% of the extent. Ok, so it's the known issue. > Normally we expect defrag to properly re-dirty the extent so that we can > free that 128M extent. > But defrag won't touch it at all, mostly due to there is no next extent > to merge. > > > Is this the problem when an overwritten and shared extent is partially > > overwritten but still occupying the whole range, aka. bookend extent? > > If yes, defrag was never meant to deal with that, though we could use > > the interface for that. > > If we don't go defrag, there is really no good way to do it safely. > > Sure you can copy the file to another non-btrfs location or dd it. > But that's not safe if there is still some process accessing it etc. > > > As Andrei pointed out, this is more like a garbage collection, get rid > > of extent that is partially unreachable. Detecting such extent requires > > looking for the unreferenced part of the extent while defragmentation > > deals with live data. This could be a new ioctl entirely too. But first > > I'd like to know if we're talking about the same thing. > > Yes, we're talking about the bookend problem. > As I would expect defrag to free most, if not all, such bookend extents. > (And that's exactly what I recommend to the initial report) Here the defrag can mean two things, the interface (ioctl and command) and the implementation. As defrag tries to merge adjacent extents or coalesce small extents and move it to a new location, this may not be always necessary just to get rid of the unreachable extent parts. From the interface side, we can add a mode that does only the garbage collection, effectively just looking up the unreachable parts, trimming the extents but leaving the live data intact. The modes of operation: - current defrag, move if filters and conditons allow that - defrag + garbage collect extents - just garbage collect extents The third mode is for use case to let it run on the whole filesystem but not try to rewrite everything. I'm not sure how would it affect send/receive, read-only subvolumes should not be touched.
On 2024/1/13 02:28, David Sterba wrote: > On Thu, Jan 11, 2024 at 04:54:47PM +1030, Qu Wenruo wrote: >> >> >> On 2024/1/11 03:39, David Sterba wrote: >>> On Fri, Jan 05, 2024 at 06:03:40PM +1030, Qu Wenruo wrote: >>>> [BUG] >>>> The following script can lead to a very under utilized extent and we >>>> have no way to use defrag to properly reclaim its wasted space: >>>> >>>> # mkfs.btrfs -f $dev >>>> # mount $dev $mnt >>>> # xfs_io -f -c "pwrite 0 128M" $mnt/foobar >>>> # sync >>>> # btrfs filesystem defrag $mnt/foobar >>>> # sync >>> >>> I don't see what's wrong with this example, as Filipe noted there's a >>> truncate missing, but still this should be explained better. >> >> Sorry, the full explanation looks like this: >> >> After above truncation, we will got the following file extent layout: >> >> item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53 >> generation 7 type 1 (regular) >> extent data disk byte 298844160 nr 134217728 >> extent data offset 0 nr 4096 ram 134217728 >> extent compression 0 (none) >> >> That would be the last 4K referring that 128M extent, aka, wasted >> (128M-4K) bytes, or 99.695% of the extent. > > Ok, so it's the known issue. > >> Normally we expect defrag to properly re-dirty the extent so that we can >> free that 128M extent. >> But defrag won't touch it at all, mostly due to there is no next extent >> to merge. >> >>> Is this the problem when an overwritten and shared extent is partially >>> overwritten but still occupying the whole range, aka. bookend extent? >>> If yes, defrag was never meant to deal with that, though we could use >>> the interface for that. >> >> If we don't go defrag, there is really no good way to do it safely. >> >> Sure you can copy the file to another non-btrfs location or dd it. >> But that's not safe if there is still some process accessing it etc. >> >>> As Andrei pointed out, this is more like a garbage collection, get rid >>> of extent that is partially unreachable. Detecting such extent requires >>> looking for the unreferenced part of the extent while defragmentation >>> deals with live data. This could be a new ioctl entirely too. But first >>> I'd like to know if we're talking about the same thing. >> >> Yes, we're talking about the bookend problem. >> As I would expect defrag to free most, if not all, such bookend extents. >> (And that's exactly what I recommend to the initial report) > > Here the defrag can mean two things, the interface (ioctl and command) > and the implementation. As defrag tries to merge adjacent extents or > coalesce small extents and move it to a new location, this may not be > always necessary just to get rid of the unreachable extent parts. To me, defrag just means re-dirty the file range. Whether it would result contig extent or lead to more fragments is not ensured. (E.g. defrag a fragmented file, but the fs itself is also super fragmented, or due to very high memory pressure we have to do writeback very often). > > From the interface side, we can add a mode that does only the garbage > collection, effectively just looking up the unreachable parts, trimming > the extents but leaving the live data intact. Another thing is, the same bookend problem would lead to very different behavior, based on whether the file extent has an adjacent extent. To me, the very basic defrag is just re-dirty all extents, no matter what (and I believe some older fs is doing exactly that?). It's us doing extra checks to avoid wasting IO on already very good extents. > > The modes of operation: > > - current defrag, move if filters and conditons allow that > - defrag + garbage collect extents > - just garbage collect extents Thus I don't really think there is any different between garbage collection or "defrag". They are the same thing, just re-dirty the file extents. (And as stated above, if the result would be better is never ensured). What we really want to do is, just add extra filters to allow end users to re-dirty the last file extent. Thanks, Qu > > The third mode is for use case to let it run on the whole filesystem but > not try to rewrite everything. > > I'm not sure how would it affect send/receive, read-only subvolumes > should not be touched. >
On Fri, 2024-01-12 at 16:58 +0100, David Sterba wrote: > I'm not sure how would it affect send/receive, read-only subvolumes > should not be touched. This of course means, that if the IO pattern that causes this issues happens, and if snapshots are used on such fs, the supposed cure would make the problem even worse. Guess either some documentation should then be added somewhere, describing the whole thing... or perhaps it should be (re-)considered whether one could do something so that the wasted parts get automatically freed upon truncation. Maybe some attribute that can be set on a directory and files/dirs created beneath that get the special treatment. Cheers, Chris.
On 13.01.2024 06:17, Qu Wenruo wrote: > > What we really want to do is, just add extra filters to allow end users > to re-dirty the last file extent. > It is possible to punch holes in the middle of the file, so partial extents can happen anywhere.
On 2024/1/13 18:35, Andrei Borzenkov wrote: > On 13.01.2024 06:17, Qu Wenruo wrote: >> >> What we really want to do is, just add extra filters to allow end users >> to re-dirty the last file extent. >> > > It is possible to punch holes in the middle of the file, so partial > extents can happen anywhere. > That would change the content of a file. Defrag (re-dirty files) should not change the content. If you mean zero detection, then I'd say it's a little beyond our simple (not that simple though) defrag implementation. Thanks, Qu
Hey there. Is that patch still under consideration? The filesystems on my 6.1 kernel regularly run full because of these IO patterns... and AFAICS the patch hasn't been merged to master or backported to the stables yet. Any ideas how to proceed? I mean if btrfs just break with that specific kind of IO patterns and there will be no fix,... fine,... but then please tell so that I can witch to something else for that use case :-) Cheers, Chris.
On 2024/2/5 16:09, Christoph Anton Mitterer wrote: > Hey there. > > Is that patch still under consideration? > > The filesystems on my 6.1 kernel regularly run full because of these IO > patterns... and AFAICS the patch hasn't been merged to master or > backported to the stables yet. > > > Any ideas how to proceed? > I mean if btrfs just break with that specific kind of IO patterns and > there will be no fix,... fine,... but then please tell so that I can > witch to something else for that use case :-) Thanks for reminding me to update the patch. I'll try to address the comments in the next version and hopes we can push the next version into the kernel. Thanks, Qu > > > Cheers, > Chris. >
05.02.2024 10:42, Qu Wenruo пишет: > > > On 2024/2/5 16:09, Christoph Anton Mitterer wrote: >> Hey there. >> >> Is that patch still under consideration? >> >> The filesystems on my 6.1 kernel regularly run full because of these IO >> patterns... and AFAICS the patch hasn't been merged to master or >> backported to the stables yet. >> >> >> Any ideas how to proceed? >> I mean if btrfs just break with that specific kind of IO patterns and >> there will be no fix,... fine,... but then please tell so that I can >> witch to something else for that use case :-) > > Thanks for reminding me to update the patch. > > I'll try to address the comments in the next version and hopes we can > push the next version into the kernel. > > Thanks, > Qu >> >> >> Cheers, >> Chris. >> > Dear developers, Could you please update us on the status of this patch? Any hope this out of space issue will be resolved? If for some reason it cannot be (or you are not planning to work on it in the foreseeable future) kindly let us know.
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c index c276b136ab63..cc319190b6fb 100644 --- a/fs/btrfs/defrag.c +++ b/fs/btrfs/defrag.c @@ -1070,6 +1070,17 @@ static int defrag_collect_targets(struct btrfs_inode *inode, if (!next_mergeable) { struct defrag_target_range *last; + /* + * Special entry point utilization ratio under 1/16 (only + * referring 1/16 of an on-disk extent). + * This can happen for a truncated large extent. + * If we don't add them, then for a truncated file + * (may be the last 4K of a 128M extent) it will never + * be defraged. + */ + if (em->ram_bytes < em->orig_block_len / 16) + goto add; + /* Empty target list, no way to merge with last entry */ if (list_empty(target_list)) goto next;
[BUG] The following script can lead to a very under utilized extent and we have no way to use defrag to properly reclaim its wasted space: # mkfs.btrfs -f $dev # mount $dev $mnt # xfs_io -f -c "pwrite 0 128M" $mnt/foobar # sync # btrfs filesystem defrag $mnt/foobar # sync After the above operations, the file "foobar" is still utilizing the whole 128M: item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160 generation 7 transid 8 size 4096 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 32770 flags 0x0(none) item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14 index 2 namelen 4 name: file item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 generation 7 type 1 (regular) extent data disk byte 298844160 nr 134217728 <<< extent data offset 0 nr 4096 ram 134217728 extent compression 0 (none) Meaning the expected defrag way to reclaim the space is not working. [CAUSE] The file extent has no adjacent extent at all, thus all existing defrag code consider it a perfectly good file extent, even if it's only utilizing a very tiny amount of space. [FIX] Add a special handling for under utilized extents, currently the ratio is 6.25% (1/16). This would allow us to add such extent to our defrag target list, resulting it to be properly defragged. Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/defrag.c | 11 +++++++++++ 1 file changed, 11 insertions(+)