mbox series

[0/9] introduce defrag to xfs_spaceman

Message ID 20240709191028.2329-1-wen.gang.wang@oracle.com (mailing list archive)
Headers show
Series introduce defrag to xfs_spaceman | expand

Message

Wengang Wang July 9, 2024, 7:10 p.m. UTC
This patch set introduces defrag to xfs_spaceman command. It has the functionality and
features below (also subject to be added to man page, so please review):

       defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
              defrag defragments the specified XFS file online non-exclusively. The target XFS
              doesn't have to (and must not) be unmunted.  When defragmentation is in progress, file
              IOs are served 'in parallel'.  reflink feature must be enabled in the XFS.

              Defragmentation and file IOs

              The target file is virtually devided into many small segments. Segments are the
              smallest units for defragmentation. Each segment is defragmented one by one in a
              lock->defragment->unlock->idle manner. File IOs are blocked when the target file is
              locked and are served during the defragmentation idle time (file is unlocked). Though
              the file IOs can't really go in parallel, they are not blocked long. The locking time
              basically depends on the segment size. Smaller segments usually take less locking time
              and thus IOs are blocked shorterly, bigger segments usually need more locking time and
              IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
              service.

              Temporary file

              A temporary file is used for the defragmentation. The temporary file is created in the
              same directory as the target file is and is named ".xfsdefrag_<pid>". It is a sparse
              file and contains a defragmentation segment at a time. The temporary file is removed
              automatically when defragmentation is done or is cancelled by ctrl-c. It remains in
              case kernel crashes when defragmentation is going on. In that case, the temporary file
              has to be removed manaully.

              Free blocks consumption

              Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
              then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
              to the number the newly allocated blocks. As a finally result, defragmentation doesn't
              consume free blocks. Well, that is true if the target file is not sharing blocks with
              other files.  In case the target file contains shared blocks, those shared blocks won't
              be freed back to filesystem as they are still owned by other files. So defragmenation
              allocates more blocks than it frees.  For existing XFS, free blocks might be over-
              committed when reflink snapshots were created. To avoid causing the XFS running into
              low free blocks state, this defragmentation excludes (partially) shared segments when
              the file system free blocks reaches a shreshold. Check the -f option.

              Safty and consistency

              The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
              crash.

              First extent share

              Current kernel has routine for each segment defragmentation detecting if the file is
              sharing blocks. It takes long in case the target file contains huge number of extents
              and the shared ones, if there is, are at the end. The First extent share feature works
              around above issue by making the first serveral blocks shared. Seeing the first blocks
              are shared, the kernel routine ends quickly. The side effect is that the "share" flag
              would remain traget file. This feature is enabled by default and can be disabled by -n
              option.

              extsize and cowextsize

              According to kernel implementation, extsize and cowextsize could have following impacts
              to defragmentation: 1) non-zero extsize causes separated block allocations for each
              extent in the segment and those blocks are not contiguous. The segment remains same
              number of extents after defragmention (no effect).  2) When extsize and/or cowextsize
              are too big, a lot of pre-allocated blocks remain in memory for a while. When new IO
              comes to whose pre-allocated blocks  Copy on Write happens and causes the file
              fragmented.

              Readahead

              Readahead tries to fetch the data blocks for next segment with less locking in
              backgroud during idle time. This feature is disabled by default, use -a to enable it.

              The command takes the following options:
                 -f free_space
                     The shreshold of XFS free blocks in MiB. When free blocks are less than this
                     number, (partially) shared segments are excluded from defragmentation. Default
                     number is 1024

                 -i idle_time
                     The time in milliseconds, defragmentation enters idle state for this long after
                     defragmenting a segment and before handing the next. Default number is TOBEDONE.

                 -s segment_size
                     The size limitation in bytes of segments. Minimium number is 4MiB, default
                     number is 16MiB.

                 -n  Disable the First extent share feature. Enabled by default.

                 -a  Enable readahead feature, disabled by default.

We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
sleep time. Here comes some number of the test:

Test: running of defrag on the image file which is used for the back end of a block device in a
      virtual machine. At the same time, fio is running at the same time inside virtual machine
      on that block device.
block device type:   NVME
File size:           200GiB
paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
Defrag run time:     223 minutes
Number of extents:   6745489(before) -> 203571(after)
Fio read latency:    15.72ms(without defrag) -> 14.53ms(during defrag)
Fio write latency:   32.21ms(without defrag) -> 20.03ms(during defrag)


Wengang Wang (9):
  xfsprogs: introduce defrag command to spaceman
  spaceman/defrag: pick up segments from target file
  spaceman/defrag: defrag segments
  spaceman/defrag: ctrl-c handler
  spaceman/defrag: exclude shared segments on low free space
  spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag()
  spaceman/defrag: sleeps between segments
  spaceman/defrag: readahead for better performance
  spaceman/defrag: warn on extsize

 spaceman/Makefile |   2 +-
 spaceman/defrag.c | 788 ++++++++++++++++++++++++++++++++++++++++++++++
 spaceman/init.c   |   1 +
 spaceman/space.h  |   1 +
 4 files changed, 791 insertions(+), 1 deletion(-)
 create mode 100644 spaceman/defrag.c

Comments

Dave Chinner July 15, 2024, 11:03 p.m. UTC | #1
[ Please keep documentation text to 80 columns. ] 

[ Please run documentation through a spell checker - there are too
many typos in this document to point them all out... ]

On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote:
> This patch set introduces defrag to xfs_spaceman command. It has the functionality and
> features below (also subject to be added to man page, so please review):

What's the use case for this?

>        defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
>               defrag defragments the specified XFS file online non-exclusively. The target XFS

What's "non-exclusively" mean? How is this different to what xfs_fsr
does?

>               doesn't have to (and must not) be unmunted.  When defragmentation is in progress, file
>               IOs are served 'in parallel'.  reflink feature must be enabled in the XFS.

xfs_fsr allows IO to occur in parallel to defrag.

>               Defragmentation and file IOs
> 
>               The target file is virtually devided into many small segments. Segments are the
>               smallest units for defragmentation. Each segment is defragmented one by one in a
>               lock->defragment->unlock->idle manner.

Userspace can't easily lock the file to prevent concurrent access.
So I'mnot sure what you are refering to here.

>               File IOs are blocked when the target file is locked and are served during the
>               defragmentation idle time (file is unlocked).

What file IOs are being served in parallel? The defragmentation IO?
something else?

>               Though
>               the file IOs can't really go in parallel, they are not blocked long. The locking time
>               basically depends on the segment size. Smaller segments usually take less locking time
>               and thus IOs are blocked shorterly, bigger segments usually need more locking time and
>               IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
>               service.

How is a user supposed to know what the correct values are for their
storage, files, and workload? Algorithms should auto tune, not
require users and administrators to use trial and error to find the
best numbers to feed a given operation.

>               Temporary file
> 
>               A temporary file is used for the defragmentation. The temporary file is created in the
>               same directory as the target file is and is named ".xfsdefrag_<pid>". It is a sparse
>               file and contains a defragmentation segment at a time. The temporary file is removed
>               automatically when defragmentation is done or is cancelled by ctrl-c. It remains in
>               case kernel crashes when defragmentation is going on. In that case, the temporary file
>               has to be removed manaully.

O_TMPFILE, as Darrick has already requested.

> 
>               Free blocks consumption
> 
>               Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
>               then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
>               to the number the newly allocated blocks. As a finally result, defragmentation doesn't
>               consume free blocks.  Well, that is true if the target file is not sharing blocks with
>               other files.

This is really hard to read. Defragmentation will -always- consume
free space while it is progress. It will always release the
temporary space it consumes when it completes.

>               In case the target file contains shared blocks, those shared blocks won't
>               be freed back to filesystem as they are still owned by other files. So defragmenation
>               allocates more blocks than it frees.

So this is doing an unshare operation as well as defrag? That seems
... suboptimal. The whole point of sharing blocks is to minimise
disk usage for duplicated data.

>               For existing XFS, free blocks might be over-
>               committed when reflink snapshots were created. To avoid causing the XFS running into
>               low free blocks state, this defragmentation excludes (partially) shared segments when
>               the file system free blocks reaches a shreshold. Check the -f option.

Again, how is the user supposed to know when they need to do this?
If the answer is "they should always avoid defrag on low free
space", then why is this an option?

>               Safty and consistency
> 
>               The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
>               crash.

Which file is the "defragmentation file"? The source or the temp
file?

>               First extent share
> 
>               Current kernel has routine for each segment defragmentation detecting if the file is
>               sharing blocks.

I have no idea what this means, or what interface this refers to.

>               It takes long in case the target file contains huge number of extents
>               and the shared ones, if there is, are at the end. The First extent share feature works
>               around above issue by making the first serveral blocks shared. Seeing the first blocks
>               are shared, the kernel routine ends quickly. The side effect is that the "share" flag
>               would remain traget file. This feature is enabled by default and can be disabled by -n
>               option.

And from this description, I have no idea what this is doing, what
problem it is trying to work around, or why we'd want to share
blocks out of a file to speed up detection of whether there are
shared blocks in the file. This description doesn't make any sense
to me because I don't know what interface you are actually having
performance issues with. Please reference the kernel code that is
problematic, and explain why the existing kernel code is problematic
and cannot be fixed.

>               extsize and cowextsize
> 
>               According to kernel implementation, extsize and cowextsize could have following impacts
>               to defragmentation: 1) non-zero extsize causes separated block allocations for each
>               extent in the segment and those blocks are not contiguous.

Extent size hints do no such thing. The simply provide extent
alignment guidelines and do not affect things like contiguous or
multi-block allocation lengths.

>               The segment remains same
>               number of extents after defragmention (no effect).  2) When extsize and/or cowextsize
>               are too big, a lot of pre-allocated blocks remain in memory for a while. When new IO
>               comes to whose pre-allocated blocks  Copy on Write happens and causes the file
>               fragmented.

extsize based unwritten extents won't cause COW or cause
fragmentation because they aren't shared and they are contiguous.
I suspect that your definition of "fragmented" isn't taking into
account that unwritten-written-unwritten over a contiguous range
is *not* fragmentation. It's just a contiguous extent in different
states, and this should really not be touched/changed by
defragmentation.

check out xfs_fsr: it ensures that the pattern of unwritten/written
blocks in the defragmented file is identical to the source. i.e. it
preserves preallocation because the application/fs config wants it
to be there....

>               Readahead
> 
>               Readahead tries to fetch the data blocks for next segment with less locking in
>               backgroud during idle time. This feature is disabled by default, use -a to enable it.

What are you reading ahead into? Kernel page cache or user buffers?
Either way, it's hardly what I'd call "idle time" if the defrag
process is using it to issue lots of read IO...


>               The command takes the following options:
>                  -f free_space
>                      The shreshold of XFS free blocks in MiB. When free blocks are less than this
>                      number, (partially) shared segments are excluded from defragmentation. Default
>                      number is 1024

When you are down to 4MB of free space in the filesystem, you
shouldn't even be trying to run defrag because all the free space
that will be left in the filesystem is single blocks. I would have
expected this sort of number to be in a percentage of capacity,
defaulting to something like 5% (which is where we start running low
space algorithms in the kernel).

>                  -i idle_time
>                      The time in milliseconds, defragmentation enters idle state for this long after
>                      defragmenting a segment and before handing the next. Default number is TOBEDONE.

Yeah, I don't think this is something anyonce whould be expected to
use or tune. If an idle time is needed, the defrag application
should be selecting this itself.
> 
>                  -s segment_size
>                      The size limitation in bytes of segments. Minimium number is 4MiB, default
>                      number is 16MiB.

Why were these numbers chosen? What happens if the file has ~32MB
sized extents and the user wants the file to be returned to a single
large contiguous extent it possible? i.e. how is the user supposed
to know how to set this for any given file without first having
examined the exact pattern of fragmentations in the file?

>                  -n  Disable the First extent share feature. Enabled by default.

So confusing.  Is the "feature disable flag" enabled by default, or
is the feature enabled by default?

>                  -a  Enable readahead feature, disabled by default.

Same confusion, but opposite logic.

I would highly recommend that you get a native english speaker to
review, spell and grammar check the documentation before the next
time you post it.

> We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
> sleep time. Here comes some number of the test:
> 
> Test: running of defrag on the image file which is used for the back end of a block device in a
>       virtual machine. At the same time, fio is running at the same time inside virtual machine
>       on that block device.
> block device type:   NVME
> File size:           200GiB
> paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
> Defrag run time:     223 minutes
> Number of extents:   6745489(before) -> 203571(after)

So and average extent size of ~32kB before, 100MB after? How much of
these are shared extents?

Runtime is 13380secs, so if we copied 200GiB in that time, the
defrag ran at 16MB/s. That's not very fast.

What's the CPU utilisation of the defrag task and kernel side
processing? What is the difference between "first_extent_share"
enabled and disabled (both performance numbers and CPU usage)?

> Fio read latency:    15.72ms(without defrag) -> 14.53ms(during defrag)
> Fio write latency:   32.21ms(without defrag) -> 20.03ms(during defrag)

So the IO latency is *lower* when defrag is running? That doesn't
make any sense, unless the fio throughput is massively reduced while
defrag is running.  What's the throughput change in the fio
workload? What's the change in worst case latency for the fio
workload? i.e. post the actual fio results so we can see the whole
picture of the behaviour, not just a single cherry-picked number.

Really, though, I have to ask: why is this an xfs_spaceman command
and not something built into the existing online defrag program
we have (xfs_fsr)?

I'm sure I'll hav emore questions as I go through the code - I'll
start at the userspace IO engine part of the patchset so I have some
idea of what the defrag algorithm actually is...

-Dave.
Wengang Wang July 16, 2024, 7:45 p.m. UTC | #2
> On Jul 15, 2024, at 4:03 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> [ Please keep documentation text to 80 columns. ] 
> 

Yes. This is not a patch. I copied it from the man 8 output.
It will be limited to 80 columns when sent as a patch.

> [ Please run documentation through a spell checker - there are too
> many typos in this document to point them all out... ]

OK.

> 
> On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote:
>> This patch set introduces defrag to xfs_spaceman command. It has the functionality and
>> features below (also subject to be added to man page, so please review):
> 
> What's the use case for this?

This is the user space defrag as you suggested previously.

Please see the previous conversation for your reference: 
https://patchwork.kernel.org/project/xfs/cover/20231214170530.8664-1-wen.gang.wang@oracle.com/

COPY STARTS —————————————> 
I am copying your last comment there:

On Tue, Dec 19, 2023 at 09:17:31PM +0000, Wengang Wang wrote:
> Hi Dave,
> Yes, the user space defrag works and satisfies my requirement (almost no change from your example code).

That's good to know :)

> Let me know if you want it in xfsprog.

Yes, i think adding it as an xfs_spaceman command would be a good
way for this defrag feature to be maintained for anyone who has need
for it.

-Dave.
<———————————————— COPY ENDS

> 
>>       defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
>>              defrag defragments the specified XFS file online non-exclusively. The target XFS
> 
> What's "non-exclusively" mean? How is this different to what xfs_fsr
> does?
> 

I think you have seen the difference when you reviewing more of this set.
Well, if I read xfs_fsr code correctly, though xfs_fsr allow parallel writes, it looks have a problem(?)
As I read the code, Xfs_fsr do the followings to defrag one file:
1) preallocating blocks to a temporary file hoping the temporary get same number of blocks as the
    file under defrag with with less extents.
2) copy data blocks from the file under defrag to the temporary file.
3) switch the extents between the two files.

For stage 2, it’s NOT copying data blocks in atomic manner. Take an example: there need two
Read->write pair to complete the data copy, that is
    Copy range 1 (read range 1 from the file under defrag to the temporary file)
    Copy range 2

If a new write come to the file (range 1) under defrag after copying range1 is done. After the defrag
(xfs_fsr) finished, will the new write lose?

I didn’t look into the extents-switch code, don’t know if that check if the two files has same data contents.
But even it does, it would be pretty slow with file locked.    


>>              doesn't have to (and must not) be unmunted.  When defragmentation is in progress, file
>>              IOs are served 'in parallel'.  reflink feature must be enabled in the XFS.
> 
> xfs_fsr allows IO to occur in parallel to defrag.

Pls see my concern above.

> 
>>              Defragmentation and file IOs
>> 
>>              The target file is virtually devided into many small segments. Segments are the
>>              smallest units for defragmentation. Each segment is defragmented one by one in a
>>              lock->defragment->unlock->idle manner.
> 
> Userspace can't easily lock the file to prevent concurrent access.
> So I'mnot sure what you are refering to here.

The manner is not simply meant what is done at user space, but a whole thing in both user space
and kernel space.  The tool defrag a file segment by segment. The lock->defragment->unlock
Is done by kernel in responding to the FALLOC_FL_UNSHARE_RANGE request from user space.

> 
>>              File IOs are blocked when the target file is locked and are served during the
>>              defragmentation idle time (file is unlocked).
> 
> What file IOs are being served in parallel? The defragmentation IO?
> something else?

Here the file IOs means the IOs request from user space applications including virtual machine
Engine.

> 
>>              Though
>>              the file IOs can't really go in parallel, they are not blocked long. The locking time
>>              basically depends on the segment size. Smaller segments usually take less locking time
>>              and thus IOs are blocked shorterly, bigger segments usually need more locking time and
>>              IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
>>              service.
> 
> How is a user supposed to know what the correct values are for their
> storage, files, and workload? Algorithms should auto tune, not
> require users and administrators to use trial and error to find the
> best numbers to feed a given operation.

In my option, user would need a way to control this according to their use case.
Any algorithms will restrict what user want to do.
Say, user want the defrag done as quick as possible regardless the resources it takes (CPU, IO and so on)
when the production system is in a maintenance window. But when the production system is busy
User want the defrag use less resources.
Another example, kernel (algorithms) never knows the maximum IO latency the user applications tolerate.
But if you have some algorithms, please share.

And we provide default numbers for the options, they come from test practice though user might need to
Change them for their own use case.

> 
>>              Temporary file
>> 
>>              A temporary file is used for the defragmentation. The temporary file is created in the
>>              same directory as the target file is and is named ".xfsdefrag_<pid>". It is a sparse
>>              file and contains a defragmentation segment at a time. The temporary file is removed
>>              automatically when defragmentation is done or is cancelled by ctrl-c. It remains in
>>              case kernel crashes when defragmentation is going on. In that case, the temporary file
>>              has to be removed manaully.
> 
> O_TMPFILE, as Darrick has already requested.

OK. Will be it.
> 
>> 
>>              Free blocks consumption
>> 
>>              Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
>>              then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
>>              to the number the newly allocated blocks. As a finally result, defragmentation doesn't
>>              consume free blocks.  Well, that is true if the target file is not sharing blocks with
>>              other files.
> 
> This is really hard to read. Defragmentation will -always- consume
> free space while it is progress. It will always release the
> temporary space it consumes when it completes.

I don’t think it’s always free blocks when it releases the temporary file. When the blocks were
Original shared before defrag, the blocks won’t be freed.

> 
>>              In case the target file contains shared blocks, those shared blocks won't
>>              be freed back to filesystem as they are still owned by other files. So defragmenation
>>              allocates more blocks than it frees.
> 
> So this is doing an unshare operation as well as defrag? That seems
> ... suboptimal. The whole point of sharing blocks is to minimise
> disk usage for duplicated data.

That depends on user's need. If users think defrag is the first priority, it is. If users don’t think the disk
saving is the most important, it is not. No matter what developers think.
What’s more, reflink (or sharing blocks) is not only used to minimize disk usage. Sometimes it’s
Used as way to take snapshots. And those snapshots might won’t stay long.

And what’s more is that, the unshare operation is what you suggested :D   


> 
>>              For existing XFS, free blocks might be over-
>>              committed when reflink snapshots were created. To avoid causing the XFS running into
>>              low free blocks state, this defragmentation excludes (partially) shared segments when
>>              the file system free blocks reaches a shreshold. Check the -f option.
> 
> Again, how is the user supposed to know when they need to do this?
> If the answer is "they should always avoid defrag on low free
> space", then why is this an option?

I didn’t say "they should always avoid defrag on low free space”. And even we can’t say how low is
Not tolerated by user, that depends on user use case. Though it’s an option, it has the default value
Of 1GB. If users don’t set this option, that is "always avoid defrag on low free space”.


> 
>>              Safty and consistency
>> 
>>              The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
>>              crash.
> 
> Which file is the "defragmentation file"? The source or the temp
> file?

I don’t think there is "source concept" here. There is no data copy between files.
“The defragmentation file” means the file under defrag, I will change it to “The file under defrag”.
I don’t think users care about the temporary file at all.


> 
>>              First extent share
>> 
>>              Current kernel has routine for each segment defragmentation detecting if the file is
>>              sharing blocks.
> 
> I have no idea what this means, or what interface this refers to.
> 
>>              It takes long in case the target file contains huge number of extents
>>              and the shared ones, if there is, are at the end. The First extent share feature works
>>              around above issue by making the first serveral blocks shared. Seeing the first blocks
>>              are shared, the kernel routine ends quickly. The side effect is that the "share" flag
>>              would remain traget file. This feature is enabled by default and can be disabled by -n
>>              option.
> 
> And from this description, I have no idea what this is doing, what
> problem it is trying to work around, or why we'd want to share
> blocks out of a file to speed up detection of whether there are
> shared blocks in the file. This description doesn't make any sense
> to me because I don't know what interface you are actually having
> performance issues with. Please reference the kernel code that is
> problematic, and explain why the existing kernel code is problematic
> and cannot be fixed.

I mentioned the kernel function name in patch 6. It is xfs_reflink_try_clear_inode_flag().

> 
>>              extsize and cowextsize
>> 
>>              According to kernel implementation, extsize and cowextsize could have following impacts
>>              to defragmentation: 1) non-zero extsize causes separated block allocations for each
>>              extent in the segment and those blocks are not contiguous.
> 
> Extent size hints do no such thing. The simply provide extent
> alignment guidelines and do not affect things like contiguous or
> multi-block allocation lengths.

Extsize really make alignment on the number of blocks to allocate. But it affects more than that.
When extsize is set, the allocations is not delayed allocation.
xfs_reflink_unshare() does one allocation each extent. For a defrag segment containing
N extents, there are N allocations.

> 
>>              The segment remains same
>>              number of extents after defragmention (no effect).  2) When extsize and/or cowextsize
>>              are too big, a lot of pre-allocated blocks remain in memory for a while. When new IO
>>              comes to whose pre-allocated blocks  Copy on Write happens and causes the file
>>              fragmented.
> 
> extsize based unwritten extents won't cause COW or cause
> fragmentation because they aren't shared and they are contiguous.
> I suspect that your definition of "fragmented" isn't taking into
> account that unwritten-written-unwritten over a contiguous range
> is *not* fragmentation. It's just a contiguous extent in different
> states, and this should really not be touched/changed by
> defragmentation.

Are you sure about that? In my option, take the buffer’s write as example,
During writeback, when the target block is found in Cow fork, Copy on Write just happens no matter
If the block is really shared or not.  Let’s see this simple example:
1) a file contains 4 blocks. file block 0, 1 and 2 are shared and block 3 is not share. 
    Extsize on this file is 4 blocks.
2) a writeback come to file block 0, 1 and 2.
3) On seeing those 3 blocks are shared, kernel pre-allocate blocks in Cow fork.
    Extsize being 4 blocks, after alignment, 4 blocks (unwritten) are allocated in Cow fork.
4) data is written to 3 of the blocks in Cow fork. In IO done callback, those 3 blocks in Cow fork
    Is moved to data fork, the original 3 blocks in data fork are freed.

The Copy on Write is done, right?
But remember, there is 1 unwritten block left in the Cow fork.
In case now a new writeback come to file block 3, the kernel see there is a file block 3 in Cow fork,
A new Copy on Write happens.
 

> 
> check out xfs_fsr: it ensures that the pattern of unwritten/written
> blocks in the defragmented file is identical to the source. i.e. it
> preserves preallocation because the application/fs config wants it
> to be there....
> 
>>              Readahead
>> 
>>              Readahead tries to fetch the data blocks for next segment with less locking in
>>              backgroud during idle time. This feature is disabled by default, use -a to enable it.
> 
> What are you reading ahead into? Kernel page cache or user buffers?
Kernel page cache.
> Either way, it's hardly what I'd call "idle time" if the defrag
> process is using it to issue lots of read IO...
> 

During the “idle time”, the file is not (IOLOCK) locked though disk fetching might be happening. 

> 
>>              The command takes the following options:
>>                 -f free_space
>>                     The shreshold of XFS free blocks in MiB. When free blocks are less than this
>>                     number, (partially) shared segments are excluded from defragmentation. Default
>>                     number is 1024
> 
> When you are down to 4MB of free space in the filesystem, you
> shouldn't even be trying to run defrag because all the free space
> that will be left in the filesystem is single blocks. I would have
> expected this sort of number to be in a percentage of capacity,
> defaulting to something like 5% (which is where we start running low
> space algorithms in the kernel).

I would like leave this to user. When user is doing defrag on low free space system, it won’t cause
Problem to file system its self. At most the defrag fails during unshare when allocating blocks.
You can’t prevent user from writing to new file when system is low free space either.

I don’t think a percentage is a good idea,  say, for a 10TiB filesystem, 5% is 512GB.  512GB is pretty
Enough to do things. And for a small one, say a 512 MB filesystem, 5% that’s 25MB, that’s too less.
In above cases, limiting by a percentage would ether prevent user doing something that can be done
Without any problem, or allow user to do something that might cause problem.
I’d think specifying a fixed safe size is better.


> 
>>                 -i idle_time
>>                     The time in milliseconds, defragmentation enters idle state for this long after
>>                     defragmenting a segment and before handing the next. Default number is TOBEDONE.
> 
> Yeah, I don't think this is something anyonce whould be expected to
> use or tune. If an idle time is needed, the defrag application
> should be selecting this itself.

I don’t think so, see my explain above.

>> 
>>                 -s segment_size
>>                     The size limitation in bytes of segments. Minimium number is 4MiB, default
>>                     number is 16MiB.
> 
> Why were these numbers chosen? What happens if the file has ~32MB
> sized extents and the user wants the file to be returned to a single
> large contiguous extent it possible? i.e. how is the user supposed
> to know how to set this for any given file without first having
> examined the exact pattern of fragmentations in the file?

Why customer want the file to be returned to a single large contiguous extent?
A 32MB extent is pretty good to me.  I didn’t here any customer complain about 32MB extents…
And you know, whether we can defrag extents to a large one depends on not only the tool it’s self.
It’s depends on the status of the filesystem, say if the filesystem is very fragmented too, say the AG size..

The 16MB is selected according our tests basing on a customer metadump. With 16MB segment size,
The the defrag result is very good and the IO latency is acceptable too.  With the default 16MB segment
Size, 32MB extent is excluded from defrag.

If you have better default size, we can use that.

> 
>>                 -n  Disable the First extent share feature. Enabled by default.
> 
> So confusing.  Is the "feature disable flag" enabled by default, or
> is the feature enabled by default?

Will change it to the following if it’s clear:
The "First extent share “ feature is enabled to default. User -n to disable it.

> 
>>                 -a  Enable readahead feature, disabled by default.
> 
> Same confusion, but opposite logic.
> 
> I would highly recommend that you get a native english speaker to
> review, spell and grammar check the documentation before the next
> time you post it.

OK, will try to do so.

> 
>> We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
>> sleep time. Here comes some number of the test:
>> 
>> Test: running of defrag on the image file which is used for the back end of a block device in a
>>      virtual machine. At the same time, fio is running at the same time inside virtual machine
>>      on that block device.
>> block device type:   NVME
>> File size:           200GiB
>> paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
>> Defrag run time:     223 minutes
>> Number of extents:   6745489(before) -> 203571(after)
> 
> So and average extent size of ~32kB before, 100MB after? How much of
> these are shared extents?

Zero shared extents, but there are some unwritten ones.
A similar run stats is like this:
Pre-defrag 6654460 extents detected, 112228 are "unwritten",0 are "shared” 
Tried to defragment 6393352 extents (181000359936 bytes) in 26032 segments Time stats(ms): max clone: 31, max unshare: 300, max punch_hole: 66
Post-defrag 282659 extents detected

> 
> Runtime is 13380secs, so if we copied 200GiB in that time, the
> defrag ran at 16MB/s. That's not very fast.
> 

We are chasing the balance of defrag and parallel IO latency.

> What's the CPU utilisation of the defrag task and kernel side
> processing? What is the difference between "first_extent_share"
> enabled and disabled (both performance numbers and CPU usage)?

On my test VM (spindle based disk I think). CPU usage is about 6% for
The defrag command. Kernel processes much lower.
I didn’t pay much attention to the CPU usage when “first_extent_share” is disabled. But think
That caused very high CPU usages.

> 
>> Fio read latency:    15.72ms(without defrag) -> 14.53ms(during defrag)
>> Fio write latency:   32.21ms(without defrag) -> 20.03ms(during defrag)
> 
> So the IO latency is *lower* when defrag is running? That doesn't
> make any sense, unless the fio throughput is massively reduced while
> defrag is running.  

That’s reasonable. For the segments that defrag is done, the page cache remains.
 

> What's the throughput change in the fio
> workload? What's the change in worst case latency for the fio
> workload? i.e. post the actual fio results so we can see the whole
> picture of the behaviour, not just a single cherry-picked number.

Let me see if we have that saved.

> 
> Really, though, I have to ask: why is this an xfs_spaceman command
> and not something built into the existing online defrag program
> we have (xfs_fsr)?
> 

Quotation from previous conversation:
“”””" 
> Let me know if you want it in xfsprog.

Yes, i think adding it as an xfs_spaceman command would be a good
way for this defrag feature to be maintained for anyone who has need
for it.

-Dave.
“”””””

Thanks,
Wengang

> I'm sure I'll hav emore questions as I go through the code - I'll
> start at the userspace IO engine part of the patchset so I have some
> idea of what the defrag algorithm actually is...
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner July 31, 2024, 2:51 a.m. UTC | #3
On Tue, Jul 16, 2024 at 07:45:37PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 15, 2024, at 4:03 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > [ Please keep documentation text to 80 columns. ] 
> > 
> 
> Yes. This is not a patch. I copied it from the man 8 output.
> It will be limited to 80 columns when sent as a patch.
> 
> > [ Please run documentation through a spell checker - there are too
> > many typos in this document to point them all out... ]
> 
> OK.
> 
> > 
> > On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote:
> >> This patch set introduces defrag to xfs_spaceman command. It has the functionality and
> >> features below (also subject to be added to man page, so please review):
> > 
> > What's the use case for this?
> 
> This is the user space defrag as you suggested previously.
> 
> Please see the previous conversation for your reference: 
> https://patchwork.kernel.org/project/xfs/cover/20231214170530.8664-1-wen.gang.wang@oracle.com/

That's exactly what you should have put in the cover letter!

The cover letter is not for documenting the user interface of a new
tool - that's what the patch in the patch set for the new man page
should be doing.

The cover letter should contain references to past patch sets and
discussions on the topic. The cover letter shoudl also contain a
changelog that documents what is different in this new version of
the patch set so reviewers know what you've changed since they last
looked at it.

IOWs, the cover letter for explaining the use case, why the
functionality is needed, important design/implementation decisions
and the history of the patchset. It's meant to inform and remind
readers of what has already happened to get to this point.

> COPY STARTS —————————————> 
> I am copying your last comment there:
> 
> On Tue, Dec 19, 2023 at 09:17:31PM +0000, Wengang Wang wrote:
> > Hi Dave,
> > Yes, the user space defrag works and satisfies my requirement (almost no change from your example code).
> 
> That's good to know :)
> 
> > Let me know if you want it in xfsprog.
> 
> Yes, i think adding it as an xfs_spaceman command would be a good
> way for this defrag feature to be maintained for anyone who has need
> for it.

Sure, I might have said that 6 months ago. When presented with a
completely new implementation in a new context months later, I might
see things differently.  Everyone is allowed to change their mind,
opinions and theories as circumstances, evidence and contexts
change.

Indeed when I look at this:

> >>       defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
> >>              defrag defragments the specified XFS file online non-exclusively. The target XFS

I didn't expect anything nearly as complex and baroque as this. All
I was expecting was something like this to defrag a single range of
a file:

	xfs_spaceman -c "defrag <offset> <length>" <file>

As the control command, and then functionality for
automated/periodic scanning and defrag would still end up being
co-ordinated by the existing xfs_fsr code.

> > What's "non-exclusively" mean? How is this different to what xfs_fsr
> > does?
> > 
> 
> I think you have seen the difference when you reviewing more of this set.
> Well, if I read xfs_fsr code correctly, though xfs_fsr allow parallel writes, it looks have a problem(?)
> As I read the code, Xfs_fsr do the followings to defrag one file:
> 1) preallocating blocks to a temporary file hoping the temporary get same number of blocks as the
>     file under defrag with with less extents.
> 2) copy data blocks from the file under defrag to the temporary file.
> 3) switch the extents between the two files.
> 
> For stage 2, it’s NOT copying data blocks in atomic manner. Take an example: there need two
> Read->write pair to complete the data copy, that is
>     Copy range 1 (read range 1 from the file under defrag to the temporary file)
>     Copy range 2

I wasn't asking you to explain to me how the xfs_fsr algorithm
works. What I was asking for was a definition of what
"non-exclusively" means.

What xfs_fsr currently does meets my definition of "non-exclusive" - it does
not rely on or require exclusive access to the file being
defragmented except for the atomic extent swap at the end. However,
using FICLONE/UNSHARE does require exclusive access to the file be
defragmented for the entirity of those operations, so I don't have
any real idea of why this new algorithm is explicitly described as
"non-exclusive".

Defining terms so everyone has a common understanding is important.

Indeed, Given that we now have XFS_IOC_EXCHANGE_RANGE, I'm
definitely starting to wonder if clone/unshare is actually the best
way to do this now.  I think we could make xfs_fsr do iterative
small file region defrag using XFS_IOC_EXCHANGE_RANGE instead of
'whole file at once' as it does now. If we were also to make fsr
aware of shared extents 

> > 
> >>              Defragmentation and file IOs
> >> 
> >>              The target file is virtually devided into many small segments. Segments are the
> >>              smallest units for defragmentation. Each segment is defragmented one by one in a
> >>              lock->defragment->unlock->idle manner.
> > 
> > Userspace can't easily lock the file to prevent concurrent access.
> > So I'mnot sure what you are refering to here.
> 
> The manner is not simply meant what is done at user space, but a whole thing in both user space
> and kernel space.  The tool defrag a file segment by segment. The lock->defragment->unlock
> Is done by kernel in responding to the FALLOC_FL_UNSHARE_RANGE request from user space.

I'm still not sure what locking you are trying to describe. There
are multiple layers of locking in the kernel, and we use them
differently. Indeed, the algorithm you have described is actually

	FICLONERANGE
	IOLOCK shared
	ILOCK exclusive
	remap_file_range()
	IUNLOCK exclusive
	IOUNLOCK shared

	.....

	UNSHARE_RANGE
	IOLOCK exclusive
	MMAPLOCK exclusive
	<drain DIO in flight>
	ILOCK exclusive
	unshare_range()
	IUNLOCK exclusive
	MMAPUNLOCK exclusive
	IOUNLOCK shared

And so there isn't a single "lock -> defrag -> unlock" context
occurring - there are multiple independent operations that have
different kernel side locking contexts and there are no userspace
side file locking contexts, either.

> > 
> >>              File IOs are blocked when the target file is locked and are served during the
> >>              defragmentation idle time (file is unlocked).
> > 
> > What file IOs are being served in parallel? The defragmentation IO?
> > something else?
> 
> Here the file IOs means the IOs request from user space applications including virtual machine
> Engine.
> 
> > 
> >>              Though
> >>              the file IOs can't really go in parallel, they are not blocked long. The locking time
> >>              basically depends on the segment size. Smaller segments usually take less locking time
> >>              and thus IOs are blocked shorterly, bigger segments usually need more locking time and
> >>              IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
> >>              service.
> > 
> > How is a user supposed to know what the correct values are for their
> > storage, files, and workload? Algorithms should auto tune, not
> > require users and administrators to use trial and error to find the
> > best numbers to feed a given operation.
> 
> In my option, user would need a way to control this according to their use case.
> Any algorithms will restrict what user want to do.
> Say, user want the defrag done as quick as possible regardless the resources it takes (CPU, IO and so on)
> when the production system is in a maintenance window. But when the production system is busy
> User want the defrag use less resources.

That's not for the defrag program to implement That's what we use
resource control groups for. Things like memcgs, block IO cgroups,
scheduler cgroups, etc. Administrators are used to restricting the
resources used by applications with generic admin tools; asking them
to learn how some random admin tool does it's own resrouce
utilisation restriction that requires careful hand tuning for -one
off admin events- is not the right way to solve this problem.

We should be making the admin tool go as fast as possible and
consume as much resources as are available. This makes it fast out
of the box, and lets the admins restrict the IO rate, CPU and memory
usage to bring it down to an acceptible resource usage level for
admin tasks on their systems.

> Another example, kernel (algorithms) never knows the maximum IO latency the user applications tolerate.
> But if you have some algorithms, please share.

As I said - make it as fast and low latency as reasonably possible.
If you have less than 10ms IO latency SLAs, the application isn't
going to be running on sparse, software defined storage that may
require hundreds of milliseconds of IO pauses during admin tasks.
Hence design to a max fixed IO latency (say 100ms) and make the
funcitonality run as fast as possible within that latency window.

If people need lower latency SLAs, then they shouldn't be running
that application on sparse, COW based VM images. This is not a
problem a defrag utility should be trying to solve.

> >>              Free blocks consumption
> >> 
> >>              Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
> >>              then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
> >>              to the number the newly allocated blocks. As a finally result, defragmentation doesn't
> >>              consume free blocks.  Well, that is true if the target file is not sharing blocks with
> >>              other files.
> > 
> > This is really hard to read. Defragmentation will -always- consume
> > free space while it is progress. It will always release the
> > temporary space it consumes when it completes.
> 
> I don’t think it’s always free blocks when it releases the temporary file. When the blocks were
> Original shared before defrag, the blocks won’t be freed.

I didn't make myself clear. If the blocks shared to the temp file
are owned exclusively by the source file (i.e. they were COW'd from
shared extents at some time in the past), then that is space
that is temporarily required by the defragmentation process. UNSHARE
creates a second, permanent copy of those blocks in the source file
and closing of the temp file them makes the original exclusively
owned blocks go away.

IOWs, defrag can temporarily consume an entire extra file's worth of
space between the UNSHARE starting and the freeing of the temporary
file when we are done with it. Freeing the temp file -always-
releases this extra space, though I note that the implementation is
to hole-punch it away after each segment has been processed.

> > 
> >>              In case the target file contains shared blocks, those shared blocks won't
> >>              be freed back to filesystem as they are still owned by other files. So defragmenation
> >>              allocates more blocks than it frees.
> > 
> > So this is doing an unshare operation as well as defrag? That seems
> > ... suboptimal. The whole point of sharing blocks is to minimise
> > disk usage for duplicated data.
> 
> That depends on user's need. If users think defrag is the first
> priority, it is.  If users don’t think the disk
> saving is the most important, it is not. No matter what developers think.
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That's pretty ... dismissive.

I mean, you're flat out wrong. You make the assumption that a user
knows exactly how every file that every application in their system
has been created and knows exactly how best to defragment it.

That's just .... wrong.

Users and admins do not have intimate knowledge of how their
applications do their stuff, and a lot of them don't even know
that their systems are using file clones (i.e. reflink copies)
instead of data copies extensively these days.

That is completely the wrong way to approach administration
tools. 

Our defragmentation policy for xfs_fsr is to leave the structure of
the file as intact as possible. That means we replicate unwritten
regions in the defragmented file. We actually -defragment unwritten
extents- in xfs_fsr, not just written extents, and we do that
because we have to assume that the unwritten extents exist for a
good reason.

We don't expect the admin to make a decision as to whether unwritten
extents should be replicated or defragged - we make the assumption
that either the application or the admin has asked for them to exist
in the first place.

It is similar for defragmenting files that are largely made up of shared
extents. That layout exists for a reason, and it's not the place of
the defragmentation operation to unilaterally decide layout policies
for the admin and/or application that is using files with shared
extents.

Hence the defrag operation must preserve the *observed intent* of
the source file layout as much as possible and not require the admin
or user to be sufficiently informed to make the right decision one
way or another. We must attempt to preserve the status quo.

Hence if the file is largely shared, we must not unshare the entire
file to defragment it unless that is the only way to reduce the
fragmentation (e.g. resolve small interleaved shared and unshared
extents). If there are reasonable sized shared extents, we should be
leaving them alone and not unsharing them just to reduce the extent
count by a handful of extents.

> What’s more, reflink (or sharing blocks) is not only used to minimize disk usage. Sometimes it’s
> Used as way to take snapshots. And those snapshots might won’t stay long.

Yes, I know this. It doesn't change anything to do with how we
defragment a file that contains shared blocks.

If you don't want the snapshot(s) to affect defragmentation, then
don't run defrag while the snapshots are present. Otherwise, we
want defrag to retain as much sharing between the snapshots and
the source file because *minimising the space used by snapshots* is
the whole point of using file clones for snapshots in the first
place!

> And what’s more is that, the unshare operation is what you suggested :D   

I suggested it as a mechanism to defrag regions of shared files with
excessive fragmentation. I was not suggesting that "defrag ==
unshare".

> >>              For existing XFS, free blocks might be over-
> >>              committed when reflink snapshots were created. To avoid causing the XFS running into
> >>              low free blocks state, this defragmentation excludes (partially) shared segments when
> >>              the file system free blocks reaches a shreshold. Check the -f option.
> > 
> > Again, how is the user supposed to know when they need to do this?
> > If the answer is "they should always avoid defrag on low free
> > space", then why is this an option?
> 
> I didn’t say "they should always avoid defrag on low free space”. And even we can’t say how low is
> Not tolerated by user, that depends on user use case. Though it’s an option, it has the default value
> Of 1GB. If users don’t set this option, that is "always avoid defrag on low free space”.

You didn't answer my question: how is the user supposed to know
when they should set this?

And, again, the followup question is: why does this need to be
built into the defrag tool?

From a policy perspective, caring about the amount of free space in
the filesystem isn't the job of a defragmentation operation. It
should simply abort if it gets an ENOSPC error or fails to improve
the layout of the file in question. Indeed, if it is obvious that
there may not be enough free space in the filesystem to begin with
thendon't run the defrag operation at all.

This is how xfs_fsr works - it tries to preallocate all the space it
will need before it starts moving data. If it fails to preallocate
all the space, it aborts. If it fails to find large enough
contiguous free spaces to improve the layout of the file, it aborts.

IOWs, xfs_fsr policy is that it doesn't care about the amount of
free space in the filesystem, it just cares if the result will
improve the layout of the file.  That's basically how any online
background defrag operation should work - if the new
layout is worse than the existing layout, or there isn't space for
the new layout to be allocated, just abort.


> >>              Safty and consistency
> >> 
> >>              The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
> >>              crash.
> > 
> > Which file is the "defragmentation file"? The source or the temp
> > file?
> 
> I don’t think there is "source concept" here. There is no data copy between files.
> “The defragmentation file” means the file under defrag, I will change it to “The file under defrag”.
> I don’t think users care about the temporary file at all.

Define the terms you use rather than assuming the reader
understands both the terminology you are using and the context in
which you are using them.

.....

> > 
> >>              The command takes the following options:
> >>                 -f free_space
> >>                     The shreshold of XFS free blocks in MiB. When free blocks are less than this
> >>                     number, (partially) shared segments are excluded from defragmentation. Default
> >>                     number is 1024
> > 
> > When you are down to 4MB of free space in the filesystem, you
> > shouldn't even be trying to run defrag because all the free space
> > that will be left in the filesystem is single blocks. I would have
> > expected this sort of number to be in a percentage of capacity,
> > defaulting to something like 5% (which is where we start running low
> > space algorithms in the kernel).
> 
> I would like leave this to user.

Again: How is the user going to know what to set this to? What
problem is this avoiding that requires the user to change this in
any way.

> When user is doing defrag on low free space system, it won’t cause
> Problem to file system its self. At most the defrag fails during unshare when allocating blocks.

Why would we even allow a user to run defrag near ENOSPC? It is a
well known problem that finding contiguous free space when we are close
to ENOSPC is difficult and so defrag often is unable to improve the
situation when we are within a few percent of the filesysetm being
full.

It is also a well known problem that defragmentation at low free
space trades off contiguous free space for fragmented free space.
Hence when we are at low free space, defrag makes the free space
fragmetnation worse, which then results in all allocation in the
filesystem getting worse and more fragmented. This is something we
absolutely should be trying to avoid.

This is one of the reasons xfs_fsr tries to layout the entire
file before doing any IO - when about 95% full, it's common for the
new layout to be worse than the original file's layout because there
isn't sufficient contiguous free space to improve the layout.

IOWs, running defragmentation when we are above 95% full is actively
harmful to the longevity of the filesystem. Hence, on a fundamental
level, having a low space threshold in a defragmentation tool is
simply wrong - defragmentation should simply not be run when the
filesystem is anywhere near full.

.....

> >> 
> >>                 -s segment_size
> >>                     The size limitation in bytes of segments. Minimium number is 4MiB, default
> >>                     number is 16MiB.
> > 
> > Why were these numbers chosen? What happens if the file has ~32MB
> > sized extents and the user wants the file to be returned to a single
> > large contiguous extent it possible? i.e. how is the user supposed
> > to know how to set this for any given file without first having
> > examined the exact pattern of fragmentations in the file?
> 
> Why customer want the file to be returned to a single large contiguous extent?
> A 32MB extent is pretty good to me.  I didn’t here any customer
> complain about 32MB extents…

There's a much wider world out there than just Oracle customers.
Just because you aren't aware of other use cases that exist, it
doesn't mean they don't exist. I know they exist, hence my question.

For example, extent size hints are used to guarantee that the data
is aligned to the underlying storage correctly, and very large
contiguous extents are required to avoid excessive seeks during
sequential reads that result in critical SLA failures. Hence if a
file is poorly laid out in this situation, defrag needs to return it
to as few, maximally sized extents as it can. How does a user know
what they'd need to set this segment size field to and so acheive
the result they need?

> And you know, whether we can defrag extents to a large one depends on not only the tool it’s self.
> It’s depends on the status of the filesystem, say if the filesystem is very fragmented too, say the AG size..
> 
> The 16MB is selected according our tests basing on a customer metadump. With 16MB segment size,
> The the defrag result is very good and the IO latency is acceptable too.  With the default 16MB segment
> Size, 32MB extent is excluded from defrag.

Exactly my point: you have written a solution that works for a
single filesystem in a single environment.  However, the solution is
so specific to the single problem you need to solve that it is not
clear whether that functionality or defaults are valid outside of
the specific problem case you've written it for and tested it on.

> If you have better default size, we can use that.

I'm not convinced that fixed size "segments" is even the right way
to approach this problem. What needs to be done is dependent on the
extent layout of the file, not how extents fit over some arbitrary
fixed segment map....

> >> We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
> >> sleep time. Here comes some number of the test:
> >> 
> >> Test: running of defrag on the image file which is used for the back end of a block device in a
> >>      virtual machine. At the same time, fio is running at the same time inside virtual machine
> >>      on that block device.
> >> block device type:   NVME
> >> File size:           200GiB
> >> paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
> >> Defrag run time:     223 minutes
> >> Number of extents:   6745489(before) -> 203571(after)
> > 
> > So and average extent size of ~32kB before, 100MB after? How much of
> > these are shared extents?
> 
> Zero shared extents, but there are some unwritten ones.
> A similar run stats is like this:
> Pre-defrag 6654460 extents detected, 112228 are "unwritten",0 are "shared” 
> Tried to defragment 6393352 extents (181000359936 bytes) in 26032 segments Time stats(ms): max clone: 31, max unshare: 300, max punch_hole: 66
> Post-defrag 282659 extents detected
> 
> > 
> > Runtime is 13380secs, so if we copied 200GiB in that time, the
> > defrag ran at 16MB/s. That's not very fast.
> > 
> 
> We are chasing the balance of defrag and parallel IO latency.

My point is that stuff like CLONE and UNSHARE should be able to run
much, much faster than this, even if some of the time is left idle
for other IO.

i.e. we can clone extents at about 100,000/s. We can copy data
through the page cache at 7-8GB/s on NVMe devices.

A full clone of the 6.6 million extents should only take about
a minute.

A full page cache copy of the 200GB cloned file (i.e. via read/write
syscalls) should easily run at >1GB/s, and so only take a couple of
minutes to run.

IOWs, the actual IO and metadata modification side of things is
really only about 5 minutes worth of CPU and IO.

Hence this defrag operation is roughly 100x slower than we should be
able to run it at.  We should be able to run it at close to those
speeds whilst still allowing concurrent read access to the file.

If an admin then wants it to run at 16MB/s, it can be throttled
to that speed using cgroups, ionice, etc.

i.e. I think you are trying to solve too many unnecessary problems
here and not addressing the one thing it should do: defrag a file as
fast and efficiently as possible.

-Dave.
Wengang Wang Aug. 2, 2024, 6:14 p.m. UTC | #4
> On Jul 30, 2024, at 7:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 16, 2024 at 07:45:37PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Jul 15, 2024, at 4:03 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> [ Please keep documentation text to 80 columns. ] 
>>> 
>> 
>> Yes. This is not a patch. I copied it from the man 8 output.
>> It will be limited to 80 columns when sent as a patch.
>> 
>>> [ Please run documentation through a spell checker - there are too
>>> many typos in this document to point them all out... ]
>> 
>> OK.
>> 
>>> 
>>> On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote:
>>>> This patch set introduces defrag to xfs_spaceman command. It has the functionality and
>>>> features below (also subject to be added to man page, so please review):
>>> 
>>> What's the use case for this?
>> 
>> This is the user space defrag as you suggested previously.
>> 
>> Please see the previous conversation for your reference: 
>> https://patchwork.kernel.org/project/xfs/cover/20231214170530.8664-1-wen.gang.wang@oracle.com/
> 
> That's exactly what you should have put in the cover letter!
> 
> The cover letter is not for documenting the user interface of a new
> tool - that's what the patch in the patch set for the new man page
> should be doing.
> 
> The cover letter should contain references to past patch sets and
> discussions on the topic. The cover letter shoudl also contain a
> changelog that documents what is different in this new version of
> the patch set so reviewers know what you've changed since they last
> looked at it.
> 
> IOWs, the cover letter for explaining the use case, why the
> functionality is needed, important design/implementation decisions
> and the history of the patchset. It's meant to inform and remind
> readers of what has already happened to get to this point.
> 
>> COPY STARTS —————————————> 
>> I am copying your last comment there:
>> 
>> On Tue, Dec 19, 2023 at 09:17:31PM +0000, Wengang Wang wrote:
>>> Hi Dave,
>>> Yes, the user space defrag works and satisfies my requirement (almost no change from your example code).
>> 
>> That's good to know :)
>> 
>>> Let me know if you want it in xfsprog.
>> 
>> Yes, i think adding it as an xfs_spaceman command would be a good
>> way for this defrag feature to be maintained for anyone who has need
>> for it.
> 
> Sure, I might have said that 6 months ago. When presented with a
> completely new implementation in a new context months later, I might
> see things differently.  Everyone is allowed to change their mind,
> opinions and theories as circumstances, evidence and contexts
> change.
> 
> Indeed when I look at this:
> 
>>>>      defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
>>>>             defrag defragments the specified XFS file online non-exclusively. The target XFS
> 
> I didn't expect anything nearly as complex and baroque as this. All
> I was expecting was something like this to defrag a single range of
> a file:
> 
> xfs_spaceman -c "defrag <offset> <length>" <file>
> 
> As the control command, and then functionality for
> automated/periodic scanning and defrag would still end up being
> co-ordinated by the existing xfs_fsr code.
> 
>>> What's "non-exclusively" mean? How is this different to what xfs_fsr
>>> does?
>>> 
>> 
>> I think you have seen the difference when you reviewing more of this set.
>> Well, if I read xfs_fsr code correctly, though xfs_fsr allow parallel writes, it looks have a problem(?)
>> As I read the code, Xfs_fsr do the followings to defrag one file:
>> 1) preallocating blocks to a temporary file hoping the temporary get same number of blocks as the
>>    file under defrag with with less extents.
>> 2) copy data blocks from the file under defrag to the temporary file.
>> 3) switch the extents between the two files.
>> 
>> For stage 2, it’s NOT copying data blocks in atomic manner. Take an example: there need two
>> Read->write pair to complete the data copy, that is
>>    Copy range 1 (read range 1 from the file under defrag to the temporary file)
>>    Copy range 2
> 
> I wasn't asking you to explain to me how the xfs_fsr algorithm
> works. What I was asking for was a definition of what
> "non-exclusively" means.
> 
> What xfs_fsr currently does meets my definition of "non-exclusive" - it does
> not rely on or require exclusive access to the file being
> defragmented except for the atomic extent swap at the end. However,
> using FICLONE/UNSHARE does require exclusive access to the file be
> defragmented for the entirity of those operations, so I don't have
> any real idea of why this new algorithm is explicitly described as
> "non-exclusive".
> 
> Defining terms so everyone has a common understanding is important.
> 
> Indeed, Given that we now have XFS_IOC_EXCHANGE_RANGE, I'm
> definitely starting to wonder if clone/unshare is actually the best
> way to do this now.  I think we could make xfs_fsr do iterative
> small file region defrag using XFS_IOC_EXCHANGE_RANGE instead of
> 'whole file at once' as it does now. If we were also to make fsr
> aware of shared extents 
> 
>>> 
>>>>             Defragmentation and file IOs
>>>> 
>>>>             The target file is virtually devided into many small segments. Segments are the
>>>>             smallest units for defragmentation. Each segment is defragmented one by one in a
>>>>             lock->defragment->unlock->idle manner.
>>> 
>>> Userspace can't easily lock the file to prevent concurrent access.
>>> So I'mnot sure what you are refering to here.
>> 
>> The manner is not simply meant what is done at user space, but a whole thing in both user space
>> and kernel space.  The tool defrag a file segment by segment. The lock->defragment->unlock
>> Is done by kernel in responding to the FALLOC_FL_UNSHARE_RANGE request from user space.
> 
> I'm still not sure what locking you are trying to describe. There
> are multiple layers of locking in the kernel, and we use them
> differently. Indeed, the algorithm you have described is actually
> 
> FICLONERANGE
> IOLOCK shared
> ILOCK exclusive
> remap_file_range()
> IUNLOCK exclusive
> IOUNLOCK shared
> 
> .....
> 
> UNSHARE_RANGE
> IOLOCK exclusive
> MMAPLOCK exclusive
> <drain DIO in flight>
> ILOCK exclusive
> unshare_range()
> IUNLOCK exclusive
> MMAPUNLOCK exclusive
> IOUNLOCK shared
> 
> And so there isn't a single "lock -> defrag -> unlock" context
> occurring - there are multiple independent operations that have
> different kernel side locking contexts and there are no userspace
> side file locking contexts, either.
> 
>>> 
>>>>             File IOs are blocked when the target file is locked and are served during the
>>>>             defragmentation idle time (file is unlocked).
>>> 
>>> What file IOs are being served in parallel? The defragmentation IO?
>>> something else?
>> 
>> Here the file IOs means the IOs request from user space applications including virtual machine
>> Engine.
>> 
>>> 
>>>>             Though
>>>>             the file IOs can't really go in parallel, they are not blocked long. The locking time
>>>>             basically depends on the segment size. Smaller segments usually take less locking time
>>>>             and thus IOs are blocked shorterly, bigger segments usually need more locking time and
>>>>             IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
>>>>             service.
>>> 
>>> How is a user supposed to know what the correct values are for their
>>> storage, files, and workload? Algorithms should auto tune, not
>>> require users and administrators to use trial and error to find the
>>> best numbers to feed a given operation.
>> 
>> In my option, user would need a way to control this according to their use case.
>> Any algorithms will restrict what user want to do.
>> Say, user want the defrag done as quick as possible regardless the resources it takes (CPU, IO and so on)
>> when the production system is in a maintenance window. But when the production system is busy
>> User want the defrag use less resources.
> 
> That's not for the defrag program to implement That's what we use
> resource control groups for. Things like memcgs, block IO cgroups,
> scheduler cgroups, etc. Administrators are used to restricting the
> resources used by applications with generic admin tools; asking them
> to learn how some random admin tool does it's own resrouce
> utilisation restriction that requires careful hand tuning for -one
> off admin events- is not the right way to solve this problem.
> 
> We should be making the admin tool go as fast as possible and
> consume as much resources as are available. This makes it fast out
> of the box, and lets the admins restrict the IO rate, CPU and memory
> usage to bring it down to an acceptible resource usage level for
> admin tasks on their systems.
> 
>> Another example, kernel (algorithms) never knows the maximum IO latency the user applications tolerate.
>> But if you have some algorithms, please share.
> 
> As I said - make it as fast and low latency as reasonably possible.
> If you have less than 10ms IO latency SLAs, the application isn't
> going to be running on sparse, software defined storage that may
> require hundreds of milliseconds of IO pauses during admin tasks.
> Hence design to a max fixed IO latency (say 100ms) and make the
> funcitonality run as fast as possible within that latency window.
> 
> If people need lower latency SLAs, then they shouldn't be running
> that application on sparse, COW based VM images. This is not a
> problem a defrag utility should be trying to solve.
> 
>>>>             Free blocks consumption
>>>> 
>>>>             Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
>>>>             then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
>>>>             to the number the newly allocated blocks. As a finally result, defragmentation doesn't
>>>>             consume free blocks.  Well, that is true if the target file is not sharing blocks with
>>>>             other files.
>>> 
>>> This is really hard to read. Defragmentation will -always- consume
>>> free space while it is progress. It will always release the
>>> temporary space it consumes when it completes.
>> 
>> I don’t think it’s always free blocks when it releases the temporary file. When the blocks were
>> Original shared before defrag, the blocks won’t be freed.
> 
> I didn't make myself clear. If the blocks shared to the temp file
> are owned exclusively by the source file (i.e. they were COW'd from
> shared extents at some time in the past), then that is space
> that is temporarily required by the defragmentation process. UNSHARE
> creates a second, permanent copy of those blocks in the source file
> and closing of the temp file them makes the original exclusively
> owned blocks go away.
> 
> IOWs, defrag can temporarily consume an entire extra file's worth of
> space between the UNSHARE starting and the freeing of the temporary
> file when we are done with it. Freeing the temp file -always-
> releases this extra space, though I note that the implementation is
> to hole-punch it away after each segment has been processed.
> 
>>> 
>>>>             In case the target file contains shared blocks, those shared blocks won't
>>>>             be freed back to filesystem as they are still owned by other files. So defragmenation
>>>>             allocates more blocks than it frees.
>>> 
>>> So this is doing an unshare operation as well as defrag? That seems
>>> ... suboptimal. The whole point of sharing blocks is to minimise
>>> disk usage for duplicated data.
>> 
>> That depends on user's need. If users think defrag is the first
>> priority, it is.  If users don’t think the disk
>> saving is the most important, it is not. No matter what developers think.
>                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> That's pretty ... dismissive.
> 
> I mean, you're flat out wrong. You make the assumption that a user
> knows exactly how every file that every application in their system
> has been created and knows exactly how best to defragment it.
> 
> That's just .... wrong.
> 
> Users and admins do not have intimate knowledge of how their
> applications do their stuff, and a lot of them don't even know
> that their systems are using file clones (i.e. reflink copies)
> instead of data copies extensively these days.
> 
> That is completely the wrong way to approach administration
> tools. 
> 
> Our defragmentation policy for xfs_fsr is to leave the structure of
> the file as intact as possible. That means we replicate unwritten
> regions in the defragmented file. We actually -defragment unwritten
> extents- in xfs_fsr, not just written extents, and we do that
> because we have to assume that the unwritten extents exist for a
> good reason.
> 
> We don't expect the admin to make a decision as to whether unwritten
> extents should be replicated or defragged - we make the assumption
> that either the application or the admin has asked for them to exist
> in the first place.
> 
> It is similar for defragmenting files that are largely made up of shared
> extents. That layout exists for a reason, and it's not the place of
> the defragmentation operation to unilaterally decide layout policies
> for the admin and/or application that is using files with shared
> extents.
> 
> Hence the defrag operation must preserve the *observed intent* of
> the source file layout as much as possible and not require the admin
> or user to be sufficiently informed to make the right decision one
> way or another. We must attempt to preserve the status quo.
> 
> Hence if the file is largely shared, we must not unshare the entire
> file to defragment it unless that is the only way to reduce the
> fragmentation (e.g. resolve small interleaved shared and unshared
> extents). If there are reasonable sized shared extents, we should be
> leaving them alone and not unsharing them just to reduce the extent
> count by a handful of extents.
> 
>> What’s more, reflink (or sharing blocks) is not only used to minimize disk usage. Sometimes it’s
>> Used as way to take snapshots. And those snapshots might won’t stay long.
> 
> Yes, I know this. It doesn't change anything to do with how we
> defragment a file that contains shared blocks.
> 
> If you don't want the snapshot(s) to affect defragmentation, then
> don't run defrag while the snapshots are present. Otherwise, we
> want defrag to retain as much sharing between the snapshots and
> the source file because *minimising the space used by snapshots* is
> the whole point of using file clones for snapshots in the first
> place!
> 
>> And what’s more is that, the unshare operation is what you suggested :D   
> 
> I suggested it as a mechanism to defrag regions of shared files with
> excessive fragmentation. I was not suggesting that "defrag ==
> unshare".
> 
>>>>             For existing XFS, free blocks might be over-
>>>>             committed when reflink snapshots were created. To avoid causing the XFS running into
>>>>             low free blocks state, this defragmentation excludes (partially) shared segments when
>>>>             the file system free blocks reaches a shreshold. Check the -f option.
>>> 
>>> Again, how is the user supposed to know when they need to do this?
>>> If the answer is "they should always avoid defrag on low free
>>> space", then why is this an option?
>> 
>> I didn’t say "they should always avoid defrag on low free space”. And even we can’t say how low is
>> Not tolerated by user, that depends on user use case. Though it’s an option, it has the default value
>> Of 1GB. If users don’t set this option, that is "always avoid defrag on low free space”.
> 
> You didn't answer my question: how is the user supposed to know
> when they should set this?
> 
> And, again, the followup question is: why does this need to be
> built into the defrag tool?
> 
> From a policy perspective, caring about the amount of free space in
> the filesystem isn't the job of a defragmentation operation. It
> should simply abort if it gets an ENOSPC error or fails to improve
> the layout of the file in question. Indeed, if it is obvious that
> there may not be enough free space in the filesystem to begin with
> thendon't run the defrag operation at all.
> 
> This is how xfs_fsr works - it tries to preallocate all the space it
> will need before it starts moving data. If it fails to preallocate
> all the space, it aborts. If it fails to find large enough
> contiguous free spaces to improve the layout of the file, it aborts.
> 
> IOWs, xfs_fsr policy is that it doesn't care about the amount of
> free space in the filesystem, it just cares if the result will
> improve the layout of the file.  That's basically how any online
> background defrag operation should work - if the new
> layout is worse than the existing layout, or there isn't space for
> the new layout to be allocated, just abort.
> 
> 
>>>>             Safty and consistency
>>>> 
>>>>             The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
>>>>             crash.
>>> 
>>> Which file is the "defragmentation file"? The source or the temp
>>> file?
>> 
>> I don’t think there is "source concept" here. There is no data copy between files.
>> “The defragmentation file” means the file under defrag, I will change it to “The file under defrag”.
>> I don’t think users care about the temporary file at all.
> 
> Define the terms you use rather than assuming the reader
> understands both the terminology you are using and the context in
> which you are using them.
> 
> .....
> 
>>> 
>>>>             The command takes the following options:
>>>>                -f free_space
>>>>                    The shreshold of XFS free blocks in MiB. When free blocks are less than this
>>>>                    number, (partially) shared segments are excluded from defragmentation. Default
>>>>                    number is 1024
>>> 
>>> When you are down to 4MB of free space in the filesystem, you
>>> shouldn't even be trying to run defrag because all the free space
>>> that will be left in the filesystem is single blocks. I would have
>>> expected this sort of number to be in a percentage of capacity,
>>> defaulting to something like 5% (which is where we start running low
>>> space algorithms in the kernel).
>> 
>> I would like leave this to user.
> 
> Again: How is the user going to know what to set this to? What
> problem is this avoiding that requires the user to change this in
> any way.
> 
>> When user is doing defrag on low free space system, it won’t cause
>> Problem to file system its self. At most the defrag fails during unshare when allocating blocks.
> 
> Why would we even allow a user to run defrag near ENOSPC? It is a
> well known problem that finding contiguous free space when we are close
> to ENOSPC is difficult and so defrag often is unable to improve the
> situation when we are within a few percent of the filesysetm being
> full.
> 
> It is also a well known problem that defragmentation at low free
> space trades off contiguous free space for fragmented free space.
> Hence when we are at low free space, defrag makes the free space
> fragmetnation worse, which then results in all allocation in the
> filesystem getting worse and more fragmented. This is something we
> absolutely should be trying to avoid.
> 
> This is one of the reasons xfs_fsr tries to layout the entire
> file before doing any IO - when about 95% full, it's common for the
> new layout to be worse than the original file's layout because there
> isn't sufficient contiguous free space to improve the layout.
> 
> IOWs, running defragmentation when we are above 95% full is actively
> harmful to the longevity of the filesystem. Hence, on a fundamental
> level, having a low space threshold in a defragmentation tool is
> simply wrong - defragmentation should simply not be run when the
> filesystem is anywhere near full.
> 
> .....
> 
>>>> 
>>>>                -s segment_size
>>>>                    The size limitation in bytes of segments. Minimium number is 4MiB, default
>>>>                    number is 16MiB.
>>> 
>>> Why were these numbers chosen? What happens if the file has ~32MB
>>> sized extents and the user wants the file to be returned to a single
>>> large contiguous extent it possible? i.e. how is the user supposed
>>> to know how to set this for any given file without first having
>>> examined the exact pattern of fragmentations in the file?
>> 
>> Why customer want the file to be returned to a single large contiguous extent?
>> A 32MB extent is pretty good to me.  I didn’t here any customer
>> complain about 32MB extents…
> 
> There's a much wider world out there than just Oracle customers.
> Just because you aren't aware of other use cases that exist, it
> doesn't mean they don't exist. I know they exist, hence my question.
> 
> For example, extent size hints are used to guarantee that the data
> is aligned to the underlying storage correctly, and very large
> contiguous extents are required to avoid excessive seeks during
> sequential reads that result in critical SLA failures. Hence if a
> file is poorly laid out in this situation, defrag needs to return it
> to as few, maximally sized extents as it can. How does a user know
> what they'd need to set this segment size field to and so acheive
> the result they need?
> 
>> And you know, whether we can defrag extents to a large one depends on not only the tool it’s self.
>> It’s depends on the status of the filesystem, say if the filesystem is very fragmented too, say the AG size..
>> 
>> The 16MB is selected according our tests basing on a customer metadump. With 16MB segment size,
>> The the defrag result is very good and the IO latency is acceptable too.  With the default 16MB segment
>> Size, 32MB extent is excluded from defrag.
> 
> Exactly my point: you have written a solution that works for a
> single filesystem in a single environment.  However, the solution is
> so specific to the single problem you need to solve that it is not
> clear whether that functionality or defaults are valid outside of
> the specific problem case you've written it for and tested it on.
> 
>> If you have better default size, we can use that.
> 
> I'm not convinced that fixed size "segments" is even the right way
> to approach this problem. What needs to be done is dependent on the
> extent layout of the file, not how extents fit over some arbitrary
> fixed segment map....
> 
>>>> We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
>>>> sleep time. Here comes some number of the test:
>>>> 
>>>> Test: running of defrag on the image file which is used for the back end of a block device in a
>>>>     virtual machine. At the same time, fio is running at the same time inside virtual machine
>>>>     on that block device.
>>>> block device type:   NVME
>>>> File size:           200GiB
>>>> paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
>>>> Defrag run time:     223 minutes
>>>> Number of extents:   6745489(before) -> 203571(after)
>>> 
>>> So and average extent size of ~32kB before, 100MB after? How much of
>>> these are shared extents?
>> 
>> Zero shared extents, but there are some unwritten ones.
>> A similar run stats is like this:
>> Pre-defrag 6654460 extents detected, 112228 are "unwritten",0 are "shared” 
>> Tried to defragment 6393352 extents (181000359936 bytes) in 26032 segments Time stats(ms): max clone: 31, max unshare: 300, max punch_hole: 66
>> Post-defrag 282659 extents detected
>> 
>>> 
>>> Runtime is 13380secs, so if we copied 200GiB in that time, the
>>> defrag ran at 16MB/s. That's not very fast.
>>> 
>> 
>> We are chasing the balance of defrag and parallel IO latency.
> 
> My point is that stuff like CLONE and UNSHARE should be able to run
> much, much faster than this, even if some of the time is left idle
> for other IO.
> 
> i.e. we can clone extents at about 100,000/s. We can copy data
> through the page cache at 7-8GB/s on NVMe devices.
> 
> A full clone of the 6.6 million extents should only take about
> a minute.
> 
> A full page cache copy of the 200GB cloned file (i.e. via read/write
> syscalls) should easily run at >1GB/s, and so only take a couple of
> minutes to run.
> 
> IOWs, the actual IO and metadata modification side of things is
> really only about 5 minutes worth of CPU and IO.
> 
> Hence this defrag operation is roughly 100x slower than we should be
> able to run it at.  We should be able to run it at close to those
> speeds whilst still allowing concurrent read access to the file.
> 
> If an admin then wants it to run at 16MB/s, it can be throttled
> to that speed using cgroups, ionice, etc.
> 
> i.e. I think you are trying to solve too many unnecessary problems
> here and not addressing the one thing it should do: defrag a file as
> fast and efficiently as possible.
> 

Thanks for all above replies.
For the performance, I am still thinking that the bottle neck is at the synchronous page by page disk reading.

Yes, I have to address/workaround something in kernel.
Let’s expect the related kernel fixings and then simplify the user space defrag code.