mbox series

[0/9] xfs file non-exclusive online defragment

Message ID 20231214170530.8664-1-wen.gang.wang@oracle.com (mailing list archive)
Headers show
Series xfs file non-exclusive online defragment | expand

Message

Wengang Wang Dec. 14, 2023, 5:05 p.m. UTC
Background:
We have the existing xfs_fsr tool which do defragment for files. It has the
following features:
1. Defragment is implemented by file copying.
2. The copy (to a temporary file) is exclusive. The source file is locked
   during the copy (to a temporary file) and all IO requests are blocked
   before the copy is done.
3. The copy could take long time for huge files with IO blocked.
4. The copy requires as many free blocks as the source file has.
   If the source is huge, say it’s 1TiB,  it’s hard to require the file
   system to have another 1TiB free.

The use case in concern is that the XFS files are used as images files for
Virtual Machines.
1. The image files are huge, they can reach hundreds of GiB and even to TiB.
2. Backups are made via reflink copies, and CoW makes the files badly fragmented.
3. fragmentation make reflink copies super slow.
4. during the reflink copy, all IO requests to the file are blocked for super
   long time. That makes timeout in VM and the timeout lead to disaster.

This feature aims to:
1. reduce the file fragmentation making future reflink (much) faster and
2. at the same time,  defragmentation works in non-exclusive manner, it doesn’t
   block file IOs long.

Non-exclusive defragment
Here we are introducing the non-exclusive manner to defragment a file,
especially for huge files, without blocking IO to it long. Non-exclusive
defragmentation divides the whole file into small pieces. For each piece,
we lock the file, defragment the piece and unlock the file. Defragmenting
the small piece doesn’t take long. File IO requests can get served between
pieces before blocked long.  Also we put (user adjustable) idle time between
defragmenting two consecutive pieces to balance the defragmentation and file IOs.
So though the defragmentation could take longer than xfs_fsr,  it balances
defragmentation and file IOs.

Operation target
The operation targets are files in XFS filesystem

User interface
A fresh new command xfs_defrag is provided. User can
start/stop/suspend/resume/get-status the defragmentation against a file.
With xfs_defrag command user can specify:
1. target extent size, extents under which are defragment target extents.
2. piece size, the whole file are divided into piece according to the piece size.
3. idle time, the idle time between defragmenting two adjacent pieces.

Piece
Piece is the smallest unit that we do defragmentation. A piece contains a range
of contiguous file blocks, it may contain one or more extents.

Target Extent Size
This is a configuration value in blocks indicating which extents are
defragmentation targets. Extents which are larger than this value are the Target
Extents. When a piece contains two or more Target Extents, the piece is a Target
Piece. Defragmenting a piece requires at least 2 x TES free file system contiguous
blocks. In case TES is set too big, the defragmentation could fail to allocate
that many contiguous file system blocks. By default it’s 64 blocks.

Piece Size
This is a configuration value indicating the size of the piece in blocks, a piece
is no larger than this size. Defragmenting a piece requires up to PS free
filesystem contiguous blocks. In case PS is set too big, the defragmentation could
fail to allocate that many contiguous file system blocks. 4096 blocks by default,
and 4096 blocks as maximum.

Error reporting
When the defragmentation fails (usually due to file system block allocation
failure), the error will return to user application when the application fetches
the defragmentation status.

Idle Time
Idle time is a configuration value, it is the time defragmentation would idle
between defragmenting two adjacent pieces. We have no limitation on IT.

Some test result:
50GiB file with 2013990 extents, average 6.5 blocks per extent.
Relink copy used 40s (then reflink copy removed before following tests)
Use above as block device in VM, creating XFS v5 on that VM block device.
Mount and build kernel from VM (buffered writes + fsync to backed image file) without defrag:   13m39.497s
Kernel build from VM (buffered writes + sync) with defrag (target extent = 256,
piece size = 4096, idle time = 1000 ms):   15m1.183s
Defrag used: 123m27.354s

Wengang Wang (9):
  xfs: defrag: introduce strucutures and numbers.
  xfs: defrag: initialization and cleanup
  xfs: defrag implement stop/suspend/resume/status
  xfs: defrag: allocate/cleanup defragmentation
  xfs: defrag: process some cases in xfs_defrag_process
  xfs: defrag: piece picking up
  xfs: defrag: guarantee contigurous blocks in cow fork
  xfs: defrag: copy data from old blocks to new blocks
  xfs: defrag: map new blocks

 fs/xfs/Makefile        |    1 +
 fs/xfs/libxfs/xfs_fs.h |    1 +
 fs/xfs/xfs_bmap_util.c |    2 +-
 fs/xfs/xfs_defrag.c    | 1074 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_defrag.h    |   11 +
 fs/xfs/xfs_inode.c     |    4 +
 fs/xfs/xfs_inode.h     |    1 +
 fs/xfs/xfs_ioctl.c     |   17 +
 fs/xfs/xfs_iomap.c     |    2 +-
 fs/xfs/xfs_mount.c     |    3 +
 fs/xfs/xfs_mount.h     |   37 ++
 fs/xfs/xfs_reflink.c   |    7 +-
 fs/xfs/xfs_reflink.h   |    3 +-
 fs/xfs/xfs_super.c     |    3 +
 include/linux/fs.h     |    5 +
 15 files changed, 1165 insertions(+), 6 deletions(-)
 create mode 100644 fs/xfs/xfs_defrag.c
 create mode 100644 fs/xfs/xfs_defrag.h

Comments

Darrick J. Wong Dec. 14, 2023, 9:35 p.m. UTC | #1
On Thu, Dec 14, 2023 at 09:05:21AM -0800, Wengang Wang wrote:
> Background:
> We have the existing xfs_fsr tool which do defragment for files. It has the
> following features:
> 1. Defragment is implemented by file copying.
> 2. The copy (to a temporary file) is exclusive. The source file is locked
>    during the copy (to a temporary file) and all IO requests are blocked
>    before the copy is done.
> 3. The copy could take long time for huge files with IO blocked.
> 4. The copy requires as many free blocks as the source file has.
>    If the source is huge, say it’s 1TiB,  it’s hard to require the file
>    system to have another 1TiB free.
> 
> The use case in concern is that the XFS files are used as images files for
> Virtual Machines.
> 1. The image files are huge, they can reach hundreds of GiB and even to TiB.
> 2. Backups are made via reflink copies, and CoW makes the files badly fragmented.
> 3. fragmentation make reflink copies super slow.
> 4. during the reflink copy, all IO requests to the file are blocked for super
>    long time. That makes timeout in VM and the timeout lead to disaster.
> 
> This feature aims to:
> 1. reduce the file fragmentation making future reflink (much) faster and
> 2. at the same time,  defragmentation works in non-exclusive manner, it doesn’t
>    block file IOs long.
> 
> Non-exclusive defragment
> Here we are introducing the non-exclusive manner to defragment a file,
> especially for huge files, without blocking IO to it long. Non-exclusive
> defragmentation divides the whole file into small pieces. For each piece,
> we lock the file, defragment the piece and unlock the file. Defragmenting
> the small piece doesn’t take long. File IO requests can get served between
> pieces before blocked long.  Also we put (user adjustable) idle time between
> defragmenting two consecutive pieces to balance the defragmentation and file IOs.
> So though the defragmentation could take longer than xfs_fsr,  it balances
> defragmentation and file IOs.

I'm kinda surprised you don't just turn on alwayscow mode, use an
iomap_funshare-like function to read in and dirty pagecache (which will
hopefully create a new large cow fork mapping) and then flush it all
back out with writeback.  Then you don't need all this state tracking,
kthreads management, and copying file data through the buffer cache.
Wouldn't that be a lot simpler?

--D

> Operation target
> The operation targets are files in XFS filesystem
> 
> User interface
> A fresh new command xfs_defrag is provided. User can
> start/stop/suspend/resume/get-status the defragmentation against a file.
> With xfs_defrag command user can specify:
> 1. target extent size, extents under which are defragment target extents.
> 2. piece size, the whole file are divided into piece according to the piece size.
> 3. idle time, the idle time between defragmenting two adjacent pieces.
> 
> Piece
> Piece is the smallest unit that we do defragmentation. A piece contains a range
> of contiguous file blocks, it may contain one or more extents.
> 
> Target Extent Size
> This is a configuration value in blocks indicating which extents are
> defragmentation targets. Extents which are larger than this value are the Target
> Extents. When a piece contains two or more Target Extents, the piece is a Target
> Piece. Defragmenting a piece requires at least 2 x TES free file system contiguous
> blocks. In case TES is set too big, the defragmentation could fail to allocate
> that many contiguous file system blocks. By default it’s 64 blocks.
> 
> Piece Size
> This is a configuration value indicating the size of the piece in blocks, a piece
> is no larger than this size. Defragmenting a piece requires up to PS free
> filesystem contiguous blocks. In case PS is set too big, the defragmentation could
> fail to allocate that many contiguous file system blocks. 4096 blocks by default,
> and 4096 blocks as maximum.
> 
> Error reporting
> When the defragmentation fails (usually due to file system block allocation
> failure), the error will return to user application when the application fetches
> the defragmentation status.
> 
> Idle Time
> Idle time is a configuration value, it is the time defragmentation would idle
> between defragmenting two adjacent pieces. We have no limitation on IT.
> 
> Some test result:
> 50GiB file with 2013990 extents, average 6.5 blocks per extent.
> Relink copy used 40s (then reflink copy removed before following tests)
> Use above as block device in VM, creating XFS v5 on that VM block device.
> Mount and build kernel from VM (buffered writes + fsync to backed image file) without defrag:   13m39.497s
> Kernel build from VM (buffered writes + sync) with defrag (target extent = 256,
> piece size = 4096, idle time = 1000 ms):   15m1.183s
> Defrag used: 123m27.354s
> 
> Wengang Wang (9):
>   xfs: defrag: introduce strucutures and numbers.
>   xfs: defrag: initialization and cleanup
>   xfs: defrag implement stop/suspend/resume/status
>   xfs: defrag: allocate/cleanup defragmentation
>   xfs: defrag: process some cases in xfs_defrag_process
>   xfs: defrag: piece picking up
>   xfs: defrag: guarantee contigurous blocks in cow fork
>   xfs: defrag: copy data from old blocks to new blocks
>   xfs: defrag: map new blocks
> 
>  fs/xfs/Makefile        |    1 +
>  fs/xfs/libxfs/xfs_fs.h |    1 +
>  fs/xfs/xfs_bmap_util.c |    2 +-
>  fs/xfs/xfs_defrag.c    | 1074 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_defrag.h    |   11 +
>  fs/xfs/xfs_inode.c     |    4 +
>  fs/xfs/xfs_inode.h     |    1 +
>  fs/xfs/xfs_ioctl.c     |   17 +
>  fs/xfs/xfs_iomap.c     |    2 +-
>  fs/xfs/xfs_mount.c     |    3 +
>  fs/xfs/xfs_mount.h     |   37 ++
>  fs/xfs/xfs_reflink.c   |    7 +-
>  fs/xfs/xfs_reflink.h   |    3 +-
>  fs/xfs/xfs_super.c     |    3 +
>  include/linux/fs.h     |    5 +
>  15 files changed, 1165 insertions(+), 6 deletions(-)
>  create mode 100644 fs/xfs/xfs_defrag.c
>  create mode 100644 fs/xfs/xfs_defrag.h
> 
> -- 
> 2.39.3 (Apple Git-145)
> 
>
Dave Chinner Dec. 15, 2023, 3:15 a.m. UTC | #2
On Thu, Dec 14, 2023 at 01:35:02PM -0800, Darrick J. Wong wrote:
> On Thu, Dec 14, 2023 at 09:05:21AM -0800, Wengang Wang wrote:
> > Background:
> > We have the existing xfs_fsr tool which do defragment for files. It has the
> > following features:
> > 1. Defragment is implemented by file copying.
> > 2. The copy (to a temporary file) is exclusive. The source file is locked
> >    during the copy (to a temporary file) and all IO requests are blocked
> >    before the copy is done.
> > 3. The copy could take long time for huge files with IO blocked.
> > 4. The copy requires as many free blocks as the source file has.
> >    If the source is huge, say it’s 1TiB,  it’s hard to require the file
> >    system to have another 1TiB free.
> > 
> > The use case in concern is that the XFS files are used as images files for
> > Virtual Machines.
> > 1. The image files are huge, they can reach hundreds of GiB and even to TiB.
> > 2. Backups are made via reflink copies, and CoW makes the files badly fragmented.
> > 3. fragmentation make reflink copies super slow.
> > 4. during the reflink copy, all IO requests to the file are blocked for super
> >    long time. That makes timeout in VM and the timeout lead to disaster.
> > 
> > This feature aims to:
> > 1. reduce the file fragmentation making future reflink (much) faster and
> > 2. at the same time,  defragmentation works in non-exclusive manner, it doesn’t
> >    block file IOs long.
> > 
> > Non-exclusive defragment
> > Here we are introducing the non-exclusive manner to defragment a file,
> > especially for huge files, without blocking IO to it long. Non-exclusive
> > defragmentation divides the whole file into small pieces. For each piece,
> > we lock the file, defragment the piece and unlock the file. Defragmenting
> > the small piece doesn’t take long. File IO requests can get served between
> > pieces before blocked long.  Also we put (user adjustable) idle time between
> > defragmenting two consecutive pieces to balance the defragmentation and file IOs.
> > So though the defragmentation could take longer than xfs_fsr,  it balances
> > defragmentation and file IOs.
> 
> I'm kinda surprised you don't just turn on alwayscow mode, use an
> iomap_funshare-like function to read in and dirty pagecache (which will
> hopefully create a new large cow fork mapping) and then flush it all
> back out with writeback.  Then you don't need all this state tracking,
> kthreads management, and copying file data through the buffer cache.
> Wouldn't that be a lot simpler?

Hmmm. I don't think it needs any kernel code to be written at all.
I think we can do atomic section-by-section, crash-safe active file
defrag from userspace like this:

	scratch_fd = open(O_TMPFILE);
	defrag_fd = open("file-to-be-dfragged");

	while (offset < target_size) {

		/*
		 * share a range of the file to be defragged into
		 * the scratch file.
		 */
		args.src_fd = defrag_fd;
		args.src_offset = offset;
		args.src_len = length;
		args.dst_offset = offset;
		ioctl(scratch_fd, FICLONERANGE, args);

		/*
		 * For the shared range to be unshared via a
		 * copy-on-write operation in the file to be
		 * defragged. This causes the file needing to be
		 * defragged to have new extents allocated and the
		 * data to be copied over and written out.
		 */
		fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, offset, length);
		fdatasync(defrag_fd);

		/*
		 * Punch out the original extents we shared to the
		 * scratch file so they are returned to free space.
		 */
		fallocate(scratch_fd, FALLOC_FL_PUNCH, offset, length);

		/* move onto next region */
		offset += length;
	};

As long as the length is large enough for the unshare to create a
large contiguous delalloc region for the COW, I think this would
likely acheive the desired "non-exclusive" defrag requirement.

If we were to implement this as, say, and xfs_spaceman operation
then all the user controlled policy bits (like inter chunk delays,
chunk sizes, etc) then just becomes command line parameters for the
defrag command...

Cheers,

Dave.
Christoph Hellwig Dec. 15, 2023, 4:06 a.m. UTC | #3
On Thu, Dec 14, 2023 at 01:35:02PM -0800, Darrick J. Wong wrote:
> I'm kinda surprised you don't just turn on alwayscow mode, use an
> iomap_funshare-like function to read in and dirty pagecache (which will
> hopefully create a new large cow fork mapping) and then flush it all
> back out with writeback.  Then you don't need all this state tracking,
> kthreads management, and copying file data through the buffer cache.
> Wouldn't that be a lot simpler?

Yes, although with a caveat or two.

We did that for the zoned XFS project, where a 'defragmentation' like
this which we call garbage collection is an essential part of the
operation to free entire zones.  I ended up initially implementing it
using iomap_file_unshare as that is page cache coherent and a nicely
available library function.  But it turns out iomap_file_unshare sucks
badly as it will read the data sychronously one block at a time.

I ended up coming up with my own duplication of iomap_file_unshare
that doesn't do it which is a bit hack by solves this problem.
I'd love to eventually merge it back into iomap_file_unshare, for
which we really need to work on our writeback iterators.

The relevant commit or the new helper is here:


   http://git.infradead.org/users/hch/xfs.git/commitdiff/cc4a639e3052fefb385f63a0db5dfe07db4e9d58

which also need a hacky readahead helper:

    http://git.infradead.org/users/hch/xfs.git/commitdiff/f6d545fc00300ddfd3e297d17e4f229ad2f15c3e

The code using this for zoned GC is here:

    http://git.infradead.org/users/hch/xfs.git/blob/refs/heads/xfs-zoned:/fs/xfs/xfs_zone_alloc.c#l764

It probably would make sense to be able to also use this for a regular
fs for the online defrag use case, although the wire up would be a bit
different.
Wengang Wang Dec. 15, 2023, 4:48 p.m. UTC | #4
Thanks Darrick and Christoph for such a quick look at this!

Yes, iomap_funshare sounds more interesting. I will look into it.

Wengang

> On Dec 14, 2023, at 8:06 PM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Thu, Dec 14, 2023 at 01:35:02PM -0800, Darrick J. Wong wrote:
>> I'm kinda surprised you don't just turn on alwayscow mode, use an
>> iomap_funshare-like function to read in and dirty pagecache (which will
>> hopefully create a new large cow fork mapping) and then flush it all
>> back out with writeback.  Then you don't need all this state tracking,
>> kthreads management, and copying file data through the buffer cache.
>> Wouldn't that be a lot simpler?
> 
> Yes, although with a caveat or two.
> 
> We did that for the zoned XFS project, where a 'defragmentation' like
> this which we call garbage collection is an essential part of the
> operation to free entire zones.  I ended up initially implementing it
> using iomap_file_unshare as that is page cache coherent and a nicely
> available library function.  But it turns out iomap_file_unshare sucks
> badly as it will read the data sychronously one block at a time.
> 
> I ended up coming up with my own duplication of iomap_file_unshare
> that doesn't do it which is a bit hack by solves this problem.
> I'd love to eventually merge it back into iomap_file_unshare, for
> which we really need to work on our writeback iterators.
> 
> The relevant commit or the new helper is here:
> 
> 
>   http://git.infradead.org/users/hch/xfs.git/commitdiff/cc4a639e3052fefb385f63a0db5dfe07db4e9d58
> 
> which also need a hacky readahead helper:
> 
>    http://git.infradead.org/users/hch/xfs.git/commitdiff/f6d545fc00300ddfd3e297d17e4f229ad2f15c3e
> 
> The code using this for zoned GC is here:
> 
>    http://git.infradead.org/users/hch/xfs.git/blob/refs/heads/xfs-zoned:/fs/xfs/xfs_zone_alloc.c#l764
> 
> It probably would make sense to be able to also use this for a regular
> fs for the online defrag use case, although the wire up would be a bit
> different.
>
Wengang Wang Dec. 15, 2023, 5:07 p.m. UTC | #5
> On Dec 14, 2023, at 7:15 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Thu, Dec 14, 2023 at 01:35:02PM -0800, Darrick J. Wong wrote:
>> On Thu, Dec 14, 2023 at 09:05:21AM -0800, Wengang Wang wrote:
>>> Background:
>>> We have the existing xfs_fsr tool which do defragment for files. It has the
>>> following features:
>>> 1. Defragment is implemented by file copying.
>>> 2. The copy (to a temporary file) is exclusive. The source file is locked
>>>   during the copy (to a temporary file) and all IO requests are blocked
>>>   before the copy is done.
>>> 3. The copy could take long time for huge files with IO blocked.
>>> 4. The copy requires as many free blocks as the source file has.
>>>   If the source is huge, say it’s 1TiB,  it’s hard to require the file
>>>   system to have another 1TiB free.
>>> 
>>> The use case in concern is that the XFS files are used as images files for
>>> Virtual Machines.
>>> 1. The image files are huge, they can reach hundreds of GiB and even to TiB.
>>> 2. Backups are made via reflink copies, and CoW makes the files badly fragmented.
>>> 3. fragmentation make reflink copies super slow.
>>> 4. during the reflink copy, all IO requests to the file are blocked for super
>>>   long time. That makes timeout in VM and the timeout lead to disaster.
>>> 
>>> This feature aims to:
>>> 1. reduce the file fragmentation making future reflink (much) faster and
>>> 2. at the same time,  defragmentation works in non-exclusive manner, it doesn’t
>>>   block file IOs long.
>>> 
>>> Non-exclusive defragment
>>> Here we are introducing the non-exclusive manner to defragment a file,
>>> especially for huge files, without blocking IO to it long. Non-exclusive
>>> defragmentation divides the whole file into small pieces. For each piece,
>>> we lock the file, defragment the piece and unlock the file. Defragmenting
>>> the small piece doesn’t take long. File IO requests can get served between
>>> pieces before blocked long.  Also we put (user adjustable) idle time between
>>> defragmenting two consecutive pieces to balance the defragmentation and file IOs.
>>> So though the defragmentation could take longer than xfs_fsr,  it balances
>>> defragmentation and file IOs.
>> 
>> I'm kinda surprised you don't just turn on alwayscow mode, use an
>> iomap_funshare-like function to read in and dirty pagecache (which will
>> hopefully create a new large cow fork mapping) and then flush it all
>> back out with writeback.  Then you don't need all this state tracking,
>> kthreads management, and copying file data through the buffer cache.
>> Wouldn't that be a lot simpler?
> 
> Hmmm. I don't think it needs any kernel code to be written at all.
> I think we can do atomic section-by-section, crash-safe active file
> defrag from userspace like this:
> 
> scratch_fd = open(O_TMPFILE);
> defrag_fd = open("file-to-be-dfragged");
> 
> while (offset < target_size) {
> 
> /*
>  * share a range of the file to be defragged into
>  * the scratch file.
>  */
> args.src_fd = defrag_fd;
> args.src_offset = offset;
> args.src_len = length;
> args.dst_offset = offset;
> ioctl(scratch_fd, FICLONERANGE, args);
> 
> /*
>  * For the shared range to be unshared via a
>  * copy-on-write operation in the file to be
>  * defragged. This causes the file needing to be
>  * defragged to have new extents allocated and the
>  * data to be copied over and written out.
>  */
> fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, offset, length);
> fdatasync(defrag_fd);
> 
> /*
>  * Punch out the original extents we shared to the
>  * scratch file so they are returned to free space.
>  */
> fallocate(scratch_fd, FALLOC_FL_PUNCH, offset, length);
> 
> /* move onto next region */
> offset += length;
> };
> 
> As long as the length is large enough for the unshare to create a
> large contiguous delalloc region for the COW, I think this would
> likely acheive the desired "non-exclusive" defrag requirement.
> 
> If we were to implement this as, say, and xfs_spaceman operation
> then all the user controlled policy bits (like inter chunk delays,
> chunk sizes, etc) then just becomes command line parameters for the
> defrag command...


Ha, the idea from user space is very interesting!
So far I have the following thoughts:
1). If the FICLONERANGE/FALLOC_FL_UNSHARE_RANGE/FALLOC_FL_PUNCH works on a FS without reflink
     enabled.
2). What if there is a big hole in the file to be defragmented? Will it cause block allocation and writing blocks with
    zeroes.
3). In case a big range of the file is good (not much fragmented), the ‘defrag’ on that range is not necessary.
4). The use space defrag can’t use a try-lock mode to make IO requests have priorities. I am not sure if this is very important.

Maybe we can work with xfs_bmap to get extents info and skip good extents and holes to help case 2) and 3).

I will figure above out.
Again, the idea is so amazing, I didn’t reallize it.

Thanks,
Wengang
Darrick J. Wong Dec. 15, 2023, 5:30 p.m. UTC | #6
On Fri, Dec 15, 2023 at 05:07:36PM +0000, Wengang Wang wrote:
> 
> 
> > On Dec 14, 2023, at 7:15 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Thu, Dec 14, 2023 at 01:35:02PM -0800, Darrick J. Wong wrote:
> >> On Thu, Dec 14, 2023 at 09:05:21AM -0800, Wengang Wang wrote:
> >>> Background:
> >>> We have the existing xfs_fsr tool which do defragment for files. It has the
> >>> following features:
> >>> 1. Defragment is implemented by file copying.
> >>> 2. The copy (to a temporary file) is exclusive. The source file is locked
> >>>   during the copy (to a temporary file) and all IO requests are blocked
> >>>   before the copy is done.
> >>> 3. The copy could take long time for huge files with IO blocked.
> >>> 4. The copy requires as many free blocks as the source file has.
> >>>   If the source is huge, say it’s 1TiB,  it’s hard to require the file
> >>>   system to have another 1TiB free.
> >>> 
> >>> The use case in concern is that the XFS files are used as images files for
> >>> Virtual Machines.
> >>> 1. The image files are huge, they can reach hundreds of GiB and even to TiB.
> >>> 2. Backups are made via reflink copies, and CoW makes the files badly fragmented.
> >>> 3. fragmentation make reflink copies super slow.
> >>> 4. during the reflink copy, all IO requests to the file are blocked for super
> >>>   long time. That makes timeout in VM and the timeout lead to disaster.
> >>> 
> >>> This feature aims to:
> >>> 1. reduce the file fragmentation making future reflink (much) faster and
> >>> 2. at the same time,  defragmentation works in non-exclusive manner, it doesn’t
> >>>   block file IOs long.
> >>> 
> >>> Non-exclusive defragment
> >>> Here we are introducing the non-exclusive manner to defragment a file,
> >>> especially for huge files, without blocking IO to it long. Non-exclusive
> >>> defragmentation divides the whole file into small pieces. For each piece,
> >>> we lock the file, defragment the piece and unlock the file. Defragmenting
> >>> the small piece doesn’t take long. File IO requests can get served between
> >>> pieces before blocked long.  Also we put (user adjustable) idle time between
> >>> defragmenting two consecutive pieces to balance the defragmentation and file IOs.
> >>> So though the defragmentation could take longer than xfs_fsr,  it balances
> >>> defragmentation and file IOs.
> >> 
> >> I'm kinda surprised you don't just turn on alwayscow mode, use an
> >> iomap_funshare-like function to read in and dirty pagecache (which will
> >> hopefully create a new large cow fork mapping) and then flush it all
> >> back out with writeback.  Then you don't need all this state tracking,
> >> kthreads management, and copying file data through the buffer cache.
> >> Wouldn't that be a lot simpler?
> > 
> > Hmmm. I don't think it needs any kernel code to be written at all.
> > I think we can do atomic section-by-section, crash-safe active file
> > defrag from userspace like this:
> > 
> > scratch_fd = open(O_TMPFILE);
> > defrag_fd = open("file-to-be-dfragged");
> > 
> > while (offset < target_size) {
> > 
> > /*
> >  * share a range of the file to be defragged into
> >  * the scratch file.
> >  */
> > args.src_fd = defrag_fd;
> > args.src_offset = offset;
> > args.src_len = length;
> > args.dst_offset = offset;
> > ioctl(scratch_fd, FICLONERANGE, args);
> > 
> > /*
> >  * For the shared range to be unshared via a
> >  * copy-on-write operation in the file to be
> >  * defragged. This causes the file needing to be
> >  * defragged to have new extents allocated and the
> >  * data to be copied over and written out.
> >  */
> > fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, offset, length);
> > fdatasync(defrag_fd);
> > 
> > /*
> >  * Punch out the original extents we shared to the
> >  * scratch file so they are returned to free space.
> >  */
> > fallocate(scratch_fd, FALLOC_FL_PUNCH, offset, length);

You could even set args.dst_offset = 0 and ftruncate here.

But yes, this is a better suggestion than adding more kernel code.

> > /* move onto next region */
> > offset += length;
> > };
> > 
> > As long as the length is large enough for the unshare to create a
> > large contiguous delalloc region for the COW, I think this would
> > likely acheive the desired "non-exclusive" defrag requirement.
> > 
> > If we were to implement this as, say, and xfs_spaceman operation
> > then all the user controlled policy bits (like inter chunk delays,
> > chunk sizes, etc) then just becomes command line parameters for the
> > defrag command...
> 
> 
> Ha, the idea from user space is very interesting!
> So far I have the following thoughts:
> 1). If the FICLONERANGE/FALLOC_FL_UNSHARE_RANGE/FALLOC_FL_PUNCH works
> on a FS without reflink enabled.

It does not.

That said, for your usecase (reflinked vm disk images that fragment over
time) that won't be an issue.  For non-reflink filesystems, there's
fewer chances for extreme fragmentation due to the lack of COW.

> 2). What if there is a big hole in the file to be defragmented? Will
> it cause block allocation and writing blocks with zeroes.

FUNSHARE ignores holes.

> 3). In case a big range of the file is good (not much fragmented), the
> ‘defrag’ on that range is not necessary.

Yep, so you'd have to check the bmap/fiemap output first to identify
areas that are more fragmented than you'd like.

> 4). The use space defrag can’t use a try-lock mode to make IO requests
> have priorities. I am not sure if this is very important.
> 
> Maybe we can work with xfs_bmap to get extents info and skip good
> extents and holes to help case 2) and 3).

Yeah, that sounds necessary.

--D

> I will figure above out.
> Again, the idea is so amazing, I didn’t reallize it.
> 
> Thanks,
> Wengang
>
Wengang Wang Dec. 15, 2023, 8:03 p.m. UTC | #7
> On Dec 15, 2023, at 9:30 AM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Fri, Dec 15, 2023 at 05:07:36PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Dec 14, 2023, at 7:15 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Thu, Dec 14, 2023 at 01:35:02PM -0800, Darrick J. Wong wrote:
>>>> On Thu, Dec 14, 2023 at 09:05:21AM -0800, Wengang Wang wrote:
>>>>> Background:
>>>>> We have the existing xfs_fsr tool which do defragment for files. It has the
>>>>> following features:
>>>>> 1. Defragment is implemented by file copying.
>>>>> 2. The copy (to a temporary file) is exclusive. The source file is locked
>>>>>  during the copy (to a temporary file) and all IO requests are blocked
>>>>>  before the copy is done.
>>>>> 3. The copy could take long time for huge files with IO blocked.
>>>>> 4. The copy requires as many free blocks as the source file has.
>>>>>  If the source is huge, say it’s 1TiB,  it’s hard to require the file
>>>>>  system to have another 1TiB free.
>>>>> 
>>>>> The use case in concern is that the XFS files are used as images files for
>>>>> Virtual Machines.
>>>>> 1. The image files are huge, they can reach hundreds of GiB and even to TiB.
>>>>> 2. Backups are made via reflink copies, and CoW makes the files badly fragmented.
>>>>> 3. fragmentation make reflink copies super slow.
>>>>> 4. during the reflink copy, all IO requests to the file are blocked for super
>>>>>  long time. That makes timeout in VM and the timeout lead to disaster.
>>>>> 
>>>>> This feature aims to:
>>>>> 1. reduce the file fragmentation making future reflink (much) faster and
>>>>> 2. at the same time,  defragmentation works in non-exclusive manner, it doesn’t
>>>>>  block file IOs long.
>>>>> 
>>>>> Non-exclusive defragment
>>>>> Here we are introducing the non-exclusive manner to defragment a file,
>>>>> especially for huge files, without blocking IO to it long. Non-exclusive
>>>>> defragmentation divides the whole file into small pieces. For each piece,
>>>>> we lock the file, defragment the piece and unlock the file. Defragmenting
>>>>> the small piece doesn’t take long. File IO requests can get served between
>>>>> pieces before blocked long.  Also we put (user adjustable) idle time between
>>>>> defragmenting two consecutive pieces to balance the defragmentation and file IOs.
>>>>> So though the defragmentation could take longer than xfs_fsr,  it balances
>>>>> defragmentation and file IOs.
>>>> 
>>>> I'm kinda surprised you don't just turn on alwayscow mode, use an
>>>> iomap_funshare-like function to read in and dirty pagecache (which will
>>>> hopefully create a new large cow fork mapping) and then flush it all
>>>> back out with writeback.  Then you don't need all this state tracking,
>>>> kthreads management, and copying file data through the buffer cache.
>>>> Wouldn't that be a lot simpler?
>>> 
>>> Hmmm. I don't think it needs any kernel code to be written at all.
>>> I think we can do atomic section-by-section, crash-safe active file
>>> defrag from userspace like this:
>>> 
>>> scratch_fd = open(O_TMPFILE);
>>> defrag_fd = open("file-to-be-dfragged");
>>> 
>>> while (offset < target_size) {
>>> 
>>> /*
>>> * share a range of the file to be defragged into
>>> * the scratch file.
>>> */
>>> args.src_fd = defrag_fd;
>>> args.src_offset = offset;
>>> args.src_len = length;
>>> args.dst_offset = offset;
>>> ioctl(scratch_fd, FICLONERANGE, args);
>>> 
>>> /*
>>> * For the shared range to be unshared via a
>>> * copy-on-write operation in the file to be
>>> * defragged. This causes the file needing to be
>>> * defragged to have new extents allocated and the
>>> * data to be copied over and written out.
>>> */
>>> fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, offset, length);
>>> fdatasync(defrag_fd);
>>> 
>>> /*
>>> * Punch out the original extents we shared to the
>>> * scratch file so they are returned to free space.
>>> */
>>> fallocate(scratch_fd, FALLOC_FL_PUNCH, offset, length);
> 
> You could even set args.dst_offset = 0 and ftruncate here.
> 
> But yes, this is a better suggestion than adding more kernel code.
> 
>>> /* move onto next region */
>>> offset += length;
>>> };
>>> 
>>> As long as the length is large enough for the unshare to create a
>>> large contiguous delalloc region for the COW, I think this would
>>> likely acheive the desired "non-exclusive" defrag requirement.
>>> 
>>> If we were to implement this as, say, and xfs_spaceman operation
>>> then all the user controlled policy bits (like inter chunk delays,
>>> chunk sizes, etc) then just becomes command line parameters for the
>>> defrag command...
>> 
>> 
>> Ha, the idea from user space is very interesting!
>> So far I have the following thoughts:
>> 1). If the FICLONERANGE/FALLOC_FL_UNSHARE_RANGE/FALLOC_FL_PUNCH works
>> on a FS without reflink enabled.
> 
> It does not.
> 
> That said, for your usecase (reflinked vm disk images that fragment over
> time) that won't be an issue.  For non-reflink filesystems, there's
> fewer chances for extreme fragmentation due to the lack of COW.
> 
>> 2). What if there is a big hole in the file to be defragmented? Will
>> it cause block allocation and writing blocks with zeroes.
> 
> FUNSHARE ignores holes.
> 
>> 3). In case a big range of the file is good (not much fragmented), the
>> ‘defrag’ on that range is not necessary.
> 
> Yep, so you'd have to check the bmap/fiemap output first to identify
> areas that are more fragmented than you'd like.
> 
>> 4). The use space defrag can’t use a try-lock mode to make IO requests
>> have priorities. I am not sure if this is very important.
>> 
>> Maybe we can work with xfs_bmap to get extents info and skip good
>> extents and holes to help case 2) and 3).
> 
> Yeah, that sounds necessary.
> 

Thanks for answering!
Wengang
Dave Chinner Dec. 15, 2023, 8:20 p.m. UTC | #8
On Fri, Dec 15, 2023 at 05:07:36PM +0000, Wengang Wang wrote:
> > On Dec 14, 2023, at 7:15 PM, Dave Chinner <david@fromorbit.com> wrote:
> > If we were to implement this as, say, and xfs_spaceman operation
> > then all the user controlled policy bits (like inter chunk delays,
> > chunk sizes, etc) then just becomes command line parameters for the
> > defrag command...
> 
> 
> Ha, the idea from user space is very interesting!
> So far I have the following thoughts:
> 1). If the FICLONERANGE/FALLOC_FL_UNSHARE_RANGE/FALLOC_FL_PUNCH works on a FS without reflink
>      enabled.

Personally, I don't care if reflink is not enabled. It's the default
for new filesystems, and it's cost free for anyone who is not
using reflink so there is no reason for anyone to turn it off.

What I'm saying is "don't compromise the design of the functionality
required just because someone might choose to disable that
functionality".

> 2). What if there is a big hole in the file to be defragmented? Will it cause block allocation and writing blocks with
>     zeroes.

Unshare skips holes.

> 3). In case a big range of the file is good (not much fragmented), the ‘defrag’ on that range is not necessary.

xfs_fsr already deals with this - it uses XFS_IOC_GETBMAPX to scan
the extent list to determine what to defrag, to replicate unwritten
regions and to skip holes. Having to scan the extent list is kinda
expected for a defrag utility

> 4). The use space defrag can’t use a try-lock mode to make IO requests have priorities. I am not sure if this is very important.

As long as the individual operations aren't holding locks for a long
time, I doubt it matters. And you can use ionice to make sure the IO
being issued has background priority in the block scheduler...

Cheers,

Dave.
Wengang Wang Dec. 18, 2023, 4:27 p.m. UTC | #9
> On Dec 15, 2023, at 12:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, Dec 15, 2023 at 05:07:36PM +0000, Wengang Wang wrote:
>>> On Dec 14, 2023, at 7:15 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> If we were to implement this as, say, and xfs_spaceman operation
>>> then all the user controlled policy bits (like inter chunk delays,
>>> chunk sizes, etc) then just becomes command line parameters for the
>>> defrag command...
>> 
>> 
>> Ha, the idea from user space is very interesting!
>> So far I have the following thoughts:
>> 1). If the FICLONERANGE/FALLOC_FL_UNSHARE_RANGE/FALLOC_FL_PUNCH works on a FS without reflink
>>     enabled.
> 
> Personally, I don't care if reflink is not enabled. It's the default
> for new filesystems, and it's cost free for anyone who is not
> using reflink so there is no reason for anyone to turn it off.
> 
> What I'm saying is "don't compromise the design of the functionality
> required just because someone might choose to disable that
> functionality".
> 
>> 2). What if there is a big hole in the file to be defragmented? Will it cause block allocation and writing blocks with
>>    zeroes.
> 
> Unshare skips holes.
> 
>> 3). In case a big range of the file is good (not much fragmented), the ‘defrag’ on that range is not necessary.
> 
> xfs_fsr already deals with this - it uses XFS_IOC_GETBMAPX to scan
> the extent list to determine what to defrag, to replicate unwritten
> regions and to skip holes. Having to scan the extent list is kinda
> expected for a defrag utility
> 
>> 4). The use space defrag can’t use a try-lock mode to make IO requests have priorities. I am not sure if this is very important.
> 
> As long as the individual operations aren't holding locks for a long
> time, I doubt it matters. And you can use ionice to make sure the IO
> being issued has background priority in the block scheduler...
> 

Yes, thanks for the answers.
I will try it out.

Thanks,
Wengang
>
Wengang Wang Dec. 19, 2023, 9:17 p.m. UTC | #10
Hi Dave,
Yes, the user space defrag works and satisfies my requirement (almost no change from your example code).
Let me know if you want it in xfsprog.

Thanks,
Wengang
 
> On Dec 18, 2023, at 8:27 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On Dec 15, 2023, at 12:20 PM, Dave Chinner <david@fromorbit.com> wrote:
>> 
>> On Fri, Dec 15, 2023 at 05:07:36PM +0000, Wengang Wang wrote:
>>>> On Dec 14, 2023, at 7:15 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>> If we were to implement this as, say, and xfs_spaceman operation
>>>> then all the user controlled policy bits (like inter chunk delays,
>>>> chunk sizes, etc) then just becomes command line parameters for the
>>>> defrag command...
>>> 
>>> 
>>> Ha, the idea from user space is very interesting!
>>> So far I have the following thoughts:
>>> 1). If the FICLONERANGE/FALLOC_FL_UNSHARE_RANGE/FALLOC_FL_PUNCH works on a FS without reflink
>>>    enabled.
>> 
>> Personally, I don't care if reflink is not enabled. It's the default
>> for new filesystems, and it's cost free for anyone who is not
>> using reflink so there is no reason for anyone to turn it off.
>> 
>> What I'm saying is "don't compromise the design of the functionality
>> required just because someone might choose to disable that
>> functionality".
>> 
>>> 2). What if there is a big hole in the file to be defragmented? Will it cause block allocation and writing blocks with
>>>   zeroes.
>> 
>> Unshare skips holes.
>> 
>>> 3). In case a big range of the file is good (not much fragmented), the ‘defrag’ on that range is not necessary.
>> 
>> xfs_fsr already deals with this - it uses XFS_IOC_GETBMAPX to scan
>> the extent list to determine what to defrag, to replicate unwritten
>> regions and to skip holes. Having to scan the extent list is kinda
>> expected for a defrag utility
>> 
>>> 4). The use space defrag can’t use a try-lock mode to make IO requests have priorities. I am not sure if this is very important.
>> 
>> As long as the individual operations aren't holding locks for a long
>> time, I doubt it matters. And you can use ionice to make sure the IO
>> being issued has background priority in the block scheduler...
>> 
> 
> Yes, thanks for the answers.
> I will try it out.
> 
> Thanks,
> Wengang
Dave Chinner Dec. 19, 2023, 9:29 p.m. UTC | #11
On Tue, Dec 19, 2023 at 09:17:31PM +0000, Wengang Wang wrote:
> Hi Dave,
> Yes, the user space defrag works and satisfies my requirement (almost no change from your example code).

That's good to know :)

> Let me know if you want it in xfsprog.

Yes, i think adding it as an xfs_spaceman command would be a good
way for this defrag feature to be maintained for anyone who has need
for it.

-Dave.
Wengang Wang Dec. 19, 2023, 10:23 p.m. UTC | #12
> On Dec 19, 2023, at 1:29 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Dec 19, 2023 at 09:17:31PM +0000, Wengang Wang wrote:
>> Hi Dave,
>> Yes, the user space defrag works and satisfies my requirement (almost no change from your example code).
> 
> That's good to know :)
> 
>> Let me know if you want it in xfsprog.
> 
> Yes, i think adding it as an xfs_spaceman command would be a good
> way for this defrag feature to be maintained for anyone who has need
> for it.
> 

Got it. Will try it.

Thanks,
Wengang