diff mbox series

[14/14] xfs: document future directions of online fsck

Message ID 167243825360.682859.5189751153452545448.stgit@magnolia (mailing list archive)
State New, archived
Headers show
Series xfs: design documentation for online fsck | expand

Commit Message

Darrick J. Wong Dec. 30, 2022, 10:10 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Add the seventh and final chapter of the online fsck documentation,
where we talk about future functionality that can tie in with the
functionality provided by the online fsck patchset.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  155 ++++++++++++++++++++
 1 file changed, 155 insertions(+)

Comments

Allison Henderson March 1, 2023, 5:37 a.m. UTC | #1
On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add the seventh and final chapter of the online fsck documentation,
> where we talk about future functionality that can tie in with the
> functionality provided by the online fsck patchset.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  155
> ++++++++++++++++++++
>  1 file changed, 155 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index 05b9411fac7f..41291edb02b9 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -4067,6 +4067,8 @@ The extra flexibility enables several new use
> cases:
>    (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby
> committing all
>    of the updates to the original file, or none of them.
>  
> +.. _swapext_if_unchanged:
> +
>  - **Transactional file updates**: The same mechanism as above, but
> the caller
>    only wants the commit to occur if the original file's contents
> have not
>    changed.
> @@ -4818,3 +4820,156 @@ and report what has been lost.
>  For media errors in blocks owned by files, the lack of parent
> pointers means
>  that the entire filesystem must be walked to report the file paths
> and offsets
>  corresponding to the media error.
> +
> +7. Conclusion and Future Work
> +=============================
> +
> +It is hoped that the reader of this document has followed the
> designs laid out
> +in this document and now has some familiarity with how XFS performs
> online
> +rebuilding of its metadata indices, and how filesystem users can
> interact with
> +that functionality.
> +Although the scope of this work is daunting, it is hoped that this
> guide will
> +make it easier for code readers to understand what has been built,
> for whom it
> +has been built, and why.
> +Please feel free to contact the XFS mailing list with questions.
> +
> +FIEXCHANGE_RANGE
> +----------------
> +
> +As discussed earlier, a second frontend to the atomic extent swap
> mechanism is
> +a new ioctl call that userspace programs can use to commit updates
> to files
> +atomically.
> +This frontend has been out for review for several years now, though
> the
> +necessary refinements to online repair and lack of customer demand
> mean that
> +the proposal has not been pushed very hard.
> +
> +Vectorized Scrub
> +----------------
> +
> +As it turns out, the :ref:`refactoring <scrubrepair>` of repair
> items mentioned
> +earlier was a catalyst for enabling a vectorized scrub system call.
> +Since 2018, the cost of making a kernel call has increased
> considerably on some
> +systems to mitigate the effects of speculative execution attacks.
> +This incentivizes program authors to make as few system calls as
> possible to
> +reduce the number of times an execution path crosses a security
> boundary.
> +
> +With vectorized scrub, userspace pushes to the kernel the identity
> of a
> +filesystem object, a list of scrub types to run against that object,
> and a
> +simple representation of the data dependencies between the selected
> scrub
> +types.
> +The kernel executes as much of the caller's plan as it can until it
> hits a
> +dependency that cannot be satisfied due to a corruption, and tells
> userspace
> +how much was accomplished.
> +It is hoped that ``io_uring`` will pick up enough of this
> functionality that
> +online fsck can use that instead of adding a separate vectored scrub
> system
> +call to XFS.
> +
> +The relevant patchsets are the
> +`kernel vectorized scrub
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=vectorized-scrub>`_
> +and
> +`userspace vectorized scrub
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> it/log/?h=vectorized-scrub>`_
> +series.
> +
> +Quality of Service Targets for Scrub
> +------------------------------------
> +
> +One serious shortcoming of the online fsck code is that the amount
> of time that
> +it can spend in the kernel holding resource locks is basically
> unbounded.
> +Userspace is allowed to send a fatal signal to the process which
> will cause
> +``xfs_scrub`` to exit when it reaches a good stopping point, but
> there's no way
> +for userspace to provide a time budget to the kernel.
> +Given that the scrub codebase has helpers to detect fatal signals,
> it shouldn't
> +be too much work to allow userspace to specify a timeout for a
> scrub/repair
> +operation and abort the operation if it exceeds budget.
> +However, most repair functions have the property that once they
> begin to touch
> +ondisk metadata, the operation cannot be cancelled cleanly, after
> which a QoS
> +timeout is no longer useful.
> +
> +Defragmenting Free Space
> +------------------------
> +
> +Over the years, many XFS users have requested the creation of a
> program to
> +clear a portion of the physical storage underlying a filesystem so
> that it
> +becomes a contiguous chunk of free space.
> +Call this free space defragmenter ``clearspace`` for short.
> +
> +The first piece the ``clearspace`` program needs is the ability to
> read the
> +reverse mapping index from userspace.
> +This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
> +The second piece it needs is a new fallocate mode
> +(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a
> region and
> +maps it to a file.
> +Call this file the "space collector" file.
> +The third piece is the ability to force an online repair.
> +
> +To clear all the metadata out of a portion of physical storage,
> clearspace
> +uses the new fallocate map-freespace call to map any free space in
> that region
> +to the space collector file.
> +Next, clearspace finds all metadata blocks in that region by way of
> +``GETFSMAP`` and issues forced repair requests on the data
> structure.
> +This often results in the metadata being rebuilt somewhere that is
> not being
> +cleared.
> +After each relocation, clearspace calls the "map free space"
> function again to
> +collect any newly freed space in the region being cleared.
> +
> +To clear all the file data out of a portion of the physical storage,
> clearspace
> +uses the FSMAP information to find relevant file data blocks.
> +Having identified a good target, it uses the ``FICLONERANGE`` call
> on that part
> +of the file to try to share the physical space with a dummy file.
> +Cloning the extent means that the original owners cannot overwrite
> the
> +contents; any changes will be written somewhere else via copy-on-
> write.
> +Clearspace makes its own copy of the frozen extent in an area that
> is not being
> +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
> +<swapext_if_unchanged>` feature) to change the target file's data
> extent
> +mapping away from the area being cleared.
> +When all other mappings have been moved, clearspace reflinks the
> space into the
> +space collector file so that it becomes unavailable.
> +
> +There are further optimizations that could apply to the above
> algorithm.
> +To clear a piece of physical storage that has a high sharing factor,
> it is
> +strongly desirable to retain this sharing factor.
> +In fact, these extents should be moved first to maximize sharing
> factor after
> +the operation completes.
> +To make this work smoothly, clearspace needs a new ioctl
> +(``FS_IOC_GETREFCOUNTS``) to report reference count information to
> userspace.
> +With the refcount information exposed, clearspace can quickly find
> the longest,
> +most shared data extents in the filesystem, and target them first.
> +


> +**Question**: How might the filesystem move inode chunks?
> +
> +*Answer*: 
"In order to move inode chunks.."

> Dave Chinner has a prototype that creates a new file with the old
> +contents and then locklessly runs around the filesystem updating
> directory
> +entries.
> +The operation cannot complete if the filesystem goes down.
> +That problem isn't totally insurmountable: create an inode remapping
> table
> +hidden behind a jump label, and a log item that tracks the kernel
> walking the
> +filesystem to update directory entries.
> +The trouble is, the kernel can't do anything about open files, since
> it cannot
> +revoke them.
> +


> +**Question**: Can static keys be used to add a revoke bailout return
> to
> +*every* code path coming in from userspace?
> +
> +*Answer*: In principle, yes.
> +This 

"It is also possible to use static keys to add a revoke bailout return
to each code path coming in from userspace.  This..."

> would eliminate the overhead of the check until a revocation happens.
> +It's not clear what we do to a revoked file after all the callers
> are finished
> +with it, however.
> +
> +The relevant patchsets are the
> +`kernel freespace defrag
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=defrag-freespace>`_
> +and
> +`userspace freespace defrag
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> it/log/?h=defrag-freespace>`_
> +series.

I guess since they're just future ideas just light documentation is
fine.  Other than cleaning out the Q & A's, I think it looks pretty
good.

Allison

> +
> +Shrinking Filesystems
> +---------------------
> +
> +Removing the end of the filesystem ought to be a simple matter of
> evacuating
> +the data and metadata at the end of the filesystem, and handing the
> freed space
> +to the shrink code.
> +That requires an evacuation of the space at end of the filesystem,
> which is a
> +use of free space defragmentation!
>
Darrick J. Wong March 2, 2023, 12:39 a.m. UTC | #2
On Wed, Mar 01, 2023 at 05:37:19AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add the seventh and final chapter of the online fsck documentation,
> > where we talk about future functionality that can tie in with the
> > functionality provided by the online fsck patchset.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  155
> > ++++++++++++++++++++
> >  1 file changed, 155 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index 05b9411fac7f..41291edb02b9 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -4067,6 +4067,8 @@ The extra flexibility enables several new use
> > cases:
> >    (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby
> > committing all
> >    of the updates to the original file, or none of them.
> >  
> > +.. _swapext_if_unchanged:
> > +
> >  - **Transactional file updates**: The same mechanism as above, but
> > the caller
> >    only wants the commit to occur if the original file's contents
> > have not
> >    changed.
> > @@ -4818,3 +4820,156 @@ and report what has been lost.
> >  For media errors in blocks owned by files, the lack of parent
> > pointers means
> >  that the entire filesystem must be walked to report the file paths
> > and offsets
> >  corresponding to the media error.
> > +
> > +7. Conclusion and Future Work
> > +=============================
> > +
> > +It is hoped that the reader of this document has followed the
> > designs laid out
> > +in this document and now has some familiarity with how XFS performs
> > online
> > +rebuilding of its metadata indices, and how filesystem users can
> > interact with
> > +that functionality.
> > +Although the scope of this work is daunting, it is hoped that this
> > guide will
> > +make it easier for code readers to understand what has been built,
> > for whom it
> > +has been built, and why.
> > +Please feel free to contact the XFS mailing list with questions.
> > +
> > +FIEXCHANGE_RANGE
> > +----------------
> > +
> > +As discussed earlier, a second frontend to the atomic extent swap
> > mechanism is
> > +a new ioctl call that userspace programs can use to commit updates
> > to files
> > +atomically.
> > +This frontend has been out for review for several years now, though
> > the
> > +necessary refinements to online repair and lack of customer demand
> > mean that
> > +the proposal has not been pushed very hard.

Note: The "Extent Swapping with Regular User Files" section has moved
here.

> > +Vectorized Scrub
> > +----------------
> > +
> > +As it turns out, the :ref:`refactoring <scrubrepair>` of repair
> > items mentioned
> > +earlier was a catalyst for enabling a vectorized scrub system call.
> > +Since 2018, the cost of making a kernel call has increased
> > considerably on some
> > +systems to mitigate the effects of speculative execution attacks.
> > +This incentivizes program authors to make as few system calls as
> > possible to
> > +reduce the number of times an execution path crosses a security
> > boundary.
> > +
> > +With vectorized scrub, userspace pushes to the kernel the identity
> > of a
> > +filesystem object, a list of scrub types to run against that object,
> > and a
> > +simple representation of the data dependencies between the selected
> > scrub
> > +types.
> > +The kernel executes as much of the caller's plan as it can until it
> > hits a
> > +dependency that cannot be satisfied due to a corruption, and tells
> > userspace
> > +how much was accomplished.
> > +It is hoped that ``io_uring`` will pick up enough of this
> > functionality that
> > +online fsck can use that instead of adding a separate vectored scrub
> > system
> > +call to XFS.
> > +
> > +The relevant patchsets are the
> > +`kernel vectorized scrub
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=vectorized-scrub>`_
> > +and
> > +`userspace vectorized scrub
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=vectorized-scrub>`_
> > +series.
> > +
> > +Quality of Service Targets for Scrub
> > +------------------------------------
> > +
> > +One serious shortcoming of the online fsck code is that the amount
> > of time that
> > +it can spend in the kernel holding resource locks is basically
> > unbounded.
> > +Userspace is allowed to send a fatal signal to the process which
> > will cause
> > +``xfs_scrub`` to exit when it reaches a good stopping point, but
> > there's no way
> > +for userspace to provide a time budget to the kernel.
> > +Given that the scrub codebase has helpers to detect fatal signals,
> > it shouldn't
> > +be too much work to allow userspace to specify a timeout for a
> > scrub/repair
> > +operation and abort the operation if it exceeds budget.
> > +However, most repair functions have the property that once they
> > begin to touch
> > +ondisk metadata, the operation cannot be cancelled cleanly, after
> > which a QoS
> > +timeout is no longer useful.
> > +
> > +Defragmenting Free Space
> > +------------------------
> > +
> > +Over the years, many XFS users have requested the creation of a
> > program to
> > +clear a portion of the physical storage underlying a filesystem so
> > that it
> > +becomes a contiguous chunk of free space.
> > +Call this free space defragmenter ``clearspace`` for short.
> > +
> > +The first piece the ``clearspace`` program needs is the ability to
> > read the
> > +reverse mapping index from userspace.
> > +This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
> > +The second piece it needs is a new fallocate mode
> > +(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a
> > region and
> > +maps it to a file.
> > +Call this file the "space collector" file.
> > +The third piece is the ability to force an online repair.
> > +
> > +To clear all the metadata out of a portion of physical storage,
> > clearspace
> > +uses the new fallocate map-freespace call to map any free space in
> > that region
> > +to the space collector file.
> > +Next, clearspace finds all metadata blocks in that region by way of
> > +``GETFSMAP`` and issues forced repair requests on the data
> > structure.
> > +This often results in the metadata being rebuilt somewhere that is
> > not being
> > +cleared.
> > +After each relocation, clearspace calls the "map free space"
> > function again to
> > +collect any newly freed space in the region being cleared.
> > +
> > +To clear all the file data out of a portion of the physical storage,
> > clearspace
> > +uses the FSMAP information to find relevant file data blocks.
> > +Having identified a good target, it uses the ``FICLONERANGE`` call
> > on that part
> > +of the file to try to share the physical space with a dummy file.
> > +Cloning the extent means that the original owners cannot overwrite
> > the
> > +contents; any changes will be written somewhere else via copy-on-
> > write.
> > +Clearspace makes its own copy of the frozen extent in an area that
> > is not being
> > +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
> > +<swapext_if_unchanged>` feature) to change the target file's data
> > extent
> > +mapping away from the area being cleared.
> > +When all other mappings have been moved, clearspace reflinks the
> > space into the
> > +space collector file so that it becomes unavailable.
> > +
> > +There are further optimizations that could apply to the above
> > algorithm.
> > +To clear a piece of physical storage that has a high sharing factor,
> > it is
> > +strongly desirable to retain this sharing factor.
> > +In fact, these extents should be moved first to maximize sharing
> > factor after
> > +the operation completes.
> > +To make this work smoothly, clearspace needs a new ioctl
> > +(``FS_IOC_GETREFCOUNTS``) to report reference count information to
> > userspace.
> > +With the refcount information exposed, clearspace can quickly find
> > the longest,
> > +most shared data extents in the filesystem, and target them first.
> > +
> 
> 
> > +**Question**: How might the filesystem move inode chunks?
> > +
> > +*Answer*: 
> "In order to move inode chunks.."

Done.

> > Dave Chinner has a prototype that creates a new file with the old
> > +contents and then locklessly runs around the filesystem updating
> > directory
> > +entries.
> > +The operation cannot complete if the filesystem goes down.
> > +That problem isn't totally insurmountable: create an inode remapping
> > table
> > +hidden behind a jump label, and a log item that tracks the kernel
> > walking the
> > +filesystem to update directory entries.
> > +The trouble is, the kernel can't do anything about open files, since
> > it cannot
> > +revoke them.
> > +
> 
> 
> > +**Question**: Can static keys be used to add a revoke bailout return
> > to
> > +*every* code path coming in from userspace?
> > +
> > +*Answer*: In principle, yes.
> > +This 
> 
> "It is also possible to use static keys to add a revoke bailout return
> to each code path coming in from userspace.  This..."

I think this change would make the answer redundant with the question.

"Can static keys be used to minimize the runtime cost of supporting
``revoke()`` on XFS files?"

"Yes.  Until the first revocation, the bailout code need not be in the
call path at all."

> > would eliminate the overhead of the check until a revocation happens.
> > +It's not clear what we do to a revoked file after all the callers
> > are finished
> > +with it, however.
> > +
> > +The relevant patchsets are the
> > +`kernel freespace defrag
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=defrag-freespace>`_
> > +and
> > +`userspace freespace defrag
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=defrag-freespace>`_
> > +series.
> 
> I guess since they're just future ideas just light documentation is
> fine.  Other than cleaning out the Q & A's, I think it looks pretty
> good.

Ok.  Thank you x100000000 for being the first person to publicly comment
on the entire document!

--D

> Allison
> 
> > +
> > +Shrinking Filesystems
> > +---------------------
> > +
> > +Removing the end of the filesystem ought to be a simple matter of
> > evacuating
> > +the data and metadata at the end of the filesystem, and handing the
> > freed space
> > +to the shrink code.
> > +That requires an evacuation of the space at end of the filesystem,
> > which is a
> > +use of free space defragmentation!
> > 
>
Allison Henderson March 3, 2023, 11:51 p.m. UTC | #3
On Wed, 2023-03-01 at 16:39 -0800, Darrick J. Wong wrote:
> On Wed, Mar 01, 2023 at 05:37:19AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Add the seventh and final chapter of the online fsck
> > > documentation,
> > > where we talk about future functionality that can tie in with the
> > > functionality provided by the online fsck patchset.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  155
> > > ++++++++++++++++++++
> > >  1 file changed, 155 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index 05b9411fac7f..41291edb02b9 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -4067,6 +4067,8 @@ The extra flexibility enables several new
> > > use
> > > cases:
> > >    (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby
> > > committing all
> > >    of the updates to the original file, or none of them.
> > >  
> > > +.. _swapext_if_unchanged:
> > > +
> > >  - **Transactional file updates**: The same mechanism as above,
> > > but
> > > the caller
> > >    only wants the commit to occur if the original file's contents
> > > have not
> > >    changed.
> > > @@ -4818,3 +4820,156 @@ and report what has been lost.
> > >  For media errors in blocks owned by files, the lack of parent
> > > pointers means
> > >  that the entire filesystem must be walked to report the file
> > > paths
> > > and offsets
> > >  corresponding to the media error.
> > > +
> > > +7. Conclusion and Future Work
> > > +=============================
> > > +
> > > +It is hoped that the reader of this document has followed the
> > > designs laid out
> > > +in this document and now has some familiarity with how XFS
> > > performs
> > > online
> > > +rebuilding of its metadata indices, and how filesystem users can
> > > interact with
> > > +that functionality.
> > > +Although the scope of this work is daunting, it is hoped that
> > > this
> > > guide will
> > > +make it easier for code readers to understand what has been
> > > built,
> > > for whom it
> > > +has been built, and why.
> > > +Please feel free to contact the XFS mailing list with questions.
> > > +
> > > +FIEXCHANGE_RANGE
> > > +----------------
> > > +
> > > +As discussed earlier, a second frontend to the atomic extent
> > > swap
> > > mechanism is
> > > +a new ioctl call that userspace programs can use to commit
> > > updates
> > > to files
> > > +atomically.
> > > +This frontend has been out for review for several years now,
> > > though
> > > the
> > > +necessary refinements to online repair and lack of customer
> > > demand
> > > mean that
> > > +the proposal has not been pushed very hard.
> 
> Note: The "Extent Swapping with Regular User Files" section has moved
> here.
> 
> > > +Vectorized Scrub
> > > +----------------
> > > +
> > > +As it turns out, the :ref:`refactoring <scrubrepair>` of repair
> > > items mentioned
> > > +earlier was a catalyst for enabling a vectorized scrub system
> > > call.
> > > +Since 2018, the cost of making a kernel call has increased
> > > considerably on some
> > > +systems to mitigate the effects of speculative execution
> > > attacks.
> > > +This incentivizes program authors to make as few system calls as
> > > possible to
> > > +reduce the number of times an execution path crosses a security
> > > boundary.
> > > +
> > > +With vectorized scrub, userspace pushes to the kernel the
> > > identity
> > > of a
> > > +filesystem object, a list of scrub types to run against that
> > > object,
> > > and a
> > > +simple representation of the data dependencies between the
> > > selected
> > > scrub
> > > +types.
> > > +The kernel executes as much of the caller's plan as it can until
> > > it
> > > hits a
> > > +dependency that cannot be satisfied due to a corruption, and
> > > tells
> > > userspace
> > > +how much was accomplished.
> > > +It is hoped that ``io_uring`` will pick up enough of this
> > > functionality that
> > > +online fsck can use that instead of adding a separate vectored
> > > scrub
> > > system
> > > +call to XFS.
> > > +
> > > +The relevant patchsets are the
> > > +`kernel vectorized scrub
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=vectorized-scrub>`_
> > > +and
> > > +`userspace vectorized scrub
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > it/log/?h=vectorized-scrub>`_
> > > +series.
> > > +
> > > +Quality of Service Targets for Scrub
> > > +------------------------------------
> > > +
> > > +One serious shortcoming of the online fsck code is that the
> > > amount
> > > of time that
> > > +it can spend in the kernel holding resource locks is basically
> > > unbounded.
> > > +Userspace is allowed to send a fatal signal to the process which
> > > will cause
> > > +``xfs_scrub`` to exit when it reaches a good stopping point, but
> > > there's no way
> > > +for userspace to provide a time budget to the kernel.
> > > +Given that the scrub codebase has helpers to detect fatal
> > > signals,
> > > it shouldn't
> > > +be too much work to allow userspace to specify a timeout for a
> > > scrub/repair
> > > +operation and abort the operation if it exceeds budget.
> > > +However, most repair functions have the property that once they
> > > begin to touch
> > > +ondisk metadata, the operation cannot be cancelled cleanly,
> > > after
> > > which a QoS
> > > +timeout is no longer useful.
> > > +
> > > +Defragmenting Free Space
> > > +------------------------
> > > +
> > > +Over the years, many XFS users have requested the creation of a
> > > program to
> > > +clear a portion of the physical storage underlying a filesystem
> > > so
> > > that it
> > > +becomes a contiguous chunk of free space.
> > > +Call this free space defragmenter ``clearspace`` for short.
> > > +
> > > +The first piece the ``clearspace`` program needs is the ability
> > > to
> > > read the
> > > +reverse mapping index from userspace.
> > > +This already exists in the form of the ``FS_IOC_GETFSMAP``
> > > ioctl.
> > > +The second piece it needs is a new fallocate mode
> > > +(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in
> > > a
> > > region and
> > > +maps it to a file.
> > > +Call this file the "space collector" file.
> > > +The third piece is the ability to force an online repair.
> > > +
> > > +To clear all the metadata out of a portion of physical storage,
> > > clearspace
> > > +uses the new fallocate map-freespace call to map any free space
> > > in
> > > that region
> > > +to the space collector file.
> > > +Next, clearspace finds all metadata blocks in that region by way
> > > of
> > > +``GETFSMAP`` and issues forced repair requests on the data
> > > structure.
> > > +This often results in the metadata being rebuilt somewhere that
> > > is
> > > not being
> > > +cleared.
> > > +After each relocation, clearspace calls the "map free space"
> > > function again to
> > > +collect any newly freed space in the region being cleared.
> > > +
> > > +To clear all the file data out of a portion of the physical
> > > storage,
> > > clearspace
> > > +uses the FSMAP information to find relevant file data blocks.
> > > +Having identified a good target, it uses the ``FICLONERANGE``
> > > call
> > > on that part
> > > +of the file to try to share the physical space with a dummy
> > > file.
> > > +Cloning the extent means that the original owners cannot
> > > overwrite
> > > the
> > > +contents; any changes will be written somewhere else via copy-
> > > on-
> > > write.
> > > +Clearspace makes its own copy of the frozen extent in an area
> > > that
> > > is not being
> > > +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent
> > > swap
> > > +<swapext_if_unchanged>` feature) to change the target file's
> > > data
> > > extent
> > > +mapping away from the area being cleared.
> > > +When all other mappings have been moved, clearspace reflinks the
> > > space into the
> > > +space collector file so that it becomes unavailable.
> > > +
> > > +There are further optimizations that could apply to the above
> > > algorithm.
> > > +To clear a piece of physical storage that has a high sharing
> > > factor,
> > > it is
> > > +strongly desirable to retain this sharing factor.
> > > +In fact, these extents should be moved first to maximize sharing
> > > factor after
> > > +the operation completes.
> > > +To make this work smoothly, clearspace needs a new ioctl
> > > +(``FS_IOC_GETREFCOUNTS``) to report reference count information
> > > to
> > > userspace.
> > > +With the refcount information exposed, clearspace can quickly
> > > find
> > > the longest,
> > > +most shared data extents in the filesystem, and target them
> > > first.
> > > +
> > 
> > 
> > > +**Question**: How might the filesystem move inode chunks?
> > > +
> > > +*Answer*: 
> > "In order to move inode chunks.."
> 
> Done.
> 
> > > Dave Chinner has a prototype that creates a new file with the old
> > > +contents and then locklessly runs around the filesystem updating
> > > directory
> > > +entries.
> > > +The operation cannot complete if the filesystem goes down.
> > > +That problem isn't totally insurmountable: create an inode
> > > remapping
> > > table
> > > +hidden behind a jump label, and a log item that tracks the
> > > kernel
> > > walking the
> > > +filesystem to update directory entries.
> > > +The trouble is, the kernel can't do anything about open files,
> > > since
> > > it cannot
> > > +revoke them.
> > > +
> > 
> > 
> > > +**Question**: Can static keys be used to add a revoke bailout
> > > return
> > > to
> > > +*every* code path coming in from userspace?
> > > +
> > > +*Answer*: In principle, yes.
> > > +This 
> > 
> > "It is also possible to use static keys to add a revoke bailout
> > return
> > to each code path coming in from userspace.  This..."
> 
> I think this change would make the answer redundant with the
> question.
Sorry, I meant for the quotations to replace everything between the
line breaks.  So from Q through the answer, just to break out of the
Q&A format.

I sort of feel like if a document leaves the reader with questions that
they didn't have before they started reading, then ideally we should
simply just incorporate the answer in the document.  Just makes the
read easier imho.

> 
> "Can static keys be used to minimize the runtime cost of supporting
> ``revoke()`` on XFS files?"
> 
> "Yes.  Until the first revocation, the bailout code need not be in
> the
> call path at all."

That's an implied Q&A format, but I suppose it's not a big deal either
way though.

> 
> > > would eliminate the overhead of the check until a revocation
> > > happens.
> > > +It's not clear what we do to a revoked file after all the
> > > callers
> > > are finished
> > > +with it, however.
> > > +
> > > +The relevant patchsets are the
> > > +`kernel freespace defrag
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=defrag-freespace>`_
> > > +and
> > > +`userspace freespace defrag
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > it/log/?h=defrag-freespace>`_
> > > +series.
> > 
> > I guess since they're just future ideas just light documentation is
> > fine.  Other than cleaning out the Q & A's, I think it looks pretty
> > good.
> 
> Ok.  Thank you x100000000 for being the first person to publicly
> comment
> on the entire document!

Sure, glad to help!  :-)

Allison

> 
> --D
> 
> > Allison
> > 
> > > +
> > > +Shrinking Filesystems
> > > +---------------------
> > > +
> > > +Removing the end of the filesystem ought to be a simple matter
> > > of
> > > evacuating
> > > +the data and metadata at the end of the filesystem, and handing
> > > the
> > > freed space
> > > +to the shrink code.
> > > +That requires an evacuation of the space at end of the
> > > filesystem,
> > > which is a
> > > +use of free space defragmentation!
> > > 
> >
Darrick J. Wong March 4, 2023, 2:28 a.m. UTC | #4
On Fri, Mar 03, 2023 at 11:51:05PM +0000, Allison Henderson wrote:
> On Wed, 2023-03-01 at 16:39 -0800, Darrick J. Wong wrote:
> > On Wed, Mar 01, 2023 at 05:37:19AM +0000, Allison Henderson wrote:
> > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Add the seventh and final chapter of the online fsck
> > > > documentation,
> > > > where we talk about future functionality that can tie in with the
> > > > functionality provided by the online fsck patchset.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  .../filesystems/xfs-online-fsck-design.rst         |  155
> > > > ++++++++++++++++++++
> > > >  1 file changed, 155 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > index 05b9411fac7f..41291edb02b9 100644
> > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > @@ -4067,6 +4067,8 @@ The extra flexibility enables several new
> > > > use
> > > > cases:
> > > >    (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby
> > > > committing all
> > > >    of the updates to the original file, or none of them.
> > > >  
> > > > +.. _swapext_if_unchanged:
> > > > +
> > > >  - **Transactional file updates**: The same mechanism as above,
> > > > but
> > > > the caller
> > > >    only wants the commit to occur if the original file's contents
> > > > have not
> > > >    changed.
> > > > @@ -4818,3 +4820,156 @@ and report what has been lost.
> > > >  For media errors in blocks owned by files, the lack of parent
> > > > pointers means
> > > >  that the entire filesystem must be walked to report the file
> > > > paths
> > > > and offsets
> > > >  corresponding to the media error.
> > > > +
> > > > +7. Conclusion and Future Work
> > > > +=============================
> > > > +
> > > > +It is hoped that the reader of this document has followed the
> > > > designs laid out
> > > > +in this document and now has some familiarity with how XFS
> > > > performs
> > > > online
> > > > +rebuilding of its metadata indices, and how filesystem users can
> > > > interact with
> > > > +that functionality.
> > > > +Although the scope of this work is daunting, it is hoped that
> > > > this
> > > > guide will
> > > > +make it easier for code readers to understand what has been
> > > > built,
> > > > for whom it
> > > > +has been built, and why.
> > > > +Please feel free to contact the XFS mailing list with questions.
> > > > +
> > > > +FIEXCHANGE_RANGE
> > > > +----------------
> > > > +
> > > > +As discussed earlier, a second frontend to the atomic extent
> > > > swap
> > > > mechanism is
> > > > +a new ioctl call that userspace programs can use to commit
> > > > updates
> > > > to files
> > > > +atomically.
> > > > +This frontend has been out for review for several years now,
> > > > though
> > > > the
> > > > +necessary refinements to online repair and lack of customer
> > > > demand
> > > > mean that
> > > > +the proposal has not been pushed very hard.
> > 
> > Note: The "Extent Swapping with Regular User Files" section has moved
> > here.
> > 
> > > > +Vectorized Scrub
> > > > +----------------
> > > > +
> > > > +As it turns out, the :ref:`refactoring <scrubrepair>` of repair
> > > > items mentioned
> > > > +earlier was a catalyst for enabling a vectorized scrub system
> > > > call.
> > > > +Since 2018, the cost of making a kernel call has increased
> > > > considerably on some
> > > > +systems to mitigate the effects of speculative execution
> > > > attacks.
> > > > +This incentivizes program authors to make as few system calls as
> > > > possible to
> > > > +reduce the number of times an execution path crosses a security
> > > > boundary.
> > > > +
> > > > +With vectorized scrub, userspace pushes to the kernel the
> > > > identity
> > > > of a
> > > > +filesystem object, a list of scrub types to run against that
> > > > object,
> > > > and a
> > > > +simple representation of the data dependencies between the
> > > > selected
> > > > scrub
> > > > +types.
> > > > +The kernel executes as much of the caller's plan as it can until
> > > > it
> > > > hits a
> > > > +dependency that cannot be satisfied due to a corruption, and
> > > > tells
> > > > userspace
> > > > +how much was accomplished.
> > > > +It is hoped that ``io_uring`` will pick up enough of this
> > > > functionality that
> > > > +online fsck can use that instead of adding a separate vectored
> > > > scrub
> > > > system
> > > > +call to XFS.
> > > > +
> > > > +The relevant patchsets are the
> > > > +`kernel vectorized scrub
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > > log/?h=vectorized-scrub>`_
> > > > +and
> > > > +`userspace vectorized scrub
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > > it/log/?h=vectorized-scrub>`_
> > > > +series.
> > > > +
> > > > +Quality of Service Targets for Scrub
> > > > +------------------------------------
> > > > +
> > > > +One serious shortcoming of the online fsck code is that the
> > > > amount
> > > > of time that
> > > > +it can spend in the kernel holding resource locks is basically
> > > > unbounded.
> > > > +Userspace is allowed to send a fatal signal to the process which
> > > > will cause
> > > > +``xfs_scrub`` to exit when it reaches a good stopping point, but
> > > > there's no way
> > > > +for userspace to provide a time budget to the kernel.
> > > > +Given that the scrub codebase has helpers to detect fatal
> > > > signals,
> > > > it shouldn't
> > > > +be too much work to allow userspace to specify a timeout for a
> > > > scrub/repair
> > > > +operation and abort the operation if it exceeds budget.
> > > > +However, most repair functions have the property that once they
> > > > begin to touch
> > > > +ondisk metadata, the operation cannot be cancelled cleanly,
> > > > after
> > > > which a QoS
> > > > +timeout is no longer useful.
> > > > +
> > > > +Defragmenting Free Space
> > > > +------------------------
> > > > +
> > > > +Over the years, many XFS users have requested the creation of a
> > > > program to
> > > > +clear a portion of the physical storage underlying a filesystem
> > > > so
> > > > that it
> > > > +becomes a contiguous chunk of free space.
> > > > +Call this free space defragmenter ``clearspace`` for short.
> > > > +
> > > > +The first piece the ``clearspace`` program needs is the ability
> > > > to
> > > > read the
> > > > +reverse mapping index from userspace.
> > > > +This already exists in the form of the ``FS_IOC_GETFSMAP``
> > > > ioctl.
> > > > +The second piece it needs is a new fallocate mode
> > > > +(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in
> > > > a
> > > > region and
> > > > +maps it to a file.
> > > > +Call this file the "space collector" file.
> > > > +The third piece is the ability to force an online repair.
> > > > +
> > > > +To clear all the metadata out of a portion of physical storage,
> > > > clearspace
> > > > +uses the new fallocate map-freespace call to map any free space
> > > > in
> > > > that region
> > > > +to the space collector file.
> > > > +Next, clearspace finds all metadata blocks in that region by way
> > > > of
> > > > +``GETFSMAP`` and issues forced repair requests on the data
> > > > structure.
> > > > +This often results in the metadata being rebuilt somewhere that
> > > > is
> > > > not being
> > > > +cleared.
> > > > +After each relocation, clearspace calls the "map free space"
> > > > function again to
> > > > +collect any newly freed space in the region being cleared.
> > > > +
> > > > +To clear all the file data out of a portion of the physical
> > > > storage,
> > > > clearspace
> > > > +uses the FSMAP information to find relevant file data blocks.
> > > > +Having identified a good target, it uses the ``FICLONERANGE``
> > > > call
> > > > on that part
> > > > +of the file to try to share the physical space with a dummy
> > > > file.
> > > > +Cloning the extent means that the original owners cannot
> > > > overwrite
> > > > the
> > > > +contents; any changes will be written somewhere else via copy-
> > > > on-
> > > > write.
> > > > +Clearspace makes its own copy of the frozen extent in an area
> > > > that
> > > > is not being
> > > > +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent
> > > > swap
> > > > +<swapext_if_unchanged>` feature) to change the target file's
> > > > data
> > > > extent
> > > > +mapping away from the area being cleared.
> > > > +When all other mappings have been moved, clearspace reflinks the
> > > > space into the
> > > > +space collector file so that it becomes unavailable.
> > > > +
> > > > +There are further optimizations that could apply to the above
> > > > algorithm.
> > > > +To clear a piece of physical storage that has a high sharing
> > > > factor,
> > > > it is
> > > > +strongly desirable to retain this sharing factor.
> > > > +In fact, these extents should be moved first to maximize sharing
> > > > factor after
> > > > +the operation completes.
> > > > +To make this work smoothly, clearspace needs a new ioctl
> > > > +(``FS_IOC_GETREFCOUNTS``) to report reference count information
> > > > to
> > > > userspace.
> > > > +With the refcount information exposed, clearspace can quickly
> > > > find
> > > > the longest,
> > > > +most shared data extents in the filesystem, and target them
> > > > first.
> > > > +
> > > 
> > > 
> > > > +**Question**: How might the filesystem move inode chunks?
> > > > +
> > > > +*Answer*: 
> > > "In order to move inode chunks.."
> > 
> > Done.
> > 
> > > > Dave Chinner has a prototype that creates a new file with the old
> > > > +contents and then locklessly runs around the filesystem updating
> > > > directory
> > > > +entries.
> > > > +The operation cannot complete if the filesystem goes down.
> > > > +That problem isn't totally insurmountable: create an inode
> > > > remapping
> > > > table
> > > > +hidden behind a jump label, and a log item that tracks the
> > > > kernel
> > > > walking the
> > > > +filesystem to update directory entries.
> > > > +The trouble is, the kernel can't do anything about open files,
> > > > since
> > > > it cannot
> > > > +revoke them.
> > > > +
> > > 
> > > 
> > > > +**Question**: Can static keys be used to add a revoke bailout
> > > > return
> > > > to
> > > > +*every* code path coming in from userspace?
> > > > +
> > > > +*Answer*: In principle, yes.
> > > > +This 
> > > 
> > > "It is also possible to use static keys to add a revoke bailout
> > > return
> > > to each code path coming in from userspace.  This..."
> > 
> > I think this change would make the answer redundant with the
> > question.
> Sorry, I meant for the quotations to replace everything between the
> line breaks.  So from Q through the answer, just to break out of the
> Q&A format.
> 
> I sort of feel like if a document leaves the reader with questions that
> they didn't have before they started reading, then ideally we should
> simply just incorporate the answer in the document.  Just makes the
> read easier imho.

Oh, I see.  Let me think about that over the weekend.  These are all
highly speculative questions about prototype code that nobody's really
worked through yet, so they need to make it clear that we're not talking
about anything close to future features.

--D

> > 
> > "Can static keys be used to minimize the runtime cost of supporting
> > ``revoke()`` on XFS files?"
> > 
> > "Yes.  Until the first revocation, the bailout code need not be in
> > the
> > call path at all."
> 
> That's an implied Q&A format, but I suppose it's not a big deal either
> way though.
> 
> > 
> > > > would eliminate the overhead of the check until a revocation
> > > > happens.
> > > > +It's not clear what we do to a revoked file after all the
> > > > callers
> > > > are finished
> > > > +with it, however.
> > > > +
> > > > +The relevant patchsets are the
> > > > +`kernel freespace defrag
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > > log/?h=defrag-freespace>`_
> > > > +and
> > > > +`userspace freespace defrag
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > > it/log/?h=defrag-freespace>`_
> > > > +series.
> > > 
> > > I guess since they're just future ideas just light documentation is
> > > fine.  Other than cleaning out the Q & A's, I think it looks pretty
> > > good.
> > 
> > Ok.  Thank you x100000000 for being the first person to publicly
> > comment
> > on the entire document!
> 
> Sure, glad to help!  :-)
> 
> Allison
> 
> > 
> > --D
> > 
> > > Allison
> > > 
> > > > +
> > > > +Shrinking Filesystems
> > > > +---------------------
> > > > +
> > > > +Removing the end of the filesystem ought to be a simple matter
> > > > of
> > > > evacuating
> > > > +the data and metadata at the end of the filesystem, and handing
> > > > the
> > > > freed space
> > > > +to the shrink code.
> > > > +That requires an evacuation of the space at end of the
> > > > filesystem,
> > > > which is a
> > > > +use of free space defragmentation!
> > > > 
> > > 
>
diff mbox series

Patch

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 05b9411fac7f..41291edb02b9 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -4067,6 +4067,8 @@  The extra flexibility enables several new use cases:
   (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
   of the updates to the original file, or none of them.
 
+.. _swapext_if_unchanged:
+
 - **Transactional file updates**: The same mechanism as above, but the caller
   only wants the commit to occur if the original file's contents have not
   changed.
@@ -4818,3 +4820,156 @@  and report what has been lost.
 For media errors in blocks owned by files, the lack of parent pointers means
 that the entire filesystem must be walked to report the file paths and offsets
 corresponding to the media error.
+
+7. Conclusion and Future Work
+=============================
+
+It is hoped that the reader of this document has followed the designs laid out
+in this document and now has some familiarity with how XFS performs online
+rebuilding of its metadata indices, and how filesystem users can interact with
+that functionality.
+Although the scope of this work is daunting, it is hoped that this guide will
+make it easier for code readers to understand what has been built, for whom it
+has been built, and why.
+Please feel free to contact the XFS mailing list with questions.
+
+FIEXCHANGE_RANGE
+----------------
+
+As discussed earlier, a second frontend to the atomic extent swap mechanism is
+a new ioctl call that userspace programs can use to commit updates to files
+atomically.
+This frontend has been out for review for several years now, though the
+necessary refinements to online repair and lack of customer demand mean that
+the proposal has not been pushed very hard.
+
+Vectorized Scrub
+----------------
+
+As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned
+earlier was a catalyst for enabling a vectorized scrub system call.
+Since 2018, the cost of making a kernel call has increased considerably on some
+systems to mitigate the effects of speculative execution attacks.
+This incentivizes program authors to make as few system calls as possible to
+reduce the number of times an execution path crosses a security boundary.
+
+With vectorized scrub, userspace pushes to the kernel the identity of a
+filesystem object, a list of scrub types to run against that object, and a
+simple representation of the data dependencies between the selected scrub
+types.
+The kernel executes as much of the caller's plan as it can until it hits a
+dependency that cannot be satisfied due to a corruption, and tells userspace
+how much was accomplished.
+It is hoped that ``io_uring`` will pick up enough of this functionality that
+online fsck can use that instead of adding a separate vectored scrub system
+call to XFS.
+
+The relevant patchsets are the
+`kernel vectorized scrub
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
+and
+`userspace vectorized scrub
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
+series.
+
+Quality of Service Targets for Scrub
+------------------------------------
+
+One serious shortcoming of the online fsck code is that the amount of time that
+it can spend in the kernel holding resource locks is basically unbounded.
+Userspace is allowed to send a fatal signal to the process which will cause
+``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way
+for userspace to provide a time budget to the kernel.
+Given that the scrub codebase has helpers to detect fatal signals, it shouldn't
+be too much work to allow userspace to specify a timeout for a scrub/repair
+operation and abort the operation if it exceeds budget.
+However, most repair functions have the property that once they begin to touch
+ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS
+timeout is no longer useful.
+
+Defragmenting Free Space
+------------------------
+
+Over the years, many XFS users have requested the creation of a program to
+clear a portion of the physical storage underlying a filesystem so that it
+becomes a contiguous chunk of free space.
+Call this free space defragmenter ``clearspace`` for short.
+
+The first piece the ``clearspace`` program needs is the ability to read the
+reverse mapping index from userspace.
+This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
+The second piece it needs is a new fallocate mode
+(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and
+maps it to a file.
+Call this file the "space collector" file.
+The third piece is the ability to force an online repair.
+
+To clear all the metadata out of a portion of physical storage, clearspace
+uses the new fallocate map-freespace call to map any free space in that region
+to the space collector file.
+Next, clearspace finds all metadata blocks in that region by way of
+``GETFSMAP`` and issues forced repair requests on the data structure.
+This often results in the metadata being rebuilt somewhere that is not being
+cleared.
+After each relocation, clearspace calls the "map free space" function again to
+collect any newly freed space in the region being cleared.
+
+To clear all the file data out of a portion of the physical storage, clearspace
+uses the FSMAP information to find relevant file data blocks.
+Having identified a good target, it uses the ``FICLONERANGE`` call on that part
+of the file to try to share the physical space with a dummy file.
+Cloning the extent means that the original owners cannot overwrite the
+contents; any changes will be written somewhere else via copy-on-write.
+Clearspace makes its own copy of the frozen extent in an area that is not being
+cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
+<swapext_if_unchanged>` feature) to change the target file's data extent
+mapping away from the area being cleared.
+When all other mappings have been moved, clearspace reflinks the space into the
+space collector file so that it becomes unavailable.
+
+There are further optimizations that could apply to the above algorithm.
+To clear a piece of physical storage that has a high sharing factor, it is
+strongly desirable to retain this sharing factor.
+In fact, these extents should be moved first to maximize sharing factor after
+the operation completes.
+To make this work smoothly, clearspace needs a new ioctl
+(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace.
+With the refcount information exposed, clearspace can quickly find the longest,
+most shared data extents in the filesystem, and target them first.
+
+**Question**: How might the filesystem move inode chunks?
+
+*Answer*: Dave Chinner has a prototype that creates a new file with the old
+contents and then locklessly runs around the filesystem updating directory
+entries.
+The operation cannot complete if the filesystem goes down.
+That problem isn't totally insurmountable: create an inode remapping table
+hidden behind a jump label, and a log item that tracks the kernel walking the
+filesystem to update directory entries.
+The trouble is, the kernel can't do anything about open files, since it cannot
+revoke them.
+
+**Question**: Can static keys be used to add a revoke bailout return to
+*every* code path coming in from userspace?
+
+*Answer*: In principle, yes.
+This would eliminate the overhead of the check until a revocation happens.
+It's not clear what we do to a revoked file after all the callers are finished
+with it, however.
+
+The relevant patchsets are the
+`kernel freespace defrag
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
+and
+`userspace freespace defrag
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
+series.
+
+Shrinking Filesystems
+---------------------
+
+Removing the end of the filesystem ought to be a simple matter of evacuating
+the data and metadata at the end of the filesystem, and handing the freed space
+to the shrink code.
+That requires an evacuation of the space at end of the filesystem, which is a
+use of free space defragmentation!