diff mbox series

[06/14] xfs: document how online fsck deals with eventual consistency

Message ID 167243825245.682859.4827095718073568782.stgit@magnolia (mailing list archive)
State New, archived
Headers show
Series xfs: design documentation for online fsck | expand

Commit Message

Darrick J. Wong Dec. 30, 2022, 10:10 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Writes to an XFS filesystem employ an eventual consistency update model
to break up complex multistep metadata updates into small chained
transactions.  This is generally good for performance and scalability
because XFS doesn't need to prepare for enormous transactions, but it
also means that online fsck must be careful not to attempt a fsck action
unless it can be shown that there are no other threads processing a
transaction chain.  This part of the design documentation covers the
thinking behind the consistency model and how scrub deals with it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  303 ++++++++++++++++++++
 1 file changed, 303 insertions(+)

Comments

Amir Goldstein Jan. 5, 2023, 9:08 a.m. UTC | #1
On Sat, Dec 31, 2022 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Writes to an XFS filesystem employ an eventual consistency update model
> to break up complex multistep metadata updates into small chained
> transactions.  This is generally good for performance and scalability
> because XFS doesn't need to prepare for enormous transactions, but it
> also means that online fsck must be careful not to attempt a fsck action
> unless it can be shown that there are no other threads processing a
> transaction chain.  This part of the design documentation covers the
> thinking behind the consistency model and how scrub deals with it.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  303 ++++++++++++++++++++
>  1 file changed, 303 insertions(+)
>
>
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
> index f45bf97fa9c4..419eb54ee200 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -1443,3 +1443,306 @@ This step is critical for enabling system administrator to monitor the status
>  of the filesystem and the progress of any repairs.
>  For developers, it is a useful means to judge the efficacy of error detection
>  and correction in the online and offline checking tools.
> +
> +Eventual Consistency vs. Online Fsck
> +------------------------------------
> +
> +Midway through the development of online scrubbing, the fsstress tests
> +uncovered a misinteraction between online fsck and compound transaction chains
> +created by other writer threads that resulted in false reports of metadata
> +inconsistency.
> +The root cause of these reports is the eventual consistency model introduced by
> +the expansion of deferred work items and compound transaction chains when
> +reverse mapping and reflink were introduced.
> +
> +Originally, transaction chains were added to XFS to avoid deadlocks when
> +unmapping space from files.
> +Deadlock avoidance rules require that AGs only be locked in increasing order,
> +which makes it impossible (say) to use a single transaction to free a space
> +extent in AG 7 and then try to free a now superfluous block mapping btree block
> +in AG 3.
> +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
> +items to commit to freeing some space in one transaction while deferring the
> +actual metadata updates to a fresh transaction.
> +The transaction sequence looks like this:
> +
> +1. The first transaction contains a physical update to the file's block mapping
> +   structures to remove the mapping from the btree blocks.
> +   It then attaches to the in-memory transaction an action item to schedule
> +   deferred freeing of space.
> +   Concretely, each transaction maintains a list of ``struct
> +   xfs_defer_pending`` objects, each of which maintains a list of ``struct
> +   xfs_extent_free_item`` objects.
> +   Returning to the example above, the action item tracks the freeing of both
> +   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
> +   AG 3.
> +   Deferred frees recorded in this manner are committed in the log by creating
> +   an EFI log item from the ``struct xfs_extent_free_item`` object and
> +   attaching the log item to the transaction.
> +   When the log is persisted to disk, the EFI item is written into the ondisk
> +   transaction record.
> +   EFIs can list up to 16 extents to free, all sorted in AG order.
> +
> +2. The second transaction contains a physical update to the free space btrees
> +   of AG 3 to release the former BMBT block and a second physical update to the
> +   free space btrees of AG 7 to release the unmapped file space.
> +   Observe that the the physical updates are resequenced in the correct order
> +   when possible.
> +   Attached to the transaction is a an extent free done (EFD) log item.
> +   The EFD contains a pointer to the EFI logged in transaction #1 so that log
> +   recovery can tell if the EFI needs to be replayed.
> +
> +If the system goes down after transaction #1 is written back to the filesystem
> +but before #2 is committed, a scan of the filesystem metadata would show
> +inconsistent filesystem metadata because there would not appear to be any owner
> +of the unmapped space.
> +Happily, log recovery corrects this inconsistency for us -- when recovery finds
> +an intent log item but does not find a corresponding intent done item, it will
> +reconstruct the incore state of the intent item and finish it.
> +In the example above, the log must replay both frees described in the recovered
> +EFI to complete the recovery phase.
> +
> +There are two subtleties to XFS' transaction chaining strategy to consider.
> +The first is that log items must be added to a transaction in the correct order
> +to prevent conflicts with principal objects that are not held by the
> +transaction.
> +In other words, all per-AG metadata updates for an unmapped block must be
> +completed before the last update to free the extent, and extents should not
> +be reallocated until that last update commits to the log.
> +The second subtlety comes from the fact that AG header buffers are (usually)
> +released between each transaction in a chain.
> +This means that other threads can observe an AG in an intermediate state,
> +but as long as the first subtlety is handled, this should not affect the
> +correctness of filesystem operations.
> +Unmounting the filesystem flushes all pending work to disk, which means that
> +offline fsck never sees the temporary inconsistencies caused by deferred work
> +item processing.
> +In this manner, XFS employs a form of eventual consistency to avoid deadlocks
> +and increase parallelism.
> +
> +During the design phase of the reverse mapping and reflink features, it was
> +decided that it was impractical to cram all the reverse mapping updates for a
> +single filesystem change into a single transaction because a single file
> +mapping operation can explode into many small updates:
> +
> +* The block mapping update itself
> +* A reverse mapping update for the block mapping update
> +* Fixing the freelist
> +* A reverse mapping update for the freelist fix
> +
> +* A shape change to the block mapping btree
> +* A reverse mapping update for the btree update
> +* Fixing the freelist (again)
> +* A reverse mapping update for the freelist fix
> +
> +* An update to the reference counting information
> +* A reverse mapping update for the refcount update
> +* Fixing the freelist (a third time)
> +* A reverse mapping update for the freelist fix
> +
> +* Freeing any space that was unmapped and not owned by any other file
> +* Fixing the freelist (a fourth time)
> +* A reverse mapping update for the freelist fix
> +
> +* Freeing the space used by the block mapping btree
> +* Fixing the freelist (a fifth time)
> +* A reverse mapping update for the freelist fix
> +
> +Free list fixups are not usually needed more than once per AG per transaction
> +chain, but it is theoretically possible if space is very tight.
> +For copy-on-write updates this is even worse, because this must be done once to
> +remove the space from a staging area and again to map it into the file!
> +
> +To deal with this explosion in a calm manner, XFS expands its use of deferred
> +work items to cover most reverse mapping updates and all refcount updates.
> +This reduces the worst case size of transaction reservations by breaking the
> +work into a long chain of small updates, which increases the degree of eventual
> +consistency in the system.
> +Again, this generally isn't a problem because XFS orders its deferred work
> +items carefully to avoid resource reuse conflicts between unsuspecting threads.
> +
> +However, online fsck changes the rules -- remember that although physical
> +updates to per-AG structures are coordinated by locking the buffers for AG
> +headers, buffer locks are dropped between transactions.
> +Once scrub acquires resources and takes locks for a data structure, it must do
> +all the validation work without releasing the lock.
> +If the main lock for a space btree is an AG header buffer lock, scrub may have
> +interrupted another thread that is midway through finishing a chain.
> +For example, if a thread performing a copy-on-write has completed a reverse
> +mapping update but not the corresponding refcount update, the two AG btrees
> +will appear inconsistent to scrub and an observation of corruption will be
> +recorded.  This observation will not be correct.
> +If a repair is attempted in this state, the results will be catastrophic!
> +
> +Several solutions to this problem were evaluated upon discovery of this flaw:
> +
> +1. Add a higher level lock to allocation groups and require writer threads to
> +   acquire the higher level lock in AG order before making any changes.
> +   This would be very difficult to implement in practice because it is
> +   difficult to determine which locks need to be obtained, and in what order,
> +   without simulating the entire operation.
> +   Performing a dry run of a file operation to discover necessary locks would
> +   make the filesystem very slow.
> +
> +2. Make the deferred work coordinator code aware of consecutive intent items
> +   targeting the same AG and have it hold the AG header buffers locked across
> +   the transaction roll between updates.
> +   This would introduce a lot of complexity into the coordinator since it is
> +   only loosely coupled with the actual deferred work items.
> +   It would also fail to solve the problem because deferred work items can
> +   generate new deferred subtasks, but all subtasks must be complete before
> +   work can start on a new sibling task.
> +
> +3. Teach online fsck to walk all transactions waiting for whichever lock(s)
> +   protect the data structure being scrubbed to look for pending operations.
> +   The checking and repair operations must factor these pending operations into
> +   the evaluations being performed.
> +   This solution is a nonstarter because it is *extremely* invasive to the main
> +   filesystem.
> +
> +4. Recognize that only online fsck has this requirement of total consistency
> +   of AG metadata, and that online fsck should be relatively rare as compared
> +   to filesystem change operations.
> +   For each AG, maintain a count of intent items targetting that AG.
> +   When online fsck wants to examine an AG, it should lock the AG header
> +   buffers to quiesce all transaction chains that want to modify that AG, and
> +   only proceed with the scrub if the count is zero.
> +   In other words, scrub only proceeds if it can lock the AG header buffers and
> +   there can't possibly be any intents in progress.
> +   This may lead to fairness and starvation issues, but regular filesystem
> +   updates take precedence over online fsck activity.
> +

Is there any guarantee that some silly real life regular filesystem workload
won't starve online fsck forever?
IOW, is forward progress of online fsck guaranteed?

Good luck with landing online fsck before the 2024 NYE deluge ;)

Thanks,
Amir.
Darrick J. Wong Jan. 5, 2023, 7:40 p.m. UTC | #2
On Thu, Jan 05, 2023 at 11:08:51AM +0200, Amir Goldstein wrote:
> On Sat, Dec 31, 2022 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Writes to an XFS filesystem employ an eventual consistency update model
> > to break up complex multistep metadata updates into small chained
> > transactions.  This is generally good for performance and scalability
> > because XFS doesn't need to prepare for enormous transactions, but it
> > also means that online fsck must be careful not to attempt a fsck action
> > unless it can be shown that there are no other threads processing a
> > transaction chain.  This part of the design documentation covers the
> > thinking behind the consistency model and how scrub deals with it.
> >
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  303 ++++++++++++++++++++
> >  1 file changed, 303 insertions(+)
> >
> >
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index f45bf97fa9c4..419eb54ee200 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -1443,3 +1443,306 @@ This step is critical for enabling system administrator to monitor the status
> >  of the filesystem and the progress of any repairs.
> >  For developers, it is a useful means to judge the efficacy of error detection
> >  and correction in the online and offline checking tools.
> > +
> > +Eventual Consistency vs. Online Fsck
> > +------------------------------------
> > +
> > +Midway through the development of online scrubbing, the fsstress tests
> > +uncovered a misinteraction between online fsck and compound transaction chains
> > +created by other writer threads that resulted in false reports of metadata
> > +inconsistency.
> > +The root cause of these reports is the eventual consistency model introduced by
> > +the expansion of deferred work items and compound transaction chains when
> > +reverse mapping and reflink were introduced.
> > +
> > +Originally, transaction chains were added to XFS to avoid deadlocks when
> > +unmapping space from files.
> > +Deadlock avoidance rules require that AGs only be locked in increasing order,
> > +which makes it impossible (say) to use a single transaction to free a space
> > +extent in AG 7 and then try to free a now superfluous block mapping btree block
> > +in AG 3.
> > +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
> > +items to commit to freeing some space in one transaction while deferring the
> > +actual metadata updates to a fresh transaction.
> > +The transaction sequence looks like this:
> > +
> > +1. The first transaction contains a physical update to the file's block mapping
> > +   structures to remove the mapping from the btree blocks.
> > +   It then attaches to the in-memory transaction an action item to schedule
> > +   deferred freeing of space.
> > +   Concretely, each transaction maintains a list of ``struct
> > +   xfs_defer_pending`` objects, each of which maintains a list of ``struct
> > +   xfs_extent_free_item`` objects.
> > +   Returning to the example above, the action item tracks the freeing of both
> > +   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
> > +   AG 3.
> > +   Deferred frees recorded in this manner are committed in the log by creating
> > +   an EFI log item from the ``struct xfs_extent_free_item`` object and
> > +   attaching the log item to the transaction.
> > +   When the log is persisted to disk, the EFI item is written into the ondisk
> > +   transaction record.
> > +   EFIs can list up to 16 extents to free, all sorted in AG order.
> > +
> > +2. The second transaction contains a physical update to the free space btrees
> > +   of AG 3 to release the former BMBT block and a second physical update to the
> > +   free space btrees of AG 7 to release the unmapped file space.
> > +   Observe that the the physical updates are resequenced in the correct order
> > +   when possible.
> > +   Attached to the transaction is a an extent free done (EFD) log item.
> > +   The EFD contains a pointer to the EFI logged in transaction #1 so that log
> > +   recovery can tell if the EFI needs to be replayed.
> > +
> > +If the system goes down after transaction #1 is written back to the filesystem
> > +but before #2 is committed, a scan of the filesystem metadata would show
> > +inconsistent filesystem metadata because there would not appear to be any owner
> > +of the unmapped space.
> > +Happily, log recovery corrects this inconsistency for us -- when recovery finds
> > +an intent log item but does not find a corresponding intent done item, it will
> > +reconstruct the incore state of the intent item and finish it.
> > +In the example above, the log must replay both frees described in the recovered
> > +EFI to complete the recovery phase.
> > +
> > +There are two subtleties to XFS' transaction chaining strategy to consider.
> > +The first is that log items must be added to a transaction in the correct order
> > +to prevent conflicts with principal objects that are not held by the
> > +transaction.
> > +In other words, all per-AG metadata updates for an unmapped block must be
> > +completed before the last update to free the extent, and extents should not
> > +be reallocated until that last update commits to the log.
> > +The second subtlety comes from the fact that AG header buffers are (usually)
> > +released between each transaction in a chain.
> > +This means that other threads can observe an AG in an intermediate state,
> > +but as long as the first subtlety is handled, this should not affect the
> > +correctness of filesystem operations.
> > +Unmounting the filesystem flushes all pending work to disk, which means that
> > +offline fsck never sees the temporary inconsistencies caused by deferred work
> > +item processing.
> > +In this manner, XFS employs a form of eventual consistency to avoid deadlocks
> > +and increase parallelism.
> > +
> > +During the design phase of the reverse mapping and reflink features, it was
> > +decided that it was impractical to cram all the reverse mapping updates for a
> > +single filesystem change into a single transaction because a single file
> > +mapping operation can explode into many small updates:
> > +
> > +* The block mapping update itself
> > +* A reverse mapping update for the block mapping update
> > +* Fixing the freelist
> > +* A reverse mapping update for the freelist fix
> > +
> > +* A shape change to the block mapping btree
> > +* A reverse mapping update for the btree update
> > +* Fixing the freelist (again)
> > +* A reverse mapping update for the freelist fix
> > +
> > +* An update to the reference counting information
> > +* A reverse mapping update for the refcount update
> > +* Fixing the freelist (a third time)
> > +* A reverse mapping update for the freelist fix
> > +
> > +* Freeing any space that was unmapped and not owned by any other file
> > +* Fixing the freelist (a fourth time)
> > +* A reverse mapping update for the freelist fix
> > +
> > +* Freeing the space used by the block mapping btree
> > +* Fixing the freelist (a fifth time)
> > +* A reverse mapping update for the freelist fix
> > +
> > +Free list fixups are not usually needed more than once per AG per transaction
> > +chain, but it is theoretically possible if space is very tight.
> > +For copy-on-write updates this is even worse, because this must be done once to
> > +remove the space from a staging area and again to map it into the file!
> > +
> > +To deal with this explosion in a calm manner, XFS expands its use of deferred
> > +work items to cover most reverse mapping updates and all refcount updates.
> > +This reduces the worst case size of transaction reservations by breaking the
> > +work into a long chain of small updates, which increases the degree of eventual
> > +consistency in the system.
> > +Again, this generally isn't a problem because XFS orders its deferred work
> > +items carefully to avoid resource reuse conflicts between unsuspecting threads.
> > +
> > +However, online fsck changes the rules -- remember that although physical
> > +updates to per-AG structures are coordinated by locking the buffers for AG
> > +headers, buffer locks are dropped between transactions.
> > +Once scrub acquires resources and takes locks for a data structure, it must do
> > +all the validation work without releasing the lock.
> > +If the main lock for a space btree is an AG header buffer lock, scrub may have
> > +interrupted another thread that is midway through finishing a chain.
> > +For example, if a thread performing a copy-on-write has completed a reverse
> > +mapping update but not the corresponding refcount update, the two AG btrees
> > +will appear inconsistent to scrub and an observation of corruption will be
> > +recorded.  This observation will not be correct.
> > +If a repair is attempted in this state, the results will be catastrophic!
> > +
> > +Several solutions to this problem were evaluated upon discovery of this flaw:
> > +
> > +1. Add a higher level lock to allocation groups and require writer threads to
> > +   acquire the higher level lock in AG order before making any changes.
> > +   This would be very difficult to implement in practice because it is
> > +   difficult to determine which locks need to be obtained, and in what order,
> > +   without simulating the entire operation.
> > +   Performing a dry run of a file operation to discover necessary locks would
> > +   make the filesystem very slow.
> > +
> > +2. Make the deferred work coordinator code aware of consecutive intent items
> > +   targeting the same AG and have it hold the AG header buffers locked across
> > +   the transaction roll between updates.
> > +   This would introduce a lot of complexity into the coordinator since it is
> > +   only loosely coupled with the actual deferred work items.
> > +   It would also fail to solve the problem because deferred work items can
> > +   generate new deferred subtasks, but all subtasks must be complete before
> > +   work can start on a new sibling task.
> > +
> > +3. Teach online fsck to walk all transactions waiting for whichever lock(s)
> > +   protect the data structure being scrubbed to look for pending operations.
> > +   The checking and repair operations must factor these pending operations into
> > +   the evaluations being performed.
> > +   This solution is a nonstarter because it is *extremely* invasive to the main
> > +   filesystem.
> > +
> > +4. Recognize that only online fsck has this requirement of total consistency
> > +   of AG metadata, and that online fsck should be relatively rare as compared
> > +   to filesystem change operations.
> > +   For each AG, maintain a count of intent items targetting that AG.
> > +   When online fsck wants to examine an AG, it should lock the AG header
> > +   buffers to quiesce all transaction chains that want to modify that AG, and
> > +   only proceed with the scrub if the count is zero.
> > +   In other words, scrub only proceeds if it can lock the AG header buffers and
> > +   there can't possibly be any intents in progress.
> > +   This may lead to fairness and starvation issues, but regular filesystem
> > +   updates take precedence over online fsck activity.
> > +
> 
> Is there any guarantee that some silly real life regular filesystem workload
> won't starve online fsck forever?
> IOW, is forward progress of online fsck guaranteed?

Nope, forward progress isn't guaranteed.  The kernel checks for fatal
signals every time it backs off a scrub so at least we don't end up with
unkillable processes.  At one point I added a timeout field to the ioctl
interface so that the kernel could time out an operation if it took too
long to acquire the necessary resources.  So far, the "race fsstress and
xfs_scrub" tests have not shown scrub failing to make any forward
progress.

That said, I have /not/ yet had a chance to try it out any of these
massive 1000-core systems with an according workload.

> Good luck with landing online fsck before the 2024 NYE deluge ;)

Thank *you* for reading this chapter of the design document!! :)

--D

> Thanks,
> Amir.
Amir Goldstein Jan. 6, 2023, 3:33 a.m. UTC | #3
On Thu, Jan 5, 2023 at 9:40 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Jan 05, 2023 at 11:08:51AM +0200, Amir Goldstein wrote:
> > On Sat, Dec 31, 2022 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Writes to an XFS filesystem employ an eventual consistency update model
> > > to break up complex multistep metadata updates into small chained
> > > transactions.  This is generally good for performance and scalability
> > > because XFS doesn't need to prepare for enormous transactions, but it
> > > also means that online fsck must be careful not to attempt a fsck action
> > > unless it can be shown that there are no other threads processing a
> > > transaction chain.  This part of the design documentation covers the
> > > thinking behind the consistency model and how scrub deals with it.
> > >
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  303 ++++++++++++++++++++
> > >  1 file changed, 303 insertions(+)
> > >
> > >
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index f45bf97fa9c4..419eb54ee200 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -1443,3 +1443,306 @@ This step is critical for enabling system administrator to monitor the status
> > >  of the filesystem and the progress of any repairs.
> > >  For developers, it is a useful means to judge the efficacy of error detection
> > >  and correction in the online and offline checking tools.
> > > +
> > > +Eventual Consistency vs. Online Fsck
> > > +------------------------------------
> > > +
> > > +Midway through the development of online scrubbing, the fsstress tests
> > > +uncovered a misinteraction between online fsck and compound transaction chains
> > > +created by other writer threads that resulted in false reports of metadata
> > > +inconsistency.
> > > +The root cause of these reports is the eventual consistency model introduced by
> > > +the expansion of deferred work items and compound transaction chains when
> > > +reverse mapping and reflink were introduced.
> > > +
> > > +Originally, transaction chains were added to XFS to avoid deadlocks when
> > > +unmapping space from files.
> > > +Deadlock avoidance rules require that AGs only be locked in increasing order,
> > > +which makes it impossible (say) to use a single transaction to free a space
> > > +extent in AG 7 and then try to free a now superfluous block mapping btree block
> > > +in AG 3.
> > > +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
> > > +items to commit to freeing some space in one transaction while deferring the
> > > +actual metadata updates to a fresh transaction.
> > > +The transaction sequence looks like this:
> > > +
> > > +1. The first transaction contains a physical update to the file's block mapping
> > > +   structures to remove the mapping from the btree blocks.
> > > +   It then attaches to the in-memory transaction an action item to schedule
> > > +   deferred freeing of space.
> > > +   Concretely, each transaction maintains a list of ``struct
> > > +   xfs_defer_pending`` objects, each of which maintains a list of ``struct
> > > +   xfs_extent_free_item`` objects.
> > > +   Returning to the example above, the action item tracks the freeing of both
> > > +   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
> > > +   AG 3.
> > > +   Deferred frees recorded in this manner are committed in the log by creating
> > > +   an EFI log item from the ``struct xfs_extent_free_item`` object and
> > > +   attaching the log item to the transaction.
> > > +   When the log is persisted to disk, the EFI item is written into the ondisk
> > > +   transaction record.
> > > +   EFIs can list up to 16 extents to free, all sorted in AG order.
> > > +
> > > +2. The second transaction contains a physical update to the free space btrees
> > > +   of AG 3 to release the former BMBT block and a second physical update to the
> > > +   free space btrees of AG 7 to release the unmapped file space.
> > > +   Observe that the the physical updates are resequenced in the correct order
> > > +   when possible.
> > > +   Attached to the transaction is a an extent free done (EFD) log item.
> > > +   The EFD contains a pointer to the EFI logged in transaction #1 so that log
> > > +   recovery can tell if the EFI needs to be replayed.
> > > +
> > > +If the system goes down after transaction #1 is written back to the filesystem
> > > +but before #2 is committed, a scan of the filesystem metadata would show
> > > +inconsistent filesystem metadata because there would not appear to be any owner
> > > +of the unmapped space.
> > > +Happily, log recovery corrects this inconsistency for us -- when recovery finds
> > > +an intent log item but does not find a corresponding intent done item, it will
> > > +reconstruct the incore state of the intent item and finish it.
> > > +In the example above, the log must replay both frees described in the recovered
> > > +EFI to complete the recovery phase.
> > > +
> > > +There are two subtleties to XFS' transaction chaining strategy to consider.
> > > +The first is that log items must be added to a transaction in the correct order
> > > +to prevent conflicts with principal objects that are not held by the
> > > +transaction.
> > > +In other words, all per-AG metadata updates for an unmapped block must be
> > > +completed before the last update to free the extent, and extents should not
> > > +be reallocated until that last update commits to the log.
> > > +The second subtlety comes from the fact that AG header buffers are (usually)
> > > +released between each transaction in a chain.
> > > +This means that other threads can observe an AG in an intermediate state,
> > > +but as long as the first subtlety is handled, this should not affect the
> > > +correctness of filesystem operations.
> > > +Unmounting the filesystem flushes all pending work to disk, which means that
> > > +offline fsck never sees the temporary inconsistencies caused by deferred work
> > > +item processing.
> > > +In this manner, XFS employs a form of eventual consistency to avoid deadlocks
> > > +and increase parallelism.
> > > +
> > > +During the design phase of the reverse mapping and reflink features, it was
> > > +decided that it was impractical to cram all the reverse mapping updates for a
> > > +single filesystem change into a single transaction because a single file
> > > +mapping operation can explode into many small updates:
> > > +
> > > +* The block mapping update itself
> > > +* A reverse mapping update for the block mapping update
> > > +* Fixing the freelist
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* A shape change to the block mapping btree
> > > +* A reverse mapping update for the btree update
> > > +* Fixing the freelist (again)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* An update to the reference counting information
> > > +* A reverse mapping update for the refcount update
> > > +* Fixing the freelist (a third time)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* Freeing any space that was unmapped and not owned by any other file
> > > +* Fixing the freelist (a fourth time)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* Freeing the space used by the block mapping btree
> > > +* Fixing the freelist (a fifth time)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +Free list fixups are not usually needed more than once per AG per transaction
> > > +chain, but it is theoretically possible if space is very tight.
> > > +For copy-on-write updates this is even worse, because this must be done once to
> > > +remove the space from a staging area and again to map it into the file!
> > > +
> > > +To deal with this explosion in a calm manner, XFS expands its use of deferred
> > > +work items to cover most reverse mapping updates and all refcount updates.
> > > +This reduces the worst case size of transaction reservations by breaking the
> > > +work into a long chain of small updates, which increases the degree of eventual
> > > +consistency in the system.
> > > +Again, this generally isn't a problem because XFS orders its deferred work
> > > +items carefully to avoid resource reuse conflicts between unsuspecting threads.
> > > +
> > > +However, online fsck changes the rules -- remember that although physical
> > > +updates to per-AG structures are coordinated by locking the buffers for AG
> > > +headers, buffer locks are dropped between transactions.
> > > +Once scrub acquires resources and takes locks for a data structure, it must do
> > > +all the validation work without releasing the lock.
> > > +If the main lock for a space btree is an AG header buffer lock, scrub may have
> > > +interrupted another thread that is midway through finishing a chain.
> > > +For example, if a thread performing a copy-on-write has completed a reverse
> > > +mapping update but not the corresponding refcount update, the two AG btrees
> > > +will appear inconsistent to scrub and an observation of corruption will be
> > > +recorded.  This observation will not be correct.
> > > +If a repair is attempted in this state, the results will be catastrophic!
> > > +
> > > +Several solutions to this problem were evaluated upon discovery of this flaw:
> > > +
> > > +1. Add a higher level lock to allocation groups and require writer threads to
> > > +   acquire the higher level lock in AG order before making any changes.
> > > +   This would be very difficult to implement in practice because it is
> > > +   difficult to determine which locks need to be obtained, and in what order,
> > > +   without simulating the entire operation.
> > > +   Performing a dry run of a file operation to discover necessary locks would
> > > +   make the filesystem very slow.
> > > +
> > > +2. Make the deferred work coordinator code aware of consecutive intent items
> > > +   targeting the same AG and have it hold the AG header buffers locked across
> > > +   the transaction roll between updates.
> > > +   This would introduce a lot of complexity into the coordinator since it is
> > > +   only loosely coupled with the actual deferred work items.
> > > +   It would also fail to solve the problem because deferred work items can
> > > +   generate new deferred subtasks, but all subtasks must be complete before
> > > +   work can start on a new sibling task.
> > > +
> > > +3. Teach online fsck to walk all transactions waiting for whichever lock(s)
> > > +   protect the data structure being scrubbed to look for pending operations.
> > > +   The checking and repair operations must factor these pending operations into
> > > +   the evaluations being performed.
> > > +   This solution is a nonstarter because it is *extremely* invasive to the main
> > > +   filesystem.
> > > +
> > > +4. Recognize that only online fsck has this requirement of total consistency
> > > +   of AG metadata, and that online fsck should be relatively rare as compared
> > > +   to filesystem change operations.
> > > +   For each AG, maintain a count of intent items targetting that AG.
> > > +   When online fsck wants to examine an AG, it should lock the AG header
> > > +   buffers to quiesce all transaction chains that want to modify that AG, and
> > > +   only proceed with the scrub if the count is zero.
> > > +   In other words, scrub only proceeds if it can lock the AG header buffers and
> > > +   there can't possibly be any intents in progress.
> > > +   This may lead to fairness and starvation issues, but regular filesystem
> > > +   updates take precedence over online fsck activity.
> > > +
> >
> > Is there any guarantee that some silly real life regular filesystem workload
> > won't starve online fsck forever?
> > IOW, is forward progress of online fsck guaranteed?
>
> Nope, forward progress isn't guaranteed.

That sounds like a problem.

> The kernel checks for fatal
> signals every time it backs off a scrub so at least we don't end up with
> unkillable processes.  At one point I added a timeout field to the ioctl
> interface so that the kernel could time out an operation if it took too
> long to acquire the necessary resources.  So far, the "race fsstress and
> xfs_scrub" tests have not shown scrub failing to make any forward
> progress.
>
> That said, I have /not/ yet had a chance to try it out any of these
> massive 1000-core systems with an according workload.
>

Don't know if fsstress is the best way to check the worst case scenario.

Can you think of a workload, say several threads creating and deleting
temp files, with deferred parent pointer items preventing the queue from
ever draining?

Considering that a "full journal" scenario is always going to be a possible
worst case incident, how bad would it be to block new transactions
instead of the possibility of starving scrub consistency checks forever?

Wouldn't the consistency checks be much faster than freeing journal
space would be in a "full journal" situation?

I don't know if there is a "mission statement" for online fsck, but
I think it would say "minimal user interference" not "no user interference".
It sounds like the interference we are trying to avoid is light years away
from the downtime of offline fsck, so online fsck would still be a huge win.
online fsck that never ends OTOH... maybe less so.

> > Good luck with landing online fsck before the 2024 NYE deluge ;)
>
> Thank *you* for reading this chapter of the design document!! :)
>

Oh I read them all at the summer submission, but it took me so long
that I forgot to follow up..

My other question was regarding memory usage control.
I have horrid memories from e2fsck unpredictable memory usage
and unpredictable runtime due to swapping.

xfs_repair -m was a huge improvement compared to e2fsck.
I don't remember reading about memory usage limits for online repair,
so I was concerned about unpredictable memory usage and swapping.
Can you say something to ease those concerns?

Thanks,
Amir.
Darrick J. Wong Jan. 11, 2023, 5:54 p.m. UTC | #4
On Fri, Jan 06, 2023 at 05:33:00AM +0200, Amir Goldstein wrote:
> On Thu, Jan 5, 2023 at 9:40 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, Jan 05, 2023 at 11:08:51AM +0200, Amir Goldstein wrote:
> > > On Sat, Dec 31, 2022 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > Writes to an XFS filesystem employ an eventual consistency update model
> > > > to break up complex multistep metadata updates into small chained
> > > > transactions.  This is generally good for performance and scalability
> > > > because XFS doesn't need to prepare for enormous transactions, but it
> > > > also means that online fsck must be careful not to attempt a fsck action
> > > > unless it can be shown that there are no other threads processing a
> > > > transaction chain.  This part of the design documentation covers the
> > > > thinking behind the consistency model and how scrub deals with it.
> > > >
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  .../filesystems/xfs-online-fsck-design.rst         |  303 ++++++++++++++++++++
> > > >  1 file changed, 303 insertions(+)
> > > >
> > > >
> > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > index f45bf97fa9c4..419eb54ee200 100644
> > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > @@ -1443,3 +1443,306 @@ This step is critical for enabling system administrator to monitor the status
> > > >  of the filesystem and the progress of any repairs.
> > > >  For developers, it is a useful means to judge the efficacy of error detection
> > > >  and correction in the online and offline checking tools.
> > > > +
> > > > +Eventual Consistency vs. Online Fsck
> > > > +------------------------------------
> > > > +
> > > > +Midway through the development of online scrubbing, the fsstress tests
> > > > +uncovered a misinteraction between online fsck and compound transaction chains
> > > > +created by other writer threads that resulted in false reports of metadata
> > > > +inconsistency.
> > > > +The root cause of these reports is the eventual consistency model introduced by
> > > > +the expansion of deferred work items and compound transaction chains when
> > > > +reverse mapping and reflink were introduced.
> > > > +
> > > > +Originally, transaction chains were added to XFS to avoid deadlocks when
> > > > +unmapping space from files.
> > > > +Deadlock avoidance rules require that AGs only be locked in increasing order,
> > > > +which makes it impossible (say) to use a single transaction to free a space
> > > > +extent in AG 7 and then try to free a now superfluous block mapping btree block
> > > > +in AG 3.
> > > > +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
> > > > +items to commit to freeing some space in one transaction while deferring the
> > > > +actual metadata updates to a fresh transaction.
> > > > +The transaction sequence looks like this:
> > > > +
> > > > +1. The first transaction contains a physical update to the file's block mapping
> > > > +   structures to remove the mapping from the btree blocks.
> > > > +   It then attaches to the in-memory transaction an action item to schedule
> > > > +   deferred freeing of space.
> > > > +   Concretely, each transaction maintains a list of ``struct
> > > > +   xfs_defer_pending`` objects, each of which maintains a list of ``struct
> > > > +   xfs_extent_free_item`` objects.
> > > > +   Returning to the example above, the action item tracks the freeing of both
> > > > +   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
> > > > +   AG 3.
> > > > +   Deferred frees recorded in this manner are committed in the log by creating
> > > > +   an EFI log item from the ``struct xfs_extent_free_item`` object and
> > > > +   attaching the log item to the transaction.
> > > > +   When the log is persisted to disk, the EFI item is written into the ondisk
> > > > +   transaction record.
> > > > +   EFIs can list up to 16 extents to free, all sorted in AG order.
> > > > +
> > > > +2. The second transaction contains a physical update to the free space btrees
> > > > +   of AG 3 to release the former BMBT block and a second physical update to the
> > > > +   free space btrees of AG 7 to release the unmapped file space.
> > > > +   Observe that the the physical updates are resequenced in the correct order
> > > > +   when possible.
> > > > +   Attached to the transaction is a an extent free done (EFD) log item.
> > > > +   The EFD contains a pointer to the EFI logged in transaction #1 so that log
> > > > +   recovery can tell if the EFI needs to be replayed.
> > > > +
> > > > +If the system goes down after transaction #1 is written back to the filesystem
> > > > +but before #2 is committed, a scan of the filesystem metadata would show
> > > > +inconsistent filesystem metadata because there would not appear to be any owner
> > > > +of the unmapped space.
> > > > +Happily, log recovery corrects this inconsistency for us -- when recovery finds
> > > > +an intent log item but does not find a corresponding intent done item, it will
> > > > +reconstruct the incore state of the intent item and finish it.
> > > > +In the example above, the log must replay both frees described in the recovered
> > > > +EFI to complete the recovery phase.
> > > > +
> > > > +There are two subtleties to XFS' transaction chaining strategy to consider.
> > > > +The first is that log items must be added to a transaction in the correct order
> > > > +to prevent conflicts with principal objects that are not held by the
> > > > +transaction.
> > > > +In other words, all per-AG metadata updates for an unmapped block must be
> > > > +completed before the last update to free the extent, and extents should not
> > > > +be reallocated until that last update commits to the log.
> > > > +The second subtlety comes from the fact that AG header buffers are (usually)
> > > > +released between each transaction in a chain.
> > > > +This means that other threads can observe an AG in an intermediate state,
> > > > +but as long as the first subtlety is handled, this should not affect the
> > > > +correctness of filesystem operations.
> > > > +Unmounting the filesystem flushes all pending work to disk, which means that
> > > > +offline fsck never sees the temporary inconsistencies caused by deferred work
> > > > +item processing.
> > > > +In this manner, XFS employs a form of eventual consistency to avoid deadlocks
> > > > +and increase parallelism.
> > > > +
> > > > +During the design phase of the reverse mapping and reflink features, it was
> > > > +decided that it was impractical to cram all the reverse mapping updates for a
> > > > +single filesystem change into a single transaction because a single file
> > > > +mapping operation can explode into many small updates:
> > > > +
> > > > +* The block mapping update itself
> > > > +* A reverse mapping update for the block mapping update
> > > > +* Fixing the freelist
> > > > +* A reverse mapping update for the freelist fix
> > > > +
> > > > +* A shape change to the block mapping btree
> > > > +* A reverse mapping update for the btree update
> > > > +* Fixing the freelist (again)
> > > > +* A reverse mapping update for the freelist fix
> > > > +
> > > > +* An update to the reference counting information
> > > > +* A reverse mapping update for the refcount update
> > > > +* Fixing the freelist (a third time)
> > > > +* A reverse mapping update for the freelist fix
> > > > +
> > > > +* Freeing any space that was unmapped and not owned by any other file
> > > > +* Fixing the freelist (a fourth time)
> > > > +* A reverse mapping update for the freelist fix
> > > > +
> > > > +* Freeing the space used by the block mapping btree
> > > > +* Fixing the freelist (a fifth time)
> > > > +* A reverse mapping update for the freelist fix
> > > > +
> > > > +Free list fixups are not usually needed more than once per AG per transaction
> > > > +chain, but it is theoretically possible if space is very tight.
> > > > +For copy-on-write updates this is even worse, because this must be done once to
> > > > +remove the space from a staging area and again to map it into the file!
> > > > +
> > > > +To deal with this explosion in a calm manner, XFS expands its use of deferred
> > > > +work items to cover most reverse mapping updates and all refcount updates.
> > > > +This reduces the worst case size of transaction reservations by breaking the
> > > > +work into a long chain of small updates, which increases the degree of eventual
> > > > +consistency in the system.
> > > > +Again, this generally isn't a problem because XFS orders its deferred work
> > > > +items carefully to avoid resource reuse conflicts between unsuspecting threads.
> > > > +
> > > > +However, online fsck changes the rules -- remember that although physical
> > > > +updates to per-AG structures are coordinated by locking the buffers for AG
> > > > +headers, buffer locks are dropped between transactions.
> > > > +Once scrub acquires resources and takes locks for a data structure, it must do
> > > > +all the validation work without releasing the lock.
> > > > +If the main lock for a space btree is an AG header buffer lock, scrub may have
> > > > +interrupted another thread that is midway through finishing a chain.
> > > > +For example, if a thread performing a copy-on-write has completed a reverse
> > > > +mapping update but not the corresponding refcount update, the two AG btrees
> > > > +will appear inconsistent to scrub and an observation of corruption will be
> > > > +recorded.  This observation will not be correct.
> > > > +If a repair is attempted in this state, the results will be catastrophic!
> > > > +
> > > > +Several solutions to this problem were evaluated upon discovery of this flaw:
> > > > +
> > > > +1. Add a higher level lock to allocation groups and require writer threads to
> > > > +   acquire the higher level lock in AG order before making any changes.
> > > > +   This would be very difficult to implement in practice because it is
> > > > +   difficult to determine which locks need to be obtained, and in what order,
> > > > +   without simulating the entire operation.
> > > > +   Performing a dry run of a file operation to discover necessary locks would
> > > > +   make the filesystem very slow.
> > > > +
> > > > +2. Make the deferred work coordinator code aware of consecutive intent items
> > > > +   targeting the same AG and have it hold the AG header buffers locked across
> > > > +   the transaction roll between updates.
> > > > +   This would introduce a lot of complexity into the coordinator since it is
> > > > +   only loosely coupled with the actual deferred work items.
> > > > +   It would also fail to solve the problem because deferred work items can
> > > > +   generate new deferred subtasks, but all subtasks must be complete before
> > > > +   work can start on a new sibling task.
> > > > +
> > > > +3. Teach online fsck to walk all transactions waiting for whichever lock(s)
> > > > +   protect the data structure being scrubbed to look for pending operations.
> > > > +   The checking and repair operations must factor these pending operations into
> > > > +   the evaluations being performed.
> > > > +   This solution is a nonstarter because it is *extremely* invasive to the main
> > > > +   filesystem.
> > > > +
> > > > +4. Recognize that only online fsck has this requirement of total consistency
> > > > +   of AG metadata, and that online fsck should be relatively rare as compared
> > > > +   to filesystem change operations.
> > > > +   For each AG, maintain a count of intent items targetting that AG.
> > > > +   When online fsck wants to examine an AG, it should lock the AG header
> > > > +   buffers to quiesce all transaction chains that want to modify that AG, and
> > > > +   only proceed with the scrub if the count is zero.
> > > > +   In other words, scrub only proceeds if it can lock the AG header buffers and
> > > > +   there can't possibly be any intents in progress.
> > > > +   This may lead to fairness and starvation issues, but regular filesystem
> > > > +   updates take precedence over online fsck activity.
> > > > +
> > >
> > > Is there any guarantee that some silly real life regular filesystem workload
> > > won't starve online fsck forever?
> > > IOW, is forward progress of online fsck guaranteed?
> >
> > Nope, forward progress isn't guaranteed.
> 
> That sounds like a problem.

So far it hasn't been.  I prefer to sacrifice performance of the
background fsck service for the sake of foreground tasks.  The fsstress
and fsx fstests haven't shown any particularly serious issues.  I've
also kicked off xfs_scrub on the same VM hosts that are running the fuzz
test suite (~52 VMs per host) and scrub can still finish the filesystem
in a couple of hours.

Things get markedly worse on spinning rust with a lot of parallel
unwritten extent conversions and allocations going on (aka the disk
backup systems).  Normally a backup from flash to rust takes about an
hour; with scrub and backup contending for the head actuator, it'll go
up to about 2-3 hours, but both tasks can make (verrrry slow) forward
progress.

That said -- the backup program spends a lot of iowait time waiting for
file data blocks to read in or get written back, so the contention is on
the storage hardware, not the filesystem locks.

> > The kernel checks for fatal
> > signals every time it backs off a scrub so at least we don't end up with
> > unkillable processes.  At one point I added a timeout field to the ioctl
> > interface so that the kernel could time out an operation if it took too
> > long to acquire the necessary resources.  So far, the "race fsstress and
> > xfs_scrub" tests have not shown scrub failing to make any forward
> > progress.
> >
> > That said, I have /not/ yet had a chance to try it out any of these
> > massive 1000-core systems with an according workload.
> >
> 
> Don't know if fsstress is the best way to check the worst case scenario.
> 
> Can you think of a workload, say several threads creating and deleting
> temp files, with deferred parent pointer items preventing the queue from
> ever draining?

The worst workload would be one that is entirely metadata based -- a
giant directory tree full of empty files with all information being
stored as extended attributes.

> Considering that a "full journal" scenario is always going to be a possible
> worst case incident, how bad would it be to block new transactions
> instead of the possibility of starving scrub consistency checks forever?

First of all, scrub has already allocated a transaction by the time it
gets to the intent drain step.  There's no good way to block new
transactions once we've reached this stage, nor should there be.
Blocking transactions stalls xfs garbage collection and memory reclaim.

> Wouldn't the consistency checks be much faster than freeing journal
> space would be in a "full journal" situation?

I haven't investigated this in depth, but yes, scrub should be faster
than forcing the log and checkpointing the log to move the log tail
forward to empty out the journal.

> I don't know if there is a "mission statement" for online fsck, but
> I think it would say "minimal user interference" not "no user interference".

Yes.  The section about eventual consistency states that "...regular
filesystem updates take precedence over online fsck activity".

> It sounds like the interference we are trying to avoid is light years away
> from the downtime of offline fsck, so online fsck would still be a huge win.
> online fsck that never ends OTOH... maybe less so.

Well you /can/ just kill the xfs_scrub processes if they are taking too
much time.  One of the nastier papercuts of the background scrub is that
the fs cannot be unmounted while it's running, and systemd doesn't have
a good mechanism for "kill this service before stopping this mount".  Or
maybe it does and I haven't yet found it?

(The cronjob variant definitely suffers from that...)

> > > Good luck with landing online fsck before the 2024 NYE deluge ;)
> >
> > Thank *you* for reading this chapter of the design document!! :)
> >
> 
> Oh I read them all at the summer submission, but it took me so long
> that I forgot to follow up..

Yeah, that seems to be a common problem with large new features. :/

> My other question was regarding memory usage control.
> I have horrid memories from e2fsck unpredictable memory usage
> and unpredictable runtime due to swapping.
> 
> xfs_repair -m was a huge improvement compared to e2fsck.
> I don't remember reading about memory usage limits for online repair,
> so I was concerned about unpredictable memory usage and swapping.
> Can you say something to ease those concerns?

Both e2fsck and xfs_repair have to be capable of repairing the entire
filesystem all at once, which means that they allocate many many of
incore objects from which all of the ondisk space metadata (ag btrees in
the case of xfs, bitmaps for e2fsck) is regenerated.  Since the fs is
offline, it's considered advantageous to perform *one* scan and rebuild
everything all at once, even if the memory cost is high.

xfs_scrub scans and repairs each metadata object individually, which
means that it only needs to allocate as much (kernel/xfile) memory as
needed to scan a single btree/inode record/quota record/bitmap.  For
scans the memory requirements are usually minimal since it creates a
bunch of btree cursors and cross-references records.

For repairs, the memory requirements are on the order of the size of the
new data structure that will be written out.  We scan the fs to build
the new recordset in memory, compute the size of the new btree, allocate
some blocks, and format the records into the blocks before committing
the btree root.

For summary data (e.g. link counts, dquots) we build a shadow copy in
memory, so the memory requirements are on the order of the number of
files in the fs and the number of uid/gid/projid in the filesystem,
respectively.

Most of the intermediate structures are stuffed into a tmpfs file, which
means they can be paged out to disk.  If there's really no memory
available, scrub can abort all the way out to userspace provided it
hasn't committed anything to disk yet.

IOWs, online fsck generally only requires enough memory to build a new
copy of whichever objects it happens to be scanning at any given moment.
The background service runs single-threaded to avoid consuming a lot of
CPU or memory.

--D

> Thanks,
> Amir.
Allison Henderson Jan. 31, 2023, 6:11 a.m. UTC | #5
On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Writes to an XFS filesystem employ an eventual consistency update
> model
> to break up complex multistep metadata updates into small chained
> transactions.  This is generally good for performance and scalability
> because XFS doesn't need to prepare for enormous transactions, but it
> also means that online fsck must be careful not to attempt a fsck
> action
> unless it can be shown that there are no other threads processing a
> transaction chain.  This part of the design documentation covers the
> thinking behind the consistency model and how scrub deals with it.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  303
> ++++++++++++++++++++
>  1 file changed, 303 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index f45bf97fa9c4..419eb54ee200 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -1443,3 +1443,306 @@ This step is critical for enabling system
> administrator to monitor the status
>  of the filesystem and the progress of any repairs.
>  For developers, it is a useful means to judge the efficacy of error
> detection
>  and correction in the online and offline checking tools.
> +
> +Eventual Consistency vs. Online Fsck
> +------------------------------------
> +
> +Midway through the development of online scrubbing, the fsstress
> tests
> +uncovered a misinteraction between online fsck and compound
> transaction chains
> +created by other writer threads that resulted in false reports of
> metadata
> +inconsistency.
> +The root cause of these reports is the eventual consistency model
> introduced by
> +the expansion of deferred work items and compound transaction chains
> when
> +reverse mapping and reflink were introduced.



> +
> +Originally, transaction chains were added to XFS to avoid deadlocks
> when
> +unmapping space from files.
> +Deadlock avoidance rules require that AGs only be locked in
> increasing order,
> +which makes it impossible (say) to use a single transaction to free
> a space
> +extent in AG 7 and then try to free a now superfluous block mapping
> btree block
> +in AG 3.
> +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent
> (EFI) log
> +items to commit to freeing some space in one transaction while
> deferring the
> +actual metadata updates to a fresh transaction.
> +The transaction sequence looks like this:
> +
> +1. The first transaction contains a physical update to the file's
> block mapping
> +   structures to remove the mapping from the btree blocks.
> +   It then attaches to the in-memory transaction an action item to
> schedule
> +   deferred freeing of space.
> +   Concretely, each transaction maintains a list of ``struct
> +   xfs_defer_pending`` objects, each of which maintains a list of
> ``struct
> +   xfs_extent_free_item`` objects.
> +   Returning to the example above, the action item tracks the
> freeing of both
> +   the unmapped space from AG 7 and the block mapping btree (BMBT)
> block from
> +   AG 3.
> +   Deferred frees recorded in this manner are committed in the log
> by creating
> +   an EFI log item from the ``struct xfs_extent_free_item`` object
> and
> +   attaching the log item to the transaction.
> +   When the log is persisted to disk, the EFI item is written into
> the ondisk
> +   transaction record.
> +   EFIs can list up to 16 extents to free, all sorted in AG order.
> +
> +2. The second transaction contains a physical update to the free
> space btrees
> +   of AG 3 to release the former BMBT block and a second physical
> update to the
> +   free space btrees of AG 7 to release the unmapped file space.
> +   Observe that the the physical updates are resequenced in the
> correct order
> +   when possible.
> +   Attached to the transaction is a an extent free done (EFD) log
> item.
> +   The EFD contains a pointer to the EFI logged in transaction #1 so
> that log
> +   recovery can tell if the EFI needs to be replayed.
> +
> +If the system goes down after transaction #1 is written back to the
> filesystem
> +but before #2 is committed, a scan of the filesystem metadata would
> show
> +inconsistent filesystem metadata because there would not appear to
> be any owner
> +of the unmapped space.
> +Happily, log recovery corrects this inconsistency for us -- when
> recovery finds
> +an intent log item but does not find a corresponding intent done
> item, it will
> +reconstruct the incore state of the intent item and finish it.
> +In the example above, the log must replay both frees described in
> the recovered
> +EFI to complete the recovery phase.
> +
> +There are two subtleties to XFS' transaction chaining strategy to
> consider.
> +The first is that log items must be added to a transaction in the
> correct order
> +to prevent conflicts with principal objects that are not held by the
> +transaction.
> +In other words, all per-AG metadata updates for an unmapped block
> must be
> +completed before the last update to free the extent, and extents
> should not
> +be reallocated until that last update commits to the log.
> +The second subtlety comes from the fact that AG header buffers are
> (usually)
> +released between each transaction in a chain.
> +This means that other threads can observe an AG in an intermediate
> state,
> +but as long as the first subtlety is handled, this should not affect
> the
> +correctness of filesystem operations.
> +Unmounting the filesystem flushes all pending work to disk, which
> means that
> +offline fsck never sees the temporary inconsistencies caused by
> deferred work
> +item processing.
> +In this manner, XFS employs a form of eventual consistency to avoid
> deadlocks
> +and increase parallelism.
> +
> +During the design phase of the reverse mapping and reflink features,
> it was
> +decided that it was impractical to cram all the reverse mapping
> updates for a
> +single filesystem change into a single transaction because a single
> file
> +mapping operation can explode into many small updates:
> +
> +* The block mapping update itself
> +* A reverse mapping update for the block mapping update
> +* Fixing the freelist
> +* A reverse mapping update for the freelist fix
> +
> +* A shape change to the block mapping btree
> +* A reverse mapping update for the btree update
> +* Fixing the freelist (again)
> +* A reverse mapping update for the freelist fix
> +
> +* An update to the reference counting information
> +* A reverse mapping update for the refcount update
> +* Fixing the freelist (a third time)
> +* A reverse mapping update for the freelist fix
> +
> +* Freeing any space that was unmapped and not owned by any other
> file
> +* Fixing the freelist (a fourth time)
> +* A reverse mapping update for the freelist fix
> +
> +* Freeing the space used by the block mapping btree
> +* Fixing the freelist (a fifth time)
> +* A reverse mapping update for the freelist fix
> +
> +Free list fixups are not usually needed more than once per AG per
> transaction
> +chain, but it is theoretically possible if space is very tight.
> +For copy-on-write updates this is even worse, because this must be
> done once to
> +remove the space from a staging area and again to map it into the
> file!
> +
> +To deal with this explosion in a calm manner, XFS expands its use of
> deferred
> +work items to cover most reverse mapping updates and all refcount
> updates.
> +This reduces the worst case size of transaction reservations by
> breaking the
> +work into a long chain of small updates, which increases the degree
> of eventual
> +consistency in the system.
> +Again, this generally isn't a problem because XFS orders its
> deferred work
> +items carefully to avoid resource reuse conflicts between
> unsuspecting threads.
> +
> +However, online fsck changes the rules -- remember that although
> physical
> +updates to per-AG structures are coordinated by locking the buffers
> for AG
> +headers, buffer locks are dropped between transactions.
> +Once scrub acquires resources and takes locks for a data structure,
> it must do
> +all the validation work without releasing the lock.
> +If the main lock for a space btree is an AG header buffer lock,
> scrub may have
> +interrupted another thread that is midway through finishing a chain.
> +For example, if a thread performing a copy-on-write has completed a
> reverse
> +mapping update but not the corresponding refcount update, the two AG
> btrees
> +will appear inconsistent to scrub and an observation of corruption
> will be
> +recorded.  This observation will not be correct.
> +If a repair is attempted in this state, the results will be
> catastrophic!
> +
> +Several solutions to this problem were evaluated upon discovery of
> this flaw:


Hmm, so while having a really in depth efi example is insightful, I
wonder if it would be more oranized to put it in a separate document
somewhere and just reference it.  As far as ofsck is concerned, I think
a lighter sumary would do:


"Complex operations that modify multiple AGs are performed through a
series of transactions which are logged to a journal that an offline
fsck can either replay or discard.  Online fsck however, must be able
to deal with these operations while they are still in progress.  This
presents a unique challenge for ofsck since a partially completed
transaction chain may present the appearance of inconsistencies, even
though the operations are functioning as intended. (For a more detailed
example, see <cite document here...>)  

The challenge then becomes how to avoid incorrectly repairing these
non-issues as doing so would cause more harm than help."

> +
> +1. Add a higher level lock to allocation groups and require writer
> threads to
> +   acquire the higher level lock in AG order before making any
> changes.
> +   This would be very difficult to implement in practice because it
> is
> +   difficult to determine which locks need to be obtained, and in
> what order,
> +   without simulating the entire operation.
> +   Performing a dry run of a file operation to discover necessary
> locks would
> +   make the filesystem very slow.
> +
> +2. Make the deferred work coordinator code aware of consecutive
> intent items
> +   targeting the same AG and have it hold the AG header buffers
> locked across
> +   the transaction roll between updates.
> +   This would introduce a lot of complexity into the coordinator
> since it is
> +   only loosely coupled with the actual deferred work items.
> +   It would also fail to solve the problem because deferred work
> items can
> +   generate new deferred subtasks, but all subtasks must be complete
> before
> +   work can start on a new sibling task.
Hmm, that one doesnt seem like it's really an option then :-(

> +
> +3. Teach online fsck to walk all transactions waiting for whichever
> lock(s)
> +   protect the data structure being scrubbed to look for pending
> operations.
> +   The checking and repair operations must factor these pending
> operations into
> +   the evaluations being performed.
> +   This solution is a nonstarter because it is *extremely* invasive
> to the main
> +   filesystem.
> +
> +4. Recognize that only online fsck has this requirement of total
> consistency
> +   of AG metadata, and that online fsck should be relatively rare as
> compared
> +   to filesystem change operations.
> +   For each AG, maintain a count of intent items targetting that AG.
> +   When online fsck wants to examine an AG, it should lock the AG
> header
> +   buffers to quiesce all transaction chains that want to modify
> that AG, and
> +   only proceed with the scrub if the count is zero.
> +   In other words, scrub only proceeds if it can lock the AG header
> buffers and
> +   there can't possibly be any intents in progress.
> +   This may lead to fairness and starvation issues, but regular
> filesystem
> +   updates take precedence over online fsck activity.
So basically it sounds like 4 is the only reasonable option?  If the
discussion concerning the other options have died down, I would clean
them out.  They're great for brain storming and invitations for
collaboration, but ideally the goal of any of that should be to narrow
down an agreed upon plan of action.  And the goal of your document
should make clear what that plan is.  So if no one has any objections
by now, maybe just tie it right into the last line:

"The challenge then becomes how to avoid incorrectly repairing these
non-issues as doing so would cause more harm than help. 
Fortunately only online fsck has this requirement of total
consistency..."

> +
> +Intent Drains
> +`````````````
> +
> +The fourth solution is implemented in the current iteration of
This solution is implemented...

> online fsck,
> +with atomic_t providing the active intent counter.
> +
> +There are two key properties to the drain mechanism.
> +First, the counter is incremented when a deferred work item is
> *queued* to a
> +transaction, and it is decremented after the associated intent done
> log item is
> +*committed* to another transaction.
> +The second property is that deferred work can be added to a
> transaction without
> +holding an AG header lock, but per-AG work items cannot be marked
> done without
> +locking that AG header buffer to log the physical updates and the
> intent done
> +log item.
> +The first property enables scrub to yield to running transaction
> chains, which
> +is an explicit deprioritization of online fsck to benefit file
> operations.
> +The second property of the drain is key to the correct coordination
> of scrub,
> +since scrub will always be able to decide if a conflict is possible.
> +
> +For regular filesystem code, the drain works as follows:
> +
> +1. Call the appropriate subsystem function to add a deferred work
> item to a
> +   transaction.
> +
> +2. The function calls ``xfs_drain_bump`` to increase the counter.
> +
> +3. When the deferred item manager wants to finish the deferred work
> item, it
> +   calls ``->finish_item`` to complete it.
> +
> +4. The ``->finish_item`` implementation logs some changes and calls
> +   ``xfs_drain_drop`` to decrease the sloppy counter and wake up any
> threads
> +   waiting on the drain.
> +
> +5. The subtransaction commits, which unlocks the resource associated
> with the
> +   intent item.
> +
> +For scrub, the drain works as follows:
> +
> +1. Lock the resource(s) associated with the metadata being scrubbed.
> +   For example, a scan of the refcount btree would lock the AGI and
> AGF header
> +   buffers.
> +
> +2. If the counter is zero (``xfs_drain_busy`` returns false), there
> are no
> +   chains in progress and the operation may proceed.
> +
> +3. Otherwise, release the resources grabbed in step 1.
> +
> +4. Wait for the intent counter to reach zero
> (``xfs_drain_intents``), then go
> +   back to step 1 unless a signal has been caught.
> +
> +To avoid polling in step 4, the drain provides a waitqueue for scrub
> threads to
> +be woken up whenever the intent count drops to zero.
I think all that makes sense

> +
> +The proposed patchset is the
> +`scrub intent drain series
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-drain-intents>`_.
> +
> +.. _jump_labels:
> +
> +Static Keys (aka Jump Label Patching)
> +`````````````````````````````````````
> +
> +Online fsck for XFS separates the regular filesystem from the
> checking and
> +repair code as much as possible.
> +However, there are a few parts of online fsck (such as the intent
> drains, and
> +later, live update hooks) where it is useful for the online fsck
> code to know
> +what's going on in the rest of the filesystem.
> +Since it is not expected that online fsck will be constantly running
> in the
> +background, it is very important to minimize the runtime overhead
> imposed by
> +these hooks when online fsck is compiled into the kernel but not
> actively
> +running on behalf of userspace.
> +Taking locks in the hot path of a writer thread to access a data
> structure only
> +to find that no further action is necessary is expensive -- on the
> author's
> +computer, this have an overhead of 40-50ns per access.
> +Fortunately, the kernel supports dynamic code patching, which
> enables XFS to
> +replace a static branch to hook code with ``nop`` sleds when online
> fsck isn't
> +running.
> +This sled has an overhead of however long it takes the instruction
> decoder to
> +skip past the sled, which seems to be on the order of less than 1ns
> and
> +does not access memory outside of instruction fetching.
> +
> +When online fsck enables the static key, the sled is replaced with
> an
> +unconditional branch to call the hook code.
> +The switchover is quite expensive (~22000ns) but is paid entirely by
> the
> +program that invoked online fsck, and can be amortized if multiple
> threads
> +enter online fsck at the same time, or if multiple filesystems are
> being
> +checked at the same time.
> +Changing the branch direction requires taking the CPU hotplug lock,
> and since
> +CPU initialization requires memory allocation, online fsck must be
> careful not
> +to change a static key while holding any locks or resources that
> could be
> +accessed in the memory reclaim paths.
> +To minimize contention on the CPU hotplug lock, care should be taken
> not to
> +enable or disable static keys unnecessarily.
> +
> +Because static keys are intended to minimize hook overhead for
> regular
> +filesystem operations when xfs_scrub is not running, the intended
> usage
> +patterns are as follows:
> +
> +- The hooked part of XFS should declare a static-scoped static key
> that
> +  defaults to false.
> +  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
> +  The static key itself should be declared as a ``static`` variable.
> +
> +- When deciding to invoke code that's only used by scrub, the
> regular
> +  filesystem should call the ``static_branch_unlikely`` predicate to
> avoid the
> +  scrub-only hook code if the static key is not enabled.
> +
> +- The regular filesystem should export helper functions that call
> +  ``static_branch_inc`` to enable and ``static_branch_dec`` to
> disable the
> +  static key.
> +  Wrapper functions make it easy to compile out the relevant code if
> the kernel
> +  distributor turns off online fsck at build time.
> +
> +- Scrub functions wanting to turn on scrub-only XFS functionality
> should call
> +  the ``xchk_fshooks_enable`` from the setup function to enable a
> specific
> +  hook.
> +  This must be done before obtaining any resources that are used by
> memory
> +  reclaim.
> +  Callers had better be sure they really need the functionality
> gated by the
> +  static key; the ``TRY_HARDER`` flag is useful here.
> +
> +Online scrub has resource acquisition helpers (e.g.
> ``xchk_perag_lock``) to
> +handle locking AGI and AGF buffers for all scrubber functions.
> +If it detects a conflict between scrub and the running transactions,
> it will
> +try to wait for intents to complete.
> +If the caller of the helper has not enabled the static key, the
> helper will
> +return -EDEADLOCK, which should result in the scrub being restarted
> with the
> +``TRY_HARDER`` flag set.
> +The scrub setup function should detect that flag, enable the static
> key, and
> +try the scrub again.
> +Scrub teardown disables all static keys obtained by
> ``xchk_fshooks_enable``.

Ok, this part here seems pretty well documented.  Organizing nits aside
I think it looks good.

Allison

> +
> +For more information, please see the kernel documentation of
> +Documentation/staging/static-keys.rst.
>
Darrick J. Wong Feb. 2, 2023, 7:55 p.m. UTC | #6
On Tue, Jan 31, 2023 at 06:11:30AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Writes to an XFS filesystem employ an eventual consistency update
> > model
> > to break up complex multistep metadata updates into small chained
> > transactions.  This is generally good for performance and scalability
> > because XFS doesn't need to prepare for enormous transactions, but it
> > also means that online fsck must be careful not to attempt a fsck
> > action
> > unless it can be shown that there are no other threads processing a
> > transaction chain.  This part of the design documentation covers the
> > thinking behind the consistency model and how scrub deals with it.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  303
> > ++++++++++++++++++++
> >  1 file changed, 303 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index f45bf97fa9c4..419eb54ee200 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -1443,3 +1443,306 @@ This step is critical for enabling system
> > administrator to monitor the status
> >  of the filesystem and the progress of any repairs.
> >  For developers, it is a useful means to judge the efficacy of error
> > detection
> >  and correction in the online and offline checking tools.
> > +
> > +Eventual Consistency vs. Online Fsck
> > +------------------------------------
> > +
> > +Midway through the development of online scrubbing, the fsstress
> > tests
> > +uncovered a misinteraction between online fsck and compound
> > transaction chains
> > +created by other writer threads that resulted in false reports of
> > metadata
> > +inconsistency.
> > +The root cause of these reports is the eventual consistency model
> > introduced by
> > +the expansion of deferred work items and compound transaction chains
> > when
> > +reverse mapping and reflink were introduced.
> 
> 
> 

Was there supposed to be a comment here?

> > +
> > +Originally, transaction chains were added to XFS to avoid deadlocks
> > when
> > +unmapping space from files.
> > +Deadlock avoidance rules require that AGs only be locked in
> > increasing order,
> > +which makes it impossible (say) to use a single transaction to free
> > a space
> > +extent in AG 7 and then try to free a now superfluous block mapping
> > btree block
> > +in AG 3.
> > +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent
> > (EFI) log
> > +items to commit to freeing some space in one transaction while
> > deferring the
> > +actual metadata updates to a fresh transaction.
> > +The transaction sequence looks like this:
> > +
> > +1. The first transaction contains a physical update to the file's
> > block mapping
> > +   structures to remove the mapping from the btree blocks.
> > +   It then attaches to the in-memory transaction an action item to
> > schedule
> > +   deferred freeing of space.
> > +   Concretely, each transaction maintains a list of ``struct
> > +   xfs_defer_pending`` objects, each of which maintains a list of
> > ``struct
> > +   xfs_extent_free_item`` objects.
> > +   Returning to the example above, the action item tracks the
> > freeing of both
> > +   the unmapped space from AG 7 and the block mapping btree (BMBT)
> > block from
> > +   AG 3.
> > +   Deferred frees recorded in this manner are committed in the log
> > by creating
> > +   an EFI log item from the ``struct xfs_extent_free_item`` object
> > and
> > +   attaching the log item to the transaction.
> > +   When the log is persisted to disk, the EFI item is written into
> > the ondisk
> > +   transaction record.
> > +   EFIs can list up to 16 extents to free, all sorted in AG order.
> > +
> > +2. The second transaction contains a physical update to the free
> > space btrees
> > +   of AG 3 to release the former BMBT block and a second physical
> > update to the
> > +   free space btrees of AG 7 to release the unmapped file space.
> > +   Observe that the the physical updates are resequenced in the
> > correct order
> > +   when possible.
> > +   Attached to the transaction is a an extent free done (EFD) log
> > item.
> > +   The EFD contains a pointer to the EFI logged in transaction #1 so
> > that log
> > +   recovery can tell if the EFI needs to be replayed.
> > +
> > +If the system goes down after transaction #1 is written back to the
> > filesystem
> > +but before #2 is committed, a scan of the filesystem metadata would
> > show
> > +inconsistent filesystem metadata because there would not appear to
> > be any owner
> > +of the unmapped space.
> > +Happily, log recovery corrects this inconsistency for us -- when
> > recovery finds
> > +an intent log item but does not find a corresponding intent done
> > item, it will
> > +reconstruct the incore state of the intent item and finish it.
> > +In the example above, the log must replay both frees described in
> > the recovered
> > +EFI to complete the recovery phase.
> > +
> > +There are two subtleties to XFS' transaction chaining strategy to
> > consider.
> > +The first is that log items must be added to a transaction in the
> > correct order
> > +to prevent conflicts with principal objects that are not held by the
> > +transaction.
> > +In other words, all per-AG metadata updates for an unmapped block
> > must be
> > +completed before the last update to free the extent, and extents
> > should not
> > +be reallocated until that last update commits to the log.
> > +The second subtlety comes from the fact that AG header buffers are
> > (usually)
> > +released between each transaction in a chain.
> > +This means that other threads can observe an AG in an intermediate
> > state,
> > +but as long as the first subtlety is handled, this should not affect
> > the
> > +correctness of filesystem operations.
> > +Unmounting the filesystem flushes all pending work to disk, which
> > means that
> > +offline fsck never sees the temporary inconsistencies caused by
> > deferred work
> > +item processing.
> > +In this manner, XFS employs a form of eventual consistency to avoid
> > deadlocks
> > +and increase parallelism.
> > +
> > +During the design phase of the reverse mapping and reflink features,
> > it was
> > +decided that it was impractical to cram all the reverse mapping
> > updates for a
> > +single filesystem change into a single transaction because a single
> > file
> > +mapping operation can explode into many small updates:
> > +
> > +* The block mapping update itself
> > +* A reverse mapping update for the block mapping update
> > +* Fixing the freelist
> > +* A reverse mapping update for the freelist fix
> > +
> > +* A shape change to the block mapping btree
> > +* A reverse mapping update for the btree update
> > +* Fixing the freelist (again)
> > +* A reverse mapping update for the freelist fix
> > +
> > +* An update to the reference counting information
> > +* A reverse mapping update for the refcount update
> > +* Fixing the freelist (a third time)
> > +* A reverse mapping update for the freelist fix
> > +
> > +* Freeing any space that was unmapped and not owned by any other
> > file
> > +* Fixing the freelist (a fourth time)
> > +* A reverse mapping update for the freelist fix
> > +
> > +* Freeing the space used by the block mapping btree
> > +* Fixing the freelist (a fifth time)
> > +* A reverse mapping update for the freelist fix
> > +
> > +Free list fixups are not usually needed more than once per AG per
> > transaction
> > +chain, but it is theoretically possible if space is very tight.
> > +For copy-on-write updates this is even worse, because this must be
> > done once to
> > +remove the space from a staging area and again to map it into the
> > file!
> > +
> > +To deal with this explosion in a calm manner, XFS expands its use of
> > deferred
> > +work items to cover most reverse mapping updates and all refcount
> > updates.
> > +This reduces the worst case size of transaction reservations by
> > breaking the
> > +work into a long chain of small updates, which increases the degree
> > of eventual
> > +consistency in the system.
> > +Again, this generally isn't a problem because XFS orders its
> > deferred work
> > +items carefully to avoid resource reuse conflicts between
> > unsuspecting threads.
> > +
> > +However, online fsck changes the rules -- remember that although
> > physical
> > +updates to per-AG structures are coordinated by locking the buffers
> > for AG
> > +headers, buffer locks are dropped between transactions.
> > +Once scrub acquires resources and takes locks for a data structure,
> > it must do
> > +all the validation work without releasing the lock.
> > +If the main lock for a space btree is an AG header buffer lock,
> > scrub may have
> > +interrupted another thread that is midway through finishing a chain.
> > +For example, if a thread performing a copy-on-write has completed a
> > reverse
> > +mapping update but not the corresponding refcount update, the two AG
> > btrees
> > +will appear inconsistent to scrub and an observation of corruption
> > will be
> > +recorded.  This observation will not be correct.
> > +If a repair is attempted in this state, the results will be
> > catastrophic!
> > +
> > +Several solutions to this problem were evaluated upon discovery of
> > this flaw:
> 
> 
> Hmm, so while having a really in depth efi example is insightful, I
> wonder if it would be more oranized to put it in a separate document
> somewhere and just reference it.  As far as ofsck is concerned, I think
> a lighter sumary would do:
> 
> 
> "Complex operations that modify multiple AGs are performed through a
> series of transactions which are logged to a journal that an offline
> fsck can either replay or discard.  Online fsck however, must be able
> to deal with these operations while they are still in progress.  This
> presents a unique challenge for ofsck since a partially completed
> transaction chain may present the appearance of inconsistencies, even
> though the operations are functioning as intended. (For a more detailed
> example, see <cite document here...>)  
> 
> The challenge then becomes how to avoid incorrectly repairing these
> non-issues as doing so would cause more harm than help."

I agree that this topic needs a much shorter introduction before moving
on to the gory details.  How does this strike you?

"Complex operations can make modifications to multiple per-AG data
structures with a chain of transactions.  These chains, once committed
to the log, are restarted during log recovery if the system crashes
while processing the chain.  Because the AG header buffers are unlocked
between transactions within a chain, online checking must coordinate
with chained operations that are in progress to avoid incorrectly
detecting inconsistencies due to pending chains.  Furthermore, online
repair must not run when operations are pending because the metadata are
temporarily inconsistent with each other, and rebuilding is not
possible."

"Only online fsck has this requirement of total consistency of AG
metadata, and should be relatively rare as compared to filesystem change
operations.  Online fsck coordinates with transaction chains as follows:

* "For each AG, maintain a count of intent items targetting that AG.
  The count should be bumped whenever a new item is added to the chain.
  The count should be dropped when the filesystem has locked the AG
  header buffers and finished the work.

* "When online fsck wants to examine an AG, it should lock the AG header
  buffers to quiesce all transaction chains that want to modify that AG.
  If the count is zero, proceed with the checking operation.  If it is
  nonzero, cycle the buffer locks to allow the chain to make forward
  progress.

"This may lead to online fsck taking a long time to complete, but
regular filesystem updates take precedence over background checking
activity.  Details about the discovery of this situation are presented
in the <next section>, and details about the solution are presented
<after that>."

These gory details of how I recognized the problem are a subsection of
the main heading, and anyone who wants to know them can read it.
Readers who'd rather move on to the solution can jump directly to the
"Intent Drains" section.  The <bracketed> text are hyperlinks.

> > +
> > +1. Add a higher level lock to allocation groups and require writer
> > threads to
> > +   acquire the higher level lock in AG order before making any
> > changes.
> > +   This would be very difficult to implement in practice because it
> > is
> > +   difficult to determine which locks need to be obtained, and in
> > what order,
> > +   without simulating the entire operation.
> > +   Performing a dry run of a file operation to discover necessary
> > locks would
> > +   make the filesystem very slow.
> > +
> > +2. Make the deferred work coordinator code aware of consecutive
> > intent items
> > +   targeting the same AG and have it hold the AG header buffers
> > locked across
> > +   the transaction roll between updates.
> > +   This would introduce a lot of complexity into the coordinator
> > since it is
> > +   only loosely coupled with the actual deferred work items.
> > +   It would also fail to solve the problem because deferred work
> > items can
> > +   generate new deferred subtasks, but all subtasks must be complete
> > before
> > +   work can start on a new sibling task.
> Hmm, that one doesnt seem like it's really an option then :-(

Right.  Now that this section has become its own gory details
subsection, the sentence preceeding the numbered list becomes:

"Several other solutions to this problem were evaluated upon discovery
of this flaw and rejected:"

> > +
> > +3. Teach online fsck to walk all transactions waiting for whichever
> > lock(s)
> > +   protect the data structure being scrubbed to look for pending
> > operations.
> > +   The checking and repair operations must factor these pending
> > operations into
> > +   the evaluations being performed.
> > +   This solution is a nonstarter because it is *extremely* invasive
> > to the main
> > +   filesystem.
> > +
> > +4. Recognize that only online fsck has this requirement of total
> > consistency
> > +   of AG metadata, and that online fsck should be relatively rare as
> > compared
> > +   to filesystem change operations.
> > +   For each AG, maintain a count of intent items targetting that AG.
> > +   When online fsck wants to examine an AG, it should lock the AG
> > header
> > +   buffers to quiesce all transaction chains that want to modify
> > that AG, and
> > +   only proceed with the scrub if the count is zero.
> > +   In other words, scrub only proceeds if it can lock the AG header
> > buffers and
> > +   there can't possibly be any intents in progress.
> > +   This may lead to fairness and starvation issues, but regular
> > filesystem
> > +   updates take precedence over online fsck activity.
> So basically it sounds like 4 is the only reasonable option?

Yes.

> If the discussion concerning the other options have died down, I would
> clean them out.

That's just the problem -- I've sent this and the code patches to the
list several times now, and mostly haven't heard any solid replies.  So
it's a bit premature to take it out, and again it might be useful to
capture the roads not taken.

> They're great for brain storming and invitations for
> collaboration, but ideally the goal of any of that should be to narrow
> down an agreed upon plan of action.  And the goal of your document
> should make clear what that plan is.  So if no one has any objections
> by now, maybe just tie it right into the last line:
> 
> "The challenge then becomes how to avoid incorrectly repairing these
> non-issues as doing so would cause more harm than help. 
> Fortunately only online fsck has this requirement of total
> consistency..."

> > +
> > +Intent Drains
> > +`````````````
> > +
> > +The fourth solution is implemented in the current iteration of
> This solution is implemented...

"Online fsck uses an atomic intent item counter and lock cycling to
coordinate with transaction chains.  There are two key properties to the
drain mechanism..."

> > online fsck,
> > +with atomic_t providing the active intent counter.
> > +
> > +There are two key properties to the drain mechanism.
> > +First, the counter is incremented when a deferred work item is
> > *queued* to a
> > +transaction, and it is decremented after the associated intent done
> > log item is
> > +*committed* to another transaction.
> > +The second property is that deferred work can be added to a
> > transaction without
> > +holding an AG header lock, but per-AG work items cannot be marked
> > done without
> > +locking that AG header buffer to log the physical updates and the
> > intent done
> > +log item.
> > +The first property enables scrub to yield to running transaction
> > chains, which
> > +is an explicit deprioritization of online fsck to benefit file
> > operations.
> > +The second property of the drain is key to the correct coordination
> > of scrub,
> > +since scrub will always be able to decide if a conflict is possible.
> > +
> > +For regular filesystem code, the drain works as follows:
> > +
> > +1. Call the appropriate subsystem function to add a deferred work
> > item to a
> > +   transaction.
> > +
> > +2. The function calls ``xfs_drain_bump`` to increase the counter.
> > +
> > +3. When the deferred item manager wants to finish the deferred work
> > item, it
> > +   calls ``->finish_item`` to complete it.
> > +
> > +4. The ``->finish_item`` implementation logs some changes and calls
> > +   ``xfs_drain_drop`` to decrease the sloppy counter and wake up any
> > threads
> > +   waiting on the drain.
> > +
> > +5. The subtransaction commits, which unlocks the resource associated
> > with the
> > +   intent item.
> > +
> > +For scrub, the drain works as follows:
> > +
> > +1. Lock the resource(s) associated with the metadata being scrubbed.
> > +   For example, a scan of the refcount btree would lock the AGI and
> > AGF header
> > +   buffers.
> > +
> > +2. If the counter is zero (``xfs_drain_busy`` returns false), there
> > are no
> > +   chains in progress and the operation may proceed.
> > +
> > +3. Otherwise, release the resources grabbed in step 1.
> > +
> > +4. Wait for the intent counter to reach zero
> > (``xfs_drain_intents``), then go
> > +   back to step 1 unless a signal has been caught.
> > +
> > +To avoid polling in step 4, the drain provides a waitqueue for scrub
> > threads to
> > +be woken up whenever the intent count drops to zero.
> I think all that makes sense

Good! :)

> > +
> > +The proposed patchset is the
> > +`scrub intent drain series
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-drain-intents>`_.
> > +
> > +.. _jump_labels:
> > +
> > +Static Keys (aka Jump Label Patching)
> > +`````````````````````````````````````
> > +
> > +Online fsck for XFS separates the regular filesystem from the
> > checking and
> > +repair code as much as possible.
> > +However, there are a few parts of online fsck (such as the intent
> > drains, and
> > +later, live update hooks) where it is useful for the online fsck
> > code to know
> > +what's going on in the rest of the filesystem.
> > +Since it is not expected that online fsck will be constantly running
> > in the
> > +background, it is very important to minimize the runtime overhead
> > imposed by
> > +these hooks when online fsck is compiled into the kernel but not
> > actively
> > +running on behalf of userspace.
> > +Taking locks in the hot path of a writer thread to access a data
> > structure only
> > +to find that no further action is necessary is expensive -- on the
> > author's
> > +computer, this have an overhead of 40-50ns per access.
> > +Fortunately, the kernel supports dynamic code patching, which
> > enables XFS to
> > +replace a static branch to hook code with ``nop`` sleds when online
> > fsck isn't
> > +running.
> > +This sled has an overhead of however long it takes the instruction
> > decoder to
> > +skip past the sled, which seems to be on the order of less than 1ns
> > and
> > +does not access memory outside of instruction fetching.
> > +
> > +When online fsck enables the static key, the sled is replaced with
> > an
> > +unconditional branch to call the hook code.
> > +The switchover is quite expensive (~22000ns) but is paid entirely by
> > the
> > +program that invoked online fsck, and can be amortized if multiple
> > threads
> > +enter online fsck at the same time, or if multiple filesystems are
> > being
> > +checked at the same time.
> > +Changing the branch direction requires taking the CPU hotplug lock,
> > and since
> > +CPU initialization requires memory allocation, online fsck must be
> > careful not
> > +to change a static key while holding any locks or resources that
> > could be
> > +accessed in the memory reclaim paths.
> > +To minimize contention on the CPU hotplug lock, care should be taken
> > not to
> > +enable or disable static keys unnecessarily.
> > +
> > +Because static keys are intended to minimize hook overhead for
> > regular
> > +filesystem operations when xfs_scrub is not running, the intended
> > usage
> > +patterns are as follows:
> > +
> > +- The hooked part of XFS should declare a static-scoped static key
> > that
> > +  defaults to false.
> > +  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
> > +  The static key itself should be declared as a ``static`` variable.
> > +
> > +- When deciding to invoke code that's only used by scrub, the
> > regular
> > +  filesystem should call the ``static_branch_unlikely`` predicate to
> > avoid the
> > +  scrub-only hook code if the static key is not enabled.
> > +
> > +- The regular filesystem should export helper functions that call
> > +  ``static_branch_inc`` to enable and ``static_branch_dec`` to
> > disable the
> > +  static key.
> > +  Wrapper functions make it easy to compile out the relevant code if
> > the kernel
> > +  distributor turns off online fsck at build time.
> > +
> > +- Scrub functions wanting to turn on scrub-only XFS functionality
> > should call
> > +  the ``xchk_fshooks_enable`` from the setup function to enable a
> > specific
> > +  hook.
> > +  This must be done before obtaining any resources that are used by
> > memory
> > +  reclaim.
> > +  Callers had better be sure they really need the functionality
> > gated by the
> > +  static key; the ``TRY_HARDER`` flag is useful here.
> > +
> > +Online scrub has resource acquisition helpers (e.g.
> > ``xchk_perag_lock``) to
> > +handle locking AGI and AGF buffers for all scrubber functions.
> > +If it detects a conflict between scrub and the running transactions,
> > it will
> > +try to wait for intents to complete.
> > +If the caller of the helper has not enabled the static key, the
> > helper will
> > +return -EDEADLOCK, which should result in the scrub being restarted
> > with the
> > +``TRY_HARDER`` flag set.
> > +The scrub setup function should detect that flag, enable the static
> > key, and
> > +try the scrub again.
> > +Scrub teardown disables all static keys obtained by
> > ``xchk_fshooks_enable``.
> 
> Ok, this part here seems pretty well documented.  Organizing nits aside
> I think it looks good.

Thanks for digging into all of this!

--D

> Allison
> 
> > +
> > +For more information, please see the kernel documentation of
> > +Documentation/staging/static-keys.rst.
> > 
>
Allison Henderson Feb. 9, 2023, 5:41 a.m. UTC | #7
On Thu, 2023-02-02 at 11:55 -0800, Darrick J. Wong wrote:
> On Tue, Jan 31, 2023 at 06:11:30AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Writes to an XFS filesystem employ an eventual consistency update
> > > model
> > > to break up complex multistep metadata updates into small chained
> > > transactions.  This is generally good for performance and
> > > scalability
> > > because XFS doesn't need to prepare for enormous transactions,
> > > but it
> > > also means that online fsck must be careful not to attempt a fsck
> > > action
> > > unless it can be shown that there are no other threads processing
> > > a
> > > transaction chain.  This part of the design documentation covers
> > > the
> > > thinking behind the consistency model and how scrub deals with
> > > it.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  303
> > > ++++++++++++++++++++
> > >  1 file changed, 303 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index f45bf97fa9c4..419eb54ee200 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -1443,3 +1443,306 @@ This step is critical for enabling system
> > > administrator to monitor the status
> > >  of the filesystem and the progress of any repairs.
> > >  For developers, it is a useful means to judge the efficacy of
> > > error
> > > detection
> > >  and correction in the online and offline checking tools.
> > > +
> > > +Eventual Consistency vs. Online Fsck
> > > +------------------------------------
> > > +
> > > +Midway through the development of online scrubbing, the fsstress
> > > tests
> > > +uncovered a misinteraction between online fsck and compound
> > > transaction chains
> > > +created by other writer threads that resulted in false reports
> > > of
> > > metadata
> > > +inconsistency.
> > > +The root cause of these reports is the eventual consistency
> > > model
> > > introduced by
> > > +the expansion of deferred work items and compound transaction
> > > chains
> > > when
> > > +reverse mapping and reflink were introduced.
> > 
> > 
> > 
> 
> Was there supposed to be a comment here?
No, sometimes I'll fiddle with paraphrasing, but if it's not enough of
an improvement, I'll scrap it, but I think evolution leaves the white
space

> 
> > > +
> > > +Originally, transaction chains were added to XFS to avoid
> > > deadlocks
> > > when
> > > +unmapping space from files.
> > > +Deadlock avoidance rules require that AGs only be locked in
> > > increasing order,
> > > +which makes it impossible (say) to use a single transaction to
> > > free
> > > a space
> > > +extent in AG 7 and then try to free a now superfluous block
> > > mapping
> > > btree block
> > > +in AG 3.
> > > +To avoid these kinds of deadlocks, XFS creates Extent Freeing
> > > Intent
> > > (EFI) log
> > > +items to commit to freeing some space in one transaction while
> > > deferring the
> > > +actual metadata updates to a fresh transaction.
> > > +The transaction sequence looks like this:
> > > +
> > > +1. The first transaction contains a physical update to the
> > > file's
> > > block mapping
> > > +   structures to remove the mapping from the btree blocks.
> > > +   It then attaches to the in-memory transaction an action item
> > > to
> > > schedule
> > > +   deferred freeing of space.
> > > +   Concretely, each transaction maintains a list of ``struct
> > > +   xfs_defer_pending`` objects, each of which maintains a list
> > > of
> > > ``struct
> > > +   xfs_extent_free_item`` objects.
> > > +   Returning to the example above, the action item tracks the
> > > freeing of both
> > > +   the unmapped space from AG 7 and the block mapping btree
> > > (BMBT)
> > > block from
> > > +   AG 3.
> > > +   Deferred frees recorded in this manner are committed in the
> > > log
> > > by creating
> > > +   an EFI log item from the ``struct xfs_extent_free_item``
> > > object
> > > and
> > > +   attaching the log item to the transaction.
> > > +   When the log is persisted to disk, the EFI item is written
> > > into
> > > the ondisk
> > > +   transaction record.
> > > +   EFIs can list up to 16 extents to free, all sorted in AG
> > > order.
> > > +
> > > +2. The second transaction contains a physical update to the free
> > > space btrees
> > > +   of AG 3 to release the former BMBT block and a second
> > > physical
> > > update to the
> > > +   free space btrees of AG 7 to release the unmapped file space.
> > > +   Observe that the the physical updates are resequenced in the
> > > correct order
> > > +   when possible.
> > > +   Attached to the transaction is a an extent free done (EFD)
> > > log
> > > item.
> > > +   The EFD contains a pointer to the EFI logged in transaction
> > > #1 so
> > > that log
> > > +   recovery can tell if the EFI needs to be replayed.
> > > +
> > > +If the system goes down after transaction #1 is written back to
> > > the
> > > filesystem
> > > +but before #2 is committed, a scan of the filesystem metadata
> > > would
> > > show
> > > +inconsistent filesystem metadata because there would not appear
> > > to
> > > be any owner
> > > +of the unmapped space.
> > > +Happily, log recovery corrects this inconsistency for us -- when
> > > recovery finds
> > > +an intent log item but does not find a corresponding intent done
> > > item, it will
> > > +reconstruct the incore state of the intent item and finish it.
> > > +In the example above, the log must replay both frees described
> > > in
> > > the recovered
> > > +EFI to complete the recovery phase.
> > > +
> > > +There are two subtleties to XFS' transaction chaining strategy
> > > to
> > > consider.
> > > +The first is that log items must be added to a transaction in
> > > the
> > > correct order
> > > +to prevent conflicts with principal objects that are not held by
> > > the
> > > +transaction.
> > > +In other words, all per-AG metadata updates for an unmapped
> > > block
> > > must be
> > > +completed before the last update to free the extent, and extents
> > > should not
> > > +be reallocated until that last update commits to the log.
> > > +The second subtlety comes from the fact that AG header buffers
> > > are
> > > (usually)
> > > +released between each transaction in a chain.
> > > +This means that other threads can observe an AG in an
> > > intermediate
> > > state,
> > > +but as long as the first subtlety is handled, this should not
> > > affect
> > > the
> > > +correctness of filesystem operations.
> > > +Unmounting the filesystem flushes all pending work to disk,
> > > which
> > > means that
> > > +offline fsck never sees the temporary inconsistencies caused by
> > > deferred work
> > > +item processing.
> > > +In this manner, XFS employs a form of eventual consistency to
> > > avoid
> > > deadlocks
> > > +and increase parallelism.
> > > +
> > > +During the design phase of the reverse mapping and reflink
> > > features,
> > > it was
> > > +decided that it was impractical to cram all the reverse mapping
> > > updates for a
> > > +single filesystem change into a single transaction because a
> > > single
> > > file
> > > +mapping operation can explode into many small updates:
> > > +
> > > +* The block mapping update itself
> > > +* A reverse mapping update for the block mapping update
> > > +* Fixing the freelist
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* A shape change to the block mapping btree
> > > +* A reverse mapping update for the btree update
> > > +* Fixing the freelist (again)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* An update to the reference counting information
> > > +* A reverse mapping update for the refcount update
> > > +* Fixing the freelist (a third time)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* Freeing any space that was unmapped and not owned by any other
> > > file
> > > +* Fixing the freelist (a fourth time)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* Freeing the space used by the block mapping btree
> > > +* Fixing the freelist (a fifth time)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +Free list fixups are not usually needed more than once per AG
> > > per
> > > transaction
> > > +chain, but it is theoretically possible if space is very tight.
> > > +For copy-on-write updates this is even worse, because this must
> > > be
> > > done once to
> > > +remove the space from a staging area and again to map it into
> > > the
> > > file!
> > > +
> > > +To deal with this explosion in a calm manner, XFS expands its
> > > use of
> > > deferred
> > > +work items to cover most reverse mapping updates and all
> > > refcount
> > > updates.
> > > +This reduces the worst case size of transaction reservations by
> > > breaking the
> > > +work into a long chain of small updates, which increases the
> > > degree
> > > of eventual
> > > +consistency in the system.
> > > +Again, this generally isn't a problem because XFS orders its
> > > deferred work
> > > +items carefully to avoid resource reuse conflicts between
> > > unsuspecting threads.
> > > +
> > > +However, online fsck changes the rules -- remember that although
> > > physical
> > > +updates to per-AG structures are coordinated by locking the
> > > buffers
> > > for AG
> > > +headers, buffer locks are dropped between transactions.
> > > +Once scrub acquires resources and takes locks for a data
> > > structure,
> > > it must do
> > > +all the validation work without releasing the lock.
> > > +If the main lock for a space btree is an AG header buffer lock,
> > > scrub may have
> > > +interrupted another thread that is midway through finishing a
> > > chain.
> > > +For example, if a thread performing a copy-on-write has
> > > completed a
> > > reverse
> > > +mapping update but not the corresponding refcount update, the
> > > two AG
> > > btrees
> > > +will appear inconsistent to scrub and an observation of
> > > corruption
> > > will be
> > > +recorded.  This observation will not be correct.
> > > +If a repair is attempted in this state, the results will be
> > > catastrophic!
> > > +
> > > +Several solutions to this problem were evaluated upon discovery
> > > of
> > > this flaw:
> > 
> > 
> > Hmm, so while having a really in depth efi example is insightful, I
> > wonder if it would be more oranized to put it in a separate
> > document
> > somewhere and just reference it.  As far as ofsck is concerned, I
> > think
> > a lighter sumary would do:
> > 
> > 
> > "Complex operations that modify multiple AGs are performed through
> > a
> > series of transactions which are logged to a journal that an
> > offline
> > fsck can either replay or discard.  Online fsck however, must be
> > able
> > to deal with these operations while they are still in progress. 
> > This
> > presents a unique challenge for ofsck since a partially completed
> > transaction chain may present the appearance of inconsistencies,
> > even
> > though the operations are functioning as intended. (For a more
> > detailed
> > example, see <cite document here...>)  
> > 
> > The challenge then becomes how to avoid incorrectly repairing these
> > non-issues as doing so would cause more harm than help."
> 
> I agree that this topic needs a much shorter introduction before
> moving
> on to the gory details.  How does this strike you?
> 
> "Complex operations can make modifications to multiple per-AG data
> structures with a chain of transactions.  These chains, once
> committed
> to the log, are restarted during log recovery if the system crashes
> while processing the chain.  Because the AG header buffers are
> unlocked
> between transactions within a chain, online checking must coordinate
> with chained operations that are in progress to avoid incorrectly
> detecting inconsistencies due to pending chains.  Furthermore, online
> repair must not run when operations are pending because the metadata
> are
> temporarily inconsistent with each other, and rebuilding is not
> possible."
> 
> "Only online fsck has this requirement of total consistency of AG
> metadata, and should be relatively rare as compared to filesystem
> change
> operations.  Online fsck coordinates with transaction chains as
> follows:
> 
> * "For each AG, maintain a count of intent items targetting that AG.
>   The count should be bumped whenever a new item is added to the
> chain.
>   The count should be dropped when the filesystem has locked the AG
>   header buffers and finished the work.
> 
> * "When online fsck wants to examine an AG, it should lock the AG
> header
>   buffers to quiesce all transaction chains that want to modify that
> AG.
>   If the count is zero, proceed with the checking operation.  If it
> is
>   nonzero, cycle the buffer locks to allow the chain to make forward
>   progress.
> 
> "This may lead to online fsck taking a long time to complete, but
> regular filesystem updates take precedence over background checking
> activity.  Details about the discovery of this situation are
> presented
> in the <next section>, and details about the solution are presented
> <after that>."
> 
> These gory details of how I recognized the problem are a subsection
> of
> the main heading, and anyone who wants to know them can read it.
> Readers who'd rather move on to the solution can jump directly to the
> "Intent Drains" section.  The <bracketed> text are hyperlinks.
Ok, I think that works.  Much lighter, and more to the point for ofsck
> 
> > > +
> > > +1. Add a higher level lock to allocation groups and require
> > > writer
> > > threads to
> > > +   acquire the higher level lock in AG order before making any
> > > changes.
> > > +   This would be very difficult to implement in practice because
> > > it
> > > is
> > > +   difficult to determine which locks need to be obtained, and
> > > in
> > > what order,
> > > +   without simulating the entire operation.
> > > +   Performing a dry run of a file operation to discover
> > > necessary
> > > locks would
> > > +   make the filesystem very slow.
> > > +
> > > +2. Make the deferred work coordinator code aware of consecutive
> > > intent items
> > > +   targeting the same AG and have it hold the AG header buffers
> > > locked across
> > > +   the transaction roll between updates.
> > > +   This would introduce a lot of complexity into the coordinator
> > > since it is
> > > +   only loosely coupled with the actual deferred work items.
> > > +   It would also fail to solve the problem because deferred work
> > > items can
> > > +   generate new deferred subtasks, but all subtasks must be
> > > complete
> > > before
> > > +   work can start on a new sibling task.
> > Hmm, that one doesnt seem like it's really an option then :-(
> 
> Right.  Now that this section has become its own gory details
> subsection, the sentence preceeding the numbered list becomes:
> 
> "Several other solutions to this problem were evaluated upon
> discovery
> of this flaw and rejected:"
Ok

> 
> > > +
> > > +3. Teach online fsck to walk all transactions waiting for
> > > whichever
> > > lock(s)
> > > +   protect the data structure being scrubbed to look for pending
> > > operations.
> > > +   The checking and repair operations must factor these pending
> > > operations into
> > > +   the evaluations being performed.
> > > +   This solution is a nonstarter because it is *extremely*
> > > invasive
> > > to the main
> > > +   filesystem.
> > > +
> > > +4. Recognize that only online fsck has this requirement of total
> > > consistency
> > > +   of AG metadata, and that online fsck should be relatively
> > > rare as
> > > compared
> > > +   to filesystem change operations.
> > > +   For each AG, maintain a count of intent items targetting that
> > > AG.
> > > +   When online fsck wants to examine an AG, it should lock the
> > > AG
> > > header
> > > +   buffers to quiesce all transaction chains that want to modify
> > > that AG, and
> > > +   only proceed with the scrub if the count is zero.
> > > +   In other words, scrub only proceeds if it can lock the AG
> > > header
> > > buffers and
> > > +   there can't possibly be any intents in progress.
> > > +   This may lead to fairness and starvation issues, but regular
> > > filesystem
> > > +   updates take precedence over online fsck activity.
> > So basically it sounds like 4 is the only reasonable option?
> 
> Yes.
> 
> > If the discussion concerning the other options have died down, I
> > would
> > clean them out.
> 
> That's just the problem -- I've sent this and the code patches to the
> list several times now, and mostly haven't heard any solid replies. 
> So
> it's a bit premature to take it out, and again it might be useful to
> capture the roads not taken.
> 
> > They're great for brain storming and invitations for
> > collaboration, but ideally the goal of any of that should be to
> > narrow
> > down an agreed upon plan of action.  And the goal of your document
> > should make clear what that plan is.  So if no one has any
> > objections
> > by now, maybe just tie it right into the last line:
> > 
> > "The challenge then becomes how to avoid incorrectly repairing
> > these
> > non-issues as doing so would cause more harm than help. 
> > Fortunately only online fsck has this requirement of total
> > consistency..."
> 
> > > +
> > > +Intent Drains
> > > +`````````````
> > > +
> > > +The fourth solution is implemented in the current iteration of
> > This solution is implemented...
> 
> "Online fsck uses an atomic intent item counter and lock cycling to
> coordinate with transaction chains.  There are two key properties to
> the
> drain mechanism..."
Ok, sounds fine
> 
> > > online fsck,
> > > +with atomic_t providing the active intent counter.
> > > +
> > > +There are two key properties to the drain mechanism.
> > > +First, the counter is incremented when a deferred work item is
> > > *queued* to a
> > > +transaction, and it is decremented after the associated intent
> > > done
> > > log item is
> > > +*committed* to another transaction.
> > > +The second property is that deferred work can be added to a
> > > transaction without
> > > +holding an AG header lock, but per-AG work items cannot be
> > > marked
> > > done without
> > > +locking that AG header buffer to log the physical updates and
> > > the
> > > intent done
> > > +log item.
> > > +The first property enables scrub to yield to running transaction
> > > chains, which
> > > +is an explicit deprioritization of online fsck to benefit file
> > > operations.
> > > +The second property of the drain is key to the correct
> > > coordination
> > > of scrub,
> > > +since scrub will always be able to decide if a conflict is
> > > possible.
> > > +
> > > +For regular filesystem code, the drain works as follows:
> > > +
> > > +1. Call the appropriate subsystem function to add a deferred
> > > work
> > > item to a
> > > +   transaction.
> > > +
> > > +2. The function calls ``xfs_drain_bump`` to increase the
> > > counter.
> > > +
> > > +3. When the deferred item manager wants to finish the deferred
> > > work
> > > item, it
> > > +   calls ``->finish_item`` to complete it.
> > > +
> > > +4. The ``->finish_item`` implementation logs some changes and
> > > calls
> > > +   ``xfs_drain_drop`` to decrease the sloppy counter and wake up
> > > any
> > > threads
> > > +   waiting on the drain.
> > > +
> > > +5. The subtransaction commits, which unlocks the resource
> > > associated
> > > with the
> > > +   intent item.
> > > +
> > > +For scrub, the drain works as follows:
> > > +
> > > +1. Lock the resource(s) associated with the metadata being
> > > scrubbed.
> > > +   For example, a scan of the refcount btree would lock the AGI
> > > and
> > > AGF header
> > > +   buffers.
> > > +
> > > +2. If the counter is zero (``xfs_drain_busy`` returns false),
> > > there
> > > are no
> > > +   chains in progress and the operation may proceed.
> > > +
> > > +3. Otherwise, release the resources grabbed in step 1.
> > > +
> > > +4. Wait for the intent counter to reach zero
> > > (``xfs_drain_intents``), then go
> > > +   back to step 1 unless a signal has been caught.
> > > +
> > > +To avoid polling in step 4, the drain provides a waitqueue for
> > > scrub
> > > threads to
> > > +be woken up whenever the intent count drops to zero.
> > I think all that makes sense
> 
> Good! :)
> 
> > > +
> > > +The proposed patchset is the
> > > +`scrub intent drain series
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-drain-intents>`_.
> > > +
> > > +.. _jump_labels:
> > > +
> > > +Static Keys (aka Jump Label Patching)
> > > +`````````````````````````````````````
> > > +
> > > +Online fsck for XFS separates the regular filesystem from the
> > > checking and
> > > +repair code as much as possible.
> > > +However, there are a few parts of online fsck (such as the
> > > intent
> > > drains, and
> > > +later, live update hooks) where it is useful for the online fsck
> > > code to know
> > > +what's going on in the rest of the filesystem.
> > > +Since it is not expected that online fsck will be constantly
> > > running
> > > in the
> > > +background, it is very important to minimize the runtime
> > > overhead
> > > imposed by
> > > +these hooks when online fsck is compiled into the kernel but not
> > > actively
> > > +running on behalf of userspace.
> > > +Taking locks in the hot path of a writer thread to access a data
> > > structure only
> > > +to find that no further action is necessary is expensive -- on
> > > the
> > > author's
> > > +computer, this have an overhead of 40-50ns per access.
> > > +Fortunately, the kernel supports dynamic code patching, which
> > > enables XFS to
> > > +replace a static branch to hook code with ``nop`` sleds when
> > > online
> > > fsck isn't
> > > +running.
> > > +This sled has an overhead of however long it takes the
> > > instruction
> > > decoder to
> > > +skip past the sled, which seems to be on the order of less than
> > > 1ns
> > > and
> > > +does not access memory outside of instruction fetching.
> > > +
> > > +When online fsck enables the static key, the sled is replaced
> > > with
> > > an
> > > +unconditional branch to call the hook code.
> > > +The switchover is quite expensive (~22000ns) but is paid
> > > entirely by
> > > the
> > > +program that invoked online fsck, and can be amortized if
> > > multiple
> > > threads
> > > +enter online fsck at the same time, or if multiple filesystems
> > > are
> > > being
> > > +checked at the same time.
> > > +Changing the branch direction requires taking the CPU hotplug
> > > lock,
> > > and since
> > > +CPU initialization requires memory allocation, online fsck must
> > > be
> > > careful not
> > > +to change a static key while holding any locks or resources that
> > > could be
> > > +accessed in the memory reclaim paths.
> > > +To minimize contention on the CPU hotplug lock, care should be
> > > taken
> > > not to
> > > +enable or disable static keys unnecessarily.
> > > +
> > > +Because static keys are intended to minimize hook overhead for
> > > regular
> > > +filesystem operations when xfs_scrub is not running, the
> > > intended
> > > usage
> > > +patterns are as follows:
> > > +
> > > +- The hooked part of XFS should declare a static-scoped static
> > > key
> > > that
> > > +  defaults to false.
> > > +  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
> > > +  The static key itself should be declared as a ``static``
> > > variable.
> > > +
> > > +- When deciding to invoke code that's only used by scrub, the
> > > regular
> > > +  filesystem should call the ``static_branch_unlikely``
> > > predicate to
> > > avoid the
> > > +  scrub-only hook code if the static key is not enabled.
> > > +
> > > +- The regular filesystem should export helper functions that
> > > call
> > > +  ``static_branch_inc`` to enable and ``static_branch_dec`` to
> > > disable the
> > > +  static key.
> > > +  Wrapper functions make it easy to compile out the relevant
> > > code if
> > > the kernel
> > > +  distributor turns off online fsck at build time.
> > > +
> > > +- Scrub functions wanting to turn on scrub-only XFS
> > > functionality
> > > should call
> > > +  the ``xchk_fshooks_enable`` from the setup function to enable
> > > a
> > > specific
> > > +  hook.
> > > +  This must be done before obtaining any resources that are used
> > > by
> > > memory
> > > +  reclaim.
> > > +  Callers had better be sure they really need the functionality
> > > gated by the
> > > +  static key; the ``TRY_HARDER`` flag is useful here.
> > > +
> > > +Online scrub has resource acquisition helpers (e.g.
> > > ``xchk_perag_lock``) to
> > > +handle locking AGI and AGF buffers for all scrubber functions.
> > > +If it detects a conflict between scrub and the running
> > > transactions,
> > > it will
> > > +try to wait for intents to complete.
> > > +If the caller of the helper has not enabled the static key, the
> > > helper will
> > > +return -EDEADLOCK, which should result in the scrub being
> > > restarted
> > > with the
> > > +``TRY_HARDER`` flag set.
> > > +The scrub setup function should detect that flag, enable the
> > > static
> > > key, and
> > > +try the scrub again.
> > > +Scrub teardown disables all static keys obtained by
> > > ``xchk_fshooks_enable``.
> > 
> > Ok, this part here seems pretty well documented.  Organizing nits
> > aside
> > I think it looks good.
> 
> Thanks for digging into all of this!
> 
> --D
> 
> > Allison
> > 
> > > +
> > > +For more information, please see the kernel documentation of
> > > +Documentation/staging/static-keys.rst.
> > > 
> >
diff mbox series

Patch

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index f45bf97fa9c4..419eb54ee200 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -1443,3 +1443,306 @@  This step is critical for enabling system administrator to monitor the status
 of the filesystem and the progress of any repairs.
 For developers, it is a useful means to judge the efficacy of error detection
 and correction in the online and offline checking tools.
+
+Eventual Consistency vs. Online Fsck
+------------------------------------
+
+Midway through the development of online scrubbing, the fsstress tests
+uncovered a misinteraction between online fsck and compound transaction chains
+created by other writer threads that resulted in false reports of metadata
+inconsistency.
+The root cause of these reports is the eventual consistency model introduced by
+the expansion of deferred work items and compound transaction chains when
+reverse mapping and reflink were introduced.
+
+Originally, transaction chains were added to XFS to avoid deadlocks when
+unmapping space from files.
+Deadlock avoidance rules require that AGs only be locked in increasing order,
+which makes it impossible (say) to use a single transaction to free a space
+extent in AG 7 and then try to free a now superfluous block mapping btree block
+in AG 3.
+To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
+items to commit to freeing some space in one transaction while deferring the
+actual metadata updates to a fresh transaction.
+The transaction sequence looks like this:
+
+1. The first transaction contains a physical update to the file's block mapping
+   structures to remove the mapping from the btree blocks.
+   It then attaches to the in-memory transaction an action item to schedule
+   deferred freeing of space.
+   Concretely, each transaction maintains a list of ``struct
+   xfs_defer_pending`` objects, each of which maintains a list of ``struct
+   xfs_extent_free_item`` objects.
+   Returning to the example above, the action item tracks the freeing of both
+   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
+   AG 3.
+   Deferred frees recorded in this manner are committed in the log by creating
+   an EFI log item from the ``struct xfs_extent_free_item`` object and
+   attaching the log item to the transaction.
+   When the log is persisted to disk, the EFI item is written into the ondisk
+   transaction record.
+   EFIs can list up to 16 extents to free, all sorted in AG order.
+
+2. The second transaction contains a physical update to the free space btrees
+   of AG 3 to release the former BMBT block and a second physical update to the
+   free space btrees of AG 7 to release the unmapped file space.
+   Observe that the the physical updates are resequenced in the correct order
+   when possible.
+   Attached to the transaction is a an extent free done (EFD) log item.
+   The EFD contains a pointer to the EFI logged in transaction #1 so that log
+   recovery can tell if the EFI needs to be replayed.
+
+If the system goes down after transaction #1 is written back to the filesystem
+but before #2 is committed, a scan of the filesystem metadata would show
+inconsistent filesystem metadata because there would not appear to be any owner
+of the unmapped space.
+Happily, log recovery corrects this inconsistency for us -- when recovery finds
+an intent log item but does not find a corresponding intent done item, it will
+reconstruct the incore state of the intent item and finish it.
+In the example above, the log must replay both frees described in the recovered
+EFI to complete the recovery phase.
+
+There are two subtleties to XFS' transaction chaining strategy to consider.
+The first is that log items must be added to a transaction in the correct order
+to prevent conflicts with principal objects that are not held by the
+transaction.
+In other words, all per-AG metadata updates for an unmapped block must be
+completed before the last update to free the extent, and extents should not
+be reallocated until that last update commits to the log.
+The second subtlety comes from the fact that AG header buffers are (usually)
+released between each transaction in a chain.
+This means that other threads can observe an AG in an intermediate state,
+but as long as the first subtlety is handled, this should not affect the
+correctness of filesystem operations.
+Unmounting the filesystem flushes all pending work to disk, which means that
+offline fsck never sees the temporary inconsistencies caused by deferred work
+item processing.
+In this manner, XFS employs a form of eventual consistency to avoid deadlocks
+and increase parallelism.
+
+During the design phase of the reverse mapping and reflink features, it was
+decided that it was impractical to cram all the reverse mapping updates for a
+single filesystem change into a single transaction because a single file
+mapping operation can explode into many small updates:
+
+* The block mapping update itself
+* A reverse mapping update for the block mapping update
+* Fixing the freelist
+* A reverse mapping update for the freelist fix
+
+* A shape change to the block mapping btree
+* A reverse mapping update for the btree update
+* Fixing the freelist (again)
+* A reverse mapping update for the freelist fix
+
+* An update to the reference counting information
+* A reverse mapping update for the refcount update
+* Fixing the freelist (a third time)
+* A reverse mapping update for the freelist fix
+
+* Freeing any space that was unmapped and not owned by any other file
+* Fixing the freelist (a fourth time)
+* A reverse mapping update for the freelist fix
+
+* Freeing the space used by the block mapping btree
+* Fixing the freelist (a fifth time)
+* A reverse mapping update for the freelist fix
+
+Free list fixups are not usually needed more than once per AG per transaction
+chain, but it is theoretically possible if space is very tight.
+For copy-on-write updates this is even worse, because this must be done once to
+remove the space from a staging area and again to map it into the file!
+
+To deal with this explosion in a calm manner, XFS expands its use of deferred
+work items to cover most reverse mapping updates and all refcount updates.
+This reduces the worst case size of transaction reservations by breaking the
+work into a long chain of small updates, which increases the degree of eventual
+consistency in the system.
+Again, this generally isn't a problem because XFS orders its deferred work
+items carefully to avoid resource reuse conflicts between unsuspecting threads.
+
+However, online fsck changes the rules -- remember that although physical
+updates to per-AG structures are coordinated by locking the buffers for AG
+headers, buffer locks are dropped between transactions.
+Once scrub acquires resources and takes locks for a data structure, it must do
+all the validation work without releasing the lock.
+If the main lock for a space btree is an AG header buffer lock, scrub may have
+interrupted another thread that is midway through finishing a chain.
+For example, if a thread performing a copy-on-write has completed a reverse
+mapping update but not the corresponding refcount update, the two AG btrees
+will appear inconsistent to scrub and an observation of corruption will be
+recorded.  This observation will not be correct.
+If a repair is attempted in this state, the results will be catastrophic!
+
+Several solutions to this problem were evaluated upon discovery of this flaw:
+
+1. Add a higher level lock to allocation groups and require writer threads to
+   acquire the higher level lock in AG order before making any changes.
+   This would be very difficult to implement in practice because it is
+   difficult to determine which locks need to be obtained, and in what order,
+   without simulating the entire operation.
+   Performing a dry run of a file operation to discover necessary locks would
+   make the filesystem very slow.
+
+2. Make the deferred work coordinator code aware of consecutive intent items
+   targeting the same AG and have it hold the AG header buffers locked across
+   the transaction roll between updates.
+   This would introduce a lot of complexity into the coordinator since it is
+   only loosely coupled with the actual deferred work items.
+   It would also fail to solve the problem because deferred work items can
+   generate new deferred subtasks, but all subtasks must be complete before
+   work can start on a new sibling task.
+
+3. Teach online fsck to walk all transactions waiting for whichever lock(s)
+   protect the data structure being scrubbed to look for pending operations.
+   The checking and repair operations must factor these pending operations into
+   the evaluations being performed.
+   This solution is a nonstarter because it is *extremely* invasive to the main
+   filesystem.
+
+4. Recognize that only online fsck has this requirement of total consistency
+   of AG metadata, and that online fsck should be relatively rare as compared
+   to filesystem change operations.
+   For each AG, maintain a count of intent items targetting that AG.
+   When online fsck wants to examine an AG, it should lock the AG header
+   buffers to quiesce all transaction chains that want to modify that AG, and
+   only proceed with the scrub if the count is zero.
+   In other words, scrub only proceeds if it can lock the AG header buffers and
+   there can't possibly be any intents in progress.
+   This may lead to fairness and starvation issues, but regular filesystem
+   updates take precedence over online fsck activity.
+
+Intent Drains
+`````````````
+
+The fourth solution is implemented in the current iteration of online fsck,
+with atomic_t providing the active intent counter.
+
+There are two key properties to the drain mechanism.
+First, the counter is incremented when a deferred work item is *queued* to a
+transaction, and it is decremented after the associated intent done log item is
+*committed* to another transaction.
+The second property is that deferred work can be added to a transaction without
+holding an AG header lock, but per-AG work items cannot be marked done without
+locking that AG header buffer to log the physical updates and the intent done
+log item.
+The first property enables scrub to yield to running transaction chains, which
+is an explicit deprioritization of online fsck to benefit file operations.
+The second property of the drain is key to the correct coordination of scrub,
+since scrub will always be able to decide if a conflict is possible.
+
+For regular filesystem code, the drain works as follows:
+
+1. Call the appropriate subsystem function to add a deferred work item to a
+   transaction.
+
+2. The function calls ``xfs_drain_bump`` to increase the counter.
+
+3. When the deferred item manager wants to finish the deferred work item, it
+   calls ``->finish_item`` to complete it.
+
+4. The ``->finish_item`` implementation logs some changes and calls
+   ``xfs_drain_drop`` to decrease the sloppy counter and wake up any threads
+   waiting on the drain.
+
+5. The subtransaction commits, which unlocks the resource associated with the
+   intent item.
+
+For scrub, the drain works as follows:
+
+1. Lock the resource(s) associated with the metadata being scrubbed.
+   For example, a scan of the refcount btree would lock the AGI and AGF header
+   buffers.
+
+2. If the counter is zero (``xfs_drain_busy`` returns false), there are no
+   chains in progress and the operation may proceed.
+
+3. Otherwise, release the resources grabbed in step 1.
+
+4. Wait for the intent counter to reach zero (``xfs_drain_intents``), then go
+   back to step 1 unless a signal has been caught.
+
+To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
+be woken up whenever the intent count drops to zero.
+
+The proposed patchset is the
+`scrub intent drain series
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
+
+.. _jump_labels:
+
+Static Keys (aka Jump Label Patching)
+`````````````````````````````````````
+
+Online fsck for XFS separates the regular filesystem from the checking and
+repair code as much as possible.
+However, there are a few parts of online fsck (such as the intent drains, and
+later, live update hooks) where it is useful for the online fsck code to know
+what's going on in the rest of the filesystem.
+Since it is not expected that online fsck will be constantly running in the
+background, it is very important to minimize the runtime overhead imposed by
+these hooks when online fsck is compiled into the kernel but not actively
+running on behalf of userspace.
+Taking locks in the hot path of a writer thread to access a data structure only
+to find that no further action is necessary is expensive -- on the author's
+computer, this have an overhead of 40-50ns per access.
+Fortunately, the kernel supports dynamic code patching, which enables XFS to
+replace a static branch to hook code with ``nop`` sleds when online fsck isn't
+running.
+This sled has an overhead of however long it takes the instruction decoder to
+skip past the sled, which seems to be on the order of less than 1ns and
+does not access memory outside of instruction fetching.
+
+When online fsck enables the static key, the sled is replaced with an
+unconditional branch to call the hook code.
+The switchover is quite expensive (~22000ns) but is paid entirely by the
+program that invoked online fsck, and can be amortized if multiple threads
+enter online fsck at the same time, or if multiple filesystems are being
+checked at the same time.
+Changing the branch direction requires taking the CPU hotplug lock, and since
+CPU initialization requires memory allocation, online fsck must be careful not
+to change a static key while holding any locks or resources that could be
+accessed in the memory reclaim paths.
+To minimize contention on the CPU hotplug lock, care should be taken not to
+enable or disable static keys unnecessarily.
+
+Because static keys are intended to minimize hook overhead for regular
+filesystem operations when xfs_scrub is not running, the intended usage
+patterns are as follows:
+
+- The hooked part of XFS should declare a static-scoped static key that
+  defaults to false.
+  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
+  The static key itself should be declared as a ``static`` variable.
+
+- When deciding to invoke code that's only used by scrub, the regular
+  filesystem should call the ``static_branch_unlikely`` predicate to avoid the
+  scrub-only hook code if the static key is not enabled.
+
+- The regular filesystem should export helper functions that call
+  ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
+  static key.
+  Wrapper functions make it easy to compile out the relevant code if the kernel
+  distributor turns off online fsck at build time.
+
+- Scrub functions wanting to turn on scrub-only XFS functionality should call
+  the ``xchk_fshooks_enable`` from the setup function to enable a specific
+  hook.
+  This must be done before obtaining any resources that are used by memory
+  reclaim.
+  Callers had better be sure they really need the functionality gated by the
+  static key; the ``TRY_HARDER`` flag is useful here.
+
+Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to
+handle locking AGI and AGF buffers for all scrubber functions.
+If it detects a conflict between scrub and the running transactions, it will
+try to wait for intents to complete.
+If the caller of the helper has not enabled the static key, the helper will
+return -EDEADLOCK, which should result in the scrub being restarted with the
+``TRY_HARDER`` flag set.
+The scrub setup function should detect that flag, enable the static key, and
+try the scrub again.
+Scrub teardown disables all static keys obtained by ``xchk_fshooks_enable``.
+
+For more information, please see the kernel documentation of
+Documentation/staging/static-keys.rst.