diff mbox series

[09/14] xfs: document online file metadata repair code

Message ID 166473478486.1082796.11670617428892270355.stgit@magnolia (mailing list archive)
State New, archived
Headers show
Series xfs: design documentation for online fsck | expand

Commit Message

Darrick J. Wong Oct. 2, 2022, 6:19 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Add to the fifth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
kernel to repair file metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  150 ++++++++++++++++++++
 1 file changed, 150 insertions(+)

Comments

Allison Henderson Feb. 16, 2023, 3:46 p.m. UTC | #1
On Sun, 2022-10-02 at 11:19 -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add to the fifth chapter of the online fsck design documentation,
> where
> we discuss the details of the data structures and algorithms used by
> the
> kernel to repair file metadata.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  150
> ++++++++++++++++++++
>  1 file changed, 150 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index c41f089549a0..10709dc74dcb 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -2872,3 +2872,153 @@ The allocation group free block list (AGFL)
> is repaired as follows:
>  4. Once the AGFL is full, reap any blocks leftover.
>  
>  5. The next operation to fix the freelist will right-size the list.
> +
> +Inode Record Repairs
> +--------------------
> +
> +Inode records must be handled carefully, because they have both
> ondisk records
> +("dinodes") and an in-memory ("cached") representation.
> +There is a very high potential for cache coherency issues if online
> fsck is not
> +careful to access the ondisk metadata *only* when the ondisk
> metadata is so
> +badly damaged that the filesystem cannot load the in-memory
> representation.
> +When online fsck wants to open a damaged file for scrubbing, it must
> use
> +specialized resource acquisition functions that return either the
> in-memory
> +representation *or* a lock on whichever object is necessary to
> prevent any
> +update to the ondisk location.
> +
> +The only repairs that should be made to the ondisk inode buffers are
> whatever
> +is necessary to get the in-core structure loaded.
> +This means fixing whatever is caught by the inode cluster buffer and
> inode fork
> +verifiers, and retrying the ``iget`` operation.
> +If the second ``iget`` fails, the repair has failed.
> +
> +Once the in-memory representation is loaded, repair can lock the
> inode and can
> +subject it to comprehensive checks, repairs, and optimizations.
> +Most inode attributes are easy to check and constrain, or are user-
> controlled
> +arbitrary bit patterns; these are both easy to fix.
> +Dealing with the data and attr fork extent counts and the file block
> counts is
> +more complicated, because computing the correct value requires
> traversing the
> +forks, or if that fails, leaving the fields invalid and waiting for
> the fork
> +fsck functions to run.
> +
> +The proposed patchset is the
> +`inode
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-inodes>`_
> +repair series.
> +
> +Quota Record Repairs
> +--------------------
> +
> +Similar to inodes, quota records ("dquots") also have both ondisk
> records and
> +an in-memory representation, and hence are subject to the same cache
> coherency
> +issues.
> +Somewhat confusingly, both are known as dquots in the XFS codebase.
> +
> +The only repairs that should be made to the ondisk quota record
> buffers are
> +whatever is necessary to get the in-core structure loaded.
> +Once the in-memory representation is loaded, the only attributes
> needing
> +checking are obviously bad limits and timer values.
> +
> +Quota usage counters are checked, repaired, and discussed separately
> in the
> +section about :ref:`live quotacheck <quotacheck>`.
> +
> +The proposed patchset is the
> +`quota
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-quota>`_
> +repair series.
> +
> +.. _fscounters:
> +
> +Freezing to Fix Summary Counters
> +--------------------------------
> +
> +Filesystem summary counters track availability of filesystem
> resources such
> +as free blocks, free inodes, and allocated inodes.
> +This information could be compiled by walking the free space and
> inode indexes,
> +but this is a slow process, so XFS maintains a copy in the ondisk
> superblock
> +that should reflect the ondisk metadata, at least when the
> filesystem has been
> +unmounted cleanly.
> +For performance reasons, XFS also maintains incore copies of those
> counters,
> +which are key to enabling resource reservations for active
> transactions.
> +Writer threads reserve the worst-case quantities of resources from
> the
> +incore counter and give back whatever they don't use at commit time.
> +It is therefore only necessary to serialize on the superblock when
> the
> +superblock is being committed to disk.
> +
> +The lazy superblock counter feature introduced in XFS v5 took this
> even further
> +by training log recovery to recompute the summary counters from the
> AG headers,
> +which eliminated the need for most transactions even to touch the
> superblock.
> +The only time XFS commits the summary counters is at filesystem
> unmount.
> +To reduce contention even further, the incore counter is implemented
> as a
> +percpu counter, which means that each CPU is allocated a batch of
> blocks from a
> +global incore counter and can satisfy small allocations from the
> local batch.
> +
> +The high-performance nature of the summary counters makes it
> difficult for
> +online fsck to check them, since there is no way to quiesce a percpu
> counter
> +while the system is running.
> +Although online fsck can read the filesystem metadata to compute the
> correct
> +values of the summary counters, there's no way to hold the value of
> a percpu
> +counter stable, so it's quite possible that the counter will be out
> of date by
> +the time the walk is complete.
> +Earlier versions of online scrub would return to userspace with an
> incomplete
> +scan flag, but this is not a satisfying outcome for a system
> administrator.
> +For repairs, the in-memory counters must be stabilize while walking
> the
nit: stablilized

> +filesystem metadata to get an accurate reading and install it in the
> percpu
> +counter.
> +
> +To satisfy this requirement, online fsck must prevent other programs
> in the
> +system from initiating new writes to the filesystem, it must disable
> background
> +garbage collection threads, and it must wait for existing writer
> programs to
> +exit the kernel.
> +Once that has been established, scrub can walk the AG free space
> indexes, the
> +inode btrees, and the realtime bitmap to compute the correct value
> of all
> +four summary counters.
> +This is very similar to a filesystem freeze.
> +
> +The initial implementation used the actual VFS filesystem freeze
> mechanism to
> +quiesce filesystem activity.


> +With the filesystem frozen, it is possible to resolve the counter
> values with
> +exact precision, but there are many problems with calling the VFS
> methods
> +directly:
> +
> +- Other programs can unfreeze the filesystem without our knowledge.
> +  This leads to incorrect scan results and incorrect repairs.
> +
> +- Adding an extra lock to prevent others from thawing the filesystem
> required
> +  the addition of a ``->freeze_super`` function to wrap
> ``freeze_fs()``.
> +  This in turn caused other subtle problems because it turns out
> that the VFS
> +  ``freeze_super`` and ``thaw_super`` functions can drop the last
> reference to
> +  the VFS superblock, and any subsequent access becomes a UAF bug!
> +  This can happen if the filesystem is unmounted while the
> underlying block
> +  device has frozen the filesystem.
> +  This problem could be solved by grabbing extra references to the
> superblock,
> +  but it felt suboptimal given the other inadequacies of this
> approach:
> +
> +- The log need not be quiesced to check the summary counters, but a
> VFS freeze
> +  initiates one anyway.
> +  This adds unnecessary runtime to live fscounter fsck operations.
> +
> +- Quiescing the log means that XFS flushes the (possibly incorrect)
> counters to
> +  disk as part of cleaning the log.
> +
> +- A bug in the VFS meant that freeze could complete even when
> sync_filesystem
> +  fails to flush the filesystem and returns an error.
> +  This bug was fixed in Linux 5.17.
> +
> +The author 

I think the above is a candidate for culling or maybe the side bar if
you really feel the need to keep it.  Aside from the fact that it was
simply bug prone, it makes sense to avoid an entire freeze if we can
help it.  I think it would be fine to just pop into the current
proposition:

"However, this approach was found to be both costly and unstable. 
Later optimizations of ofsck established ..."
> established that the only component of online fsck that requires the
> +ability to freeze the filesystem is the fscounter scrubber, so the
> code for
> +this could be localized to that source file.
> +fscounter freeze behaves the same as the VFS freeze method, except:
> +
> +- The final freeze state is set one higher than
> ``SB_FREEZE_COMPLETE`` to
> +  prevent other threads from thawing the filesystem.
> +
> +- It does not quiesce the log.
> +
> +With this code in place, it is now possible to pause the filesystem
> for just
> +long enough to check and correct the summary counters.
> +
> +The proposed patchset is the
> +`summary counter cleanup
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-fscounters>`_
> +series.
> 

Otherwise looks good.  :-)
Allison
Darrick J. Wong Feb. 16, 2023, 11:05 p.m. UTC | #2
On Thu, Feb 16, 2023 at 03:46:30PM +0000, Allison Henderson wrote:
> On Sun, 2022-10-02 at 11:19 -0700, Darrick J. Wong wrote:

Not sure why this reply is to the earlier submission of the design doc,
but oh well, it didn't change much between October and December of 2022.

> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add to the fifth chapter of the online fsck design documentation,
> > where
> > we discuss the details of the data structures and algorithms used by
> > the
> > kernel to repair file metadata.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  150
> > ++++++++++++++++++++
> >  1 file changed, 150 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index c41f089549a0..10709dc74dcb 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -2872,3 +2872,153 @@ The allocation group free block list (AGFL)
> > is repaired as follows:
> >  4. Once the AGFL is full, reap any blocks leftover.
> >  
> >  5. The next operation to fix the freelist will right-size the list.
> > +
> > +Inode Record Repairs
> > +--------------------
> > +
> > +Inode records must be handled carefully, because they have both
> > ondisk records
> > +("dinodes") and an in-memory ("cached") representation.
> > +There is a very high potential for cache coherency issues if online
> > fsck is not
> > +careful to access the ondisk metadata *only* when the ondisk
> > metadata is so
> > +badly damaged that the filesystem cannot load the in-memory
> > representation.
> > +When online fsck wants to open a damaged file for scrubbing, it must
> > use
> > +specialized resource acquisition functions that return either the
> > in-memory
> > +representation *or* a lock on whichever object is necessary to
> > prevent any
> > +update to the ondisk location.
> > +
> > +The only repairs that should be made to the ondisk inode buffers are
> > whatever
> > +is necessary to get the in-core structure loaded.
> > +This means fixing whatever is caught by the inode cluster buffer and
> > inode fork
> > +verifiers, and retrying the ``iget`` operation.
> > +If the second ``iget`` fails, the repair has failed.
> > +
> > +Once the in-memory representation is loaded, repair can lock the
> > inode and can
> > +subject it to comprehensive checks, repairs, and optimizations.
> > +Most inode attributes are easy to check and constrain, or are user-
> > controlled
> > +arbitrary bit patterns; these are both easy to fix.
> > +Dealing with the data and attr fork extent counts and the file block
> > counts is
> > +more complicated, because computing the correct value requires
> > traversing the
> > +forks, or if that fails, leaving the fields invalid and waiting for
> > the fork
> > +fsck functions to run.
> > +
> > +The proposed patchset is the
> > +`inode
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-inodes>`_
> > +repair series.
> > +
> > +Quota Record Repairs
> > +--------------------
> > +
> > +Similar to inodes, quota records ("dquots") also have both ondisk
> > records and
> > +an in-memory representation, and hence are subject to the same cache
> > coherency
> > +issues.
> > +Somewhat confusingly, both are known as dquots in the XFS codebase.
> > +
> > +The only repairs that should be made to the ondisk quota record
> > buffers are
> > +whatever is necessary to get the in-core structure loaded.
> > +Once the in-memory representation is loaded, the only attributes
> > needing
> > +checking are obviously bad limits and timer values.
> > +
> > +Quota usage counters are checked, repaired, and discussed separately
> > in the
> > +section about :ref:`live quotacheck <quotacheck>`.
> > +
> > +The proposed patchset is the
> > +`quota
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-quota>`_
> > +repair series.
> > +
> > +.. _fscounters:
> > +
> > +Freezing to Fix Summary Counters
> > +--------------------------------
> > +
> > +Filesystem summary counters track availability of filesystem
> > resources such
> > +as free blocks, free inodes, and allocated inodes.
> > +This information could be compiled by walking the free space and
> > inode indexes,
> > +but this is a slow process, so XFS maintains a copy in the ondisk
> > superblock
> > +that should reflect the ondisk metadata, at least when the
> > filesystem has been
> > +unmounted cleanly.
> > +For performance reasons, XFS also maintains incore copies of those
> > counters,
> > +which are key to enabling resource reservations for active
> > transactions.
> > +Writer threads reserve the worst-case quantities of resources from
> > the
> > +incore counter and give back whatever they don't use at commit time.
> > +It is therefore only necessary to serialize on the superblock when
> > the
> > +superblock is being committed to disk.
> > +
> > +The lazy superblock counter feature introduced in XFS v5 took this
> > even further
> > +by training log recovery to recompute the summary counters from the
> > AG headers,
> > +which eliminated the need for most transactions even to touch the
> > superblock.
> > +The only time XFS commits the summary counters is at filesystem
> > unmount.
> > +To reduce contention even further, the incore counter is implemented
> > as a
> > +percpu counter, which means that each CPU is allocated a batch of
> > blocks from a
> > +global incore counter and can satisfy small allocations from the
> > local batch.
> > +
> > +The high-performance nature of the summary counters makes it
> > difficult for
> > +online fsck to check them, since there is no way to quiesce a percpu
> > counter
> > +while the system is running.
> > +Although online fsck can read the filesystem metadata to compute the
> > correct
> > +values of the summary counters, there's no way to hold the value of
> > a percpu
> > +counter stable, so it's quite possible that the counter will be out
> > of date by
> > +the time the walk is complete.
> > +Earlier versions of online scrub would return to userspace with an
> > incomplete
> > +scan flag, but this is not a satisfying outcome for a system
> > administrator.
> > +For repairs, the in-memory counters must be stabilize while walking
> > the
> nit: stablilized

Fixed, thank you.

> > +filesystem metadata to get an accurate reading and install it in the
> > percpu
> > +counter.
> > +
> > +To satisfy this requirement, online fsck must prevent other programs
> > in the
> > +system from initiating new writes to the filesystem, it must disable
> > background
> > +garbage collection threads, and it must wait for existing writer
> > programs to
> > +exit the kernel.
> > +Once that has been established, scrub can walk the AG free space
> > indexes, the
> > +inode btrees, and the realtime bitmap to compute the correct value
> > of all
> > +four summary counters.
> > +This is very similar to a filesystem freeze.
> > +
> > +The initial implementation used the actual VFS filesystem freeze
> > mechanism to
> > +quiesce filesystem activity.
> 
> 
> > +With the filesystem frozen, it is possible to resolve the counter
> > values with
> > +exact precision, but there are many problems with calling the VFS
> > methods
> > +directly:
> > +
> > +- Other programs can unfreeze the filesystem without our knowledge.
> > +  This leads to incorrect scan results and incorrect repairs.
> > +
> > +- Adding an extra lock to prevent others from thawing the filesystem
> > required
> > +  the addition of a ``->freeze_super`` function to wrap
> > ``freeze_fs()``.
> > +  This in turn caused other subtle problems because it turns out
> > that the VFS
> > +  ``freeze_super`` and ``thaw_super`` functions can drop the last
> > reference to
> > +  the VFS superblock, and any subsequent access becomes a UAF bug!
> > +  This can happen if the filesystem is unmounted while the
> > underlying block
> > +  device has frozen the filesystem.
> > +  This problem could be solved by grabbing extra references to the
> > superblock,
> > +  but it felt suboptimal given the other inadequacies of this
> > approach:
> > +
> > +- The log need not be quiesced to check the summary counters, but a
> > VFS freeze
> > +  initiates one anyway.
> > +  This adds unnecessary runtime to live fscounter fsck operations.
> > +
> > +- Quiescing the log means that XFS flushes the (possibly incorrect)
> > counters to
> > +  disk as part of cleaning the log.
> > +
> > +- A bug in the VFS meant that freeze could complete even when
> > sync_filesystem
> > +  fails to flush the filesystem and returns an error.
> > +  This bug was fixed in Linux 5.17.
> > +
> > +The author 
> 
> I think the above is a candidate for culling or maybe the side bar if
> you really feel the need to keep it.

I'll turn it into a sidebar, since it took me a while to figure all that
out, and maybe some day I'll need to refer to what amounts to historic
notes.

> Aside from the fact that it was
> simply bug prone, it makes sense to avoid an entire freeze if we can
> help it.  I think it would be fine to just pop into the current
> proposition:
> 
> "However, this approach was found to be both costly and unstable. 
> Later optimizations of ofsck established ..."

How about:

"...Once that has been established, scrub can walk the AG free space
indexes, the inode btrees, and the realtime bitmap to compute the
correct value of all four summary counters.  This is very similar to a
filesystem freeze, though not all of the pieces are necessary:

- "The final freeze state is set one higher than ``SB_FREEZE_COMPLETE``
  to prevent other threads from thawing the filesystem, or other scrub
  threads from initiating another fscounters freeze.

- IIt does not quiesce the log.

"With this code in place, it is now possible to pause the filesystem for
just long enough to check and correct the summary counters."

<giant historic sidebar here>

?

> > established that the only component of online fsck that requires the
> > +ability to freeze the filesystem is the fscounter scrubber, so the
> > code for
> > +this could be localized to that source file.
> > +fscounter freeze behaves the same as the VFS freeze method, except:
> > +
> > +- The final freeze state is set one higher than
> > ``SB_FREEZE_COMPLETE`` to
> > +  prevent other threads from thawing the filesystem.
> > +
> > +- It does not quiesce the log.
> > +
> > +With this code in place, it is now possible to pause the filesystem
> > for just
> > +long enough to check and correct the summary counters.
> > +
> > +The proposed patchset is the
> > +`summary counter cleanup
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-fscounters>`_
> > +series.
> > 
> 
> Otherwise looks good.  :-)

Thanks!

--D

> Allison
>
Allison Henderson Feb. 25, 2023, 7:32 a.m. UTC | #3
On Thu, 2023-02-16 at 15:05 -0800, Darrick J. Wong wrote:
> On Thu, Feb 16, 2023 at 03:46:30PM +0000, Allison Henderson wrote:
> > On Sun, 2022-10-02 at 11:19 -0700, Darrick J. Wong wrote:
> 
> Not sure why this reply is to the earlier submission of the design
> doc,
> but oh well, it didn't change much between October and December of
> 2022.
oops, it just because I search the inbox for the next patch in the
series and inadvertently clicked the earlier one
> 
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Add to the fifth chapter of the online fsck design documentation,
> > > where
> > > we discuss the details of the data structures and algorithms used
> > > by
> > > the
> > > kernel to repair file metadata.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  150
> > > ++++++++++++++++++++
> > >  1 file changed, 150 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index c41f089549a0..10709dc74dcb 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -2872,3 +2872,153 @@ The allocation group free block list
> > > (AGFL)
> > > is repaired as follows:
> > >  4. Once the AGFL is full, reap any blocks leftover.
> > >  
> > >  5. The next operation to fix the freelist will right-size the
> > > list.
> > > +
> > > +Inode Record Repairs
> > > +--------------------
> > > +
> > > +Inode records must be handled carefully, because they have both
> > > ondisk records
> > > +("dinodes") and an in-memory ("cached") representation.
> > > +There is a very high potential for cache coherency issues if
> > > online
> > > fsck is not
> > > +careful to access the ondisk metadata *only* when the ondisk
> > > metadata is so
> > > +badly damaged that the filesystem cannot load the in-memory
> > > representation.
> > > +When online fsck wants to open a damaged file for scrubbing, it
> > > must
> > > use
> > > +specialized resource acquisition functions that return either
> > > the
> > > in-memory
> > > +representation *or* a lock on whichever object is necessary to
> > > prevent any
> > > +update to the ondisk location.
> > > +
> > > +The only repairs that should be made to the ondisk inode buffers
> > > are
> > > whatever
> > > +is necessary to get the in-core structure loaded.
> > > +This means fixing whatever is caught by the inode cluster buffer
> > > and
> > > inode fork
> > > +verifiers, and retrying the ``iget`` operation.
> > > +If the second ``iget`` fails, the repair has failed.
> > > +
> > > +Once the in-memory representation is loaded, repair can lock the
> > > inode and can
> > > +subject it to comprehensive checks, repairs, and optimizations.
> > > +Most inode attributes are easy to check and constrain, or are
> > > user-
> > > controlled
> > > +arbitrary bit patterns; these are both easy to fix.
> > > +Dealing with the data and attr fork extent counts and the file
> > > block
> > > counts is
> > > +more complicated, because computing the correct value requires
> > > traversing the
> > > +forks, or if that fails, leaving the fields invalid and waiting
> > > for
> > > the fork
> > > +fsck functions to run.
> > > +
> > > +The proposed patchset is the
> > > +`inode
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-inodes>`_
> > > +repair series.
> > > +
> > > +Quota Record Repairs
> > > +--------------------
> > > +
> > > +Similar to inodes, quota records ("dquots") also have both
> > > ondisk
> > > records and
> > > +an in-memory representation, and hence are subject to the same
> > > cache
> > > coherency
> > > +issues.
> > > +Somewhat confusingly, both are known as dquots in the XFS
> > > codebase.
> > > +
> > > +The only repairs that should be made to the ondisk quota record
> > > buffers are
> > > +whatever is necessary to get the in-core structure loaded.
> > > +Once the in-memory representation is loaded, the only attributes
> > > needing
> > > +checking are obviously bad limits and timer values.
> > > +
> > > +Quota usage counters are checked, repaired, and discussed
> > > separately
> > > in the
> > > +section about :ref:`live quotacheck <quotacheck>`.
> > > +
> > > +The proposed patchset is the
> > > +`quota
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-quota>`_
> > > +repair series.
> > > +
> > > +.. _fscounters:
> > > +
> > > +Freezing to Fix Summary Counters
> > > +--------------------------------
> > > +
> > > +Filesystem summary counters track availability of filesystem
> > > resources such
> > > +as free blocks, free inodes, and allocated inodes.
> > > +This information could be compiled by walking the free space and
> > > inode indexes,
> > > +but this is a slow process, so XFS maintains a copy in the
> > > ondisk
> > > superblock
> > > +that should reflect the ondisk metadata, at least when the
> > > filesystem has been
> > > +unmounted cleanly.
> > > +For performance reasons, XFS also maintains incore copies of
> > > those
> > > counters,
> > > +which are key to enabling resource reservations for active
> > > transactions.
> > > +Writer threads reserve the worst-case quantities of resources
> > > from
> > > the
> > > +incore counter and give back whatever they don't use at commit
> > > time.
> > > +It is therefore only necessary to serialize on the superblock
> > > when
> > > the
> > > +superblock is being committed to disk.
> > > +
> > > +The lazy superblock counter feature introduced in XFS v5 took
> > > this
> > > even further
> > > +by training log recovery to recompute the summary counters from
> > > the
> > > AG headers,
> > > +which eliminated the need for most transactions even to touch
> > > the
> > > superblock.
> > > +The only time XFS commits the summary counters is at filesystem
> > > unmount.
> > > +To reduce contention even further, the incore counter is
> > > implemented
> > > as a
> > > +percpu counter, which means that each CPU is allocated a batch
> > > of
> > > blocks from a
> > > +global incore counter and can satisfy small allocations from the
> > > local batch.
> > > +
> > > +The high-performance nature of the summary counters makes it
> > > difficult for
> > > +online fsck to check them, since there is no way to quiesce a
> > > percpu
> > > counter
> > > +while the system is running.
> > > +Although online fsck can read the filesystem metadata to compute
> > > the
> > > correct
> > > +values of the summary counters, there's no way to hold the value
> > > of
> > > a percpu
> > > +counter stable, so it's quite possible that the counter will be
> > > out
> > > of date by
> > > +the time the walk is complete.
> > > +Earlier versions of online scrub would return to userspace with
> > > an
> > > incomplete
> > > +scan flag, but this is not a satisfying outcome for a system
> > > administrator.
> > > +For repairs, the in-memory counters must be stabilize while
> > > walking
> > > the
> > nit: stablilized
> 
> Fixed, thank you.
> 
> > > +filesystem metadata to get an accurate reading and install it in
> > > the
> > > percpu
> > > +counter.
> > > +
> > > +To satisfy this requirement, online fsck must prevent other
> > > programs
> > > in the
> > > +system from initiating new writes to the filesystem, it must
> > > disable
> > > background
> > > +garbage collection threads, and it must wait for existing writer
> > > programs to
> > > +exit the kernel.
> > > +Once that has been established, scrub can walk the AG free space
> > > indexes, the
> > > +inode btrees, and the realtime bitmap to compute the correct
> > > value
> > > of all
> > > +four summary counters.
> > > +This is very similar to a filesystem freeze.
> > > +
> > > +The initial implementation used the actual VFS filesystem freeze
> > > mechanism to
> > > +quiesce filesystem activity.
> > 
> > 
> > > +With the filesystem frozen, it is possible to resolve the
> > > counter
> > > values with
> > > +exact precision, but there are many problems with calling the
> > > VFS
> > > methods
> > > +directly:
> > > +
> > > +- Other programs can unfreeze the filesystem without our
> > > knowledge.
> > > +  This leads to incorrect scan results and incorrect repairs.
> > > +
> > > +- Adding an extra lock to prevent others from thawing the
> > > filesystem
> > > required
> > > +  the addition of a ``->freeze_super`` function to wrap
> > > ``freeze_fs()``.
> > > +  This in turn caused other subtle problems because it turns out
> > > that the VFS
> > > +  ``freeze_super`` and ``thaw_super`` functions can drop the
> > > last
> > > reference to
> > > +  the VFS superblock, and any subsequent access becomes a UAF
> > > bug!
> > > +  This can happen if the filesystem is unmounted while the
> > > underlying block
> > > +  device has frozen the filesystem.
> > > +  This problem could be solved by grabbing extra references to
> > > the
> > > superblock,
> > > +  but it felt suboptimal given the other inadequacies of this
> > > approach:
> > > +
> > > +- The log need not be quiesced to check the summary counters,
> > > but a
> > > VFS freeze
> > > +  initiates one anyway.
> > > +  This adds unnecessary runtime to live fscounter fsck
> > > operations.
> > > +
> > > +- Quiescing the log means that XFS flushes the (possibly
> > > incorrect)
> > > counters to
> > > +  disk as part of cleaning the log.
> > > +
> > > +- A bug in the VFS meant that freeze could complete even when
> > > sync_filesystem
> > > +  fails to flush the filesystem and returns an error.
> > > +  This bug was fixed in Linux 5.17.
> > > +
> > > +The author 
> > 
> > I think the above is a candidate for culling or maybe the side bar
> > if
> > you really feel the need to keep it.
> 
> I'll turn it into a sidebar, since it took me a while to figure all
> that
> out, and maybe some day I'll need to refer to what amounts to
> historic
> notes.
> 
> > Aside from the fact that it was
> > simply bug prone, it makes sense to avoid an entire freeze if we
> > can
> > help it.  I think it would be fine to just pop into the current
> > proposition:
> > 
> > "However, this approach was found to be both costly and unstable. 
> > Later optimizations of ofsck established ..."
> 
> How about:
> 
> "...Once that has been established, scrub can walk the AG free space
> indexes, the inode btrees, and the realtime bitmap to compute the
> correct value of all four summary counters.  This is very similar to
> a
> filesystem freeze, though not all of the pieces are necessary:
> 
> - "The final freeze state is set one higher than
> ``SB_FREEZE_COMPLETE``
>   to prevent other threads from thawing the filesystem, or other
> scrub
>   threads from initiating another fscounters freeze.
> 
> - IIt does not quiesce the log.
> 
> "With this code in place, it is now possible to pause the filesystem
> for
> just long enough to check and correct the summary counters."
> 
> <giant historic sidebar here>
Ok, I think that's a fair compromise
> 
> ?
> 
> > > established that the only component of online fsck that requires
> > > the
> > > +ability to freeze the filesystem is the fscounter scrubber, so
> > > the
> > > code for
> > > +this could be localized to that source file.
> > > +fscounter freeze behaves the same as the VFS freeze method,
> > > except:
> > > +
> > > +- The final freeze state is set one higher than
> > > ``SB_FREEZE_COMPLETE`` to
> > > +  prevent other threads from thawing the filesystem.
> > > +
> > > +- It does not quiesce the log.
> > > +
> > > +With this code in place, it is now possible to pause the
> > > filesystem
> > > for just
> > > +long enough to check and correct the summary counters.
> > > +
> > > +The proposed patchset is the
> > > +`summary counter cleanup
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-fscounters>`_
> > > +series.
> > > 
> > 
> > Otherwise looks good.  :-)
> 
> Thanks!
> 
> --D
> 
> > Allison
> >
diff mbox series

Patch

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index c41f089549a0..10709dc74dcb 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -2872,3 +2872,153 @@  The allocation group free block list (AGFL) is repaired as follows:
 4. Once the AGFL is full, reap any blocks leftover.
 
 5. The next operation to fix the freelist will right-size the list.
+
+Inode Record Repairs
+--------------------
+
+Inode records must be handled carefully, because they have both ondisk records
+("dinodes") and an in-memory ("cached") representation.
+There is a very high potential for cache coherency issues if online fsck is not
+careful to access the ondisk metadata *only* when the ondisk metadata is so
+badly damaged that the filesystem cannot load the in-memory representation.
+When online fsck wants to open a damaged file for scrubbing, it must use
+specialized resource acquisition functions that return either the in-memory
+representation *or* a lock on whichever object is necessary to prevent any
+update to the ondisk location.
+
+The only repairs that should be made to the ondisk inode buffers are whatever
+is necessary to get the in-core structure loaded.
+This means fixing whatever is caught by the inode cluster buffer and inode fork
+verifiers, and retrying the ``iget`` operation.
+If the second ``iget`` fails, the repair has failed.
+
+Once the in-memory representation is loaded, repair can lock the inode and can
+subject it to comprehensive checks, repairs, and optimizations.
+Most inode attributes are easy to check and constrain, or are user-controlled
+arbitrary bit patterns; these are both easy to fix.
+Dealing with the data and attr fork extent counts and the file block counts is
+more complicated, because computing the correct value requires traversing the
+forks, or if that fails, leaving the fields invalid and waiting for the fork
+fsck functions to run.
+
+The proposed patchset is the
+`inode
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
+repair series.
+
+Quota Record Repairs
+--------------------
+
+Similar to inodes, quota records ("dquots") also have both ondisk records and
+an in-memory representation, and hence are subject to the same cache coherency
+issues.
+Somewhat confusingly, both are known as dquots in the XFS codebase.
+
+The only repairs that should be made to the ondisk quota record buffers are
+whatever is necessary to get the in-core structure loaded.
+Once the in-memory representation is loaded, the only attributes needing
+checking are obviously bad limits and timer values.
+
+Quota usage counters are checked, repaired, and discussed separately in the
+section about :ref:`live quotacheck <quotacheck>`.
+
+The proposed patchset is the
+`quota
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
+repair series.
+
+.. _fscounters:
+
+Freezing to Fix Summary Counters
+--------------------------------
+
+Filesystem summary counters track availability of filesystem resources such
+as free blocks, free inodes, and allocated inodes.
+This information could be compiled by walking the free space and inode indexes,
+but this is a slow process, so XFS maintains a copy in the ondisk superblock
+that should reflect the ondisk metadata, at least when the filesystem has been
+unmounted cleanly.
+For performance reasons, XFS also maintains incore copies of those counters,
+which are key to enabling resource reservations for active transactions.
+Writer threads reserve the worst-case quantities of resources from the
+incore counter and give back whatever they don't use at commit time.
+It is therefore only necessary to serialize on the superblock when the
+superblock is being committed to disk.
+
+The lazy superblock counter feature introduced in XFS v5 took this even further
+by training log recovery to recompute the summary counters from the AG headers,
+which eliminated the need for most transactions even to touch the superblock.
+The only time XFS commits the summary counters is at filesystem unmount.
+To reduce contention even further, the incore counter is implemented as a
+percpu counter, which means that each CPU is allocated a batch of blocks from a
+global incore counter and can satisfy small allocations from the local batch.
+
+The high-performance nature of the summary counters makes it difficult for
+online fsck to check them, since there is no way to quiesce a percpu counter
+while the system is running.
+Although online fsck can read the filesystem metadata to compute the correct
+values of the summary counters, there's no way to hold the value of a percpu
+counter stable, so it's quite possible that the counter will be out of date by
+the time the walk is complete.
+Earlier versions of online scrub would return to userspace with an incomplete
+scan flag, but this is not a satisfying outcome for a system administrator.
+For repairs, the in-memory counters must be stabilize while walking the
+filesystem metadata to get an accurate reading and install it in the percpu
+counter.
+
+To satisfy this requirement, online fsck must prevent other programs in the
+system from initiating new writes to the filesystem, it must disable background
+garbage collection threads, and it must wait for existing writer programs to
+exit the kernel.
+Once that has been established, scrub can walk the AG free space indexes, the
+inode btrees, and the realtime bitmap to compute the correct value of all
+four summary counters.
+This is very similar to a filesystem freeze.
+
+The initial implementation used the actual VFS filesystem freeze mechanism to
+quiesce filesystem activity.
+With the filesystem frozen, it is possible to resolve the counter values with
+exact precision, but there are many problems with calling the VFS methods
+directly:
+
+- Other programs can unfreeze the filesystem without our knowledge.
+  This leads to incorrect scan results and incorrect repairs.
+
+- Adding an extra lock to prevent others from thawing the filesystem required
+  the addition of a ``->freeze_super`` function to wrap ``freeze_fs()``.
+  This in turn caused other subtle problems because it turns out that the VFS
+  ``freeze_super`` and ``thaw_super`` functions can drop the last reference to
+  the VFS superblock, and any subsequent access becomes a UAF bug!
+  This can happen if the filesystem is unmounted while the underlying block
+  device has frozen the filesystem.
+  This problem could be solved by grabbing extra references to the superblock,
+  but it felt suboptimal given the other inadequacies of this approach:
+
+- The log need not be quiesced to check the summary counters, but a VFS freeze
+  initiates one anyway.
+  This adds unnecessary runtime to live fscounter fsck operations.
+
+- Quiescing the log means that XFS flushes the (possibly incorrect) counters to
+  disk as part of cleaning the log.
+
+- A bug in the VFS meant that freeze could complete even when sync_filesystem
+  fails to flush the filesystem and returns an error.
+  This bug was fixed in Linux 5.17.
+
+The author established that the only component of online fsck that requires the
+ability to freeze the filesystem is the fscounter scrubber, so the code for
+this could be localized to that source file.
+fscounter freeze behaves the same as the VFS freeze method, except:
+
+- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
+  prevent other threads from thawing the filesystem.
+
+- It does not quiesce the log.
+
+With this code in place, it is now possible to pause the filesystem for just
+long enough to check and correct the summary counters.
+
+The proposed patchset is the
+`summary counter cleanup
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
+series.