[09/14] xfs: document online file metadata repair code

Message ID	165989705595.2495930.3099979524542778581.stgit@magnolia (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Subject: [PATCH 09/14] xfs: document online file metadata repair code From: "Darrick J. Wong" <djwong@kernel.org> To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org, willy@infradead.org, chandan.babu@oracle.com, allison.henderson@oracle.com, linux-fsdevel@vger.kernel.org, hch@infradead.org, catherine.hoang@oracle.com Date: Sun, 07 Aug 2022 11:30:56 -0700 Message-ID: <165989705595.2495930.3099979524542778581.stgit@magnolia> In-Reply-To: <165989700514.2495930.13997256907290563223.stgit@magnolia> References: <165989700514.2495930.13997256907290563223.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	xfs: design documentation for online fsck \| expand [PATCHSET,v2,00/14] xfs: design documentation for online fsck [01/14] xfs: document the motivation for online fsck design [02/14] xfs: document the general theory underlying online fsck design [03/14] xfs: document the testing plan for online fsck [04/14] xfs: document the user interface for online fsck [05/14] xfs: document the filesystem metadata checking strategy [06/14] xfs: document how online fsck deals with eventual consistency [07/14] xfs: document pageable kernel memory [08/14] xfs: document btree bulk loading [09/14] xfs: document online file metadata repair code [10/14] xfs: document full filesystem scans for online fsck [11/14] xfs: document metadata file repair [12/14] xfs: document directory tree repairs [13/14] xfs: document the userspace fsck driver program [14/14] xfs: document future directions of online fsck

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst index c41f089549a0..10709dc74dcb 100644 --- a/Documentation/filesystems/xfs-online-fsck-design.rst +++ b/Documentation/filesystems/xfs-online-fsck-design.rst @@ -2872,3 +2872,153 @@ The allocation group free block list (AGFL) is repaired as follows: 4. Once the AGFL is full, reap any blocks leftover. 5. The next operation to fix the freelist will right-size the list. + +Inode Record Repairs +-------------------- + +Inode records must be handled carefully, because they have both ondisk records +("dinodes") and an in-memory ("cached") representation. +There is a very high potential for cache coherency issues if online fsck is not +careful to access the ondisk metadata *only* when the ondisk metadata is so +badly damaged that the filesystem cannot load the in-memory representation. +When online fsck wants to open a damaged file for scrubbing, it must use +specialized resource acquisition functions that return either the in-memory +representation *or* a lock on whichever object is necessary to prevent any +update to the ondisk location. + +The only repairs that should be made to the ondisk inode buffers are whatever +is necessary to get the in-core structure loaded. +This means fixing whatever is caught by the inode cluster buffer and inode fork +verifiers, and retrying the ``iget`` operation. +If the second ``iget`` fails, the repair has failed. + +Once the in-memory representation is loaded, repair can lock the inode and can +subject it to comprehensive checks, repairs, and optimizations. +Most inode attributes are easy to check and constrain, or are user-controlled +arbitrary bit patterns; these are both easy to fix. +Dealing with the data and attr fork extent counts and the file block counts is +more complicated, because computing the correct value requires traversing the +forks, or if that fails, leaving the fields invalid and waiting for the fork +fsck functions to run. + +The proposed patchset is the +`inode +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_ +repair series. + +Quota Record Repairs +-------------------- + +Similar to inodes, quota records ("dquots") also have both ondisk records and +an in-memory representation, and hence are subject to the same cache coherency +issues. +Somewhat confusingly, both are known as dquots in the XFS codebase. + +The only repairs that should be made to the ondisk quota record buffers are +whatever is necessary to get the in-core structure loaded. +Once the in-memory representation is loaded, the only attributes needing +checking are obviously bad limits and timer values. + +Quota usage counters are checked, repaired, and discussed separately in the +section about :ref:`live quotacheck <quotacheck>`. + +The proposed patchset is the +`quota +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_ +repair series. + +.. _fscounters: + +Freezing to Fix Summary Counters +-------------------------------- + +Filesystem summary counters track availability of filesystem resources such +as free blocks, free inodes, and allocated inodes. +This information could be compiled by walking the free space and inode indexes, +but this is a slow process, so XFS maintains a copy in the ondisk superblock +that should reflect the ondisk metadata, at least when the filesystem has been +unmounted cleanly. +For performance reasons, XFS also maintains incore copies of those counters, +which are key to enabling resource reservations for active transactions. +Writer threads reserve the worst-case quantities of resources from the +incore counter and give back whatever they don't use at commit time. +It is therefore only necessary to serialize on the superblock when the +superblock is being committed to disk. + +The lazy superblock counter feature introduced in XFS v5 took this even further +by training log recovery to recompute the summary counters from the AG headers, +which eliminated the need for most transactions even to touch the superblock. +The only time XFS commits the summary counters is at filesystem unmount. +To reduce contention even further, the incore counter is implemented as a +percpu counter, which means that each CPU is allocated a batch of blocks from a +global incore counter and can satisfy small allocations from the local batch. + +The high-performance nature of the summary counters makes it difficult for +online fsck to check them, since there is no way to quiesce a percpu counter +while the system is running. +Although online fsck can read the filesystem metadata to compute the correct +values of the summary counters, there's no way to hold the value of a percpu +counter stable, so it's quite possible that the counter will be out of date by +the time the walk is complete. +Earlier versions of online scrub would return to userspace with an incomplete +scan flag, but this is not a satisfying outcome for a system administrator. +For repairs, the in-memory counters must be stabilize while walking the +filesystem metadata to get an accurate reading and install it in the percpu +counter. + +To satisfy this requirement, online fsck must prevent other programs in the +system from initiating new writes to the filesystem, it must disable background +garbage collection threads, and it must wait for existing writer programs to +exit the kernel. +Once that has been established, scrub can walk the AG free space indexes, the +inode btrees, and the realtime bitmap to compute the correct value of all +four summary counters. +This is very similar to a filesystem freeze. + +The initial implementation used the actual VFS filesystem freeze mechanism to +quiesce filesystem activity. +With the filesystem frozen, it is possible to resolve the counter values with +exact precision, but there are many problems with calling the VFS methods +directly: + +- Other programs can unfreeze the filesystem without our knowledge. + This leads to incorrect scan results and incorrect repairs. + +- Adding an extra lock to prevent others from thawing the filesystem required + the addition of a ``->freeze_super`` function to wrap ``freeze_fs()``. + This in turn caused other subtle problems because it turns out that the VFS + ``freeze_super`` and ``thaw_super`` functions can drop the last reference to + the VFS superblock, and any subsequent access becomes a UAF bug! + This can happen if the filesystem is unmounted while the underlying block + device has frozen the filesystem. + This problem could be solved by grabbing extra references to the superblock, + but it felt suboptimal given the other inadequacies of this approach: + +- The log need not be quiesced to check the summary counters, but a VFS freeze + initiates one anyway. + This adds unnecessary runtime to live fscounter fsck operations. + +- Quiescing the log means that XFS flushes the (possibly incorrect) counters to + disk as part of cleaning the log. + +- A bug in the VFS meant that freeze could complete even when sync_filesystem + fails to flush the filesystem and returns an error. + This bug was fixed in Linux 5.17. + +The author established that the only component of online fsck that requires the +ability to freeze the filesystem is the fscounter scrubber, so the code for +this could be localized to that source file. +fscounter freeze behaves the same as the VFS freeze method, except: + +- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to + prevent other threads from thawing the filesystem. + +- It does not quiesce the log. + +With this code in place, it is now possible to pause the filesystem for just +long enough to check and correct the summary counters. + +The proposed patchset is the +`summary counter cleanup +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_ +series.

[09/14] xfs: document online file metadata repair code

Commit Message

Patch