diff mbox series

[07/14] xfs: document pageable kernel memory

Message ID 167243825260.682859.10235142095680268936.stgit@magnolia (mailing list archive)
State New, archived
Headers show
Series xfs: design documentation for online fsck | expand

Commit Message

Darrick J. Wong Dec. 30, 2022, 10:10 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Add a discussion of pageable kernel memory, since online fsck needs
quite a bit more memory than most other parts of the filesystem to stage
records and other information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  490 ++++++++++++++++++++
 1 file changed, 490 insertions(+)

Comments

Allison Henderson Feb. 2, 2023, 7:14 a.m. UTC | #1
On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add a discussion of pageable kernel memory, since online fsck needs
> quite a bit more memory than most other parts of the filesystem to
> stage
> records and other information.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  490
> ++++++++++++++++++++
>  1 file changed, 490 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index 419eb54ee200..9d7a2ef1d0dd 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -383,6 +383,8 @@ Algorithms") of Srinivasan.
>  However, any data structure builder that maintains a resource lock
> for the
>  duration of the repair is *always* an offline algorithm.
>  
> +.. _secondary_metadata:
> +
>  Secondary Metadata
>  ``````````````````
>  
> @@ -1746,3 +1748,491 @@ Scrub teardown disables all static keys
> obtained by ``xchk_fshooks_enable``.
>  
>  For more information, please see the kernel documentation of
>  Documentation/staging/static-keys.rst.
> +
> +.. _xfile:
> +
> +Pageable Kernel Memory
> +----------------------
> +
> +Demonstrations of the first few prototypes of online repair revealed
> new
> +technical requirements that were not originally identified.
> +For the first demonstration, the code walked whatever filesystem
> +metadata it needed to synthesize new records and inserted records
> into a new
> +btree as it found them.
> +This was subpar since any additional corruption or runtime errors
> encountered
> +during the walk would shut down the filesystem.
> +After remount, the blocks containing the half-rebuilt data structure
> would not
> +be accessible until another repair was attempted.
> +Solving the problem of half-rebuilt data structures will be
> discussed in the
> +next section.
> +
> +For the second demonstration, the synthesized records were instead
> stored in
> +kernel slab memory.
> +Doing so enabled online repair to abort without writing to the
> filesystem if
> +the metadata walk failed, which prevented online fsck from making
> things worse.
> +However, even this approach needed improving upon.
> +
> +There are four reasons why traditional Linux kernel memory
> management isn't
> +suitable for storing large datasets:
> +
> +1. Although it is tempting to allocate a contiguous block of memory
> to create a
> +   C array, this cannot easily be done in the kernel because it
> cannot be
> +   relied upon to allocate multiple contiguous memory pages.
> +
> +2. While disparate physical pages can be virtually mapped together,
> installed
> +   memory might still not be large enough to stage the entire record
> set in
> +   memory while constructing a new btree.
> +
> +3. To overcome these two difficulties, the implementation was
> adjusted to use
> +   doubly linked lists, which means every record object needed two
> 64-bit list
> +   head pointers, which is a lot of overhead.
> +
> +4. Kernel memory is pinned, which can drive the system out of
> memory, leading
> +   to OOM kills of unrelated processes.
> +
I think I maybe might just jump to what ever the current plan is
instead of trying to keep a record of the dev history in the document.
I'm sure we're not done yet, dev really never is, so in order for the
documentation to be maintained, it would just get bigger and bigger to
keep documenting it this way.  It's not that the above isnt valuable,
but maybe a different kind of document really.


> +For the third iteration, attention swung back to the possibility of
> using

Due to the large volume of metadata that needs to be processed, ofsck
uses...

> +byte-indexed array-like storage to reduce the overhead of in-memory
> records.
> +At any given time, online repair does not need to keep the entire
> record set in
> +memory, which means that individual records can be paged out.
> +Creating new temporary files in the XFS filesystem to store
> intermediate data
> +was explored and rejected for some types of repairs because a
> filesystem with
> +compromised space and inode metadata should never be used to fix
> compromised
> +space or inode metadata.
> +However, the kernel already has a facility for byte-addressable and
> pageable
> +storage: shmfs.
> +In-kernel graphics drivers (most notably i915) take advantage of
> shmfs files
> +to store intermediate data that doesn't need to be in memory at all
> times, so
> +that usage precedent is already established.
> +Hence, the ``xfile`` was born!
> +
> +xfile Access Models
> +```````````````````
> +
> +A survey of the intended uses of xfiles suggested these use cases:
> +
> +1. Arrays of fixed-sized records (space management btrees, directory
> and
> +   extended attribute entries)
> +
> +2. Sparse arrays of fixed-sized records (quotas and link counts)
> +
> +3. Large binary objects (BLOBs) of variable sizes (directory and
> extended
> +   attribute names and values)
> +
> +4. Staging btrees in memory (reverse mapping btrees)
> +
> +5. Arbitrary contents (realtime space management)
> +
> +To support the first four use cases, high level data structures wrap
> the xfile
> +to share functionality between online fsck functions.
> +The rest of this section discusses the interfaces that the xfile
> presents to
> +four of those five higher level data structures.
> +The fifth use case is discussed in the :ref:`realtime summary
> <rtsummary>` case
> +study.
> +
> +The most general storage interface supported by the xfile enables
> the reading
> +and writing of arbitrary quantities of data at arbitrary offsets in
> the xfile.
> +This capability is provided by ``xfile_pread`` and ``xfile_pwrite``
> functions,
> +which behave similarly to their userspace counterparts.
> +XFS is very record-based, which suggests that the ability to load
> and store
> +complete records is important.
> +To support these cases, a pair of ``xfile_obj_load`` and
> ``xfile_obj_store``
> +functions are provided to read and persist objects into an xfile.
> +They are internally the same as pread and pwrite, except that they
> treat any
> +error as an out of memory error.
> +For online repair, squashing error conditions in this manner is an
> acceptable
> +behavior because the only reaction is to abort the operation back to
> userspace.
> +All five xfile usecases can be serviced by these four functions.
> +
> +However, no discussion of file access idioms is complete without
> answering the
> +question, "But what about mmap?"
I actually wouldn't spend too much time discussing solutions that
didn't work for what ever reason, unless someones really asking for it.
 I think this section would read just fine to trim off the last
paragraph here
 
> +It would be *much* more convenient if kernel code could access
> pageable kernel
> +memory with pointers, just like userspace code does with regular
> memory.
> +Like any other filesystem that uses the page cache, reads and writes
> of xfile
> +data lock the cache page and map it into the kernel address space
> for the
> +duration of the operation.
> +Unfortunately, shmfs can only write a file page to the swap device
> if the page
> +is unmapped and unlocked, which means the xfile risks causing OOM
> problems
> +unless it is careful not to pin too many pages.
> +Therefore, the xfile steers most of its users towards programmatic
> access so
> +that backing pages are not kept locked in memory for longer than is
> necessary.
> +However, for callers performing quick linear scans of xfile data,
> +``xfile_get_page`` and ``xfile_put_page`` functions are provided to
> pin a page
> +in memory.
> +So far, the only code to use these functions are the xfarray
> :ref:`sorting
> +<xfarray_sort>` algorithms.
> +
> +xfile Access Coordination
> +`````````````````````````
> +
> +For security reasons, xfiles must be owned privately by the kernel.
> +They are marked ``S_PRIVATE`` to prevent interference from the
> security system,
> +must never be mapped into process file descriptor tables, and their
> pages must
> +never be mapped into userspace processes.
> +
> +To avoid locking recursion issues with the VFS, all accesses to the
> shmfs file
> +are performed by manipulating the page cache directly.
> +xfile writes call the ``->write_begin`` and ``->write_end``
> functions of the
> +xfile's address space to grab writable pages, copy the caller's
> buffer into the
> +page, and release the pages.
> +xfile reads call ``shmem_read_mapping_page_gfp`` to grab pages
xfile readers
> directly before
> +copying the contents into the caller's buffer.
> +In other words, xfiles ignore the VFS read and write code paths to
> avoid
> +having to create a dummy ``struct kiocb`` and to avoid taking inode
> and
> +freeze locks.
> +
> +If an xfile is shared between threads to stage repairs, the caller
> must provide
> +its own locks to coordinate access.
Ofsck threads that share an xfile between stage repairs will use their
own locks to coordinate access with each other.

?
> +
> +.. _xfarray:
> +
> +Arrays of Fixed-Sized Records
> +`````````````````````````````
> +
> +In XFS, each type of indexed space metadata (free space, inodes,
> reference
> +counts, file fork space, and reverse mappings) consists of a set of
> fixed-size
> +records indexed with a classic B+ tree.
> +Directories have a set of fixed-size dirent records that point to
> the names,
> +and extended attributes have a set of fixed-size attribute keys that
> point to
> +names and values.
> +Quota counters and file link counters index records with numbers.
> +During a repair, scrub needs to stage new records during the
> gathering step and
> +retrieve them during the btree building step.
> +
> +Although this requirement can be satisfied by calling the read and
> write
> +methods of the xfile directly, it is simpler for callers for there
> to be a
> +higher level abstraction to take care of computing array offsets, to
> provide
> +iterator functions, and to deal with sparse records and sorting.
> +The ``xfarray`` abstraction presents a linear array for fixed-size
> records atop
> +the byte-accessible xfile.
> +
> +.. _xfarray_access_patterns:
> +
> +Array Access Patterns
> +^^^^^^^^^^^^^^^^^^^^^
> +
> +Array access patterns in online fsck tend to fall into three
> categories.
> +Iteration of records is assumed to be necessary for all cases and
> will be
> +covered in the next section.
> +
> +The first type of caller handles records that are indexed by
> position.
> +Gaps may exist between records, and a record may be updated multiple
> times
> +during the collection step.
> +In other words, these callers want a sparse linearly addressed table
> file.
> +The typical use case are quota records or file link count records.
> +Access to array elements is performed programmatically via
> ``xfarray_load`` and
> +``xfarray_store`` functions, which wrap the similarly-named xfile
> functions to
> +provide loading and storing of array elements at arbitrary array
> indices.
> +Gaps are defined to be null records, and null records are defined to
> be a
> +sequence of all zero bytes.
> +Null records are detected by calling ``xfarray_element_is_null``.
> +They are created either by calling ``xfarray_unset`` to null out an
> existing
> +record or by never storing anything to an array index.
> +
> +The second type of caller handles records that are not indexed by
> position
> +and do not require multiple updates to a record.
> +The typical use case here is rebuilding space btrees and key/value
> btrees.
> +These callers can add records to the array without caring about
> array indices
> +via the ``xfarray_append`` function, which stores a record at the
> end of the
> +array.
> +For callers that require records to be presentable in a specific
> order (e.g.
> +rebuilding btree data), the ``xfarray_sort`` function can arrange
> the sorted
> +records; this function will be covered later.
> +
> +The third type of caller is a bag, which is useful for counting
> records.
> +The typical use case here is constructing space extent reference
> counts from
> +reverse mapping information.
> +Records can be put in the bag in any order, they can be removed from
> the bag
> +at any time, and uniqueness of records is left to callers.
> +The ``xfarray_store_anywhere`` function is used to insert a record
> in any
> +null record slot in the bag; and the ``xfarray_unset`` function
> removes a
> +record from the bag.
> +
> +The proposed patchset is the
> +`big in-memory array
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=big-array>`_.
> +
> +Iterating Array Elements
> +^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Most users of the xfarray require the ability to iterate the records
> stored in
> +the array.
> +Callers can probe every possible array index with the following:
> +
> +.. code-block:: c
> +
> +       xfarray_idx_t i;
> +       foreach_xfarray_idx(array, i) {
> +           xfarray_load(array, i, &rec);
> +
> +           /* do something with rec */
> +       }
> +
> +All users of this idiom must be prepared to handle null records or
> must already
> +know that there aren't any.
> +
> +For xfarray users that want to iterate a sparse array, the
> ``xfarray_iter``
> +function ignores indices in the xfarray that have never been written
> to by
> +calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to
> skip areas
> +of the array that are not populated with memory pages.
> +Once it finds a page, it will skip the zeroed areas of the page.
> +
> +.. code-block:: c
> +
> +       xfarray_idx_t i = XFARRAY_CURSOR_INIT;
> +       while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
> +           /* do something with rec */
> +       }
> +
> +.. _xfarray_sort:
> +
> +Sorting Array Elements
> +^^^^^^^^^^^^^^^^^^^^^^
> +
> +During the fourth demonstration of online repair, a community
> reviewer remarked
> +that for performance reasons, online repair ought to load batches of
> records
> +into btree record blocks instead of inserting records into a new
> btree one at a
> +time.
> +The btree insertion code in XFS is responsible for maintaining
> correct ordering
> +of the records, so naturally the xfarray must also support sorting
> the record
> +set prior to bulk loading.
> +
> +The sorting algorithm used in the xfarray is actually a combination
> of adaptive
> +quicksort and a heapsort subalgorithm in the spirit of
> +`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
> +`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations
> for the Linux
> +kernel.
> +To sort records in a reasonably short amount of time, ``xfarray``
> takes
> +advantage of the binary subpartitioning offered by quicksort, but it
> also uses
> +heapsort to hedge aginst performance collapse if the chosen
> quicksort pivots
> +are poor.
> +Both algorithms are (in general) O(n * lg(n)), but there is a wide
> performance
> +gulf between the two implementations.
> +
> +The Linux kernel already contains a reasonably fast implementation
> of heapsort.
> +It only operates on regular C arrays, which limits the scope of its
> usefulness.
> +There are two key places where the xfarray uses it:
> +
> +* Sorting any record subset backed by a single xfile page.
> +
> +* Loading a small number of xfarray records from potentially
> disparate parts
> +  of the xfarray into a memory buffer, and sorting the buffer.
> +
> +In other words, ``xfarray`` uses heapsort to constrain the nested
> recursion of
> +quicksort, thereby mitigating quicksort's worst runtime behavior.
> +
> +Choosing a quicksort pivot is a tricky business.
> +A good pivot splits the set to sort in half, leading to the divide
> and conquer
> +behavior that is crucial to  O(n * lg(n)) performance.
> +A poor pivot barely splits the subset at all, leading to O(n\
> :sup:`2`)
> +runtime.
> +The xfarray sort routine tries to avoid picking a bad pivot by
> sampling nine
> +records into a memory buffer and using the kernel heapsort to
> identify the
> +median of the nine.
> +
> +Most modern quicksort implementations employ Tukey's "ninther" to
> select a
> +pivot from a classic C array.
> +Typical ninther implementations pick three unique triads of records,
> sort each
> +of the triads, and then sort the middle value of each triad to
> determine the
> +ninther value.
> +As stated previously, however, xfile accesses are not entirely
> cheap.
> +It turned out to be much more performant to read the nine elements
> into a
> +memory buffer, run the kernel's in-memory heapsort on the buffer,
> and choose
> +the 4th element of that buffer as the pivot.
> +Tukey's ninthers are described in J. W. Tukey, `The ninther, a
> technique for
> +low-effort robust (resistant) location in large samples`, in
> *Contributions to
> +Survey Sampling and Applied Statistics*, edited by H. David,
> (Academic Press,
> +1978), pp. 251–257.
> +
> +The partitioning of quicksort is fairly textbook -- rearrange the
> record
> +subset around the pivot, then set up the current and next stack
> frames to
> +sort with the larger and the smaller halves of the pivot,
> respectively.
> +This keeps the stack space requirements to log2(record count).
> +
> +As a final performance optimization, the hi and lo scanning phase of
> quicksort
> +keeps examined xfile pages mapped in the kernel for as long as
> possible to
> +reduce map/unmap cycles.
> +Surprisingly, this reduces overall sort runtime by nearly half again
> after
> +accounting for the application of heapsort directly onto xfile
> pages.
This sorting section is insightful, but I think I'd be ok with out it
too.  Or maybe save it for later in the document as an "implementation
details" section, or something similar.  It seems like there's still a
lot to cover about how ofsck works in general before we start drilling
into things like the runtime complexity of the sorting algorithm it
uses.  

> +
> +Blob Storage
> +````````````
> +
> +Extended attributes and directories add an additional requirement
> for staging
> +records: arbitrary byte sequences of finite length.
> +Each directory entry record needs to store entry name,
> +and each extended attribute needs to store both the attribute name
> and value.
> +The names, keys, and values can consume a large amount of memory, so
> the
> +``xfblob`` abstraction was created to simplify management of these
> blobs
> +atop an xfile.
> +
> +Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions
> to retrieve
> +and persist objects.
> +The store function returns a magic cookie for every object that it
> persists.
> +Later, callers provide this cookie to the ``xblob_load`` to recall
> the object.
> +The ``xfblob_free`` function frees a specific blob, and the
> ``xfblob_truncate``
> +function frees them all because compaction is not needed.
> +
> +The details of repairing directories and extended attributes will be
> discussed
> +in a subsequent section about atomic extent swapping.
> +However, it should be noted that these repair functions only use
> blob storage
> +to cache a small number of entries before adding them to a temporary
> ondisk
> +file, which is why compaction is not required.
> +
> +The proposed patchset is at the start of the
> +`extended attribute repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-xattrs>`_ series.
> +
> +.. _xfbtree:
> +
> +In-Memory B+Trees
> +`````````````````
> +
> +The chapter about :ref:`secondary metadata<secondary_metadata>`
> mentioned that
> +checking and repairing of secondary metadata commonly requires
> coordination
> +between a live metadata scan of the filesystem and writer threads
> that are
> +updating that metadata.
> +Keeping the scan data up to date requires requires the ability to
> propagate
> +metadata updates from the filesystem into the data being collected
> by the scan.
> +This *can* be done by appending concurrent updates into a separate
> log file and
> +applying them before writing the new metadata to disk, but this
> leads to
> +unbounded memory consumption if the rest of the system is very busy.
> +Another option is to skip the side-log and commit live updates from
> the
> +filesystem directly into the scan data, which trades more overhead
> for a lower
> +maximum memory requirement.
> +In both cases, the data structure holding the scan results must
> support indexed
> +access to perform well.
> +
> +Given that indexed lookups of scan data is required for both
> strategies, online
> +fsck employs the second strategy of committing live updates directly
> into
> +scan data.
> +Because xfarrays are not indexed and do not enforce record ordering,
> they
> +are not suitable for this task.
> +Conveniently, however, XFS has a library to create and maintain
> ordered reverse
> +mapping records: the existing rmap btree code!
> +If only there was a means to create one in memory.
> +
> +Recall that the :ref:`xfile <xfile>` abstraction represents memory
> pages as a
> +regular file, which means that the kernel can create byte or block
> addressable
> +virtual address spaces at will.
> +The XFS buffer cache specializes in abstracting IO to block-
> oriented  address
> +spaces, which means that adaptation of the buffer cache to interface
> with
> +xfiles enables reuse of the entire btree library.
> +Btrees built atop an xfile are collectively known as ``xfbtrees``.
> +The next few sections describe how they actually work.
> +
> +The proposed patchset is the
> +`in-memory btree
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=in-memory-btrees>`_
> +series.
> +
> +Using xfiles as a Buffer Cache Target
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Two modifications are necessary to support xfiles as a buffer cache
> target.
> +The first is to make it possible for the ``struct xfs_buftarg``
> structure to
> +host the ``struct xfs_buf`` rhashtable, because normally those are
> held by a
> +per-AG structure.
> +The second change is to modify the buffer ``ioapply`` function to
> "read" cached
> +pages from the xfile and "write" cached pages back to the xfile.
> +Multiple access to individual buffers is controlled by the
> ``xfs_buf`` lock,
> +since the xfile does not provide any locking on its own.
> +With this adaptation in place, users of the xfile-backed buffer
> cache use
> +exactly the same APIs as users of the disk-backed buffer cache.
> +The separation between xfile and buffer cache implies higher memory
> usage since
> +they do not share pages, but this property could some day enable
> transactional
> +updates to an in-memory btree.
> +Today, however, it simply eliminates the need for new code.
> +
> +Space Management with an xfbtree
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Space management for an xfile is very simple -- each btree block is
> one memory
> +page in size.
> +These blocks use the same header format as an on-disk btree, but the
> in-memory
> +block verifiers ignore the checksums, assuming that xfile memory is
> no more
> +corruption-prone than regular DRAM.
> +Reusing existing code here is more important than absolute memory
> efficiency.
> +
> +The very first block of an xfile backing an xfbtree contains a
> header block.
> +The header describes the owner, height, and the block number of the
> root
> +xfbtree block.
> +
> +To allocate a btree block, use ``xfile_seek_data`` to find a gap in
> the file.
> +If there are no gaps, create one by extending the length of the
> xfile.
> +Preallocate space for the block with ``xfile_prealloc``, and hand
> back the
> +location.
> +To free an xfbtree block, use ``xfile_discard`` (which internally
> uses
> +``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
> +
> +Populating an xfbtree
> +^^^^^^^^^^^^^^^^^^^^^
> +
> +An online fsck function that wants to create an xfbtree should
> proceed as
> +follows:
> +
> +1. Call ``xfile_create`` to create an xfile.
> +
> +2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target
> structure
> +   pointing to the xfile.
> +
> +3. Pass the buffer cache target, buffer ops, and other information
> to
> +   ``xfbtree_create`` to write an initial tree header and root block
> to the
> +   xfile.
> +   Each btree type should define a wrapper that passes necessary
> arguments to
> +   the creation function.
> +   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take
> care of
> +   all the necessary details for callers.
> +   A ``struct xfbtree`` object will be returned.
> +
> +4. Pass the xfbtree object to the btree cursor creation function for
> the
> +   btree type.
> +   Following the example above, ``xfs_rmapbt_mem_cursor`` takes care
> of this
> +   for callers.
> +
> +5. Pass the btree cursor to the regular btree functions to make
> queries against
> +   and to update the in-memory btree.
> +   For example, a btree cursor for an rmap xfbtree can be passed to
> the
> +   ``xfs_rmap_*`` functions just like any other btree cursor.
> +   See the :ref:`next section<xfbtree_commit>` for information on
> dealing with
> +   xfbtree updates that are logged to a transaction.
> +
> +6. When finished, delete the btree cursor, destroy the xfbtree
> object, free the
> +   buffer target, and the destroy the xfile to release all
> resources.
> +
> +.. _xfbtree_commit:
> +
> +Committing Logged xfbtree Buffers
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Although it is a clever hack to reuse the rmap btree code to handle
> the staging
> +structure, the ephemeral nature of the in-memory btree block storage
> presents
> +some challenges of its own.
> +The XFS transaction manager must not commit buffer log items for
> buffers backed
> +by an xfile because the log format does not understand updates for
> devices
> +other than the data device.
> +An ephemeral xfbtree probably will not exist by the time the AIL
> checkpoints
> +log transactions back into the filesystem, and certainly won't exist
> during
> +log recovery.
> +For these reasons, any code updating an xfbtree in transaction
> context must
> +remove the buffer log items from the transaction and write the
> updates into the
> +backing xfile before committing or cancelling the transaction.
> +
> +The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions
> implement
> +this functionality as follows:
> +
> +1. Find each buffer log item whose buffer targets the xfile.
> +
> +2. Record the dirty/ordered status of the log item.
> +
> +3. Detach the log item from the buffer.
> +
> +4. Queue the buffer to a special delwri list.
> +
> +5. Clear the transaction dirty flag if the only dirty log items were
> the ones
> +   that were detached in step 3.
> +
> +6. Submit the delwri list to commit the changes to the xfile, if the
> updates
> +   are being committed.
> +
> +After removing xfile logged buffers from the transaction in this
> manner, the
> +transaction can be committed or cancelled.
Rest of this looks pretty good, organizing nits aside.

Allison

>
Darrick J. Wong Feb. 2, 2023, 11:14 p.m. UTC | #2
On Thu, Feb 02, 2023 at 07:14:22AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add a discussion of pageable kernel memory, since online fsck needs
> > quite a bit more memory than most other parts of the filesystem to
> > stage
> > records and other information.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  490
> > ++++++++++++++++++++
> >  1 file changed, 490 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index 419eb54ee200..9d7a2ef1d0dd 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -383,6 +383,8 @@ Algorithms") of Srinivasan.
> >  However, any data structure builder that maintains a resource lock
> > for the
> >  duration of the repair is *always* an offline algorithm.
> >  
> > +.. _secondary_metadata:
> > +
> >  Secondary Metadata
> >  ``````````````````
> >  
> > @@ -1746,3 +1748,491 @@ Scrub teardown disables all static keys
> > obtained by ``xchk_fshooks_enable``.
> >  
> >  For more information, please see the kernel documentation of
> >  Documentation/staging/static-keys.rst.
> > +
> > +.. _xfile:
> > +
> > +Pageable Kernel Memory
> > +----------------------
> > +
> > +Demonstrations of the first few prototypes of online repair revealed
> > new
> > +technical requirements that were not originally identified.
> > +For the first demonstration, the code walked whatever filesystem
> > +metadata it needed to synthesize new records and inserted records
> > into a new
> > +btree as it found them.
> > +This was subpar since any additional corruption or runtime errors
> > encountered
> > +during the walk would shut down the filesystem.
> > +After remount, the blocks containing the half-rebuilt data structure
> > would not
> > +be accessible until another repair was attempted.
> > +Solving the problem of half-rebuilt data structures will be
> > discussed in the
> > +next section.
> > +
> > +For the second demonstration, the synthesized records were instead
> > stored in
> > +kernel slab memory.
> > +Doing so enabled online repair to abort without writing to the
> > filesystem if
> > +the metadata walk failed, which prevented online fsck from making
> > things worse.
> > +However, even this approach needed improving upon.
> > +
> > +There are four reasons why traditional Linux kernel memory
> > management isn't
> > +suitable for storing large datasets:
> > +
> > +1. Although it is tempting to allocate a contiguous block of memory
> > to create a
> > +   C array, this cannot easily be done in the kernel because it
> > cannot be
> > +   relied upon to allocate multiple contiguous memory pages.
> > +
> > +2. While disparate physical pages can be virtually mapped together,
> > installed
> > +   memory might still not be large enough to stage the entire record
> > set in
> > +   memory while constructing a new btree.
> > +
> > +3. To overcome these two difficulties, the implementation was
> > adjusted to use
> > +   doubly linked lists, which means every record object needed two
> > 64-bit list
> > +   head pointers, which is a lot of overhead.
> > +
> > +4. Kernel memory is pinned, which can drive the system out of
> > memory, leading
> > +   to OOM kills of unrelated processes.
> > +
> I think I maybe might just jump to what ever the current plan is
> instead of trying to keep a record of the dev history in the document.
> I'm sure we're not done yet, dev really never is, so in order for the
> documentation to be maintained, it would just get bigger and bigger to
> keep documenting it this way.  It's not that the above isnt valuable,
> but maybe a different kind of document really.

OK, I've shortened this introduction to outline the requirements, and
trimmed the historical information to a sidebar:

"Some online checking functions work by scanning the filesystem to build
a shadow copy of an ondisk metadata structure in memory and comparing
the two copies. For online repair to rebuild a metadata structure, it
must compute the record set that will be stored in the new structure
before it can persist that new structure to disk. Ideally, repairs
complete with a single atomic commit that introduces a new data
structure. To meet these goals, the kernel needs to collect a large
amount of information in a place that doesn’t require the correct
operation of the filesystem.

"Kernel memory isn’t suitable because:

*   Allocating a contiguous region of memory to create a C array is very
    difficult, especially on 32-bit systems.

*   Linked lists of records introduce double pointer overhead which is
    very high and eliminate the possibility of indexed lookups.

*   Kernel memory is pinned, which can drive the system into OOM
    conditions.

*   The system might not have sufficient memory to stage all the
    information.

"At any given time, online fsck does not need to keep the entire record
set in memory, which means that individual records can be paged out if
necessary. Continued development of online fsck demonstrated that the
ability to perform indexed data storage would also be very useful.
Fortunately, the Linux kernel already has a facility for
byte-addressable and pageable storage: tmpfs. In-kernel graphics drivers
(most notably i915) take advantage of tmpfs files to store intermediate
data that doesn’t need to be in memory at all times, so that usage
precedent is already established. Hence, the xfile was born!

Historical Sidebar
------------------

"The first edition of online repair inserted records into a new btree as
it found them, which failed because filesystem could shut down with a
built data structure, which would be live after recovery finished.

"The second edition solved the half-rebuilt structure problem by storing
everything in memory, but frequently ran the system out of memory.

"The third edition solved the OOM problem by using linked lists, but the
list overhead was extreme."

> 
> 
> > +For the third iteration, attention swung back to the possibility of
> > using
> 
> Due to the large volume of metadata that needs to be processed, ofsck
> uses...
> 
> > +byte-indexed array-like storage to reduce the overhead of in-memory
> > records.
> > +At any given time, online repair does not need to keep the entire
> > record set in
> > +memory, which means that individual records can be paged out.
> > +Creating new temporary files in the XFS filesystem to store
> > intermediate data
> > +was explored and rejected for some types of repairs because a
> > filesystem with
> > +compromised space and inode metadata should never be used to fix
> > compromised
> > +space or inode metadata.
> > +However, the kernel already has a facility for byte-addressable and
> > pageable
> > +storage: shmfs.
> > +In-kernel graphics drivers (most notably i915) take advantage of
> > shmfs files
> > +to store intermediate data that doesn't need to be in memory at all
> > times, so
> > +that usage precedent is already established.
> > +Hence, the ``xfile`` was born!
> > +
> > +xfile Access Models
> > +```````````````````
> > +
> > +A survey of the intended uses of xfiles suggested these use cases:
> > +
> > +1. Arrays of fixed-sized records (space management btrees, directory
> > and
> > +   extended attribute entries)
> > +
> > +2. Sparse arrays of fixed-sized records (quotas and link counts)
> > +
> > +3. Large binary objects (BLOBs) of variable sizes (directory and
> > extended
> > +   attribute names and values)
> > +
> > +4. Staging btrees in memory (reverse mapping btrees)
> > +
> > +5. Arbitrary contents (realtime space management)
> > +
> > +To support the first four use cases, high level data structures wrap
> > the xfile
> > +to share functionality between online fsck functions.
> > +The rest of this section discusses the interfaces that the xfile
> > presents to
> > +four of those five higher level data structures.
> > +The fifth use case is discussed in the :ref:`realtime summary
> > <rtsummary>` case
> > +study.
> > +
> > +The most general storage interface supported by the xfile enables
> > the reading
> > +and writing of arbitrary quantities of data at arbitrary offsets in
> > the xfile.
> > +This capability is provided by ``xfile_pread`` and ``xfile_pwrite``
> > functions,
> > +which behave similarly to their userspace counterparts.
> > +XFS is very record-based, which suggests that the ability to load
> > and store
> > +complete records is important.
> > +To support these cases, a pair of ``xfile_obj_load`` and
> > ``xfile_obj_store``
> > +functions are provided to read and persist objects into an xfile.
> > +They are internally the same as pread and pwrite, except that they
> > treat any
> > +error as an out of memory error.
> > +For online repair, squashing error conditions in this manner is an
> > acceptable
> > +behavior because the only reaction is to abort the operation back to
> > userspace.
> > +All five xfile usecases can be serviced by these four functions.
> > +
> > +However, no discussion of file access idioms is complete without
> > answering the
> > +question, "But what about mmap?"
> I actually wouldn't spend too much time discussing solutions that
> didn't work for what ever reason, unless someones really asking for it.
>  I think this section would read just fine to trim off the last
> paragraph here

Since I wrote this, I've been experimenting with wiring up the tmpfs
file page cache folios to the xfs buffer cache.  Pinning the folios in
this manner makes it so that online fsck can (more or less) directly
access the xfile contents.  Much to my surprise, this has actually held
up in testing, so ... it's no longer a solution that "didn't really
work". :)

I also need to s/page/folio/ now that willy has finished that
conversion.  This section has been rewritten as such:

"However, no discussion of file access idioms is complete without
answering the question, “But what about mmap?” It is convenient to
access storage directly with pointers, just like userspace code does
with regular memory. Online fsck must not drive the system into OOM
conditions, which means that xfiles must be responsive to memory
reclamation. tmpfs can only push a pagecache folio to the swap cache if
the folio is neither pinned nor locked, which means the xfile must not
pin too many folios.

"Short term direct access to xfile contents is done by locking the
pagecache folio and mapping it into kernel address space. Programmatic
access (e.g. pread and pwrite) uses this mechanism. Folio locks are not
supposed to be held for long periods of time, so long term direct access
to xfile contents is done by bumping the folio refcount, mapping it into
kernel address space, and dropping the folio lock. These long term users
must be responsive to memory reclaim by hooking into the shrinker
infrastructure to know when to release folios.

"The xfile_get_page and xfile_put_page functions are provided to
retrieve the (locked) folio that backs part of an xfile and to release
it. The only code to use these folio lease functions are the xfarray
sorting algorithms and the in-memory btrees."

> > +It would be *much* more convenient if kernel code could access
> > pageable kernel
> > +memory with pointers, just like userspace code does with regular
> > memory.
> > +Like any other filesystem that uses the page cache, reads and writes
> > of xfile
> > +data lock the cache page and map it into the kernel address space
> > for the
> > +duration of the operation.
> > +Unfortunately, shmfs can only write a file page to the swap device
> > if the page
> > +is unmapped and unlocked, which means the xfile risks causing OOM
> > problems
> > +unless it is careful not to pin too many pages.
> > +Therefore, the xfile steers most of its users towards programmatic
> > access so
> > +that backing pages are not kept locked in memory for longer than is
> > necessary.
> > +However, for callers performing quick linear scans of xfile data,
> > +``xfile_get_page`` and ``xfile_put_page`` functions are provided to
> > pin a page
> > +in memory.
> > +So far, the only code to use these functions are the xfarray
> > :ref:`sorting
> > +<xfarray_sort>` algorithms.
> > +
> > +xfile Access Coordination
> > +`````````````````````````
> > +
> > +For security reasons, xfiles must be owned privately by the kernel.
> > +They are marked ``S_PRIVATE`` to prevent interference from the
> > security system,
> > +must never be mapped into process file descriptor tables, and their
> > pages must
> > +never be mapped into userspace processes.
> > +
> > +To avoid locking recursion issues with the VFS, all accesses to the
> > shmfs file
> > +are performed by manipulating the page cache directly.
> > +xfile writes call the ``->write_begin`` and ``->write_end``
> > functions of the
> > +xfile's address space to grab writable pages, copy the caller's
> > buffer into the
> > +page, and release the pages.
> > +xfile reads call ``shmem_read_mapping_page_gfp`` to grab pages
> xfile readers

OK.

> > directly before
> > +copying the contents into the caller's buffer.
> > +In other words, xfiles ignore the VFS read and write code paths to
> > avoid
> > +having to create a dummy ``struct kiocb`` and to avoid taking inode
> > and
> > +freeze locks.
> > +
> > +If an xfile is shared between threads to stage repairs, the caller
> > must provide
> > +its own locks to coordinate access.
> Ofsck threads that share an xfile between stage repairs will use their
> own locks to coordinate access with each other.
> 
> ?

Hm.  I wonder if there's a misunderstanding here?

Online fsck functions themselves are single-threaded, which is to say
that they themselves neither queue workers nor start kthreads.  However,
an xfile created by a running fsck function can be accessed from other
thread if the fsck function also hooks itself into filesystem code.

The live update section has a nice diagram of how that works:
https://djwong.org/docs/xfs-online-fsck-design/#filesystem-hooks

> > +
> > +.. _xfarray:
> > +
> > +Arrays of Fixed-Sized Records
> > +`````````````````````````````
> > +
> > +In XFS, each type of indexed space metadata (free space, inodes,
> > reference
> > +counts, file fork space, and reverse mappings) consists of a set of
> > fixed-size
> > +records indexed with a classic B+ tree.
> > +Directories have a set of fixed-size dirent records that point to
> > the names,
> > +and extended attributes have a set of fixed-size attribute keys that
> > point to
> > +names and values.
> > +Quota counters and file link counters index records with numbers.
> > +During a repair, scrub needs to stage new records during the
> > gathering step and
> > +retrieve them during the btree building step.
> > +
> > +Although this requirement can be satisfied by calling the read and
> > write
> > +methods of the xfile directly, it is simpler for callers for there
> > to be a
> > +higher level abstraction to take care of computing array offsets, to
> > provide
> > +iterator functions, and to deal with sparse records and sorting.
> > +The ``xfarray`` abstraction presents a linear array for fixed-size
> > records atop
> > +the byte-accessible xfile.
> > +
> > +.. _xfarray_access_patterns:
> > +
> > +Array Access Patterns
> > +^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Array access patterns in online fsck tend to fall into three
> > categories.
> > +Iteration of records is assumed to be necessary for all cases and
> > will be
> > +covered in the next section.
> > +
> > +The first type of caller handles records that are indexed by
> > position.
> > +Gaps may exist between records, and a record may be updated multiple
> > times
> > +during the collection step.
> > +In other words, these callers want a sparse linearly addressed table
> > file.
> > +The typical use case are quota records or file link count records.
> > +Access to array elements is performed programmatically via
> > ``xfarray_load`` and
> > +``xfarray_store`` functions, which wrap the similarly-named xfile
> > functions to
> > +provide loading and storing of array elements at arbitrary array
> > indices.
> > +Gaps are defined to be null records, and null records are defined to
> > be a
> > +sequence of all zero bytes.
> > +Null records are detected by calling ``xfarray_element_is_null``.
> > +They are created either by calling ``xfarray_unset`` to null out an
> > existing
> > +record or by never storing anything to an array index.
> > +
> > +The second type of caller handles records that are not indexed by
> > position
> > +and do not require multiple updates to a record.
> > +The typical use case here is rebuilding space btrees and key/value
> > btrees.
> > +These callers can add records to the array without caring about
> > array indices
> > +via the ``xfarray_append`` function, which stores a record at the
> > end of the
> > +array.
> > +For callers that require records to be presentable in a specific
> > order (e.g.
> > +rebuilding btree data), the ``xfarray_sort`` function can arrange
> > the sorted
> > +records; this function will be covered later.
> > +
> > +The third type of caller is a bag, which is useful for counting
> > records.
> > +The typical use case here is constructing space extent reference
> > counts from
> > +reverse mapping information.
> > +Records can be put in the bag in any order, they can be removed from
> > the bag
> > +at any time, and uniqueness of records is left to callers.
> > +The ``xfarray_store_anywhere`` function is used to insert a record
> > in any
> > +null record slot in the bag; and the ``xfarray_unset`` function
> > removes a
> > +record from the bag.
> > +
> > +The proposed patchset is the
> > +`big in-memory array
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=big-array>`_.
> > +
> > +Iterating Array Elements
> > +^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Most users of the xfarray require the ability to iterate the records
> > stored in
> > +the array.
> > +Callers can probe every possible array index with the following:
> > +
> > +.. code-block:: c
> > +
> > +       xfarray_idx_t i;
> > +       foreach_xfarray_idx(array, i) {
> > +           xfarray_load(array, i, &rec);
> > +
> > +           /* do something with rec */
> > +       }
> > +
> > +All users of this idiom must be prepared to handle null records or
> > must already
> > +know that there aren't any.
> > +
> > +For xfarray users that want to iterate a sparse array, the
> > ``xfarray_iter``
> > +function ignores indices in the xfarray that have never been written
> > to by
> > +calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to
> > skip areas
> > +of the array that are not populated with memory pages.
> > +Once it finds a page, it will skip the zeroed areas of the page.
> > +
> > +.. code-block:: c
> > +
> > +       xfarray_idx_t i = XFARRAY_CURSOR_INIT;
> > +       while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
> > +           /* do something with rec */
> > +       }
> > +
> > +.. _xfarray_sort:
> > +
> > +Sorting Array Elements
> > +^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +During the fourth demonstration of online repair, a community
> > reviewer remarked
> > +that for performance reasons, online repair ought to load batches of
> > records
> > +into btree record blocks instead of inserting records into a new
> > btree one at a
> > +time.
> > +The btree insertion code in XFS is responsible for maintaining
> > correct ordering
> > +of the records, so naturally the xfarray must also support sorting
> > the record
> > +set prior to bulk loading.
> > +
> > +The sorting algorithm used in the xfarray is actually a combination
> > of adaptive
> > +quicksort and a heapsort subalgorithm in the spirit of
> > +`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
> > +`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations
> > for the Linux
> > +kernel.
> > +To sort records in a reasonably short amount of time, ``xfarray``
> > takes
> > +advantage of the binary subpartitioning offered by quicksort, but it
> > also uses
> > +heapsort to hedge aginst performance collapse if the chosen
> > quicksort pivots
> > +are poor.
> > +Both algorithms are (in general) O(n * lg(n)), but there is a wide
> > performance
> > +gulf between the two implementations.
> > +
> > +The Linux kernel already contains a reasonably fast implementation
> > of heapsort.
> > +It only operates on regular C arrays, which limits the scope of its
> > usefulness.
> > +There are two key places where the xfarray uses it:
> > +
> > +* Sorting any record subset backed by a single xfile page.
> > +
> > +* Loading a small number of xfarray records from potentially
> > disparate parts
> > +  of the xfarray into a memory buffer, and sorting the buffer.
> > +
> > +In other words, ``xfarray`` uses heapsort to constrain the nested
> > recursion of
> > +quicksort, thereby mitigating quicksort's worst runtime behavior.
> > +
> > +Choosing a quicksort pivot is a tricky business.
> > +A good pivot splits the set to sort in half, leading to the divide
> > and conquer
> > +behavior that is crucial to  O(n * lg(n)) performance.
> > +A poor pivot barely splits the subset at all, leading to O(n\
> > :sup:`2`)
> > +runtime.
> > +The xfarray sort routine tries to avoid picking a bad pivot by
> > sampling nine
> > +records into a memory buffer and using the kernel heapsort to
> > identify the
> > +median of the nine.
> > +
> > +Most modern quicksort implementations employ Tukey's "ninther" to
> > select a
> > +pivot from a classic C array.
> > +Typical ninther implementations pick three unique triads of records,
> > sort each
> > +of the triads, and then sort the middle value of each triad to
> > determine the
> > +ninther value.
> > +As stated previously, however, xfile accesses are not entirely
> > cheap.
> > +It turned out to be much more performant to read the nine elements
> > into a
> > +memory buffer, run the kernel's in-memory heapsort on the buffer,
> > and choose
> > +the 4th element of that buffer as the pivot.
> > +Tukey's ninthers are described in J. W. Tukey, `The ninther, a
> > technique for
> > +low-effort robust (resistant) location in large samples`, in
> > *Contributions to
> > +Survey Sampling and Applied Statistics*, edited by H. David,
> > (Academic Press,
> > +1978), pp. 251–257.
> > +
> > +The partitioning of quicksort is fairly textbook -- rearrange the
> > record
> > +subset around the pivot, then set up the current and next stack
> > frames to
> > +sort with the larger and the smaller halves of the pivot,
> > respectively.
> > +This keeps the stack space requirements to log2(record count).
> > +
> > +As a final performance optimization, the hi and lo scanning phase of
> > quicksort
> > +keeps examined xfile pages mapped in the kernel for as long as
> > possible to
> > +reduce map/unmap cycles.
> > +Surprisingly, this reduces overall sort runtime by nearly half again
> > after
> > +accounting for the application of heapsort directly onto xfile
> > pages.
> This sorting section is insightful, but I think I'd be ok with out it
> too.  Or maybe save it for later in the document as an "implementation
> details" section, or something similar.  It seems like there's still a
> lot to cover about how ofsck works in general before we start drilling
> into things like the runtime complexity of the sorting algorithm it
> uses.  

How about I demote the details of how sorting works to a case study?

> > +
> > +Blob Storage
> > +````````````
> > +
> > +Extended attributes and directories add an additional requirement
> > for staging
> > +records: arbitrary byte sequences of finite length.
> > +Each directory entry record needs to store entry name,
> > +and each extended attribute needs to store both the attribute name
> > and value.
> > +The names, keys, and values can consume a large amount of memory, so
> > the
> > +``xfblob`` abstraction was created to simplify management of these
> > blobs
> > +atop an xfile.
> > +
> > +Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions
> > to retrieve
> > +and persist objects.
> > +The store function returns a magic cookie for every object that it
> > persists.
> > +Later, callers provide this cookie to the ``xblob_load`` to recall
> > the object.
> > +The ``xfblob_free`` function frees a specific blob, and the
> > ``xfblob_truncate``
> > +function frees them all because compaction is not needed.
> > +
> > +The details of repairing directories and extended attributes will be
> > discussed
> > +in a subsequent section about atomic extent swapping.
> > +However, it should be noted that these repair functions only use
> > blob storage
> > +to cache a small number of entries before adding them to a temporary
> > ondisk
> > +file, which is why compaction is not required.
> > +
> > +The proposed patchset is at the start of the
> > +`extended attribute repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-xattrs>`_ series.
> > +
> > +.. _xfbtree:
> > +
> > +In-Memory B+Trees
> > +`````````````````
> > +
> > +The chapter about :ref:`secondary metadata<secondary_metadata>`
> > mentioned that
> > +checking and repairing of secondary metadata commonly requires
> > coordination
> > +between a live metadata scan of the filesystem and writer threads
> > that are
> > +updating that metadata.
> > +Keeping the scan data up to date requires requires the ability to
> > propagate
> > +metadata updates from the filesystem into the data being collected
> > by the scan.
> > +This *can* be done by appending concurrent updates into a separate
> > log file and
> > +applying them before writing the new metadata to disk, but this
> > leads to
> > +unbounded memory consumption if the rest of the system is very busy.
> > +Another option is to skip the side-log and commit live updates from
> > the
> > +filesystem directly into the scan data, which trades more overhead
> > for a lower
> > +maximum memory requirement.
> > +In both cases, the data structure holding the scan results must
> > support indexed
> > +access to perform well.
> > +
> > +Given that indexed lookups of scan data is required for both
> > strategies, online
> > +fsck employs the second strategy of committing live updates directly
> > into
> > +scan data.
> > +Because xfarrays are not indexed and do not enforce record ordering,
> > they
> > +are not suitable for this task.
> > +Conveniently, however, XFS has a library to create and maintain
> > ordered reverse
> > +mapping records: the existing rmap btree code!
> > +If only there was a means to create one in memory.
> > +
> > +Recall that the :ref:`xfile <xfile>` abstraction represents memory
> > pages as a
> > +regular file, which means that the kernel can create byte or block
> > addressable
> > +virtual address spaces at will.
> > +The XFS buffer cache specializes in abstracting IO to block-
> > oriented  address
> > +spaces, which means that adaptation of the buffer cache to interface
> > with
> > +xfiles enables reuse of the entire btree library.
> > +Btrees built atop an xfile are collectively known as ``xfbtrees``.
> > +The next few sections describe how they actually work.
> > +
> > +The proposed patchset is the
> > +`in-memory btree
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=in-memory-btrees>`_
> > +series.
> > +
> > +Using xfiles as a Buffer Cache Target
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Two modifications are necessary to support xfiles as a buffer cache
> > target.
> > +The first is to make it possible for the ``struct xfs_buftarg``
> > structure to
> > +host the ``struct xfs_buf`` rhashtable, because normally those are
> > held by a
> > +per-AG structure.
> > +The second change is to modify the buffer ``ioapply`` function to
> > "read" cached
> > +pages from the xfile and "write" cached pages back to the xfile.
> > +Multiple access to individual buffers is controlled by the
> > ``xfs_buf`` lock,
> > +since the xfile does not provide any locking on its own.
> > +With this adaptation in place, users of the xfile-backed buffer
> > cache use
> > +exactly the same APIs as users of the disk-backed buffer cache.
> > +The separation between xfile and buffer cache implies higher memory
> > usage since
> > +they do not share pages, but this property could some day enable
> > transactional
> > +updates to an in-memory btree.
> > +Today, however, it simply eliminates the need for new code.
> > +
> > +Space Management with an xfbtree
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Space management for an xfile is very simple -- each btree block is
> > one memory
> > +page in size.
> > +These blocks use the same header format as an on-disk btree, but the
> > in-memory
> > +block verifiers ignore the checksums, assuming that xfile memory is
> > no more
> > +corruption-prone than regular DRAM.
> > +Reusing existing code here is more important than absolute memory
> > efficiency.
> > +
> > +The very first block of an xfile backing an xfbtree contains a
> > header block.
> > +The header describes the owner, height, and the block number of the
> > root
> > +xfbtree block.
> > +
> > +To allocate a btree block, use ``xfile_seek_data`` to find a gap in
> > the file.
> > +If there are no gaps, create one by extending the length of the
> > xfile.
> > +Preallocate space for the block with ``xfile_prealloc``, and hand
> > back the
> > +location.
> > +To free an xfbtree block, use ``xfile_discard`` (which internally
> > uses
> > +``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
> > +
> > +Populating an xfbtree
> > +^^^^^^^^^^^^^^^^^^^^^
> > +
> > +An online fsck function that wants to create an xfbtree should
> > proceed as
> > +follows:
> > +
> > +1. Call ``xfile_create`` to create an xfile.
> > +
> > +2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target
> > structure
> > +   pointing to the xfile.
> > +
> > +3. Pass the buffer cache target, buffer ops, and other information
> > to
> > +   ``xfbtree_create`` to write an initial tree header and root block
> > to the
> > +   xfile.
> > +   Each btree type should define a wrapper that passes necessary
> > arguments to
> > +   the creation function.
> > +   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take
> > care of
> > +   all the necessary details for callers.
> > +   A ``struct xfbtree`` object will be returned.
> > +
> > +4. Pass the xfbtree object to the btree cursor creation function for
> > the
> > +   btree type.
> > +   Following the example above, ``xfs_rmapbt_mem_cursor`` takes care
> > of this
> > +   for callers.
> > +
> > +5. Pass the btree cursor to the regular btree functions to make
> > queries against
> > +   and to update the in-memory btree.
> > +   For example, a btree cursor for an rmap xfbtree can be passed to
> > the
> > +   ``xfs_rmap_*`` functions just like any other btree cursor.
> > +   See the :ref:`next section<xfbtree_commit>` for information on
> > dealing with
> > +   xfbtree updates that are logged to a transaction.
> > +
> > +6. When finished, delete the btree cursor, destroy the xfbtree
> > object, free the
> > +   buffer target, and the destroy the xfile to release all
> > resources.
> > +
> > +.. _xfbtree_commit:
> > +
> > +Committing Logged xfbtree Buffers
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Although it is a clever hack to reuse the rmap btree code to handle
> > the staging
> > +structure, the ephemeral nature of the in-memory btree block storage
> > presents
> > +some challenges of its own.
> > +The XFS transaction manager must not commit buffer log items for
> > buffers backed
> > +by an xfile because the log format does not understand updates for
> > devices
> > +other than the data device.
> > +An ephemeral xfbtree probably will not exist by the time the AIL
> > checkpoints
> > +log transactions back into the filesystem, and certainly won't exist
> > during
> > +log recovery.
> > +For these reasons, any code updating an xfbtree in transaction
> > context must
> > +remove the buffer log items from the transaction and write the
> > updates into the
> > +backing xfile before committing or cancelling the transaction.
> > +
> > +The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions
> > implement
> > +this functionality as follows:
> > +
> > +1. Find each buffer log item whose buffer targets the xfile.
> > +
> > +2. Record the dirty/ordered status of the log item.
> > +
> > +3. Detach the log item from the buffer.
> > +
> > +4. Queue the buffer to a special delwri list.
> > +
> > +5. Clear the transaction dirty flag if the only dirty log items were
> > the ones
> > +   that were detached in step 3.
> > +
> > +6. Submit the delwri list to commit the changes to the xfile, if the
> > updates
> > +   are being committed.
> > +
> > +After removing xfile logged buffers from the transaction in this
> > manner, the
> > +transaction can be committed or cancelled.
> Rest of this looks pretty good, organizing nits aside.

Cool, thank you!!

--D

> Allison
> 
> >
Allison Henderson Feb. 9, 2023, 5:41 a.m. UTC | #3
On Thu, 2023-02-02 at 15:14 -0800, Darrick J. Wong wrote:
> On Thu, Feb 02, 2023 at 07:14:22AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Add a discussion of pageable kernel memory, since online fsck
> > > needs
> > > quite a bit more memory than most other parts of the filesystem
> > > to
> > > stage
> > > records and other information.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  490
> > > ++++++++++++++++++++
> > >  1 file changed, 490 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index 419eb54ee200..9d7a2ef1d0dd 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -383,6 +383,8 @@ Algorithms") of Srinivasan.
> > >  However, any data structure builder that maintains a resource
> > > lock
> > > for the
> > >  duration of the repair is *always* an offline algorithm.
> > >  
> > > +.. _secondary_metadata:
> > > +
> > >  Secondary Metadata
> > >  ``````````````````
> > >  
> > > @@ -1746,3 +1748,491 @@ Scrub teardown disables all static keys
> > > obtained by ``xchk_fshooks_enable``.
> > >  
> > >  For more information, please see the kernel documentation of
> > >  Documentation/staging/static-keys.rst.
> > > +
> > > +.. _xfile:
> > > +
> > > +Pageable Kernel Memory
> > > +----------------------
> > > +
> > > +Demonstrations of the first few prototypes of online repair
> > > revealed
> > > new
> > > +technical requirements that were not originally identified.
> > > +For the first demonstration, the code walked whatever filesystem
> > > +metadata it needed to synthesize new records and inserted
> > > records
> > > into a new
> > > +btree as it found them.
> > > +This was subpar since any additional corruption or runtime
> > > errors
> > > encountered
> > > +during the walk would shut down the filesystem.
> > > +After remount, the blocks containing the half-rebuilt data
> > > structure
> > > would not
> > > +be accessible until another repair was attempted.
> > > +Solving the problem of half-rebuilt data structures will be
> > > discussed in the
> > > +next section.
> > > +
> > > +For the second demonstration, the synthesized records were
> > > instead
> > > stored in
> > > +kernel slab memory.
> > > +Doing so enabled online repair to abort without writing to the
> > > filesystem if
> > > +the metadata walk failed, which prevented online fsck from
> > > making
> > > things worse.
> > > +However, even this approach needed improving upon.
> > > +
> > > +There are four reasons why traditional Linux kernel memory
> > > management isn't
> > > +suitable for storing large datasets:
> > > +
> > > +1. Although it is tempting to allocate a contiguous block of
> > > memory
> > > to create a
> > > +   C array, this cannot easily be done in the kernel because it
> > > cannot be
> > > +   relied upon to allocate multiple contiguous memory pages.
> > > +
> > > +2. While disparate physical pages can be virtually mapped
> > > together,
> > > installed
> > > +   memory might still not be large enough to stage the entire
> > > record
> > > set in
> > > +   memory while constructing a new btree.
> > > +
> > > +3. To overcome these two difficulties, the implementation was
> > > adjusted to use
> > > +   doubly linked lists, which means every record object needed
> > > two
> > > 64-bit list
> > > +   head pointers, which is a lot of overhead.
> > > +
> > > +4. Kernel memory is pinned, which can drive the system out of
> > > memory, leading
> > > +   to OOM kills of unrelated processes.
> > > +
> > I think I maybe might just jump to what ever the current plan is
> > instead of trying to keep a record of the dev history in the
> > document.
> > I'm sure we're not done yet, dev really never is, so in order for
> > the
> > documentation to be maintained, it would just get bigger and bigger
> > to
> > keep documenting it this way.  It's not that the above isnt
> > valuable,
> > but maybe a different kind of document really.
> 
> OK, I've shortened this introduction to outline the requirements, and
> trimmed the historical information to a sidebar:
> 
> "Some online checking functions work by scanning the filesystem to
> build
> a shadow copy of an ondisk metadata structure in memory and comparing
> the two copies. For online repair to rebuild a metadata structure, it
> must compute the record set that will be stored in the new structure
> before it can persist that new structure to disk. Ideally, repairs
> complete with a single atomic commit that introduces a new data
> structure. To meet these goals, the kernel needs to collect a large
> amount of information in a place that doesn’t require the correct
> operation of the filesystem.
> 
> "Kernel memory isn’t suitable because:
> 
> *   Allocating a contiguous region of memory to create a C array is
> very
>     difficult, especially on 32-bit systems.
> 
> *   Linked lists of records introduce double pointer overhead which
> is
>     very high and eliminate the possibility of indexed lookups.
> 
> *   Kernel memory is pinned, which can drive the system into OOM
>     conditions.
> 
> *   The system might not have sufficient memory to stage all the
>     information.
> 
> "At any given time, online fsck does not need to keep the entire
> record
> set in memory, which means that individual records can be paged out
> if
> necessary. Continued development of online fsck demonstrated that the
> ability to perform indexed data storage would also be very useful.
> Fortunately, the Linux kernel already has a facility for
> byte-addressable and pageable storage: tmpfs. In-kernel graphics
> drivers
> (most notably i915) take advantage of tmpfs files to store
> intermediate
> data that doesn’t need to be in memory at all times, so that usage
> precedent is already established. Hence, the xfile was born!
> 
> Historical Sidebar
> ------------------
> 
> "The first edition of online repair inserted records into a new btree
> as
> it found them, which failed because filesystem could shut down with a
> built data structure, which would be live after recovery finished.
> 
> "The second edition solved the half-rebuilt structure problem by
> storing
> everything in memory, but frequently ran the system out of memory.
> 
> "The third edition solved the OOM problem by using linked lists, but
> the
> list overhead was extreme."
Ok, I think that's cleaner

> 
> > 
> > 
> > > +For the third iteration, attention swung back to the possibility
> > > of
> > > using
> > 
> > Due to the large volume of metadata that needs to be processed,
> > ofsck
> > uses...
> > 
> > > +byte-indexed array-like storage to reduce the overhead of in-
> > > memory
> > > records.
> > > +At any given time, online repair does not need to keep the
> > > entire
> > > record set in
> > > +memory, which means that individual records can be paged out.
> > > +Creating new temporary files in the XFS filesystem to store
> > > intermediate data
> > > +was explored and rejected for some types of repairs because a
> > > filesystem with
> > > +compromised space and inode metadata should never be used to fix
> > > compromised
> > > +space or inode metadata.
> > > +However, the kernel already has a facility for byte-addressable
> > > and
> > > pageable
> > > +storage: shmfs.
> > > +In-kernel graphics drivers (most notably i915) take advantage of
> > > shmfs files
> > > +to store intermediate data that doesn't need to be in memory at
> > > all
> > > times, so
> > > +that usage precedent is already established.
> > > +Hence, the ``xfile`` was born!
> > > +
> > > +xfile Access Models
> > > +```````````````````
> > > +
> > > +A survey of the intended uses of xfiles suggested these use
> > > cases:
> > > +
> > > +1. Arrays of fixed-sized records (space management btrees,
> > > directory
> > > and
> > > +   extended attribute entries)
> > > +
> > > +2. Sparse arrays of fixed-sized records (quotas and link counts)
> > > +
> > > +3. Large binary objects (BLOBs) of variable sizes (directory and
> > > extended
> > > +   attribute names and values)
> > > +
> > > +4. Staging btrees in memory (reverse mapping btrees)
> > > +
> > > +5. Arbitrary contents (realtime space management)
> > > +
> > > +To support the first four use cases, high level data structures
> > > wrap
> > > the xfile
> > > +to share functionality between online fsck functions.
> > > +The rest of this section discusses the interfaces that the xfile
> > > presents to
> > > +four of those five higher level data structures.
> > > +The fifth use case is discussed in the :ref:`realtime summary
> > > <rtsummary>` case
> > > +study.
> > > +
> > > +The most general storage interface supported by the xfile
> > > enables
> > > the reading
> > > +and writing of arbitrary quantities of data at arbitrary offsets
> > > in
> > > the xfile.
> > > +This capability is provided by ``xfile_pread`` and
> > > ``xfile_pwrite``
> > > functions,
> > > +which behave similarly to their userspace counterparts.
> > > +XFS is very record-based, which suggests that the ability to
> > > load
> > > and store
> > > +complete records is important.
> > > +To support these cases, a pair of ``xfile_obj_load`` and
> > > ``xfile_obj_store``
> > > +functions are provided to read and persist objects into an
> > > xfile.
> > > +They are internally the same as pread and pwrite, except that
> > > they
> > > treat any
> > > +error as an out of memory error.
> > > +For online repair, squashing error conditions in this manner is
> > > an
> > > acceptable
> > > +behavior because the only reaction is to abort the operation
> > > back to
> > > userspace.
> > > +All five xfile usecases can be serviced by these four functions.
> > > +
> > > +However, no discussion of file access idioms is complete without
> > > answering the
> > > +question, "But what about mmap?"
> > I actually wouldn't spend too much time discussing solutions that
> > didn't work for what ever reason, unless someones really asking for
> > it.
> >  I think this section would read just fine to trim off the last
> > paragraph here
> 
> Since I wrote this, I've been experimenting with wiring up the tmpfs
> file page cache folios to the xfs buffer cache.  Pinning the folios
> in
> this manner makes it so that online fsck can (more or less) directly
> access the xfile contents.  Much to my surprise, this has actually
> held
> up in testing, so ... it's no longer a solution that "didn't really
> work". :)
> 
> I also need to s/page/folio/ now that willy has finished that
> conversion.  This section has been rewritten as such:
> 
> "However, no discussion of file access idioms is complete without
> answering the question, “But what about mmap?” It is convenient to
> access storage directly with pointers, just like userspace code does
> with regular memory. Online fsck must not drive the system into OOM
> conditions, which means that xfiles must be responsive to memory
> reclamation. tmpfs can only push a pagecache folio to the swap cache
> if
> the folio is neither pinned nor locked, which means the xfile must
> not
> pin too many folios.
> 
> "Short term direct access to xfile contents is done by locking the
> pagecache folio and mapping it into kernel address space.
> Programmatic
> access (e.g. pread and pwrite) uses this mechanism. Folio locks are
> not
> supposed to be held for long periods of time, so long term direct
> access
> to xfile contents is done by bumping the folio refcount, mapping it
> into
> kernel address space, and dropping the folio lock. These long term
> users
> must be responsive to memory reclaim by hooking into the shrinker
> infrastructure to know when to release folios.
> 
> "The xfile_get_page and xfile_put_page functions are provided to
> retrieve the (locked) folio that backs part of an xfile and to
> release
> it. The only code to use these folio lease functions are the xfarray
> sorting algorithms and the in-memory btrees."
Alrighty, sounds like a good upate then

> 
> > > +It would be *much* more convenient if kernel code could access
> > > pageable kernel
> > > +memory with pointers, just like userspace code does with regular
> > > memory.
> > > +Like any other filesystem that uses the page cache, reads and
> > > writes
> > > of xfile
> > > +data lock the cache page and map it into the kernel address
> > > space
> > > for the
> > > +duration of the operation.
> > > +Unfortunately, shmfs can only write a file page to the swap
> > > device
> > > if the page
> > > +is unmapped and unlocked, which means the xfile risks causing
> > > OOM
> > > problems
> > > +unless it is careful not to pin too many pages.
> > > +Therefore, the xfile steers most of its users towards
> > > programmatic
> > > access so
> > > +that backing pages are not kept locked in memory for longer than
> > > is
> > > necessary.
> > > +However, for callers performing quick linear scans of xfile
> > > data,
> > > +``xfile_get_page`` and ``xfile_put_page`` functions are provided
> > > to
> > > pin a page
> > > +in memory.
> > > +So far, the only code to use these functions are the xfarray
> > > :ref:`sorting
> > > +<xfarray_sort>` algorithms.
> > > +
> > > +xfile Access Coordination
> > > +`````````````````````````
> > > +
> > > +For security reasons, xfiles must be owned privately by the
> > > kernel.
> > > +They are marked ``S_PRIVATE`` to prevent interference from the
> > > security system,
> > > +must never be mapped into process file descriptor tables, and
> > > their
> > > pages must
> > > +never be mapped into userspace processes.
> > > +
> > > +To avoid locking recursion issues with the VFS, all accesses to
> > > the
> > > shmfs file
> > > +are performed by manipulating the page cache directly.
> > > +xfile writes call the ``->write_begin`` and ``->write_end``
> > > functions of the
> > > +xfile's address space to grab writable pages, copy the caller's
> > > buffer into the
> > > +page, and release the pages.
> > > +xfile reads call ``shmem_read_mapping_page_gfp`` to grab pages
> > xfile readers
> 
> OK.
> 
> > > directly before
> > > +copying the contents into the caller's buffer.
> > > +In other words, xfiles ignore the VFS read and write code paths
> > > to
> > > avoid
> > > +having to create a dummy ``struct kiocb`` and to avoid taking
> > > inode
> > > and
> > > +freeze locks.
> > > +
> > > +If an xfile is shared between threads to stage repairs, the
> > > caller
> > > must provide
> > > +its own locks to coordinate access.
> > Ofsck threads that share an xfile between stage repairs will use
> > their
> > own locks to coordinate access with each other.
> > 
> > ?
> 
> Hm.  I wonder if there's a misunderstanding here?
> 
> Online fsck functions themselves are single-threaded, which is to say
> that they themselves neither queue workers nor start kthreads. 
> However,
> an xfile created by a running fsck function can be accessed from
> other
> thread if the fsck function also hooks itself into filesystem code.
> 
> The live update section has a nice diagram of how that works:
> https://djwong.org/docs/xfs-online-fsck-design/#filesystem-hooks
> 

Oh ok, I think I got hung up on who the callers were.  How about
"xfiles shared between threads running from hooked filesystem functions
will use their own locks to coordinate access with each other."

> > > +
> > > +.. _xfarray:
> > > +
> > > +Arrays of Fixed-Sized Records
> > > +`````````````````````````````
> > > +
> > > +In XFS, each type of indexed space metadata (free space, inodes,
> > > reference
> > > +counts, file fork space, and reverse mappings) consists of a set
> > > of
> > > fixed-size
> > > +records indexed with a classic B+ tree.
> > > +Directories have a set of fixed-size dirent records that point
> > > to
> > > the names,
> > > +and extended attributes have a set of fixed-size attribute keys
> > > that
> > > point to
> > > +names and values.
> > > +Quota counters and file link counters index records with
> > > numbers.
> > > +During a repair, scrub needs to stage new records during the
> > > gathering step and
> > > +retrieve them during the btree building step.
> > > +
> > > +Although this requirement can be satisfied by calling the read
> > > and
> > > write
> > > +methods of the xfile directly, it is simpler for callers for
> > > there
> > > to be a
> > > +higher level abstraction to take care of computing array
> > > offsets, to
> > > provide
> > > +iterator functions, and to deal with sparse records and sorting.
> > > +The ``xfarray`` abstraction presents a linear array for fixed-
> > > size
> > > records atop
> > > +the byte-accessible xfile.
> > > +
> > > +.. _xfarray_access_patterns:
> > > +
> > > +Array Access Patterns
> > > +^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Array access patterns in online fsck tend to fall into three
> > > categories.
> > > +Iteration of records is assumed to be necessary for all cases
> > > and
> > > will be
> > > +covered in the next section.
> > > +
> > > +The first type of caller handles records that are indexed by
> > > position.
> > > +Gaps may exist between records, and a record may be updated
> > > multiple
> > > times
> > > +during the collection step.
> > > +In other words, these callers want a sparse linearly addressed
> > > table
> > > file.
> > > +The typical use case are quota records or file link count
> > > records.
> > > +Access to array elements is performed programmatically via
> > > ``xfarray_load`` and
> > > +``xfarray_store`` functions, which wrap the similarly-named
> > > xfile
> > > functions to
> > > +provide loading and storing of array elements at arbitrary array
> > > indices.
> > > +Gaps are defined to be null records, and null records are
> > > defined to
> > > be a
> > > +sequence of all zero bytes.
> > > +Null records are detected by calling
> > > ``xfarray_element_is_null``.
> > > +They are created either by calling ``xfarray_unset`` to null out
> > > an
> > > existing
> > > +record or by never storing anything to an array index.
> > > +
> > > +The second type of caller handles records that are not indexed
> > > by
> > > position
> > > +and do not require multiple updates to a record.
> > > +The typical use case here is rebuilding space btrees and
> > > key/value
> > > btrees.
> > > +These callers can add records to the array without caring about
> > > array indices
> > > +via the ``xfarray_append`` function, which stores a record at
> > > the
> > > end of the
> > > +array.
> > > +For callers that require records to be presentable in a specific
> > > order (e.g.
> > > +rebuilding btree data), the ``xfarray_sort`` function can
> > > arrange
> > > the sorted
> > > +records; this function will be covered later.
> > > +
> > > +The third type of caller is a bag, which is useful for counting
> > > records.
> > > +The typical use case here is constructing space extent reference
> > > counts from
> > > +reverse mapping information.
> > > +Records can be put in the bag in any order, they can be removed
> > > from
> > > the bag
> > > +at any time, and uniqueness of records is left to callers.
> > > +The ``xfarray_store_anywhere`` function is used to insert a
> > > record
> > > in any
> > > +null record slot in the bag; and the ``xfarray_unset`` function
> > > removes a
> > > +record from the bag.
> > > +
> > > +The proposed patchset is the
> > > +`big in-memory array
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=big-array>`_.
> > > +
> > > +Iterating Array Elements
> > > +^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Most users of the xfarray require the ability to iterate the
> > > records
> > > stored in
> > > +the array.
> > > +Callers can probe every possible array index with the following:
> > > +
> > > +.. code-block:: c
> > > +
> > > +       xfarray_idx_t i;
> > > +       foreach_xfarray_idx(array, i) {
> > > +           xfarray_load(array, i, &rec);
> > > +
> > > +           /* do something with rec */
> > > +       }
> > > +
> > > +All users of this idiom must be prepared to handle null records
> > > or
> > > must already
> > > +know that there aren't any.
> > > +
> > > +For xfarray users that want to iterate a sparse array, the
> > > ``xfarray_iter``
> > > +function ignores indices in the xfarray that have never been
> > > written
> > > to by
> > > +calling ``xfile_seek_data`` (which internally uses
> > > ``SEEK_DATA``) to
> > > skip areas
> > > +of the array that are not populated with memory pages.
> > > +Once it finds a page, it will skip the zeroed areas of the page.
> > > +
> > > +.. code-block:: c
> > > +
> > > +       xfarray_idx_t i = XFARRAY_CURSOR_INIT;
> > > +       while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
> > > +           /* do something with rec */
> > > +       }
> > > +
> > > +.. _xfarray_sort:
> > > +
> > > +Sorting Array Elements
> > > +^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +During the fourth demonstration of online repair, a community
> > > reviewer remarked
> > > +that for performance reasons, online repair ought to load
> > > batches of
> > > records
> > > +into btree record blocks instead of inserting records into a new
> > > btree one at a
> > > +time.
> > > +The btree insertion code in XFS is responsible for maintaining
> > > correct ordering
> > > +of the records, so naturally the xfarray must also support
> > > sorting
> > > the record
> > > +set prior to bulk loading.
> > > +
> > > +The sorting algorithm used in the xfarray is actually a
> > > combination
> > > of adaptive
> > > +quicksort and a heapsort subalgorithm in the spirit of
> > > +`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
> > > +`pdqsort <https://github.com/orlp/pdqsort>`_, with
> > > customizations
> > > for the Linux
> > > +kernel.
> > > +To sort records in a reasonably short amount of time,
> > > ``xfarray``
> > > takes
> > > +advantage of the binary subpartitioning offered by quicksort,
> > > but it
> > > also uses
> > > +heapsort to hedge aginst performance collapse if the chosen
> > > quicksort pivots
> > > +are poor.
> > > +Both algorithms are (in general) O(n * lg(n)), but there is a
> > > wide
> > > performance
> > > +gulf between the two implementations.
> > > +
> > > +The Linux kernel already contains a reasonably fast
> > > implementation
> > > of heapsort.
> > > +It only operates on regular C arrays, which limits the scope of
> > > its
> > > usefulness.
> > > +There are two key places where the xfarray uses it:
> > > +
> > > +* Sorting any record subset backed by a single xfile page.
> > > +
> > > +* Loading a small number of xfarray records from potentially
> > > disparate parts
> > > +  of the xfarray into a memory buffer, and sorting the buffer.
> > > +
> > > +In other words, ``xfarray`` uses heapsort to constrain the
> > > nested
> > > recursion of
> > > +quicksort, thereby mitigating quicksort's worst runtime
> > > behavior.
> > > +
> > > +Choosing a quicksort pivot is a tricky business.
> > > +A good pivot splits the set to sort in half, leading to the
> > > divide
> > > and conquer
> > > +behavior that is crucial to  O(n * lg(n)) performance.
> > > +A poor pivot barely splits the subset at all, leading to O(n\
> > > :sup:`2`)
> > > +runtime.
> > > +The xfarray sort routine tries to avoid picking a bad pivot by
> > > sampling nine
> > > +records into a memory buffer and using the kernel heapsort to
> > > identify the
> > > +median of the nine.
> > > +
> > > +Most modern quicksort implementations employ Tukey's "ninther"
> > > to
> > > select a
> > > +pivot from a classic C array.
> > > +Typical ninther implementations pick three unique triads of
> > > records,
> > > sort each
> > > +of the triads, and then sort the middle value of each triad to
> > > determine the
> > > +ninther value.
> > > +As stated previously, however, xfile accesses are not entirely
> > > cheap.
> > > +It turned out to be much more performant to read the nine
> > > elements
> > > into a
> > > +memory buffer, run the kernel's in-memory heapsort on the
> > > buffer,
> > > and choose
> > > +the 4th element of that buffer as the pivot.
> > > +Tukey's ninthers are described in J. W. Tukey, `The ninther, a
> > > technique for
> > > +low-effort robust (resistant) location in large samples`, in
> > > *Contributions to
> > > +Survey Sampling and Applied Statistics*, edited by H. David,
> > > (Academic Press,
> > > +1978), pp. 251–257.
> > > +
> > > +The partitioning of quicksort is fairly textbook -- rearrange
> > > the
> > > record
> > > +subset around the pivot, then set up the current and next stack
> > > frames to
> > > +sort with the larger and the smaller halves of the pivot,
> > > respectively.
> > > +This keeps the stack space requirements to log2(record count).
> > > +
> > > +As a final performance optimization, the hi and lo scanning
> > > phase of
> > > quicksort
> > > +keeps examined xfile pages mapped in the kernel for as long as
> > > possible to
> > > +reduce map/unmap cycles.
> > > +Surprisingly, this reduces overall sort runtime by nearly half
> > > again
> > > after
> > > +accounting for the application of heapsort directly onto xfile
> > > pages.
> > This sorting section is insightful, but I think I'd be ok with out
> > it
> > too.  Or maybe save it for later in the document as an
> > "implementation
> > details" section, or something similar.  It seems like there's
> > still a
> > lot to cover about how ofsck works in general before we start
> > drilling
> > into things like the runtime complexity of the sorting algorithm it
> > uses.  
> 
> How about I demote the details of how sorting works to a case study?
Sure, sounds good
> 
> > > +
> > > +Blob Storage
> > > +````````````
> > > +
> > > +Extended attributes and directories add an additional
> > > requirement
> > > for staging
> > > +records: arbitrary byte sequences of finite length.
> > > +Each directory entry record needs to store entry name,
> > > +and each extended attribute needs to store both the attribute
> > > name
> > > and value.
> > > +The names, keys, and values can consume a large amount of
> > > memory, so
> > > the
> > > +``xfblob`` abstraction was created to simplify management of
> > > these
> > > blobs
> > > +atop an xfile.
> > > +
> > > +Blob arrays provide ``xfblob_load`` and ``xfblob_store``
> > > functions
> > > to retrieve
> > > +and persist objects.
> > > +The store function returns a magic cookie for every object that
> > > it
> > > persists.
> > > +Later, callers provide this cookie to the ``xblob_load`` to
> > > recall
> > > the object.
> > > +The ``xfblob_free`` function frees a specific blob, and the
> > > ``xfblob_truncate``
> > > +function frees them all because compaction is not needed.
> > > +
> > > +The details of repairing directories and extended attributes
> > > will be
> > > discussed
> > > +in a subsequent section about atomic extent swapping.
> > > +However, it should be noted that these repair functions only use
> > > blob storage
> > > +to cache a small number of entries before adding them to a
> > > temporary
> > > ondisk
> > > +file, which is why compaction is not required.
> > > +
> > > +The proposed patchset is at the start of the
> > > +`extended attribute repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-xattrs>`_ series.
> > > +
> > > +.. _xfbtree:
> > > +
> > > +In-Memory B+Trees
> > > +`````````````````
> > > +
> > > +The chapter about :ref:`secondary metadata<secondary_metadata>`
> > > mentioned that
> > > +checking and repairing of secondary metadata commonly requires
> > > coordination
> > > +between a live metadata scan of the filesystem and writer
> > > threads
> > > that are
> > > +updating that metadata.
> > > +Keeping the scan data up to date requires requires the ability
> > > to
> > > propagate
> > > +metadata updates from the filesystem into the data being
> > > collected
> > > by the scan.
> > > +This *can* be done by appending concurrent updates into a
> > > separate
> > > log file and
> > > +applying them before writing the new metadata to disk, but this
> > > leads to
> > > +unbounded memory consumption if the rest of the system is very
> > > busy.
> > > +Another option is to skip the side-log and commit live updates
> > > from
> > > the
> > > +filesystem directly into the scan data, which trades more
> > > overhead
> > > for a lower
> > > +maximum memory requirement.
> > > +In both cases, the data structure holding the scan results must
> > > support indexed
> > > +access to perform well.
> > > +
> > > +Given that indexed lookups of scan data is required for both
> > > strategies, online
> > > +fsck employs the second strategy of committing live updates
> > > directly
> > > into
> > > +scan data.
> > > +Because xfarrays are not indexed and do not enforce record
> > > ordering,
> > > they
> > > +are not suitable for this task.
> > > +Conveniently, however, XFS has a library to create and maintain
> > > ordered reverse
> > > +mapping records: the existing rmap btree code!
> > > +If only there was a means to create one in memory.
> > > +
> > > +Recall that the :ref:`xfile <xfile>` abstraction represents
> > > memory
> > > pages as a
> > > +regular file, which means that the kernel can create byte or
> > > block
> > > addressable
> > > +virtual address spaces at will.
> > > +The XFS buffer cache specializes in abstracting IO to block-
> > > oriented  address
> > > +spaces, which means that adaptation of the buffer cache to
> > > interface
> > > with
> > > +xfiles enables reuse of the entire btree library.
> > > +Btrees built atop an xfile are collectively known as
> > > ``xfbtrees``.
> > > +The next few sections describe how they actually work.
> > > +
> > > +The proposed patchset is the
> > > +`in-memory btree
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=in-memory-btrees>`_
> > > +series.
> > > +
> > > +Using xfiles as a Buffer Cache Target
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Two modifications are necessary to support xfiles as a buffer
> > > cache
> > > target.
> > > +The first is to make it possible for the ``struct xfs_buftarg``
> > > structure to
> > > +host the ``struct xfs_buf`` rhashtable, because normally those
> > > are
> > > held by a
> > > +per-AG structure.
> > > +The second change is to modify the buffer ``ioapply`` function
> > > to
> > > "read" cached
> > > +pages from the xfile and "write" cached pages back to the xfile.
> > > +Multiple access to individual buffers is controlled by the
> > > ``xfs_buf`` lock,
> > > +since the xfile does not provide any locking on its own.
> > > +With this adaptation in place, users of the xfile-backed buffer
> > > cache use
> > > +exactly the same APIs as users of the disk-backed buffer cache.
> > > +The separation between xfile and buffer cache implies higher
> > > memory
> > > usage since
> > > +they do not share pages, but this property could some day enable
> > > transactional
> > > +updates to an in-memory btree.
> > > +Today, however, it simply eliminates the need for new code.
> > > +
> > > +Space Management with an xfbtree
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Space management for an xfile is very simple -- each btree block
> > > is
> > > one memory
> > > +page in size.
> > > +These blocks use the same header format as an on-disk btree, but
> > > the
> > > in-memory
> > > +block verifiers ignore the checksums, assuming that xfile memory
> > > is
> > > no more
> > > +corruption-prone than regular DRAM.
> > > +Reusing existing code here is more important than absolute
> > > memory
> > > efficiency.
> > > +
> > > +The very first block of an xfile backing an xfbtree contains a
> > > header block.
> > > +The header describes the owner, height, and the block number of
> > > the
> > > root
> > > +xfbtree block.
> > > +
> > > +To allocate a btree block, use ``xfile_seek_data`` to find a gap
> > > in
> > > the file.
> > > +If there are no gaps, create one by extending the length of the
> > > xfile.
> > > +Preallocate space for the block with ``xfile_prealloc``, and
> > > hand
> > > back the
> > > +location.
> > > +To free an xfbtree block, use ``xfile_discard`` (which
> > > internally
> > > uses
> > > +``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the
> > > xfile.
> > > +
> > > +Populating an xfbtree
> > > +^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +An online fsck function that wants to create an xfbtree should
> > > proceed as
> > > +follows:
> > > +
> > > +1. Call ``xfile_create`` to create an xfile.
> > > +
> > > +2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache
> > > target
> > > structure
> > > +   pointing to the xfile.
> > > +
> > > +3. Pass the buffer cache target, buffer ops, and other
> > > information
> > > to
> > > +   ``xfbtree_create`` to write an initial tree header and root
> > > block
> > > to the
> > > +   xfile.
> > > +   Each btree type should define a wrapper that passes necessary
> > > arguments to
> > > +   the creation function.
> > > +   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to
> > > take
> > > care of
> > > +   all the necessary details for callers.
> > > +   A ``struct xfbtree`` object will be returned.
> > > +
> > > +4. Pass the xfbtree object to the btree cursor creation function
> > > for
> > > the
> > > +   btree type.
> > > +   Following the example above, ``xfs_rmapbt_mem_cursor`` takes
> > > care
> > > of this
> > > +   for callers.
> > > +
> > > +5. Pass the btree cursor to the regular btree functions to make
> > > queries against
> > > +   and to update the in-memory btree.
> > > +   For example, a btree cursor for an rmap xfbtree can be passed
> > > to
> > > the
> > > +   ``xfs_rmap_*`` functions just like any other btree cursor.
> > > +   See the :ref:`next section<xfbtree_commit>` for information
> > > on
> > > dealing with
> > > +   xfbtree updates that are logged to a transaction.
> > > +
> > > +6. When finished, delete the btree cursor, destroy the xfbtree
> > > object, free the
> > > +   buffer target, and the destroy the xfile to release all
> > > resources.
> > > +
> > > +.. _xfbtree_commit:
> > > +
> > > +Committing Logged xfbtree Buffers
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Although it is a clever hack to reuse the rmap btree code to
> > > handle
> > > the staging
> > > +structure, the ephemeral nature of the in-memory btree block
> > > storage
> > > presents
> > > +some challenges of its own.
> > > +The XFS transaction manager must not commit buffer log items for
> > > buffers backed
> > > +by an xfile because the log format does not understand updates
> > > for
> > > devices
> > > +other than the data device.
> > > +An ephemeral xfbtree probably will not exist by the time the AIL
> > > checkpoints
> > > +log transactions back into the filesystem, and certainly won't
> > > exist
> > > during
> > > +log recovery.
> > > +For these reasons, any code updating an xfbtree in transaction
> > > context must
> > > +remove the buffer log items from the transaction and write the
> > > updates into the
> > > +backing xfile before committing or cancelling the transaction.
> > > +
> > > +The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel``
> > > functions
> > > implement
> > > +this functionality as follows:
> > > +
> > > +1. Find each buffer log item whose buffer targets the xfile.
> > > +
> > > +2. Record the dirty/ordered status of the log item.
> > > +
> > > +3. Detach the log item from the buffer.
> > > +
> > > +4. Queue the buffer to a special delwri list.
> > > +
> > > +5. Clear the transaction dirty flag if the only dirty log items
> > > were
> > > the ones
> > > +   that were detached in step 3.
> > > +
> > > +6. Submit the delwri list to commit the changes to the xfile, if
> > > the
> > > updates
> > > +   are being committed.
> > > +
> > > +After removing xfile logged buffers from the transaction in this
> > > manner, the
> > > +transaction can be committed or cancelled.
> > Rest of this looks pretty good, organizing nits aside.
> 
> Cool, thank you!!
> 
> --D
> 
> > Allison
> > 
> > >
Darrick J. Wong Feb. 9, 2023, 11:14 p.m. UTC | #4
On Thu, Feb 09, 2023 at 05:41:22AM +0000, Allison Henderson wrote:
> On Thu, 2023-02-02 at 15:14 -0800, Darrick J. Wong wrote:
> > On Thu, Feb 02, 2023 at 07:14:22AM +0000, Allison Henderson wrote:
> > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Add a discussion of pageable kernel memory, since online fsck
> > > > needs
> > > > quite a bit more memory than most other parts of the filesystem
> > > > to
> > > > stage
> > > > records and other information.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  .../filesystems/xfs-online-fsck-design.rst         |  490
> > > > ++++++++++++++++++++
> > > >  1 file changed, 490 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > index 419eb54ee200..9d7a2ef1d0dd 100644
> > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > @@ -383,6 +383,8 @@ Algorithms") of Srinivasan.
> > > >  However, any data structure builder that maintains a resource
> > > > lock
> > > > for the
> > > >  duration of the repair is *always* an offline algorithm.
> > > >  
> > > > +.. _secondary_metadata:
> > > > +
> > > >  Secondary Metadata
> > > >  ``````````````````
> > > >  
> > > > @@ -1746,3 +1748,491 @@ Scrub teardown disables all static keys
> > > > obtained by ``xchk_fshooks_enable``.
> > > >  
> > > >  For more information, please see the kernel documentation of
> > > >  Documentation/staging/static-keys.rst.
> > > > +
> > > > +.. _xfile:
> > > > +
> > > > +Pageable Kernel Memory
> > > > +----------------------
> > > > +
> > > > +Demonstrations of the first few prototypes of online repair
> > > > revealed
> > > > new
> > > > +technical requirements that were not originally identified.
> > > > +For the first demonstration, the code walked whatever filesystem
> > > > +metadata it needed to synthesize new records and inserted
> > > > records
> > > > into a new
> > > > +btree as it found them.
> > > > +This was subpar since any additional corruption or runtime
> > > > errors
> > > > encountered
> > > > +during the walk would shut down the filesystem.
> > > > +After remount, the blocks containing the half-rebuilt data
> > > > structure
> > > > would not
> > > > +be accessible until another repair was attempted.
> > > > +Solving the problem of half-rebuilt data structures will be
> > > > discussed in the
> > > > +next section.
> > > > +
> > > > +For the second demonstration, the synthesized records were
> > > > instead
> > > > stored in
> > > > +kernel slab memory.
> > > > +Doing so enabled online repair to abort without writing to the
> > > > filesystem if
> > > > +the metadata walk failed, which prevented online fsck from
> > > > making
> > > > things worse.
> > > > +However, even this approach needed improving upon.
> > > > +
> > > > +There are four reasons why traditional Linux kernel memory
> > > > management isn't
> > > > +suitable for storing large datasets:
> > > > +
> > > > +1. Although it is tempting to allocate a contiguous block of
> > > > memory
> > > > to create a
> > > > +   C array, this cannot easily be done in the kernel because it
> > > > cannot be
> > > > +   relied upon to allocate multiple contiguous memory pages.
> > > > +
> > > > +2. While disparate physical pages can be virtually mapped
> > > > together,
> > > > installed
> > > > +   memory might still not be large enough to stage the entire
> > > > record
> > > > set in
> > > > +   memory while constructing a new btree.
> > > > +
> > > > +3. To overcome these two difficulties, the implementation was
> > > > adjusted to use
> > > > +   doubly linked lists, which means every record object needed
> > > > two
> > > > 64-bit list
> > > > +   head pointers, which is a lot of overhead.
> > > > +
> > > > +4. Kernel memory is pinned, which can drive the system out of
> > > > memory, leading
> > > > +   to OOM kills of unrelated processes.
> > > > +
> > > I think I maybe might just jump to what ever the current plan is
> > > instead of trying to keep a record of the dev history in the
> > > document.
> > > I'm sure we're not done yet, dev really never is, so in order for
> > > the
> > > documentation to be maintained, it would just get bigger and bigger
> > > to
> > > keep documenting it this way.  It's not that the above isnt
> > > valuable,
> > > but maybe a different kind of document really.
> > 
> > OK, I've shortened this introduction to outline the requirements, and
> > trimmed the historical information to a sidebar:
> > 
> > "Some online checking functions work by scanning the filesystem to
> > build
> > a shadow copy of an ondisk metadata structure in memory and comparing
> > the two copies. For online repair to rebuild a metadata structure, it
> > must compute the record set that will be stored in the new structure
> > before it can persist that new structure to disk. Ideally, repairs
> > complete with a single atomic commit that introduces a new data
> > structure. To meet these goals, the kernel needs to collect a large
> > amount of information in a place that doesn’t require the correct
> > operation of the filesystem.
> > 
> > "Kernel memory isn’t suitable because:
> > 
> > *   Allocating a contiguous region of memory to create a C array is
> > very
> >     difficult, especially on 32-bit systems.
> > 
> > *   Linked lists of records introduce double pointer overhead which
> > is
> >     very high and eliminate the possibility of indexed lookups.
> > 
> > *   Kernel memory is pinned, which can drive the system into OOM
> >     conditions.
> > 
> > *   The system might not have sufficient memory to stage all the
> >     information.
> > 
> > "At any given time, online fsck does not need to keep the entire
> > record
> > set in memory, which means that individual records can be paged out
> > if
> > necessary. Continued development of online fsck demonstrated that the
> > ability to perform indexed data storage would also be very useful.
> > Fortunately, the Linux kernel already has a facility for
> > byte-addressable and pageable storage: tmpfs. In-kernel graphics
> > drivers
> > (most notably i915) take advantage of tmpfs files to store
> > intermediate
> > data that doesn’t need to be in memory at all times, so that usage
> > precedent is already established. Hence, the xfile was born!
> > 
> > Historical Sidebar
> > ------------------
> > 
> > "The first edition of online repair inserted records into a new btree
> > as
> > it found them, which failed because filesystem could shut down with a
> > built data structure, which would be live after recovery finished.
> > 
> > "The second edition solved the half-rebuilt structure problem by
> > storing
> > everything in memory, but frequently ran the system out of memory.
> > 
> > "The third edition solved the OOM problem by using linked lists, but
> > the
> > list overhead was extreme."
> Ok, I think that's cleaner
> 
> > 
> > > 
> > > 
> > > > +For the third iteration, attention swung back to the possibility
> > > > of
> > > > using
> > > 
> > > Due to the large volume of metadata that needs to be processed,
> > > ofsck
> > > uses...
> > > 
> > > > +byte-indexed array-like storage to reduce the overhead of in-
> > > > memory
> > > > records.
> > > > +At any given time, online repair does not need to keep the
> > > > entire
> > > > record set in
> > > > +memory, which means that individual records can be paged out.
> > > > +Creating new temporary files in the XFS filesystem to store
> > > > intermediate data
> > > > +was explored and rejected for some types of repairs because a
> > > > filesystem with
> > > > +compromised space and inode metadata should never be used to fix
> > > > compromised
> > > > +space or inode metadata.
> > > > +However, the kernel already has a facility for byte-addressable
> > > > and
> > > > pageable
> > > > +storage: shmfs.
> > > > +In-kernel graphics drivers (most notably i915) take advantage of
> > > > shmfs files
> > > > +to store intermediate data that doesn't need to be in memory at
> > > > all
> > > > times, so
> > > > +that usage precedent is already established.
> > > > +Hence, the ``xfile`` was born!
> > > > +
> > > > +xfile Access Models
> > > > +```````````````````
> > > > +
> > > > +A survey of the intended uses of xfiles suggested these use
> > > > cases:
> > > > +
> > > > +1. Arrays of fixed-sized records (space management btrees,
> > > > directory
> > > > and
> > > > +   extended attribute entries)
> > > > +
> > > > +2. Sparse arrays of fixed-sized records (quotas and link counts)
> > > > +
> > > > +3. Large binary objects (BLOBs) of variable sizes (directory and
> > > > extended
> > > > +   attribute names and values)
> > > > +
> > > > +4. Staging btrees in memory (reverse mapping btrees)
> > > > +
> > > > +5. Arbitrary contents (realtime space management)
> > > > +
> > > > +To support the first four use cases, high level data structures
> > > > wrap
> > > > the xfile
> > > > +to share functionality between online fsck functions.
> > > > +The rest of this section discusses the interfaces that the xfile
> > > > presents to
> > > > +four of those five higher level data structures.
> > > > +The fifth use case is discussed in the :ref:`realtime summary
> > > > <rtsummary>` case
> > > > +study.
> > > > +
> > > > +The most general storage interface supported by the xfile
> > > > enables
> > > > the reading
> > > > +and writing of arbitrary quantities of data at arbitrary offsets
> > > > in
> > > > the xfile.
> > > > +This capability is provided by ``xfile_pread`` and
> > > > ``xfile_pwrite``
> > > > functions,
> > > > +which behave similarly to their userspace counterparts.
> > > > +XFS is very record-based, which suggests that the ability to
> > > > load
> > > > and store
> > > > +complete records is important.
> > > > +To support these cases, a pair of ``xfile_obj_load`` and
> > > > ``xfile_obj_store``
> > > > +functions are provided to read and persist objects into an
> > > > xfile.
> > > > +They are internally the same as pread and pwrite, except that
> > > > they
> > > > treat any
> > > > +error as an out of memory error.
> > > > +For online repair, squashing error conditions in this manner is
> > > > an
> > > > acceptable
> > > > +behavior because the only reaction is to abort the operation
> > > > back to
> > > > userspace.
> > > > +All five xfile usecases can be serviced by these four functions.
> > > > +
> > > > +However, no discussion of file access idioms is complete without
> > > > answering the
> > > > +question, "But what about mmap?"
> > > I actually wouldn't spend too much time discussing solutions that
> > > didn't work for what ever reason, unless someones really asking for
> > > it.
> > >  I think this section would read just fine to trim off the last
> > > paragraph here
> > 
> > Since I wrote this, I've been experimenting with wiring up the tmpfs
> > file page cache folios to the xfs buffer cache.  Pinning the folios
> > in
> > this manner makes it so that online fsck can (more or less) directly
> > access the xfile contents.  Much to my surprise, this has actually
> > held
> > up in testing, so ... it's no longer a solution that "didn't really
> > work". :)
> > 
> > I also need to s/page/folio/ now that willy has finished that
> > conversion.  This section has been rewritten as such:
> > 
> > "However, no discussion of file access idioms is complete without
> > answering the question, “But what about mmap?” It is convenient to
> > access storage directly with pointers, just like userspace code does
> > with regular memory. Online fsck must not drive the system into OOM
> > conditions, which means that xfiles must be responsive to memory
> > reclamation. tmpfs can only push a pagecache folio to the swap cache
> > if
> > the folio is neither pinned nor locked, which means the xfile must
> > not
> > pin too many folios.
> > 
> > "Short term direct access to xfile contents is done by locking the
> > pagecache folio and mapping it into kernel address space.
> > Programmatic
> > access (e.g. pread and pwrite) uses this mechanism. Folio locks are
> > not
> > supposed to be held for long periods of time, so long term direct
> > access
> > to xfile contents is done by bumping the folio refcount, mapping it
> > into
> > kernel address space, and dropping the folio lock. These long term
> > users
> > must be responsive to memory reclaim by hooking into the shrinker
> > infrastructure to know when to release folios.
> > 
> > "The xfile_get_page and xfile_put_page functions are provided to
> > retrieve the (locked) folio that backs part of an xfile and to
> > release
> > it. The only code to use these folio lease functions are the xfarray
> > sorting algorithms and the in-memory btrees."
> Alrighty, sounds like a good upate then
> 
> > 
> > > > +It would be *much* more convenient if kernel code could access
> > > > pageable kernel
> > > > +memory with pointers, just like userspace code does with regular
> > > > memory.
> > > > +Like any other filesystem that uses the page cache, reads and
> > > > writes
> > > > of xfile
> > > > +data lock the cache page and map it into the kernel address
> > > > space
> > > > for the
> > > > +duration of the operation.
> > > > +Unfortunately, shmfs can only write a file page to the swap
> > > > device
> > > > if the page
> > > > +is unmapped and unlocked, which means the xfile risks causing
> > > > OOM
> > > > problems
> > > > +unless it is careful not to pin too many pages.
> > > > +Therefore, the xfile steers most of its users towards
> > > > programmatic
> > > > access so
> > > > +that backing pages are not kept locked in memory for longer than
> > > > is
> > > > necessary.
> > > > +However, for callers performing quick linear scans of xfile
> > > > data,
> > > > +``xfile_get_page`` and ``xfile_put_page`` functions are provided
> > > > to
> > > > pin a page
> > > > +in memory.
> > > > +So far, the only code to use these functions are the xfarray
> > > > :ref:`sorting
> > > > +<xfarray_sort>` algorithms.
> > > > +
> > > > +xfile Access Coordination
> > > > +`````````````````````````
> > > > +
> > > > +For security reasons, xfiles must be owned privately by the
> > > > kernel.
> > > > +They are marked ``S_PRIVATE`` to prevent interference from the
> > > > security system,
> > > > +must never be mapped into process file descriptor tables, and
> > > > their
> > > > pages must
> > > > +never be mapped into userspace processes.
> > > > +
> > > > +To avoid locking recursion issues with the VFS, all accesses to
> > > > the
> > > > shmfs file
> > > > +are performed by manipulating the page cache directly.
> > > > +xfile writes call the ``->write_begin`` and ``->write_end``
> > > > functions of the
> > > > +xfile's address space to grab writable pages, copy the caller's
> > > > buffer into the
> > > > +page, and release the pages.
> > > > +xfile reads call ``shmem_read_mapping_page_gfp`` to grab pages
> > > xfile readers
> > 
> > OK.
> > 
> > > > directly before
> > > > +copying the contents into the caller's buffer.
> > > > +In other words, xfiles ignore the VFS read and write code paths
> > > > to
> > > > avoid
> > > > +having to create a dummy ``struct kiocb`` and to avoid taking
> > > > inode
> > > > and
> > > > +freeze locks.
> > > > +
> > > > +If an xfile is shared between threads to stage repairs, the
> > > > caller
> > > > must provide
> > > > +its own locks to coordinate access.
> > > Ofsck threads that share an xfile between stage repairs will use
> > > their
> > > own locks to coordinate access with each other.
> > > 
> > > ?
> > 
> > Hm.  I wonder if there's a misunderstanding here?
> > 
> > Online fsck functions themselves are single-threaded, which is to say
> > that they themselves neither queue workers nor start kthreads. 
> > However,
> > an xfile created by a running fsck function can be accessed from
> > other
> > thread if the fsck function also hooks itself into filesystem code.
> > 
> > The live update section has a nice diagram of how that works:
> > https://djwong.org/docs/xfs-online-fsck-design/#filesystem-hooks
> > 
> 
> Oh ok, I think I got hung up on who the callers were.  How about
> "xfiles shared between threads running from hooked filesystem functions
> will use their own locks to coordinate access with each other."

I don't want to mention filesystem hooks before the chapter that
introduces them.  How about:

"For example, if a scrub function stores scan results in an xfile and
needs other threads to provide updates to the scanned data, the scrub
function must provide a lock for all threads to share."

--D
Allison Henderson Feb. 25, 2023, 7:32 a.m. UTC | #5
On Thu, 2023-02-09 at 15:14 -0800, Darrick J. Wong wrote:
> On Thu, Feb 09, 2023 at 05:41:22AM +0000, Allison Henderson wrote:
> > On Thu, 2023-02-02 at 15:14 -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 02, 2023 at 07:14:22AM +0000, Allison Henderson
> > > wrote:
> > > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > Add a discussion of pageable kernel memory, since online fsck
> > > > > needs
> > > > > quite a bit more memory than most other parts of the
> > > > > filesystem
> > > > > to
> > > > > stage
> > > > > records and other information.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > > ---
> > > > >  .../filesystems/xfs-online-fsck-design.rst         |  490
> > > > > ++++++++++++++++++++
> > > > >  1 file changed, 490 insertions(+)
> > > > > 
> > > > > 
> > > > > diff --git a/Documentation/filesystems/xfs-online-fsck-
> > > > > design.rst
> > > > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > > index 419eb54ee200..9d7a2ef1d0dd 100644
> > > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > > @@ -383,6 +383,8 @@ Algorithms") of Srinivasan.
> > > > >  However, any data structure builder that maintains a
> > > > > resource
> > > > > lock
> > > > > for the
> > > > >  duration of the repair is *always* an offline algorithm.
> > > > >  
> > > > > +.. _secondary_metadata:
> > > > > +
> > > > >  Secondary Metadata
> > > > >  ``````````````````
> > > > >  
> > > > > @@ -1746,3 +1748,491 @@ Scrub teardown disables all static
> > > > > keys
> > > > > obtained by ``xchk_fshooks_enable``.
> > > > >  
> > > > >  For more information, please see the kernel documentation of
> > > > >  Documentation/staging/static-keys.rst.
> > > > > +
> > > > > +.. _xfile:
> > > > > +
> > > > > +Pageable Kernel Memory
> > > > > +----------------------
> > > > > +
> > > > > +Demonstrations of the first few prototypes of online repair
> > > > > revealed
> > > > > new
> > > > > +technical requirements that were not originally identified.
> > > > > +For the first demonstration, the code walked whatever
> > > > > filesystem
> > > > > +metadata it needed to synthesize new records and inserted
> > > > > records
> > > > > into a new
> > > > > +btree as it found them.
> > > > > +This was subpar since any additional corruption or runtime
> > > > > errors
> > > > > encountered
> > > > > +during the walk would shut down the filesystem.
> > > > > +After remount, the blocks containing the half-rebuilt data
> > > > > structure
> > > > > would not
> > > > > +be accessible until another repair was attempted.
> > > > > +Solving the problem of half-rebuilt data structures will be
> > > > > discussed in the
> > > > > +next section.
> > > > > +
> > > > > +For the second demonstration, the synthesized records were
> > > > > instead
> > > > > stored in
> > > > > +kernel slab memory.
> > > > > +Doing so enabled online repair to abort without writing to
> > > > > the
> > > > > filesystem if
> > > > > +the metadata walk failed, which prevented online fsck from
> > > > > making
> > > > > things worse.
> > > > > +However, even this approach needed improving upon.
> > > > > +
> > > > > +There are four reasons why traditional Linux kernel memory
> > > > > management isn't
> > > > > +suitable for storing large datasets:
> > > > > +
> > > > > +1. Although it is tempting to allocate a contiguous block of
> > > > > memory
> > > > > to create a
> > > > > +   C array, this cannot easily be done in the kernel because
> > > > > it
> > > > > cannot be
> > > > > +   relied upon to allocate multiple contiguous memory pages.
> > > > > +
> > > > > +2. While disparate physical pages can be virtually mapped
> > > > > together,
> > > > > installed
> > > > > +   memory might still not be large enough to stage the
> > > > > entire
> > > > > record
> > > > > set in
> > > > > +   memory while constructing a new btree.
> > > > > +
> > > > > +3. To overcome these two difficulties, the implementation
> > > > > was
> > > > > adjusted to use
> > > > > +   doubly linked lists, which means every record object
> > > > > needed
> > > > > two
> > > > > 64-bit list
> > > > > +   head pointers, which is a lot of overhead.
> > > > > +
> > > > > +4. Kernel memory is pinned, which can drive the system out
> > > > > of
> > > > > memory, leading
> > > > > +   to OOM kills of unrelated processes.
> > > > > +
> > > > I think I maybe might just jump to what ever the current plan
> > > > is
> > > > instead of trying to keep a record of the dev history in the
> > > > document.
> > > > I'm sure we're not done yet, dev really never is, so in order
> > > > for
> > > > the
> > > > documentation to be maintained, it would just get bigger and
> > > > bigger
> > > > to
> > > > keep documenting it this way.  It's not that the above isnt
> > > > valuable,
> > > > but maybe a different kind of document really.
> > > 
> > > OK, I've shortened this introduction to outline the requirements,
> > > and
> > > trimmed the historical information to a sidebar:
> > > 
> > > "Some online checking functions work by scanning the filesystem
> > > to
> > > build
> > > a shadow copy of an ondisk metadata structure in memory and
> > > comparing
> > > the two copies. For online repair to rebuild a metadata
> > > structure, it
> > > must compute the record set that will be stored in the new
> > > structure
> > > before it can persist that new structure to disk. Ideally,
> > > repairs
> > > complete with a single atomic commit that introduces a new data
> > > structure. To meet these goals, the kernel needs to collect a
> > > large
> > > amount of information in a place that doesn’t require the correct
> > > operation of the filesystem.
> > > 
> > > "Kernel memory isn’t suitable because:
> > > 
> > > *   Allocating a contiguous region of memory to create a C array
> > > is
> > > very
> > >     difficult, especially on 32-bit systems.
> > > 
> > > *   Linked lists of records introduce double pointer overhead
> > > which
> > > is
> > >     very high and eliminate the possibility of indexed lookups.
> > > 
> > > *   Kernel memory is pinned, which can drive the system into OOM
> > >     conditions.
> > > 
> > > *   The system might not have sufficient memory to stage all the
> > >     information.
> > > 
> > > "At any given time, online fsck does not need to keep the entire
> > > record
> > > set in memory, which means that individual records can be paged
> > > out
> > > if
> > > necessary. Continued development of online fsck demonstrated that
> > > the
> > > ability to perform indexed data storage would also be very
> > > useful.
> > > Fortunately, the Linux kernel already has a facility for
> > > byte-addressable and pageable storage: tmpfs. In-kernel graphics
> > > drivers
> > > (most notably i915) take advantage of tmpfs files to store
> > > intermediate
> > > data that doesn’t need to be in memory at all times, so that
> > > usage
> > > precedent is already established. Hence, the xfile was born!
> > > 
> > > Historical Sidebar
> > > ------------------
> > > 
> > > "The first edition of online repair inserted records into a new
> > > btree
> > > as
> > > it found them, which failed because filesystem could shut down
> > > with a
> > > built data structure, which would be live after recovery
> > > finished.
> > > 
> > > "The second edition solved the half-rebuilt structure problem by
> > > storing
> > > everything in memory, but frequently ran the system out of
> > > memory.
> > > 
> > > "The third edition solved the OOM problem by using linked lists,
> > > but
> > > the
> > > list overhead was extreme."
> > Ok, I think that's cleaner
> > 
> > > 
> > > > 
> > > > 
> > > > > +For the third iteration, attention swung back to the
> > > > > possibility
> > > > > of
> > > > > using
> > > > 
> > > > Due to the large volume of metadata that needs to be processed,
> > > > ofsck
> > > > uses...
> > > > 
> > > > > +byte-indexed array-like storage to reduce the overhead of
> > > > > in-
> > > > > memory
> > > > > records.
> > > > > +At any given time, online repair does not need to keep the
> > > > > entire
> > > > > record set in
> > > > > +memory, which means that individual records can be paged
> > > > > out.
> > > > > +Creating new temporary files in the XFS filesystem to store
> > > > > intermediate data
> > > > > +was explored and rejected for some types of repairs because
> > > > > a
> > > > > filesystem with
> > > > > +compromised space and inode metadata should never be used to
> > > > > fix
> > > > > compromised
> > > > > +space or inode metadata.
> > > > > +However, the kernel already has a facility for byte-
> > > > > addressable
> > > > > and
> > > > > pageable
> > > > > +storage: shmfs.
> > > > > +In-kernel graphics drivers (most notably i915) take
> > > > > advantage of
> > > > > shmfs files
> > > > > +to store intermediate data that doesn't need to be in memory
> > > > > at
> > > > > all
> > > > > times, so
> > > > > +that usage precedent is already established.
> > > > > +Hence, the ``xfile`` was born!
> > > > > +
> > > > > +xfile Access Models
> > > > > +```````````````````
> > > > > +
> > > > > +A survey of the intended uses of xfiles suggested these use
> > > > > cases:
> > > > > +
> > > > > +1. Arrays of fixed-sized records (space management btrees,
> > > > > directory
> > > > > and
> > > > > +   extended attribute entries)
> > > > > +
> > > > > +2. Sparse arrays of fixed-sized records (quotas and link
> > > > > counts)
> > > > > +
> > > > > +3. Large binary objects (BLOBs) of variable sizes (directory
> > > > > and
> > > > > extended
> > > > > +   attribute names and values)
> > > > > +
> > > > > +4. Staging btrees in memory (reverse mapping btrees)
> > > > > +
> > > > > +5. Arbitrary contents (realtime space management)
> > > > > +
> > > > > +To support the first four use cases, high level data
> > > > > structures
> > > > > wrap
> > > > > the xfile
> > > > > +to share functionality between online fsck functions.
> > > > > +The rest of this section discusses the interfaces that the
> > > > > xfile
> > > > > presents to
> > > > > +four of those five higher level data structures.
> > > > > +The fifth use case is discussed in the :ref:`realtime
> > > > > summary
> > > > > <rtsummary>` case
> > > > > +study.
> > > > > +
> > > > > +The most general storage interface supported by the xfile
> > > > > enables
> > > > > the reading
> > > > > +and writing of arbitrary quantities of data at arbitrary
> > > > > offsets
> > > > > in
> > > > > the xfile.
> > > > > +This capability is provided by ``xfile_pread`` and
> > > > > ``xfile_pwrite``
> > > > > functions,
> > > > > +which behave similarly to their userspace counterparts.
> > > > > +XFS is very record-based, which suggests that the ability to
> > > > > load
> > > > > and store
> > > > > +complete records is important.
> > > > > +To support these cases, a pair of ``xfile_obj_load`` and
> > > > > ``xfile_obj_store``
> > > > > +functions are provided to read and persist objects into an
> > > > > xfile.
> > > > > +They are internally the same as pread and pwrite, except
> > > > > that
> > > > > they
> > > > > treat any
> > > > > +error as an out of memory error.
> > > > > +For online repair, squashing error conditions in this manner
> > > > > is
> > > > > an
> > > > > acceptable
> > > > > +behavior because the only reaction is to abort the operation
> > > > > back to
> > > > > userspace.
> > > > > +All five xfile usecases can be serviced by these four
> > > > > functions.
> > > > > +
> > > > > +However, no discussion of file access idioms is complete
> > > > > without
> > > > > answering the
> > > > > +question, "But what about mmap?"
> > > > I actually wouldn't spend too much time discussing solutions
> > > > that
> > > > didn't work for what ever reason, unless someones really asking
> > > > for
> > > > it.
> > > >  I think this section would read just fine to trim off the last
> > > > paragraph here
> > > 
> > > Since I wrote this, I've been experimenting with wiring up the
> > > tmpfs
> > > file page cache folios to the xfs buffer cache.  Pinning the
> > > folios
> > > in
> > > this manner makes it so that online fsck can (more or less)
> > > directly
> > > access the xfile contents.  Much to my surprise, this has
> > > actually
> > > held
> > > up in testing, so ... it's no longer a solution that "didn't
> > > really
> > > work". :)
> > > 
> > > I also need to s/page/folio/ now that willy has finished that
> > > conversion.  This section has been rewritten as such:
> > > 
> > > "However, no discussion of file access idioms is complete without
> > > answering the question, “But what about mmap?” It is convenient
> > > to
> > > access storage directly with pointers, just like userspace code
> > > does
> > > with regular memory. Online fsck must not drive the system into
> > > OOM
> > > conditions, which means that xfiles must be responsive to memory
> > > reclamation. tmpfs can only push a pagecache folio to the swap
> > > cache
> > > if
> > > the folio is neither pinned nor locked, which means the xfile
> > > must
> > > not
> > > pin too many folios.
> > > 
> > > "Short term direct access to xfile contents is done by locking
> > > the
> > > pagecache folio and mapping it into kernel address space.
> > > Programmatic
> > > access (e.g. pread and pwrite) uses this mechanism. Folio locks
> > > are
> > > not
> > > supposed to be held for long periods of time, so long term direct
> > > access
> > > to xfile contents is done by bumping the folio refcount, mapping
> > > it
> > > into
> > > kernel address space, and dropping the folio lock. These long
> > > term
> > > users
> > > must be responsive to memory reclaim by hooking into the shrinker
> > > infrastructure to know when to release folios.
> > > 
> > > "The xfile_get_page and xfile_put_page functions are provided to
> > > retrieve the (locked) folio that backs part of an xfile and to
> > > release
> > > it. The only code to use these folio lease functions are the
> > > xfarray
> > > sorting algorithms and the in-memory btrees."
> > Alrighty, sounds like a good upate then
> > 
> > > 
> > > > > +It would be *much* more convenient if kernel code could
> > > > > access
> > > > > pageable kernel
> > > > > +memory with pointers, just like userspace code does with
> > > > > regular
> > > > > memory.
> > > > > +Like any other filesystem that uses the page cache, reads
> > > > > and
> > > > > writes
> > > > > of xfile
> > > > > +data lock the cache page and map it into the kernel address
> > > > > space
> > > > > for the
> > > > > +duration of the operation.
> > > > > +Unfortunately, shmfs can only write a file page to the swap
> > > > > device
> > > > > if the page
> > > > > +is unmapped and unlocked, which means the xfile risks
> > > > > causing
> > > > > OOM
> > > > > problems
> > > > > +unless it is careful not to pin too many pages.
> > > > > +Therefore, the xfile steers most of its users towards
> > > > > programmatic
> > > > > access so
> > > > > +that backing pages are not kept locked in memory for longer
> > > > > than
> > > > > is
> > > > > necessary.
> > > > > +However, for callers performing quick linear scans of xfile
> > > > > data,
> > > > > +``xfile_get_page`` and ``xfile_put_page`` functions are
> > > > > provided
> > > > > to
> > > > > pin a page
> > > > > +in memory.
> > > > > +So far, the only code to use these functions are the xfarray
> > > > > :ref:`sorting
> > > > > +<xfarray_sort>` algorithms.
> > > > > +
> > > > > +xfile Access Coordination
> > > > > +`````````````````````````
> > > > > +
> > > > > +For security reasons, xfiles must be owned privately by the
> > > > > kernel.
> > > > > +They are marked ``S_PRIVATE`` to prevent interference from
> > > > > the
> > > > > security system,
> > > > > +must never be mapped into process file descriptor tables,
> > > > > and
> > > > > their
> > > > > pages must
> > > > > +never be mapped into userspace processes.
> > > > > +
> > > > > +To avoid locking recursion issues with the VFS, all accesses
> > > > > to
> > > > > the
> > > > > shmfs file
> > > > > +are performed by manipulating the page cache directly.
> > > > > +xfile writes call the ``->write_begin`` and ``->write_end``
> > > > > functions of the
> > > > > +xfile's address space to grab writable pages, copy the
> > > > > caller's
> > > > > buffer into the
> > > > > +page, and release the pages.
> > > > > +xfile reads call ``shmem_read_mapping_page_gfp`` to grab
> > > > > pages
> > > > xfile readers
> > > 
> > > OK.
> > > 
> > > > > directly before
> > > > > +copying the contents into the caller's buffer.
> > > > > +In other words, xfiles ignore the VFS read and write code
> > > > > paths
> > > > > to
> > > > > avoid
> > > > > +having to create a dummy ``struct kiocb`` and to avoid
> > > > > taking
> > > > > inode
> > > > > and
> > > > > +freeze locks.
> > > > > +
> > > > > +If an xfile is shared between threads to stage repairs, the
> > > > > caller
> > > > > must provide
> > > > > +its own locks to coordinate access.
> > > > Ofsck threads that share an xfile between stage repairs will
> > > > use
> > > > their
> > > > own locks to coordinate access with each other.
> > > > 
> > > > ?
> > > 
> > > Hm.  I wonder if there's a misunderstanding here?
> > > 
> > > Online fsck functions themselves are single-threaded, which is to
> > > say
> > > that they themselves neither queue workers nor start kthreads. 
> > > However,
> > > an xfile created by a running fsck function can be accessed from
> > > other
> > > thread if the fsck function also hooks itself into filesystem
> > > code.
> > > 
> > > The live update section has a nice diagram of how that works:
> > > https://djwong.org/docs/xfs-online-fsck-design/#filesystem-hooks
> > > 
> > 
> > Oh ok, I think I got hung up on who the callers were.  How about
> > "xfiles shared between threads running from hooked filesystem
> > functions
> > will use their own locks to coordinate access with each other."
> 
> I don't want to mention filesystem hooks before the chapter that
> introduces them.  How about:
> 
> "For example, if a scrub function stores scan results in an xfile and
> needs other threads to provide updates to the scanned data, the scrub
> function must provide a lock for all threads to share."
Oh, I didnt see this response....

Ok, i think that sounds fine.  Alternately I think if patch 10 were to
move up, then it would have sounded fine since we introduce hooks
there, but I think either way works

Allison
> 
> --D
diff mbox series

Patch

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 419eb54ee200..9d7a2ef1d0dd 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -383,6 +383,8 @@  Algorithms") of Srinivasan.
 However, any data structure builder that maintains a resource lock for the
 duration of the repair is *always* an offline algorithm.
 
+.. _secondary_metadata:
+
 Secondary Metadata
 ``````````````````
 
@@ -1746,3 +1748,491 @@  Scrub teardown disables all static keys obtained by ``xchk_fshooks_enable``.
 
 For more information, please see the kernel documentation of
 Documentation/staging/static-keys.rst.
+
+.. _xfile:
+
+Pageable Kernel Memory
+----------------------
+
+Demonstrations of the first few prototypes of online repair revealed new
+technical requirements that were not originally identified.
+For the first demonstration, the code walked whatever filesystem
+metadata it needed to synthesize new records and inserted records into a new
+btree as it found them.
+This was subpar since any additional corruption or runtime errors encountered
+during the walk would shut down the filesystem.
+After remount, the blocks containing the half-rebuilt data structure would not
+be accessible until another repair was attempted.
+Solving the problem of half-rebuilt data structures will be discussed in the
+next section.
+
+For the second demonstration, the synthesized records were instead stored in
+kernel slab memory.
+Doing so enabled online repair to abort without writing to the filesystem if
+the metadata walk failed, which prevented online fsck from making things worse.
+However, even this approach needed improving upon.
+
+There are four reasons why traditional Linux kernel memory management isn't
+suitable for storing large datasets:
+
+1. Although it is tempting to allocate a contiguous block of memory to create a
+   C array, this cannot easily be done in the kernel because it cannot be
+   relied upon to allocate multiple contiguous memory pages.
+
+2. While disparate physical pages can be virtually mapped together, installed
+   memory might still not be large enough to stage the entire record set in
+   memory while constructing a new btree.
+
+3. To overcome these two difficulties, the implementation was adjusted to use
+   doubly linked lists, which means every record object needed two 64-bit list
+   head pointers, which is a lot of overhead.
+
+4. Kernel memory is pinned, which can drive the system out of memory, leading
+   to OOM kills of unrelated processes.
+
+For the third iteration, attention swung back to the possibility of using
+byte-indexed array-like storage to reduce the overhead of in-memory records.
+At any given time, online repair does not need to keep the entire record set in
+memory, which means that individual records can be paged out.
+Creating new temporary files in the XFS filesystem to store intermediate data
+was explored and rejected for some types of repairs because a filesystem with
+compromised space and inode metadata should never be used to fix compromised
+space or inode metadata.
+However, the kernel already has a facility for byte-addressable and pageable
+storage: shmfs.
+In-kernel graphics drivers (most notably i915) take advantage of shmfs files
+to store intermediate data that doesn't need to be in memory at all times, so
+that usage precedent is already established.
+Hence, the ``xfile`` was born!
+
+xfile Access Models
+```````````````````
+
+A survey of the intended uses of xfiles suggested these use cases:
+
+1. Arrays of fixed-sized records (space management btrees, directory and
+   extended attribute entries)
+
+2. Sparse arrays of fixed-sized records (quotas and link counts)
+
+3. Large binary objects (BLOBs) of variable sizes (directory and extended
+   attribute names and values)
+
+4. Staging btrees in memory (reverse mapping btrees)
+
+5. Arbitrary contents (realtime space management)
+
+To support the first four use cases, high level data structures wrap the xfile
+to share functionality between online fsck functions.
+The rest of this section discusses the interfaces that the xfile presents to
+four of those five higher level data structures.
+The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
+study.
+
+The most general storage interface supported by the xfile enables the reading
+and writing of arbitrary quantities of data at arbitrary offsets in the xfile.
+This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
+which behave similarly to their userspace counterparts.
+XFS is very record-based, which suggests that the ability to load and store
+complete records is important.
+To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
+functions are provided to read and persist objects into an xfile.
+They are internally the same as pread and pwrite, except that they treat any
+error as an out of memory error.
+For online repair, squashing error conditions in this manner is an acceptable
+behavior because the only reaction is to abort the operation back to userspace.
+All five xfile usecases can be serviced by these four functions.
+
+However, no discussion of file access idioms is complete without answering the
+question, "But what about mmap?"
+It would be *much* more convenient if kernel code could access pageable kernel
+memory with pointers, just like userspace code does with regular memory.
+Like any other filesystem that uses the page cache, reads and writes of xfile
+data lock the cache page and map it into the kernel address space for the
+duration of the operation.
+Unfortunately, shmfs can only write a file page to the swap device if the page
+is unmapped and unlocked, which means the xfile risks causing OOM problems
+unless it is careful not to pin too many pages.
+Therefore, the xfile steers most of its users towards programmatic access so
+that backing pages are not kept locked in memory for longer than is necessary.
+However, for callers performing quick linear scans of xfile data,
+``xfile_get_page`` and ``xfile_put_page`` functions are provided to pin a page
+in memory.
+So far, the only code to use these functions are the xfarray :ref:`sorting
+<xfarray_sort>` algorithms.
+
+xfile Access Coordination
+`````````````````````````
+
+For security reasons, xfiles must be owned privately by the kernel.
+They are marked ``S_PRIVATE`` to prevent interference from the security system,
+must never be mapped into process file descriptor tables, and their pages must
+never be mapped into userspace processes.
+
+To avoid locking recursion issues with the VFS, all accesses to the shmfs file
+are performed by manipulating the page cache directly.
+xfile writes call the ``->write_begin`` and ``->write_end`` functions of the
+xfile's address space to grab writable pages, copy the caller's buffer into the
+page, and release the pages.
+xfile reads call ``shmem_read_mapping_page_gfp`` to grab pages directly before
+copying the contents into the caller's buffer.
+In other words, xfiles ignore the VFS read and write code paths to avoid
+having to create a dummy ``struct kiocb`` and to avoid taking inode and
+freeze locks.
+
+If an xfile is shared between threads to stage repairs, the caller must provide
+its own locks to coordinate access.
+
+.. _xfarray:
+
+Arrays of Fixed-Sized Records
+`````````````````````````````
+
+In XFS, each type of indexed space metadata (free space, inodes, reference
+counts, file fork space, and reverse mappings) consists of a set of fixed-size
+records indexed with a classic B+ tree.
+Directories have a set of fixed-size dirent records that point to the names,
+and extended attributes have a set of fixed-size attribute keys that point to
+names and values.
+Quota counters and file link counters index records with numbers.
+During a repair, scrub needs to stage new records during the gathering step and
+retrieve them during the btree building step.
+
+Although this requirement can be satisfied by calling the read and write
+methods of the xfile directly, it is simpler for callers for there to be a
+higher level abstraction to take care of computing array offsets, to provide
+iterator functions, and to deal with sparse records and sorting.
+The ``xfarray`` abstraction presents a linear array for fixed-size records atop
+the byte-accessible xfile.
+
+.. _xfarray_access_patterns:
+
+Array Access Patterns
+^^^^^^^^^^^^^^^^^^^^^
+
+Array access patterns in online fsck tend to fall into three categories.
+Iteration of records is assumed to be necessary for all cases and will be
+covered in the next section.
+
+The first type of caller handles records that are indexed by position.
+Gaps may exist between records, and a record may be updated multiple times
+during the collection step.
+In other words, these callers want a sparse linearly addressed table file.
+The typical use case are quota records or file link count records.
+Access to array elements is performed programmatically via ``xfarray_load`` and
+``xfarray_store`` functions, which wrap the similarly-named xfile functions to
+provide loading and storing of array elements at arbitrary array indices.
+Gaps are defined to be null records, and null records are defined to be a
+sequence of all zero bytes.
+Null records are detected by calling ``xfarray_element_is_null``.
+They are created either by calling ``xfarray_unset`` to null out an existing
+record or by never storing anything to an array index.
+
+The second type of caller handles records that are not indexed by position
+and do not require multiple updates to a record.
+The typical use case here is rebuilding space btrees and key/value btrees.
+These callers can add records to the array without caring about array indices
+via the ``xfarray_append`` function, which stores a record at the end of the
+array.
+For callers that require records to be presentable in a specific order (e.g.
+rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
+records; this function will be covered later.
+
+The third type of caller is a bag, which is useful for counting records.
+The typical use case here is constructing space extent reference counts from
+reverse mapping information.
+Records can be put in the bag in any order, they can be removed from the bag
+at any time, and uniqueness of records is left to callers.
+The ``xfarray_store_anywhere`` function is used to insert a record in any
+null record slot in the bag; and the ``xfarray_unset`` function removes a
+record from the bag.
+
+The proposed patchset is the
+`big in-memory array
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
+
+Iterating Array Elements
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most users of the xfarray require the ability to iterate the records stored in
+the array.
+Callers can probe every possible array index with the following:
+
+.. code-block:: c
+
+	xfarray_idx_t i;
+	foreach_xfarray_idx(array, i) {
+	    xfarray_load(array, i, &rec);
+
+	    /* do something with rec */
+	}
+
+All users of this idiom must be prepared to handle null records or must already
+know that there aren't any.
+
+For xfarray users that want to iterate a sparse array, the ``xfarray_iter``
+function ignores indices in the xfarray that have never been written to by
+calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas
+of the array that are not populated with memory pages.
+Once it finds a page, it will skip the zeroed areas of the page.
+
+.. code-block:: c
+
+	xfarray_idx_t i = XFARRAY_CURSOR_INIT;
+	while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
+	    /* do something with rec */
+	}
+
+.. _xfarray_sort:
+
+Sorting Array Elements
+^^^^^^^^^^^^^^^^^^^^^^
+
+During the fourth demonstration of online repair, a community reviewer remarked
+that for performance reasons, online repair ought to load batches of records
+into btree record blocks instead of inserting records into a new btree one at a
+time.
+The btree insertion code in XFS is responsible for maintaining correct ordering
+of the records, so naturally the xfarray must also support sorting the record
+set prior to bulk loading.
+
+The sorting algorithm used in the xfarray is actually a combination of adaptive
+quicksort and a heapsort subalgorithm in the spirit of
+`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
+`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux
+kernel.
+To sort records in a reasonably short amount of time, ``xfarray`` takes
+advantage of the binary subpartitioning offered by quicksort, but it also uses
+heapsort to hedge aginst performance collapse if the chosen quicksort pivots
+are poor.
+Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
+gulf between the two implementations.
+
+The Linux kernel already contains a reasonably fast implementation of heapsort.
+It only operates on regular C arrays, which limits the scope of its usefulness.
+There are two key places where the xfarray uses it:
+
+* Sorting any record subset backed by a single xfile page.
+
+* Loading a small number of xfarray records from potentially disparate parts
+  of the xfarray into a memory buffer, and sorting the buffer.
+
+In other words, ``xfarray`` uses heapsort to constrain the nested recursion of
+quicksort, thereby mitigating quicksort's worst runtime behavior.
+
+Choosing a quicksort pivot is a tricky business.
+A good pivot splits the set to sort in half, leading to the divide and conquer
+behavior that is crucial to  O(n * lg(n)) performance.
+A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
+runtime.
+The xfarray sort routine tries to avoid picking a bad pivot by sampling nine
+records into a memory buffer and using the kernel heapsort to identify the
+median of the nine.
+
+Most modern quicksort implementations employ Tukey's "ninther" to select a
+pivot from a classic C array.
+Typical ninther implementations pick three unique triads of records, sort each
+of the triads, and then sort the middle value of each triad to determine the
+ninther value.
+As stated previously, however, xfile accesses are not entirely cheap.
+It turned out to be much more performant to read the nine elements into a
+memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
+the 4th element of that buffer as the pivot.
+Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
+low-effort robust (resistant) location in large samples`, in *Contributions to
+Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
+1978), pp. 251–257.
+
+The partitioning of quicksort is fairly textbook -- rearrange the record
+subset around the pivot, then set up the current and next stack frames to
+sort with the larger and the smaller halves of the pivot, respectively.
+This keeps the stack space requirements to log2(record count).
+
+As a final performance optimization, the hi and lo scanning phase of quicksort
+keeps examined xfile pages mapped in the kernel for as long as possible to
+reduce map/unmap cycles.
+Surprisingly, this reduces overall sort runtime by nearly half again after
+accounting for the application of heapsort directly onto xfile pages.
+
+Blob Storage
+````````````
+
+Extended attributes and directories add an additional requirement for staging
+records: arbitrary byte sequences of finite length.
+Each directory entry record needs to store entry name,
+and each extended attribute needs to store both the attribute name and value.
+The names, keys, and values can consume a large amount of memory, so the
+``xfblob`` abstraction was created to simplify management of these blobs
+atop an xfile.
+
+Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
+and persist objects.
+The store function returns a magic cookie for every object that it persists.
+Later, callers provide this cookie to the ``xblob_load`` to recall the object.
+The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
+function frees them all because compaction is not needed.
+
+The details of repairing directories and extended attributes will be discussed
+in a subsequent section about atomic extent swapping.
+However, it should be noted that these repair functions only use blob storage
+to cache a small number of entries before adding them to a temporary ondisk
+file, which is why compaction is not required.
+
+The proposed patchset is at the start of the
+`extended attribute repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
+
+.. _xfbtree:
+
+In-Memory B+Trees
+`````````````````
+
+The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
+checking and repairing of secondary metadata commonly requires coordination
+between a live metadata scan of the filesystem and writer threads that are
+updating that metadata.
+Keeping the scan data up to date requires requires the ability to propagate
+metadata updates from the filesystem into the data being collected by the scan.
+This *can* be done by appending concurrent updates into a separate log file and
+applying them before writing the new metadata to disk, but this leads to
+unbounded memory consumption if the rest of the system is very busy.
+Another option is to skip the side-log and commit live updates from the
+filesystem directly into the scan data, which trades more overhead for a lower
+maximum memory requirement.
+In both cases, the data structure holding the scan results must support indexed
+access to perform well.
+
+Given that indexed lookups of scan data is required for both strategies, online
+fsck employs the second strategy of committing live updates directly into
+scan data.
+Because xfarrays are not indexed and do not enforce record ordering, they
+are not suitable for this task.
+Conveniently, however, XFS has a library to create and maintain ordered reverse
+mapping records: the existing rmap btree code!
+If only there was a means to create one in memory.
+
+Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
+regular file, which means that the kernel can create byte or block addressable
+virtual address spaces at will.
+The XFS buffer cache specializes in abstracting IO to block-oriented  address
+spaces, which means that adaptation of the buffer cache to interface with
+xfiles enables reuse of the entire btree library.
+Btrees built atop an xfile are collectively known as ``xfbtrees``.
+The next few sections describe how they actually work.
+
+The proposed patchset is the
+`in-memory btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
+series.
+
+Using xfiles as a Buffer Cache Target
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Two modifications are necessary to support xfiles as a buffer cache target.
+The first is to make it possible for the ``struct xfs_buftarg`` structure to
+host the ``struct xfs_buf`` rhashtable, because normally those are held by a
+per-AG structure.
+The second change is to modify the buffer ``ioapply`` function to "read" cached
+pages from the xfile and "write" cached pages back to the xfile.
+Multiple access to individual buffers is controlled by the ``xfs_buf`` lock,
+since the xfile does not provide any locking on its own.
+With this adaptation in place, users of the xfile-backed buffer cache use
+exactly the same APIs as users of the disk-backed buffer cache.
+The separation between xfile and buffer cache implies higher memory usage since
+they do not share pages, but this property could some day enable transactional
+updates to an in-memory btree.
+Today, however, it simply eliminates the need for new code.
+
+Space Management with an xfbtree
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Space management for an xfile is very simple -- each btree block is one memory
+page in size.
+These blocks use the same header format as an on-disk btree, but the in-memory
+block verifiers ignore the checksums, assuming that xfile memory is no more
+corruption-prone than regular DRAM.
+Reusing existing code here is more important than absolute memory efficiency.
+
+The very first block of an xfile backing an xfbtree contains a header block.
+The header describes the owner, height, and the block number of the root
+xfbtree block.
+
+To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
+If there are no gaps, create one by extending the length of the xfile.
+Preallocate space for the block with ``xfile_prealloc``, and hand back the
+location.
+To free an xfbtree block, use ``xfile_discard`` (which internally uses
+``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
+
+Populating an xfbtree
+^^^^^^^^^^^^^^^^^^^^^
+
+An online fsck function that wants to create an xfbtree should proceed as
+follows:
+
+1. Call ``xfile_create`` to create an xfile.
+
+2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
+   pointing to the xfile.
+
+3. Pass the buffer cache target, buffer ops, and other information to
+   ``xfbtree_create`` to write an initial tree header and root block to the
+   xfile.
+   Each btree type should define a wrapper that passes necessary arguments to
+   the creation function.
+   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
+   all the necessary details for callers.
+   A ``struct xfbtree`` object will be returned.
+
+4. Pass the xfbtree object to the btree cursor creation function for the
+   btree type.
+   Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
+   for callers.
+
+5. Pass the btree cursor to the regular btree functions to make queries against
+   and to update the in-memory btree.
+   For example, a btree cursor for an rmap xfbtree can be passed to the
+   ``xfs_rmap_*`` functions just like any other btree cursor.
+   See the :ref:`next section<xfbtree_commit>` for information on dealing with
+   xfbtree updates that are logged to a transaction.
+
+6. When finished, delete the btree cursor, destroy the xfbtree object, free the
+   buffer target, and the destroy the xfile to release all resources.
+
+.. _xfbtree_commit:
+
+Committing Logged xfbtree Buffers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Although it is a clever hack to reuse the rmap btree code to handle the staging
+structure, the ephemeral nature of the in-memory btree block storage presents
+some challenges of its own.
+The XFS transaction manager must not commit buffer log items for buffers backed
+by an xfile because the log format does not understand updates for devices
+other than the data device.
+An ephemeral xfbtree probably will not exist by the time the AIL checkpoints
+log transactions back into the filesystem, and certainly won't exist during
+log recovery.
+For these reasons, any code updating an xfbtree in transaction context must
+remove the buffer log items from the transaction and write the updates into the
+backing xfile before committing or cancelling the transaction.
+
+The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
+this functionality as follows:
+
+1. Find each buffer log item whose buffer targets the xfile.
+
+2. Record the dirty/ordered status of the log item.
+
+3. Detach the log item from the buffer.
+
+4. Queue the buffer to a special delwri list.
+
+5. Clear the transaction dirty flag if the only dirty log items were the ones
+   that were detached in step 3.
+
+6. Submit the delwri list to commit the changes to the xfile, if the updates
+   are being committed.
+
+After removing xfile logged buffers from the transaction in this manner, the
+transaction can be committed or cancelled.