From patchwork Thu Oct 26 21:40:58 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437927
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F05C2C25B72
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:42:31 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232168AbjJZVmb (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:42:31 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56146 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229785AbjJZVmb (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:42:31 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4837CDC
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356499;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=Q32bIJxOzxM4+A0HUaZyMgM37wJzKP1N2rWA87/QPvM=;
        b=cQWvrzx9P7tDu1uqCCtEl4C+AryZhaMz7B7GMmwbngpQhZTvKlZoge4LUyJ1rLGqanlLv9
        JfvYQ/4/nUsSZsXa2TioA/JizQs5rb/62+OONc+3Yn1wa97BBHFywq1Vi6MD2ZvSVmyK2D
        jd27P5Fv8wTvXsenYewgelgJmqScxnU=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-637-CwiOfJIOPYer5CnO7Q4S_g-1; Thu,
 26 Oct 2023 17:41:37 -0400
X-MC-Unique: CwiOfJIOPYer5CnO7Q4S_g-1
Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com
 [10.11.54.7])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5FE0E3C0275E;
        Thu, 26 Oct 2023 21:41:37 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 455F01C060AE;
        Thu, 26 Oct 2023 21:41:37 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id BB78B3003B; Thu, 26 Oct 2023 17:41:36 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 01/39] dm: add documentation for dm-vdo target
Date: Thu, 26 Oct 2023 17:40:58 -0400
Message-Id: <20231026214136.1067410-2-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.7
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

This adds the admin-guide documentation for dm-vdo.

vdo.rst is the guide to using dm-vdo. vdo-design is an overview of the
design of dm-vdo.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 .../admin-guide/device-mapper/vdo-design.rst  | 415 ++++++++++++++++++
 .../admin-guide/device-mapper/vdo.rst         | 388 ++++++++++++++++
 2 files changed, 803 insertions(+)
 create mode 100644 Documentation/admin-guide/device-mapper/vdo-design.rst
 create mode 100644 Documentation/admin-guide/device-mapper/vdo.rst

diff --git a/Documentation/admin-guide/device-mapper/vdo-design.rst b/Documentation/admin-guide/device-mapper/vdo-design.rst
new file mode 100644
index 000000000000..c82d51071c7d
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/vdo-design.rst
@@ -0,0 +1,415 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+================
+Design of dm-vdo
+================
+
+The dm-vdo (virtual data optimizer) target provides inline deduplication,
+compression, zero-block elimination, and thin provisioning. A dm-vdo target
+can be backed by up to 256TB of storage, and can present a logical size of
+up to 4PB. This target was originally developed at Permabit Technology
+Corp. starting in 2009. It was first released in 2013 and has been used in
+production environments ever since. It was made open-source in 2017 after
+Permabit was acquired by Red Hat. This document describes the design of
+dm-vdo. For usage, see vdo.rst in the same directory as this file.
+
+Because deduplication rates fall drastically as the block size increases, a
+vdo target has a maximum block size of 4K. However, it can achieve
+deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can
+reference a single 4K of actual storage. It can achieve compression rates
+of 14:1. All zero blocks consume no storage at all.
+
+Theory of Operation
+===================
+
+The design of dm-vdo is based on the idea that deduplication is a two-part
+problem. The first is to recognize duplicate data. The second is to avoid
+storing multiple copies of those duplicates. Therefore, dm-vdo has two main
+parts: a deduplication index (called UDS) that is used to discover
+duplicate data, and a data store with a reference counted block map that
+maps from logical block addresses to the actual storage location of the
+data.
+
+Zones and Threading
+-------------------
+
+Due to the complexity of data optimization, the number of metadata
+structures involved in a single write operation to a vdo target is larger
+than most other targets. Furthermore, because vdo must operate on small
+block sizes in order to achieve good deduplication rates, acceptable
+performance can only be achieved through parallelism. Therefore, vdo's
+design attempts to be lock-free. Most of a vdo's main data structures are
+designed to be easily divided into "zones" such that any given bio must
+only access a single zone of any zoned structure. Safety with minimal
+locking is achieved by ensuring that during normal operation, each zone is
+assigned to a specific thread, and only that thread will access the portion
+of that data structure in that zone. Associated with each thread is a work
+queue. Each bio is associated with a request object which can be added to a
+work queue when the next phase of its operation requires access to the
+structures in the zone associated with that queue. Although each structure
+may be divided into zones, this division is not reflected in the on-disk
+representation of each data structure. Therefore, the number of zones for
+each structure, and hence the number of threads, is configured each time a
+vdo target is started.
+
+The Deduplication Index
+-----------------------
+
+In order to identify duplicate data efficiently, vdo was designed to
+leverage some common characteristics of duplicate data. From empirical
+observations, we gathered two key insights. The first is that in most data
+sets with significant amounts of duplicate data, the duplicates tend to
+have temporal locality. When a duplicate appears, it is more likely that
+other duplicates will be detected, and that those duplicates will have been
+written at about the same time. This is why the index keeps records in
+temporal order. The second insight is that new data is more likely to
+duplicate recent data than it is to duplicate older data and in general,
+there are diminishing returns to looking further back in time. Therefore,
+when the index is full, it should cull its oldest records to make space for
+new ones. Another important idea behind the design of the index is that the
+ultimate goal of deduplication is to reduce storage costs. Since there is a
+trade-off between the storage saved and the resources expended to achieve
+those savings, vdo does not attempt to find every last duplicate block. It
+is sufficient to find and eliminate most of the redundancy.
+
+Each block of data is hashed to produce a 16-byte block name. An index
+record consists of this block name paired with the presumed location of
+that data on the underlying storage. However, it is not possible to
+guarantee that the index is accurate. Most often, this occurs because it is
+too costly to update the index when a block is over-written or discarded.
+Doing so would require either storing the block name along with the blocks,
+which is difficult to do efficiently in block-based storage, or reading and
+rehashing each block before overwriting it. Inaccuracy can also result from
+a hash collision where two different blocks have the same name. In
+practice, this is extremely unlikely, but because vdo does not use a
+cryptographic hash, a malicious workload can be constructed. Because of
+these inaccuracies, vdo treats the locations in the index as hints, and
+reads each indicated block to verify that it is indeed a duplicate before
+sharing the existing block with a new one.
+
+Records are collected into groups called chapters. New records are added to
+the newest chapter, called the open chapter. This chapter is stored in a
+format optimized for adding and modifying records, and the content of the
+open chapter is not finalized until it runs out of space for new records.
+When the open chapter fills up, it is closed and a new open chapter is
+created to collect new records.
+
+Closing a chapter converts it to a different format which is optimized for
+writing. The records are written to a series of record pages based on the
+order in which they were received. This means that records with temporal
+locality should be on a small number of pages, reducing the I/O required to
+retrieve them. The chapter also compiles an index that indicates which
+record page contains any given name. This index means that a request for a
+name can determine exactly which record page may contain that record,
+without having to load the entire chapter from storage. This index uses
+only a subset of the block name as its key, so it cannot guarantee that an
+index entry refers to the desired block name. It can only guarantee that if
+there is a record for this name, it will be on the indicated page. The
+contents of a closed chapter are never altered in any way; these chapters
+are read-only structures.
+
+Once enough records have been written to fill up all the available index
+space, the oldest chapter gets removed to make space for new chapters. Any
+time a request finds a matching record in the index, that record is copied
+to the open chapter. This ensures that useful block names remain available
+in the index, while unreferenced block names are forgotten.
+
+In order to find records in older chapters, the index also maintains a
+higher level structure called the volume index, which contains entries
+mapping a block name to the chapter containing its newest record. This
+mapping is updated as records for the block name are copied or updated,
+ensuring that only the newer record for a given block name is findable.
+Older records for a block name can no longer be found even though they have
+not been deleted. Like the chapter index, the volume index uses only a
+subset of the block name as its key and can not definitively say that a
+record exists for a name. It can only say which chapter would contain the
+record if a record exists. The volume index is stored entirely in memory
+and is saved to storage only when the vdo target is shut down.
+
+From the viewpoint of a request for a particular block name, first it will
+look up the name in the volume index which will indicate either that the
+record is new, or which chapter to search. If the latter, the request looks
+up its name in the chapter index to determine if the record is new, or
+which record page to search. Finally, if not new, the request will look for
+its record on the indicated record page. This process may require up to two
+page reads per request (one for the chapter index page and one for the
+request page). However, recently accessed pages are cached so that these
+page reads can be amortized across many block name requests.
+
+The volume index and the chapter indexes are implemented using a
+memory-efficient structure called a delta index. Instead of storing the
+entire key (the block name) for each entry, the entries are sorted by name
+and only the difference between adjacent keys (the delta) is stored.
+Because we expect the hashes to be evenly distributed, the size of the
+deltas follows an exponential distribution. Because of this distribution,
+the deltas are expressed in a Huffman code to take up even less space. The
+entire sorted list of keys is called a delta list. This structure allows
+the index to use many fewer bytes per entry than a traditional hash table,
+but it is slightly more expensive to look up entries, because a request
+must read every entry in a delta list to add up the deltas in order to find
+the record it needs. The delta index reduces this lookup cost by splitting
+its key space into many sub-lists, each starting at a fixed key value, so
+that each individual list is short.
+
+The default index size can hold 64 million records, corresponding to about
+256GB. This means that the index can identify duplicate data if the
+original data was written within the last 256GB of writes. This range is
+called the deduplication window. If new writes duplicate data that is older
+than that, the index will not be able to find it because the records of the
+older data have been removed. So when writing a 200 GB file to a vdo
+target, and then immediately writing it again, the two copies will
+deduplicate perfectly. Doing the same with a 500 GB file will result in no
+deduplication, because the beginning of the file will no longer be in the
+index by the time the second write begins (assuming there is no duplication
+within the file itself).
+
+If you anticipate a data workload that will see useful deduplication beyond
+the 256GB threshold, vdo can be configured to use a larger index with a
+correspondingly larger deduplication window. (This configuration can only
+be set when the target is created, not altered later. It is important to
+consider the expected workload for a vdo target before configuring it.)
+There are two ways to do this.
+
+One way is to increase the memory size of the index, which also increases
+the amount of backing storage required. Doubling the size of the index will
+double the length of the deduplication window at the expense of doubling
+the storage size and the memory requirements.
+
+The other way is to enable sparse indexing. Sparse indexing increases the
+deduplication window by a factor of 10, at the expense of also increasing
+the storage size by a factor of 10. However with sparse indexing, the
+memory requirements do not increase; the trade-off is slightly more
+computation per request, and a slight decrease in the amount of
+deduplication detected. (For workloads with significant amounts of
+duplicate data, sparse indexing will detect 97-99% of the deduplication
+that a standard, or "dense", index will detect.)
+
+The Data Store
+--------------
+
+The data store is implemented by three main data structures, all of which
+work in concert to reduce or amortize metadata updates across as many data
+writes as possible.
+
+*The Slab Depot*
+
+Most of the vdo volume belongs to the slab depot. The depot contains a
+collection of slabs. The slabs can be up to 32GB, and are divided into
+three sections. Most of a slab consists of a linear sequence of 4K blocks.
+These blocks are used either to store data, or to hold portions of the
+block map (see below). In addition to the data blocks, each slab has a set
+of reference counters, using 1 byte for each data block. Finally each slab
+has a journal. Reference updates are written to the slab journal, which is
+written out one block at a time as each block fills. A copy of the
+reference counters are kept in memory, and are written out a block at a
+time, in oldest-dirtied-order whenever there is a need to reclaim slab
+journal space. The journal is used both to ensure that the main recovery
+journal (see below) can regularly free up space, and also to amortize the
+cost of updating individual reference blocks.
+
+Each slab is independent of every other. They are assigned to "physical
+zones" in round-robin fashion. If there are P physical zones, then slab n
+is assigned to zone n mod P.
+
+The slab depot maintains an additional small data structure, the "slab
+summary," which is used to reduce the amount of work needed to come back
+online after a crash. The slab summary maintains an entry for each slab
+indicating whether or not the slab has ever been used, whether it is clean
+(i.e. all of its reference count updates have been persisted to storage),
+and approximately how full it is. During recovery, each physical zone will
+attempt to recover at least one slab, stopping whenever it has recovered a
+slab which has some free blocks. Once each zone has some space (or has
+determined that none is available), the target can resume normal operation
+in a degraded mode. Read and write requests can be serviced, perhaps with
+degraded performance, while the remainder of the dirty slabs are recovered.
+
+*The Block Map*
+
+The block map contains the logical to physical mapping. It can be thought
+of as an array with one entry per logical address. Each entry is 5 bytes,
+36 bits of which contain the physical block number which holds the data for
+the given logical address. The other 4 bits are used to indicate the nature
+of the mapping. Of the 16 possible states, one represents a logical address
+which is unmapped (i.e. it has never been written, or has been discarded),
+one represents an uncompressed block, and the other 14 states are used to
+indicate that the mapped data is compressed, and which of the compression
+slots in the compressed block this logical address maps to (see below).
+
+In practice, the array of mapping entries is divided into "block map
+pages," each of which fits in a single 4K block. Each block map page
+consists of a header, and 812 mapping entries (812 being the number that
+fit). Each mapping page is actually a leaf of a radix tree which consists
+of block map pages at each level. There are 60 radix trees which are
+assigned to "logical zones" in round robin fashion (if there are L logical
+zones, tree n will belong to zone n mod L). At each level, the trees are
+interleaved, so logical addresses 0-811 belong to tree 0, logical addresses
+812-1623 belong to tree 1, and so on. The interleaving is maintained all
+the way up the forest. 60 was chosen as the number of trees because it is
+highly composite and hence results in an evenly distributed number of trees
+per zone for a large number of possible logical zone counts. The storage
+for the 60 tree roots is allocated at format time. All other block map
+pages are allocated out of the slabs as needed. This flexible allocation
+avoids the need to pre-allocate space for the entire set of logical
+mappings and also makes growing the logical size of a vdo easy to
+implement.
+
+In operation, the block map maintains two caches. It is prohibitive to keep
+the entire leaf level of the trees in memory, so each logical zone
+maintains its own cache of leaf pages. The size of this cache is
+configurable at target start time. The second cache is allocated at start
+time, and is large enough to hold all the non-leaf pages of the entire
+block map. This cache is populated as needed.
+
+*The Recovery Journal*
+
+The recovery journal is used to amortize updates across the block map and
+slab depot. Each write request causes an entry to be made in the journal.
+Entries are either "data remappings" or "block map remappings." For a data
+remapping, the journal records the logical address affected and its old and
+new physical mappings. For a block map remapping, the journal records the
+block map page number and the physical block allocated for it (block map
+pages are never reclaimed, so the old mapping is always 0). Each journal
+entry and the data write it represents must be stable on disk before the
+other metadata structures may be updated to reflect the operation.
+
+*Write Path*
+
+A write bio is first assigned a "data_vio," the request object which will
+operate on behalf of the bio. (A "vio," from Vdo I/O, is vdo's wrapper for
+bios; metadata operations use a vio, whereas submitted bios require the
+much larger data_vio.) There is a fixed pool of 2048 data_vios. This number
+was chosen both to bound the amount of work that is required to recover
+from a crash, and because measurements indicate that increasing it consumes
+more resources, but does not improve performance. These measurements have
+been, and should continue to be, revisited over time.
+
+Once a data_vio is assigned, the following steps are performed:
+
+1.  The bio's data is checked to see if it is all zeros, and copied if not.
+
+2.  A lock is obtained on the logical address of the bio. Because
+    deduplication involves sharing blocks, it is vital to prevent
+    simultaneous modifications of the same block.
+
+3.  The block map tree is traversed, loading any non-leaf pages which cover
+    the logical address and are not already in memory. If any of these
+    pages, or the leaf page which covers the logical address have not been
+    allocated, and the block is not all zeros, they are allocated at this
+    time.
+
+4.  If the block is a zero block, skip to step 9. Otherwise, an attempt is
+    made to allocate a free data block.
+
+5.  If an allocation was obtained, the bio is acknowledged.
+
+6.  The bio's data is hashed.
+
+7.  The data_vio obtains or joins a "hash lock," which represents all of
+    the bios currently writing the same data.
+
+8.  If the hash lock does not already have a data_vio acting as its agent,
+    the current one assumes that role. As the agent:
+
+        a) The index is queried.
+
+        b) If an entry is found, the indicated block is read and compared
+           to the data being written.
+
+        c) If the data matches, we have identified duplicate data. As many
+           of the data_vios as there are references available for that
+           block (including the agent) are shared. If there are more
+           data_vios in the hash lock than there are references available,
+           one of them becomes the new agent and continues as if there was
+           no duplicate found.
+
+        d) If no duplicate was found, and the agent in the hash lock does
+           not have an allocation (fron step 3), another data_vio in the
+           hash lock will become the agent and write the data. If no
+           data_vio in the hash lock has an allocation, the data_vios will
+           be marked out of space and go to step 13 for cleanup.
+
+           If there is an allocation, the data being written will be
+           compressed. If the compressed size is sufficiently small, the
+           data_vio will go to the packer where it may be placed in a bin
+           along with other data_vios.
+
+        e) Once a bin is full, either because it is out of space, or
+           because all 14 of its slots are in use, it is written out.
+
+        f) Each data_vio from the bin just written is the agent of some
+           hash lock, it will now proceed to treat the just written
+           compressed block as if it were a duplicate and share it with as
+           many other data_vios in its hash lock as possible.
+
+        g) If the agent's data is not compressed, it will attempt to write
+           its data to the block it has allocated.
+
+        h) If the data was written, this new block is treated as a
+           duplicate and shared as much as possible with any other
+           data_vios in the hash lock.
+
+        i) If the agent wrote new data (whether compressed or not), the
+           index is updated to reflect the new entry.
+
+9.  The block map is queried to determine the previous mapping of the
+    logical address.
+
+10. An entry is made in the recovery journal. The data_vio will block in
+    the journal until a flush has completed to ensure the data it may have
+    written is stable. It must also wait until its journal entry is stable
+    on disk. (Journal writes are all issued with the FUA bit set.)
+
+11. Once the recovery journal entry is stable, the data_vio makes two slab
+    journal entries: an increment entry for the new mapping, and a
+    decrement entry for the old mapping, if that mapping was non-zero. For
+    correctness during recovery, the slab journal entries in any given slab
+    journal must be in the same order as the corresponding recovery journal
+    entries. Therefore, if the two entries are in different zones, they are
+    made concurrently, and if they are in the same zone, the increment is
+    always made before the decrement in order to avoid underflow. After
+    each slab journal entry is made in memory, the associated reference
+    count is also updated in memory. Each of these updates will get written
+    out as needed. (Slab journal blocks are written out either when they
+    are full, or when the recovery journal requests they do so in order to
+    allow the recovery journal to free up space; reference count blocks are
+    written out whenever the associated slab journal requests they do so in
+    order to free up slab journal space.)
+
+12. Once all the reference count updates are done, the block map is updated
+    and the write is complete.
+
+13. If the data_vio did not use its allocation, it releases the allocated
+    block, the hash lock (if it has one), and its logical lock. The
+    data_vio then returns to the pool.
+
+*Read Path*
+
+Reads are much simpler than writes. After a data_vio is assigned to the
+bio, and the logical lock is obtained, the block map is queried. If the
+block is mapped, the appropriate physical block is read, and if necessary,
+decompressed.
+
+*Recovery*
+
+When a vdo is restarted after a crash, it will attempt to recover from the
+recovery journal. During the pre-resume phase of the next start, the
+recovery journal is read. The increment portion of valid entries are played
+into the block map. Next, valid entries are played, in order as required,
+into the slab journals. Finally, each physical zone attempts to replay at
+least one slab journal to reconstruct the reference counts of one slab.
+Once each zone has some free space (or has determined that it has none),
+the vdo comes back online, while the remainder of the slab journals are
+used to reconstruct the rest of the reference counts.
+
+*Read-only Rebuild*
+
+If a vdo encounters an unrecoverable error, it will enter read-only mode.
+This mode indicates that some previously acknowledged data may have been
+lost. The vdo may be instructed to rebuild as best it can in order to
+return to a writable state. However, this is never done automatically due
+to the likelihood that data has been lost. During a read-only rebuild, the
+block map is recovered from the recovery journal as before. However, the
+reference counts are not rebuilt from the slab journals. Rather, the
+reference counts are zeroed, and then the entire block map is traversed,
+and the reference counts are updated from it. While this may lose some
+data, it ensures that the block map and reference counts are consistent.
diff --git a/Documentation/admin-guide/device-mapper/vdo.rst b/Documentation/admin-guide/device-mapper/vdo.rst
new file mode 100644
index 000000000000..a199009979d9
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/vdo.rst
@@ -0,0 +1,388 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+dm-vdo
+======
+
+The dm-vdo (virtual data optimizer) device mapper target provides
+block-level deduplication, compression, and thin provisioning. As a device
+mapper target, it can add these features to the storage stack, compatible
+with any file system. The vdo target does not protect against data
+corruption, relying instead on integrity protection of the storage below
+it. It is strongly recommended that lvm be used to manage vdo volumes. See
+lvmvdo(7).
+
+Userspace component
+===================
+
+Formatting a vdo volume requires the use of the 'vdoformat' tool, available
+at:
+
+https://github.com/dm-vdo/vdo/
+
+In most cases, a vdo target will recover from a crash automatically the
+next time it is started. In cases where it encountered an unrecoverable
+error (either during normal operation or crash recovery) the target will
+enter or come up in read-only mode. Because read-only mode is indicative of
+data-loss, a positive action must be taken to bring vdo out of read-only
+mode. The 'vdoforcerebuild' tool, available from the same repo, is used to
+prepare a read-only vdo to exit read-only mode. After running this tool,
+the vdo target will rebuild its metadata the next time it is
+started. Although some data may be lost, the rebuilt vdo's metadata will be
+internally consistent and the target will be writable again.
+
+The repo also contains additional userspace tools which can be used to
+inspect a vdo target's on-disk metadata. Fortunately, these tools are
+rarely needed except by dm-vdo developers.
+
+Target interface
+================
+
+Table line
+----------
+
+::
+
+	<offset> <logical device size> vdo V4 <storage device>
+	<storage device size> <minimum I/O size> <block map cache size>
+	<block map era length> [optional arguments]
+
+
+Required parameters:
+
+	offset:
+		The offset, in sectors, at which the vdo volume's logical
+		space begins.
+
+	logical device size:
+		The size of the device which the vdo volume will service,
+		in sectors. Must match the current logical size of the vdo
+		volume.
+
+	storage device:
+		The device holding the vdo volume's data and metadata.
+
+	storage device size:
+		The size of the device holding the vdo volume, as a number
+		of 4096-byte blocks. Must match the current size of the vdo
+		volume.
+
+	minimum I/O size:
+		The minimum I/O size for this vdo volume to accept, in
+		bytes. Valid values are 512 or 4096. The recommended value
+		is 4096.
+
+	block map cache size:
+		The size of the block map cache, as a number of 4096-byte
+		blocks. The minimum and recommended value is 32768 blocks.
+		If the logical thread count is non-zero, the cache size
+		must be at least 4096 blocks per logical thread.
+
+	block map era length:
+		The speed with which the block map cache writes out
+		modified block map pages. A smaller era length is likely to
+		reduce the amount of time spent rebuilding, at the cost of
+		increased block map writes during normal operation. The
+		maximum and recommended value is 16380; the minimum value
+		is 1.
+
+Optional parameters:
+--------------------
+Some or all of these parameters may be specified as <key> <value> pairs.
+
+Thread related parameters:
+
+Different categories of work are assigned to separate thread groups, and
+the number of threads in each group can be configured separately.
+
+If <hash>, <logical>, and <physical> are all set to 0, the work handled by
+all three thread types will be handled by a single thread. If any of these
+values are non-zero, all of them must be non-zero.
+
+	ack:
+		The number of threads used to complete bios. Since
+		completing a bio calls an arbitrary completion function
+		outside the vdo volume, threads of this type allow the vdo
+		volume to continue processing requests even when bio
+		completion is slow. The default is 1.
+
+	bio:
+		The number of threads used to issue bios to the underlying
+		storage. Threads of this type allow the vdo volume to
+		continue processing requests even when bio submission is
+		slow. The default is 4.
+
+	bioRotationInterval:
+		The number of bios to enqueue on each bio thread before
+		switching to the next thread. The value must be greater
+		than 0 and not more than 1024; the default is 64.
+
+	cpu:
+		The number of threads used to do CPU-intensive work, such
+		as hashing and compression. The default is 1.
+
+	hash:
+		The number of threads used to manage data comparisons for
+		deduplication based on the hash value of data blocks. The
+		default is 0.
+
+	logical:
+		The number of threads used to manage caching and locking
+		based on the logical address of incoming bios. The default
+		is 0; the maximum is 60.
+
+	physical:
+		The number of threads used to manage administration of the
+		underlying storage device. At format time, a slab size for
+		the vdo is chosen; the vdo storage device must be large
+		enough to have at least 1 slab per physical thread. The
+		default is 0; the maximum is 16.
+
+Miscellaneous parameters:
+
+	maxDiscard:
+		The maximum size of discard bio accepted, in 4096-byte
+		blocks. I/O requests to a vdo volume are normally split
+		into 4096-byte blocks, and processed up to 2048 at a time.
+		However, discard requests to a vdo volume can be
+		automatically split to a larger size, up to <maxDiscard>
+		4096-byte blocks in a single bio, and are limited to 1500
+		at a time. Increasing this value may provide better overall
+		performance, at the cost of increased latency for the
+		individual discard requests. The default and minimum is 1;
+		the maximum is UINT_MAX / 4096.
+
+	deduplication:
+		Whether deduplication is enabled. The default is 'on'; the
+		acceptable values are 'on' and 'off'.
+
+	compression:
+		Whether compression is enabled. The default is 'off'; the
+		acceptable values are 'on' and 'off'.
+
+Device modification
+-------------------
+
+A modified table may be loaded into a running, non-suspended vdo volume.
+The modifications will take effect when the device is next resumed. The
+modifiable parameters are <logical device size>, <physical device size>,
+<maxDiscard>, <compression>, and <deduplication>.
+
+If the logical device size or physical device size are changed, upon
+successful resume vdo will store the new values and require them on future
+startups. These two parameters may not be decreased. The logical device
+size may not exceed 4 PB. The physical device size must increase by at
+least 32832 4096-byte blocks if at all, and must not exceed the size of the
+underlying storage device. Additionally, when formatting the vdo device, a
+slab size is chosen: the physical device size may never increase above the
+size which provides 8192 slabs, and each increase must be large enough to
+add at least one new slab.
+
+Examples:
+
+Start a previously-formatted vdo volume with 1 GB logical space and 1 GB
+physical space, storing to /dev/dm-1 which has more than 1 GB of space.
+
+::
+
+	dmsetup create vdo0 --table \
+	"0 2097152 vdo V4 /dev/dm-1 262144 4096 32768 16380"
+
+Grow the logical size to 4 GB.
+
+::
+
+	dmsetup reload vdo0 --table \
+	"0 8388608 vdo V4 /dev/dm-1 262144 4096 32768 16380"
+	dmsetup resume vdo0
+
+Grow the physical size to 2 GB.
+
+::
+
+	dmsetup reload vdo0 --table \
+	"0 8388608 vdo V4 /dev/dm-1 524288 4096 32768 16380"
+	dmsetup resume vdo0
+
+Grow the physical size by 1 GB more and increase max discard sectors.
+
+::
+
+	dmsetup reload vdo0 --table \
+	"0 10485760 vdo V4 /dev/dm-1 786432 4096 32768 16380 maxDiscard 8"
+	dmsetup resume vdo0
+
+Stop the vdo volume.
+
+::
+
+	dmsetup remove vdo0
+
+Start the vdo volume again. Note that the logical and physical device sizes
+must still match, but other parameters can change.
+
+::
+
+	dmsetup create vdo1 --table \
+	"0 10485760 vdo V4 /dev/dm-1 786432 512 65550 5000 hash 1 logical 3 physical 2"
+
+Messages
+--------
+All vdo devices accept messages in the form:
+
+::
+        dmsetup message <target-name> 0 <message-name> <message-parameters>
+
+The messages are:
+
+        stats:
+		Outputs the current view of the vdo statistics. Mostly used
+		by the vdostats userspace program to interpret the output
+		buffer.
+
+        dump:
+		Dumps many internal structures to the system log. This is
+		not always safe to run, so it should only be used to debug
+		a hung vdo. Optional parameters to specify structures to
+		dump are:
+
+			viopool: The pool of I/O requests incoming bios
+			pools: A synonym of 'viopool'
+			vdo: Most of the structures managing on-disk data
+			queues: Basic information about each vdo thread
+			threads: A synonym of 'queues'
+			default: Equivalent to 'queues vdo'
+			all: All of the above.
+
+        dump-on-shutdown:
+		Perform a default dump next time vdo shuts down.
+
+
+Status
+------
+
+::
+
+    <device> <operating mode> <in recovery> <index state>
+    <compression state> <physical blocks used> <total physical blocks>
+
+	device:
+		The name of the vdo volume.
+
+	operating mode:
+		The current operating mode of the vdo volume; values may be
+		'normal', 'recovering' (the volume has detected an issue
+		with its metadata and is attempting to repair itself), and
+		'read-only' (an error has occurred that forces the vdo
+		volume to only support read operations and not writes).
+
+	in recovery:
+		Whether the vdo volume is currently in recovery mode;
+		values may be 'recovering' or '-' which indicates not
+		recovering.
+
+	index state:
+		The current state of the deduplication index in the vdo
+		volume; values may be 'closed', 'closing', 'error',
+		'offline', 'online', 'opening', and 'unknown'.
+
+	compression state:
+		The current state of compression in the vdo volume; values
+		may be 'offline' and 'online'.
+
+	used physical blocks:
+		The number of physical blocks in use by the vdo volume.
+
+	total physical blocks:
+		The total number of physical blocks the vdo volume may use;
+		the difference between this value and the
+		<used physical blocks> is the number of blocks the vdo
+		volume has left before being full.
+
+Memory Requirements
+===================
+
+A vdo target requires a fixed 38 MB of RAM along with the following amounts
+that scale with the target:
+
+- 1.15 MB of RAM for each 1 MB of configured block map cache size. The
+  block map cache requires a minimum of 150 MB.
+- 1.6 MB of RAM for each 1 TB of logical space.
+- 268 MB of RAM for each 1 TB of physical storage managed by the volume.
+
+The deduplication index requires additional memory which scales with the
+size of the deduplication window. For dense indexes, the index requires 1
+GB of RAM per 1 TB of window. For sparse indexes, the index requires 1 GB
+of RAM per 10 TB of window. The index configuration is set when the target
+is formatted and may not be modified.
+
+Run-time Usage
+==============
+
+When using dm-vdo, it is important to be aware of the ways in which its
+behavior differs from other storage targets.
+
+- There is no guarantee that over-writes of existing blocks will succeed.
+  Because the underlying storage may be multiply referenced, over-writing
+  an existing block generally requires a vdo to have a free block
+  available.
+
+- When blocks are no longer in use, sending a discard request for those
+  blocks lets the vdo release references for those blocks. If the vdo is
+  thinly provisioned, discarding unused blocks is essential to prevent the
+  target from running out of space. However, due to the sharing of
+  duplicate blocks, no discard request for any given logical block is
+  guaranteed to reclaim space.
+
+- Assuming the underlying storage properly implements flush requests, vdo
+  is resilient against crashes, however, unflushed writes may or may not
+  persist after a crash.
+
+- Each write to a vdo target entails a significant amount of processing.
+  However, much of the work is paralellizable. Therefore, vdo targets
+  achieve better throughput at higher I/O depths, and can support up 2048
+  requests in parallel.
+
+Tuning
+======
+
+The vdo device has many options, and it can be difficult to make optimal
+choices without perfect knowledge of the workload. Additionally, most
+configuration options must be set when a vdo target is started, and cannot
+be changed without shutting it down completely; the configuration cannot be
+changed while the target is active. Ideally, tuning with simulated
+workloads should be performed before deploying vdo in production
+environments.
+
+The most important value to adjust is the block map cache size. In order to
+service a request for any logical address, a vdo must load the portion of
+the block map which holds the relevant mapping. These mappings are cached.
+Performance will suffer when the working set does not fit in the cache. By
+default, a vdo allocates 128 MB of metadata cache in RAM to support
+efficient access to 100 GB of logical space at a time. It should be scaled
+up proportionally for larger working sets.
+
+The logical and physical thread counts should also be adjusted. A logical
+thread controls a disjoint section of the block map, so additional logical
+threads increase parallelism and can increase throughput. Physical threads
+control a disjoint section of the data blocks, so additional physical
+threads can also increase throughput. However, excess threads can waste
+resources and increase contention.
+
+Bio submission threads control the parallelism involved in sending I/O to
+the underlying storage; fewer threads mean there is more opportunity to
+reorder I/O requests for performance benefit, but also that each I/O
+request has to wait longer before being submitted.
+
+Bio acknowledgment threads are used for finishing I/O requests. This is
+done on dedicated threads since the amount of work required to execute a
+bio's callback can not be controlled by the vdo itself. Usually one thread
+is sufficient but additional threads may be beneficial, particularly when
+bios have CPU-heavy callbacks.
+
+CPU threads are used for hashing and for compression; in workloads with
+compression enabled, more threads may result in higher throughput.
+
+Hash threads are used to sort active requests by hash and determine whether
+they should deduplicate; the most CPU intensive actions done by these
+threads are comparison of 4096-byte data blocks. In most cases, a single
+hash thread is sufficient.

From patchwork Thu Oct 26 21:40:59 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437925
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D3BA5C25B48
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:42:30 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229621AbjJZVmb (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:42:31 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56136 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229785AbjJZVma (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:42:30 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DB6911AC
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356501;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=ca6xIkPR5eSAWr1Yll5Pz3QDzCIGAfoTyVKQb+o03tk=;
        b=D69kDdICYaK4eQWDkw1VRV3xPK70fc1dBCKvJ9swoB3DnlbD9H5temRjBsHB66BWpWcix+
        DcwLokVMWgTbKT8uFCxcwscI4cpfPXoToqV0CpN/JFmbLVXvoInMG/ITqsXxOXBse2SqQI
        JRazmKiaPqHDxzkiyJL/z7c4NwD2ixo=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-675-o8ZyKFt5Mi2Kg8xySFue2w-1; Thu, 26 Oct 2023 17:41:37 -0400
X-MC-Unique: o8ZyKFt5Mi2Kg8xySFue2w-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 7EBA280F921;
        Thu, 26 Oct 2023 21:41:37 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 7002A2949;
        Thu, 26 Oct 2023 21:41:37 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id EEE8F3003D; Thu, 26 Oct 2023 17:41:36 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 02/39] dm vdo: add the MurmurHash3 fast hashing algorithm
Date: Thu, 26 Oct 2023 17:40:59 -0400
Message-Id: <20231026214136.1067410-3-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally
written by Austin Appleby and placed in the public domain. This version has
been modified to produce the same result on both big endian and little
endian processors, making it suitable for use in portable persistent data.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Co-developed-by: John Wiele <jwiele@redhat.com>
Signed-off-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/murmurhash3.h |  15 +++
 2 files changed, 190 insertions(+)
 create mode 100644 drivers/md/dm-vdo/murmurhash3.c
 create mode 100644 drivers/md/dm-vdo/murmurhash3.h

diff --git a/drivers/md/dm-vdo/murmurhash3.c b/drivers/md/dm-vdo/murmurhash3.c
new file mode 100644
index 000000000000..c68d0d153904
--- /dev/null
+++ b/drivers/md/dm-vdo/murmurhash3.c
@@ -0,0 +1,175 @@
+// SPDX-License-Identifier: LGPL-2.1+
+/*
+ * MurmurHash3 was written by Austin Appleby, and is placed in the public
+ * domain. The author hereby disclaims copyright to this source code.
+ *
+ * Adapted by John Wiele (jwiele@redhat.com).
+ */
+
+#include "murmurhash3.h"
+
+static inline u64 rotl64(u64 x, s8 r)
+{
+	return (x << r) | (x >> (64 - r));
+}
+
+#define ROTL64(x, y) rotl64(x, y)
+static __always_inline u64 getblock64(const u64 *p, int i)
+{
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	return p[i];
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+	return __builtin_bswap64(p[i]);
+#else
+#error "can't figure out byte order"
+#endif
+}
+
+static __always_inline void putblock64(u64 *p, int i, u64 value)
+{
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	p[i] = value;
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+	p[i] = __builtin_bswap64(value);
+#else
+#error "can't figure out byte order"
+#endif
+}
+
+/* Finalization mix - force all bits of a hash block to avalanche */
+
+static __always_inline u64 fmix64(u64 k)
+{
+	k ^= k >> 33;
+	k *= 0xff51afd7ed558ccdLLU;
+	k ^= k >> 33;
+	k *= 0xc4ceb9fe1a85ec53LLU;
+	k ^= k >> 33;
+
+	return k;
+}
+
+void murmurhash3_128(const void *key, const int len, const u32 seed, void *out)
+{
+	const u8 *data = (const u8 *)key;
+	const int nblocks = len / 16;
+
+	u64 h1 = seed;
+	u64 h2 = seed;
+
+	const u64 c1 = 0x87c37b91114253d5LLU;
+	const u64 c2 = 0x4cf5ad432745937fLLU;
+
+	/* body */
+
+	const u64 *blocks = (const u64 *)(data);
+
+	int i;
+
+	for (i = 0; i < nblocks; i++) {
+		u64 k1 = getblock64(blocks, i * 2 + 0);
+		u64 k2 = getblock64(blocks, i * 2 + 1);
+
+		k1 *= c1;
+		k1 = ROTL64(k1, 31);
+		k1 *= c2;
+		h1 ^= k1;
+
+		h1 = ROTL64(h1, 27);
+		h1 += h2;
+		h1 = h1 * 5 + 0x52dce729;
+
+		k2 *= c2;
+		k2 = ROTL64(k2, 33);
+		k2 *= c1;
+		h2 ^= k2;
+
+		h2 = ROTL64(h2, 31);
+		h2 += h1;
+		h2 = h2 * 5 + 0x38495ab5;
+	}
+
+	/* tail */
+
+	{
+		const u8 *tail = (const u8 *)(data + nblocks * 16);
+
+		u64 k1 = 0;
+		u64 k2 = 0;
+
+		switch (len & 15) {
+		case 15:
+			k2 ^= ((u64)tail[14]) << 48;
+			fallthrough;
+		case 14:
+			k2 ^= ((u64)tail[13]) << 40;
+			fallthrough;
+		case 13:
+			k2 ^= ((u64)tail[12]) << 32;
+			fallthrough;
+		case 12:
+			k2 ^= ((u64)tail[11]) << 24;
+			fallthrough;
+		case 11:
+			k2 ^= ((u64)tail[10]) << 16;
+			fallthrough;
+		case 10:
+			k2 ^= ((u64)tail[9]) << 8;
+			fallthrough;
+		case 9:
+			k2 ^= ((u64)tail[8]) << 0;
+			k2 *= c2;
+			k2 = ROTL64(k2, 33);
+			k2 *= c1;
+			h2 ^= k2;
+			fallthrough;
+
+		case 8:
+			k1 ^= ((u64)tail[7]) << 56;
+			fallthrough;
+		case 7:
+			k1 ^= ((u64)tail[6]) << 48;
+			fallthrough;
+		case 6:
+			k1 ^= ((u64)tail[5]) << 40;
+			fallthrough;
+		case 5:
+			k1 ^= ((u64)tail[4]) << 32;
+			fallthrough;
+		case 4:
+			k1 ^= ((u64)tail[3]) << 24;
+			fallthrough;
+		case 3:
+			k1 ^= ((u64)tail[2]) << 16;
+			fallthrough;
+		case 2:
+			k1 ^= ((u64)tail[1]) << 8;
+			fallthrough;
+		case 1:
+			k1 ^= ((u64)tail[0]) << 0;
+			k1 *= c1;
+			k1 = ROTL64(k1, 31);
+			k1 *= c2;
+			h1 ^= k1;
+			break;
+		default:
+			break;
+		};
+	}
+	/* finalization */
+
+	h1 ^= len;
+	h2 ^= len;
+
+	h1 += h2;
+	h2 += h1;
+
+	h1 = fmix64(h1);
+	h2 = fmix64(h2);
+
+	h1 += h2;
+	h2 += h1;
+
+	putblock64((u64 *)out, 0, h1);
+	putblock64((u64 *)out, 1, h2);
+}
diff --git a/drivers/md/dm-vdo/murmurhash3.h b/drivers/md/dm-vdo/murmurhash3.h
new file mode 100644
index 000000000000..d84711ddb659
--- /dev/null
+++ b/drivers/md/dm-vdo/murmurhash3.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: LGPL-2.1+ */
+/*
+ * MurmurHash3 was written by Austin Appleby, and is placed in the public
+ * domain. The author hereby disclaims copyright to this source code.
+ */
+
+#ifndef _MURMURHASH3_H_
+#define _MURMURHASH3_H_
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+void murmurhash3_128(const void *key, int len, u32 seed, void *out);
+
+#endif /* _MURMURHASH3_H_ */

From patchwork Thu Oct 26 21:41:00 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437928
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 29861C25B48
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:42:36 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232171AbjJZVmg (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:42:36 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34950 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229785AbjJZVmf (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:42:35 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C6F8418A
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356501;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=Ms1m0IJ5OOBEwdcY+WtZ9KiJuG2ySmUq4FFZSQjReY4=;
        b=Am7volFnJAes9y/MVykEBm2RqZZc/uAF5YDRKjPEZPJBostPygA1TvSmarE+ePmFkz0tYz
        wa223UrYTry7thGBI51vO49/S8zTwjJvvClTMyCKYPjDOEWfcNW9QgPopybnwluvB7Jw1A
        zTuRT4ZrM58PH5Lu6hnRZafh70uRbm0=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-633-DIYqi9fyP4eDMFH7oKusNQ-1; Thu, 26 Oct 2023 17:41:37 -0400
X-MC-Unique: DIYqi9fyP4eDMFH7oKusNQ-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 9AA0681101F;
        Thu, 26 Oct 2023 21:41:37 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 8C43A2949;
        Thu, 26 Oct 2023 21:41:37 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 4D05E3003E; Thu, 26 Oct 2023 17:41:37 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 03/39] dm vdo: add memory allocation utilities
Date: Thu, 26 Oct 2023 17:41:00 -0400
Message-Id: <20231026214136.1067410-4-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

This patch adds standardized allocation macros and memory tracking tools to
track and report any allocated memory that is not freed. This makes it
easier to ensure that the vdo target does not leak memory.

This patch also adds utilities for controlling whether certain threads are
allowed to allocate memory, since memory allocation during certain critical
code sections can cause the vdo target to deadlock.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Co-developed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/memory-alloc.c | 447 +++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/memory-alloc.h | 171 ++++++++++++
 2 files changed, 618 insertions(+)
 create mode 100644 drivers/md/dm-vdo/memory-alloc.c
 create mode 100644 drivers/md/dm-vdo/memory-alloc.h

diff --git a/drivers/md/dm-vdo/memory-alloc.c b/drivers/md/dm-vdo/memory-alloc.c
new file mode 100644
index 000000000000..41b15f7a7ba6
--- /dev/null
+++ b/drivers/md/dm-vdo/memory-alloc.c
@@ -0,0 +1,447 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include <linux/delay.h>
+#include <linux/mm.h>
+#include <linux/sched/mm.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+
+/*
+ * UDS and VDO keep track of which threads are allowed to allocate memory freely, and which threads
+ * must be careful to not do a memory allocation that does an I/O request. The allocating_threads
+ * threads_registry and its associated methods implement this tracking.
+ */
+static struct thread_registry allocating_threads;
+
+static bool allocations_allowed(void)
+{
+	const bool *pointer = uds_lookup_thread(&allocating_threads);
+
+	return (pointer != NULL) ? *pointer : false;
+}
+
+/*
+ * Register the current thread as an allocating thread.
+ *
+ * An optional flag location can be supplied indicating whether, at any given point in time, the
+ * threads associated with that flag should be allocating storage. If the flag is false, a message
+ * will be logged.
+ *
+ * If no flag is supplied, the thread is always allowed to allocate storage without complaint.
+ *
+ * @new_thread: registered_thread structure to use for the current thread
+ * @flag_ptr: Location of the allocation-allowed flag
+ */
+void uds_register_allocating_thread(struct registered_thread *new_thread, const bool *flag_ptr)
+{
+	if (flag_ptr == NULL) {
+		static const bool allocation_always_allowed = true;
+
+		flag_ptr = &allocation_always_allowed;
+	}
+
+	uds_register_thread(&allocating_threads, new_thread, flag_ptr);
+}
+
+/* Unregister the current thread as an allocating thread. */
+void uds_unregister_allocating_thread(void)
+{
+	uds_unregister_thread(&allocating_threads);
+}
+
+/*
+ * We track how much memory has been allocated and freed. When we unload the module, we log an
+ * error if we have not freed all the memory that we allocated. Nearly all memory allocation and
+ * freeing is done using this module.
+ *
+ * We do not use kernel functions like the kvasprintf() method, which allocate memory indirectly
+ * using kmalloc.
+ *
+ * These data structures and methods are used to track the amount of memory used.
+ */
+
+/*
+ * We allocate very few large objects, and allocation/deallocation isn't done in a
+ * performance-critical stage for us, so a linked list should be fine.
+ */
+struct vmalloc_block_info {
+	void *ptr;
+	size_t size;
+	struct vmalloc_block_info *next;
+};
+
+static struct {
+	spinlock_t lock;
+	size_t kmalloc_blocks;
+	size_t kmalloc_bytes;
+	size_t vmalloc_blocks;
+	size_t vmalloc_bytes;
+	size_t peak_bytes;
+	struct vmalloc_block_info *vmalloc_list;
+} memory_stats __cacheline_aligned;
+
+static void update_peak_usage(void)
+{
+	size_t total_bytes = memory_stats.kmalloc_bytes + memory_stats.vmalloc_bytes;
+
+	if (total_bytes > memory_stats.peak_bytes)
+		memory_stats.peak_bytes = total_bytes;
+}
+
+static void add_kmalloc_block(size_t size)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&memory_stats.lock, flags);
+	memory_stats.kmalloc_blocks++;
+	memory_stats.kmalloc_bytes += size;
+	update_peak_usage();
+	spin_unlock_irqrestore(&memory_stats.lock, flags);
+}
+
+static void remove_kmalloc_block(size_t size)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&memory_stats.lock, flags);
+	memory_stats.kmalloc_blocks--;
+	memory_stats.kmalloc_bytes -= size;
+	spin_unlock_irqrestore(&memory_stats.lock, flags);
+}
+
+static void add_vmalloc_block(struct vmalloc_block_info *block)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&memory_stats.lock, flags);
+	block->next = memory_stats.vmalloc_list;
+	memory_stats.vmalloc_list = block;
+	memory_stats.vmalloc_blocks++;
+	memory_stats.vmalloc_bytes += block->size;
+	update_peak_usage();
+	spin_unlock_irqrestore(&memory_stats.lock, flags);
+}
+
+static void remove_vmalloc_block(void *ptr)
+{
+	struct vmalloc_block_info *block;
+	struct vmalloc_block_info **block_ptr;
+	unsigned long flags;
+
+	spin_lock_irqsave(&memory_stats.lock, flags);
+	for (block_ptr = &memory_stats.vmalloc_list;
+	     (block = *block_ptr) != NULL;
+	     block_ptr = &block->next) {
+		if (block->ptr == ptr) {
+			*block_ptr = block->next;
+			memory_stats.vmalloc_blocks--;
+			memory_stats.vmalloc_bytes -= block->size;
+			break;
+		}
+	}
+
+	spin_unlock_irqrestore(&memory_stats.lock, flags);
+	if (block != NULL)
+		UDS_FREE(block);
+	else
+		uds_log_info("attempting to remove ptr %px not found in vmalloc list", ptr);
+}
+
+/*
+ * Determine whether allocating a memory block should use kmalloc or __vmalloc.
+ *
+ * vmalloc can allocate any integral number of pages.
+ *
+ * kmalloc can allocate any number of bytes up to a configured limit, which defaults to 8 megabytes
+ * on some systems. kmalloc is especially good when memory is being both allocated and freed, and
+ * it does this efficiently in a multi CPU environment.
+ *
+ * kmalloc usually rounds the size of the block up to the next power of two, so when the requested
+ * block is bigger than PAGE_SIZE / 2 bytes, kmalloc will never give you less space than the
+ * corresponding vmalloc allocation. Sometimes vmalloc will use less overhead than kmalloc.
+ *
+ * The advantages of kmalloc do not help out UDS or VDO, because we allocate all our memory up
+ * front and do not free and reallocate it. Sometimes we have problems using kmalloc, because the
+ * Linux memory page map can become so fragmented that kmalloc will not give us a 32KB chunk. We
+ * have used vmalloc as a backup to kmalloc in the past, and a follow-up vmalloc of 32KB will work.
+ * But there is no strong case to be made for using kmalloc over vmalloc for these size chunks.
+ *
+ * The kmalloc/vmalloc boundary is set at 4KB, and kmalloc gets the 4KB requests. There is no
+ * strong reason for favoring either kmalloc or vmalloc for 4KB requests, except that tracking
+ * vmalloc statistics uses a linked list implementation. Using a simple test, this choice of
+ * boundary results in 132 vmalloc calls. Using vmalloc for requests of exactly 4KB results in an
+ * additional 6374 vmalloc calls, which is much less efficient for tracking.
+ *
+ * @size: How many bytes to allocate
+ */
+static inline bool use_kmalloc(size_t size)
+{
+	return size <= PAGE_SIZE;
+}
+
+/*
+ * Allocate storage based on memory size and alignment, logging an error if the allocation fails.
+ * The memory will be zeroed.
+ *
+ * @size: The size of an object
+ * @align: The required alignment
+ * @what: What is being allocated (for error logging)
+ * @ptr: A pointer to hold the allocated memory
+ *
+ * Return: UDS_SUCCESS or an error code
+ */
+int uds_allocate_memory(size_t size, size_t align, const char *what, void *ptr)
+{
+	/*
+	 * The __GFP_RETRY_MAYFAIL flag means the VM implementation will retry memory reclaim
+	 * procedures that have previously failed if there is some indication that progress has
+	 * been made elsewhere. It can wait for other tasks to attempt high level approaches to
+	 * freeing memory such as compaction (which removes fragmentation) and page-out. There is
+	 * still a definite limit to the number of retries, but it is a larger limit than with
+	 * __GFP_NORETRY. Allocations with this flag may fail, but only when there is genuinely
+	 * little unused memory. While these allocations do not directly trigger the OOM killer,
+	 * their failure indicates that the system is likely to need to use the OOM killer soon.
+	 * The caller must handle failure, but can reasonably do so by failing a higher-level
+	 * request, or completing it only in a much less efficient manner.
+	 */
+	const gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL;
+	unsigned int noio_flags;
+	bool allocations_restricted = !allocations_allowed();
+	unsigned long start_time;
+	void *p = NULL;
+
+	if (ptr == NULL)
+		return UDS_INVALID_ARGUMENT;
+
+	if (size == 0) {
+		*((void **) ptr) = NULL;
+		return UDS_SUCCESS;
+	}
+
+	if (allocations_restricted)
+		noio_flags = memalloc_noio_save();
+
+	start_time = jiffies;
+	if (use_kmalloc(size) && (align < PAGE_SIZE)) {
+		p = kmalloc(size, gfp_flags | __GFP_NOWARN);
+		if (p == NULL) {
+			/*
+			 * It is possible for kmalloc to fail to allocate memory because there is
+			 * no page available (see VDO-3688). A short sleep may allow the page
+			 * reclaimer to free a page.
+			 */
+			fsleep(1000);
+			p = kmalloc(size, gfp_flags);
+		}
+
+		if (p != NULL)
+			add_kmalloc_block(ksize(p));
+	} else {
+		struct vmalloc_block_info *block;
+
+		if (UDS_ALLOCATE(1, struct vmalloc_block_info, __func__, &block) ==
+		    UDS_SUCCESS) {
+			/*
+			 * It is possible for __vmalloc to fail to allocate memory because there
+			 * are no pages available (see VDO-3661). A short sleep may allow the page
+			 * reclaimer to free enough pages for a small allocation.
+			 *
+			 * For larger allocations, the page_alloc code is racing against the page
+			 * reclaimer. If the page reclaimer can stay ahead of page_alloc, the
+			 * __vmalloc will succeed. But if page_alloc overtakes the page reclaimer,
+			 * the allocation fails. It is possible that more retries will succeed.
+			 */
+			for (;;) {
+				p = __vmalloc(size, gfp_flags | __GFP_NOWARN);
+
+				if (p != NULL)
+					break;
+
+				if (jiffies_to_msecs(jiffies - start_time) > 1000) {
+					/* Try one more time, logging a failure for this call. */
+					p = __vmalloc(size, gfp_flags);
+					break;
+				}
+
+				fsleep(1000);
+			}
+
+			if (p == NULL) {
+				UDS_FREE(block);
+			} else {
+				block->ptr = p;
+				block->size = PAGE_ALIGN(size);
+				add_vmalloc_block(block);
+			}
+		}
+	}
+
+	if (allocations_restricted)
+		memalloc_noio_restore(noio_flags);
+
+	if (p == NULL) {
+		unsigned int duration = jiffies_to_msecs(jiffies - start_time);
+
+		uds_log_error("Could not allocate %zu bytes for %s in %u msecs",
+			      size,
+			      what,
+			      duration);
+		return -ENOMEM;
+	}
+
+	*((void **) ptr) = p;
+	return UDS_SUCCESS;
+}
+
+/*
+ * Allocate storage based on memory size, failing immediately if the required memory is not
+ * available. The memory will be zeroed.
+ *
+ * @size: The size of an object.
+ * @what: What is being allocated (for error logging)
+ *
+ * Return: pointer to the allocated memory, or NULL if the required space is not available.
+ */
+void *uds_allocate_memory_nowait(size_t size, const char *what __maybe_unused)
+{
+	void *p = kmalloc(size, GFP_NOWAIT | __GFP_ZERO);
+
+	if (p != NULL)
+		add_kmalloc_block(ksize(p));
+
+	return p;
+}
+
+void uds_free_memory(void *ptr)
+{
+	if (ptr != NULL) {
+		if (is_vmalloc_addr(ptr)) {
+			remove_vmalloc_block(ptr);
+			vfree(ptr);
+		} else {
+			remove_kmalloc_block(ksize(ptr));
+			kfree(ptr);
+		}
+	}
+}
+
+/*
+ * Reallocate dynamically allocated memory. There are no alignment guarantees for the reallocated
+ * memory. If the new memory is larger than the old memory, the new space will be zeroed.
+ *
+ * @ptr: The memory to reallocate.
+ * @old_size: The old size of the memory
+ * @size: The new size to allocate
+ * @what: What is being allocated (for error logging)
+ * @new_ptr: A pointer to hold the reallocated pointer
+ *
+ * Return: UDS_SUCCESS or an error code
+ */
+int uds_reallocate_memory(void *ptr, size_t old_size, size_t size, const char *what, void *new_ptr)
+{
+	int result;
+
+	if (size == 0) {
+		UDS_FREE(ptr);
+		*(void **) new_ptr = NULL;
+		return UDS_SUCCESS;
+	}
+
+	result = UDS_ALLOCATE(size, char, what, new_ptr);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (ptr != NULL) {
+		if (old_size < size)
+			size = old_size;
+
+		memcpy(*((void **) new_ptr), ptr, size);
+		UDS_FREE(ptr);
+	}
+
+	return UDS_SUCCESS;
+}
+
+int uds_duplicate_string(const char *string, const char *what, char **new_string)
+{
+	int result;
+	u8 *dup;
+
+	result = UDS_ALLOCATE(strlen(string) + 1, u8, what, &dup);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	memcpy(dup, string, strlen(string) + 1);
+	*new_string = dup;
+	return UDS_SUCCESS;
+}
+
+void uds_memory_init(void)
+{
+	spin_lock_init(&memory_stats.lock);
+	uds_initialize_thread_registry(&allocating_threads);
+}
+
+void uds_memory_exit(void)
+{
+	ASSERT_LOG_ONLY(memory_stats.kmalloc_bytes == 0,
+			"kmalloc memory used (%zd bytes in %zd blocks) is returned to the kernel",
+			memory_stats.kmalloc_bytes,
+			memory_stats.kmalloc_blocks);
+	ASSERT_LOG_ONLY(memory_stats.vmalloc_bytes == 0,
+			"vmalloc memory used (%zd bytes in %zd blocks) is returned to the kernel",
+			memory_stats.vmalloc_bytes,
+			memory_stats.vmalloc_blocks);
+	uds_log_debug("peak usage %zd bytes", memory_stats.peak_bytes);
+}
+
+void uds_get_memory_stats(u64 *bytes_used, u64 *peak_bytes_used)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&memory_stats.lock, flags);
+	*bytes_used = memory_stats.kmalloc_bytes + memory_stats.vmalloc_bytes;
+	*peak_bytes_used = memory_stats.peak_bytes;
+	spin_unlock_irqrestore(&memory_stats.lock, flags);
+}
+
+/*
+ * Report stats on any allocated memory that we're tracking. Not all allocation types are
+ * guaranteed to be tracked in bytes (e.g., bios).
+ */
+void uds_report_memory_usage(void)
+{
+	unsigned long flags;
+	u64 kmalloc_blocks;
+	u64 kmalloc_bytes;
+	u64 vmalloc_blocks;
+	u64 vmalloc_bytes;
+	u64 peak_usage;
+	u64 total_bytes;
+
+	spin_lock_irqsave(&memory_stats.lock, flags);
+	kmalloc_blocks = memory_stats.kmalloc_blocks;
+	kmalloc_bytes = memory_stats.kmalloc_bytes;
+	vmalloc_blocks = memory_stats.vmalloc_blocks;
+	vmalloc_bytes = memory_stats.vmalloc_bytes;
+	peak_usage = memory_stats.peak_bytes;
+	spin_unlock_irqrestore(&memory_stats.lock, flags);
+	total_bytes = kmalloc_bytes + vmalloc_bytes;
+	uds_log_info("current module memory tracking (actual allocation sizes, not requested):");
+	uds_log_info("  %llu bytes in %llu kmalloc blocks",
+		     (unsigned long long) kmalloc_bytes,
+		     (unsigned long long) kmalloc_blocks);
+	uds_log_info("  %llu bytes in %llu vmalloc blocks",
+		     (unsigned long long) vmalloc_bytes,
+		     (unsigned long long) vmalloc_blocks);
+	uds_log_info("  total %llu bytes, peak usage %llu bytes",
+		     (unsigned long long) total_bytes,
+		     (unsigned long long) peak_usage);
+}
diff --git a/drivers/md/dm-vdo/memory-alloc.h b/drivers/md/dm-vdo/memory-alloc.h
new file mode 100644
index 000000000000..f02eadf08984
--- /dev/null
+++ b/drivers/md/dm-vdo/memory-alloc.h
@@ -0,0 +1,171 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_MEMORY_ALLOC_H
+#define UDS_MEMORY_ALLOC_H
+
+#include <linux/cache.h>
+#include <linux/io.h> /* for PAGE_SIZE */
+
+#include "permassert.h"
+#include "thread-registry.h"
+
+/* Custom memory allocation functions for UDS that track memory usage */
+
+int __must_check uds_allocate_memory(size_t size, size_t align, const char *what, void *ptr);
+
+void uds_free_memory(void *ptr);
+
+/* Free memory allocated with UDS_ALLOCATE(). */
+#define UDS_FREE(PTR) uds_free_memory(PTR)
+
+static inline void *uds_forget(void **ptr_ptr)
+{
+	void *ptr = *ptr_ptr;
+
+	*ptr_ptr = NULL;
+	return ptr;
+}
+
+/*
+ * Null out a pointer and return a copy to it. This macro should be used when passing a pointer to
+ * a function for which it is not safe to access the pointer once the function returns.
+ */
+#define UDS_FORGET(ptr) uds_forget((void **) &(ptr))
+
+/*
+ * Allocate storage based on element counts, sizes, and alignment.
+ *
+ * This is a generalized form of our allocation use case: It allocates an array of objects,
+ * optionally preceded by one object of another type (i.e., a struct with trailing variable-length
+ * array), with the alignment indicated.
+ *
+ * Why is this inline? The sizes and alignment will always be constant, when invoked through the
+ * macros below, and often the count will be a compile-time constant 1 or the number of extra bytes
+ * will be a compile-time constant 0. So at least some of the arithmetic can usually be optimized
+ * away, and the run-time selection between allocation functions always can. In many cases, it'll
+ * boil down to just a function call with a constant size.
+ *
+ * @count: The number of objects to allocate
+ * @size: The size of an object
+ * @extra: The number of additional bytes to allocate
+ * @align: The required alignment
+ * @what: What is being allocated (for error logging)
+ * @ptr: A pointer to hold the allocated memory
+ *
+ * Return: UDS_SUCCESS or an error code
+ */
+static inline int uds_do_allocation(size_t count,
+				    size_t size,
+				    size_t extra,
+				    size_t align,
+				    const char *what,
+				    void *ptr)
+{
+	size_t total_size = count * size + extra;
+
+	/* Overflow check: */
+	if ((size > 0) && (count > ((SIZE_MAX - extra) / size)))
+		/*
+		 * This is kind of a hack: We rely on the fact that SIZE_MAX would cover the entire
+		 * address space (minus one byte) and thus the system can never allocate that much
+		 * and the call will always fail. So we can report an overflow as "out of memory"
+		 * by asking for "merely" SIZE_MAX bytes.
+		 */
+		total_size = SIZE_MAX;
+
+	return uds_allocate_memory(total_size, align, what, ptr);
+}
+
+int __must_check uds_reallocate_memory(void *ptr,
+				       size_t old_size,
+				       size_t size,
+				       const char *what,
+				       void *new_ptr);
+
+/*
+ * Allocate one or more elements of the indicated type, logging an error if the allocation fails.
+ * The memory will be zeroed.
+ *
+ * @COUNT: The number of objects to allocate
+ * @TYPE: The type of objects to allocate. This type determines the alignment of the allocation.
+ * @WHAT: What is being allocated (for error logging)
+ * @PTR: A pointer to hold the allocated memory
+ *
+ * Return: UDS_SUCCESS or an error code
+ */
+#define UDS_ALLOCATE(COUNT, TYPE, WHAT, PTR) \
+	uds_do_allocation(COUNT, sizeof(TYPE), 0, __alignof__(TYPE), WHAT, PTR)
+
+/*
+ * Allocate one object of an indicated type, followed by one or more elements of a second type,
+ * logging an error if the allocation fails. The memory will be zeroed.
+ *
+ * @TYPE1: The type of the primary object to allocate. This type determines the alignment of the
+ *         allocated memory.
+ * @COUNT: The number of objects to allocate
+ * @TYPE2: The type of array objects to allocate
+ * @WHAT: What is being allocated (for error logging)
+ * @PTR: A pointer to hold the allocated memory
+ *
+ * Return: UDS_SUCCESS or an error code
+ */
+#define UDS_ALLOCATE_EXTENDED(TYPE1, COUNT, TYPE2, WHAT, PTR)            \
+	__extension__({                                                  \
+		int _result;						 \
+		TYPE1 **_ptr = (PTR);                                    \
+		STATIC_ASSERT(__alignof__(TYPE1) >= __alignof__(TYPE2)); \
+		_result = uds_do_allocation(COUNT,                       \
+					    sizeof(TYPE2),               \
+					    sizeof(TYPE1),               \
+					    __alignof__(TYPE1),          \
+					    WHAT,                        \
+					    _ptr);                       \
+		_result;                                                 \
+	})
+
+/*
+ * Allocate memory starting on a cache line boundary, logging an error if the allocation fails. The
+ * memory will be zeroed.
+ *
+ * @size: The number of bytes to allocate
+ * @what: What is being allocated (for error logging)
+ * @ptr: A pointer to hold the allocated memory
+ *
+ * Return: UDS_SUCCESS or an error code
+ */
+static inline int __must_check uds_allocate_cache_aligned(size_t size, const char *what, void *ptr)
+{
+	return uds_allocate_memory(size, L1_CACHE_BYTES, what, ptr);
+}
+
+void *__must_check uds_allocate_memory_nowait(size_t size, const char *what);
+
+/*
+ * Allocate one element of the indicated type immediately, failing if the required memory is not
+ * immediately available.
+ *
+ * @TYPE: The type of objects to allocate
+ * @WHAT: What is being allocated (for error logging)
+ *
+ * Return: pointer to the memory, or NULL if the memory is not available.
+ */
+#define UDS_ALLOCATE_NOWAIT(TYPE, WHAT) uds_allocate_memory_nowait(sizeof(TYPE), WHAT)
+
+int __must_check uds_duplicate_string(const char *string, const char *what, char **new_string);
+
+void uds_memory_exit(void);
+
+void uds_memory_init(void);
+
+void uds_register_allocating_thread(struct registered_thread *new_thread, const bool *flag_ptr);
+
+void uds_unregister_allocating_thread(void);
+
+void uds_get_memory_stats(u64 *bytes_used, u64 *peak_bytes_used);
+
+void uds_report_memory_usage(void);
+
+#endif /* UDS_MEMORY_ALLOC_H */

From patchwork Thu Oct 26 21:41:01 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437930
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B0C93C25B48
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:43:14 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232210AbjJZVnO (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:43:14 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40942 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231977AbjJZVnO (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:43:14 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 856241B3
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356501;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=OqIYtQ0iP76N1W9+W/qFYkGNuS2V5HGCOukSmrRjIq4=;
        b=GUomIhVux4Q5d7gdOBviLRy3bW+a0TGf9mIrUb3qbCgcQLRMPARjg1r6HlQcE8lObhI/wd
        LwnbHRT6K8p3KoqEtkKqfyYuXCnAM5YcMoorgawi+/UA1eXbCnMQZsLe5agM3Di2fNA6vk
        GccYR1hpvditZgs4Kz6q57Ls14GphsE=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-524-X_IwkbL-OI24j1bPK9aH0g-1; Thu, 26 Oct 2023 17:41:38 -0400
X-MC-Unique: X_IwkbL-OI24j1bPK9aH0g-1
Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com
 [10.11.54.10])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id DB5A9185A783;
        Thu, 26 Oct 2023 21:41:37 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id C02D6492BFA;
        Thu, 26 Oct 2023 21:41:37 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 762513003C; Thu, 26 Oct 2023 17:41:37 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 04/39] dm vdo: add basic logging and support utilities
Date: Thu, 26 Oct 2023 17:41:01 -0400
Message-Id: <20231026214136.1067410-5-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.10
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Add various support utilities for the vdo target and deduplication index,
including logging utilities, string and time management, and index-specific
error codes.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Co-developed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/errors.c       | 316 +++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/errors.h       |  83 ++++++++
 drivers/md/dm-vdo/logger.c       | 304 +++++++++++++++++++++++++++++
 drivers/md/dm-vdo/logger.h       | 112 +++++++++++
 drivers/md/dm-vdo/permassert.c   |  35 ++++
 drivers/md/dm-vdo/permassert.h   |  65 +++++++
 drivers/md/dm-vdo/string-utils.c |  28 +++
 drivers/md/dm-vdo/string-utils.h |  23 +++
 drivers/md/dm-vdo/time-utils.h   |  28 +++
 9 files changed, 994 insertions(+)
 create mode 100644 drivers/md/dm-vdo/errors.c
 create mode 100644 drivers/md/dm-vdo/errors.h
 create mode 100644 drivers/md/dm-vdo/logger.c
 create mode 100644 drivers/md/dm-vdo/logger.h
 create mode 100644 drivers/md/dm-vdo/permassert.c
 create mode 100644 drivers/md/dm-vdo/permassert.h
 create mode 100644 drivers/md/dm-vdo/string-utils.c
 create mode 100644 drivers/md/dm-vdo/string-utils.h
 create mode 100644 drivers/md/dm-vdo/time-utils.h

diff --git a/drivers/md/dm-vdo/errors.c b/drivers/md/dm-vdo/errors.c
new file mode 100644
index 000000000000..223eca020ff4
--- /dev/null
+++ b/drivers/md/dm-vdo/errors.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "errors.h"
+
+#include <linux/compiler.h>
+#include <linux/errno.h>
+
+#include "logger.h"
+#include "permassert.h"
+#include "string-utils.h"
+
+static const struct error_info successful = { "UDS_SUCCESS", "Success" };
+
+static const char *const message_table[] = {
+	[EPERM] = "Operation not permitted",
+	[ENOENT] = "No such file or directory",
+	[ESRCH] = "No such process",
+	[EINTR] = "Interrupted system call",
+	[EIO] = "Input/output error",
+	[ENXIO] = "No such device or address",
+	[E2BIG] = "Argument list too long",
+	[ENOEXEC] = "Exec format error",
+	[EBADF] = "Bad file descriptor",
+	[ECHILD] = "No child processes",
+	[EAGAIN] = "Resource temporarily unavailable",
+	[ENOMEM] = "Cannot allocate memory",
+	[EACCES] = "Permission denied",
+	[EFAULT] = "Bad address",
+	[ENOTBLK] = "Block device required",
+	[EBUSY] = "Device or resource busy",
+	[EEXIST] = "File exists",
+	[EXDEV] = "Invalid cross-device link",
+	[ENODEV] = "No such device",
+	[ENOTDIR] = "Not a directory",
+	[EISDIR] = "Is a directory",
+	[EINVAL] = "Invalid argument",
+	[ENFILE] = "Too many open files in system",
+	[EMFILE] = "Too many open files",
+	[ENOTTY] = "Inappropriate ioctl for device",
+	[ETXTBSY] = "Text file busy",
+	[EFBIG] = "File too large",
+	[ENOSPC] = "No space left on device",
+	[ESPIPE] = "Illegal seek",
+	[EROFS] = "Read-only file system",
+	[EMLINK] = "Too many links",
+	[EPIPE] = "Broken pipe",
+	[EDOM] = "Numerical argument out of domain",
+	[ERANGE] = "Numerical result out of range"
+};
+
+static const struct error_info error_list[] = {
+	{ "UDS_OVERFLOW", "Index overflow" },
+	{ "UDS_INVALID_ARGUMENT", "Invalid argument passed to internal routine" },
+	{ "UDS_BAD_STATE", "UDS data structures are in an invalid state" },
+	{ "UDS_DUPLICATE_NAME", "Attempt to enter the same name into a delta index twice" },
+	{ "UDS_ASSERTION_FAILED", "Assertion failed" },
+	{ "UDS_QUEUED", "Request queued" },
+	{ "UDS_BUFFER_ERROR", "Buffer error" },
+	{ "UDS_NO_DIRECTORY", "Expected directory is missing" },
+	{ "UDS_ALREADY_REGISTERED", "Error range already registered" },
+	{ "UDS_OUT_OF_RANGE", "Cannot access data outside specified limits" },
+	{ "UDS_EMODULE_LOAD", "Could not load modules" },
+	{ "UDS_DISABLED", "UDS library context is disabled" },
+	{ "UDS_UNKNOWN_ERROR", "Unknown error" },
+	{ "UDS_UNSUPPORTED_VERSION", "Unsupported version" },
+	{ "UDS_CORRUPT_DATA", "Some index structure is corrupt" },
+	{ "UDS_NO_INDEX", "No index found" },
+	{ "UDS_INDEX_NOT_SAVED_CLEANLY", "Index not saved cleanly" },
+};
+
+struct error_block {
+	const char *name;
+	int base;
+	int last;
+	int max;
+	const struct error_info *infos;
+};
+
+enum {
+	MAX_ERROR_BLOCKS = 6,
+};
+
+static struct {
+	int allocated;
+	int count;
+	struct error_block blocks[MAX_ERROR_BLOCKS];
+} registered_errors = {
+	.allocated = MAX_ERROR_BLOCKS,
+	.count = 1,
+	.blocks = { {
+			.name = "UDS Error",
+			.base = UDS_ERROR_CODE_BASE,
+			.last = UDS_ERROR_CODE_LAST,
+			.max = UDS_ERROR_CODE_BLOCK_END,
+			.infos = error_list,
+		  } },
+};
+
+/* Get the error info for an error number. Also returns the name of the error block, if known. */
+static const char *get_error_info(int errnum, const struct error_info **info_ptr)
+{
+	struct error_block *block;
+
+	if (errnum == UDS_SUCCESS) {
+		*info_ptr = &successful;
+		return NULL;
+	}
+
+	for (block = registered_errors.blocks;
+	     block < registered_errors.blocks + registered_errors.count;
+	     ++block) {
+		if ((errnum >= block->base) && (errnum < block->last)) {
+			*info_ptr = block->infos + (errnum - block->base);
+			return block->name;
+		} else if ((errnum >= block->last) && (errnum < block->max)) {
+			*info_ptr = NULL;
+			return block->name;
+		}
+	}
+
+	return NULL;
+}
+
+/* Return a string describing a system error message. */
+static const char *system_string_error(int errnum, char *buf, size_t buflen)
+{
+	size_t len;
+	const char *error_string = NULL;
+
+	if ((errnum > 0) && (errnum < ARRAY_SIZE(message_table)))
+		error_string = message_table[errnum];
+
+	len = ((error_string == NULL) ?
+		 snprintf(buf, buflen, "Unknown error %d", errnum) :
+		 snprintf(buf, buflen, "%s", error_string));
+	if (len < buflen)
+		return buf;
+
+	buf[0] = '\0';
+	return "System error";
+}
+
+/* Convert an error code to a descriptive string. */
+const char *uds_string_error(int errnum, char *buf, size_t buflen)
+{
+	char *buffer = buf;
+	char *buf_end = buf + buflen;
+	const struct error_info *info = NULL;
+	const char *block_name;
+
+	if (buf == NULL)
+		return NULL;
+
+	if (errnum < 0)
+		errnum = -errnum;
+
+	block_name = get_error_info(errnum, &info);
+	if (block_name != NULL) {
+		if (info != NULL)
+			buffer = uds_append_to_buffer(buffer,
+						      buf_end,
+						      "%s: %s",
+						      block_name,
+						      info->message);
+		else
+			buffer = uds_append_to_buffer(buffer,
+						      buf_end,
+						      "Unknown %s %d",
+						      block_name,
+						      errnum);
+	} else if (info != NULL) {
+		buffer = uds_append_to_buffer(buffer, buf_end, "%s", info->message);
+	} else {
+		const char *tmp = system_string_error(errnum, buffer, buf_end - buffer);
+
+		if (tmp != buffer)
+			buffer = uds_append_to_buffer(buffer, buf_end, "%s", tmp);
+		else
+			buffer += strlen(tmp);
+	}
+	return buf;
+}
+
+/* Convert an error code to its name. */
+const char *uds_string_error_name(int errnum, char *buf, size_t buflen)
+{
+	char *buffer = buf;
+	char *buf_end = buf + buflen;
+	const struct error_info *info = NULL;
+	const char *block_name;
+
+	if (errnum < 0)
+		errnum = -errnum;
+
+	block_name = get_error_info(errnum, &info);
+	if (block_name != NULL) {
+		if (info != NULL)
+			buffer = uds_append_to_buffer(buffer, buf_end, "%s", info->name);
+		else
+			buffer = uds_append_to_buffer(buffer,
+						      buf_end,
+						      "%s %d",
+						      block_name,
+						      errnum);
+	} else if (info != NULL) {
+		buffer = uds_append_to_buffer(buffer, buf_end, "%s", info->name);
+	} else {
+		const char *tmp;
+
+		tmp = system_string_error(errnum, buffer, buf_end - buffer);
+		if (tmp != buffer)
+			buffer = uds_append_to_buffer(buffer, buf_end, "%s", tmp);
+		else
+			buffer += strlen(tmp);
+	}
+	return buf;
+}
+
+/*
+ * Translate an error code into a value acceptable to the kernel. The input error code may be a
+ * system-generated value (such as -EIO), or an internal UDS status code. The result will be a
+ * negative errno value.
+ */
+int uds_map_to_system_error(int error)
+{
+	char error_name[UDS_MAX_ERROR_NAME_SIZE];
+	char error_message[UDS_MAX_ERROR_MESSAGE_SIZE];
+
+	/* 0 is success, and negative values are already system error codes. */
+	if (likely(error <= 0))
+		return error;
+
+	if (error < 1024)
+		/* This is probably an errno from userspace. */
+		return -error;
+
+	/* Internal UDS errors */
+	switch (error) {
+	case UDS_NO_INDEX:
+	case UDS_CORRUPT_DATA:
+		/* The index doesn't exist or can't be recovered. */
+		return -ENOENT;
+
+	case UDS_INDEX_NOT_SAVED_CLEANLY:
+	case UDS_UNSUPPORTED_VERSION:
+		/*
+		 * The index exists, but can't be loaded. Tell the client it exists so they don't
+		 * destroy it inadvertently.
+		 */
+		return -EEXIST;
+
+	case UDS_DISABLED:
+		/* The session is unusable; only returned by requests. */
+		return -EIO;
+
+	default:
+		/* Translate an unexpected error into something generic. */
+		uds_log_info("%s: mapping status code %d (%s: %s) to -EIO",
+			     __func__,
+			     error,
+			     uds_string_error_name(error, error_name, sizeof(error_name)),
+			     uds_string_error(error, error_message, sizeof(error_message)));
+		return -EIO;
+	}
+}
+
+/*
+ * Register a block of error codes.
+ *
+ * @block_name: the name of the block of error codes
+ * @first_error: the first error code in the block
+ * @next_free_error: one past the highest possible error in the block
+ * @infos: a pointer to the error info array for the block
+ * @info_size: the size of the error info array
+ */
+int uds_register_error_block(const char *block_name,
+			     int first_error,
+			     int next_free_error,
+			     const struct error_info *infos,
+			     size_t info_size)
+{
+	int result;
+	struct error_block *block;
+	struct error_block new_block = {
+		.name = block_name,
+		.base = first_error,
+		.last = first_error + (info_size / sizeof(struct error_info)),
+		.max = next_free_error,
+		.infos = infos,
+	};
+
+	result = ASSERT(first_error < next_free_error, "well-defined error block range");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (registered_errors.count == registered_errors.allocated)
+		/* This should never happen. */
+		return UDS_OVERFLOW;
+
+	for (block = registered_errors.blocks;
+	     block < registered_errors.blocks + registered_errors.count;
+	     ++block) {
+		if (strcmp(block_name, block->name) == 0)
+			return UDS_DUPLICATE_NAME;
+
+		/* Ensure error ranges do not overlap. */
+		if ((first_error < block->max) && (next_free_error > block->base))
+			return UDS_ALREADY_REGISTERED;
+	}
+
+	registered_errors.blocks[registered_errors.count++] = new_block;
+	return UDS_SUCCESS;
+}
diff --git a/drivers/md/dm-vdo/errors.h b/drivers/md/dm-vdo/errors.h
new file mode 100644
index 000000000000..43e0c33bec65
--- /dev/null
+++ b/drivers/md/dm-vdo/errors.h
@@ -0,0 +1,83 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_ERRORS_H
+#define UDS_ERRORS_H
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/* Custom error codes and error-related utilities for UDS */
+
+/* Valid status codes for internal UDS functions. */
+enum uds_status_codes {
+	/* Successful return */
+	UDS_SUCCESS = 0,
+
+	/* Used as a base value for reporting internal errors */
+	UDS_ERROR_CODE_BASE = 1024,
+	/* Index overflow */
+	UDS_OVERFLOW = UDS_ERROR_CODE_BASE + 0,
+	/* Invalid argument passed to internal routine */
+	UDS_INVALID_ARGUMENT = UDS_ERROR_CODE_BASE + 1,
+	/* UDS data structures are in an invalid state */
+	UDS_BAD_STATE = UDS_ERROR_CODE_BASE + 2,
+	/* Attempt to enter the same name into an internal structure twice */
+	UDS_DUPLICATE_NAME = UDS_ERROR_CODE_BASE + 3,
+	/* An assertion failed */
+	UDS_ASSERTION_FAILED = UDS_ERROR_CODE_BASE + 4,
+	/* A request has been queued for later processing (not an error) */
+	UDS_QUEUED = UDS_ERROR_CODE_BASE + 5,
+	/* A problem has occurred with a buffer */
+	UDS_BUFFER_ERROR = UDS_ERROR_CODE_BASE + 6,
+	/* No directory was found where one was expected */
+	UDS_NO_DIRECTORY = UDS_ERROR_CODE_BASE + 7,
+	/* This error range has already been registered */
+	UDS_ALREADY_REGISTERED = UDS_ERROR_CODE_BASE + 8,
+	/* Attempt to read or write data outside the valid range */
+	UDS_OUT_OF_RANGE = UDS_ERROR_CODE_BASE + 9,
+	/* Could not load modules */
+	UDS_EMODULE_LOAD = UDS_ERROR_CODE_BASE + 10,
+	/* The index session is disabled */
+	UDS_DISABLED = UDS_ERROR_CODE_BASE + 11,
+	/* Unknown error */
+	UDS_UNKNOWN_ERROR = UDS_ERROR_CODE_BASE + 12,
+	/* The index configuration or volume format is no longer supported */
+	UDS_UNSUPPORTED_VERSION = UDS_ERROR_CODE_BASE + 13,
+	/* Some index structure is corrupt */
+	UDS_CORRUPT_DATA = UDS_ERROR_CODE_BASE + 14,
+	/* No index state found */
+	UDS_NO_INDEX = UDS_ERROR_CODE_BASE + 15,
+	/* Attempt to access incomplete index save data */
+	UDS_INDEX_NOT_SAVED_CLEANLY = UDS_ERROR_CODE_BASE + 16,
+	/* One more than the last UDS_INTERNAL error code */
+	UDS_ERROR_CODE_LAST,
+	/* One more than the last error this block will ever use */
+	UDS_ERROR_CODE_BLOCK_END = UDS_ERROR_CODE_BASE + 440,
+};
+
+enum {
+	UDS_MAX_ERROR_NAME_SIZE = 80,
+	UDS_MAX_ERROR_MESSAGE_SIZE = 128,
+};
+
+struct error_info {
+	const char *name;
+	const char *message;
+};
+
+const char * __must_check uds_string_error(int errnum, char *buf, size_t buflen);
+
+const char *uds_string_error_name(int errnum, char *buf, size_t buflen);
+
+int uds_map_to_system_error(int error);
+
+int uds_register_error_block(const char *block_name,
+			     int first_error,
+			     int last_reserved_error,
+			     const struct error_info *infos,
+			     size_t info_size);
+
+#endif /* UDS_ERRORS_H */
diff --git a/drivers/md/dm-vdo/logger.c b/drivers/md/dm-vdo/logger.c
new file mode 100644
index 000000000000..00acac8affbd
--- /dev/null
+++ b/drivers/md/dm-vdo/logger.c
@@ -0,0 +1,304 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "logger.h"
+
+#include <linux/delay.h>
+#include <linux/hardirq.h>
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/sched.h>
+
+#include "thread-device.h"
+#include "uds-threads.h"
+
+struct priority_name {
+	const char *name;
+	const int priority;
+};
+
+static const struct priority_name PRIORITIES[] = {
+	{ "ALERT", UDS_LOG_ALERT },
+	{ "CRITICAL", UDS_LOG_CRIT },
+	{ "CRIT", UDS_LOG_CRIT },
+	{ "DEBUG", UDS_LOG_DEBUG },
+	{ "EMERGENCY", UDS_LOG_EMERG },
+	{ "EMERG", UDS_LOG_EMERG },
+	{ "ERROR", UDS_LOG_ERR },
+	{ "ERR", UDS_LOG_ERR },
+	{ "INFO", UDS_LOG_INFO },
+	{ "NOTICE", UDS_LOG_NOTICE },
+	{ "PANIC", UDS_LOG_EMERG },
+	{ "WARN", UDS_LOG_WARNING },
+	{ "WARNING", UDS_LOG_WARNING },
+	{ NULL, -1 },
+};
+
+static const char *const PRIORITY_STRINGS[] = {
+	"EMERGENCY",
+	"ALERT",
+	"CRITICAL",
+	"ERROR",
+	"WARN",
+	"NOTICE",
+	"INFO",
+	"DEBUG",
+};
+
+static int log_level = UDS_LOG_INFO;
+
+int uds_get_log_level(void)
+{
+	return log_level;
+}
+
+void uds_set_log_level(int new_log_level)
+{
+	log_level = new_log_level;
+}
+
+int uds_log_string_to_priority(const char *string)
+{
+	int i;
+
+	for (i = 0; PRIORITIES[i].name != NULL; i++)
+		if (strcasecmp(string, PRIORITIES[i].name) == 0)
+			return PRIORITIES[i].priority;
+	return UDS_LOG_INFO;
+}
+
+const char *uds_log_priority_to_string(int priority)
+{
+	if ((priority < 0) || (priority >= (int) ARRAY_SIZE(PRIORITY_STRINGS)))
+		return "unknown";
+	return PRIORITY_STRINGS[priority];
+}
+
+static const char *get_current_interrupt_type(void)
+{
+	if (in_nmi())
+		return "NMI";
+	if (in_irq())
+		return "HI";
+	if (in_softirq())
+		return "SI";
+	return "INTR";
+}
+
+/**
+ * emit_log_message_to_kernel() - Emit a log message to the kernel at the specified priority.
+ *
+ * @priority: The priority at which to log the message
+ * @fmt: The format string of the message
+ */
+static void emit_log_message_to_kernel(int priority, const char *fmt, ...)
+{
+	va_list args;
+	struct va_format vaf;
+
+	if (priority > uds_get_log_level())
+		return;
+
+	va_start(args, fmt);
+	vaf.fmt = fmt;
+	vaf.va = &args;
+
+	switch (priority) {
+	case UDS_LOG_EMERG:
+	case UDS_LOG_ALERT:
+	case UDS_LOG_CRIT:
+		printk(KERN_CRIT "%pV", &vaf);
+		break;
+	case UDS_LOG_ERR:
+		printk(KERN_ERR "%pV", &vaf);
+		break;
+	case UDS_LOG_WARNING:
+		printk(KERN_WARNING "%pV", &vaf);
+		break;
+	case UDS_LOG_NOTICE:
+		printk(KERN_NOTICE "%pV", &vaf);
+		break;
+	case UDS_LOG_INFO:
+		printk(KERN_INFO "%pV", &vaf);
+		break;
+	case UDS_LOG_DEBUG:
+		printk(KERN_DEBUG "%pV", &vaf);
+		break;
+	default:
+		printk(KERN_DEFAULT "%pV", &vaf);
+		break;
+	}
+
+	va_end(args);
+}
+
+/**
+ * emit_log_message() - Emit a log message to the kernel log in a format suited to the current
+ *                      thread context.
+ *
+ * Context info formats:
+ *
+ * interrupt:           uds[NMI]: blah
+ * kvdo thread:         kvdo12:foobarQ: blah
+ * thread w/device id:  kvdo12:myprog: blah
+ * other thread:        uds: myprog: blah
+ *
+ * Fields: module name, interrupt level, process name, device ID.
+ *
+ * @priority: the priority at which to log the message
+ * @module: The name of the module doing the logging
+ * @prefix: The prefix of the log message
+ * @vaf1: The first message format descriptor
+ * @vaf2: The second message format descriptor
+ */
+static void emit_log_message(int priority,
+			     const char *module,
+			     const char *prefix,
+			     const struct va_format *vaf1,
+			     const struct va_format *vaf2)
+{
+	int device_instance;
+
+	/*
+	 * In interrupt context, identify the interrupt type and module. Ignore the process/thread
+	 * since it could be anything.
+	 */
+	if (in_interrupt()) {
+		const char *type = get_current_interrupt_type();
+
+		emit_log_message_to_kernel(priority,
+					   "%s[%s]: %s%pV%pV\n",
+					   module, type, prefix, vaf1, vaf2);
+		return;
+	}
+
+	/* Not at interrupt level; we have a process we can look at, and might have a device ID. */
+	device_instance = uds_get_thread_device_id();
+	if (device_instance >= 0) {
+		emit_log_message_to_kernel(priority,
+					   "%s%u:%s: %s%pV%pV\n",
+					   module, device_instance,
+					   current->comm, prefix, vaf1, vaf2);
+		return;
+	}
+
+	/*
+	 * If it's a kernel thread and the module name is a prefix of its name, assume it is ours
+	 * and only identify the thread.
+	 */
+	if (((current->flags & PF_KTHREAD) != 0) &&
+	    (strncmp(module, current->comm, strlen(module)) == 0)) {
+		emit_log_message_to_kernel(priority,
+					   "%s: %s%pV%pV\n",
+					   current->comm, prefix, vaf1, vaf2);
+		return;
+	}
+
+	/* Identify the module and the process. */
+	emit_log_message_to_kernel(priority,
+				   "%s: %s: %s%pV%pV\n",
+				   module, current->comm, prefix, vaf1, vaf2);
+}
+
+/*
+ * uds_log_embedded_message() - Log a message embedded within another message.
+ * @priority: the priority at which to log the message
+ * @module: the name of the module doing the logging
+ * @prefix: optional string prefix to message, may be NULL
+ * @fmt1: format of message first part (required)
+ * @args1: arguments for message first part (required)
+ * @fmt2: format of message second part
+ */
+void uds_log_embedded_message(int priority,
+			      const char *module,
+			      const char *prefix,
+			      const char *fmt1,
+			      va_list args1,
+			      const char *fmt2,
+			      ...)
+{
+	va_list args1_copy;
+	va_list args2;
+	struct va_format vaf1, vaf2;
+
+	va_start(args2, fmt2);
+
+	if (module == NULL)
+		module = UDS_LOGGING_MODULE_NAME;
+	if (prefix == NULL)
+		prefix = "";
+
+	/*
+	 * It is implementation dependent whether va_list is defined as an array type that decays
+	 * to a pointer when passed as an argument. Copy args1 and args2 with va_copy so that vaf1
+	 * and vaf2 get proper va_list pointers irrespective of how va_list is defined.
+	 */
+	va_copy(args1_copy, args1);
+	vaf1.fmt = fmt1;
+	vaf1.va = &args1_copy;
+
+	vaf2.fmt = fmt2;
+	vaf2.va = &args2;
+
+	emit_log_message(priority, module, prefix, &vaf1, &vaf2);
+
+	va_end(args1_copy);
+	va_end(args2);
+}
+
+int uds_vlog_strerror(int priority,
+		      int errnum,
+		      const char *module,
+		      const char *format,
+		      va_list args)
+{
+	char errbuf[UDS_MAX_ERROR_MESSAGE_SIZE];
+	const char *message = uds_string_error(errnum, errbuf, sizeof(errbuf));
+
+	uds_log_embedded_message(priority,
+				 module,
+				 NULL,
+				 format,
+				 args,
+				 ": %s (%d)",
+				 message,
+				 errnum);
+	return errnum;
+}
+
+int __uds_log_strerror(int priority, int errnum, const char *module, const char *format, ...)
+{
+	va_list args;
+
+	va_start(args, format);
+	uds_vlog_strerror(priority, errnum, module, format, args);
+	va_end(args);
+	return errnum;
+}
+
+void uds_log_backtrace(int priority)
+{
+	if (priority > uds_get_log_level())
+		return;
+	dump_stack();
+}
+
+void __uds_log_message(int priority, const char *module, const char *format, ...)
+{
+	va_list args;
+
+	va_start(args, format);
+	uds_log_embedded_message(priority, module, NULL, format, args, "%s", "");
+	va_end(args);
+}
+
+/*
+ * Sleep or delay a few milliseconds in an attempt to allow the log buffers to be flushed lest they
+ * be overrun.
+ */
+void uds_pause_for_logger(void)
+{
+	fsleep(4000);
+}
diff --git a/drivers/md/dm-vdo/logger.h b/drivers/md/dm-vdo/logger.h
new file mode 100644
index 000000000000..7a0ee3ca85ec
--- /dev/null
+++ b/drivers/md/dm-vdo/logger.h
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_LOGGER_H
+#define UDS_LOGGER_H
+
+#include <linux/module.h>
+#include <linux/ratelimit.h>
+
+/* Custom logging utilities for UDS */
+
+#define UDS_LOG_EMERG 0
+#define UDS_LOG_ALERT 1
+#define UDS_LOG_CRIT 2
+#define UDS_LOG_ERR 3
+#define UDS_LOG_WARNING 4
+#define UDS_LOG_NOTICE 5
+#define UDS_LOG_INFO 6
+#define UDS_LOG_DEBUG 7
+
+#if defined(MODULE)
+#define UDS_LOGGING_MODULE_NAME THIS_MODULE->name
+#else /* compiled into the kernel */
+#define UDS_LOGGING_MODULE_NAME "vdo"
+#endif
+
+/* Apply a rate limiter to a log method call. */
+#define uds_log_ratelimit(log_fn, ...)                                    \
+	do {                                                              \
+		static DEFINE_RATELIMIT_STATE(_rs,                        \
+					      DEFAULT_RATELIMIT_INTERVAL, \
+					      DEFAULT_RATELIMIT_BURST);   \
+		if (__ratelimit(&_rs)) {                                  \
+			log_fn(__VA_ARGS__);                              \
+		}                                                         \
+	} while (0)
+
+int uds_get_log_level(void);
+
+void uds_set_log_level(int new_log_level);
+
+int uds_log_string_to_priority(const char *string);
+
+const char *uds_log_priority_to_string(int priority);
+
+void uds_log_embedded_message(int priority,
+			      const char *module,
+			      const char *prefix,
+			      const char *fmt1,
+			      va_list args1,
+			      const char *fmt2,
+			      ...)
+	__printf(4, 0) __printf(6, 7);
+
+void uds_log_backtrace(int priority);
+
+/* All log functions will preserve the caller's value of errno. */
+
+#define uds_log_strerror(priority, errnum, ...) \
+	__uds_log_strerror(priority, errnum, UDS_LOGGING_MODULE_NAME, __VA_ARGS__)
+
+int __uds_log_strerror(int priority, int errnum, const char *module, const char *format, ...)
+	__printf(4, 5);
+
+int uds_vlog_strerror(int priority,
+		      int errnum,
+		      const char *module,
+		      const char *format,
+		      va_list args)
+	__printf(4, 0);
+
+/* Log an error prefixed with the string associated with the errnum. */
+#define uds_log_error_strerror(errnum, ...) \
+	uds_log_strerror(UDS_LOG_ERR, errnum, __VA_ARGS__)
+
+#define uds_log_debug_strerror(errnum, ...) \
+	uds_log_strerror(UDS_LOG_DEBUG, errnum, __VA_ARGS__)
+
+#define uds_log_info_strerror(errnum, ...) \
+	uds_log_strerror(UDS_LOG_INFO, errnum, __VA_ARGS__)
+
+#define uds_log_notice_strerror(errnum, ...) \
+	uds_log_strerror(UDS_LOG_NOTICE, errnum, __VA_ARGS__)
+
+#define uds_log_warning_strerror(errnum, ...) \
+	uds_log_strerror(UDS_LOG_WARNING, errnum, __VA_ARGS__)
+
+#define uds_log_fatal_strerror(errnum, ...) \
+	uds_log_strerror(UDS_LOG_CRIT, errnum, __VA_ARGS__)
+
+#define uds_log_message(priority, ...) \
+	__uds_log_message(priority, UDS_LOGGING_MODULE_NAME, __VA_ARGS__)
+
+void __uds_log_message(int priority, const char *module, const char *format, ...)
+	__printf(3, 4);
+
+#define uds_log_debug(...) uds_log_message(UDS_LOG_DEBUG, __VA_ARGS__)
+
+#define uds_log_info(...) uds_log_message(UDS_LOG_INFO, __VA_ARGS__)
+
+#define uds_log_notice(...) uds_log_message(UDS_LOG_NOTICE, __VA_ARGS__)
+
+#define uds_log_warning(...) uds_log_message(UDS_LOG_WARNING, __VA_ARGS__)
+
+#define uds_log_error(...) uds_log_message(UDS_LOG_ERR, __VA_ARGS__)
+
+#define uds_log_fatal(...) uds_log_message(UDS_LOG_CRIT, __VA_ARGS__)
+
+void uds_pause_for_logger(void);
+#endif /* UDS_LOGGER_H */
diff --git a/drivers/md/dm-vdo/permassert.c b/drivers/md/dm-vdo/permassert.c
new file mode 100644
index 000000000000..94ab785c8a0a
--- /dev/null
+++ b/drivers/md/dm-vdo/permassert.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "permassert.h"
+
+#include "errors.h"
+#include "logger.h"
+
+int uds_assertion_failed(const char *expression_string,
+			 const char *file_name,
+			 int line_number,
+			 const char *format,
+			 ...)
+{
+	va_list args;
+
+	va_start(args, format);
+
+	uds_log_embedded_message(UDS_LOG_ERR,
+				 UDS_LOGGING_MODULE_NAME,
+				 "assertion \"",
+				 format,
+				 args,
+				 "\" (%s) failed at %s:%d",
+				 expression_string,
+				 file_name,
+				 line_number);
+	uds_log_backtrace(UDS_LOG_ERR);
+
+	va_end(args);
+
+	return UDS_ASSERTION_FAILED;
+}
diff --git a/drivers/md/dm-vdo/permassert.h b/drivers/md/dm-vdo/permassert.h
new file mode 100644
index 000000000000..bc0db0b187b6
--- /dev/null
+++ b/drivers/md/dm-vdo/permassert.h
@@ -0,0 +1,65 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef PERMASSERT_H
+#define PERMASSERT_H
+
+#include <linux/compiler.h>
+
+#include "errors.h"
+
+/* Utilities for asserting that certain conditions are met */
+
+#define STRINGIFY(X) #X
+#define STRINGIFY_VALUE(X) STRINGIFY(X)
+
+/*
+ * A hack to apply the "warn if unused" attribute to an integral expression.
+ *
+ * Since GCC doesn't propagate the warn_unused_result attribute to conditional expressions
+ * incorporating calls to functions with that attribute, this function can be used to wrap such an
+ * expression. With optimization enabled, this function contributes no additional instructions, but
+ * the warn_unused_result attribute still applies to the code calling it.
+ */
+static inline int __must_check uds_must_use(int value)
+{
+	return value;
+}
+
+/* Assert that an expression is true and return an error if it is not. */
+#define ASSERT(expr, ...) uds_must_use(__UDS_ASSERT(expr, __VA_ARGS__))
+
+/* Log a message if the expression is not true. */
+#define ASSERT_LOG_ONLY(expr, ...) __UDS_ASSERT(expr, __VA_ARGS__)
+
+#define __UDS_ASSERT(expr, ...)				      \
+	(likely(expr) ? UDS_SUCCESS			      \
+		      : uds_assertion_failed(STRINGIFY(expr), __FILE__, __LINE__, __VA_ARGS__))
+
+/* Log an assertion failure message. */
+int uds_assertion_failed(const char *expression_string,
+			 const char *file_name,
+			 int line_number,
+			 const char *format,
+			 ...)
+	__printf(4, 5);
+
+#define STATIC_ASSERT(expr)	     \
+	do {			     \
+		switch (0) {	     \
+		case 0:		     \
+			;	     \
+			fallthrough; \
+		case expr:	     \
+			;	     \
+			fallthrough; \
+		default:	     \
+			break;	     \
+		}		     \
+	} while (0)
+
+#define STATIC_ASSERT_SIZEOF(type, expected_size) STATIC_ASSERT(sizeof(type) == (expected_size))
+
+#endif /* PERMASSERT_H */
diff --git a/drivers/md/dm-vdo/string-utils.c b/drivers/md/dm-vdo/string-utils.c
new file mode 100644
index 000000000000..a584b37bb70c
--- /dev/null
+++ b/drivers/md/dm-vdo/string-utils.c
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "string-utils.h"
+
+#include "errors.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+#include "uds.h"
+
+char *uds_append_to_buffer(char *buffer, char *buf_end, const char *fmt, ...)
+{
+	va_list args;
+	size_t n;
+
+	va_start(args, fmt);
+	n = vsnprintf(buffer, buf_end - buffer, fmt, args);
+	if (n >= (size_t) (buf_end - buffer))
+		buffer = buf_end;
+	else
+		buffer += n;
+	va_end(args);
+
+	return buffer;
+}
diff --git a/drivers/md/dm-vdo/string-utils.h b/drivers/md/dm-vdo/string-utils.h
new file mode 100644
index 000000000000..8275af582cf7
--- /dev/null
+++ b/drivers/md/dm-vdo/string-utils.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_STRING_UTILS_H
+#define UDS_STRING_UTILS_H
+
+#include <linux/kernel.h>
+#include <linux/string.h>
+
+/* Utilities related to string manipulation */
+
+static inline const char *uds_bool_to_string(bool value)
+{
+	return value ? "true" : "false";
+}
+
+/* Append a formatted string to the end of a buffer. */
+char *uds_append_to_buffer(char *buffer, char *buf_end, const char *fmt, ...)
+	__printf(3, 4);
+
+#endif /* UDS_STRING_UTILS_H */
diff --git a/drivers/md/dm-vdo/time-utils.h b/drivers/md/dm-vdo/time-utils.h
new file mode 100644
index 000000000000..5f1e850fd826
--- /dev/null
+++ b/drivers/md/dm-vdo/time-utils.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_TIME_UTILS_H
+#define UDS_TIME_UTILS_H
+
+#include <linux/ktime.h>
+#include <linux/time.h>
+#include <linux/types.h>
+
+static inline s64 ktime_to_seconds(ktime_t reltime)
+{
+	return reltime / NSEC_PER_SEC;
+}
+
+static inline ktime_t current_time_ns(clockid_t clock)
+{
+	return clock == CLOCK_MONOTONIC ? ktime_get_ns() : ktime_get_real_ns();
+}
+
+static inline ktime_t current_time_us(void)
+{
+	return current_time_ns(CLOCK_REALTIME) / NSEC_PER_USEC;
+}
+
+#endif /* UDS_TIME_UTILS_H */

From patchwork Thu Oct 26 21:41:02 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437956
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0E7E5C27C58
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:36 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344904AbjJZVof (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:35 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41540 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234983AbjJZVoY (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:24 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A5C70D5F
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356511;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=Ywt0ygdGQrOnx0Eta6sJLehIkvu2r4ry7HSq0vOBPVw=;
        b=OqbNK2X/DEj5WwPl0imqX7G/HkQhkAMDR1yDZnyyG2MQx9hJlo7BWy8XB6akrm9d1SyGcf
        CgD9Cf/MCnhy5JpA9EBIFNLJZkZTptCu4vNMns+1xbGI+plCosQ/9hirZBVUNbSTvBSs54
        L8SR25iHo0tE7IpE0vi4jGlga8I50NU=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-368-yJWjIMGiMLmMVXLbbYBh-g-1; Thu, 26 Oct 2023 17:41:38 -0400
X-MC-Unique: yJWjIMGiMLmMVXLbbYBh-g-1
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com
 [10.11.54.4])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1436F811E7E;
        Thu, 26 Oct 2023 21:41:38 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id EB8E72026D4C;
        Thu, 26 Oct 2023 21:41:37 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 9FDD13003B; Thu, 26 Oct 2023 17:41:37 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 05/39] dm vdo: add vdo type declarations, constants,
 and simple data structures
Date: Thu, 26 Oct 2023 17:41:02 -0400
Message-Id: <20231026214136.1067410-6-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.4
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Add definitions of constants defining the fixed parameters of a VDO
volume, and the default and maximum values of configurable or dynamic
parameters.

Add definitions of internal status codes used for internal
communication within the module and for logging.

Add definitions of types and structs used to manage the processing of
an I/O operation.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/constants.c    |  15 ++
 drivers/md/dm-vdo/constants.h    | 102 ++++++++
 drivers/md/dm-vdo/status-codes.c | 126 ++++++++++
 drivers/md/dm-vdo/status-codes.h | 112 +++++++++
 drivers/md/dm-vdo/types.h        | 393 +++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/wait-queue.c   | 223 ++++++++++++++++++
 drivers/md/dm-vdo/wait-queue.h   | 129 ++++++++++
 7 files changed, 1100 insertions(+)
 create mode 100644 drivers/md/dm-vdo/constants.c
 create mode 100644 drivers/md/dm-vdo/constants.h
 create mode 100644 drivers/md/dm-vdo/status-codes.c
 create mode 100644 drivers/md/dm-vdo/status-codes.h
 create mode 100644 drivers/md/dm-vdo/types.h
 create mode 100644 drivers/md/dm-vdo/wait-queue.c
 create mode 100644 drivers/md/dm-vdo/wait-queue.h

diff --git a/drivers/md/dm-vdo/constants.c b/drivers/md/dm-vdo/constants.c
new file mode 100644
index 000000000000..14ac047101f8
--- /dev/null
+++ b/drivers/md/dm-vdo/constants.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "types.h"
+
+/* The maximum logical space is 4 petabytes, which is 1 terablock. */
+const block_count_t MAXIMUM_VDO_LOGICAL_BLOCKS = 1024ULL * 1024 * 1024 * 1024;
+
+/* The maximum physical space is 256 terabytes, which is 64 gigablocks. */
+const block_count_t MAXIMUM_VDO_PHYSICAL_BLOCKS = 1024ULL * 1024 * 1024 * 64;
+
+/* unit test minimum */
+const block_count_t MINIMUM_VDO_SLAB_JOURNAL_BLOCKS = 2;
diff --git a/drivers/md/dm-vdo/constants.h b/drivers/md/dm-vdo/constants.h
new file mode 100644
index 000000000000..9ae3917d00de
--- /dev/null
+++ b/drivers/md/dm-vdo/constants.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_CONSTANTS_H
+#define VDO_CONSTANTS_H
+
+#include <linux/blkdev.h>
+
+#include "types.h"
+
+enum {
+	/*
+	 * The maximum number of contiguous PBNs which will go to a single bio submission queue,
+	 * assuming there is more than one queue.
+	 */
+	VDO_BIO_ROTATION_INTERVAL_LIMIT = 1024,
+
+	/** The number of entries on a block map page */
+	VDO_BLOCK_MAP_ENTRIES_PER_PAGE = 812,
+
+	/** The origin of the flat portion of the block map */
+	VDO_BLOCK_MAP_FLAT_PAGE_ORIGIN = 1,
+
+	/*
+	 * The height of a block map tree. Assuming a root count of 60 and 812 entries per page,
+	 * this is big enough to represent almost 95 PB of logical space.
+	 */
+	VDO_BLOCK_MAP_TREE_HEIGHT = 5,
+
+	/** The default number of bio submission queues. */
+	DEFAULT_VDO_BIO_SUBMIT_QUEUE_COUNT = 4,
+
+	/** The number of contiguous PBNs to be submitted to a single bio queue. */
+	DEFAULT_VDO_BIO_SUBMIT_QUEUE_ROTATE_INTERVAL = 64,
+
+	/** The number of trees in the arboreal block map */
+	DEFAULT_VDO_BLOCK_MAP_TREE_ROOT_COUNT = 60,
+
+	/** The default size of the recovery journal, in blocks */
+	DEFAULT_VDO_RECOVERY_JOURNAL_SIZE = 32 * 1024,
+
+	/** The default size of each slab journal, in blocks */
+	DEFAULT_VDO_SLAB_JOURNAL_SIZE = 224,
+
+	/*
+	 * The initial size of lbn_operations and pbn_operations, which is based upon the expected
+	 * maximum number of outstanding VIOs. This value was chosen to make it highly unlikely
+	 * that the maps would need to be resized.
+	 */
+	VDO_LOCK_MAP_CAPACITY = 10000,
+
+	/** The maximum number of logical zones */
+	MAX_VDO_LOGICAL_ZONES = 60,
+
+	/** The maximum number of physical zones */
+	MAX_VDO_PHYSICAL_ZONES = 16,
+
+	/** The base-2 logarithm of the maximum blocks in one slab */
+	MAX_VDO_SLAB_BITS = 23,
+
+	/** The maximum number of slabs the slab depot supports */
+	MAX_VDO_SLABS = 8192,
+
+	/*
+	 * The maximum number of block map pages to load simultaneously during recovery or rebuild.
+	 */
+	MAXIMUM_SIMULTANEOUS_VDO_BLOCK_MAP_RESTORATION_READS = 1024,
+
+	/** The maximum number of entries in the slab summary */
+	MAXIMUM_VDO_SLAB_SUMMARY_ENTRIES = MAX_VDO_SLABS * MAX_VDO_PHYSICAL_ZONES,
+
+	/** The maximum number of total threads in a VDO thread configuration. */
+	MAXIMUM_VDO_THREADS = 100,
+
+	/** The maximum number of VIOs in the system at once */
+	MAXIMUM_VDO_USER_VIOS = 2048,
+
+	/** The only physical block size supported by VDO */
+	VDO_BLOCK_SIZE = 4096,
+
+	/** The number of sectors per block */
+	VDO_SECTORS_PER_BLOCK = (VDO_BLOCK_SIZE >> SECTOR_SHIFT),
+
+	/** The size of a sector that will not be torn */
+	VDO_SECTOR_SIZE = 512,
+
+	/** The physical block number reserved for storing the zero block */
+	VDO_ZERO_BLOCK = 0,
+};
+
+/** The maximum logical space is 4 petabytes, which is 1 terablock. */
+extern const block_count_t MAXIMUM_VDO_LOGICAL_BLOCKS;
+
+/** The maximum physical space is 256 terabytes, which is 64 gigablocks. */
+extern const block_count_t MAXIMUM_VDO_PHYSICAL_BLOCKS;
+
+/** unit test minimum */
+extern const block_count_t MINIMUM_VDO_SLAB_JOURNAL_BLOCKS;
+
+#endif /* VDO_CONSTANTS_H */
diff --git a/drivers/md/dm-vdo/status-codes.c b/drivers/md/dm-vdo/status-codes.c
new file mode 100644
index 000000000000..2e886597e995
--- /dev/null
+++ b/drivers/md/dm-vdo/status-codes.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "status-codes.h"
+
+#include "errors.h"
+#include "logger.h"
+#include "permassert.h"
+#include "uds-threads.h"
+
+const struct error_info vdo_status_list[] = {
+	{ "VDO_NOT_IMPLEMENTED", "Not implemented" },
+	{ "VDO_OUT_OF_RANGE", "Out of range" },
+	{ "VDO_REF_COUNT_INVALID", "Reference count would become invalid" },
+	{ "VDO_NO_SPACE", "Out of space" },
+	{ "VDO_UNEXPECTED_EOF", "Unexpected EOF on block read" },
+	{ "VDO_BAD_CONFIGURATION", "Bad configuration option" },
+	{ "VDO_SOCKET_ERROR", "Socket error" },
+	{ "VDO_BAD_ALIGNMENT", "Mis-aligned block reference" },
+	{ "VDO_COMPONENT_BUSY", "Prior operation still in progress" },
+	{ "VDO_BAD_PAGE", "Corrupt or incorrect page" },
+	{ "VDO_UNSUPPORTED_VERSION", "Unsupported component version" },
+	{ "VDO_INCORRECT_COMPONENT", "Component id mismatch in decoder" },
+	{ "VDO_PARAMETER_MISMATCH", "Parameters have conflicting values" },
+	{ "VDO_BLOCK_SIZE_TOO_SMALL", "The block size is too small" },
+	{ "VDO_UNKNOWN_PARTITION", "No partition exists with a given id" },
+	{ "VDO_PARTITION_EXISTS", "A partition already exists with a given id" },
+	{ "VDO_NOT_READ_ONLY", "The device is not in read-only mode" },
+	{ "VDO_INCREMENT_TOO_SMALL", "Physical block growth of too few blocks" },
+	{ "VDO_CHECKSUM_MISMATCH", "Incorrect checksum" },
+	{ "VDO_RECOVERY_JOURNAL_FULL", "The recovery journal is full" },
+	{ "VDO_LOCK_ERROR", "A lock is held incorrectly" },
+	{ "VDO_READ_ONLY", "The device is in read-only mode" },
+	{ "VDO_SHUTTING_DOWN", "The device is shutting down" },
+	{ "VDO_CORRUPT_JOURNAL", "Recovery journal entries corrupted" },
+	{ "VDO_TOO_MANY_SLABS", "Exceeds maximum number of slabs supported" },
+	{ "VDO_INVALID_FRAGMENT", "Compressed block fragment is invalid" },
+	{ "VDO_RETRY_AFTER_REBUILD", "Retry operation after rebuilding finishes" },
+	{ "VDO_UNKNOWN_COMMAND", "The extended command is not known" },
+	{ "VDO_COMMAND_ERROR", "Bad extended command parameters" },
+	{ "VDO_CANNOT_DETERMINE_SIZE", "Cannot determine config sizes to fit" },
+	{ "VDO_BAD_MAPPING", "Invalid page mapping" },
+	{ "VDO_READ_CACHE_BUSY", "Read cache has no free slots" },
+	{ "VDO_BIO_CREATION_FAILED", "Bio creation failed" },
+	{ "VDO_BAD_MAGIC", "Bad magic number" },
+	{ "VDO_BAD_NONCE", "Bad nonce" },
+	{ "VDO_JOURNAL_OVERFLOW", "Journal sequence number overflow" },
+	{ "VDO_INVALID_ADMIN_STATE", "Invalid operation for current state" },
+	{ "VDO_CANT_ADD_SYSFS_NODE", "Failed to add sysfs node" },
+};
+
+static atomic_t vdo_status_codes_registered = ATOMIC_INIT(0);
+static int status_code_registration_result;
+
+static void do_status_code_registration(void)
+{
+	int result;
+
+	STATIC_ASSERT((VDO_STATUS_CODE_LAST - VDO_STATUS_CODE_BASE) ==
+		      ARRAY_SIZE(vdo_status_list));
+
+	result = uds_register_error_block("VDO Status",
+					  VDO_STATUS_CODE_BASE,
+					  VDO_STATUS_CODE_BLOCK_END,
+					  vdo_status_list,
+					  sizeof(vdo_status_list));
+	/*
+	 * The following test handles cases where libvdo is statically linked against both the test
+	 * modules and the test driver (because multiple instances of this module call their own
+	 * copy of this function once each, resulting in multiple calls to register_error_block
+	 * which is shared in libuds).
+	 */
+	if (result == UDS_DUPLICATE_NAME)
+		result = UDS_SUCCESS;
+
+	status_code_registration_result = (result == UDS_SUCCESS) ? VDO_SUCCESS : result;
+}
+
+/**
+ * vdo_register_status_codes() - Register the VDO status codes if needed.
+ * Return: A success or error code.
+ */
+int vdo_register_status_codes(void)
+{
+	uds_perform_once(&vdo_status_codes_registered, do_status_code_registration);
+	return status_code_registration_result;
+}
+
+/**
+ * vdo_map_to_system_error() - Given an error code, return a value we can return to the OS.
+ * @error: The error code to convert.
+ *
+ * The input error code may be a system-generated value (such as -EIO), an errno macro used in our
+ * code (such as EIO), or a UDS or VDO status code; the result must be something the rest of the OS
+ * can consume (negative errno values such as -EIO, in the case of the kernel).
+ *
+ * Return: A system error code value.
+ */
+int vdo_map_to_system_error(int error)
+{
+	char error_name[UDS_MAX_ERROR_NAME_SIZE];
+	char error_message[UDS_MAX_ERROR_MESSAGE_SIZE];
+
+	/* 0 is success, negative a system error code */
+	if (likely(error <= 0))
+		return error;
+	if (error < 1024)
+		return -error;
+
+	/* VDO or UDS error */
+	switch (error) {
+	case VDO_NO_SPACE:
+		return -ENOSPC;
+	case VDO_READ_ONLY:
+		return -EIO;
+	default:
+		uds_log_info("%s: mapping internal status code %d (%s: %s) to EIO",
+			     __func__,
+			     error,
+			     uds_string_error_name(error, error_name, sizeof(error_name)),
+			     uds_string_error(error, error_message, sizeof(error_message)));
+		return -EIO;
+	}
+}
diff --git a/drivers/md/dm-vdo/status-codes.h b/drivers/md/dm-vdo/status-codes.h
new file mode 100644
index 000000000000..c5da37ec8b55
--- /dev/null
+++ b/drivers/md/dm-vdo/status-codes.h
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_STATUS_CODES_H
+#define VDO_STATUS_CODES_H
+
+#include "errors.h"
+
+enum {
+	UDS_BLOCK_SIZE = UDS_ERROR_CODE_BLOCK_END - UDS_ERROR_CODE_BASE,
+	VDO_BLOCK_START = UDS_ERROR_CODE_BLOCK_END,
+	VDO_BLOCK_END = VDO_BLOCK_START + UDS_BLOCK_SIZE,
+	PRP_BLOCK_START = VDO_BLOCK_END,
+	PRP_BLOCK_END = PRP_BLOCK_START + UDS_BLOCK_SIZE,
+};
+
+/* VDO-specific status codes. */
+enum vdo_status_codes {
+	/* successful result */
+	VDO_SUCCESS,
+	/* base of all VDO errors */
+	VDO_STATUS_CODE_BASE = VDO_BLOCK_START,
+	/* we haven't written this yet */
+	VDO_NOT_IMPLEMENTED = VDO_STATUS_CODE_BASE,
+	/* input out of range */
+	VDO_OUT_OF_RANGE,
+	/* an invalid reference count would result */
+	VDO_REF_COUNT_INVALID,
+	/* a free block could not be allocated */
+	VDO_NO_SPACE,
+	/* unexpected EOF on block read */
+	VDO_UNEXPECTED_EOF,
+	/* improper or missing configuration option */
+	VDO_BAD_CONFIGURATION,
+	/* socket opening or binding problem */
+	VDO_SOCKET_ERROR,
+	/* read or write on non-aligned offset */
+	VDO_BAD_ALIGNMENT,
+	/* prior operation still in progress */
+	VDO_COMPONENT_BUSY,
+	/* page contents incorrect or corrupt data */
+	VDO_BAD_PAGE,
+	/* unsupported version of some component */
+	VDO_UNSUPPORTED_VERSION,
+	/* component id mismatch in decoder */
+	VDO_INCORRECT_COMPONENT,
+	/* parameters have conflicting values */
+	VDO_PARAMETER_MISMATCH,
+	/* the block size is too small */
+	VDO_BLOCK_SIZE_TOO_SMALL,
+	/* no partition exists with a given id */
+	VDO_UNKNOWN_PARTITION,
+	/* a partition already exists with a given id */
+	VDO_PARTITION_EXISTS,
+	/* the VDO is not in read-only mode */
+	VDO_NOT_READ_ONLY,
+	/* physical block growth of too few blocks */
+	VDO_INCREMENT_TOO_SMALL,
+	/* incorrect checksum */
+	VDO_CHECKSUM_MISMATCH,
+	/* the recovery journal is full */
+	VDO_RECOVERY_JOURNAL_FULL,
+	/* a lock is held incorrectly */
+	VDO_LOCK_ERROR,
+	/* the VDO is in read-only mode */
+	VDO_READ_ONLY,
+	/* the VDO is shutting down */
+	VDO_SHUTTING_DOWN,
+	/* the recovery journal has corrupt entries */
+	VDO_CORRUPT_JOURNAL,
+	/* exceeds maximum number of slabs supported */
+	VDO_TOO_MANY_SLABS,
+	/* a compressed block fragment is invalid */
+	VDO_INVALID_FRAGMENT,
+	/* action is unsupported while rebuilding */
+	VDO_RETRY_AFTER_REBUILD,
+	/* the extended command is not known */
+	VDO_UNKNOWN_COMMAND,
+	/* bad extended command parameters */
+	VDO_COMMAND_ERROR,
+	/* cannot determine sizes to fit */
+	VDO_CANNOT_DETERMINE_SIZE,
+	/* a block map entry is invalid */
+	VDO_BAD_MAPPING,
+	/* read cache has no free slots */
+	VDO_READ_CACHE_BUSY,
+	/* bio_add_page failed */
+	VDO_BIO_CREATION_FAILED,
+	/* bad magic number */
+	VDO_BAD_MAGIC,
+	/* bad nonce */
+	VDO_BAD_NONCE,
+	/* sequence number overflow */
+	VDO_JOURNAL_OVERFLOW,
+	/* the VDO is not in a state to perform an admin operation */
+	VDO_INVALID_ADMIN_STATE,
+	/* failure adding a sysfs node */
+	VDO_CANT_ADD_SYSFS_NODE,
+	/* one more than last error code */
+	VDO_STATUS_CODE_LAST,
+	VDO_STATUS_CODE_BLOCK_END = VDO_BLOCK_END
+};
+
+extern const struct error_info vdo_status_list[];
+
+int vdo_register_status_codes(void);
+
+int vdo_map_to_system_error(int error);
+
+#endif /* VDO_STATUS_CODES_H */
diff --git a/drivers/md/dm-vdo/types.h b/drivers/md/dm-vdo/types.h
new file mode 100644
index 000000000000..f2dcce92c796
--- /dev/null
+++ b/drivers/md/dm-vdo/types.h
@@ -0,0 +1,393 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_TYPES_H
+#define VDO_TYPES_H
+
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/device-mapper.h>
+#include <linux/list.h>
+#include <linux/compiler_attributes.h>
+#include <linux/types.h>
+
+#include "funnel-queue.h"
+
+/* A size type in blocks. */
+typedef u64 block_count_t;
+
+/* The size of a block. */
+typedef u16 block_size_t;
+
+/* A counter for data_vios */
+typedef u16 data_vio_count_t;
+
+/* A height within a tree. */
+typedef u8 height_t;
+
+/* The logical block number as used by the consumer. */
+typedef u64 logical_block_number_t;
+
+/* The type of the nonce used to identify instances of VDO. */
+typedef u64 nonce_t;
+
+/* A size in pages. */
+typedef u32 page_count_t;
+
+/* A page number. */
+typedef u32 page_number_t;
+
+/*
+ * The physical (well, less logical) block number at which the block is found on the underlying
+ * device.
+ */
+typedef u64 physical_block_number_t;
+
+/* A count of tree roots. */
+typedef u8 root_count_t;
+
+/* A number of sectors. */
+typedef u8 sector_count_t;
+
+/* A sequence number. */
+typedef u64 sequence_number_t;
+
+/* The offset of a block within a slab. */
+typedef u32 slab_block_number;
+
+/* A size type in slabs. */
+typedef u16 slab_count_t;
+
+/* A slot in a bin or block map page. */
+typedef u16 slot_number_t;
+
+/* typedef thread_count_t - A thread counter. */
+typedef u8 thread_count_t;
+
+/* typedef thread_id_t - A thread ID, vdo threads are numbered sequentially from 0. */
+typedef u8 thread_id_t;
+
+/* A zone counter */
+typedef u8 zone_count_t;
+
+/* The following enums are persisted on storage, so the values must be preserved. */
+
+/* The current operating mode of the VDO. */
+enum vdo_state {
+	VDO_DIRTY = 0,
+	VDO_NEW = 1,
+	VDO_CLEAN = 2,
+	VDO_READ_ONLY_MODE = 3,
+	VDO_FORCE_REBUILD = 4,
+	VDO_RECOVERING = 5,
+	VDO_REPLAYING = 6, /* VDO_REPLAYING is never set anymore, but retained for upgrade */
+	VDO_REBUILD_FOR_UPGRADE = 7,
+
+	/* Keep VDO_STATE_COUNT at the bottom. */
+	VDO_STATE_COUNT
+};
+
+/**
+ * vdo_state_requires_read_only_rebuild() - Check whether a vdo_state indicates
+ * that a read-only rebuild is required.
+ * @state: The vdo_state to check.
+ *
+ * Return: true if the state indicates a rebuild is required
+ */
+static inline bool __must_check vdo_state_requires_read_only_rebuild(enum vdo_state state)
+{
+	return ((state == VDO_FORCE_REBUILD) || (state == VDO_REBUILD_FOR_UPGRADE));
+}
+
+/**
+ * vdo_state_requires_recovery() - Check whether a vdo state indicates that recovery is needed.
+ * @state: The state to check.
+ *
+ * Return: true if the state indicates a recovery is required
+ */
+static inline bool __must_check vdo_state_requires_recovery(enum vdo_state state)
+{
+	return ((state == VDO_DIRTY) || (state == VDO_REPLAYING) || (state == VDO_RECOVERING));
+}
+
+/*
+ * The current operation on a physical block (from the point of view of the recovery journal, slab
+ * journals, and reference counts.
+ */
+enum journal_operation {
+	VDO_JOURNAL_DATA_REMAPPING = 0,
+	VDO_JOURNAL_BLOCK_MAP_REMAPPING = 1,
+} __packed;
+
+/* Partition IDs encoded in the volume layout in the super block. */
+enum partition_id {
+	VDO_BLOCK_MAP_PARTITION = 0,
+	VDO_SLAB_DEPOT_PARTITION = 1,
+	VDO_RECOVERY_JOURNAL_PARTITION = 2,
+	VDO_SLAB_SUMMARY_PARTITION = 3,
+} __packed;
+
+/* Metadata types for the vdo. */
+enum vdo_metadata_type {
+	VDO_METADATA_RECOVERY_JOURNAL = 1,
+	VDO_METADATA_SLAB_JOURNAL = 2,
+	VDO_METADATA_RECOVERY_JOURNAL_2 = 3,
+} __packed;
+
+/* A position in the block map where a block map entry is stored. */
+struct block_map_slot {
+	physical_block_number_t pbn;
+	slot_number_t slot;
+};
+
+/*
+ * Four bits of each five-byte block map entry contain a mapping state value used to distinguish
+ * unmapped or trimmed logical blocks (which are treated as mapped to the zero block) from entries
+ * that have been mapped to a physical block, including the zero block.
+ *
+ * FIXME: these should maybe be defines.
+ */
+enum block_mapping_state {
+	VDO_MAPPING_STATE_UNMAPPED = 0, /* Must be zero to be the default value */
+	VDO_MAPPING_STATE_UNCOMPRESSED = 1, /* A normal (uncompressed) block */
+	VDO_MAPPING_STATE_COMPRESSED_BASE = 2, /* Compressed in slot 0 */
+	VDO_MAPPING_STATE_COMPRESSED_MAX = 15, /* Compressed in slot 13 */
+};
+
+enum {
+	VDO_MAX_COMPRESSION_SLOTS =
+		(VDO_MAPPING_STATE_COMPRESSED_MAX - VDO_MAPPING_STATE_COMPRESSED_BASE + 1),
+};
+
+
+struct data_location {
+	physical_block_number_t pbn;
+	enum block_mapping_state state;
+};
+
+/* The configuration of a single slab derived from the configured block size and slab size. */
+struct slab_config {
+	/* total number of blocks in the slab */
+	block_count_t slab_blocks;
+	/* number of blocks available for data */
+	block_count_t data_blocks;
+	/* number of blocks for reference counts */
+	block_count_t reference_count_blocks;
+	/* number of blocks for the slab journal */
+	block_count_t slab_journal_blocks;
+	/*
+	 * Number of blocks after which the slab journal starts pushing out a reference_block for
+	 * each new entry it receives.
+	 */
+	block_count_t slab_journal_flushing_threshold;
+	/*
+	 * Number of blocks after which the slab journal pushes out all reference_blocks and makes
+	 * all vios wait.
+	 */
+	block_count_t slab_journal_blocking_threshold;
+	/* Number of blocks after which the slab must be scrubbed before coming online. */
+	block_count_t slab_journal_scrubbing_threshold;
+} __packed;
+
+/*
+ * This structure is memcmp'd for equality. Keep it packed and don't add any fields that are not
+ * properly set in both extant and parsed configs.
+ */
+struct thread_count_config {
+	unsigned int bio_ack_threads;
+	unsigned int bio_threads;
+	unsigned int bio_rotation_interval;
+	unsigned int cpu_threads;
+	unsigned int logical_zones;
+	unsigned int physical_zones;
+	unsigned int hash_zones;
+} __packed;
+
+struct device_config {
+	struct dm_target *owning_target;
+	struct dm_dev *owned_device;
+	struct vdo *vdo;
+	/* All configs referencing a layer are kept on a list in the layer */
+	struct list_head config_list;
+	char *original_string;
+	unsigned int version;
+	char *parent_device_name;
+	block_count_t physical_blocks;
+	/*
+	 * This is the number of logical blocks from VDO's internal point of view. It is the number
+	 * of 4K blocks regardless of the value of the logical_block_size parameter below.
+	 */
+	block_count_t logical_blocks;
+	unsigned int logical_block_size;
+	unsigned int cache_size;
+	unsigned int block_map_maximum_age;
+	bool deduplication;
+	bool compression;
+	struct thread_count_config thread_counts;
+	block_count_t max_discard_blocks;
+};
+
+enum vdo_completion_type {
+	/* Keep VDO_UNSET_COMPLETION_TYPE at the top. */
+	VDO_UNSET_COMPLETION_TYPE,
+	VDO_ACTION_COMPLETION,
+	VDO_ADMIN_COMPLETION,
+	VDO_BLOCK_ALLOCATOR_COMPLETION,
+	VDO_DATA_VIO_POOL_COMPLETION,
+	VDO_DECREMENT_COMPLETION,
+	VDO_FLUSH_COMPLETION,
+	VDO_FLUSH_NOTIFICATION_COMPLETION,
+	VDO_GENERATION_FLUSHED_COMPLETION,
+	VDO_HASH_ZONE_COMPLETION,
+	VDO_HASH_ZONES_COMPLETION,
+	VDO_LOCK_COUNTER_COMPLETION,
+	VDO_PAGE_COMPLETION,
+	VDO_READ_ONLY_MODE_COMPLETION,
+	VDO_REPAIR_COMPLETION,
+	VDO_SYNC_COMPLETION,
+	VIO_COMPLETION,
+} __packed;
+
+struct vdo_completion;
+
+/**
+ * typedef vdo_action - An asynchronous VDO operation.
+ * @completion: The completion of the operation.
+ */
+typedef void vdo_action(struct vdo_completion *completion);
+
+enum vdo_completion_priority {
+	BIO_ACK_Q_ACK_PRIORITY = 0,
+	BIO_ACK_Q_MAX_PRIORITY = 0,
+	BIO_Q_COMPRESSED_DATA_PRIORITY = 0,
+	BIO_Q_DATA_PRIORITY = 0,
+	BIO_Q_FLUSH_PRIORITY = 2,
+	BIO_Q_HIGH_PRIORITY = 2,
+	BIO_Q_METADATA_PRIORITY = 1,
+	BIO_Q_VERIFY_PRIORITY = 1,
+	BIO_Q_MAX_PRIORITY = 2,
+	CPU_Q_COMPLETE_VIO_PRIORITY = 0,
+	CPU_Q_COMPLETE_READ_PRIORITY = 0,
+	CPU_Q_COMPRESS_BLOCK_PRIORITY = 0,
+	CPU_Q_EVENT_REPORTER_PRIORITY = 0,
+	CPU_Q_HASH_BLOCK_PRIORITY = 0,
+	CPU_Q_MAX_PRIORITY = 0,
+	UDS_Q_PRIORITY = 0,
+	UDS_Q_MAX_PRIORITY = 0,
+	VDO_DEFAULT_Q_COMPLETION_PRIORITY = 1,
+	VDO_DEFAULT_Q_FLUSH_PRIORITY = 2,
+	VDO_DEFAULT_Q_MAP_BIO_PRIORITY = 0,
+	VDO_DEFAULT_Q_SYNC_PRIORITY = 2,
+	VDO_DEFAULT_Q_VIO_CALLBACK_PRIORITY = 1,
+	VDO_DEFAULT_Q_MAX_PRIORITY = 2,
+	/* The maximum allowable priority */
+	VDO_WORK_Q_MAX_PRIORITY = 2,
+	/* A value which must be out of range for a valid priority */
+	VDO_WORK_Q_DEFAULT_PRIORITY = VDO_WORK_Q_MAX_PRIORITY + 1,
+};
+
+struct vdo_completion {
+	/* The type of completion this is */
+	enum vdo_completion_type type;
+
+	/*
+	 * <code>true</code> once the processing of the operation is complete. This flag should not
+	 * be used by waiters external to the VDO base as it is used to gate calling the callback.
+	 */
+	bool complete;
+
+	/*
+	 * If true, queue this completion on the next callback invocation, even if it is already
+	 * running on the correct thread.
+	 */
+	bool requeue;
+
+	/* The ID of the thread which should run the next callback */
+	thread_id_t callback_thread_id;
+
+	/* The result of the operation */
+	int result;
+
+	/* The VDO on which this completion operates */
+	struct vdo *vdo;
+
+	/* The callback which will be called once the operation is complete */
+	vdo_action *callback;
+
+	/* Callback which, if set, will be called if an error result is set */
+	vdo_action *error_handler;
+
+	/* The parent object, if any, that spawned this completion */
+	void *parent;
+
+	/* Entry link for lock-free work queue */
+	struct funnel_queue_entry work_queue_entry_link;
+	enum vdo_completion_priority priority;
+	struct vdo_work_queue *my_queue;
+};
+
+struct block_allocator;
+struct data_vio;
+struct vdo;
+struct vdo_config;
+
+/* vio types for statistics and instrumentation. */
+enum vio_type {
+	VIO_TYPE_UNINITIALIZED = 0,
+	VIO_TYPE_DATA,
+	VIO_TYPE_BLOCK_ALLOCATOR,
+	VIO_TYPE_BLOCK_MAP,
+	VIO_TYPE_BLOCK_MAP_INTERIOR,
+	VIO_TYPE_GEOMETRY,
+	VIO_TYPE_PARTITION_COPY,
+	VIO_TYPE_RECOVERY_JOURNAL,
+	VIO_TYPE_SLAB_JOURNAL,
+	VIO_TYPE_SLAB_SUMMARY,
+	VIO_TYPE_SUPER_BLOCK,
+} __packed;
+
+/* Priority levels for asynchronous I/O operations performed on a vio. */
+enum vio_priority {
+	VIO_PRIORITY_LOW = 0,
+	VIO_PRIORITY_DATA = VIO_PRIORITY_LOW,
+	VIO_PRIORITY_COMPRESSED_DATA = VIO_PRIORITY_DATA,
+	VIO_PRIORITY_METADATA,
+	VIO_PRIORITY_HIGH,
+} __packed;
+
+/*
+ * A wrapper for a bio. All I/O to the storage below a vdo is conducted via vios.
+ */
+struct vio {
+	/* The completion for this vio */
+	struct vdo_completion completion;
+
+	/* The bio zone in which I/O should be processed */
+	zone_count_t bio_zone;
+
+	/* The queueing priority of the vio operation */
+	enum vio_priority priority;
+
+	/* The vio type is used for statistics and instrumentation. */
+	enum vio_type type;
+
+	/* The size of this vio in blocks */
+	unsigned int block_count;
+
+	/* The data being read or written. */
+	char *data;
+
+	/* The VDO-owned bio to use for all IO for this vio */
+	struct bio *bio;
+
+	/*
+	 * A list of enqueued bios with consecutive block numbers, stored by vdo_submit_bio() under
+	 * the first-enqueued vio. The other vios are found via their bio entries in this list, and
+	 * are not added to the work queue as separate completions.
+	 */
+	struct bio_list bios_merged;
+};
+
+#endif /* VDO_TYPES_H */
diff --git a/drivers/md/dm-vdo/wait-queue.c b/drivers/md/dm-vdo/wait-queue.c
new file mode 100644
index 000000000000..2f889bc32cd4
--- /dev/null
+++ b/drivers/md/dm-vdo/wait-queue.c
@@ -0,0 +1,223 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "wait-queue.h"
+
+#include <linux/device-mapper.h>
+
+#include "permassert.h"
+
+#include "status-codes.h"
+
+/**
+ * vdo_enqueue_waiter() - Add a waiter to the tail end of a wait queue.
+ * @queue: The queue to which to add the waiter.
+ * @waiter: The waiter to add to the queue.
+ *
+ * The waiter must not already be waiting in a queue.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+void vdo_enqueue_waiter(struct wait_queue *queue, struct waiter *waiter)
+{
+	BUG_ON(waiter->next_waiter != NULL);
+
+	if (queue->last_waiter == NULL) {
+		/*
+		 * The queue is empty, so form the initial circular list by self-linking the
+		 * initial waiter.
+		 */
+		waiter->next_waiter = waiter;
+	} else {
+		/* Splice the new waiter in at the end of the queue. */
+		waiter->next_waiter = queue->last_waiter->next_waiter;
+		queue->last_waiter->next_waiter = waiter;
+	}
+
+	/* In both cases, the waiter we added to the ring becomes the last waiter. */
+	queue->last_waiter = waiter;
+	queue->queue_length += 1;
+}
+
+/**
+ * vdo_transfer_all_waiters() - Transfer all waiters from one wait queue to a second queue,
+ *                              emptying the first queue.
+ * @from_queue: The queue containing the waiters to move.
+ * @to_queue: The queue that will receive the waiters from the first queue.
+ */
+void vdo_transfer_all_waiters(struct wait_queue *from_queue, struct wait_queue *to_queue)
+{
+	/* If the source queue is empty, there's nothing to do. */
+	if (!vdo_has_waiters(from_queue))
+		return;
+
+	if (vdo_has_waiters(to_queue)) {
+		/*
+		 * Both queues are non-empty. Splice the two circular lists together by swapping
+		 * the next (head) pointers in the list tails.
+		 */
+		struct waiter *from_head = from_queue->last_waiter->next_waiter;
+		struct waiter *to_head = to_queue->last_waiter->next_waiter;
+
+		to_queue->last_waiter->next_waiter = from_head;
+		from_queue->last_waiter->next_waiter = to_head;
+	}
+
+	to_queue->last_waiter = from_queue->last_waiter;
+	to_queue->queue_length += from_queue->queue_length;
+	vdo_initialize_wait_queue(from_queue);
+}
+
+/**
+ * vdo_notify_all_waiters() - Notify all the entries waiting in a queue.
+ * @queue: The wait queue containing the waiters to notify.
+ * @callback: The function to call to notify each waiter, or NULL to invoke the callback field
+ *            registered in each waiter.
+ * @context: The context to pass to the callback function.
+ *
+ * Notifies all the entries waiting in a queue to continue execution by invoking a callback
+ * function on each of them in turn. The queue is copied and emptied before invoking any callbacks,
+ * and only the waiters that were in the queue at the start of the call will be notified.
+ */
+void vdo_notify_all_waiters(struct wait_queue *queue, waiter_callback *callback, void *context)
+{
+	/*
+	 * Copy and empty the queue first, avoiding the possibility of an infinite loop if entries
+	 * are returned to the queue by the callback function.
+	 */
+	struct wait_queue waiters;
+
+	vdo_initialize_wait_queue(&waiters);
+	vdo_transfer_all_waiters(queue, &waiters);
+
+	/* Drain the copied queue, invoking the callback on every entry. */
+	while (vdo_notify_next_waiter(&waiters, callback, context))
+		/* All the work is done by the loop condition. */
+		;
+}
+
+/**
+ * vdo_get_first_waiter() - Return the waiter that is at the head end of a wait queue.
+ * @queue: The queue from which to get the first waiter.
+ *
+ * Return: The first (oldest) waiter in the queue, or NULL if the queue is empty.
+ */
+struct waiter *vdo_get_first_waiter(const struct wait_queue *queue)
+{
+	struct waiter *last_waiter = queue->last_waiter;
+
+	if (last_waiter == NULL)
+		/* There are no waiters, so we're done. */
+		return NULL;
+
+	/* The queue is circular, so the last entry links to the head of the queue. */
+	return last_waiter->next_waiter;
+}
+
+/**
+ * vdo_dequeue_matching_waiters() - Remove all waiters that match based on the specified matching
+ *                                  method and append them to a wait_queue.
+ * @queue: The wait queue to process.
+ * @match_method: The method to determine matching.
+ * @match_context: Contextual info for the match method.
+ * @matched_queue: A wait_queue to store matches.
+ */
+void vdo_dequeue_matching_waiters(struct wait_queue *queue,
+				  waiter_match *match_method,
+				  void *match_context,
+				  struct wait_queue *matched_queue)
+{
+	struct wait_queue matched_waiters, iteration_queue;
+
+	vdo_initialize_wait_queue(&matched_waiters);
+
+	vdo_initialize_wait_queue(&iteration_queue);
+	vdo_transfer_all_waiters(queue, &iteration_queue);
+	while (vdo_has_waiters(&iteration_queue)) {
+		struct waiter *waiter = vdo_dequeue_next_waiter(&iteration_queue);
+
+		vdo_enqueue_waiter((match_method(waiter, match_context) ?
+				    &matched_waiters :
+				    queue),
+				   waiter);
+	}
+
+	vdo_transfer_all_waiters(&matched_waiters, matched_queue);
+}
+
+/**
+ * vdo_dequeue_next_waiter() - Remove the first waiter from the head end of a wait queue.
+ * @queue: The wait queue from which to remove the first entry.
+ *
+ * The caller will be responsible for waking the waiter by invoking the correct callback function
+ * to resume its execution.
+ *
+ * Return: The first (oldest) waiter in the queue, or NULL if the queue is empty.
+ */
+struct waiter *vdo_dequeue_next_waiter(struct wait_queue *queue)
+{
+	struct waiter *first_waiter = vdo_get_first_waiter(queue);
+	struct waiter *last_waiter = queue->last_waiter;
+
+	if (first_waiter == NULL)
+		return NULL;
+
+	if (first_waiter == last_waiter)
+		/* The queue has a single entry, so just empty it out by nulling the tail. */
+		queue->last_waiter = NULL;
+	else
+		/*
+		 * The queue has more than one entry, so splice the first waiter out of the
+		 * circular queue.
+		 */
+		last_waiter->next_waiter = first_waiter->next_waiter;
+
+	/* The waiter is no longer in a wait queue. */
+	first_waiter->next_waiter = NULL;
+	queue->queue_length -= 1;
+	return first_waiter;
+}
+
+/**
+ * vdo_notify_next_waiter() - Notify the next entry waiting in a queue.
+ * @queue: The wait queue containing the waiter to notify.
+ * @callback: The function to call to notify the waiter, or NULL to invoke the callback field
+ *            registered in the waiter.
+ * @context: The context to pass to the callback function.
+ *
+ * Notifies the next entry waiting in a queue to continue execution by invoking a callback function
+ * on it after removing it from the queue.
+ *
+ * Return: true if there was a waiter in the queue.
+ */
+bool vdo_notify_next_waiter(struct wait_queue *queue, waiter_callback *callback, void *context)
+{
+	struct waiter *waiter = vdo_dequeue_next_waiter(queue);
+
+	if (waiter == NULL)
+		return false;
+
+	if (callback == NULL)
+		callback = waiter->callback;
+	(*callback)(waiter, context);
+	return true;
+}
+
+/**
+ * vdo_get_next_waiter() - Get the waiter after this one, for debug iteration.
+ * @queue: The wait queue.
+ * @waiter: A waiter.
+ *
+ * Return: The next waiter, or NULL.
+ */
+const struct waiter *
+vdo_get_next_waiter(const struct wait_queue *queue, const struct waiter *waiter)
+{
+	struct waiter *first_waiter = vdo_get_first_waiter(queue);
+
+	if (waiter == NULL)
+		return first_waiter;
+	return ((waiter->next_waiter != first_waiter) ? waiter->next_waiter : NULL);
+}
diff --git a/drivers/md/dm-vdo/wait-queue.h b/drivers/md/dm-vdo/wait-queue.h
new file mode 100644
index 000000000000..8620fe9dd28a
--- /dev/null
+++ b/drivers/md/dm-vdo/wait-queue.h
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_WAIT_QUEUE_H
+#define VDO_WAIT_QUEUE_H
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/**
+ * DOC: Wait queues.
+ *
+ * A wait queue is a circular list of entries waiting to be notified of a change in a condition.
+ * Keeping a circular list allows the queue structure to simply be a pointer to the tail (newest)
+ * entry in the queue, supporting constant-time enqueue and dequeue operations. A null pointer is
+ * an empty queue.
+ *
+ *   An empty queue:
+ *     queue0.last_waiter -> NULL
+ *
+ *   A singleton queue:
+ *     queue1.last_waiter -> entry1 -> entry1 -> [...]
+ *
+ *   A three-element queue:
+ *     queue2.last_waiter -> entry3 -> entry1 -> entry2 -> entry3 -> [...]
+ */
+
+struct waiter;
+
+struct wait_queue {
+	/* The tail of the queue, the last (most recently added) entry */
+	struct waiter *last_waiter;
+	/* The number of waiters currently in the queue */
+	size_t queue_length;
+};
+
+/**
+ * typedef waiter_callback - Callback type for functions which will be called to resume processing
+ *                           of a waiter after it has been removed from its wait queue.
+ */
+typedef void waiter_callback(struct waiter *waiter, void *context);
+
+/**
+ * typedef waiter_match - Method type for waiter matching methods.
+ *
+ * A waiter_match method returns false if the waiter does not match.
+ */
+typedef bool waiter_match(struct waiter *waiter, void *context);
+
+/* The queue entry structure for entries in a wait_queue. */
+struct waiter {
+	/*
+	 * The next waiter in the queue. If this entry is the last waiter, then this is actually a
+	 * pointer back to the head of the queue.
+	 */
+	struct waiter *next_waiter;
+
+	/* Optional waiter-specific callback to invoke when waking this waiter. */
+	waiter_callback *callback;
+};
+
+/**
+ * is_waiting() - Check whether a waiter is waiting.
+ * @waiter: The waiter to check.
+ *
+ * Return: true if the waiter is on some wait_queue.
+ */
+static inline bool vdo_is_waiting(struct waiter *waiter)
+{
+	return (waiter->next_waiter != NULL);
+}
+
+/**
+ * initialize_wait_queue() - Initialize a wait queue.
+ * @queue: The queue to initialize.
+ */
+static inline void vdo_initialize_wait_queue(struct wait_queue *queue)
+{
+	*queue = (struct wait_queue) {
+		.last_waiter = NULL,
+		.queue_length = 0,
+	};
+}
+
+/**
+ * has_waiters() - Check whether a wait queue has any entries waiting in it.
+ * @queue: The queue to query.
+ *
+ * Return: true if there are any waiters in the queue.
+ */
+static inline bool __must_check vdo_has_waiters(const struct wait_queue *queue)
+{
+	return (queue->last_waiter != NULL);
+}
+
+void vdo_enqueue_waiter(struct wait_queue *queue, struct waiter *waiter);
+
+void vdo_notify_all_waiters(struct wait_queue *queue, waiter_callback *callback, void *context);
+
+bool vdo_notify_next_waiter(struct wait_queue *queue, waiter_callback *callback, void *context);
+
+void vdo_transfer_all_waiters(struct wait_queue *from_queue, struct wait_queue *to_queue);
+
+struct waiter *vdo_get_first_waiter(const struct wait_queue *queue);
+
+void vdo_dequeue_matching_waiters(struct wait_queue *queue,
+				  waiter_match *match_method,
+				  void *match_context,
+				  struct wait_queue *matched_queue);
+
+struct waiter *vdo_dequeue_next_waiter(struct wait_queue *queue);
+
+/**
+ * count_waiters() - Count the number of waiters in a wait queue.
+ * @queue: The wait queue to query.
+ *
+ * Return: The number of waiters in the queue.
+ */
+static inline size_t __must_check vdo_count_waiters(const struct wait_queue *queue)
+{
+	return queue->queue_length;
+}
+
+const struct waiter * __must_check
+vdo_get_next_waiter(const struct wait_queue *queue, const struct waiter *waiter);
+
+#endif /* VDO_WAIT_QUEUE_H */

From patchwork Thu Oct 26 21:41:03 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437926
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 890E3C25B6B
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:42:30 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231947AbjJZVma (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:42:30 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56134 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229621AbjJZVma (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:42:30 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A62B110E
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356499;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=chfx64uLJD/sf1u+eg1ZQv38Q/wef9blrTzhCFG3FOI=;
        b=XxOPY9JbARIg26DtZvJwj6rgPZvpT3WBXF3uQrFudcmKaQEgag+epMQcWQhabpsQ14gs98
        TyoNb9QGIpWlrFFdqhyMSxs9s+xsIyOVYOtMkKPmguuGI2j9rdQE5h1xDR/vea8+3PefR5
        VGtNM0zg4K5vns8uhTh7pcnzS/W2VQM=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-317-eNSyWPJAOiimmfsV-KT7dQ-1; Thu, 26 Oct 2023 17:41:38 -0400
X-MC-Unique: eNSyWPJAOiimmfsV-KT7dQ-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 25CE2185A784;
        Thu, 26 Oct 2023 21:41:38 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 16D0A167AB;
        Thu, 26 Oct 2023 21:41:38 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id CABB03003D; Thu, 26 Oct 2023 17:41:37 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 06/39] dm vdo: add thread and synchronization utilities
Date: Thu, 26 Oct 2023 17:41:03 -0400
Message-Id: <20231026214136.1067410-7-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

This patch adds utilities for managing and using named threads, as well as
several locking and synchronization utilities. These utilities help dm-vdo
minimize thread transitions and manage interactions between threads.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Co-developed-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Co-developed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/thread-cond-var.c |  46 +++++++
 drivers/md/dm-vdo/thread-device.c   |  35 ++++++
 drivers/md/dm-vdo/thread-device.h   |  19 +++
 drivers/md/dm-vdo/thread-registry.c |  93 ++++++++++++++
 drivers/md/dm-vdo/thread-registry.h |  33 +++++
 drivers/md/dm-vdo/uds-threads.c     | 186 ++++++++++++++++++++++++++++
 drivers/md/dm-vdo/uds-threads.h     | 115 +++++++++++++++++
 7 files changed, 527 insertions(+)
 create mode 100644 drivers/md/dm-vdo/thread-cond-var.c
 create mode 100644 drivers/md/dm-vdo/thread-device.c
 create mode 100644 drivers/md/dm-vdo/thread-device.h
 create mode 100644 drivers/md/dm-vdo/thread-registry.c
 create mode 100644 drivers/md/dm-vdo/thread-registry.h
 create mode 100644 drivers/md/dm-vdo/uds-threads.c
 create mode 100644 drivers/md/dm-vdo/uds-threads.h

diff --git a/drivers/md/dm-vdo/thread-cond-var.c b/drivers/md/dm-vdo/thread-cond-var.c
new file mode 100644
index 000000000000..ed7f0b79ca0a
--- /dev/null
+++ b/drivers/md/dm-vdo/thread-cond-var.c
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include <linux/jiffies.h>
+#include <linux/minmax.h>
+
+#include "errors.h"
+#include "time-utils.h"
+#include "uds-threads.h"
+
+int uds_init_cond(struct cond_var *cv)
+{
+	init_waitqueue_head(&cv->wait_queue);
+	return UDS_SUCCESS;
+}
+
+int uds_signal_cond(struct cond_var *cv)
+{
+	wake_up(&cv->wait_queue);
+	return UDS_SUCCESS;
+}
+
+int uds_broadcast_cond(struct cond_var *cv)
+{
+	wake_up_all(&cv->wait_queue);
+	return UDS_SUCCESS;
+}
+
+int uds_wait_cond(struct cond_var *cv, struct mutex *mutex)
+{
+	DEFINE_WAIT(__wait);
+
+	prepare_to_wait(&cv->wait_queue, &__wait, TASK_IDLE);
+	uds_unlock_mutex(mutex);
+	schedule();
+	finish_wait(&cv->wait_queue, &__wait);
+	uds_lock_mutex(mutex);
+	return UDS_SUCCESS;
+}
+
+int uds_destroy_cond(struct cond_var *cv)
+{
+	return UDS_SUCCESS;
+}
diff --git a/drivers/md/dm-vdo/thread-device.c b/drivers/md/dm-vdo/thread-device.c
new file mode 100644
index 000000000000..897d8490c3af
--- /dev/null
+++ b/drivers/md/dm-vdo/thread-device.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "thread-device.h"
+
+#include "thread-registry.h"
+
+/* A registry of threads associated with device id numbers. */
+static struct thread_registry device_id_thread_registry;
+
+/* Any registered thread must be unregistered. */
+void uds_register_thread_device_id(struct registered_thread *new_thread, unsigned int *id_ptr)
+{
+	uds_register_thread(&device_id_thread_registry, new_thread, id_ptr);
+}
+
+void uds_unregister_thread_device_id(void)
+{
+	uds_unregister_thread(&device_id_thread_registry);
+}
+
+int uds_get_thread_device_id(void)
+{
+	const unsigned int *pointer;
+
+	pointer = uds_lookup_thread(&device_id_thread_registry);
+	return (pointer != NULL) ? *pointer : -1;
+}
+
+void uds_initialize_thread_device_registry(void)
+{
+	uds_initialize_thread_registry(&device_id_thread_registry);
+}
diff --git a/drivers/md/dm-vdo/thread-device.h b/drivers/md/dm-vdo/thread-device.h
new file mode 100644
index 000000000000..c1c189f0d248
--- /dev/null
+++ b/drivers/md/dm-vdo/thread-device.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_THREAD_DEVICE_H
+#define UDS_THREAD_DEVICE_H
+
+#include "thread-registry.h"
+
+void uds_register_thread_device_id(struct registered_thread *new_thread, unsigned int *id_ptr);
+
+void uds_unregister_thread_device_id(void);
+
+int uds_get_thread_device_id(void);
+
+void uds_initialize_thread_device_registry(void);
+
+#endif /* UDS_THREAD_DEVICE_H */
diff --git a/drivers/md/dm-vdo/thread-registry.c b/drivers/md/dm-vdo/thread-registry.c
new file mode 100644
index 000000000000..6e76cf2a93b6
--- /dev/null
+++ b/drivers/md/dm-vdo/thread-registry.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "thread-registry.h"
+
+#include <linux/rculist.h>
+
+#include "permassert.h"
+
+/*
+ * We need to be careful when using other facilities that may use thread registry functions in
+ * their normal operation. For example, we do not want to invoke the logger while holding a lock.
+ */
+
+void uds_initialize_thread_registry(struct thread_registry *registry)
+{
+	INIT_LIST_HEAD(&registry->links);
+	spin_lock_init(&registry->lock);
+}
+
+/* Register the current thread and associate it with a data pointer. */
+void uds_register_thread(struct thread_registry *registry,
+			 struct registered_thread *new_thread,
+			 const void *pointer)
+{
+	struct registered_thread *thread;
+	bool found_it = false;
+
+	INIT_LIST_HEAD(&new_thread->links);
+	new_thread->pointer = pointer;
+	new_thread->task = current;
+
+	spin_lock(&registry->lock);
+	list_for_each_entry(thread, &registry->links, links) {
+		if (thread->task == current) {
+			/* There should be no existing entry. */
+			list_del_rcu(&thread->links);
+			found_it = true;
+			break;
+		}
+	}
+	list_add_tail_rcu(&new_thread->links, &registry->links);
+	spin_unlock(&registry->lock);
+
+	ASSERT_LOG_ONLY(!found_it, "new thread not already in registry");
+	if (found_it) {
+		/* Ensure no RCU iterators see it before re-initializing. */
+		synchronize_rcu();
+		INIT_LIST_HEAD(&thread->links);
+	}
+}
+
+void uds_unregister_thread(struct thread_registry *registry)
+{
+	struct registered_thread *thread;
+	bool found_it = false;
+
+	spin_lock(&registry->lock);
+	list_for_each_entry(thread, &registry->links, links) {
+		if (thread->task == current) {
+			list_del_rcu(&thread->links);
+			found_it = true;
+			break;
+		}
+	}
+	spin_unlock(&registry->lock);
+
+	ASSERT_LOG_ONLY(found_it, "thread found in registry");
+	if (found_it) {
+		/* Ensure no RCU iterators see it before re-initializing. */
+		synchronize_rcu();
+		INIT_LIST_HEAD(&thread->links);
+	}
+}
+
+const void *uds_lookup_thread(struct thread_registry *registry)
+{
+	struct registered_thread *thread;
+	const void *result = NULL;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(thread, &registry->links, links) {
+		if (thread->task == current) {
+			result = thread->pointer;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return result;
+}
diff --git a/drivers/md/dm-vdo/thread-registry.h b/drivers/md/dm-vdo/thread-registry.h
new file mode 100644
index 000000000000..05d82a8211fe
--- /dev/null
+++ b/drivers/md/dm-vdo/thread-registry.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_THREAD_REGISTRY_H
+#define UDS_THREAD_REGISTRY_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+struct thread_registry {
+	struct list_head links;
+	spinlock_t lock;
+};
+
+struct registered_thread {
+	struct list_head links;
+	const void *pointer;
+	struct task_struct *task;
+};
+
+void uds_initialize_thread_registry(struct thread_registry *registry);
+
+void uds_register_thread(struct thread_registry *registry,
+			 struct registered_thread *new_thread,
+			 const void *pointer);
+
+void uds_unregister_thread(struct thread_registry *registry);
+
+const void *uds_lookup_thread(struct thread_registry *registry);
+
+#endif /* UDS_THREAD_REGISTRY_H */
diff --git a/drivers/md/dm-vdo/uds-threads.c b/drivers/md/dm-vdo/uds-threads.c
new file mode 100644
index 000000000000..17737246d28c
--- /dev/null
+++ b/drivers/md/dm-vdo/uds-threads.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "uds-threads.h"
+
+#include <linux/completion.h>
+#include <linux/err.h>
+#include <linux/kthread.h>
+#include <linux/sched.h>
+
+#include "errors.h"
+#include "logger.h"
+#include "memory-alloc.h"
+
+static struct hlist_head thread_list;
+static struct mutex thread_mutex;
+static atomic_t thread_once = ATOMIC_INIT(0);
+
+struct thread {
+	void (*thread_function)(void *thread_data);
+	void *thread_data;
+	struct hlist_node thread_links;
+	struct task_struct *thread_task;
+	struct completion thread_done;
+};
+
+enum {
+	ONCE_NOT_DONE = 0,
+	ONCE_IN_PROGRESS = 1,
+	ONCE_COMPLETE = 2,
+};
+
+/* Run a function once only, and record that fact in the atomic value. */
+void uds_perform_once(atomic_t *once, void (*function)(void))
+{
+	for (;;) {
+		switch (atomic_cmpxchg(once, ONCE_NOT_DONE, ONCE_IN_PROGRESS)) {
+		case ONCE_NOT_DONE:
+			function();
+			atomic_set_release(once, ONCE_COMPLETE);
+			return;
+		case ONCE_IN_PROGRESS:
+			cond_resched();
+			break;
+		case ONCE_COMPLETE:
+			return;
+		default:
+			return;
+		}
+	}
+}
+
+static void thread_init(void)
+{
+	mutex_init(&thread_mutex);
+}
+
+static int thread_starter(void *arg)
+{
+	struct registered_thread allocating_thread;
+	struct thread *thread = arg;
+
+	thread->thread_task = current;
+	uds_perform_once(&thread_once, thread_init);
+	mutex_lock(&thread_mutex);
+	hlist_add_head(&thread->thread_links, &thread_list);
+	mutex_unlock(&thread_mutex);
+	uds_register_allocating_thread(&allocating_thread, NULL);
+	thread->thread_function(thread->thread_data);
+	uds_unregister_allocating_thread();
+	complete(&thread->thread_done);
+	return 0;
+}
+
+int uds_create_thread(void (*thread_function)(void *),
+		      void *thread_data,
+		      const char *name,
+		      struct thread **new_thread)
+{
+	char *name_colon = strchr(name, ':');
+	char *my_name_colon = strchr(current->comm, ':');
+	struct task_struct *task;
+	struct thread *thread;
+	int result;
+
+	result = UDS_ALLOCATE(1, struct thread, __func__, &thread);
+	if (result != UDS_SUCCESS) {
+		uds_log_warning("Error allocating memory for %s", name);
+		return result;
+	}
+
+	thread->thread_function = thread_function;
+	thread->thread_data = thread_data;
+	init_completion(&thread->thread_done);
+	/*
+	 * Start the thread, with an appropriate thread name.
+	 *
+	 * If the name supplied contains a colon character, use that name. This causes uds module
+	 * threads to have names like "uds:callbackW" and the main test runner thread to be named
+	 * "zub:runtest".
+	 *
+	 * Otherwise if the current thread has a name containing a colon character, prefix the name
+	 * supplied with the name of the current thread up to (and including) the colon character.
+	 * Thus when the "kvdo0:dedupeQ" thread opens an index session, all the threads associated
+	 * with that index will have names like "kvdo0:foo".
+	 *
+	 * Otherwise just use the name supplied. This should be a rare occurrence.
+	 */
+	if ((name_colon == NULL) && (my_name_colon != NULL))
+		task = kthread_run(thread_starter,
+				   thread,
+				   "%.*s:%s",
+				   (int) (my_name_colon - current->comm),
+				   current->comm,
+				   name);
+	else
+		task = kthread_run(thread_starter, thread, "%s", name);
+
+	if (IS_ERR(task)) {
+		UDS_FREE(thread);
+		return PTR_ERR(task);
+	}
+
+	*new_thread = thread;
+	return UDS_SUCCESS;
+}
+
+int uds_join_threads(struct thread *thread)
+{
+	while (wait_for_completion_interruptible(&thread->thread_done) != 0)
+		/* empty loop */
+		;
+
+	mutex_lock(&thread_mutex);
+	hlist_del(&thread->thread_links);
+	mutex_unlock(&thread_mutex);
+	UDS_FREE(thread);
+	return UDS_SUCCESS;
+}
+
+int uds_initialize_barrier(struct barrier *barrier, unsigned int thread_count)
+{
+	int result;
+
+	result = uds_initialize_semaphore(&barrier->mutex, 1);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	barrier->arrived = 0;
+	barrier->thread_count = thread_count;
+	return uds_initialize_semaphore(&barrier->wait, 0);
+}
+
+int uds_destroy_barrier(struct barrier *barrier)
+{
+	int result;
+
+	result = uds_destroy_semaphore(&barrier->mutex);
+	if (result != UDS_SUCCESS)
+		return result;
+	return uds_destroy_semaphore(&barrier->wait);
+}
+
+int uds_enter_barrier(struct barrier *barrier)
+{
+	bool last_thread;
+
+	uds_acquire_semaphore(&barrier->mutex);
+	last_thread = (++barrier->arrived == barrier->thread_count);
+	if (last_thread) {
+		int i;
+
+		for (i = 1; i < barrier->thread_count; i++)
+			uds_release_semaphore(&barrier->wait);
+
+		barrier->arrived = 0;
+		uds_release_semaphore(&barrier->mutex);
+	} else {
+		uds_release_semaphore(&barrier->mutex);
+		uds_acquire_semaphore(&barrier->wait);
+	}
+
+	return UDS_SUCCESS;
+}
diff --git a/drivers/md/dm-vdo/uds-threads.h b/drivers/md/dm-vdo/uds-threads.h
new file mode 100644
index 000000000000..9b65bf65c595
--- /dev/null
+++ b/drivers/md/dm-vdo/uds-threads.h
@@ -0,0 +1,115 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_THREADS_H
+#define UDS_THREADS_H
+
+#include <linux/atomic.h>
+#include <linux/delay.h>
+#include <linux/jiffies.h>
+#include <linux/mutex.h>
+#include <linux/semaphore.h>
+#include <linux/wait.h>
+
+#include "errors.h"
+#include "time-utils.h"
+
+/* Thread and synchronization utilities for UDS */
+
+struct cond_var {
+	wait_queue_head_t wait_queue;
+};
+
+struct thread;
+
+struct barrier {
+	/* Mutex for this barrier object */
+	struct semaphore mutex;
+	/* Semaphore for threads waiting at the barrier */
+	struct semaphore wait;
+	/* Number of threads which have arrived */
+	int arrived;
+	/* Total number of threads using this barrier */
+	int thread_count;
+};
+
+int __must_check uds_create_thread(void (*thread_function)(void *),
+				   void *thread_data,
+				   const char *name,
+				   struct thread **new_thread);
+
+void uds_perform_once(atomic_t *once_state, void (*function) (void));
+
+int uds_join_threads(struct thread *thread);
+
+int __must_check uds_initialize_barrier(struct barrier *barrier, unsigned int thread_count);
+int uds_destroy_barrier(struct barrier *barrier);
+int uds_enter_barrier(struct barrier *barrier);
+
+int __must_check uds_init_cond(struct cond_var *cond);
+int uds_signal_cond(struct cond_var *cond);
+int uds_broadcast_cond(struct cond_var *cond);
+int uds_wait_cond(struct cond_var *cond, struct mutex *mutex);
+int uds_destroy_cond(struct cond_var *cond);
+
+static inline int __must_check uds_init_mutex(struct mutex *mutex)
+{
+	mutex_init(mutex);
+	return UDS_SUCCESS;
+}
+
+static inline int uds_destroy_mutex(struct mutex *mutex)
+{
+	return UDS_SUCCESS;
+}
+
+static inline void uds_lock_mutex(struct mutex *mutex)
+{
+	mutex_lock(mutex);
+}
+
+static inline void uds_unlock_mutex(struct mutex *mutex)
+{
+	mutex_unlock(mutex);
+}
+
+static inline int __must_check
+uds_initialize_semaphore(struct semaphore *semaphore, unsigned int value)
+{
+	sema_init(semaphore, value);
+	return UDS_SUCCESS;
+}
+
+static inline int uds_destroy_semaphore(struct semaphore *semaphore)
+{
+	return UDS_SUCCESS;
+}
+
+static inline void uds_acquire_semaphore(struct semaphore *semaphore)
+{
+	/*
+	 * Do not use down(semaphore). Instead use down_interruptible so that
+	 * we do not get 120 second stall messages in kern.log.
+	 */
+	while (down_interruptible(semaphore) != 0)
+		/*
+		 * If we're called from a user-mode process (e.g., "dmsetup
+		 * remove") while waiting for an operation that may take a
+		 * while (e.g., UDS index save), and a signal is sent (SIGINT,
+		 * SIGUSR2), then down_interruptible will not block. If that
+		 * happens, sleep briefly to avoid keeping the CPU locked up in
+		 * this loop. We could just call cond_resched, but then we'd
+		 * still keep consuming CPU time slices and swamp other threads
+		 * trying to do computational work. [VDO-4980]
+		 */
+		fsleep(1000);
+}
+
+static inline void uds_release_semaphore(struct semaphore *semaphore)
+{
+	up(semaphore);
+}
+
+#endif /* UDS_THREADS_H */

From patchwork Thu Oct 26 21:41:04 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437935
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 107EAC27C47
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:43:19 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233035AbjJZVnT (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:43:19 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40990 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232235AbjJZVnS (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:43:18 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54534128
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356500;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=qTVr2bEL2nhED7bKq6qyOh8riwxZ6zyxIqcZ+ZdPwu0=;
        b=PcsG3+DPeetc2oxZo3umUQCrwK5Acpy9mUD7EPG9aUjV61uCQWFxli4IYdcNE5CTWbKmrC
        /BWOAXKnnp1Md15HiD0Bc0nfuUnWnom4ez/epgdb8PJT+dwBV/JfSesB9jXkpzUtWWgsdT
        /F+Qsh4y827KQoJlIbiaRB/+t8HYM4Y=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-689-mjMNYASXM2WtAoI9povrpg-1; Thu,
 26 Oct 2023 17:41:38 -0400
X-MC-Unique: mjMNYASXM2WtAoI9povrpg-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com
 [10.11.54.8])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 64BEF1C04B41;
        Thu, 26 Oct 2023 21:41:38 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 49E7DC15984;
        Thu, 26 Oct 2023 21:41:38 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 00BF93003E; Thu, 26 Oct 2023 17:41:37 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 07/39] dm vdo: add specialized request queueing
 functionality
Date: Thu, 26 Oct 2023 17:41:04 -0400
Message-Id: <20231026214136.1067410-8-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

This patch adds funnel_queue, a mostly lock-free multi-producer,
single-consumer queue. It also adds the request queue used by the dm-vdo
deduplication index, and the work_queue used by the dm-vdo data store. Both
of these are built on top of funnel queue and are intended to support the
dispatching of many short-running tasks. The work_queue also supports
priorities. Finally, this patch adds vdo_completion, the structure which is
enqueued on work_queues.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Co-developed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/completion.c    | 141 +++++++
 drivers/md/dm-vdo/completion.h    | 155 +++++++
 drivers/md/dm-vdo/cpu.h           |  58 +++
 drivers/md/dm-vdo/funnel-queue.c  | 169 ++++++++
 drivers/md/dm-vdo/funnel-queue.h  | 110 +++++
 drivers/md/dm-vdo/request-queue.c | 284 +++++++++++++
 drivers/md/dm-vdo/request-queue.h |  30 ++
 drivers/md/dm-vdo/work-queue.c    | 659 ++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/work-queue.h    |  53 +++
 9 files changed, 1659 insertions(+)
 create mode 100644 drivers/md/dm-vdo/completion.c
 create mode 100644 drivers/md/dm-vdo/completion.h
 create mode 100644 drivers/md/dm-vdo/cpu.h
 create mode 100644 drivers/md/dm-vdo/funnel-queue.c
 create mode 100644 drivers/md/dm-vdo/funnel-queue.h
 create mode 100644 drivers/md/dm-vdo/request-queue.c
 create mode 100644 drivers/md/dm-vdo/request-queue.h
 create mode 100644 drivers/md/dm-vdo/work-queue.c
 create mode 100644 drivers/md/dm-vdo/work-queue.h

diff --git a/drivers/md/dm-vdo/completion.c b/drivers/md/dm-vdo/completion.c
new file mode 100644
index 000000000000..e4b6ff74d098
--- /dev/null
+++ b/drivers/md/dm-vdo/completion.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "completion.h"
+
+#include <linux/kernel.h>
+
+#include "logger.h"
+#include "permassert.h"
+
+#include "status-codes.h"
+#include "types.h"
+#include "vio.h"
+#include "vdo.h"
+
+/**
+ * DOC: vdo completions.
+ *
+ * Most of vdo's data structures are lock free, each either belonging to a single "zone," or
+ * divided into a number of zones whose accesses to the structure do not overlap. During normal
+ * operation, at most one thread will be operating in any given zone. Each zone has a
+ * vdo_work_queue which holds vdo_completions that are to be run in that zone. A completion may
+ * only be enqueued on one queue or operating in a single zone at a time.
+ *
+ * At each step of a multi-threaded operation, the completion performing the operation is given a
+ * callback, error handler, and thread id for the next step. A completion is "run" when it is
+ * operating on the correct thread (as specified by its callback_thread_id). If the value of its
+ * "result" field is an error (i.e. not VDO_SUCCESS), the function in its "error_handler" will be
+ * invoked. If the error_handler is NULL, or there is no error, the function set as its "callback"
+ * will be invoked. Generally, a completion will not be run directly, but rather will be
+ * "launched." In this case, it will check whether it is operating on the correct thread. If it is,
+ * it will run immediately. Otherwise, it will be enqueue on the vdo_work_queue associated with the
+ * completion's "callback_thread_id". When it is dequeued, it will be on the correct thread, and
+ * will get run. In some cases, the completion should get queued instead of running immediately,
+ * even if it is being launched from the correct thread. This is usually in cases where there is a
+ * long chain of callbacks, all on the same thread, which could overflow the stack. In such cases,
+ * the completion's "requeue" field should be set to true. Doing so will skip the current thread
+ * check and simply enqueue the completion.
+ *
+ * A completion may be "finished," in which case its "complete" field will be set to true before it
+ * is next run. It is a bug to attempt to set the result or re-finish a finished completion.
+ * Because a completion's fields are not safe to examine from any thread other than the one on
+ * which the completion is currently operating, this field is used only to aid in detecting
+ * programming errors. It can not be used for cross-thread checking on the status of an operation.
+ * A completion must be "reset" before it can be reused after it has been finished. Resetting will
+ * also clear any error from the result field.
+ **/
+
+void vdo_initialize_completion(struct vdo_completion *completion,
+			       struct vdo *vdo,
+			       enum vdo_completion_type type)
+{
+	memset(completion, 0, sizeof(*completion));
+	completion->vdo = vdo;
+	completion->type = type;
+	vdo_reset_completion(completion);
+}
+
+static inline void assert_incomplete(struct vdo_completion *completion)
+{
+	ASSERT_LOG_ONLY(!completion->complete, "completion is not complete");
+}
+
+/**
+ * vdo_set_completion_result() - Set the result of a completion.
+ *
+ * Older errors will not be masked.
+ */
+void vdo_set_completion_result(struct vdo_completion *completion, int result)
+{
+	assert_incomplete(completion);
+	if (completion->result == VDO_SUCCESS)
+		completion->result = result;
+}
+
+/**
+ * vdo_launch_completion_with_priority() - Run or enqueue a completion.
+ * @priority: The priority at which to enqueue the completion.
+ *
+ * If called on the correct thread (i.e. the one specified in the completion's callback_thread_id
+ * field) and not marked for requeue, the completion will be run immediately. Otherwise, the
+ * completion will be enqueued on the specified thread.
+ */
+void vdo_launch_completion_with_priority(struct vdo_completion *completion,
+					 enum vdo_completion_priority priority)
+{
+	thread_id_t callback_thread = completion->callback_thread_id;
+
+	if (completion->requeue || (callback_thread != vdo_get_callback_thread_id())) {
+		vdo_enqueue_completion(completion, priority);
+		return;
+	}
+
+	vdo_run_completion(completion);
+}
+
+/** vdo_finish_completion() - Mark a completion as complete and then launch it. */
+void vdo_finish_completion(struct vdo_completion *completion)
+{
+	assert_incomplete(completion);
+	completion->complete = true;
+	if (completion->callback != NULL)
+		vdo_launch_completion(completion);
+}
+
+void vdo_enqueue_completion(struct vdo_completion *completion,
+			    enum vdo_completion_priority priority)
+{
+	struct vdo *vdo = completion->vdo;
+	thread_id_t thread_id = completion->callback_thread_id;
+
+	if (ASSERT(thread_id < vdo->thread_config.thread_count,
+		   "thread_id %u (completion type %d) is less than thread count %u",
+		   thread_id,
+		   completion->type,
+		   vdo->thread_config.thread_count) != UDS_SUCCESS)
+		BUG();
+
+	completion->requeue = false;
+	completion->priority = priority;
+	completion->my_queue = NULL;
+	vdo_enqueue_work_queue(vdo->threads[thread_id].queue, completion);
+}
+
+/**
+ * vdo_requeue_completion_if_needed() - Requeue a completion if not called on the specified thread.
+ *
+ * Return: True if the completion was requeued; callers may not access the completion in this case.
+ */
+bool vdo_requeue_completion_if_needed(struct vdo_completion *completion,
+				      thread_id_t callback_thread_id)
+{
+	if (vdo_get_callback_thread_id() == callback_thread_id)
+		return false;
+
+	completion->callback_thread_id = callback_thread_id;
+	vdo_enqueue_completion(completion, VDO_WORK_Q_DEFAULT_PRIORITY);
+	return true;
+}
diff --git a/drivers/md/dm-vdo/completion.h b/drivers/md/dm-vdo/completion.h
new file mode 100644
index 000000000000..b81322aea253
--- /dev/null
+++ b/drivers/md/dm-vdo/completion.h
@@ -0,0 +1,155 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_COMPLETION_H
+#define VDO_COMPLETION_H
+
+#include "permassert.h"
+
+#include "status-codes.h"
+#include "types.h"
+
+/**
+ * vdo_run_completion() - Run a completion's callback or error handler on the current thread.
+ *
+ * Context: This function must be called from the correct callback thread.
+ */
+static inline void vdo_run_completion(struct vdo_completion *completion)
+{
+	if ((completion->result != VDO_SUCCESS) && (completion->error_handler != NULL)) {
+		completion->error_handler(completion);
+		return;
+	}
+
+	completion->callback(completion);
+}
+
+void vdo_set_completion_result(struct vdo_completion *completion, int result);
+
+void vdo_initialize_completion(struct vdo_completion *completion,
+			       struct vdo *vdo,
+			       enum vdo_completion_type type);
+
+/**
+ * vdo_reset_completion() - Reset a completion to a clean state, while keeping the type, vdo and
+ *                          parent information.
+ */
+static inline void vdo_reset_completion(struct vdo_completion *completion)
+{
+	completion->result = VDO_SUCCESS;
+	completion->complete = false;
+}
+
+void vdo_launch_completion_with_priority(struct vdo_completion *completion,
+					 enum vdo_completion_priority priority);
+
+/**
+ * vdo_launch_completion() - Launch a completion with default priority.
+ */
+static inline void vdo_launch_completion(struct vdo_completion *completion)
+{
+	vdo_launch_completion_with_priority(completion, VDO_WORK_Q_DEFAULT_PRIORITY);
+}
+
+/**
+ * vdo_continue_completion() - Continue processing a completion.
+ * @result: The current result (will not mask older errors).
+ *
+ * Continue processing a completion by setting the current result and calling
+ * vdo_launch_completion().
+ */
+static inline void vdo_continue_completion(struct vdo_completion *completion, int result)
+{
+	vdo_set_completion_result(completion, result);
+	vdo_launch_completion(completion);
+}
+
+void vdo_finish_completion(struct vdo_completion *completion);
+
+/**
+ * vdo_fail_completion() - Set the result of a completion if it does not already have an error,
+ *                         then finish it.
+ */
+static inline void vdo_fail_completion(struct vdo_completion *completion, int result)
+{
+	vdo_set_completion_result(completion, result);
+	vdo_finish_completion(completion);
+}
+
+/**
+ * vdo_assert_completion_type() - Assert that a completion is of the correct type.
+ *
+ * Return: VDO_SUCCESS or an error
+ */
+static inline int
+vdo_assert_completion_type(struct vdo_completion *completion, enum vdo_completion_type expected)
+{
+	return ASSERT(expected == completion->type,
+		      "completion type should be %u, not %u",
+		      expected,
+		      completion->type);
+}
+
+static inline void vdo_set_completion_callback(struct vdo_completion *completion,
+					       vdo_action *callback,
+					       thread_id_t callback_thread_id)
+{
+	completion->callback = callback;
+	completion->callback_thread_id = callback_thread_id;
+}
+
+/**
+ * vdo_launch_completion_callback() - Set the callback for a completion and launch it immediately.
+ */
+static inline void vdo_launch_completion_callback(struct vdo_completion *completion,
+						  vdo_action *callback,
+						  thread_id_t callback_thread_id)
+{
+	vdo_set_completion_callback(completion, callback, callback_thread_id);
+	vdo_launch_completion(completion);
+}
+
+/**
+ * vdo_prepare_completion() - Prepare a completion for launch.
+ *
+ * Resets the completion, and then sets its callback, error handler, callback thread, and parent.
+ */
+static inline void vdo_prepare_completion(struct vdo_completion *completion,
+					  vdo_action *callback,
+					  vdo_action *error_handler,
+					  thread_id_t callback_thread_id,
+					  void *parent)
+{
+	vdo_reset_completion(completion);
+	vdo_set_completion_callback(completion, callback, callback_thread_id);
+	completion->error_handler = error_handler;
+	completion->parent = parent;
+}
+
+/**
+ * vdo_prepare_completion_for_requeue() - Prepare a completion for launch ensuring that it will
+ *                                        always be requeued.
+ *
+ * Resets the completion, and then sets its callback, error handler, callback thread, and parent.
+ */
+static inline void
+vdo_prepare_completion_for_requeue(struct vdo_completion *completion,
+				   vdo_action *callback,
+				   vdo_action *error_handler,
+				   thread_id_t callback_thread_id,
+				   void *parent)
+{
+	vdo_prepare_completion(completion, callback, error_handler, callback_thread_id, parent);
+	completion->requeue = true;
+}
+
+void vdo_enqueue_completion(struct vdo_completion *completion,
+			    enum vdo_completion_priority priority);
+
+
+bool vdo_requeue_completion_if_needed(struct vdo_completion *completion,
+				      thread_id_t callback_thread_id);
+
+#endif /* VDO_COMPLETION_H */
diff --git a/drivers/md/dm-vdo/cpu.h b/drivers/md/dm-vdo/cpu.h
new file mode 100644
index 000000000000..a538a4934e4b
--- /dev/null
+++ b/drivers/md/dm-vdo/cpu.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_CPU_H
+#define UDS_CPU_H
+
+#include <linux/cache.h>
+
+/**
+ * uds_prefetch_address() - Minimize cache-miss latency by attempting to move data into a CPU cache
+ *                          before it is accessed.
+ *
+ * @address: the address to fetch (may be invalid)
+ * @for_write: must be constant at compile time--false if for reading, true if for writing
+ */
+static inline void uds_prefetch_address(const void *address, bool for_write)
+{
+	/*
+	 * for_write won't be a constant if we are compiled with optimization turned off, in which
+	 * case prefetching really doesn't matter. clang can't figure out that if for_write is a
+	 * constant, it can be passed as the second, mandatorily constant argument to prefetch(),
+	 * at least currently on llvm 12.
+	 */
+	if (__builtin_constant_p(for_write)) {
+		if (for_write)
+			__builtin_prefetch(address, true);
+		else
+			__builtin_prefetch(address, false);
+	}
+}
+
+/**
+ * uds_prefetch_range() - Minimize cache-miss latency by attempting to move a range of addresses
+ *                        into a CPU cache before they are accessed.
+ *
+ * @start: the starting address to fetch (may be invalid)
+ * @size: the number of bytes in the address range
+ * @for_write: must be constant at compile time--false if for reading, true if for writing
+ */
+static inline void uds_prefetch_range(const void *start, unsigned int size, bool for_write)
+{
+	/*
+	 * Count the number of cache lines to fetch, allowing for the address range to span an
+	 * extra cache line boundary due to address alignment.
+	 */
+	const char *address = (const char *) start;
+	unsigned int offset = ((uintptr_t) address % L1_CACHE_BYTES);
+	unsigned int cache_lines = (1 + ((size + offset) / L1_CACHE_BYTES));
+
+	while (cache_lines-- > 0) {
+		uds_prefetch_address(address, for_write);
+		address += L1_CACHE_BYTES;
+	}
+}
+
+#endif /* UDS_CPU_H */
diff --git a/drivers/md/dm-vdo/funnel-queue.c b/drivers/md/dm-vdo/funnel-queue.c
new file mode 100644
index 000000000000..bf7f0d8bc04d
--- /dev/null
+++ b/drivers/md/dm-vdo/funnel-queue.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "funnel-queue.h"
+
+#include "cpu.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+#include "uds.h"
+
+int uds_make_funnel_queue(struct funnel_queue **queue_ptr)
+{
+	int result;
+	struct funnel_queue *queue;
+
+	result = UDS_ALLOCATE(1, struct funnel_queue, "funnel queue", &queue);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	/*
+	 * Initialize the stub entry and put it in the queue, establishing the invariant that
+	 * queue->newest and queue->oldest are never null.
+	 */
+	queue->stub.next = NULL;
+	queue->newest = &queue->stub;
+	queue->oldest = &queue->stub;
+
+	*queue_ptr = queue;
+	return UDS_SUCCESS;
+}
+
+void uds_free_funnel_queue(struct funnel_queue *queue)
+{
+	UDS_FREE(queue);
+}
+
+static struct funnel_queue_entry *get_oldest(struct funnel_queue *queue)
+{
+	/*
+	 * Barrier requirements: We need a read barrier between reading a "next" field pointer
+	 * value and reading anything it points to. There's an accompanying barrier in
+	 * uds_funnel_queue_put() between its caller setting up the entry and making it visible.
+	 */
+	struct funnel_queue_entry *oldest = queue->oldest;
+	struct funnel_queue_entry *next = READ_ONCE(oldest->next);
+
+	if (oldest == &queue->stub) {
+		/*
+		 * When the oldest entry is the stub and it has no successor, the queue is
+		 * logically empty.
+		 */
+		if (next == NULL)
+			return NULL;
+		/*
+		 * The stub entry has a successor, so the stub can be dequeued and ignored without
+		 * breaking the queue invariants.
+		 */
+		oldest = next;
+		queue->oldest = oldest;
+		next = READ_ONCE(oldest->next);
+	}
+
+	/*
+	 * We have a non-stub candidate to dequeue. If it lacks a successor, we'll need to put the
+	 * stub entry back on the queue first.
+	 */
+	if (next == NULL) {
+		struct funnel_queue_entry *newest = READ_ONCE(queue->newest);
+
+		if (oldest != newest)
+			/*
+			 * Another thread has already swung queue->newest atomically, but not yet
+			 * assigned previous->next. The queue is really still empty.
+			 */
+			return NULL;
+
+		/*
+		 * Put the stub entry back on the queue, ensuring a successor will eventually be
+		 * seen.
+		 */
+		uds_funnel_queue_put(queue, &queue->stub);
+
+		/* Check again for a successor. */
+		next = READ_ONCE(oldest->next);
+		if (next == NULL)
+			/*
+			 * We lost a race with a producer who swapped queue->newest before we did,
+			 * but who hasn't yet updated previous->next. Try again later.
+			 */
+			return NULL;
+	}
+
+	return oldest;
+}
+
+/*
+ * Poll a queue, removing the oldest entry if the queue is not empty. This function must only be
+ * called from a single consumer thread.
+ */
+struct funnel_queue_entry *uds_funnel_queue_poll(struct funnel_queue *queue)
+{
+	struct funnel_queue_entry *oldest = get_oldest(queue);
+
+	if (oldest == NULL)
+		return oldest;
+
+	/*
+	 * Dequeue the oldest entry and return it. Only one consumer thread may call this function,
+	 * so no locking, atomic operations, or fences are needed; queue->oldest is owned by the
+	 * consumer and oldest->next is never used by a producer thread after it is swung from NULL
+	 * to non-NULL.
+	 */
+	queue->oldest = READ_ONCE(oldest->next);
+	/*
+	 * Make sure the caller sees the proper stored data for this entry. Since we've already
+	 * fetched the entry pointer we stored in "queue->oldest", this also ensures that on entry
+	 * to the next call we'll properly see the dependent data.
+	 */
+	smp_rmb();
+	/*
+	 * If "oldest" is a very light-weight work item, we'll be looking for the next one very
+	 * soon, so prefetch it now.
+	 */
+	uds_prefetch_address(queue->oldest, true);
+	WRITE_ONCE(oldest->next, NULL);
+	return oldest;
+}
+
+/*
+ * Check whether the funnel queue is empty or not. If the queue is in a transition state with one
+ * or more entries being added such that the list view is incomplete, this function will report the
+ * queue as empty.
+ */
+bool uds_is_funnel_queue_empty(struct funnel_queue *queue)
+{
+	return get_oldest(queue) == NULL;
+}
+
+/*
+ * Check whether the funnel queue is idle or not. If the queue has entries available to be
+ * retrieved, it is not idle. If the queue is in a transition state with one or more entries being
+ * added such that the list view is incomplete, it may not be possible to retrieve an entry with
+ * the uds_funnel_queue_poll() function, but the queue will not be considered idle.
+ */
+bool uds_is_funnel_queue_idle(struct funnel_queue *queue)
+{
+	/*
+	 * Oldest is not the stub, so there's another entry, though if next is NULL we can't
+	 * retrieve it yet.
+	 */
+	if (queue->oldest != &queue->stub)
+		return false;
+
+	/*
+	 * Oldest is the stub, but newest has been updated by _put(); either there's another,
+	 * retrievable entry in the list, or the list is officially empty but in the intermediate
+	 * state of having an entry added.
+	 *
+	 * Whether anything is retrievable depends on whether stub.next has been updated and become
+	 * visible to us, but for idleness we don't care. And due to memory ordering in _put(), the
+	 * update to newest would be visible to us at the same time or sooner.
+	 */
+	if (READ_ONCE(queue->newest) != &queue->stub)
+		return false;
+
+	return true;
+}
diff --git a/drivers/md/dm-vdo/funnel-queue.h b/drivers/md/dm-vdo/funnel-queue.h
new file mode 100644
index 000000000000..275dd4ac29a4
--- /dev/null
+++ b/drivers/md/dm-vdo/funnel-queue.h
@@ -0,0 +1,110 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_FUNNEL_QUEUE_H
+#define UDS_FUNNEL_QUEUE_H
+
+#include <linux/atomic.h>
+#include <linux/cache.h>
+
+/*
+ * A funnel queue is a simple (almost) lock-free queue that accepts entries from multiple threads
+ * (multi-producer) and delivers them to a single thread (single-consumer). "Funnel" is an attempt
+ * to evoke the image of requests from more than one producer being "funneled down" to a single
+ * consumer.
+ *
+ * This is an unsynchronized but thread-safe data structure when used as intended. There is no
+ * mechanism to ensure that only one thread is consuming from the queue. If more than one thread
+ * attempts to consume from the queue, the resulting behavior is undefined. Clients must not
+ * directly access or manipulate the internals of the queue, which are only exposed for the purpose
+ * of allowing the very simple enqueue operation to be inlined.
+ *
+ * The implementation requires that a funnel_queue_entry structure (a link pointer) is embedded in
+ * the queue entries, and pointers to those structures are used exclusively by the queue. No macros
+ * are defined to template the queue, so the offset of the funnel_queue_entry in the records placed
+ * in the queue must all be the same so the client can derive their structure pointer from the
+ * entry pointer returned by uds_funnel_queue_poll().
+ *
+ * Callers are wholly responsible for allocating and freeing the entries. Entries may be freed as
+ * soon as they are returned since this queue is not susceptible to the "ABA problem" present in
+ * many lock-free data structures. The queue is dynamically allocated to ensure cache-line
+ * alignment, but no other dynamic allocation is used.
+ *
+ * The algorithm is not actually 100% lock-free. There is a single point in uds_funnel_queue_put()
+ * at which a preempted producer will prevent the consumers from seeing items added to the queue by
+ * later producers, and only if the queue is short enough or the consumer fast enough for it to
+ * reach what was the end of the queue at the time of the preemption.
+ *
+ * The consumer function, uds_funnel_queue_poll(), will return NULL when the queue is empty. To
+ * wait for data to consume, spin (if safe) or combine the queue with a struct event_count to
+ * signal the presence of new entries.
+ */
+
+/* This queue link structure must be embedded in client entries. */
+struct funnel_queue_entry {
+	/* The next (newer) entry in the queue. */
+	struct funnel_queue_entry *next;
+};
+
+/*
+ * The dynamically allocated queue structure, which is allocated on a cache line boundary so the
+ * producer and consumer fields in the structure will land on separate cache lines. This should be
+ * consider opaque but it is exposed here so uds_funnel_queue_put() can be inlined.
+ */
+struct __aligned(L1_CACHE_BYTES) funnel_queue {
+	/*
+	 * The producers' end of the queue, an atomically exchanged pointer that will never be
+	 * NULL.
+	 */
+	struct funnel_queue_entry *newest;
+
+	/* The consumer's end of the queue, which is owned by the consumer and never NULL. */
+	struct funnel_queue_entry *oldest __aligned(L1_CACHE_BYTES);
+
+	/* A dummy entry used to provide the non-NULL invariants above. */
+	struct funnel_queue_entry stub;
+};
+
+int __must_check uds_make_funnel_queue(struct funnel_queue **queue_ptr);
+
+void uds_free_funnel_queue(struct funnel_queue *queue);
+
+/*
+ * Put an entry on the end of the queue.
+ *
+ * The entry pointer must be to the struct funnel_queue_entry embedded in the caller's data
+ * structure. The caller must be able to derive the address of the start of their data structure
+ * from the pointer that passed in here, so every entry in the queue must have the struct
+ * funnel_queue_entry at the same offset within the client's structure.
+ */
+static inline void
+uds_funnel_queue_put(struct funnel_queue *queue, struct funnel_queue_entry *entry)
+{
+	struct funnel_queue_entry *previous;
+
+	/*
+	 * Barrier requirements: All stores relating to the entry ("next" pointer, containing data
+	 * structure fields) must happen before the previous->next store making it visible to the
+	 * consumer. Also, the entry's "next" field initialization to NULL must happen before any
+	 * other producer threads can see the entry (the xchg) and try to update the "next" field.
+	 *
+	 * xchg implements a full barrier.
+	 */
+	WRITE_ONCE(entry->next, NULL);
+	previous = xchg(&queue->newest, entry);
+	/*
+	 * Preemptions between these two statements hide the rest of the queue from the consumer,
+	 * preventing consumption until the following assignment runs.
+	 */
+	WRITE_ONCE(previous->next, entry);
+}
+
+struct funnel_queue_entry *__must_check uds_funnel_queue_poll(struct funnel_queue *queue);
+
+bool __must_check uds_is_funnel_queue_empty(struct funnel_queue *queue);
+
+bool __must_check uds_is_funnel_queue_idle(struct funnel_queue *queue);
+
+#endif /* UDS_FUNNEL_QUEUE_H */
diff --git a/drivers/md/dm-vdo/request-queue.c b/drivers/md/dm-vdo/request-queue.c
new file mode 100644
index 000000000000..1c60b09027a8
--- /dev/null
+++ b/drivers/md/dm-vdo/request-queue.c
@@ -0,0 +1,284 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "request-queue.h"
+
+#include <linux/atomic.h>
+#include <linux/compiler.h>
+#include <linux/wait.h>
+
+#include "funnel-queue.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "uds-threads.h"
+
+/*
+ * This queue will attempt to handle requests in reasonably sized batches instead of reacting
+ * immediately to each new request. The wait time between batches is dynamically adjusted up or
+ * down to try to balance responsiveness against wasted thread run time.
+ *
+ * If the wait time becomes long enough, the queue will become dormant and must be explicitly
+ * awoken when a new request is enqueued. The enqueue operation updates "newest" in the funnel
+ * queue via xchg (which is a memory barrier), and later checks "dormant" to decide whether to do a
+ * wakeup of the worker thread.
+ *
+ * When deciding to go to sleep, the worker thread sets "dormant" and then examines "newest" to
+ * decide if the funnel queue is idle. In dormant mode, the last examination of "newest" before
+ * going to sleep is done inside the wait_event_interruptible() macro, after a point where one or
+ * more memory barriers have been issued. (Preparing to sleep uses spin locks.) Even if the funnel
+ * queue's "next" field update isn't visible yet to make the entry accessible, its existence will
+ * kick the worker thread out of dormant mode and back into timer-based mode.
+ *
+ * Unbatched requests are used to communicate between different zone threads and will also cause
+ * the queue to awaken immediately.
+ */
+
+enum {
+	NANOSECOND = 1,
+	MICROSECOND = 1000 * NANOSECOND,
+	MILLISECOND = 1000 * MICROSECOND,
+	DEFAULT_WAIT_TIME = 20 * MICROSECOND,
+	MINIMUM_WAIT_TIME = DEFAULT_WAIT_TIME / 2,
+	MAXIMUM_WAIT_TIME = MILLISECOND,
+	MINIMUM_BATCH = 32,
+	MAXIMUM_BATCH = 64,
+};
+
+struct uds_request_queue {
+	/* Wait queue for synchronizing producers and consumer */
+	struct wait_queue_head wait_head;
+	/* Function to process a request */
+	uds_request_queue_processor_t *processor;
+	/* Queue of new incoming requests */
+	struct funnel_queue *main_queue;
+	/* Queue of old requests to retry */
+	struct funnel_queue *retry_queue;
+	/* The thread id of the worker thread */
+	struct thread *thread;
+	/* True if the worker was started */
+	bool started;
+	/* When true, requests can be enqueued */
+	bool running;
+	/* A flag set when the worker is waiting without a timeout */
+	atomic_t dormant;
+};
+
+static inline struct uds_request *poll_queues(struct uds_request_queue *queue)
+{
+	struct funnel_queue_entry *entry;
+
+	entry = uds_funnel_queue_poll(queue->retry_queue);
+	if (entry != NULL)
+		return container_of(entry, struct uds_request, queue_link);
+
+	entry = uds_funnel_queue_poll(queue->main_queue);
+	if (entry != NULL)
+		return container_of(entry, struct uds_request, queue_link);
+
+	return NULL;
+}
+
+static inline bool are_queues_idle(struct uds_request_queue *queue)
+{
+	return uds_is_funnel_queue_idle(queue->retry_queue) &&
+	       uds_is_funnel_queue_idle(queue->main_queue);
+}
+
+/*
+ * Determine if there is a next request to process, and return it if there is. Also return flags
+ * indicating whether the worker thread can sleep (for the use of wait_event() macros) and whether
+ * the thread did sleep before returning a new request.
+ */
+static inline bool dequeue_request(struct uds_request_queue *queue,
+				   struct uds_request **request_ptr,
+				   bool *waited_ptr)
+{
+	struct uds_request *request = poll_queues(queue);
+
+	if (request != NULL) {
+		*request_ptr = request;
+		return true;
+	}
+
+	if (!READ_ONCE(queue->running)) {
+		/* Wake the worker thread so it can exit. */
+		*request_ptr = NULL;
+		return true;
+	}
+
+	*request_ptr = NULL;
+	*waited_ptr = true;
+	return false;
+}
+
+static void wait_for_request(struct uds_request_queue *queue,
+			     bool dormant,
+			     unsigned long timeout,
+			     struct uds_request **request,
+			     bool *waited)
+{
+	if (dormant) {
+		wait_event_interruptible(queue->wait_head,
+					 (dequeue_request(queue, request, waited) ||
+					  !are_queues_idle(queue)));
+		return;
+	}
+
+	wait_event_interruptible_hrtimeout(queue->wait_head,
+					   dequeue_request(queue, request, waited),
+					   ns_to_ktime(timeout));
+}
+
+static void request_queue_worker(void *arg)
+{
+	struct uds_request_queue *queue = (struct uds_request_queue *) arg;
+	struct uds_request *request = NULL;
+	unsigned long time_batch = DEFAULT_WAIT_TIME;
+	bool dormant = atomic_read(&queue->dormant);
+	bool waited = false;
+	long current_batch = 0;
+
+	for (;;) {
+		wait_for_request(queue, dormant, time_batch, &request, &waited);
+		if (likely(request != NULL)) {
+			current_batch++;
+			queue->processor(request);
+		} else if (!READ_ONCE(queue->running)) {
+			break;
+		}
+
+		if (dormant) {
+			/*
+			 * The queue has been roused from dormancy. Clear the flag so enqueuers can
+			 * stop broadcasting. No fence is needed for this transition.
+			 */
+			atomic_set(&queue->dormant, false);
+			dormant = false;
+			time_batch = DEFAULT_WAIT_TIME;
+		} else if (waited) {
+			/*
+			 * We waited for this request to show up. Adjust the wait time to smooth
+			 * out the batch size.
+			 */
+			if (current_batch < MINIMUM_BATCH) {
+				/*
+				 * If the last batch of requests was too small, increase the wait
+				 * time.
+				 */
+				time_batch += time_batch / 4;
+				if (time_batch >= MAXIMUM_WAIT_TIME) {
+					atomic_set(&queue->dormant, true);
+					dormant = true;
+				}
+			} else if (current_batch > MAXIMUM_BATCH) {
+				/*
+				 * If the last batch of requests was too large, decrease the wait
+				 * time.
+				 */
+				time_batch -= time_batch / 4;
+				if (time_batch < MINIMUM_WAIT_TIME)
+					time_batch = MINIMUM_WAIT_TIME;
+			}
+			current_batch = 0;
+		}
+	}
+
+	/*
+	 * Ensure that we process any remaining requests that were enqueued before trying to shut
+	 * down. The corresponding write barrier is in uds_request_queue_finish().
+	 */
+	smp_rmb();
+	while ((request = poll_queues(queue)) != NULL)
+		queue->processor(request);
+}
+
+int uds_make_request_queue(const char *queue_name,
+			   uds_request_queue_processor_t *processor,
+			   struct uds_request_queue **queue_ptr)
+{
+	int result;
+	struct uds_request_queue *queue;
+
+	result = UDS_ALLOCATE(1, struct uds_request_queue, __func__, &queue);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	queue->processor = processor;
+	queue->running = true;
+	atomic_set(&queue->dormant, false);
+	init_waitqueue_head(&queue->wait_head);
+
+	result = uds_make_funnel_queue(&queue->main_queue);
+	if (result != UDS_SUCCESS) {
+		uds_request_queue_finish(queue);
+		return result;
+	}
+
+	result = uds_make_funnel_queue(&queue->retry_queue);
+	if (result != UDS_SUCCESS) {
+		uds_request_queue_finish(queue);
+		return result;
+	}
+
+	result = uds_create_thread(request_queue_worker, queue, queue_name, &queue->thread);
+	if (result != UDS_SUCCESS) {
+		uds_request_queue_finish(queue);
+		return result;
+	}
+
+	queue->started = true;
+	*queue_ptr = queue;
+	return UDS_SUCCESS;
+}
+
+static inline void wake_up_worker(struct uds_request_queue *queue)
+{
+	if (wq_has_sleeper(&queue->wait_head))
+		wake_up(&queue->wait_head);
+}
+
+void uds_request_queue_enqueue(struct uds_request_queue *queue, struct uds_request *request)
+{
+	struct funnel_queue *sub_queue;
+	bool unbatched = request->unbatched;
+
+	sub_queue = request->requeued ? queue->retry_queue : queue->main_queue;
+	uds_funnel_queue_put(sub_queue, &request->queue_link);
+
+	/*
+	 * We must wake the worker thread when it is dormant. A read fence isn't needed here since
+	 * we know the queue operation acts as one.
+	 */
+	if (atomic_read(&queue->dormant) || unbatched)
+		wake_up_worker(queue);
+}
+
+void uds_request_queue_finish(struct uds_request_queue *queue)
+{
+	int result;
+
+	if (queue == NULL)
+		return;
+
+	/*
+	 * This memory barrier ensures that any requests we queued will be seen. The point is that
+	 * when dequeue_request() sees the following update to the running flag, it will also be
+	 * able to see any change we made to a next field in the funnel queue entry. The
+	 * corresponding read barrier is in request_queue_worker().
+	 */
+	smp_wmb();
+	WRITE_ONCE(queue->running, false);
+
+	if (queue->started) {
+		wake_up_worker(queue);
+		result = uds_join_threads(queue->thread);
+		if (result != UDS_SUCCESS)
+			uds_log_warning_strerror(result, "Failed to join worker thread");
+	}
+
+	uds_free_funnel_queue(queue->main_queue);
+	uds_free_funnel_queue(queue->retry_queue);
+	UDS_FREE(queue);
+}
diff --git a/drivers/md/dm-vdo/request-queue.h b/drivers/md/dm-vdo/request-queue.h
new file mode 100644
index 000000000000..818cb14135fd
--- /dev/null
+++ b/drivers/md/dm-vdo/request-queue.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_REQUEST_QUEUE_H
+#define UDS_REQUEST_QUEUE_H
+
+#include "uds.h"
+
+/*
+ * A simple request queue which will handle new requests in the order in which they are received,
+ * and will attempt to handle requeued requests before new ones. However, the nature of the
+ * implementation means that it cannot guarantee this ordering; the prioritization is merely a
+ * hint.
+ */
+
+struct uds_request_queue;
+
+typedef void uds_request_queue_processor_t(struct uds_request *);
+
+int __must_check uds_make_request_queue(const char *queue_name,
+					uds_request_queue_processor_t *processor,
+					struct uds_request_queue **queue_ptr);
+
+void uds_request_queue_enqueue(struct uds_request_queue *queue, struct uds_request *request);
+
+void uds_request_queue_finish(struct uds_request_queue *queue);
+
+#endif /* UDS_REQUEST_QUEUE_H */
diff --git a/drivers/md/dm-vdo/work-queue.c b/drivers/md/dm-vdo/work-queue.c
new file mode 100644
index 000000000000..6adf07fc74e1
--- /dev/null
+++ b/drivers/md/dm-vdo/work-queue.c
@@ -0,0 +1,659 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "work-queue.h"
+
+#include <linux/atomic.h>
+#include <linux/cache.h>
+#include <linux/completion.h>
+#include <linux/err.h>
+#include <linux/kthread.h>
+#include <linux/percpu.h>
+
+#include "funnel-queue.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "numeric.h"
+#include "permassert.h"
+#include "string-utils.h"
+
+#include "completion.h"
+#include "status-codes.h"
+
+static DEFINE_PER_CPU(unsigned int, service_queue_rotor);
+
+/**
+ * DOC: Work queue definition.
+ *
+ * There are two types of work queues: simple, with one worker thread, and round-robin, which uses
+ * a group of the former to do the work, and assigns work to them in round-robin fashion (roughly).
+ * Externally, both are represented via the same common sub-structure, though there's actually not
+ * a great deal of overlap between the two types internally.
+ */
+struct vdo_work_queue {
+	/* Name of just the work queue (e.g., "cpuQ12") */
+	char *name;
+	bool round_robin_mode;
+	struct vdo_thread *owner;
+	/* Life cycle functions, etc */
+	const struct vdo_work_queue_type *type;
+};
+
+struct simple_work_queue {
+	struct vdo_work_queue common;
+	struct funnel_queue *priority_lists[VDO_WORK_Q_MAX_PRIORITY + 1];
+	void *private;
+
+	/*
+	 * The fields above are unchanged after setup but often read, and are good candidates for
+	 * caching -- and if the max priority is 2, just fit in one x86-64 cache line if aligned.
+	 * The fields below are often modified as we sleep and wake, so we want a separate cache
+	 * line for performance.
+	 */
+
+	/* Any (0 or 1) worker threads waiting for new work to do */
+	wait_queue_head_t waiting_worker_threads ____cacheline_aligned;
+	/* Hack to reduce wakeup calls if the worker thread is running */
+	atomic_t idle;
+
+	/* These are infrequently used so in terms of performance we don't care where they land. */
+	struct task_struct *thread;
+	/* Notify creator once worker has initialized */
+	struct completion *started;
+};
+
+struct round_robin_work_queue {
+	struct vdo_work_queue common;
+	struct simple_work_queue **service_queues;
+	unsigned int num_service_queues;
+};
+
+static inline struct simple_work_queue *as_simple_work_queue(struct vdo_work_queue *queue)
+{
+	return ((queue == NULL) ? NULL : container_of(queue, struct simple_work_queue, common));
+}
+
+static inline struct round_robin_work_queue *
+as_round_robin_work_queue(struct vdo_work_queue *queue)
+{
+	return ((queue == NULL) ?
+		 NULL :
+		 container_of(queue, struct round_robin_work_queue, common));
+}
+
+/* Processing normal completions. */
+
+/*
+ * Dequeue and return the next waiting completion, if any.
+ *
+ * We scan the funnel queues from highest priority to lowest, once; there is therefore a race
+ * condition where a high-priority completion can be enqueued followed by a lower-priority one, and
+ * we'll grab the latter (but we'll catch the high-priority item on the next call). If strict
+ * enforcement of priorities becomes necessary, this function will need fixing.
+ */
+static struct vdo_completion *poll_for_completion(struct simple_work_queue *queue)
+{
+	int i;
+
+	for (i = queue->common.type->max_priority; i >= 0; i--) {
+		struct funnel_queue_entry *link = uds_funnel_queue_poll(queue->priority_lists[i]);
+
+		if (link != NULL)
+			return container_of(link, struct vdo_completion, work_queue_entry_link);
+	}
+
+	return NULL;
+}
+
+static void
+enqueue_work_queue_completion(struct simple_work_queue *queue, struct vdo_completion *completion)
+{
+	ASSERT_LOG_ONLY(completion->my_queue == NULL,
+			"completion %px (fn %px) to enqueue (%px) is not already queued (%px)",
+			completion,
+			completion->callback,
+			queue,
+			completion->my_queue);
+	if (completion->priority == VDO_WORK_Q_DEFAULT_PRIORITY)
+		completion->priority = queue->common.type->default_priority;
+
+	if (ASSERT(completion->priority <= queue->common.type->max_priority,
+		   "priority is in range for queue") != VDO_SUCCESS)
+		completion->priority = 0;
+
+	completion->my_queue = &queue->common;
+
+	/* Funnel queue handles the synchronization for the put. */
+	uds_funnel_queue_put(queue->priority_lists[completion->priority],
+			     &completion->work_queue_entry_link);
+
+	/*
+	 * Due to how funnel queue synchronization is handled (just atomic operations), the
+	 * simplest safe implementation here would be to wake-up any waiting threads after
+	 * enqueueing each item. Even if the funnel queue is not empty at the time of adding an
+	 * item to the queue, the consumer thread may not see this since it is not guaranteed to
+	 * have the same view of the queue as a producer thread.
+	 *
+	 * However, the above is wasteful so instead we attempt to minimize the number of thread
+	 * wakeups. Using an idle flag, and careful ordering using memory barriers, we should be
+	 * able to determine when the worker thread might be asleep or going to sleep. We use
+	 * cmpxchg to try to take ownership (vs other producer threads) of the responsibility for
+	 * waking the worker thread, so multiple wakeups aren't tried at once.
+	 *
+	 * This was tuned for some x86 boxes that were handy; it's untested whether doing the read
+	 * first is any better or worse for other platforms, even other x86 configurations.
+	 */
+	smp_mb();
+	if ((atomic_read(&queue->idle) != 1) || (atomic_cmpxchg(&queue->idle, 1, 0) != 1))
+		return;
+
+	/* There's a maximum of one thread in this list. */
+	wake_up(&queue->waiting_worker_threads);
+}
+
+static void run_start_hook(struct simple_work_queue *queue)
+{
+	if (queue->common.type->start != NULL)
+		queue->common.type->start(queue->private);
+}
+
+static void run_finish_hook(struct simple_work_queue *queue)
+{
+	if (queue->common.type->finish != NULL)
+		queue->common.type->finish(queue->private);
+}
+
+/*
+ * Wait for the next completion to process, or until kthread_should_stop indicates that it's time
+ * for us to shut down.
+ *
+ * If kthread_should_stop says it's time to stop but we have pending completions return a
+ * completion.
+ *
+ * Also update statistics relating to scheduler interactions.
+ */
+static struct vdo_completion *wait_for_next_completion(struct simple_work_queue *queue)
+{
+	struct vdo_completion *completion;
+	DEFINE_WAIT(wait);
+
+	while (true) {
+		prepare_to_wait(&queue->waiting_worker_threads, &wait, TASK_INTERRUPTIBLE);
+		/*
+		 * Don't set the idle flag until a wakeup will not be lost.
+		 *
+		 * Force synchronization between setting the idle flag and checking the funnel
+		 * queue; the producer side will do them in the reverse order. (There's still a
+		 * race condition we've chosen to allow, because we've got a timeout below that
+		 * unwedges us if we hit it, but this may narrow the window a little.)
+		 */
+		atomic_set(&queue->idle, 1);
+		smp_mb(); /* store-load barrier between "idle" and funnel queue */
+
+		completion = poll_for_completion(queue);
+		if (completion != NULL)
+			break;
+
+		/*
+		 * We need to check for thread-stop after setting TASK_INTERRUPTIBLE state up
+		 * above. Otherwise, schedule() will put the thread to sleep and might miss a
+		 * wakeup from kthread_stop() call in vdo_finish_work_queue().
+		 */
+		if (kthread_should_stop())
+			break;
+
+		schedule();
+
+		/*
+		 * Most of the time when we wake, it should be because there's work to do. If it
+		 * was a spurious wakeup, continue looping.
+		 */
+		completion = poll_for_completion(queue);
+		if (completion != NULL)
+			break;
+	}
+
+	finish_wait(&queue->waiting_worker_threads, &wait);
+	atomic_set(&queue->idle, 0);
+
+	return completion;
+}
+
+static void process_completion(struct simple_work_queue *queue, struct vdo_completion *completion)
+{
+	if (ASSERT(completion->my_queue == &queue->common,
+		   "completion %px from queue %px marked as being in this queue (%px)",
+		   completion,
+		   queue,
+		   completion->my_queue) == UDS_SUCCESS)
+		completion->my_queue = NULL;
+
+	vdo_run_completion(completion);
+}
+
+static void service_work_queue(struct simple_work_queue *queue)
+{
+	run_start_hook(queue);
+
+	while (true) {
+		struct vdo_completion *completion = poll_for_completion(queue);
+
+		if (completion == NULL)
+			completion = wait_for_next_completion(queue);
+
+		if (completion == NULL)
+			/* No completions but kthread_should_stop() was triggered. */
+			break;
+
+		process_completion(queue, completion);
+
+		/*
+		 * Be friendly to a CPU that has other work to do, if the kernel has told us to.
+		 * This speeds up some performance tests; that "other work" might include other VDO
+		 * threads.
+		 */
+		if (need_resched())
+			cond_resched();
+	}
+
+	run_finish_hook(queue);
+}
+
+static int work_queue_runner(void *ptr)
+{
+	struct simple_work_queue *queue = ptr;
+
+	complete(queue->started);
+	service_work_queue(queue);
+	return 0;
+}
+
+/* Creation & teardown */
+
+static void free_simple_work_queue(struct simple_work_queue *queue)
+{
+	unsigned int i;
+
+	for (i = 0; i <= VDO_WORK_Q_MAX_PRIORITY; i++)
+		uds_free_funnel_queue(queue->priority_lists[i]);
+	UDS_FREE(queue->common.name);
+	UDS_FREE(queue);
+}
+
+static void free_round_robin_work_queue(struct round_robin_work_queue *queue)
+{
+	struct simple_work_queue **queue_table = queue->service_queues;
+	unsigned int count = queue->num_service_queues;
+	unsigned int i;
+
+	queue->service_queues = NULL;
+
+	for (i = 0; i < count; i++)
+		free_simple_work_queue(queue_table[i]);
+	UDS_FREE(queue_table);
+	UDS_FREE(queue->common.name);
+	UDS_FREE(queue);
+}
+
+void vdo_free_work_queue(struct vdo_work_queue *queue)
+{
+	if (queue == NULL)
+		return;
+
+	vdo_finish_work_queue(queue);
+
+	if (queue->round_robin_mode)
+		free_round_robin_work_queue(as_round_robin_work_queue(queue));
+	else
+		free_simple_work_queue(as_simple_work_queue(queue));
+}
+
+static int make_simple_work_queue(const char *thread_name_prefix,
+				  const char *name,
+				  struct vdo_thread *owner,
+				  void *private,
+				  const struct vdo_work_queue_type *type,
+				  struct simple_work_queue **queue_ptr)
+{
+	DECLARE_COMPLETION_ONSTACK(started);
+	struct simple_work_queue *queue;
+	int i;
+	struct task_struct *thread = NULL;
+	int result;
+
+	ASSERT_LOG_ONLY((type->max_priority <= VDO_WORK_Q_MAX_PRIORITY),
+			"queue priority count %u within limit %u",
+			type->max_priority,
+			VDO_WORK_Q_MAX_PRIORITY);
+
+	result = UDS_ALLOCATE(1, struct simple_work_queue, "simple work queue", &queue);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	queue->private = private;
+	queue->started = &started;
+	queue->common.type = type;
+	queue->common.owner = owner;
+	init_waitqueue_head(&queue->waiting_worker_threads);
+
+	result = uds_duplicate_string(name, "queue name", &queue->common.name);
+	if (result != VDO_SUCCESS) {
+		UDS_FREE(queue);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i <= type->max_priority; i++) {
+		result = uds_make_funnel_queue(&queue->priority_lists[i]);
+		if (result != UDS_SUCCESS) {
+			free_simple_work_queue(queue);
+			return result;
+		}
+	}
+
+	thread = kthread_run(work_queue_runner,
+			     queue,
+			     "%s:%s",
+			     thread_name_prefix,
+			     queue->common.name);
+	if (IS_ERR(thread)) {
+		free_simple_work_queue(queue);
+		return (int) PTR_ERR(thread);
+	}
+
+	queue->thread = thread;
+
+	/*
+	 * If we don't wait to ensure the thread is running VDO code, a quick kthread_stop (due to
+	 * errors elsewhere) could cause it to never get as far as running VDO, skipping the
+	 * cleanup code.
+	 *
+	 * Eventually we should just make that path safe too, and then we won't need this
+	 * synchronization.
+	 */
+	wait_for_completion(&started);
+
+	*queue_ptr = queue;
+	return UDS_SUCCESS;
+}
+
+/**
+ * vdo_make_work_queue() - Create a work queue; if multiple threads are requested, completions will
+ *                         be distributed to them in round-robin fashion.
+ *
+ * Each queue is associated with a struct vdo_thread which has a single vdo thread id. Regardless
+ * of the actual number of queues and threads allocated here, code outside of the queue
+ * implementation will treat this as a single zone.
+ */
+int vdo_make_work_queue(const char *thread_name_prefix,
+			const char *name,
+			struct vdo_thread *owner,
+			const struct vdo_work_queue_type *type,
+			unsigned int thread_count,
+			void *thread_privates[],
+			struct vdo_work_queue **queue_ptr)
+{
+	struct round_robin_work_queue *queue;
+	int result;
+	char thread_name[TASK_COMM_LEN];
+	unsigned int i;
+
+	if (thread_count == 1) {
+		struct simple_work_queue *simple_queue;
+		void *context = ((thread_privates != NULL) ? thread_privates[0] : NULL);
+
+		result = make_simple_work_queue(thread_name_prefix,
+						name,
+						owner,
+						context,
+						type,
+						&simple_queue);
+		if (result == VDO_SUCCESS)
+			*queue_ptr = &simple_queue->common;
+		return result;
+	}
+
+	result = UDS_ALLOCATE(1, struct round_robin_work_queue, "round-robin work queue", &queue);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(thread_count,
+			      struct simple_work_queue *,
+			      "subordinate work queues",
+			      &queue->service_queues);
+	if (result != UDS_SUCCESS) {
+		UDS_FREE(queue);
+		return result;
+	}
+
+	queue->num_service_queues = thread_count;
+	queue->common.round_robin_mode = true;
+	queue->common.owner = owner;
+
+	result = uds_duplicate_string(name, "queue name", &queue->common.name);
+	if (result != VDO_SUCCESS) {
+		UDS_FREE(queue->service_queues);
+		UDS_FREE(queue);
+		return -ENOMEM;
+	}
+
+	*queue_ptr = &queue->common;
+
+	for (i = 0; i < thread_count; i++) {
+		void *context = ((thread_privates != NULL) ? thread_privates[i] : NULL);
+
+		snprintf(thread_name, sizeof(thread_name), "%s%u", name, i);
+		result = make_simple_work_queue(thread_name_prefix,
+						thread_name,
+						owner,
+						context,
+						type,
+						&queue->service_queues[i]);
+		if (result != VDO_SUCCESS) {
+			queue->num_service_queues = i;
+			/* Destroy previously created subordinates. */
+			vdo_free_work_queue(UDS_FORGET(*queue_ptr));
+			return result;
+		}
+	}
+
+	return VDO_SUCCESS;
+}
+
+static void finish_simple_work_queue(struct simple_work_queue *queue)
+{
+	if (queue->thread == NULL)
+		return;
+
+	/* Tells the worker thread to shut down and waits for it to exit. */
+	kthread_stop(queue->thread);
+	queue->thread = NULL;
+}
+
+static void finish_round_robin_work_queue(struct round_robin_work_queue *queue)
+{
+	struct simple_work_queue **queue_table = queue->service_queues;
+	unsigned int count = queue->num_service_queues;
+	unsigned int i;
+
+	for (i = 0; i < count; i++)
+		finish_simple_work_queue(queue_table[i]);
+}
+
+/* No enqueueing of completions should be done once this function is called. */
+void vdo_finish_work_queue(struct vdo_work_queue *queue)
+{
+	if (queue == NULL)
+		return;
+
+	if (queue->round_robin_mode)
+		finish_round_robin_work_queue(as_round_robin_work_queue(queue));
+	else
+		finish_simple_work_queue(as_simple_work_queue(queue));
+}
+
+/* Debugging dumps */
+
+static void dump_simple_work_queue(struct simple_work_queue *queue)
+{
+	const char *thread_status = "no threads";
+	char task_state_report = '-';
+
+	if (queue->thread != NULL) {
+		task_state_report = task_state_to_char(queue->thread);
+		thread_status = atomic_read(&queue->idle) ? "idle" : "running";
+	}
+
+	uds_log_info("workQ %px (%s) %s (%c)",
+		     &queue->common,
+		     queue->common.name,
+		     thread_status,
+		     task_state_report);
+
+	/* ->waiting_worker_threads wait queue status? anyone waiting? */
+}
+
+/*
+ * Write to the buffer some info about the completion, for logging. Since the common use case is
+ * dumping info about a lot of completions to syslog all at once, the format favors brevity over
+ * readability.
+ */
+void vdo_dump_work_queue(struct vdo_work_queue *queue)
+{
+	if (queue->round_robin_mode) {
+		struct round_robin_work_queue *round_robin = as_round_robin_work_queue(queue);
+		unsigned int i;
+
+		for (i = 0; i < round_robin->num_service_queues; i++)
+			dump_simple_work_queue(round_robin->service_queues[i]);
+	} else {
+		dump_simple_work_queue(as_simple_work_queue(queue));
+	}
+}
+
+static void get_function_name(void *pointer, char *buffer, size_t buffer_length)
+{
+	if (pointer == NULL) {
+		/*
+		 * Format "%ps" logs a null pointer as "(null)" with a bunch of leading spaces. We
+		 * sometimes use this when logging lots of data; don't be so verbose.
+		 */
+		strscpy(buffer, "-", buffer_length);
+	} else {
+		/*
+		 * Use a pragma to defeat gcc's format checking, which doesn't understand that
+		 * "%ps" actually does support a precision spec in Linux kernel code.
+		 */
+		char *space;
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wformat"
+		snprintf(buffer, buffer_length, "%.*ps", buffer_length - 1, pointer);
+#pragma GCC diagnostic pop
+
+		space = strchr(buffer, ' ');
+		if (space != NULL)
+			*space = '\0';
+	}
+}
+
+void vdo_dump_completion_to_buffer(struct vdo_completion *completion, char *buffer, size_t length)
+{
+	size_t current_length =
+		scnprintf(buffer,
+			  length,
+			  "%.*s/",
+			  TASK_COMM_LEN,
+			  (completion->my_queue == NULL ? "-" : completion->my_queue->name));
+
+	if (current_length < length - 1)
+		get_function_name((void *) completion->callback,
+				  buffer + current_length,
+				  length - current_length);
+}
+
+/* Completion submission */
+/*
+ * If the completion has a timeout that has already passed, the timeout handler function may be
+ * invoked by this function.
+ */
+void vdo_enqueue_work_queue(struct vdo_work_queue *queue, struct vdo_completion *completion)
+{
+	/*
+	 * Convert the provided generic vdo_work_queue to the simple_work_queue to actually queue
+	 * on.
+	 */
+	struct simple_work_queue *simple_queue = NULL;
+
+	if (!queue->round_robin_mode) {
+		simple_queue = as_simple_work_queue(queue);
+	} else {
+		struct round_robin_work_queue *round_robin = as_round_robin_work_queue(queue);
+
+		/*
+		 * It shouldn't be a big deal if the same rotor gets used for multiple work queues.
+		 * Any patterns that might develop are likely to be disrupted by random ordering of
+		 * multiple completions and migration between cores, unless the load is so light as
+		 * to be regular in ordering of tasks and the threads are confined to individual
+		 * cores; with a load that light we won't care.
+		 */
+		unsigned int rotor = this_cpu_inc_return(service_queue_rotor);
+		unsigned int index = rotor % round_robin->num_service_queues;
+
+		simple_queue = round_robin->service_queues[index];
+	}
+
+	enqueue_work_queue_completion(simple_queue, completion);
+}
+
+/* Misc */
+
+/*
+ * Return the work queue pointer recorded at initialization time in the work-queue stack handle
+ * initialized on the stack of the current thread, if any.
+ */
+static struct simple_work_queue *get_current_thread_work_queue(void)
+{
+	/*
+	 * In interrupt context, if a vdo thread is what got interrupted, the calls below will find
+	 * the queue for the thread which was interrupted. However, the interrupted thread may have
+	 * been processing a completion, in which case starting to process another would violate
+	 * our concurrency assumptions.
+	 */
+	if (in_interrupt())
+		return NULL;
+	if (kthread_func(current) != work_queue_runner)
+		/* Not a VDO work queue thread. */
+		return NULL;
+	return kthread_data(current);
+}
+
+struct vdo_work_queue *vdo_get_current_work_queue(void)
+{
+	struct simple_work_queue *queue = get_current_thread_work_queue();
+
+	return (queue == NULL) ? NULL : &queue->common;
+}
+
+struct vdo_thread *vdo_get_work_queue_owner(struct vdo_work_queue *queue)
+{
+	return queue->owner;
+}
+
+/**
+ * vdo_get_work_queue_private_data() - Returns the private data for the current thread's work
+ *                                     queue, or NULL if none or if the current thread is not a
+ *                                     work queue thread.
+ */
+void *vdo_get_work_queue_private_data(void)
+{
+	struct simple_work_queue *queue = get_current_thread_work_queue();
+
+	return (queue != NULL) ? queue->private : NULL;
+}
+
+bool vdo_work_queue_type_is(struct vdo_work_queue *queue, const struct vdo_work_queue_type *type)
+{
+	return (queue->type == type);
+}
diff --git a/drivers/md/dm-vdo/work-queue.h b/drivers/md/dm-vdo/work-queue.h
new file mode 100644
index 000000000000..1d5aa9878b16
--- /dev/null
+++ b/drivers/md/dm-vdo/work-queue.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_WORK_QUEUE_H
+#define VDO_WORK_QUEUE_H
+
+#include <linux/sched.h> /* for TASK_COMM_LEN */
+
+#include "types.h"
+
+enum {
+	MAX_VDO_WORK_QUEUE_NAME_LEN = TASK_COMM_LEN,
+};
+
+struct vdo_work_queue_type {
+	void (*start)(void *context);
+	void (*finish)(void *context);
+	enum vdo_completion_priority max_priority;
+	enum vdo_completion_priority default_priority;
+};
+
+struct vdo_completion;
+struct vdo_thread;
+struct vdo_work_queue;
+
+int vdo_make_work_queue(const char *thread_name_prefix,
+			const char *name,
+			struct vdo_thread *owner,
+			const struct vdo_work_queue_type *type,
+			unsigned int thread_count,
+			void *thread_privates[],
+			struct vdo_work_queue **queue_ptr);
+
+void vdo_enqueue_work_queue(struct vdo_work_queue *queue, struct vdo_completion *completion);
+
+void vdo_finish_work_queue(struct vdo_work_queue *queue);
+
+void vdo_free_work_queue(struct vdo_work_queue *queue);
+
+void vdo_dump_work_queue(struct vdo_work_queue *queue);
+
+void vdo_dump_completion_to_buffer(struct vdo_completion *completion, char *buffer, size_t length);
+
+void *vdo_get_work_queue_private_data(void);
+struct vdo_work_queue *vdo_get_current_work_queue(void);
+struct vdo_thread *vdo_get_work_queue_owner(struct vdo_work_queue *queue);
+
+bool __must_check
+vdo_work_queue_type_is(struct vdo_work_queue *queue, const struct vdo_work_queue_type *type);
+
+#endif /* VDO_WORK_QUEUE_H */

From patchwork Thu Oct 26 21:41:05 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437936
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E58E2C27C46
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:43:17 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232027AbjJZVnR (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:43:17 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41026 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344860AbjJZVnQ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:43:16 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2AEC81B2
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356501;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=c0H5QOMw94uQO3n3nqp8DQ4ZCrgXNFUXky23E0yZu/E=;
        b=i1VqXTTfu56UJN2btEkIWvvnXIG9tgCA4z6aUHfDlrIjKW4LREB1pRF2CG6FSEOY+m10r4
        aS6s4HC9YPB45963CkePQ5Z+Hez+Bu6yJpK+MsW0Qe+dIVv74i8CtYkR9tvMcElpq2Dt+e
        5OCudfinsV7BsZuPBDVQz8OKdGRQmoo=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-568-Ucm5F-kWP0SR8VA0UcZUKA-1; Thu,
 26 Oct 2023 17:41:38 -0400
X-MC-Unique: Ucm5F-kWP0SR8VA0UcZUKA-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com
 [10.11.54.2])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 9467E3C02768;
        Thu, 26 Oct 2023 21:41:38 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 731C040C6F79;
        Thu, 26 Oct 2023 21:41:38 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 293DD3003C; Thu, 26 Oct 2023 17:41:38 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 08/39] dm vdo: add basic hash map data structures
Date: Thu, 26 Oct 2023 17:41:05 -0400
Message-Id: <20231026214136.1067410-9-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

This patch adds two hash maps, one keyed by integers, the other by
pointers, and also a priority heap. The integer map is used for locking of
logical and physical addresses. The pointer map is used for managing
concurrent writes of the same data, ensuring that those writes are
deduplicated. The priority heap is used to minimize the search time for
free blocks.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/int-map.c        | 710 +++++++++++++++++++++++++++++
 drivers/md/dm-vdo/int-map.h        |  40 ++
 drivers/md/dm-vdo/pointer-map.c    | 691 ++++++++++++++++++++++++++++
 drivers/md/dm-vdo/pointer-map.h    |  81 ++++
 drivers/md/dm-vdo/priority-table.c | 226 +++++++++
 drivers/md/dm-vdo/priority-table.h |  48 ++
 6 files changed, 1796 insertions(+)
 create mode 100644 drivers/md/dm-vdo/int-map.c
 create mode 100644 drivers/md/dm-vdo/int-map.h
 create mode 100644 drivers/md/dm-vdo/pointer-map.c
 create mode 100644 drivers/md/dm-vdo/pointer-map.h
 create mode 100644 drivers/md/dm-vdo/priority-table.c
 create mode 100644 drivers/md/dm-vdo/priority-table.h

diff --git a/drivers/md/dm-vdo/int-map.c b/drivers/md/dm-vdo/int-map.c
new file mode 100644
index 000000000000..76431ea64f63
--- /dev/null
+++ b/drivers/md/dm-vdo/int-map.c
@@ -0,0 +1,710 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+/**
+ * DOC:
+ *
+ * Hash table implementation of a map from integers to pointers, implemented using the Hopscotch
+ * Hashing algorithm by Herlihy, Shavit, and Tzafrir (see
+ * http://en.wikipedia.org/wiki/Hopscotch_hashing). This implementation does not contain any of the
+ * locking/concurrency features of the algorithm, just the collision resolution scheme.
+ *
+ * Hopscotch Hashing is based on hashing with open addressing and linear probing. All the entries
+ * are stored in a fixed array of buckets, with no dynamic allocation for collisions. Unlike linear
+ * probing, all the entries that hash to a given bucket are stored within a fixed neighborhood
+ * starting at that bucket. Chaining is effectively represented as a bit vector relative to each
+ * bucket instead of as pointers or explicit offsets.
+ *
+ * When an empty bucket cannot be found within a given neighborhood, subsequent neighborhoods are
+ * searched, and one or more entries will "hop" into those neighborhoods. When this process works,
+ * an empty bucket will move into the desired neighborhood, allowing the entry to be added. When
+ * that process fails (typically when the buckets are around 90% full), the table must be resized
+ * and the all entries rehashed and added to the expanded table.
+ *
+ * Unlike linear probing, the number of buckets that must be searched in the worst case has a fixed
+ * upper bound (the size of the neighborhood). Those entries occupy a small number of memory cache
+ * lines, leading to improved use of the cache (fewer misses on both successful and unsuccessful
+ * searches). Hopscotch hashing outperforms linear probing at much higher load factors, so even
+ * with the increased memory burden for maintaining the hop vectors, less memory is needed to
+ * achieve that performance. Hopscotch is also immune to "contamination" from deleting entries
+ * since entries are genuinely removed instead of being replaced by a placeholder.
+ *
+ * The published description of the algorithm used a bit vector, but the paper alludes to an offset
+ * scheme which is used by this implementation. Since the entries in the neighborhood are within N
+ * entries of the hash bucket at the start of the neighborhood, a pair of small offset fields each
+ * log2(N) bits wide is all that's needed to maintain the hops as a linked list. In order to encode
+ * "no next hop" (i.e. NULL) as the natural initial value of zero, the offsets are biased by one
+ * (i.e. 0 => NULL, 1 => offset=0, 2 => offset=1, etc.) We can represent neighborhoods of up to 255
+ * entries with just 8+8=16 bits per entry. The hop list is sorted by hop offset so the first entry
+ * in the list is always the bucket closest to the start of the neighborhood.
+ *
+ * While individual accesses tend to be very fast, the table resize operations are very, very
+ * expensive. If an upper bound on the latency of adding an entry to the table is needed, we either
+ * need to ensure the table is pre-sized to be large enough so no resize is ever needed, or we'll
+ * need to develop an approach to incrementally resize the table.
+ */
+
+#include "int-map.h"
+
+#include <linux/minmax.h>
+
+#include "errors.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "numeric.h"
+#include "permassert.h"
+
+enum {
+	DEFAULT_CAPACITY = 16, /* the number of neighborhoods in a new table */
+	NEIGHBORHOOD = 255,    /* the number of buckets in each neighborhood */
+	MAX_PROBES = 1024,     /* limit on the number of probes for a free bucket */
+	NULL_HOP_OFFSET = 0,   /* the hop offset value terminating the hop list */
+	DEFAULT_LOAD = 75      /* a compromise between memory use and performance */
+};
+
+/**
+ * struct bucket - hash bucket
+ *
+ * Buckets are packed together to reduce memory usage and improve cache efficiency. It would be
+ * tempting to encode the hop offsets separately and maintain alignment of key/value pairs, but
+ * it's crucial to keep the hop fields near the buckets that they use them so they'll tend to share
+ * cache lines.
+ */
+struct __packed bucket {
+	/**
+	 * @first_hop: The biased offset of the first entry in the hop list of the neighborhood
+	 *             that hashes to this bucket.
+	 */
+	u8 first_hop;
+	/** @next_hop: The biased offset of the next bucket in the hop list. */
+	u8 next_hop;
+	/** @key: The key stored in this bucket. */
+	u64 key;
+	/** @value: The value stored in this bucket (NULL if empty). */
+	void *value;
+};
+
+/**
+ * struct int_map - The concrete definition of the opaque int_map type.
+ *
+ * To avoid having to wrap the neighborhoods of the last entries back around to the start of the
+ * bucket array, we allocate a few more buckets at the end of the array instead, which is why
+ * capacity and bucket_count are different.
+ */
+struct int_map {
+	/** @size: The number of entries stored in the map. */
+	size_t size;
+	/** @capacity: The number of neighborhoods in the map. */
+	size_t capacity;
+	/* @bucket_count: The number of buckets in the bucket array. */
+	size_t bucket_count;
+	/** @buckets: The array of hash buckets. */
+	struct bucket *buckets;
+};
+
+/**
+ * mix() - The Google CityHash 16-byte hash mixing function.
+ * @input1: The first input value.
+ * @input2: The second input value.
+ *
+ * Return: A hash of the two inputs.
+ */
+static u64 mix(u64 input1, u64 input2)
+{
+	static const u64 CITY_MULTIPLIER = 0x9ddfea08eb382d69ULL;
+	u64 hash = (input1 ^ input2);
+
+	hash *= CITY_MULTIPLIER;
+	hash ^= (hash >> 47);
+	hash ^= input2;
+	hash *= CITY_MULTIPLIER;
+	hash ^= (hash >> 47);
+	hash *= CITY_MULTIPLIER;
+	return hash;
+}
+
+/**
+ * hash_key() - Calculate a 64-bit non-cryptographic hash value for the provided 64-bit integer
+ *              key.
+ * @key: The mapping key.
+ *
+ * The implementation is based on Google's CityHash, only handling the specific case of an 8-byte
+ * input.
+ *
+ * Return: The hash of the mapping key.
+ */
+static u64 hash_key(u64 key)
+{
+	/*
+	 * Aliasing restrictions forbid us from casting pointer types, so use a union to convert a
+	 * single u64 to two u32 values.
+	 */
+	union {
+		u64 u64;
+		u32 u32[2];
+	} pun = {.u64 = key};
+
+	return mix(sizeof(key) + (((u64) pun.u32[0]) << 3), pun.u32[1]);
+}
+
+/**
+ * allocate_buckets() - Initialize an int_map.
+ * @map: The map to initialize.
+ * @capacity: The initial capacity of the map.
+ *
+ * Return: UDS_SUCCESS or an error code.
+ */
+static int allocate_buckets(struct int_map *map, size_t capacity)
+{
+	map->size = 0;
+	map->capacity = capacity;
+
+	/*
+	 * Allocate NEIGHBORHOOD - 1 extra buckets so the last bucket can have a full neighborhood
+	 * without have to wrap back around to element zero.
+	 */
+	map->bucket_count = capacity + (NEIGHBORHOOD - 1);
+	return UDS_ALLOCATE(map->bucket_count, struct bucket,
+			    "struct int_map buckets", &map->buckets);
+}
+
+/**
+ * vdo_make_int_map() - Allocate and initialize an int_map.
+ * @initial_capacity: The number of entries the map should initially be capable of holding (zero
+ *                    tells the map to use its own small default).
+ * @initial_load: The load factor of the map, expressed as an integer percentage (typically in the
+ *                range 50 to 90, with zero telling the map to use its own default).
+ * @map_ptr: Output, a pointer to hold the new int_map.
+ *
+ * Return: UDS_SUCCESS or an error code.
+ */
+int vdo_make_int_map(size_t initial_capacity, unsigned int initial_load, struct int_map **map_ptr)
+{
+	struct int_map *map;
+	int result;
+	size_t capacity;
+
+	/* Use the default initial load if the caller did not specify one. */
+	if (initial_load == 0)
+		initial_load = DEFAULT_LOAD;
+	if (initial_load > 100)
+		return UDS_INVALID_ARGUMENT;
+
+	result = UDS_ALLOCATE(1, struct int_map, "struct int_map", &map);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	/* Use the default capacity if the caller did not specify one. */
+	capacity = (initial_capacity > 0) ? initial_capacity : DEFAULT_CAPACITY;
+
+	/*
+	 * Scale up the capacity by the specified initial load factor. (i.e to hold 1000 entries at
+	 * 80% load we need a capacity of 1250)
+	 */
+	capacity = capacity * 100 / initial_load;
+
+	result = allocate_buckets(map, capacity);
+	if (result != UDS_SUCCESS) {
+		vdo_free_int_map(UDS_FORGET(map));
+		return result;
+	}
+
+	*map_ptr = map;
+	return UDS_SUCCESS;
+}
+
+/**
+ * vdo_free_int_map() - Free an int_map.
+ * @map: The int_map to free.
+ *
+ * NOTE: The map does not own the pointer values stored in the map and they are not freed by this
+ * call.
+ */
+void vdo_free_int_map(struct int_map *map)
+{
+	if (map == NULL)
+		return;
+
+	UDS_FREE(UDS_FORGET(map->buckets));
+	UDS_FREE(UDS_FORGET(map));
+}
+
+/**
+ * vdo_int_map_size() - Get the number of entries stored in an int_map.
+ * @map: The int_map to query.
+ *
+ * Return: The number of entries in the map.
+ */
+size_t vdo_int_map_size(const struct int_map *map)
+{
+	return map->size;
+}
+
+/**
+ * dereference_hop() - Convert a biased hop offset within a neighborhood to a pointer to the bucket
+ *                     it references.
+ * @neighborhood: The first bucket in the neighborhood.
+ * @hop_offset: The biased hop offset to the desired bucket.
+ *
+ * Return: NULL if hop_offset is zero, otherwise a pointer to the bucket in the neighborhood at
+ *         hop_offset - 1.
+ */
+static struct bucket *dereference_hop(struct bucket *neighborhood, unsigned int hop_offset)
+{
+	if (hop_offset == NULL_HOP_OFFSET)
+		return NULL;
+
+	STATIC_ASSERT(NULL_HOP_OFFSET == 0);
+	return &neighborhood[hop_offset - 1];
+}
+
+/**
+ * insert_in_hop_list() - Add a bucket into the hop list for the neighborhood.
+ * @neighborhood: The first bucket in the neighborhood.
+ * @new_bucket: The bucket to add to the hop list.
+ *
+ * The bucket is inserted it into the list so the hop list remains sorted by hop offset.
+ */
+static void insert_in_hop_list(struct bucket *neighborhood, struct bucket *new_bucket)
+{
+	/* Zero indicates a NULL hop offset, so bias the hop offset by one. */
+	int hop_offset = 1 + (new_bucket - neighborhood);
+
+	/* Handle the special case of adding a bucket at the start of the list. */
+	int next_hop = neighborhood->first_hop;
+
+	if ((next_hop == NULL_HOP_OFFSET) || (next_hop > hop_offset)) {
+		new_bucket->next_hop = next_hop;
+		neighborhood->first_hop = hop_offset;
+		return;
+	}
+
+	/* Search the hop list for the insertion point that maintains the sort order. */
+	for (;;) {
+		struct bucket *bucket = dereference_hop(neighborhood, next_hop);
+
+		next_hop = bucket->next_hop;
+
+		if ((next_hop == NULL_HOP_OFFSET) || (next_hop > hop_offset)) {
+			new_bucket->next_hop = next_hop;
+			bucket->next_hop = hop_offset;
+			return;
+		}
+	}
+}
+
+/**
+ * select_bucket() - Select and return the hash bucket for a given search key.
+ * @map: The map to search.
+ * @key: The mapping key.
+ */
+static struct bucket *select_bucket(const struct int_map *map, u64 key)
+{
+	/*
+	 * Calculate a good hash value for the provided key. We want exactly 32 bits, so mask the
+	 * result.
+	 */
+	u64 hash = hash_key(key) & 0xFFFFFFFF;
+
+	/*
+	 * Scale the 32-bit hash to a bucket index by treating it as a binary fraction and
+	 * multiplying that by the capacity. If the hash is uniformly distributed over [0 ..
+	 * 2^32-1], then (hash * capacity / 2^32) should be uniformly distributed over [0 ..
+	 * capacity-1]. The multiply and shift is much faster than a divide (modulus) on X86 CPUs.
+	 */
+	return &map->buckets[(hash * map->capacity) >> 32];
+}
+
+/**
+ * search_hop_list() - Search the hop list associated with given hash bucket for a given search
+ *                     key.
+ * @map: The map being searched.
+ * @bucket: The map bucket to search for the key.
+ * @key: The mapping key.
+ * @previous_ptr: Output. if not NULL, a pointer in which to store the bucket in the list preceding
+ *                the one that had the matching key
+ *
+ * If the key is found, returns a pointer to the entry (bucket or collision), otherwise returns
+ * NULL.
+ *
+ * Return: An entry that matches the key, or NULL if not found.
+ */
+static struct bucket *search_hop_list(struct int_map *map __always_unused,
+				      struct bucket *bucket,
+				      u64 key,
+				      struct bucket **previous_ptr)
+{
+	struct bucket *previous = NULL;
+	unsigned int next_hop = bucket->first_hop;
+
+	while (next_hop != NULL_HOP_OFFSET) {
+		/*
+		 * Check the neighboring bucket indexed by the offset for the
+		 * desired key.
+		 */
+		struct bucket *entry = dereference_hop(bucket, next_hop);
+
+		if ((key == entry->key) && (entry->value != NULL)) {
+			if (previous_ptr != NULL)
+				*previous_ptr = previous;
+			return entry;
+		}
+		next_hop = entry->next_hop;
+		previous = entry;
+	}
+	return NULL;
+}
+
+/**
+ * vdo_int_map_get() - Retrieve the value associated with a given key from the int_map.
+ * @map: The int_map to query.
+ * @key: The key to look up.
+ *
+ * Return: The value associated with the given key, or NULL if the key is not mapped to any value.
+ */
+void *vdo_int_map_get(struct int_map *map, u64 key)
+{
+	struct bucket *match = search_hop_list(map, select_bucket(map, key), key, NULL);
+
+	return ((match != NULL) ? match->value : NULL);
+}
+
+/**
+ * resize_buckets() - Increase the number of hash buckets.
+ * @map: The map to resize.
+ *
+ * Resizes and rehashes all the existing entries, storing them in the new buckets.
+ *
+ * Return: UDS_SUCCESS or an error code.
+ */
+static int resize_buckets(struct int_map *map)
+{
+	int result;
+	size_t i;
+
+	/* Copy the top-level map data to the stack. */
+	struct int_map old_map = *map;
+
+	/* Re-initialize the map to be empty and 50% larger. */
+	size_t new_capacity = map->capacity / 2 * 3;
+
+	uds_log_info("%s: attempting resize from %zu to %zu, current size=%zu",
+		     __func__, map->capacity, new_capacity, map->size);
+	result = allocate_buckets(map, new_capacity);
+	if (result != UDS_SUCCESS) {
+		*map = old_map;
+		return result;
+	}
+
+	/* Populate the new hash table from the entries in the old bucket array. */
+	for (i = 0; i < old_map.bucket_count; i++) {
+		struct bucket *entry = &old_map.buckets[i];
+
+		if (entry->value == NULL)
+			continue;
+
+		result = vdo_int_map_put(map, entry->key, entry->value, true, NULL);
+		if (result != UDS_SUCCESS) {
+			/* Destroy the new partial map and restore the map from the stack. */
+			UDS_FREE(UDS_FORGET(map->buckets));
+			*map = old_map;
+			return result;
+		}
+	}
+
+	/* Destroy the old bucket array. */
+	UDS_FREE(UDS_FORGET(old_map.buckets));
+	return UDS_SUCCESS;
+}
+
+/**
+ * find_empty_bucket() - Probe the bucket array starting at the given bucket for the next empty
+ *                       bucket, returning a pointer to it.
+ * @map: The map containing the buckets to search.
+ * @bucket: The bucket at which to start probing.
+ * @max_probes: The maximum number of buckets to search.
+ *
+ * NULL will be returned if the search reaches the end of the bucket array or if the number of
+ * linear probes exceeds a specified limit.
+ *
+ * Return: The next empty bucket, or NULL if the search failed.
+ */
+static struct bucket *
+find_empty_bucket(struct int_map *map, struct bucket *bucket, unsigned int max_probes)
+{
+	/*
+	 * Limit the search to either the nearer of the end of the bucket array or a fixed distance
+	 * beyond the initial bucket.
+	 */
+	ptrdiff_t remaining = &map->buckets[map->bucket_count] - bucket;
+	struct bucket *sentinel = &bucket[min_t(ptrdiff_t, remaining, max_probes)];
+	struct bucket *entry;
+
+	for (entry = bucket; entry < sentinel; entry++)
+		if (entry->value == NULL)
+			return entry;
+	return NULL;
+}
+
+/**
+ * move_empty_bucket() - Move an empty bucket closer to the start of the bucket array.
+ * @map: The map containing the bucket.
+ * @hole: The empty bucket to fill with an entry that precedes it in one of its enclosing
+ *        neighborhoods.
+ *
+ * This searches the neighborhoods that contain the empty bucket for a non-empty bucket closer to
+ * the start of the array. If such a bucket is found, this swaps the two buckets by moving the
+ * entry to the empty bucket.
+ *
+ * Return: The bucket that was vacated by moving its entry to the provided hole, or NULL if no
+ *         entry could be moved.
+ */
+static struct bucket *move_empty_bucket(struct int_map *map __always_unused, struct bucket *hole)
+{
+	/*
+	 * Examine every neighborhood that the empty bucket is part of, starting with the one in
+	 * which it is the last bucket. No boundary check is needed for the negative array
+	 * arithmetic since this function is only called when hole is at least NEIGHBORHOOD cells
+	 * deeper into the array than a valid bucket.
+	 */
+	struct bucket *bucket;
+
+	for (bucket = &hole[1 - NEIGHBORHOOD]; bucket < hole; bucket++) {
+		/*
+		 * Find the entry that is nearest to the bucket, which means it will be nearest to
+		 * the hash bucket whose neighborhood is full.
+		 */
+		struct bucket *new_hole = dereference_hop(bucket, bucket->first_hop);
+
+		if (new_hole == NULL)
+			/*
+			 * There are no buckets in this neighborhood that are in use by this one
+			 * (they must all be owned by overlapping neighborhoods).
+			 */
+			continue;
+
+		/*
+		 * Skip this bucket if its first entry is actually further away than the hole that
+		 * we're already trying to fill.
+		 */
+		if (hole < new_hole)
+			continue;
+
+		/*
+		 * We've found an entry in this neighborhood that we can "hop" further away, moving
+		 * the hole closer to the hash bucket, if not all the way into its neighborhood.
+		 */
+
+		/*
+		 * The entry that will be the new hole is the first bucket in the list, so setting
+		 * first_hop is all that's needed remove it from the list.
+		 */
+		bucket->first_hop = new_hole->next_hop;
+		new_hole->next_hop = NULL_HOP_OFFSET;
+
+		/* Move the entry into the original hole. */
+		hole->key = new_hole->key;
+		hole->value = new_hole->value;
+		new_hole->value = NULL;
+
+		/* Insert the filled hole into the hop list for the neighborhood. */
+		insert_in_hop_list(bucket, hole);
+		return new_hole;
+	}
+
+	/* We couldn't find an entry to relocate to the hole. */
+	return NULL;
+}
+
+/**
+ * update_mapping() - Find and update any existing mapping for a given key, returning the value
+ *                    associated with the key in the provided pointer.
+ * @map: The int_map to attempt to modify.
+ * @neighborhood: The first bucket in the neighborhood that would contain the search key
+ * @key: The key with which to associate the new value.
+ * @new_value: The value to be associated with the key.
+ * @update: Whether to overwrite an existing value.
+ * @old_value_ptr: a pointer in which to store the old value (unmodified if no mapping was found)
+ *
+ * Return: true if the map contains a mapping for the key, false if it does not.
+ */
+static bool update_mapping(struct int_map *map,
+			   struct bucket *neighborhood,
+			   u64 key,
+			   void *new_value,
+			   bool update,
+			   void **old_value_ptr)
+{
+	struct bucket *bucket = search_hop_list(map, neighborhood, key, NULL);
+
+	if (bucket == NULL)
+		/* There is no bucket containing the key in the neighborhood. */
+		return false;
+
+	/*
+	 * Return the value of the current mapping (if desired) and update the mapping with the new
+	 * value (if desired).
+	 */
+	if (old_value_ptr != NULL)
+		*old_value_ptr = bucket->value;
+	if (update)
+		bucket->value = new_value;
+	return true;
+}
+
+/**
+ * find_or_make_vacancy() - Find an empty bucket.
+ * @map: The int_map to search or modify.
+ * @neighborhood: The first bucket in the neighborhood in which an empty bucket is needed for a new
+ *                mapping.
+ *
+ * Find an empty bucket in a specified neighborhood for a new mapping or attempt to re-arrange
+ * mappings so there is such a bucket. This operation may fail (returning NULL) if an empty bucket
+ * is not available or could not be relocated to the neighborhood.
+ *
+ * Return: a pointer to an empty bucket in the desired neighborhood, or NULL if a vacancy could not
+ *         be found or arranged.
+ */
+static struct bucket *find_or_make_vacancy(struct int_map *map, struct bucket *neighborhood)
+{
+	/* Probe within and beyond the neighborhood for the first empty bucket. */
+	struct bucket *hole = find_empty_bucket(map, neighborhood, MAX_PROBES);
+
+	/*
+	 * Keep trying until the empty bucket is in the bucket's neighborhood or we are unable to
+	 * move it any closer by swapping it with a filled bucket.
+	 */
+	while (hole != NULL) {
+		int distance = hole - neighborhood;
+
+		if (distance < NEIGHBORHOOD)
+			/*
+			 * We've found or relocated an empty bucket close enough to the initial
+			 * hash bucket to be referenced by its hop vector.
+			 */
+			return hole;
+
+		/*
+		 * The nearest empty bucket isn't within the neighborhood that must contain the new
+		 * entry, so try to swap it with bucket that is closer.
+		 */
+		hole = move_empty_bucket(map, hole);
+	}
+
+	return NULL;
+}
+
+/**
+ * vdo_int_map_put() - Try to associate a value with an integer.
+ * @map: The int_map to attempt to modify.
+ * @key: The key with which to associate the new value.
+ * @new_value: The value to be associated with the key.
+ * @update: Whether to overwrite an existing value.
+ * @old_value_ptr: A pointer in which to store either the old value (if the key was already mapped)
+ *                 or NULL if the map did not contain the key; NULL may be provided if the caller
+ *                 does not need to know the old value
+ *
+ * Try to associate a value (a pointer) with an integer in an int_map. If the map already contains
+ * a mapping for the provided key, the old value is only replaced with the specified value if
+ * update is true. In either case the old value is returned. If the map does not already contain a
+ * value for the specified key, the new value is added regardless of the value of update.
+ *
+ * Return: UDS_SUCCESS or an error code.
+ */
+int vdo_int_map_put(struct int_map *map, u64 key, void *new_value, bool update, void **old_value_ptr)
+{
+	struct bucket *neighborhood, *bucket;
+
+	if (new_value == NULL)
+		return UDS_INVALID_ARGUMENT;
+
+	/*
+	 * Select the bucket at the start of the neighborhood that must contain any entry for the
+	 * provided key.
+	 */
+	neighborhood = select_bucket(map, key);
+
+	/*
+	 * Check whether the neighborhood already contains an entry for the key, in which case we
+	 * optionally update it, returning the old value.
+	 */
+	if (update_mapping(map, neighborhood, key, new_value, update, old_value_ptr))
+		return UDS_SUCCESS;
+
+	/*
+	 * Find an empty bucket in the desired neighborhood for the new entry or re-arrange entries
+	 * in the map so there is such a bucket. This operation will usually succeed; the loop body
+	 * will only be executed on the rare occasions that we have to resize the map.
+	 */
+	while ((bucket = find_or_make_vacancy(map, neighborhood)) == NULL) {
+		int result;
+
+		/*
+		 * There is no empty bucket in which to put the new entry in the current map, so
+		 * we're forced to allocate a new bucket array with a larger capacity, re-hash all
+		 * the entries into those buckets, and try again (a very expensive operation for
+		 * large maps).
+		 */
+		result = resize_buckets(map);
+		if (result != UDS_SUCCESS)
+			return result;
+
+		/*
+		 * Resizing the map invalidates all pointers to buckets, so recalculate the
+		 * neighborhood pointer.
+		 */
+		neighborhood = select_bucket(map, key);
+	}
+
+	/* Put the new entry in the empty bucket, adding it to the neighborhood. */
+	bucket->key = key;
+	bucket->value = new_value;
+	insert_in_hop_list(neighborhood, bucket);
+	map->size += 1;
+
+	/* There was no existing entry, so there was no old value to be returned. */
+	if (old_value_ptr != NULL)
+		*old_value_ptr = NULL;
+	return UDS_SUCCESS;
+}
+
+/**
+ * vdo_int_map_remove() - Remove the mapping for a given key from the int_map.
+ * @map: The int_map from which to remove the mapping.
+ * @key: The key whose mapping is to be removed.
+ *
+ * Return: the value that was associated with the key, or NULL if it was not mapped.
+ */
+void *vdo_int_map_remove(struct int_map *map, u64 key)
+{
+	void *value;
+
+	/* Select the bucket to search and search it for an existing entry. */
+	struct bucket *bucket = select_bucket(map, key);
+	struct bucket *previous;
+	struct bucket *victim = search_hop_list(map, bucket, key, &previous);
+
+	if (victim == NULL)
+		/* There is no matching entry to remove. */
+		return NULL;
+
+	/*
+	 * We found an entry to remove. Save the mapped value to return later and empty the bucket.
+	 */
+	map->size -= 1;
+	value = victim->value;
+	victim->value = NULL;
+	victim->key = 0;
+
+	/* The victim bucket is now empty, but it still needs to be spliced out of the hop list. */
+	if (previous == NULL)
+		/* The victim is the head of the list, so swing first_hop. */
+		bucket->first_hop = victim->next_hop;
+	else
+		previous->next_hop = victim->next_hop;
+	victim->next_hop = NULL_HOP_OFFSET;
+
+	return value;
+}
diff --git a/drivers/md/dm-vdo/int-map.h b/drivers/md/dm-vdo/int-map.h
new file mode 100644
index 000000000000..e85f12b30edc
--- /dev/null
+++ b/drivers/md/dm-vdo/int-map.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_INT_MAP_H
+#define VDO_INT_MAP_H
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/**
+ * DOC: int_map
+ *
+ * An int_map associates pointers (void *) with integer keys (u64). NULL pointer values are
+ * not supported.
+ *
+ * The map is implemented as hash table, which should provide constant-time insert, query, and
+ * remove operations, although the insert may occasionally grow the table, which is linear in the
+ * number of entries in the map. The table will grow as needed to hold new entries, but will not
+ * shrink as entries are removed.
+ */
+
+struct int_map;
+
+int __must_check
+vdo_make_int_map(size_t initial_capacity, unsigned int initial_load, struct int_map **map_ptr);
+
+void vdo_free_int_map(struct int_map *map);
+
+size_t vdo_int_map_size(const struct int_map *map);
+
+void *vdo_int_map_get(struct int_map *map, u64 key);
+
+int __must_check
+vdo_int_map_put(struct int_map *map, u64 key, void *new_value, bool update, void **old_value_ptr);
+
+void *vdo_int_map_remove(struct int_map *map, u64 key);
+
+#endif /* VDO_INT_MAP_H */
diff --git a/drivers/md/dm-vdo/pointer-map.c b/drivers/md/dm-vdo/pointer-map.c
new file mode 100644
index 000000000000..21ccb40c73d2
--- /dev/null
+++ b/drivers/md/dm-vdo/pointer-map.c
@@ -0,0 +1,691 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+/**
+ * DOC:
+ *
+ * Hash table implementation of a map from integers to pointers, implemented using the Hopscotch
+ * Hashing algorithm by Herlihy, Shavit, and Tzafrir (see
+ * http://en.wikipedia.org/wiki/Hopscotch_hashing). This implementation does not contain any of the
+ * locking/concurrency features of the algorithm, just the collision resolution scheme.
+ *
+ * Hopscotch Hashing is based on hashing with open addressing and linear probing. All the entries
+ * are stored in a fixed array of buckets, with no dynamic allocation for collisions. Unlike linear
+ * probing, all the entries that hash to a given bucket are stored within a fixed neighborhood
+ * starting at that bucket. Chaining is effectively represented as a bit vector relative to each
+ * bucket instead of as pointers or explicit offsets.
+ *
+ * When an empty bucket cannot be found within a given neighborhood, subsequent neighborhoods are
+ * searched, and one or more entries will "hop" into those neighborhoods. When this process works,
+ * an empty bucket will move into the desired neighborhood, allowing the entry to be added. When
+ * that process fails (typically when the buckets are around 90% full), the table must be resized
+ * and the all entries rehashed and added to the expanded table.
+ *
+ * Unlike linear probing, the number of buckets that must be searched in the worst case has a fixed
+ * upper bound (the size of the neighborhood). Those entries occupy a small number of memory cache
+ * lines, leading to improved use of the cache (fewer misses on both successful and unsuccessful
+ * searches). Hopscotch hashing outperforms linear probing at much higher load factors, so even
+ * with the increased memory burden for maintaining the hop vectors, less memory is needed to
+ * achieve that performance. Hopscotch is also immune to "contamination" from deleting entries
+ * since entries are genuinely removed instead of being replaced by a placeholder.
+ *
+ * The published description of the algorithm used a bit vector, but the paper alludes to an offset
+ * scheme which is used by this implementation. Since the entries in the neighborhood are within N
+ * entries of the hash bucket at the start of the neighborhood, a pair of small offset fields each
+ * log2(N) bits wide is all that's needed to maintain the hops as a linked list. In order to encode
+ * "no next hop" (i.e. NULL) as the natural initial value of zero, the offsets are biased by one
+ * (i.e. 0 => NULL, 1 => offset=0, 2 => offset=1, etc.) We can represent neighborhoods of up to 255
+ * entries with just 8+8=16 bits per entry. The hop list is sorted by hop offset so the first entry
+ * in the list is always the bucket closest to the start of the neighborhood.
+ *
+ * While individual accesses tend to be very fast, the table resize operations are very, very
+ * expensive. If an upper bound on the latency of adding an entry to the table is needed, we either
+ * need to ensure the table is pre-sized to be large enough so no resize is ever needed, or we'll
+ * need to develop an approach to incrementally resize the table.
+ */
+
+#include "pointer-map.h"
+
+#include <linux/minmax.h>
+
+#include "errors.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "numeric.h"
+#include "permassert.h"
+
+enum {
+	DEFAULT_CAPACITY = 16, /* the number of neighborhoods in a new table */
+	NEIGHBORHOOD = 255, /* the number of buckets in each neighborhood */
+	MAX_PROBES = 1024, /* limit on the number of probes for a free bucket */
+	NULL_HOP_OFFSET = 0, /* the hop offset value terminating the hop list */
+	DEFAULT_LOAD = 75 /* a compromise between memory use and performance */
+};
+
+/**
+ * struct bucket - Hash buckets.
+ *
+ * Buckets are packed together to reduce memory usage and improve cache efficiency. It would be
+ * tempting to encode the hop offsets separately and maintain alignment of key/value pairs, but
+ * it's crucial to keep the hop fields near the buckets that they use them so they'll tend to share
+ * cache lines.
+ */
+struct __packed bucket {
+	/**
+	 * @first_hop: The biased offset of the first entry in the hop list of the neighborhood
+	 * that hashes to this bucket.
+	 */
+	u8 first_hop;
+	/** @next_hop: the biased offset of the next bucket in the hop list. */
+	u8 next_hop;
+	/** @key: The key stored in this bucket. */
+	const void *key;
+	/** @value: The value stored in this bucket (NULL if empty). */
+	void *value;
+};
+
+/**
+ * struct pointer_map - The concrete definition of the opaque pointer_map type.
+ *
+ * To avoid having to wrap the neighborhoods of the last entries back around to the start of the
+ * bucket array, we allocate a few more buckets at the end of the array instead, which is why
+ * capacity and bucket_count are different.
+ */
+struct pointer_map {
+	/** @size: The number of entries stored in the map. */
+	size_t size;
+	/** @capacity: The number of neighborhoods in the map. */
+	size_t capacity;
+	/** @bucket_count: The number of buckets in the bucket array. */
+	size_t bucket_count;
+	/** @buckets: The array of hash buckets. */
+	struct bucket *buckets;
+	/** @comparator: The function for comparing keys for equality. */
+	pointer_key_comparator *comparator;
+	/** @hasher: The function for getting a hash code from a key. */
+	pointer_key_hasher *hasher;
+};
+
+/**
+ * allocate_buckets() - Initialize a pointer_map.
+ * @map: The map to initialize.
+ * @capacity: The initial capacity of the map.
+ *
+ * Return: UDS_SUCCESS or an error code.
+ */
+static int allocate_buckets(struct pointer_map *map, size_t capacity)
+{
+	map->size = 0;
+	map->capacity = capacity;
+
+	/*
+	 * Allocate NEIGHBORHOOD - 1 extra buckets so the last bucket can have a full neighborhood
+	 * without have to wrap back around to element zero.
+	 */
+	map->bucket_count = capacity + (NEIGHBORHOOD - 1);
+	return UDS_ALLOCATE(map->bucket_count,
+			    struct bucket,
+			    "pointer_map buckets",
+			    &map->buckets);
+}
+
+/**
+ * vdo_make_pointer_map() - Allocate and initialize a pointer_map.
+ * @initial_capacity: The number of entries the map should initially be capable of holding (zero
+ *                    tells the map to use its own small default).
+ * @initial_load: The load factor of the map, expressed as an integer percentage (typically in the
+ * range 50 to 90, with zero telling the map to use its own default).
+ * @comparator: The function to use to compare the referents of two pointer keys for equality.
+ * @hasher: The function to use obtain the hash code associated with each pointer key
+ * @map_ptr: A pointer to hold the new pointer_map.
+ *
+ * Return: UDS_SUCCESS or an error code.
+ */
+int vdo_make_pointer_map(size_t initial_capacity,
+			 unsigned int initial_load,
+			 pointer_key_comparator comparator,
+			 pointer_key_hasher hasher,
+			 struct pointer_map **map_ptr)
+{
+	int result;
+	struct pointer_map *map;
+	size_t capacity;
+
+	/* Use the default initial load if the caller did not specify one. */
+	if (initial_load == 0)
+		initial_load = DEFAULT_LOAD;
+	if (initial_load > 100)
+		return UDS_INVALID_ARGUMENT;
+
+	result = UDS_ALLOCATE(1, struct pointer_map, "pointer_map", &map);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	map->hasher = hasher;
+	map->comparator = comparator;
+
+	/* Use the default capacity if the caller did not specify one. */
+	capacity = (initial_capacity > 0) ? initial_capacity : DEFAULT_CAPACITY;
+
+	/*
+	 * Scale up the capacity by the specified initial load factor. (i.e to hold 1000 entries at
+	 * 80% load we need a capacity of 1250)
+	 */
+	capacity = capacity * 100 / initial_load;
+
+	result = allocate_buckets(map, capacity);
+	if (result != UDS_SUCCESS) {
+		vdo_free_pointer_map(UDS_FORGET(map));
+		return result;
+	}
+
+	*map_ptr = map;
+	return UDS_SUCCESS;
+}
+
+/**
+ * vdo_free_pointer_map() - Free a pointer_map.
+ * @map: The pointer_map to free.
+ *
+ * The map does not own the pointer keys and values stored in the map and they are not freed by
+ * this call.
+ */
+void vdo_free_pointer_map(struct pointer_map *map)
+{
+	if (map == NULL)
+		return;
+
+	UDS_FREE(UDS_FORGET(map->buckets));
+	UDS_FREE(UDS_FORGET(map));
+}
+
+/**
+ * vdo_pointer_map_size() - Get the number of entries stored in a pointer_map.
+ * @map: The pointer_map to query.
+ *
+ * Return: The number of entries in the map.
+ */
+size_t vdo_pointer_map_size(const struct pointer_map *map)
+{
+	return map->size;
+}
+
+/**
+ * dereference_hop() - Convert a biased hop offset within a neighborhood to a pointer to the bucket
+ *                     it references.
+ * @neighborhood: The first bucket in the neighborhood.
+ * @hop_offset: The biased hop offset to the desired bucket.
+ *
+ * Return: NULL if hop_offset is zero, otherwise a pointer to the bucket in the neighborhood at
+ *         hop_offset - 1.
+ */
+static struct bucket *dereference_hop(struct bucket *neighborhood, unsigned int hop_offset)
+{
+	if (hop_offset == NULL_HOP_OFFSET)
+		return NULL;
+
+	STATIC_ASSERT(NULL_HOP_OFFSET == 0);
+	return &neighborhood[hop_offset - 1];
+}
+
+/**
+ * insert_in_hop_list() - Add a bucket into the hop list for the neighborhood, inserting it into
+ *                        the list so the hop list remains sorted by hop offset.
+ * @neighborhood: The first bucket in the neighborhood.
+ * @new_bucket: The bucket to add to the hop list.
+ */
+static void insert_in_hop_list(struct bucket *neighborhood, struct bucket *new_bucket)
+{
+	/* Zero indicates a NULL hop offset, so bias the hop offset by one. */
+	int hop_offset = 1 + (new_bucket - neighborhood);
+
+	/* Handle the special case of adding a bucket at the start of the list. */
+	int next_hop = neighborhood->first_hop;
+
+	if ((next_hop == NULL_HOP_OFFSET) || (next_hop > hop_offset)) {
+		new_bucket->next_hop = next_hop;
+		neighborhood->first_hop = hop_offset;
+		return;
+	}
+
+	/* Search the hop list for the insertion point that maintains the sort order. */
+	for (;;) {
+		struct bucket *bucket = dereference_hop(neighborhood, next_hop);
+
+		next_hop = bucket->next_hop;
+
+		if ((next_hop == NULL_HOP_OFFSET) || (next_hop > hop_offset)) {
+			new_bucket->next_hop = next_hop;
+			bucket->next_hop = hop_offset;
+			return;
+		}
+	}
+}
+
+/**
+ * select_bucket() - Select and return the hash bucket for a given search key.
+ * @map: The map to search.
+ * @key: The mapping key.
+ */
+static struct bucket *select_bucket(const struct pointer_map *map, const void *key)
+{
+	/*
+	 * Scale the 32-bit hash to a bucket index by treating it as a binary fraction and
+	 * multiplying that by the capacity. If the hash is uniformly distributed over [0 ..
+	 * 2^32-1], then (hash * capacity / 2^32) should be uniformly distributed over [0 ..
+	 * capacity-1]. The multiply and shift is much faster than a divide (modulus) on X86 CPUs.
+	 */
+	u64 hash = map->hasher(key);
+
+	return &map->buckets[(hash * map->capacity) >> 32];
+}
+
+/**
+ * search_hop_list() - Search the hop list.
+ * @map: The map being searched.
+ * @bucket: The map bucket to search for the key.
+ * @key: The mapping key.
+ * @previous_ptr: if not NULL, a pointer in which to store the bucket in the list preceding the one
+ *                that had the matching key.
+ *
+ * Searches the hop list associated with given hash bucket for a given search key. If the key is
+ * found, returns a pointer to the entry (bucket or collision), otherwise returns NULL.
+ *
+ * Return: an entry that matches the key, or NULL if not found.
+ */
+static struct bucket *search_hop_list(struct pointer_map *map,
+				      struct bucket *bucket,
+				      const void *key,
+				      struct bucket **previous_ptr)
+{
+	struct bucket *previous = NULL;
+	unsigned int next_hop = bucket->first_hop;
+
+	while (next_hop != NULL_HOP_OFFSET) {
+		/* Check the neighboring bucket indexed by the offset for the desired key. */
+		struct bucket *entry = dereference_hop(bucket, next_hop);
+
+		if ((entry->value != NULL) && map->comparator(key, entry->key)) {
+			if (previous_ptr != NULL)
+				*previous_ptr = previous;
+			return entry;
+		}
+		next_hop = entry->next_hop;
+		previous = entry;
+	}
+	return NULL;
+}
+
+/**
+ * vdo_pointer_map_get() - Retrieve the value associated with a given key from the pointer_map.
+ * @map: The pointer_map to query.
+ * @key: The key to look up (may be NULL if the comparator and hasher functions support it).
+ *
+ * Return: the value associated with the given key, or NULL if the key is not mapped to any value.
+ */
+void *vdo_pointer_map_get(struct pointer_map *map, const void *key)
+{
+	struct bucket *match = search_hop_list(map, select_bucket(map, key), key, NULL);
+
+	return ((match != NULL) ? match->value : NULL);
+}
+
+/**
+ * resize_buckets() - Increase the number of hash buckets and rehash all the existing entries,
+ *                    storing them in the new buckets.
+ * @map: The map to resize.
+ */
+static int resize_buckets(struct pointer_map *map)
+{
+	int result;
+	size_t i;
+
+	/* Copy the top-level map data to the stack. */
+	struct pointer_map old_map = *map;
+
+	/* Re-initialize the map to be empty and 50% larger. */
+	size_t new_capacity = map->capacity / 2 * 3;
+
+	uds_log_info("%s: attempting resize from %zu to %zu, current size=%zu",
+		     __func__,
+		     map->capacity,
+		     new_capacity,
+		     map->size);
+	result = allocate_buckets(map, new_capacity);
+	if (result != UDS_SUCCESS) {
+		*map = old_map;
+		return result;
+	}
+
+	/* Populate the new hash table from the entries in the old bucket array. */
+	for (i = 0; i < old_map.bucket_count; i++) {
+		struct bucket *entry = &old_map.buckets[i];
+
+		if (entry->value == NULL)
+			continue;
+
+		result = vdo_pointer_map_put(map, entry->key, entry->value, true, NULL);
+		if (result != UDS_SUCCESS) {
+			/* Destroy the new partial map and restore the map from the stack. */
+			UDS_FREE(UDS_FORGET(map->buckets));
+			*map = old_map;
+			return result;
+		}
+	}
+
+	/* Destroy the old bucket array. */
+	UDS_FREE(UDS_FORGET(old_map.buckets));
+	return UDS_SUCCESS;
+}
+
+/**
+ * find_empty_bucket() - Probe the bucket array starting at the given bucket for the next empty
+ *                       bucket, returning a pointer to it.
+ * @map: The map containing the buckets to search.
+ * @bucket: The bucket at which to start probing.
+ * @max_probes: The maximum number of buckets to search.
+ *
+ * NULL will be returned if the search reaches the end of the bucket array or if the number of
+ * linear probes exceeds a specified limit.
+ *
+ * Return: The next empty bucket, or NULL if the search failed.
+ */
+static struct bucket *
+find_empty_bucket(struct pointer_map *map, struct bucket *bucket, unsigned int max_probes)
+{
+	/*
+	 * Limit the search to either the nearer of the end of the bucket array or a fixed distance
+	 * beyond the initial bucket.
+	 */
+	ptrdiff_t remaining = &map->buckets[map->bucket_count] - bucket;
+	struct bucket *sentinel = &bucket[min_t(ptrdiff_t, remaining, max_probes)];
+	struct bucket *entry;
+
+	for (entry = bucket; entry < sentinel; entry++)
+		if (entry->value == NULL)
+			return entry;
+	return NULL;
+}
+
+/**
+ * move_empty_bucket() - Move an empty bucket closer to the start of the bucket array.
+ * @map: The map containing the bucket.
+
+ * @hole: The empty bucket to fill with an entry that precedes it in one of its enclosing
+ *        neighborhoods.
+ *
+ * This searches the neighborhoods that contain the empty bucket for a non-empty bucket closer to
+ * the start of the array. If such a bucket is found, this swaps the two buckets by moving the
+ * entry to the empty bucket.
+ *
+ * Return: The bucket that was vacated by moving its entry to the provided hole, or NULL if no
+ *         entry could be moved.
+ */
+static struct bucket *
+move_empty_bucket(struct pointer_map *map __always_unused, struct bucket *hole)
+{
+	/*
+	 * Examine every neighborhood that the empty bucket is part of, starting with the one in
+	 * which it is the last bucket. No boundary check is needed for the negative array
+	 * arithmetic since this function is only called when hole is at least NEIGHBORHOOD cells
+	 * deeper into the array than a valid bucket.
+	 */
+	struct bucket *bucket;
+
+	for (bucket = &hole[1 - NEIGHBORHOOD]; bucket < hole; bucket++) {
+		/*
+		 * Find the entry that is nearest to the bucket, which means it will be nearest to
+		 * the hash bucket whose neighborhood is full.
+		 */
+		struct bucket *new_hole = dereference_hop(bucket, bucket->first_hop);
+
+		if (new_hole == NULL)
+			/*
+			 * There are no buckets in this neighborhood that are in use by this one
+			 * (they must all be owned by overlapping neighborhoods).
+			 */
+			continue;
+
+		/*
+		 * Skip this bucket if its first entry is actually further away than the hole that
+		 * we're already trying to fill.
+		 */
+		if (hole < new_hole)
+			continue;
+
+		/*
+		 * We've found an entry in this neighborhood that we can "hop" further away, moving
+		 * the hole closer to the hash bucket, if not all the way into its neighborhood.
+		 */
+
+		/*
+		 * The entry that will be the new hole is the first bucket in the list, so setting
+		 * first_hop is all that's needed remove it from the list.
+		 */
+		bucket->first_hop = new_hole->next_hop;
+		new_hole->next_hop = NULL_HOP_OFFSET;
+
+		/* Move the entry into the original hole. */
+		hole->key = new_hole->key;
+		hole->value = new_hole->value;
+		new_hole->value = NULL;
+
+		/* Insert the filled hole into the hop list for the neighborhood. */
+		insert_in_hop_list(bucket, hole);
+		return new_hole;
+	}
+
+	/* We couldn't find an entry to relocate to the hole. */
+	return NULL;
+}
+
+/**
+ * update_mapping() - Find and update any existing mapping for a given key, returning the value
+ *                    associated with the key in the provided pointer.
+ * @map: The pointer_map to attempt to modify.
+ * @neighborhood: The first bucket in the neighborhood that would contain the search key.
+ * @key: The key with which to associate the new value.
+ * @new_value: The value to be associated with the key.
+ * @update: Whether to overwrite an existing value.
+ * @old_value_ptr: A pointer in which to store the old value (unmodified if no mapping was found).
+ *
+ * Return: true if the map contains a mapping for the key, false if it does not.
+ */
+static bool update_mapping(struct pointer_map *map,
+			   struct bucket *neighborhood,
+			   const void *key,
+			   void *new_value,
+			   bool update,
+			   void **old_value_ptr)
+{
+	struct bucket *bucket = search_hop_list(map, neighborhood, key, NULL);
+
+	if (bucket == NULL)
+		/* There is no bucket containing the key in the neighborhood. */
+		return false;
+
+	/*
+	 * Return the value of the current mapping (if desired) and update the mapping with the new
+	 * value (if desired).
+	 */
+	if (old_value_ptr != NULL)
+		*old_value_ptr = bucket->value;
+	if (update) {
+		/*
+		 * We're dropping the old key pointer on the floor here, assuming it's a property
+		 * of the value or that it's otherwise safe to just forget.
+		 */
+		bucket->key = key;
+		bucket->value = new_value;
+	}
+	return true;
+}
+
+/**
+ * find_or_make_vacancy() - Find an empty bucket in a specified neighborhood for a new mapping or
+ *                          attempt to re-arrange mappings so there is such a bucket.
+ * @map: The pointer_map to search or modify.
+ * @neighborhood: The first bucket in the neighborhood in which an empty bucket is needed for a new
+ *                mapping.
+ *
+ * This operation may fail (returning NULL) if an empty bucket is not available or could not be
+ * relocated to the neighborhood.
+ *
+ * Return: A pointer to an empty bucket in the desired neighborhood, or NULL if a vacancy could not
+ *         be found or arranged.
+ */
+static struct bucket *find_or_make_vacancy(struct pointer_map *map, struct bucket *neighborhood)
+{
+	/* Probe within and beyond the neighborhood for the first empty bucket. */
+	struct bucket *hole = find_empty_bucket(map, neighborhood, MAX_PROBES);
+
+	/*
+	 * Keep trying until the empty bucket is in the bucket's neighborhood or we are unable to
+	 * move it any closer by swapping it with a filled bucket.
+	 */
+	while (hole != NULL) {
+		int distance = hole - neighborhood;
+
+		if (distance < NEIGHBORHOOD)
+			/*
+			 * We've found or relocated an empty bucket close enough to the initial
+			 * hash bucket to be referenced by its hop vector.
+			 */
+			return hole;
+
+		/*
+		 * The nearest empty bucket isn't within the neighborhood that must contain the new
+		 * entry, so try to swap it with bucket that is closer.
+		 */
+		hole = move_empty_bucket(map, hole);
+	}
+
+	return NULL;
+}
+
+/**
+ * vdo_pointer_map_put() - Try to associate a value (a pointer) with an integer in a pointer_map.
+ * @map: The pointer_map to attempt to modify.
+ * @key: The key with which to associate the new value (may be NULL if the comparator and hasher
+ *       functions support it).
+ * @new_value: The value to be associated with the key.
+ * @update: Whether to overwrite an existing value.
+ * @old_value_ptr: A pointer in which to store either the old value (if the key was already mapped)
+ *                 or NULL if the map did not contain the key; NULL may be provided if the caller
+ *                 does not need to know the old value.
+ *
+ * If the map already contains a mapping for the provided key, the old value is only replaced with
+ * the specified value if update is true. In either case the old value is returned. If the map does
+ * not already contain a value for the specified key, the new value is added regardless of the
+ * value of update.
+ *
+ * If the value stored in the map is updated, then the key stored in the map will also be updated
+ * with the key provided by this call. The old key will not be returned due to the memory
+ * management assumptions described in the interface header comment.
+ *
+ * Return: UDS_SUCCESS or an error code.
+ */
+int vdo_pointer_map_put(struct pointer_map *map,
+			const void *key,
+			void *new_value,
+			bool update,
+			void **old_value_ptr)
+{
+	struct bucket *neighborhood, *bucket;
+
+	if (new_value == NULL)
+		return UDS_INVALID_ARGUMENT;
+
+	/*
+	 * Select the bucket at the start of the neighborhood that must contain any entry for the
+	 * provided key.
+	 */
+	neighborhood = select_bucket(map, key);
+
+	/*
+	 * Check whether the neighborhood already contains an entry for the key, in which case we
+	 * optionally update it, returning the old value.
+	 */
+	if (update_mapping(map, neighborhood, key, new_value, update, old_value_ptr))
+		return UDS_SUCCESS;
+
+	/*
+	 * Find an empty bucket in the desired neighborhood for the new entry or re-arrange entries
+	 * in the map so there is such a bucket. This operation will usually succeed; the loop body
+	 * will only be executed on the rare occasions that we have to resize the map.
+	 */
+	while ((bucket = find_or_make_vacancy(map, neighborhood)) == NULL) {
+		/*
+		 * There is no empty bucket in which to put the new entry in the current map, so
+		 * we're forced to allocate a new bucket array with a larger capacity, re-hash all
+		 * the entries into those buckets, and try again (a very expensive operation for
+		 * large maps).
+		 */
+		int result = resize_buckets(map);
+
+		if (result != UDS_SUCCESS)
+			return result;
+
+		/*
+		 * Resizing the map invalidates all pointers to buckets, so
+		 * recalculate the neighborhood pointer.
+		 */
+		neighborhood = select_bucket(map, key);
+	}
+
+	/* Put the new entry in the empty bucket, adding it to the neighborhood. */
+	bucket->key = key;
+	bucket->value = new_value;
+	insert_in_hop_list(neighborhood, bucket);
+	map->size += 1;
+
+	/*
+	 * There was no existing entry, so there was no old value to be
+	 * returned.
+	 */
+	if (old_value_ptr != NULL)
+		*old_value_ptr = NULL;
+	return UDS_SUCCESS;
+}
+
+/**
+ * vdo_pointer_map_remove() - Remove the mapping for a given key from the pointer_map.
+ * @map: The pointer_map from which to remove the mapping.
+ * @key: The key whose mapping is to be removed (may be NULL if the comparator and hasher functions
+ *       support it).
+ *
+ * Return: the value that was associated with the key, or NULL if it was not mapped.
+ */
+void *vdo_pointer_map_remove(struct pointer_map *map, const void *key)
+{
+	void *value;
+
+	/* Select the bucket to search and search it for an existing entry. */
+	struct bucket *bucket = select_bucket(map, key);
+	struct bucket *previous;
+	struct bucket *victim = search_hop_list(map, bucket, key, &previous);
+
+	if (victim == NULL)
+		/* There is no matching entry to remove. */
+		return NULL;
+
+	/*
+	 * We found an entry to remove. Save the mapped value to return later and empty the bucket.
+	 */
+	map->size -= 1;
+	value = victim->value;
+	victim->value = NULL;
+	victim->key = 0;
+
+	/* The victim bucket is now empty, but it still needs to be spliced out of the hop list. */
+	if (previous == NULL)
+		/* The victim is the head of the list, so swing first_hop. */
+		bucket->first_hop = victim->next_hop;
+	else
+		previous->next_hop = victim->next_hop;
+	victim->next_hop = NULL_HOP_OFFSET;
+
+	return value;
+}
diff --git a/drivers/md/dm-vdo/pointer-map.h b/drivers/md/dm-vdo/pointer-map.h
new file mode 100644
index 000000000000..83465d01bfa7
--- /dev/null
+++ b/drivers/md/dm-vdo/pointer-map.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_POINTER_MAP_H
+#define VDO_POINTER_MAP_H
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/*
+ * A pointer_map associates pointer values (<code>void *</code>) with the data referenced by
+ * pointer keys (<code>void *</code>). <code>NULL</code> pointer values are not supported. A
+ * <code>NULL</code> key value is supported when the instance's key comparator and hasher functions
+ * support it.
+ *
+ * The map is implemented as hash table, which should provide constant-time insert, query, and
+ * remove operations, although the insert may occasionally grow the table, which is linear in the
+ * number of entries in the map. The table will grow as needed to hold new entries, but will not
+ * shrink as entries are removed.
+ *
+ * The key and value pointers passed to the map are retained and used by the map, but are not owned
+ * by the map. Freeing the map does not attempt to free the pointers. The client is entirely
+ * responsible for the memory management of the keys and values. The current interface and
+ * implementation assume that keys will be properties of the values, or that keys will not be
+ * memory managed, or that keys will not need to be freed as a result of being replaced when a key
+ * is re-mapped.
+ */
+
+struct pointer_map;
+
+/**
+ * typedef pointer_key_comparator - The prototype of functions that compare the referents of two
+ *                                  pointer keys for equality.
+ * @this_key: The first element to compare.
+ * @that_key: The second element to compare.
+ *
+ * If two keys are equal, then both keys must have the same the hash code associated with them by
+ * the hasher function defined below.
+ *
+ * Return: true if and only if the referents of the two key pointers are to be treated as the same
+ *         key by the map.
+ */
+typedef bool pointer_key_comparator(const void *this_key, const void *that_key);
+
+/**
+ * typedef pointer_key_hasher - The prototype of functions that get or calculate a hash code
+ *                              associated with the referent of pointer key.
+ * @key: The pointer key to hash.
+ *
+ * The hash code must be uniformly distributed over all u32 values. The hash code associated
+ * with a given key must not change while the key is in the map. If the comparator function says
+ * two keys are equal, then this function must return the same hash code for both keys. This
+ * function may be called many times for a key while an entry is stored for it in the map.
+ *
+ * Return: The hash code for the key.
+ */
+typedef u32 pointer_key_hasher(const void *key);
+
+int __must_check vdo_make_pointer_map(size_t initial_capacity,
+				      unsigned int initial_load,
+				      pointer_key_comparator comparator,
+				      pointer_key_hasher hasher,
+				      struct pointer_map **map_ptr);
+
+void vdo_free_pointer_map(struct pointer_map *map);
+
+size_t vdo_pointer_map_size(const struct pointer_map *map);
+
+void *vdo_pointer_map_get(struct pointer_map *map, const void *key);
+
+int __must_check vdo_pointer_map_put(struct pointer_map *map,
+				     const void *key,
+				     void *new_value,
+				     bool update,
+				     void **old_value_ptr);
+
+void *vdo_pointer_map_remove(struct pointer_map *map, const void *key);
+
+#endif /* VDO_POINTER_MAP_H */
diff --git a/drivers/md/dm-vdo/priority-table.c b/drivers/md/dm-vdo/priority-table.c
new file mode 100644
index 000000000000..f8b81d6410c7
--- /dev/null
+++ b/drivers/md/dm-vdo/priority-table.c
@@ -0,0 +1,226 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "priority-table.h"
+
+#include <linux/log2.h>
+
+#include "errors.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+
+#include "status-codes.h"
+
+/* We use a single 64-bit search vector, so the maximum priority is 63 */
+enum {
+	MAX_PRIORITY = 63
+};
+
+/*
+ * All the entries with the same priority are queued in a circular list in a bucket for that
+ * priority. The table is essentially an array of buckets.
+ */
+struct bucket {
+	/*
+	 * The head of a queue of table entries, all having the same priority
+	 */
+	struct list_head queue;
+	/* The priority of all the entries in this bucket */
+	unsigned int priority;
+};
+
+/*
+ * A priority table is an array of buckets, indexed by priority. New entries are added to the end
+ * of the queue in the appropriate bucket. The dequeue operation finds the highest-priority
+ * non-empty bucket by searching a bit vector represented as a single 8-byte word, which is very
+ * fast with compiler and CPU support.
+ */
+struct priority_table {
+	/* The maximum priority of entries that may be stored in this table */
+	unsigned int max_priority;
+	/* A bit vector flagging all buckets that are currently non-empty */
+	u64 search_vector;
+	/* The array of all buckets, indexed by priority */
+	struct bucket buckets[];
+};
+
+/**
+ * vdo_make_priority_table() - Allocate and initialize a new priority_table.
+ * @max_priority: The maximum priority value for table entries.
+ * @table_ptr: A pointer to hold the new table.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+int vdo_make_priority_table(unsigned int max_priority, struct priority_table **table_ptr)
+{
+	struct priority_table *table;
+	int result;
+	unsigned int priority;
+
+	if (max_priority > MAX_PRIORITY)
+		return UDS_INVALID_ARGUMENT;
+
+	result = UDS_ALLOCATE_EXTENDED(struct priority_table, max_priority + 1,
+				       struct bucket, __func__, &table);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	for (priority = 0; priority <= max_priority; priority++) {
+		struct bucket *bucket = &table->buckets[priority];
+
+		bucket->priority = priority;
+		INIT_LIST_HEAD(&bucket->queue);
+	}
+
+	table->max_priority = max_priority;
+	table->search_vector = 0;
+
+	*table_ptr = table;
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_free_priority_table() - Free a priority_table.
+ * @table: The table to free.
+ *
+ * The table does not own the entries stored in it and they are not freed by this call.
+ */
+void vdo_free_priority_table(struct priority_table *table)
+{
+	if (table == NULL)
+		return;
+
+	/*
+	 * Unlink the buckets from any entries still in the table so the entries won't be left with
+	 * dangling pointers to freed memory.
+	 */
+	vdo_reset_priority_table(table);
+
+	UDS_FREE(table);
+}
+
+/**
+ * vdo_reset_priority_table() - Reset a priority table, leaving it in the same empty state as when
+ *                          newly constructed.
+ * @table: The table to reset.
+ *
+ * The table does not own the entries stored in it and they are not freed (or even unlinked from
+ * each other) by this call.
+ */
+void vdo_reset_priority_table(struct priority_table *table)
+{
+	unsigned int priority;
+
+	table->search_vector = 0;
+	for (priority = 0; priority <= table->max_priority; priority++)
+		list_del_init(&table->buckets[priority].queue);
+}
+
+/**
+ * vdo_priority_table_enqueue() - Add a new entry to the priority table, appending it to the queue
+ *                                for entries with the specified priority.
+ * @table: The table in which to store the entry.
+ * @priority: The priority of the entry.
+ * @entry: The list_head embedded in the entry to store in the table (the caller must have
+ *         initialized it).
+ */
+void vdo_priority_table_enqueue(struct priority_table *table,
+				unsigned int priority,
+				struct list_head *entry)
+{
+	ASSERT_LOG_ONLY((priority <= table->max_priority),
+			"entry priority must be valid for the table");
+
+	/* Append the entry to the queue in the specified bucket. */
+	list_move_tail(entry, &table->buckets[priority].queue);
+
+	/* Flag the bucket in the search vector since it must be non-empty. */
+	table->search_vector |= (1ULL << priority);
+}
+
+static inline void mark_bucket_empty(struct priority_table *table, struct bucket *bucket)
+{
+	table->search_vector &= ~(1ULL << bucket->priority);
+}
+
+/**
+ * vdo_priority_table_dequeue() - Find the highest-priority entry in the table, remove it from the
+ *                                table, and return it.
+ * @table: The priority table from which to remove an entry.
+ *
+ * If there are multiple entries with the same priority, the one that has been in the table with
+ * that priority the longest will be returned.
+ *
+ * Return: The dequeued entry, or NULL if the table is currently empty.
+ */
+struct list_head *vdo_priority_table_dequeue(struct priority_table *table)
+{
+	struct bucket *bucket;
+	struct list_head *entry;
+	int top_priority;
+
+	if (table->search_vector == 0)
+		/* All buckets are empty. */
+		return NULL;
+
+	/*
+	 * Find the highest priority non-empty bucket by finding the highest-order non-zero bit in
+	 * the search vector.
+	 */
+	top_priority = ilog2(table->search_vector);
+
+	/* Dequeue the first entry in the bucket. */
+	bucket = &table->buckets[top_priority];
+	entry = bucket->queue.next;
+	list_del_init(entry);
+
+	/* Clear the bit in the search vector if the bucket has been emptied. */
+	if (list_empty(&bucket->queue))
+		mark_bucket_empty(table, bucket);
+
+	return entry;
+}
+
+/**
+ * vdo_priority_table_remove() - Remove a specified entry from its priority table.
+ * @table: The table from which to remove the entry.
+ * @entry: The entry to remove from the table.
+ */
+void vdo_priority_table_remove(struct priority_table *table, struct list_head *entry)
+{
+	struct list_head *next_entry;
+
+	/*
+	 * We can't guard against calls where the entry is on a list for a different table, but
+	 * it's easy to deal with an entry not in any table or list.
+	 */
+	if (list_empty(entry))
+		return;
+
+	/*
+	 * Remove the entry from the bucket list, remembering a pointer to another entry in the
+	 * ring.
+	 */
+	next_entry = entry->next;
+	list_del_init(entry);
+
+	/*
+	 * If the rest of the list is now empty, the next node must be the list head in the bucket
+	 * and we can use it to update the search vector.
+	 */
+	if (list_empty(next_entry))
+		mark_bucket_empty(table, list_entry(next_entry, struct bucket, queue));
+}
+
+/**
+ * vdo_is_priority_table_empty() - Return whether the priority table is empty.
+ * @table: The table to check.
+ *
+ * Return: true if the table is empty.
+ */
+bool vdo_is_priority_table_empty(struct priority_table *table)
+{
+	return (table->search_vector == 0);
+}
diff --git a/drivers/md/dm-vdo/priority-table.h b/drivers/md/dm-vdo/priority-table.h
new file mode 100644
index 000000000000..f56460ba0f09
--- /dev/null
+++ b/drivers/md/dm-vdo/priority-table.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_PRIORITY_TABLE_H
+#define VDO_PRIORITY_TABLE_H
+
+#include <linux/list.h>
+
+/*
+ * A priority_table is a simple implementation of a priority queue for entries with priorities that
+ * are small non-negative integer values. It implements the obvious priority queue operations of
+ * enqueuing an entry and dequeuing an entry with the maximum priority. It also supports removing
+ * an arbitrary entry. The priority of an entry already in the table can be changed by removing it
+ * and re-enqueuing it with a different priority. All operations have O(1) complexity.
+ *
+ * The links for the table entries must be embedded in the entries themselves. Lists are used to
+ * link entries in the table and no wrapper type is declared, so an existing list entry in an
+ * object can also be used to queue it in a priority_table, assuming the field is not used for
+ * anything else while so queued.
+ *
+ * The table is implemented as an array of queues (circular lists) indexed by priority, along with
+ * a hint for which queues are non-empty. Steven Skiena calls a very similar structure a "bounded
+ * height priority queue", but given the resemblance to a hash table, "priority table" seems both
+ * shorter and more apt, if somewhat novel.
+ */
+
+struct priority_table;
+
+int __must_check
+vdo_make_priority_table(unsigned int max_priority, struct priority_table **table_ptr);
+
+void vdo_free_priority_table(struct priority_table *table);
+
+void vdo_priority_table_enqueue(struct priority_table *table,
+				unsigned int priority,
+				struct list_head *entry);
+
+void vdo_reset_priority_table(struct priority_table *table);
+
+struct list_head * __must_check vdo_priority_table_dequeue(struct priority_table *table);
+
+void vdo_priority_table_remove(struct priority_table *table, struct list_head *entry);
+
+bool __must_check vdo_is_priority_table_empty(struct priority_table *table);
+
+#endif /* VDO_PRIORITY_TABLE_H */

From patchwork Thu Oct 26 21:41:06 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437932
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 17036C25B70
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:43:16 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232355AbjJZVnP (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:43:15 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40978 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232306AbjJZVnP (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:43:15 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1C03C1B6
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356502;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=gtCR2xC9/+hPvx1vgjQd0jvGK4XHPfWiElTiAnqmQ98=;
        b=AnLfNzdr9IwY2YSip6vC2O/dtbD6es61bYrsqTFja5EwwPXfIFTgh3gCN2htkOgWm+wOag
        mnnIRKpkocqguh8it7cjBXCRCofzppC2P8oGZbHeD1wSBOImL76xc94RzWDEwyP3mHwhaS
        QbHq+Fye4A6FU6VWBdSBU9IavQibjEU=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-416-XBZybvfcOXmgcEa5xYyt4g-1; Thu, 26 Oct 2023 17:41:38 -0400
X-MC-Unique: XBZybvfcOXmgcEa5xYyt4g-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A25F61019C83;
        Thu, 26 Oct 2023 21:41:38 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 936A12949;
        Thu, 26 Oct 2023 21:41:38 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 5320D3003B; Thu, 26 Oct 2023 17:41:38 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 09/39] dm vdo: add deduplication configuration structures
Date: Thu, 26 Oct 2023 17:41:06 -0400
Message-Id: <20231026214136.1067410-10-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Add structures which record the configuration of various deduplication
index parameters. This also includes facilities for saving and loading the
configuration and validating its integrity.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Co-developed-by: John Wiele <jwiele@redhat.com>
Signed-off-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/config.c   | 389 +++++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/config.h   | 125 +++++++++++
 drivers/md/dm-vdo/geometry.c | 205 ++++++++++++++++++
 drivers/md/dm-vdo/geometry.h | 137 ++++++++++++
 4 files changed, 856 insertions(+)
 create mode 100644 drivers/md/dm-vdo/config.c
 create mode 100644 drivers/md/dm-vdo/config.h
 create mode 100644 drivers/md/dm-vdo/geometry.c
 create mode 100644 drivers/md/dm-vdo/geometry.h

diff --git a/drivers/md/dm-vdo/config.c b/drivers/md/dm-vdo/config.c
new file mode 100644
index 000000000000..6a89f029dc21
--- /dev/null
+++ b/drivers/md/dm-vdo/config.c
@@ -0,0 +1,389 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "config.h"
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "numeric.h"
+#include "string-utils.h"
+#include "uds-threads.h"
+
+static const u8 INDEX_CONFIG_MAGIC[] = "ALBIC";
+static const u8 INDEX_CONFIG_VERSION_6_02[] = "06.02";
+static const u8 INDEX_CONFIG_VERSION_8_02[] = "08.02";
+
+enum {
+	DEFAULT_VOLUME_READ_THREADS = 2,
+	MAX_VOLUME_READ_THREADS = 16,
+	INDEX_CONFIG_MAGIC_LENGTH = sizeof(INDEX_CONFIG_MAGIC) - 1,
+	INDEX_CONFIG_VERSION_LENGTH = sizeof(INDEX_CONFIG_VERSION_6_02) - 1,
+};
+
+static bool is_version(const u8 *version, u8 *buffer)
+{
+	return memcmp(version, buffer, INDEX_CONFIG_VERSION_LENGTH) == 0;
+}
+
+static bool are_matching_configurations(struct configuration *saved_config,
+					struct geometry *saved_geometry,
+					struct configuration *user)
+{
+	struct geometry *geometry = user->geometry;
+	bool result = true;
+
+	if (saved_geometry->record_pages_per_chapter != geometry->record_pages_per_chapter) {
+		uds_log_error("Record pages per chapter (%u) does not match (%u)",
+			      saved_geometry->record_pages_per_chapter,
+			      geometry->record_pages_per_chapter);
+		result = false;
+	}
+
+	if (saved_geometry->chapters_per_volume != geometry->chapters_per_volume) {
+		uds_log_error("Chapter count (%u) does not match (%u)",
+			      saved_geometry->chapters_per_volume,
+			      geometry->chapters_per_volume);
+		result = false;
+	}
+
+	if (saved_geometry->sparse_chapters_per_volume != geometry->sparse_chapters_per_volume) {
+		uds_log_error("Sparse chapter count (%u) does not match (%u)",
+			      saved_geometry->sparse_chapters_per_volume,
+			      geometry->sparse_chapters_per_volume);
+		result = false;
+	}
+
+	if (saved_config->cache_chapters != user->cache_chapters) {
+		uds_log_error("Cache size (%u) does not match (%u)",
+			      saved_config->cache_chapters,
+			      user->cache_chapters);
+		result = false;
+	}
+
+	if (saved_config->volume_index_mean_delta != user->volume_index_mean_delta) {
+		uds_log_error("Volume index mean delta (%u) does not match (%u)",
+			      saved_config->volume_index_mean_delta,
+			      user->volume_index_mean_delta);
+		result = false;
+	}
+
+	if (saved_geometry->bytes_per_page != geometry->bytes_per_page) {
+		uds_log_error("Bytes per page value (%zu) does not match (%zu)",
+			      saved_geometry->bytes_per_page,
+			      geometry->bytes_per_page);
+		result = false;
+	}
+
+	if (saved_config->sparse_sample_rate != user->sparse_sample_rate) {
+		uds_log_error("Sparse sample rate (%u) does not match (%u)",
+			      saved_config->sparse_sample_rate,
+			      user->sparse_sample_rate);
+		result = false;
+	}
+
+	if (saved_config->nonce != user->nonce) {
+		uds_log_error("Nonce (%llu) does not match (%llu)",
+			      (unsigned long long) saved_config->nonce,
+			      (unsigned long long) user->nonce);
+		result = false;
+	}
+
+	return result;
+}
+
+/* Read the configuration and validate it against the provided one. */
+int uds_validate_config_contents(struct buffered_reader *reader, struct configuration *user_config)
+{
+	int result;
+	struct configuration config;
+	struct geometry geometry;
+	u8 version_buffer[INDEX_CONFIG_VERSION_LENGTH];
+	u32 bytes_per_page;
+	u8 buffer[sizeof(struct uds_configuration_6_02)];
+	size_t offset = 0;
+
+	result = uds_verify_buffered_data(reader, INDEX_CONFIG_MAGIC, INDEX_CONFIG_MAGIC_LENGTH);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_read_from_buffered_reader(reader,
+					       version_buffer,
+					       INDEX_CONFIG_VERSION_LENGTH);
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result, "cannot read index config version");
+
+	if (!is_version(INDEX_CONFIG_VERSION_6_02, version_buffer) &&
+	    !is_version(INDEX_CONFIG_VERSION_8_02, version_buffer)) {
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "unsupported configuration version: '%.*s'",
+					      INDEX_CONFIG_VERSION_LENGTH,
+					      version_buffer);
+	}
+
+	result = uds_read_from_buffered_reader(reader, buffer, sizeof(buffer));
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result, "cannot read config data");
+
+	decode_u32_le(buffer, &offset, &geometry.record_pages_per_chapter);
+	decode_u32_le(buffer, &offset, &geometry.chapters_per_volume);
+	decode_u32_le(buffer, &offset, &geometry.sparse_chapters_per_volume);
+	decode_u32_le(buffer, &offset, &config.cache_chapters);
+	offset += sizeof(u32);
+	decode_u32_le(buffer, &offset, &config.volume_index_mean_delta);
+	decode_u32_le(buffer, &offset, &bytes_per_page);
+	geometry.bytes_per_page = bytes_per_page;
+	decode_u32_le(buffer, &offset, &config.sparse_sample_rate);
+	decode_u64_le(buffer, &offset, &config.nonce);
+
+	result = ASSERT(offset == sizeof(struct uds_configuration_6_02),
+			"%zu bytes read but not decoded",
+			sizeof(struct uds_configuration_6_02) - offset);
+	if (result != UDS_SUCCESS)
+		return UDS_CORRUPT_DATA;
+
+	if (is_version(INDEX_CONFIG_VERSION_6_02, version_buffer)) {
+		user_config->geometry->remapped_virtual = 0;
+		user_config->geometry->remapped_physical = 0;
+	} else {
+		u8 remapping[sizeof(u64) + sizeof(u64)];
+
+		result = uds_read_from_buffered_reader(reader, remapping, sizeof(remapping));
+		if (result != UDS_SUCCESS)
+			return uds_log_error_strerror(result, "cannot read converted config");
+
+		offset = 0;
+		decode_u64_le(remapping, &offset, &user_config->geometry->remapped_virtual);
+		decode_u64_le(remapping, &offset, &user_config->geometry->remapped_physical);
+	}
+
+	if (!are_matching_configurations(&config, &geometry, user_config)) {
+		uds_log_warning("Supplied configuration does not match save");
+		return UDS_NO_INDEX;
+	}
+
+	return UDS_SUCCESS;
+}
+
+/*
+ * Write the configuration to stable storage. If the superblock version is < 4, write the 6.02
+ * version; otherwise write the 8.02 version, indicating the configuration is for an index that has
+ * been reduced by one chapter.
+ */
+int uds_write_config_contents(struct buffered_writer *writer,
+			      struct configuration *config,
+			      u32 version)
+{
+	int result;
+	struct geometry *geometry = config->geometry;
+	u8 buffer[sizeof(struct uds_configuration_8_02)];
+	size_t offset = 0;
+
+	result = uds_write_to_buffered_writer(writer,
+					      INDEX_CONFIG_MAGIC,
+					      INDEX_CONFIG_MAGIC_LENGTH);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	/*
+	 * If version is < 4, the index has not been reduced by a chapter so it must be written out
+	 * as version 6.02 so that it is still compatible with older versions of UDS.
+	 */
+	if (version >= 4) {
+		result = uds_write_to_buffered_writer(writer,
+						      INDEX_CONFIG_VERSION_8_02,
+						      INDEX_CONFIG_VERSION_LENGTH);
+		if (result != UDS_SUCCESS)
+			return result;
+	} else {
+		result = uds_write_to_buffered_writer(writer,
+						      INDEX_CONFIG_VERSION_6_02,
+						      INDEX_CONFIG_VERSION_LENGTH);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	encode_u32_le(buffer, &offset, geometry->record_pages_per_chapter);
+	encode_u32_le(buffer, &offset, geometry->chapters_per_volume);
+	encode_u32_le(buffer, &offset, geometry->sparse_chapters_per_volume);
+	encode_u32_le(buffer, &offset, config->cache_chapters);
+	encode_u32_le(buffer, &offset, 0);
+	encode_u32_le(buffer, &offset, config->volume_index_mean_delta);
+	encode_u32_le(buffer, &offset, geometry->bytes_per_page);
+	encode_u32_le(buffer, &offset, config->sparse_sample_rate);
+	encode_u64_le(buffer, &offset, config->nonce);
+
+	result = ASSERT(offset == sizeof(struct uds_configuration_6_02),
+			"%zu bytes encoded, of %zu expected",
+			offset,
+			sizeof(struct uds_configuration_6_02));
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (version >= 4) {
+		encode_u64_le(buffer, &offset, geometry->remapped_virtual);
+		encode_u64_le(buffer, &offset, geometry->remapped_physical);
+	}
+
+	return uds_write_to_buffered_writer(writer, buffer, offset);
+}
+
+/* Compute configuration parameters that depend on memory size. */
+static int compute_memory_sizes(uds_memory_config_size_t mem_gb,
+				bool sparse,
+				u32 *chapters_per_volume,
+				u32 *record_pages_per_chapter,
+				u32 *sparse_chapters_per_volume)
+{
+	u32 reduced_chapters = 0;
+	u32 base_chapters;
+
+	if (mem_gb == UDS_MEMORY_CONFIG_256MB) {
+		base_chapters = DEFAULT_CHAPTERS_PER_VOLUME;
+		*record_pages_per_chapter = SMALL_RECORD_PAGES_PER_CHAPTER;
+	} else if (mem_gb == UDS_MEMORY_CONFIG_512MB) {
+		base_chapters = DEFAULT_CHAPTERS_PER_VOLUME;
+		*record_pages_per_chapter = 2 * SMALL_RECORD_PAGES_PER_CHAPTER;
+	} else if (mem_gb == UDS_MEMORY_CONFIG_768MB) {
+		base_chapters = DEFAULT_CHAPTERS_PER_VOLUME;
+		*record_pages_per_chapter = 3 * SMALL_RECORD_PAGES_PER_CHAPTER;
+	} else if ((mem_gb >= 1) && (mem_gb <= UDS_MEMORY_CONFIG_MAX)) {
+		base_chapters = mem_gb * DEFAULT_CHAPTERS_PER_VOLUME;
+		*record_pages_per_chapter = DEFAULT_RECORD_PAGES_PER_CHAPTER;
+	} else if (mem_gb == UDS_MEMORY_CONFIG_REDUCED_256MB) {
+		reduced_chapters = 1;
+		base_chapters = DEFAULT_CHAPTERS_PER_VOLUME;
+		*record_pages_per_chapter = SMALL_RECORD_PAGES_PER_CHAPTER;
+	} else if (mem_gb == UDS_MEMORY_CONFIG_REDUCED_512MB) {
+		reduced_chapters = 1;
+		base_chapters = DEFAULT_CHAPTERS_PER_VOLUME;
+		*record_pages_per_chapter = 2 * SMALL_RECORD_PAGES_PER_CHAPTER;
+	} else if (mem_gb == UDS_MEMORY_CONFIG_REDUCED_768MB) {
+		reduced_chapters = 1;
+		base_chapters = DEFAULT_CHAPTERS_PER_VOLUME;
+		*record_pages_per_chapter = 3 * SMALL_RECORD_PAGES_PER_CHAPTER;
+	} else if ((mem_gb >= 1 + UDS_MEMORY_CONFIG_REDUCED) &&
+		   (mem_gb <= UDS_MEMORY_CONFIG_REDUCED_MAX)) {
+		reduced_chapters = 1;
+		base_chapters = ((mem_gb - UDS_MEMORY_CONFIG_REDUCED) *
+				 DEFAULT_CHAPTERS_PER_VOLUME);
+		*record_pages_per_chapter = DEFAULT_RECORD_PAGES_PER_CHAPTER;
+	} else {
+		uds_log_error("received invalid memory size");
+		return -EINVAL;
+	}
+
+	if (sparse) {
+		/* Make 95% of chapters sparse, allowing 10x more records. */
+		*sparse_chapters_per_volume = (19 * base_chapters) / 2;
+		base_chapters *= 10;
+	} else {
+		*sparse_chapters_per_volume = 0;
+	}
+
+	*chapters_per_volume = base_chapters - reduced_chapters;
+	return UDS_SUCCESS;
+}
+
+static unsigned int __must_check normalize_zone_count(unsigned int requested)
+{
+	unsigned int zone_count = requested;
+
+	if (zone_count == 0)
+		zone_count = num_online_cpus() / 2;
+
+	if (zone_count < 1)
+		zone_count = 1;
+
+	if (zone_count > MAX_ZONES)
+		zone_count = MAX_ZONES;
+
+	uds_log_info("Using %u indexing zone%s for concurrency.",
+		     zone_count,
+		     zone_count == 1 ? "" : "s");
+	return zone_count;
+}
+
+static unsigned int __must_check normalize_read_threads(unsigned int requested)
+{
+	unsigned int read_threads = requested;
+
+	if (read_threads < 1)
+		read_threads = DEFAULT_VOLUME_READ_THREADS;
+
+	if (read_threads > MAX_VOLUME_READ_THREADS)
+		read_threads = MAX_VOLUME_READ_THREADS;
+
+	return read_threads;
+}
+
+int uds_make_configuration(const struct uds_parameters *params, struct configuration **config_ptr)
+{
+	struct configuration *config;
+	u32 chapters_per_volume = 0;
+	u32 record_pages_per_chapter = 0;
+	u32 sparse_chapters_per_volume = 0;
+	int result;
+
+	result = compute_memory_sizes(params->memory_size,
+				      params->sparse,
+				      &chapters_per_volume,
+				      &record_pages_per_chapter,
+				      &sparse_chapters_per_volume);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(1, struct configuration, __func__, &config);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_make_geometry(DEFAULT_BYTES_PER_PAGE,
+				   record_pages_per_chapter,
+				   chapters_per_volume,
+				   sparse_chapters_per_volume,
+				   0,
+				   0,
+				   &config->geometry);
+	if (result != UDS_SUCCESS) {
+		uds_free_configuration(config);
+		return result;
+	}
+
+	config->zone_count = normalize_zone_count(params->zone_count);
+	config->read_threads = normalize_read_threads(params->read_threads);
+
+	config->cache_chapters = DEFAULT_CACHE_CHAPTERS;
+	config->volume_index_mean_delta = DEFAULT_VOLUME_INDEX_MEAN_DELTA;
+	config->sparse_sample_rate = (params->sparse ? DEFAULT_SPARSE_SAMPLE_RATE : 0);
+	config->nonce = params->nonce;
+	config->bdev = params->bdev;
+	config->offset = params->offset;
+	config->size = params->size;
+
+	*config_ptr = config;
+	return UDS_SUCCESS;
+}
+
+void uds_free_configuration(struct configuration *config)
+{
+	if (config != NULL) {
+		uds_free_geometry(config->geometry);
+		UDS_FREE(config);
+	}
+}
+
+void uds_log_configuration(struct configuration *config)
+{
+	struct geometry *geometry = config->geometry;
+
+	uds_log_debug("Configuration:");
+	uds_log_debug("  Record pages per chapter:   %10u", geometry->record_pages_per_chapter);
+	uds_log_debug("  Chapters per volume:        %10u", geometry->chapters_per_volume);
+	uds_log_debug("  Sparse chapters per volume: %10u", geometry->sparse_chapters_per_volume);
+	uds_log_debug("  Cache size (chapters):      %10u", config->cache_chapters);
+	uds_log_debug("  Volume index mean delta:    %10u", config->volume_index_mean_delta);
+	uds_log_debug("  Bytes per page:             %10zu", geometry->bytes_per_page);
+	uds_log_debug("  Sparse sample rate:         %10u", config->sparse_sample_rate);
+	uds_log_debug("  Nonce:                      %llu", (unsigned long long) config->nonce);
+}
diff --git a/drivers/md/dm-vdo/config.h b/drivers/md/dm-vdo/config.h
new file mode 100644
index 000000000000..e62513428b99
--- /dev/null
+++ b/drivers/md/dm-vdo/config.h
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_CONFIG_H
+#define UDS_CONFIG_H
+
+#include "geometry.h"
+#include "io-factory.h"
+#include "uds.h"
+
+/*
+ * The configuration records a variety of parameters used to configure a new UDS index. Some
+ * parameters are provided by the client, while others are fixed or derived from user-supplied
+ * values. It is created when an index is created, and it is recorded in the index metadata.
+ */
+
+enum {
+	DEFAULT_VOLUME_INDEX_MEAN_DELTA = 4096,
+	DEFAULT_CACHE_CHAPTERS = 7,
+	DEFAULT_SPARSE_SAMPLE_RATE = 32,
+	MAX_ZONES = 16,
+};
+
+/* A set of configuration parameters for the indexer. */
+struct configuration {
+	/* Storage device for the index */
+	struct block_device *bdev;
+
+	/* The maximum allowable size of the index */
+	size_t size;
+
+	/* The offset where the index should start */
+	off_t offset;
+
+	/* Parameters for the volume */
+
+	/* The volume layout */
+	struct geometry *geometry;
+
+	/* Index owner's nonce */
+	u64 nonce;
+
+	/* The number of threads used to process index requests */
+	unsigned int zone_count;
+
+	/* The number of threads used to read volume pages */
+	unsigned int read_threads;
+
+	/* Size of the page cache and sparse chapter index cache in chapters */
+	u32 cache_chapters;
+
+	/* Parameters for the volume index */
+
+	/* The mean delta for the volume index */
+	u32 volume_index_mean_delta;
+
+	/* Sampling rate for sparse indexing */
+	u32 sparse_sample_rate;
+};
+
+/* On-disk structure of data for a version 8.02 index. */
+struct uds_configuration_8_02 {
+	/* Smaller (16), Small (64) or large (256) indices */
+	u32 record_pages_per_chapter;
+	/* Total number of chapters per volume */
+	u32 chapters_per_volume;
+	/* Number of sparse chapters per volume */
+	u32 sparse_chapters_per_volume;
+	/* Size of the page cache, in chapters */
+	u32 cache_chapters;
+	/* Unused field */
+	u32 unused;
+	/* The volume index mean delta to use */
+	u32 volume_index_mean_delta;
+	/* Size of a page, used for both record pages and index pages */
+	u32 bytes_per_page;
+	/* Sampling rate for sparse indexing */
+	u32 sparse_sample_rate;
+	/* Index owner's nonce */
+	u64 nonce;
+	/* Virtual chapter remapped from physical chapter 0 */
+	u64 remapped_virtual;
+	/* New physical chapter which remapped chapter was moved to */
+	u64 remapped_physical;
+} __packed;
+
+/* On-disk structure of data for a version 6.02 index. */
+struct uds_configuration_6_02 {
+	/* Smaller (16), Small (64) or large (256) indices */
+	u32 record_pages_per_chapter;
+	/* Total number of chapters per volume */
+	u32 chapters_per_volume;
+	/* Number of sparse chapters per volume */
+	u32 sparse_chapters_per_volume;
+	/* Size of the page cache, in chapters */
+	u32 cache_chapters;
+	/* Unused field */
+	u32 unused;
+	/* The volume index mean delta to use */
+	u32 volume_index_mean_delta;
+	/* Size of a page, used for both record pages and index pages */
+	u32 bytes_per_page;
+	/* Sampling rate for sparse indexing */
+	u32 sparse_sample_rate;
+	/* Index owner's nonce */
+	u64 nonce;
+} __packed;
+
+int __must_check
+uds_make_configuration(const struct uds_parameters *params, struct configuration **config_ptr);
+
+void uds_free_configuration(struct configuration *config);
+
+int __must_check
+uds_validate_config_contents(struct buffered_reader *reader, struct configuration *config);
+
+int __must_check uds_write_config_contents(struct buffered_writer *writer,
+					   struct configuration *config,
+					   u32 version);
+
+void uds_log_configuration(struct configuration *config);
+
+#endif /* UDS_CONFIG_H */
diff --git a/drivers/md/dm-vdo/geometry.c b/drivers/md/dm-vdo/geometry.c
new file mode 100644
index 000000000000..f4a4bcee5b73
--- /dev/null
+++ b/drivers/md/dm-vdo/geometry.c
@@ -0,0 +1,205 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "geometry.h"
+
+#include <linux/compiler.h>
+#include <linux/log2.h>
+
+#include "delta-index.h"
+#include "errors.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+#include "uds.h"
+
+/*
+ * An index volume is divided into a fixed number of fixed-size chapters, each consisting of a
+ * fixed number of fixed-size pages. The volume layout is defined by two constants and four
+ * parameters. The constants are that index records are 32 bytes long (16-byte block name plus
+ * 16-byte metadata) and that open chapter index hash slots are one byte long. The four parameters
+ * are the number of bytes in a page, the number of record pages in a chapter, the number of
+ * chapters in a volume, and the number of chapters that are sparse. From these parameters, we can
+ * derive the rest of the layout and other index properties.
+ *
+ * The index volume is sized by its maximum memory footprint. For a dense index, the persistent
+ * storage is about 10 times the size of the memory footprint. For a sparse index, the persistent
+ * storage is about 100 times the size of the memory footprint.
+ *
+ * For a small index with a memory footprint less than 1GB, there are three possible memory
+ * configurations: 0.25GB, 0.5GB and 0.75GB. The default geometry for each is 1024 index records
+ * per 32 KB page, 1024 chapters per volume, and either 64, 128, or 192 record pages per chapter
+ * (resulting in 6, 13, or 20 index pages per chapter) depending on the memory configuration. For
+ * the VDO default of a 0.25 GB index, this yields a deduplication window of 256 GB using about 2.5
+ * GB for the persistent storage and 256 MB of RAM.
+ *
+ * For a larger index with a memory footprint that is a multiple of 1 GB, the geometry is 1024
+ * index records per 32 KB page, 256 record pages per chapter, 26 index pages per chapter, and 1024
+ * chapters for every GB of memory footprint. For a 1 GB volume, this yields a deduplication window
+ * of 1 TB using about 9GB of persistent storage and 1 GB of RAM.
+ *
+ * The above numbers hold for volumes which have no sparse chapters. A sparse volume has 10 times
+ * as many chapters as the corresponding non-sparse volume, which provides 10 times the
+ * deduplication window while using 10 times as much persistent storage as the equivalent
+ * non-sparse volume with the same memory footprint.
+ *
+ * If the volume has been converted from a non-lvm format to an lvm volume, the number of chapters
+ * per volume will have been reduced by one by eliminating physical chapter 0, and the virtual
+ * chapter that formerly mapped to physical chapter 0 may be remapped to another physical chapter.
+ * This remapping is expressed by storing which virtual chapter was remapped, and which physical
+ * chapter it was moved to.
+ */
+
+int uds_make_geometry(size_t bytes_per_page,
+		      u32 record_pages_per_chapter,
+		      u32 chapters_per_volume,
+		      u32 sparse_chapters_per_volume,
+		      u64 remapped_virtual,
+		      u64 remapped_physical,
+		      struct geometry **geometry_ptr)
+{
+	int result;
+	struct geometry *geometry;
+
+	result = UDS_ALLOCATE(1, struct geometry, "geometry", &geometry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	geometry->bytes_per_page = bytes_per_page;
+	geometry->record_pages_per_chapter = record_pages_per_chapter;
+	geometry->chapters_per_volume = chapters_per_volume;
+	geometry->sparse_chapters_per_volume = sparse_chapters_per_volume;
+	geometry->dense_chapters_per_volume = chapters_per_volume - sparse_chapters_per_volume;
+	geometry->remapped_virtual = remapped_virtual;
+	geometry->remapped_physical = remapped_physical;
+
+	geometry->records_per_page = bytes_per_page / BYTES_PER_RECORD;
+	geometry->records_per_chapter = geometry->records_per_page * record_pages_per_chapter;
+	geometry->records_per_volume = (u64) geometry->records_per_chapter * chapters_per_volume;
+
+	geometry->chapter_mean_delta = 1 << DEFAULT_CHAPTER_MEAN_DELTA_BITS;
+	geometry->chapter_payload_bits = bits_per(record_pages_per_chapter - 1);
+	/*
+	 * We want 1 delta list for every 64 records in the chapter.
+	 * The "| 077" ensures that the chapter_delta_list_bits computation
+	 * does not underflow.
+	 */
+	geometry->chapter_delta_list_bits =
+		bits_per((geometry->records_per_chapter - 1) | 077) - 6;
+	geometry->delta_lists_per_chapter = 1 << geometry->chapter_delta_list_bits;
+	/* We need enough address bits to achieve the desired mean delta. */
+	geometry->chapter_address_bits =
+		(DEFAULT_CHAPTER_MEAN_DELTA_BITS -
+		 geometry->chapter_delta_list_bits +
+		 bits_per(geometry->records_per_chapter - 1));
+	geometry->index_pages_per_chapter =
+		uds_get_delta_index_page_count(geometry->records_per_chapter,
+					       geometry->delta_lists_per_chapter,
+					       geometry->chapter_mean_delta,
+					       geometry->chapter_payload_bits,
+					       bytes_per_page);
+
+	geometry->pages_per_chapter = geometry->index_pages_per_chapter + record_pages_per_chapter;
+	geometry->pages_per_volume = geometry->pages_per_chapter * chapters_per_volume;
+	geometry->bytes_per_volume =
+		bytes_per_page * (geometry->pages_per_volume + HEADER_PAGES_PER_VOLUME);
+
+	*geometry_ptr = geometry;
+	return UDS_SUCCESS;
+}
+
+int uds_copy_geometry(struct geometry *source, struct geometry **geometry_ptr)
+{
+	return uds_make_geometry(source->bytes_per_page,
+				 source->record_pages_per_chapter,
+				 source->chapters_per_volume,
+				 source->sparse_chapters_per_volume,
+				 source->remapped_virtual,
+				 source->remapped_physical,
+				 geometry_ptr);
+}
+
+void uds_free_geometry(struct geometry *geometry)
+{
+	UDS_FREE(geometry);
+}
+
+u32 __must_check uds_map_to_physical_chapter(const struct geometry *geometry, u64 virtual_chapter)
+{
+	u64 delta;
+
+	if (!uds_is_reduced_geometry(geometry))
+		return virtual_chapter % geometry->chapters_per_volume;
+
+	if (likely(virtual_chapter > geometry->remapped_virtual)) {
+		delta = virtual_chapter - geometry->remapped_virtual;
+		if (likely(delta > geometry->remapped_physical))
+			return delta % geometry->chapters_per_volume;
+		else
+			return delta - 1;
+	}
+
+	if (virtual_chapter == geometry->remapped_virtual)
+		return geometry->remapped_physical;
+
+	delta = geometry->remapped_virtual - virtual_chapter;
+	if (delta < geometry->chapters_per_volume)
+		return geometry->chapters_per_volume - delta;
+
+	/* This chapter is so old the answer doesn't matter. */
+	return 0;
+}
+
+/* Check whether any sparse chapters are in use. */
+bool uds_has_sparse_chapters(const struct geometry *geometry,
+			     u64 oldest_virtual_chapter,
+			     u64 newest_virtual_chapter)
+{
+	return uds_is_sparse_geometry(geometry) &&
+	       ((newest_virtual_chapter - oldest_virtual_chapter + 1) >
+		geometry->dense_chapters_per_volume);
+}
+
+bool uds_is_chapter_sparse(const struct geometry *geometry,
+			   u64 oldest_virtual_chapter,
+			   u64 newest_virtual_chapter,
+			   u64 virtual_chapter_number)
+{
+	return uds_has_sparse_chapters(geometry,
+				       oldest_virtual_chapter,
+				       newest_virtual_chapter) &&
+	       ((virtual_chapter_number + geometry->dense_chapters_per_volume) <=
+		newest_virtual_chapter);
+}
+
+/* Calculate how many chapters to expire after opening the newest chapter. */
+u32 uds_chapters_to_expire(const struct geometry *geometry, u64 newest_chapter)
+{
+	/* If the index isn't full yet, don't expire anything. */
+	if (newest_chapter < geometry->chapters_per_volume)
+		return 0;
+
+	/* If a chapter is out of order... */
+	if (geometry->remapped_physical > 0) {
+		u64 oldest_chapter = newest_chapter - geometry->chapters_per_volume;
+
+		/*
+		 * ... expire an extra chapter when expiring the moved chapter to free physical
+		 * space for the new chapter ...
+		 */
+		if (oldest_chapter == geometry->remapped_virtual)
+			return 2;
+
+		/*
+		 * ... but don't expire anything when the new chapter will use the physical chapter
+		 * freed by expiring the moved chapter.
+		 */
+		if (oldest_chapter == (geometry->remapped_virtual + geometry->remapped_physical))
+			return 0;
+	}
+
+	/* Normally, just expire one. */
+	return 1;
+}
diff --git a/drivers/md/dm-vdo/geometry.h b/drivers/md/dm-vdo/geometry.h
new file mode 100644
index 000000000000..963dc7629c92
--- /dev/null
+++ b/drivers/md/dm-vdo/geometry.h
@@ -0,0 +1,137 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_GEOMETRY_H
+#define UDS_GEOMETRY_H
+
+#include "uds.h"
+
+/*
+ * The geometry records parameters that define the layout of a UDS index volume, and the size and
+ * shape of various index structures. It is created when the index is created, and is referenced by
+ * many index sub-components.
+ */
+
+struct geometry {
+	/* Size of a chapter page, in bytes */
+	size_t bytes_per_page;
+	/* Number of record pages in a chapter */
+	u32 record_pages_per_chapter;
+	/* Total number of chapters in a volume */
+	u32 chapters_per_volume;
+	/* Number of sparsely-indexed chapters in a volume */
+	u32 sparse_chapters_per_volume;
+	/* Number of bits used to determine delta list numbers */
+	u8 chapter_delta_list_bits;
+	/* Virtual chapter remapped from physical chapter 0 */
+	u64 remapped_virtual;
+	/* New physical chapter where the remapped chapter can be found */
+	u64 remapped_physical;
+
+	/*
+	 * The following properties are derived from the ones above, but they are computed and
+	 * recorded as fields for convenience.
+	 */
+	/* Total number of pages in a volume, excluding the header */
+	u32 pages_per_volume;
+	/* Total number of bytes in a volume, including the header */
+	size_t bytes_per_volume;
+	/* Number of pages in a chapter */
+	u32 pages_per_chapter;
+	/* Number of index pages in a chapter index */
+	u32 index_pages_per_chapter;
+	/* Number of records that fit on a page */
+	u32 records_per_page;
+	/* Number of records that fit in a chapter */
+	u32 records_per_chapter;
+	/* Number of records that fit in a volume */
+	u64 records_per_volume;
+	/* Number of delta lists per chapter index */
+	u32 delta_lists_per_chapter;
+	/* Mean delta for chapter indexes */
+	u32 chapter_mean_delta;
+	/* Number of bits needed for record page numbers */
+	u8 chapter_payload_bits;
+	/* Number of bits used to compute addresses for chapter delta lists */
+	u8 chapter_address_bits;
+	/* Number of densely-indexed chapters in a volume */
+	u32 dense_chapters_per_volume;
+};
+
+enum {
+	/* The number of bytes in a record (name + metadata) */
+	BYTES_PER_RECORD = (UDS_RECORD_NAME_SIZE + UDS_RECORD_DATA_SIZE),
+
+	/* The default length of a page in a chapter, in bytes */
+	DEFAULT_BYTES_PER_PAGE = 1024 * BYTES_PER_RECORD,
+
+	/* The default maximum number of records per page */
+	DEFAULT_RECORDS_PER_PAGE = DEFAULT_BYTES_PER_PAGE / BYTES_PER_RECORD,
+
+	/* The default number of record pages in a chapter */
+	DEFAULT_RECORD_PAGES_PER_CHAPTER = 256,
+
+	/* The default number of record pages in a chapter for a small index */
+	SMALL_RECORD_PAGES_PER_CHAPTER = 64,
+
+	/* The default number of chapters in a volume */
+	DEFAULT_CHAPTERS_PER_VOLUME = 1024,
+
+	/* The default number of sparsely-indexed chapters in a volume */
+	DEFAULT_SPARSE_CHAPTERS_PER_VOLUME = 0,
+
+	/* The log2 of the default mean delta */
+	DEFAULT_CHAPTER_MEAN_DELTA_BITS = 16,
+
+	/* The log2 of the number of delta lists in a large chapter */
+	DEFAULT_CHAPTER_DELTA_LIST_BITS = 12,
+
+	/* The log2 of the number of delta lists in a small chapter */
+	SMALL_CHAPTER_DELTA_LIST_BITS = 10,
+
+	/* The number of header pages per volume */
+	HEADER_PAGES_PER_VOLUME = 1,
+};
+
+int __must_check uds_make_geometry(size_t bytes_per_page,
+				   u32 record_pages_per_chapter,
+				   u32 chapters_per_volume,
+				   u32 sparse_chapters_per_volume,
+				   u64 remapped_virtual,
+				   u64 remapped_physical,
+				   struct geometry **geometry_ptr);
+
+int __must_check uds_copy_geometry(struct geometry *source, struct geometry **geometry_ptr);
+
+void uds_free_geometry(struct geometry *geometry);
+
+u32 __must_check uds_map_to_physical_chapter(const struct geometry *geometry, u64 virtual_chapter);
+
+/*
+ * Check whether this geometry is reduced by a chapter. This will only be true if the volume was
+ * converted from a non-lvm volume to an lvm volume.
+ */
+static inline bool __must_check uds_is_reduced_geometry(const struct geometry *geometry)
+{
+	return !!(geometry->chapters_per_volume & 1);
+}
+
+static inline bool __must_check uds_is_sparse_geometry(const struct geometry *geometry)
+{
+	return geometry->sparse_chapters_per_volume > 0;
+}
+
+bool __must_check uds_has_sparse_chapters(const struct geometry *geometry,
+					  u64 oldest_virtual_chapter,
+					  u64 newest_virtual_chapter);
+
+bool __must_check uds_is_chapter_sparse(const struct geometry *geometry,
+					u64 oldest_virtual_chapter,
+					u64 newest_virtual_chapter,
+					u64 virtual_chapter_number);
+
+u32 __must_check uds_chapters_to_expire(const struct geometry *geometry, u64 newest_chapter);
+
+#endif /* UDS_GEOMETRY_H */

From patchwork Thu Oct 26 21:41:07 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437938
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 742CDC25B48
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:43:23 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S234789AbjJZVnX (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:43:23 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41056 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233234AbjJZVnW (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:43:22 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 19E1DD44
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356503;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=BmpRSjH1g7zodzS9Ws3arMDh+IQrJeyBVaODHyRihpw=;
        b=HKmDdbO8lPqGYXx6xhIu3cdKpumXHLdw/eDC9KWgChMpe+YxiEVfq2ZGs9koPzieq/b7nB
        SunMGGKlzNjPD0N8a0yFerqGy3YW/fgBr0eDIKqsQwCnQbmp00yUl20lO1W636MzBiISMa
        r1KJyHvdblMjnaqijmQARMhoq+e36j0=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-674-PdL2nNq5NhKn7_z-CPsGOA-1; Thu,
 26 Oct 2023 17:41:39 -0400
X-MC-Unique: PdL2nNq5NhKn7_z-CPsGOA-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E4BDD28237C3;
        Thu, 26 Oct 2023 21:41:38 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id C6D7B2166B26;
        Thu, 26 Oct 2023 21:41:38 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 7D0B83003D; Thu, 26 Oct 2023 17:41:38 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 10/39] dm vdo: add deduplication index storage interface
Date: Thu, 26 Oct 2023 17:41:07 -0400
Message-Id: <20231026214136.1067410-11-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

This patch adds infrastructure for managing reads and writes to the
underlying storage layer for the deduplication index. The deduplication
index uses dm-bufio for all of its reads and writes, so part of this
infrastructure is managing the various dm-bufio clients required. It also
adds the buffered reader and buffered writer abstractions, which simplify
reading and writing metadata structures that span several blocks.

This patch also includes structures and utilities for encoding and decoding
all of the deduplication index metadata, collectively called the index
layout.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Co-developed-by: John Wiele <jwiele@redhat.com>
Signed-off-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/index-layout.c | 1775 ++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/index-layout.h |   43 +
 drivers/md/dm-vdo/io-factory.c   |  423 +++++++
 drivers/md/dm-vdo/io-factory.h   |   66 ++
 drivers/md/dm-vdo/numeric.h      |   78 ++
 5 files changed, 2385 insertions(+)
 create mode 100644 drivers/md/dm-vdo/index-layout.c
 create mode 100644 drivers/md/dm-vdo/index-layout.h
 create mode 100644 drivers/md/dm-vdo/io-factory.c
 create mode 100644 drivers/md/dm-vdo/io-factory.h
 create mode 100644 drivers/md/dm-vdo/numeric.h

diff --git a/drivers/md/dm-vdo/index-layout.c b/drivers/md/dm-vdo/index-layout.c
new file mode 100644
index 000000000000..a3bd56e3ef2d
--- /dev/null
+++ b/drivers/md/dm-vdo/index-layout.c
@@ -0,0 +1,1775 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "index-layout.h"
+
+#include <linux/random.h>
+
+#include "config.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "murmurhash3.h"
+#include "numeric.h"
+#include "open-chapter.h"
+#include "time-utils.h"
+#include "volume-index.h"
+
+/*
+ * The UDS layout on storage media is divided into a number of fixed-size regions, the sizes of
+ * which are computed when the index is created. Every header and region begins on 4K block
+ * boundary. Save regions are further sub-divided into regions of their own.
+ *
+ * Each region has a kind and an instance number. Some kinds only have one instance and therefore
+ * use RL_SOLE_INSTANCE (-1) as the instance number. The RL_KIND_INDEX used to use instances to
+ * represent sub-indices; now, however there is only ever one sub-index and therefore one instance.
+ * The RL_KIND_VOLUME_INDEX uses instances to record which zone is being saved.
+ *
+ * Every region header has a type and version.
+ *
+ *     +-+-+---------+--------+--------+-+
+ *     | | |   I N D E X  0   101, 0   | |
+ *     |H|C+---------+--------+--------+S|
+ *     |D|f| Volume  | Save   | Save   |e|
+ *     |R|g| Region  | Region | Region |a|
+ *     | | | 201, -1 | 202, 0 | 202, 1 |l|
+ *     +-+-+--------+---------+--------+-+
+ *
+ * The header contains the encoded region layout table as well as some index configuration data.
+ * The sub-index region and its subdivisions are maintained in the same table.
+ *
+ * There are two save regions to preserve the old state in case saving the new state is incomplete.
+ * They are used in alternation. Each save region is further divided into sub-regions.
+ *
+ *     +-+-----+------+------+-----+-----+
+ *     |H| IPM | MI   | MI   |     | OC  |
+ *     |D|     | zone | zone | ... |     |
+ *     |R| 301 | 302  | 302  |     | 303 |
+ *     | | -1  |  0   |  1   |     | -1  |
+ *     +-+-----+------+------+-----+-----+
+ *
+ * The header contains the encoded region layout table as well as index state data for that save.
+ * Each save also has a unique nonce.
+ */
+
+enum {
+	MAGIC_SIZE = 32,
+	NONCE_INFO_SIZE = 32,
+	MAX_SAVES = 2,
+};
+
+enum region_kind {
+	RL_KIND_EMPTY = 0,
+	RL_KIND_HEADER = 1,
+	RL_KIND_CONFIG = 100,
+	RL_KIND_INDEX = 101,
+	RL_KIND_SEAL = 102,
+	RL_KIND_VOLUME = 201,
+	RL_KIND_SAVE = 202,
+	RL_KIND_INDEX_PAGE_MAP = 301,
+	RL_KIND_VOLUME_INDEX = 302,
+	RL_KIND_OPEN_CHAPTER = 303,
+};
+
+/* Some region types are historical and are no longer used. */
+enum region_type {
+	RH_TYPE_FREE = 0, /* unused */
+	RH_TYPE_SUPER = 1,
+	RH_TYPE_SAVE = 2,
+	RH_TYPE_CHECKPOINT = 3, /* unused */
+	RH_TYPE_UNSAVED = 4,
+};
+
+enum {
+	RL_SOLE_INSTANCE = 65535,
+};
+
+/*
+ * Super block version 2 is the first released version.
+ *
+ * Super block version 3 is the normal version used from RHEL 8.2 onwards.
+ *
+ * Super block versions 4 through 6 were incremental development versions and
+ * are not supported.
+ *
+ * Super block version 7 is used for volumes which have been reduced in size by one chapter in
+ * order to make room to prepend LVM metadata to a volume originally created without lvm. This
+ * allows the index to retain most its deduplication records.
+ */
+enum {
+	SUPER_VERSION_MINIMUM = 3,
+	SUPER_VERSION_CURRENT = 3,
+	SUPER_VERSION_MAXIMUM = 7,
+};
+
+static const u8 LAYOUT_MAGIC[MAGIC_SIZE] = "*ALBIREO*SINGLE*FILE*LAYOUT*001*";
+static const u64 REGION_MAGIC = 0x416c6252676e3031; /* 'AlbRgn01' */
+
+struct region_header {
+	u64 magic;
+	u64 region_blocks;
+	u16 type;
+	/* Currently always version 1 */
+	u16 version;
+	u16 region_count;
+	u16 payload;
+};
+
+struct layout_region {
+	u64 start_block;
+	u64 block_count;
+	u32 __unused;
+	u16 kind;
+	u16 instance;
+};
+
+struct region_table {
+	size_t encoded_size;
+	struct region_header header;
+	struct layout_region regions[];
+};
+
+struct index_save_data {
+	u64 timestamp;
+	u64 nonce;
+	/* Currently always version 1 */
+	u32 version;
+	u32 unused__;
+};
+
+struct index_state_version {
+	s32 signature;
+	s32 version_id;
+};
+
+static const struct index_state_version INDEX_STATE_VERSION_301 = {
+	.signature  = -1,
+	.version_id = 301,
+};
+
+struct index_state_data301 {
+	struct index_state_version version;
+	u64 newest_chapter;
+	u64 oldest_chapter;
+	u64 last_save;
+	u32 unused;
+	u32 padding;
+};
+
+struct index_save_layout {
+	unsigned int zone_count;
+	struct layout_region index_save;
+	struct layout_region header;
+	struct layout_region index_page_map;
+	struct layout_region free_space;
+	struct layout_region volume_index_zones[MAX_ZONES];
+	struct layout_region open_chapter;
+	struct index_save_data save_data;
+	struct index_state_data301 state_data;
+};
+
+struct sub_index_layout {
+	u64 nonce;
+	struct layout_region sub_index;
+	struct layout_region volume;
+	struct index_save_layout *saves;
+};
+
+struct super_block_data {
+	u8 magic_label[MAGIC_SIZE];
+	u8 nonce_info[NONCE_INFO_SIZE];
+	u64 nonce;
+	u32 version;
+	u32 block_size;
+	u16 index_count;
+	u16 max_saves;
+	/* Padding reflects a blank field on permanent storage */
+	u8 padding[4];
+	u64 open_chapter_blocks;
+	u64 page_map_blocks;
+	u64 volume_offset;
+	u64 start_offset;
+};
+
+struct index_layout {
+	struct io_factory *factory;
+	size_t factory_size;
+	off_t offset;
+	struct super_block_data super;
+	struct layout_region header;
+	struct layout_region config;
+	struct sub_index_layout index;
+	struct layout_region seal;
+	u64 total_blocks;
+};
+
+struct save_layout_sizes {
+	unsigned int save_count;
+	size_t block_size;
+	u64 volume_blocks;
+	u64 volume_index_blocks;
+	u64 page_map_blocks;
+	u64 open_chapter_blocks;
+	u64 save_blocks;
+	u64 sub_index_blocks;
+	u64 total_blocks;
+	size_t total_size;
+};
+
+static inline bool is_converted_super_block(struct super_block_data *super)
+{
+	return super->version == 7;
+}
+
+static int __must_check compute_sizes(const struct configuration *config,
+				      struct save_layout_sizes *sls)
+{
+	int result;
+	struct geometry *geometry = config->geometry;
+
+	memset(sls, 0, sizeof(*sls));
+	sls->save_count = MAX_SAVES;
+	sls->block_size = UDS_BLOCK_SIZE;
+	sls->volume_blocks = geometry->bytes_per_volume / sls->block_size;
+
+	result = uds_compute_volume_index_save_blocks(config,
+						      sls->block_size,
+						      &sls->volume_index_blocks);
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result, "cannot compute index save size");
+
+	sls->page_map_blocks =
+		DIV_ROUND_UP(uds_compute_index_page_map_save_size(geometry), sls->block_size);
+	sls->open_chapter_blocks =
+		DIV_ROUND_UP(uds_compute_saved_open_chapter_size(geometry), sls->block_size);
+	sls->save_blocks =
+		1 + (sls->volume_index_blocks + sls->page_map_blocks + sls->open_chapter_blocks);
+	sls->sub_index_blocks = sls->volume_blocks + (sls->save_count * sls->save_blocks);
+	sls->total_blocks = 3 + sls->sub_index_blocks;
+	sls->total_size = sls->total_blocks * sls->block_size;
+
+	return UDS_SUCCESS;
+}
+
+int uds_compute_index_size(const struct uds_parameters *parameters, u64 *index_size)
+{
+	int result;
+	struct configuration *index_config;
+	struct save_layout_sizes sizes;
+
+	if (index_size == NULL) {
+		uds_log_error("Missing output size pointer");
+		return -EINVAL;
+	}
+
+	result = uds_make_configuration(parameters, &index_config);
+	if (result != UDS_SUCCESS) {
+		uds_log_error_strerror(result, "cannot compute index size");
+		return uds_map_to_system_error(result);
+	}
+
+	result = compute_sizes(index_config, &sizes);
+	uds_free_configuration(index_config);
+	if (result != UDS_SUCCESS)
+		return uds_map_to_system_error(result);
+
+	*index_size = sizes.total_size;
+	return UDS_SUCCESS;
+}
+
+/* Create unique data using the current time and a pseudorandom number. */
+static void create_unique_nonce_data(u8 *buffer)
+{
+	ktime_t now = current_time_ns(CLOCK_REALTIME);
+	u32 rand;
+	size_t offset = 0;
+
+	get_random_bytes(&rand, sizeof(u32));
+	memcpy(buffer + offset, &now, sizeof(now));
+	offset += sizeof(now);
+	memcpy(buffer + offset, &rand, sizeof(rand));
+	offset += sizeof(rand);
+	while (offset < NONCE_INFO_SIZE) {
+		size_t len = min(NONCE_INFO_SIZE - offset, offset);
+
+		memcpy(buffer + offset, buffer, len);
+		offset += len;
+	}
+}
+
+static u64 hash_stuff(u64 start, const void *data, size_t len)
+{
+	u32 seed = start ^ (start >> 27);
+	u8 hash_buffer[16];
+
+	murmurhash3_128(data, len, seed, hash_buffer);
+	return get_unaligned_le64(hash_buffer + 4);
+}
+
+/* Generate a primary nonce from the provided data. */
+static u64 generate_primary_nonce(const void *data, size_t len)
+{
+	return hash_stuff(0xa1b1e0fc, data, len);
+}
+
+/*
+ * Deterministically generate a secondary nonce from an existing nonce and some arbitrary data by
+ * hashing the original nonce and the data to produce a new nonce.
+ */
+static u64 generate_secondary_nonce(u64 nonce, const void *data, size_t len)
+{
+	return hash_stuff(nonce + 1, data, len);
+}
+
+static int __must_check open_layout_reader(struct index_layout *layout,
+					   struct layout_region *lr,
+					   off_t offset,
+					   struct buffered_reader **reader_ptr)
+{
+	return uds_make_buffered_reader(layout->factory,
+					lr->start_block + offset,
+					lr->block_count,
+					reader_ptr);
+}
+
+static int open_region_reader(struct index_layout *layout,
+			      struct layout_region *region,
+			      struct buffered_reader **reader_ptr)
+{
+	return open_layout_reader(layout, region, -layout->super.start_offset, reader_ptr);
+}
+
+static int __must_check open_layout_writer(struct index_layout *layout,
+					   struct layout_region *lr,
+					   off_t offset,
+					   struct buffered_writer **writer_ptr)
+{
+	return uds_make_buffered_writer(layout->factory,
+					lr->start_block + offset,
+					lr->block_count,
+					writer_ptr);
+}
+
+static int open_region_writer(struct index_layout *layout,
+			      struct layout_region *region,
+			      struct buffered_writer **writer_ptr)
+{
+	return open_layout_writer(layout, region, -layout->super.start_offset, writer_ptr);
+}
+
+static void
+generate_super_block_data(struct save_layout_sizes *sls, struct super_block_data *super)
+{
+	memset(super, 0, sizeof(*super));
+	memcpy(super->magic_label, LAYOUT_MAGIC, MAGIC_SIZE);
+	create_unique_nonce_data(super->nonce_info);
+
+	super->nonce = generate_primary_nonce(super->nonce_info, sizeof(super->nonce_info));
+	super->version = SUPER_VERSION_CURRENT;
+	super->block_size = sls->block_size;
+	super->index_count = 1;
+	super->max_saves = sls->save_count;
+	super->open_chapter_blocks = sls->open_chapter_blocks;
+	super->page_map_blocks = sls->page_map_blocks;
+	super->volume_offset = 0;
+	super->start_offset = 0;
+}
+
+static void define_sub_index_nonce(struct index_layout *layout)
+{
+	struct sub_index_nonce_data {
+		u64 offset;
+		u16 index_id;
+	};
+	struct sub_index_layout *sil = &layout->index;
+	u64 primary_nonce = layout->super.nonce;
+	u8 buffer[sizeof(struct sub_index_nonce_data)] = { 0 };
+	size_t offset = 0;
+
+	encode_u64_le(buffer, &offset, sil->sub_index.start_block);
+	encode_u16_le(buffer, &offset, 0);
+	sil->nonce = generate_secondary_nonce(primary_nonce, buffer, sizeof(buffer));
+	if (sil->nonce == 0)
+		sil->nonce = generate_secondary_nonce(~primary_nonce + 1, buffer, sizeof(buffer));
+}
+
+static void
+setup_sub_index(struct index_layout *layout, u64 start_block, struct save_layout_sizes *sls)
+{
+	struct sub_index_layout *sil = &layout->index;
+	u64 next_block = start_block;
+	unsigned int i;
+
+	sil->sub_index = (struct layout_region) {
+		.start_block = start_block,
+		.block_count = sls->sub_index_blocks,
+		.kind = RL_KIND_INDEX,
+		.instance = 0,
+	};
+
+	sil->volume = (struct layout_region) {
+		.start_block = next_block,
+		.block_count = sls->volume_blocks,
+		.kind = RL_KIND_VOLUME,
+		.instance = RL_SOLE_INSTANCE,
+	};
+
+	next_block += sls->volume_blocks;
+
+	for (i = 0; i < sls->save_count; i++) {
+		sil->saves[i].index_save = (struct layout_region) {
+			.start_block = next_block,
+			.block_count = sls->save_blocks,
+			.kind = RL_KIND_SAVE,
+			.instance = i,
+		};
+
+		next_block += sls->save_blocks;
+	}
+
+	define_sub_index_nonce(layout);
+}
+
+static void initialize_layout(struct index_layout *layout, struct save_layout_sizes *sls)
+{
+	u64 next_block = layout->offset / sls->block_size;
+
+	layout->total_blocks = sls->total_blocks;
+	generate_super_block_data(sls, &layout->super);
+	layout->header = (struct layout_region) {
+		.start_block = next_block++,
+		.block_count = 1,
+		.kind = RL_KIND_HEADER,
+		.instance = RL_SOLE_INSTANCE,
+	};
+
+	layout->config = (struct layout_region) {
+		.start_block = next_block++,
+		.block_count = 1,
+		.kind = RL_KIND_CONFIG,
+		.instance = RL_SOLE_INSTANCE,
+	};
+
+	setup_sub_index(layout, next_block, sls);
+	next_block += sls->sub_index_blocks;
+
+	layout->seal = (struct layout_region) {
+		.start_block = next_block,
+		.block_count = 1,
+		.kind = RL_KIND_SEAL,
+		.instance = RL_SOLE_INSTANCE,
+	};
+}
+
+static int __must_check
+make_index_save_region_table(struct index_save_layout *isl, struct region_table **table_ptr)
+{
+	int result;
+	unsigned int z;
+	struct region_table *table;
+	struct layout_region *lr;
+	u16 region_count;
+	size_t payload;
+	size_t type;
+
+	if (isl->zone_count > 0) {
+		/*
+		 * Normal save regions: header, page map, volume index zones,
+		 * open chapter, and possibly free space.
+		 */
+		region_count = 3 + isl->zone_count;
+		if (isl->free_space.block_count > 0)
+			region_count++;
+
+		payload = sizeof(isl->save_data) + sizeof(isl->state_data);
+		type = RH_TYPE_SAVE;
+	} else {
+		/* Empty save regions: header, page map, free space. */
+		region_count = 3;
+		payload = sizeof(isl->save_data);
+		type = RH_TYPE_UNSAVED;
+	}
+
+	result = UDS_ALLOCATE_EXTENDED(struct region_table,
+				       region_count,
+				       struct layout_region,
+				       "layout region table for ISL",
+				       &table);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	lr = &table->regions[0];
+	*lr++ = isl->header;
+	*lr++ = isl->index_page_map;
+	for (z = 0; z < isl->zone_count; z++)
+		*lr++ = isl->volume_index_zones[z];
+
+	if (isl->zone_count > 0)
+		*lr++ = isl->open_chapter;
+
+	if (isl->free_space.block_count > 0)
+		*lr++ = isl->free_space;
+
+	table->header = (struct region_header) {
+		.magic = REGION_MAGIC,
+		.region_blocks = isl->index_save.block_count,
+		.type = type,
+		.version = 1,
+		.region_count = region_count,
+		.payload = payload,
+	};
+
+	table->encoded_size = (sizeof(struct region_header) + payload +
+			       region_count * sizeof(struct layout_region));
+	*table_ptr = table;
+	return UDS_SUCCESS;
+}
+
+static void encode_region_table(u8 *buffer, size_t *offset, struct region_table *table)
+{
+	unsigned int i;
+
+	encode_u64_le(buffer, offset, REGION_MAGIC);
+	encode_u64_le(buffer, offset, table->header.region_blocks);
+	encode_u16_le(buffer, offset, table->header.type);
+	encode_u16_le(buffer, offset, table->header.version);
+	encode_u16_le(buffer, offset, table->header.region_count);
+	encode_u16_le(buffer, offset, table->header.payload);
+
+	for (i = 0; i < table->header.region_count; i++) {
+		encode_u64_le(buffer, offset, table->regions[i].start_block);
+		encode_u64_le(buffer, offset, table->regions[i].block_count);
+		encode_u32_le(buffer, offset, 0);
+		encode_u16_le(buffer, offset, table->regions[i].kind);
+		encode_u16_le(buffer, offset, table->regions[i].instance);
+	}
+}
+
+static int __must_check write_index_save_header(struct index_save_layout *isl,
+						struct region_table *table,
+						struct buffered_writer *writer)
+{
+	int result;
+	u8 *buffer;
+	size_t offset = 0;
+
+	result = UDS_ALLOCATE(table->encoded_size, u8, "index save data", &buffer);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	encode_region_table(buffer, &offset, table);
+	encode_u64_le(buffer, &offset, isl->save_data.timestamp);
+	encode_u64_le(buffer, &offset, isl->save_data.nonce);
+	encode_u32_le(buffer, &offset, isl->save_data.version);
+	encode_u32_le(buffer, &offset, 0);
+	if (isl->zone_count > 0) {
+		encode_u32_le(buffer, &offset, INDEX_STATE_VERSION_301.signature);
+		encode_u32_le(buffer, &offset, INDEX_STATE_VERSION_301.version_id);
+		encode_u64_le(buffer, &offset, isl->state_data.newest_chapter);
+		encode_u64_le(buffer, &offset, isl->state_data.oldest_chapter);
+		encode_u64_le(buffer, &offset, isl->state_data.last_save);
+		encode_u64_le(buffer, &offset, 0);
+	}
+
+	result = uds_write_to_buffered_writer(writer, buffer, offset);
+	UDS_FREE(buffer);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	return uds_flush_buffered_writer(writer);
+}
+
+static int write_index_save_layout(struct index_layout *layout, struct index_save_layout *isl)
+{
+	int result;
+	struct region_table *table;
+	struct buffered_writer *writer;
+
+	result = make_index_save_region_table(isl, &table);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = open_region_writer(layout, &isl->header, &writer);
+	if (result != UDS_SUCCESS) {
+		UDS_FREE(table);
+		return result;
+	}
+
+	result = write_index_save_header(isl, table, writer);
+	UDS_FREE(table);
+	uds_free_buffered_writer(writer);
+
+	return result;
+}
+
+static void reset_index_save_layout(struct index_save_layout *isl, u64 page_map_blocks)
+{
+	u64 free_blocks;
+	u64 next_block = isl->index_save.start_block;
+
+	isl->zone_count = 0;
+	memset(&isl->save_data, 0, sizeof(isl->save_data));
+
+	isl->header = (struct layout_region) {
+		.start_block = next_block++,
+		.block_count = 1,
+		.kind = RL_KIND_HEADER,
+		.instance = RL_SOLE_INSTANCE,
+	};
+
+	isl->index_page_map = (struct layout_region) {
+		.start_block = next_block,
+		.block_count = page_map_blocks,
+		.kind = RL_KIND_INDEX_PAGE_MAP,
+		.instance = RL_SOLE_INSTANCE,
+	};
+
+	next_block += page_map_blocks;
+
+	free_blocks = isl->index_save.block_count - page_map_blocks - 1;
+	isl->free_space = (struct layout_region) {
+		.start_block = next_block,
+		.block_count = free_blocks,
+		.kind = RL_KIND_EMPTY,
+		.instance = RL_SOLE_INSTANCE,
+	};
+}
+
+static int __must_check
+invalidate_old_save(struct index_layout *layout, struct index_save_layout *isl)
+{
+	reset_index_save_layout(isl, layout->super.page_map_blocks);
+	return write_index_save_layout(layout, isl);
+}
+
+static int discard_index_state_data(struct index_layout *layout)
+{
+	int result;
+	int saved_result = UDS_SUCCESS;
+	unsigned int i;
+
+	for (i = 0; i < layout->super.max_saves; i++) {
+		result = invalidate_old_save(layout, &layout->index.saves[i]);
+		if (result != UDS_SUCCESS)
+			saved_result = result;
+	}
+
+	if (saved_result != UDS_SUCCESS)
+		return uds_log_error_strerror(result,
+					      "%s: cannot destroy all index saves",
+					      __func__);
+
+	return UDS_SUCCESS;
+}
+
+static int __must_check
+make_layout_region_table(struct index_layout *layout, struct region_table **table_ptr)
+{
+	int result;
+	unsigned int i;
+	/* Regions: header, config, index, volume, saves, seal */
+	u16 region_count = 5 + layout->super.max_saves;
+	u16 payload;
+	struct region_table *table;
+	struct layout_region *lr;
+
+	result = UDS_ALLOCATE_EXTENDED(struct region_table,
+				       region_count,
+				       struct layout_region,
+				       "layout region table",
+				       &table);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	lr = &table->regions[0];
+	*lr++ = layout->header;
+	*lr++ = layout->config;
+	*lr++ = layout->index.sub_index;
+	*lr++ = layout->index.volume;
+
+	for (i = 0; i < layout->super.max_saves; i++)
+		*lr++ = layout->index.saves[i].index_save;
+
+	*lr++ = layout->seal;
+
+	if (is_converted_super_block(&layout->super))
+		payload = sizeof(struct super_block_data);
+	else
+		payload = (sizeof(struct super_block_data) -
+			   sizeof(layout->super.volume_offset) -
+			   sizeof(layout->super.start_offset));
+
+	table->header = (struct region_header) {
+		.magic = REGION_MAGIC,
+		.region_blocks = layout->total_blocks,
+		.type = RH_TYPE_SUPER,
+		.version = 1,
+		.region_count = region_count,
+		.payload = payload,
+	};
+
+	table->encoded_size = (sizeof(struct region_header) + payload +
+			       region_count * sizeof(struct layout_region));
+	*table_ptr = table;
+	return UDS_SUCCESS;
+}
+
+static int __must_check write_layout_header(struct index_layout *layout,
+					    struct region_table *table,
+					    struct buffered_writer *writer)
+{
+	int result;
+	u8 *buffer;
+	size_t offset = 0;
+
+	result = UDS_ALLOCATE(table->encoded_size, u8, "layout data", &buffer);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	encode_region_table(buffer, &offset, table);
+	memcpy(buffer + offset, &layout->super.magic_label, MAGIC_SIZE);
+	offset += MAGIC_SIZE;
+	memcpy(buffer + offset, &layout->super.nonce_info, NONCE_INFO_SIZE);
+	offset += NONCE_INFO_SIZE;
+	encode_u64_le(buffer, &offset, layout->super.nonce);
+	encode_u32_le(buffer, &offset, layout->super.version);
+	encode_u32_le(buffer, &offset, layout->super.block_size);
+	encode_u16_le(buffer, &offset, layout->super.index_count);
+	encode_u16_le(buffer, &offset, layout->super.max_saves);
+	encode_u32_le(buffer, &offset, 0);
+	encode_u64_le(buffer, &offset, layout->super.open_chapter_blocks);
+	encode_u64_le(buffer, &offset, layout->super.page_map_blocks);
+
+	if (is_converted_super_block(&layout->super)) {
+		encode_u64_le(buffer, &offset, layout->super.volume_offset);
+		encode_u64_le(buffer, &offset, layout->super.start_offset);
+	}
+
+	result = uds_write_to_buffered_writer(writer, buffer, offset);
+	UDS_FREE(buffer);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	return uds_flush_buffered_writer(writer);
+}
+
+static int __must_check write_uds_index_config(struct index_layout *layout,
+					       struct configuration *config,
+					       off_t offset)
+{
+	int result;
+	struct buffered_writer *writer = NULL;
+
+	result = open_layout_writer(layout, &layout->config, offset, &writer);
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result, "failed to open config region");
+
+	result = uds_write_config_contents(writer, config, layout->super.version);
+	if (result != UDS_SUCCESS) {
+		uds_free_buffered_writer(writer);
+		return uds_log_error_strerror(result, "failed to write config region");
+	}
+
+	result = uds_flush_buffered_writer(writer);
+	if (result != UDS_SUCCESS) {
+		uds_free_buffered_writer(writer);
+		return uds_log_error_strerror(result, "cannot flush config writer");
+	}
+
+	uds_free_buffered_writer(writer);
+	return UDS_SUCCESS;
+}
+
+static int __must_check save_layout(struct index_layout *layout, off_t offset)
+{
+	int result;
+	struct buffered_writer *writer = NULL;
+	struct region_table *table;
+
+	result = make_layout_region_table(layout, &table);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = open_layout_writer(layout, &layout->header, offset, &writer);
+	if (result != UDS_SUCCESS) {
+		UDS_FREE(table);
+		return result;
+	}
+
+	result = write_layout_header(layout, table, writer);
+	UDS_FREE(table);
+	uds_free_buffered_writer(writer);
+
+	return result;
+}
+
+static int create_index_layout(struct index_layout *layout, struct configuration *config)
+{
+	int result;
+	struct save_layout_sizes sizes;
+
+	result = compute_sizes(config, &sizes);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(sizes.save_count,
+			      struct index_save_layout,
+			      __func__,
+			      &layout->index.saves);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	initialize_layout(layout, &sizes);
+
+	result = discard_index_state_data(layout);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = write_uds_index_config(layout, config, 0);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	return save_layout(layout, 0);
+}
+
+static u64 generate_index_save_nonce(u64 volume_nonce, struct index_save_layout *isl)
+{
+	struct save_nonce_data {
+		struct index_save_data data;
+		u64 offset;
+	} nonce_data;
+	u8 buffer[sizeof(nonce_data)];
+	size_t offset = 0;
+
+	encode_u64_le(buffer, &offset, isl->save_data.timestamp);
+	encode_u64_le(buffer, &offset, 0);
+	encode_u32_le(buffer, &offset, isl->save_data.version);
+	encode_u32_le(buffer, &offset, 0U);
+	encode_u64_le(buffer, &offset, isl->index_save.start_block);
+	ASSERT_LOG_ONLY(offset == sizeof(nonce_data),
+			"%zu bytes encoded of %zu expected",
+			offset,
+			sizeof(nonce_data));
+	return generate_secondary_nonce(volume_nonce, buffer, sizeof(buffer));
+}
+
+static u64 validate_index_save_layout(struct index_save_layout *isl, u64 volume_nonce)
+{
+	if ((isl->zone_count == 0) || (isl->save_data.timestamp == 0))
+		return 0;
+
+	if (isl->save_data.nonce != generate_index_save_nonce(volume_nonce, isl))
+		return 0;
+
+	return isl->save_data.timestamp;
+}
+
+static int
+find_latest_uds_index_save_slot(struct index_layout *layout, struct index_save_layout **isl_ptr)
+{
+	struct index_save_layout *latest = NULL;
+	struct index_save_layout *isl;
+	unsigned int i;
+	u64 save_time = 0;
+	u64 latest_time = 0;
+
+	for (i = 0; i < layout->super.max_saves; i++) {
+		isl = &layout->index.saves[i];
+		save_time = validate_index_save_layout(isl, layout->index.nonce);
+		if (save_time > latest_time) {
+			latest = isl;
+			latest_time = save_time;
+		}
+	}
+
+	if (latest == NULL) {
+		uds_log_error("No valid index save found");
+		return UDS_INDEX_NOT_SAVED_CLEANLY;
+	}
+
+	*isl_ptr = latest;
+	return UDS_SUCCESS;
+}
+
+int uds_discard_open_chapter(struct index_layout *layout)
+{
+	int result;
+	struct index_save_layout *isl;
+	struct buffered_writer *writer;
+
+	result = find_latest_uds_index_save_slot(layout, &isl);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = open_region_writer(layout, &isl->open_chapter, &writer);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_write_to_buffered_writer(writer, NULL, UDS_BLOCK_SIZE);
+	if (result != UDS_SUCCESS) {
+		uds_free_buffered_writer(writer);
+		return result;
+	}
+
+	result = uds_flush_buffered_writer(writer);
+	uds_free_buffered_writer(writer);
+	return result;
+}
+
+int uds_load_index_state(struct index_layout *layout, struct uds_index *index)
+{
+	int result;
+	unsigned int zone;
+	struct index_save_layout *isl;
+	struct buffered_reader *readers[MAX_ZONES];
+
+	result = find_latest_uds_index_save_slot(layout, &isl);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	index->newest_virtual_chapter = isl->state_data.newest_chapter;
+	index->oldest_virtual_chapter = isl->state_data.oldest_chapter;
+	index->last_save = isl->state_data.last_save;
+
+	result = open_region_reader(layout, &isl->open_chapter, &readers[0]);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_load_open_chapter(index, readers[0]);
+	uds_free_buffered_reader(readers[0]);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	for (zone = 0; zone < isl->zone_count; zone++) {
+		result = open_region_reader(layout,
+					    &isl->volume_index_zones[zone],
+					    &readers[zone]);
+		if (result != UDS_SUCCESS) {
+			for (; zone > 0; zone--)
+				uds_free_buffered_reader(readers[zone - 1]);
+
+			return result;
+		}
+	}
+
+	result = uds_load_volume_index(index->volume_index, readers, isl->zone_count);
+	for (zone = 0; zone < isl->zone_count; zone++)
+		uds_free_buffered_reader(readers[zone]);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = open_region_reader(layout, &isl->index_page_map, &readers[0]);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_read_index_page_map(index->volume->index_page_map, readers[0]);
+	uds_free_buffered_reader(readers[0]);
+
+	return result;
+}
+
+static struct index_save_layout *select_oldest_index_save_layout(struct index_layout *layout)
+{
+	struct index_save_layout *oldest = NULL;
+	struct index_save_layout *isl;
+	unsigned int i;
+	u64 save_time = 0;
+	u64 oldest_time = 0;
+
+	for (i = 0; i < layout->super.max_saves; i++) {
+		isl = &layout->index.saves[i];
+		save_time = validate_index_save_layout(isl, layout->index.nonce);
+		if (oldest == NULL || save_time < oldest_time) {
+			oldest = isl;
+			oldest_time = save_time;
+		}
+	}
+
+	return oldest;
+}
+
+static void instantiate_index_save_layout(struct index_save_layout *isl,
+					  struct super_block_data *super,
+					  u64 volume_nonce,
+					  unsigned int zone_count)
+{
+	unsigned int z;
+	u64 next_block;
+	u64 free_blocks;
+	u64 volume_index_blocks;
+
+	isl->zone_count = zone_count;
+	memset(&isl->save_data, 0, sizeof(isl->save_data));
+	isl->save_data.timestamp = ktime_to_ms(current_time_ns(CLOCK_REALTIME));
+	isl->save_data.version = 1;
+	isl->save_data.nonce = generate_index_save_nonce(volume_nonce, isl);
+
+	next_block = isl->index_save.start_block;
+	isl->header = (struct layout_region) {
+		.start_block = next_block++,
+		.block_count = 1,
+		.kind = RL_KIND_HEADER,
+		.instance = RL_SOLE_INSTANCE,
+	};
+
+	isl->index_page_map = (struct layout_region) {
+		.start_block = next_block,
+		.block_count = super->page_map_blocks,
+		.kind = RL_KIND_INDEX_PAGE_MAP,
+		.instance = RL_SOLE_INSTANCE,
+	};
+	next_block += super->page_map_blocks;
+
+	free_blocks = (isl->index_save.block_count - 1 -
+		       super->page_map_blocks -
+		       super->open_chapter_blocks);
+	volume_index_blocks = free_blocks / isl->zone_count;
+	for (z = 0; z < isl->zone_count; z++) {
+		isl->volume_index_zones[z] = (struct layout_region) {
+			.start_block = next_block,
+			.block_count = volume_index_blocks,
+			.kind = RL_KIND_VOLUME_INDEX,
+			.instance = z,
+		};
+
+		next_block += volume_index_blocks;
+		free_blocks -= volume_index_blocks;
+	}
+
+	isl->open_chapter = (struct layout_region) {
+		.start_block = next_block,
+		.block_count = super->open_chapter_blocks,
+		.kind = RL_KIND_OPEN_CHAPTER,
+		.instance = RL_SOLE_INSTANCE,
+	};
+
+	next_block += super->open_chapter_blocks;
+
+	isl->free_space = (struct layout_region) {
+		.start_block = next_block,
+		.block_count = free_blocks,
+		.kind = RL_KIND_EMPTY,
+		.instance = RL_SOLE_INSTANCE,
+	};
+}
+
+static int setup_uds_index_save_slot(struct index_layout *layout,
+				     unsigned int zone_count,
+				     struct index_save_layout **isl_ptr)
+{
+	int result;
+	struct index_save_layout *isl;
+
+	isl = select_oldest_index_save_layout(layout);
+	result = invalidate_old_save(layout, isl);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	instantiate_index_save_layout(isl, &layout->super, layout->index.nonce, zone_count);
+
+	*isl_ptr = isl;
+	return UDS_SUCCESS;
+}
+
+static void cancel_uds_index_save(struct index_save_layout *isl)
+{
+	memset(&isl->save_data, 0, sizeof(isl->save_data));
+	memset(&isl->state_data, 0, sizeof(isl->state_data));
+	isl->zone_count = 0;
+}
+
+int uds_save_index_state(struct index_layout *layout, struct uds_index *index)
+{
+	int result;
+	unsigned int zone;
+	struct index_save_layout *isl;
+	struct buffered_writer *writers[MAX_ZONES];
+
+	result = setup_uds_index_save_slot(layout, index->zone_count, &isl);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	isl->state_data	= (struct index_state_data301) {
+		.newest_chapter = index->newest_virtual_chapter,
+		.oldest_chapter = index->oldest_virtual_chapter,
+		.last_save = index->last_save,
+	};
+
+	result = open_region_writer(layout, &isl->open_chapter, &writers[0]);
+	if (result != UDS_SUCCESS) {
+		cancel_uds_index_save(isl);
+		return result;
+	}
+
+	result = uds_save_open_chapter(index, writers[0]);
+	uds_free_buffered_writer(writers[0]);
+	if (result != UDS_SUCCESS) {
+		cancel_uds_index_save(isl);
+		return result;
+	}
+
+	for (zone = 0; zone < index->zone_count; zone++) {
+		result = open_region_writer(layout,
+					    &isl->volume_index_zones[zone],
+					    &writers[zone]);
+		if (result != UDS_SUCCESS) {
+			for (; zone > 0; zone--)
+				uds_free_buffered_writer(writers[zone - 1]);
+
+			cancel_uds_index_save(isl);
+			return result;
+		}
+	}
+
+	result = uds_save_volume_index(index->volume_index, writers, index->zone_count);
+	for (zone = 0; zone < index->zone_count; zone++)
+		uds_free_buffered_writer(writers[zone]);
+	if (result != UDS_SUCCESS) {
+		cancel_uds_index_save(isl);
+		return result;
+	}
+
+	result = open_region_writer(layout, &isl->index_page_map, &writers[0]);
+	if (result != UDS_SUCCESS) {
+		cancel_uds_index_save(isl);
+		return result;
+	}
+
+	result = uds_write_index_page_map(index->volume->index_page_map, writers[0]);
+	uds_free_buffered_writer(writers[0]);
+	if (result != UDS_SUCCESS) {
+		cancel_uds_index_save(isl);
+		return result;
+	}
+
+	return write_index_save_layout(layout, isl);
+}
+
+static int __must_check
+load_region_table(struct buffered_reader *reader, struct region_table **table_ptr)
+{
+	int result;
+	unsigned int i;
+	struct region_header header;
+	struct region_table *table;
+	u8 buffer[sizeof(struct region_header)];
+	size_t offset = 0;
+
+	result = uds_read_from_buffered_reader(reader, buffer, sizeof(buffer));
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result, "cannot read region table header");
+
+	decode_u64_le(buffer, &offset, &header.magic);
+	decode_u64_le(buffer, &offset, &header.region_blocks);
+	decode_u16_le(buffer, &offset, &header.type);
+	decode_u16_le(buffer, &offset, &header.version);
+	decode_u16_le(buffer, &offset, &header.region_count);
+	decode_u16_le(buffer, &offset, &header.payload);
+
+	if (header.magic != REGION_MAGIC)
+		return UDS_NO_INDEX;
+
+	if (header.version != 1)
+		return uds_log_error_strerror(UDS_UNSUPPORTED_VERSION,
+					      "unknown region table version %hu",
+					      header.version);
+
+	result = UDS_ALLOCATE_EXTENDED(struct region_table,
+				       header.region_count,
+				       struct layout_region,
+				       "single file layout region table",
+				       &table);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	table->header = header;
+	for (i = 0; i < header.region_count; i++) {
+		u8 region_buffer[sizeof(struct layout_region)];
+
+		offset = 0;
+		result = uds_read_from_buffered_reader(reader,
+						       region_buffer,
+						       sizeof(region_buffer));
+		if (result != UDS_SUCCESS) {
+			UDS_FREE(table);
+			return uds_log_error_strerror(UDS_CORRUPT_DATA,
+						      "cannot read region table layouts");
+		}
+
+		decode_u64_le(region_buffer, &offset, &table->regions[i].start_block);
+		decode_u64_le(region_buffer, &offset, &table->regions[i].block_count);
+		offset += sizeof(u32);
+		decode_u16_le(region_buffer, &offset, &table->regions[i].kind);
+		decode_u16_le(region_buffer, &offset, &table->regions[i].instance);
+	}
+
+	*table_ptr = table;
+	return UDS_SUCCESS;
+}
+
+static int __must_check read_super_block_data(struct buffered_reader *reader,
+					      struct index_layout *layout,
+					      size_t saved_size)
+{
+	int result;
+	struct super_block_data *super = &layout->super;
+	u8 *buffer;
+	size_t offset = 0;
+
+	result = UDS_ALLOCATE(saved_size, u8, "super block data", &buffer);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_read_from_buffered_reader(reader, buffer, saved_size);
+	if (result != UDS_SUCCESS) {
+		UDS_FREE(buffer);
+		return uds_log_error_strerror(result, "cannot read region table header");
+	}
+
+	memcpy(&super->magic_label, buffer, MAGIC_SIZE);
+	offset += MAGIC_SIZE;
+	memcpy(&super->nonce_info, buffer + offset, NONCE_INFO_SIZE);
+	offset += NONCE_INFO_SIZE;
+	decode_u64_le(buffer, &offset, &super->nonce);
+	decode_u32_le(buffer, &offset, &super->version);
+	decode_u32_le(buffer, &offset, &super->block_size);
+	decode_u16_le(buffer, &offset, &super->index_count);
+	decode_u16_le(buffer, &offset, &super->max_saves);
+	offset += sizeof(u32);
+	decode_u64_le(buffer, &offset, &super->open_chapter_blocks);
+	decode_u64_le(buffer, &offset, &super->page_map_blocks);
+
+	if (is_converted_super_block(super)) {
+		decode_u64_le(buffer, &offset, &super->volume_offset);
+		decode_u64_le(buffer, &offset, &super->start_offset);
+	} else {
+		super->volume_offset = 0;
+		super->start_offset = 0;
+	}
+
+	UDS_FREE(buffer);
+
+	if (memcmp(super->magic_label, LAYOUT_MAGIC, MAGIC_SIZE) != 0)
+		return uds_log_error_strerror(UDS_CORRUPT_DATA, "unknown superblock magic label");
+
+	if ((super->version < SUPER_VERSION_MINIMUM) ||
+	    (super->version == 4) ||
+	    (super->version == 5) ||
+	    (super->version == 6) ||
+	    (super->version > SUPER_VERSION_MAXIMUM))
+		return uds_log_error_strerror(UDS_UNSUPPORTED_VERSION,
+					      "unknown superblock version number %u",
+					      super->version);
+
+	if (super->volume_offset < super->start_offset)
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "inconsistent offsets (start %llu, volume %llu)",
+					      (unsigned long long) super->start_offset,
+					      (unsigned long long) super->volume_offset);
+
+	/* Sub-indexes are no longer used but the layout retains this field. */
+	if (super->index_count != 1)
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "invalid subindex count %u",
+					      super->index_count);
+
+	if (generate_primary_nonce(super->nonce_info, sizeof(super->nonce_info)) != super->nonce)
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "inconsistent superblock nonce");
+
+	return UDS_SUCCESS;
+}
+
+static int __must_check verify_region(struct layout_region *lr,
+				      u64 start_block,
+				      enum region_kind kind,
+				      unsigned int instance)
+{
+	if (lr->start_block != start_block)
+		return uds_log_error_strerror(UDS_CORRUPT_DATA, "incorrect layout region offset");
+
+	if (lr->kind != kind)
+		return uds_log_error_strerror(UDS_CORRUPT_DATA, "incorrect layout region kind");
+
+	if (lr->instance != instance)
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "incorrect layout region instance");
+
+	return UDS_SUCCESS;
+}
+
+static int __must_check
+verify_sub_index(struct index_layout *layout, u64 start_block, struct region_table *table)
+{
+	int result;
+	unsigned int i;
+	struct sub_index_layout *sil = &layout->index;
+	u64 next_block = start_block;
+
+	sil->sub_index = table->regions[2];
+	result = verify_region(&sil->sub_index, next_block, RL_KIND_INDEX, 0);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	define_sub_index_nonce(layout);
+
+	sil->volume = table->regions[3];
+	result = verify_region(&sil->volume, next_block, RL_KIND_VOLUME, RL_SOLE_INSTANCE);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	next_block += sil->volume.block_count + layout->super.volume_offset;
+
+	for (i = 0; i < layout->super.max_saves; i++) {
+		sil->saves[i].index_save = table->regions[i + 4];
+		result = verify_region(&sil->saves[i].index_save, next_block, RL_KIND_SAVE, i);
+		if (result != UDS_SUCCESS)
+			return result;
+
+		next_block += sil->saves[i].index_save.block_count;
+	}
+
+	next_block -= layout->super.volume_offset;
+	if (next_block != start_block + sil->sub_index.block_count)
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "sub index region does not span all saves");
+
+	return UDS_SUCCESS;
+}
+
+static int __must_check
+reconstitute_layout(struct index_layout *layout, struct region_table *table, u64 first_block)
+{
+	int result;
+	u64 next_block = first_block;
+
+	result = UDS_ALLOCATE(layout->super.max_saves,
+			      struct index_save_layout,
+			      __func__,
+			      &layout->index.saves);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	layout->total_blocks = table->header.region_blocks;
+
+	layout->header = table->regions[0];
+	result = verify_region(&layout->header, next_block++, RL_KIND_HEADER, RL_SOLE_INSTANCE);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	layout->config = table->regions[1];
+	result = verify_region(&layout->config, next_block++, RL_KIND_CONFIG, RL_SOLE_INSTANCE);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = verify_sub_index(layout, next_block, table);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	next_block += layout->index.sub_index.block_count;
+
+	layout->seal = table->regions[table->header.region_count - 1];
+	result = verify_region(&layout->seal,
+			       next_block + layout->super.volume_offset,
+			       RL_KIND_SEAL,
+			       RL_SOLE_INSTANCE);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (++next_block != (first_block + layout->total_blocks))
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "layout table does not span total blocks");
+
+	return UDS_SUCCESS;
+}
+
+static int __must_check load_super_block(struct index_layout *layout,
+					 size_t block_size,
+					 u64 first_block,
+					 struct buffered_reader *reader)
+{
+	int result;
+	struct region_table *table = NULL;
+	struct super_block_data *super = &layout->super;
+
+	result = load_region_table(reader, &table);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (table->header.type != RH_TYPE_SUPER) {
+		UDS_FREE(table);
+		return uds_log_error_strerror(UDS_CORRUPT_DATA, "not a superblock region table");
+	}
+
+	result = read_super_block_data(reader, layout, table->header.payload);
+	if (result != UDS_SUCCESS) {
+		UDS_FREE(table);
+		return uds_log_error_strerror(result, "unknown superblock format");
+	}
+
+	if (super->block_size != block_size) {
+		UDS_FREE(table);
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "superblock saved block_size %u differs from supplied block_size %zu",
+					      super->block_size,
+					      block_size);
+	}
+
+	first_block -= (super->volume_offset - super->start_offset);
+	result = reconstitute_layout(layout, table, first_block);
+	UDS_FREE(table);
+	return result;
+}
+
+static int __must_check read_index_save_data(struct buffered_reader *reader,
+					     struct index_save_layout *isl,
+					     size_t saved_size)
+{
+	int result;
+	struct index_state_version file_version;
+	u8 buffer[sizeof(struct index_save_data) + sizeof(struct index_state_data301)];
+	size_t offset = 0;
+
+	if (saved_size != sizeof(buffer))
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "unexpected index save data size %zu",
+					      saved_size);
+
+	result = uds_read_from_buffered_reader(reader, buffer, sizeof(buffer));
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result, "cannot read index save data");
+
+	decode_u64_le(buffer, &offset, &isl->save_data.timestamp);
+	decode_u64_le(buffer, &offset, &isl->save_data.nonce);
+	decode_u32_le(buffer, &offset, &isl->save_data.version);
+	offset += sizeof(u32);
+
+	if (isl->save_data.version > 1)
+		return uds_log_error_strerror(UDS_UNSUPPORTED_VERSION,
+					      "unknown index save version number %u",
+					      isl->save_data.version);
+
+	decode_s32_le(buffer, &offset, &file_version.signature);
+	decode_s32_le(buffer, &offset, &file_version.version_id);
+
+	if ((file_version.signature != INDEX_STATE_VERSION_301.signature) ||
+	    (file_version.version_id != INDEX_STATE_VERSION_301.version_id))
+		return uds_log_error_strerror(UDS_UNSUPPORTED_VERSION,
+					      "index state version %d,%d is unsupported",
+					      file_version.signature,
+					      file_version.version_id);
+
+	decode_u64_le(buffer, &offset, &isl->state_data.newest_chapter);
+	decode_u64_le(buffer, &offset, &isl->state_data.oldest_chapter);
+	decode_u64_le(buffer, &offset, &isl->state_data.last_save);
+	/* Skip past some historical fields that are now unused */
+	offset += sizeof(u32) + sizeof(u32);
+	return UDS_SUCCESS;
+}
+
+static int __must_check
+reconstruct_index_save(struct index_save_layout *isl, struct region_table *table)
+{
+	int result;
+	unsigned int z;
+	struct layout_region *last_region;
+	u64 next_block = isl->index_save.start_block;
+	u64 last_block = next_block + isl->index_save.block_count;
+
+	isl->zone_count = table->header.region_count - 3;
+
+	last_region = &table->regions[table->header.region_count - 1];
+	if (last_region->kind == RL_KIND_EMPTY) {
+		isl->free_space = *last_region;
+		isl->zone_count--;
+	} else {
+		isl->free_space = (struct layout_region) {
+			.start_block = last_block,
+			.block_count = 0,
+			.kind = RL_KIND_EMPTY,
+			.instance = RL_SOLE_INSTANCE,
+		};
+	}
+
+	isl->header = table->regions[0];
+	result = verify_region(&isl->header, next_block++, RL_KIND_HEADER, RL_SOLE_INSTANCE);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	isl->index_page_map = table->regions[1];
+	result = verify_region(&isl->index_page_map,
+			       next_block,
+			       RL_KIND_INDEX_PAGE_MAP,
+			       RL_SOLE_INSTANCE);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	next_block += isl->index_page_map.block_count;
+
+	for (z = 0; z < isl->zone_count; z++) {
+		isl->volume_index_zones[z] = table->regions[z + 2];
+		result = verify_region(&isl->volume_index_zones[z],
+				       next_block,
+				       RL_KIND_VOLUME_INDEX,
+				       z);
+		if (result != UDS_SUCCESS)
+			return result;
+
+		next_block += isl->volume_index_zones[z].block_count;
+	}
+
+	isl->open_chapter = table->regions[isl->zone_count + 2];
+	result = verify_region(&isl->open_chapter,
+			       next_block,
+			       RL_KIND_OPEN_CHAPTER,
+			       RL_SOLE_INSTANCE);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	next_block += isl->open_chapter.block_count;
+
+	result = verify_region(&isl->free_space, next_block, RL_KIND_EMPTY, RL_SOLE_INSTANCE);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	next_block += isl->free_space.block_count;
+	if (next_block != last_block)
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "index save layout table incomplete");
+
+	return UDS_SUCCESS;
+}
+
+static int __must_check load_index_save(struct index_save_layout *isl,
+					struct buffered_reader *reader,
+					unsigned int instance)
+{
+	int result;
+	struct region_table *table = NULL;
+
+	result = load_region_table(reader, &table);
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result,
+					      "cannot read index save %u header",
+					      instance);
+
+	if (table->header.region_blocks != isl->index_save.block_count) {
+		u64 region_blocks = table->header.region_blocks;
+
+		UDS_FREE(table);
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "unexpected index save %u region block count %llu",
+					      instance,
+					      (unsigned long long) region_blocks);
+	}
+
+	if (table->header.type == RH_TYPE_UNSAVED) {
+		UDS_FREE(table);
+		reset_index_save_layout(isl, 0);
+		return UDS_SUCCESS;
+	}
+
+
+	if (table->header.type != RH_TYPE_SAVE) {
+		UDS_FREE(table);
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "unexpected index save %u header type %u",
+					      instance,
+					      table->header.type);
+	}
+
+	result = read_index_save_data(reader, isl, table->header.payload);
+	if (result != UDS_SUCCESS) {
+		UDS_FREE(table);
+		return uds_log_error_strerror(result,
+					      "unknown index save %u data format",
+					      instance);
+	}
+
+	result = reconstruct_index_save(isl, table);
+	UDS_FREE(table);
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result,
+					      "cannot reconstruct index save %u",
+					      instance);
+
+	return UDS_SUCCESS;
+}
+
+static int __must_check load_sub_index_regions(struct index_layout *layout)
+{
+	int result;
+	unsigned int j;
+	struct index_save_layout *isl;
+	struct buffered_reader *reader;
+
+	for (j = 0; j < layout->super.max_saves; ++j) {
+		isl = &layout->index.saves[j];
+		result = open_region_reader(layout, &isl->index_save, &reader);
+
+		if (result != UDS_SUCCESS) {
+			uds_log_error_strerror(result, "cannot get reader for index 0 save %u", j);
+			return result;
+		}
+
+		result = load_index_save(isl, reader, j);
+		uds_free_buffered_reader(reader);
+		if (result != UDS_SUCCESS) {
+			/* Another save slot might be valid. */
+			reset_index_save_layout(isl, 0);
+			continue;
+		}
+	}
+
+	return UDS_SUCCESS;
+}
+
+static int __must_check
+verify_uds_index_config(struct index_layout *layout, struct configuration *config)
+{
+	int result;
+	struct buffered_reader *reader = NULL;
+	u64 offset;
+
+	offset = layout->super.volume_offset - layout->super.start_offset;
+	result = open_layout_reader(layout, &layout->config, offset, &reader);
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result, "failed to open config reader");
+
+	result = uds_validate_config_contents(reader, config);
+	if (result != UDS_SUCCESS) {
+		uds_free_buffered_reader(reader);
+		return uds_log_error_strerror(result, "failed to read config region");
+	}
+
+	uds_free_buffered_reader(reader);
+	return UDS_SUCCESS;
+}
+
+static int load_index_layout(struct index_layout *layout, struct configuration *config)
+{
+	int result;
+	struct buffered_reader *reader;
+
+	result = uds_make_buffered_reader(layout->factory,
+					  layout->offset / UDS_BLOCK_SIZE,
+					  1,
+					  &reader);
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result, "unable to read superblock");
+
+	result = load_super_block(layout, UDS_BLOCK_SIZE, layout->offset / UDS_BLOCK_SIZE, reader);
+	uds_free_buffered_reader(reader);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = verify_uds_index_config(layout, config);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	return load_sub_index_regions(layout);
+}
+
+static int create_layout_factory(struct index_layout *layout, const struct configuration *config)
+{
+	int result;
+	size_t writable_size;
+	struct io_factory *factory = NULL;
+
+	result = uds_make_io_factory(config->bdev, &factory);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	writable_size = uds_get_writable_size(factory) & -UDS_BLOCK_SIZE;
+	if (writable_size < config->size + config->offset) {
+		uds_put_io_factory(factory);
+		uds_log_error("index storage (%zu) is smaller than the requested size %zu",
+			      writable_size,
+			      config->size + config->offset);
+		return -ENOSPC;
+	}
+
+	layout->factory = factory;
+	layout->factory_size = (config->size > 0) ? config->size : writable_size;
+	layout->offset = config->offset;
+	return UDS_SUCCESS;
+}
+
+int uds_make_index_layout(struct configuration *config,
+			  bool new_layout,
+			  struct index_layout **layout_ptr)
+{
+	int result;
+	struct index_layout *layout = NULL;
+	struct save_layout_sizes sizes;
+
+	result = compute_sizes(config, &sizes);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(1, struct index_layout, __func__, &layout);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = create_layout_factory(layout, config);
+	if (result != UDS_SUCCESS) {
+		uds_free_index_layout(layout);
+		return result;
+	}
+
+	if (layout->factory_size < sizes.total_size) {
+		uds_log_error("index storage (%zu) is smaller than the required size %llu",
+			      layout->factory_size,
+			      (unsigned long long) sizes.total_size);
+		uds_free_index_layout(layout);
+		return -ENOSPC;
+	}
+
+	if (new_layout)
+		result = create_index_layout(layout, config);
+	else
+		result = load_index_layout(layout, config);
+	if (result != UDS_SUCCESS) {
+		uds_free_index_layout(layout);
+		return result;
+	}
+
+	*layout_ptr = layout;
+	return UDS_SUCCESS;
+}
+
+void uds_free_index_layout(struct index_layout *layout)
+{
+	if (layout == NULL)
+		return;
+
+	UDS_FREE(layout->index.saves);
+	if (layout->factory != NULL)
+		uds_put_io_factory(layout->factory);
+
+	UDS_FREE(layout);
+}
+
+int uds_replace_index_layout_storage(struct index_layout *layout, struct block_device *bdev)
+{
+	return uds_replace_storage(layout->factory, bdev);
+}
+
+/* Obtain a dm_bufio_client for the volume region. */
+int uds_open_volume_bufio(struct index_layout *layout,
+			  size_t block_size,
+			  unsigned int reserved_buffers,
+			  struct dm_bufio_client **client_ptr)
+{
+	off_t offset = (layout->index.volume.start_block +
+			layout->super.volume_offset -
+			layout->super.start_offset);
+
+	return uds_make_bufio(layout->factory, offset, block_size, reserved_buffers, client_ptr);
+}
+
+u64 uds_get_volume_nonce(struct index_layout *layout)
+{
+	return layout->index.nonce;
+}
diff --git a/drivers/md/dm-vdo/index-layout.h b/drivers/md/dm-vdo/index-layout.h
new file mode 100644
index 000000000000..2a32c0bcb00b
--- /dev/null
+++ b/drivers/md/dm-vdo/index-layout.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_INDEX_LAYOUT_H
+#define UDS_INDEX_LAYOUT_H
+
+#include "config.h"
+#include "io-factory.h"
+#include "uds.h"
+
+/*
+ * The index layout describes the format of the index on the underlying storage, and is responsible
+ * for creating those structures when the index is first created. It also validates the index data
+ * when loading a saved index, and updates it when saving the index.
+ */
+
+struct index_layout;
+
+int __must_check uds_make_index_layout(struct configuration *config,
+				       bool new_layout,
+				       struct index_layout **layout_ptr);
+
+void uds_free_index_layout(struct index_layout *layout);
+
+int __must_check uds_replace_index_layout_storage(struct index_layout *layout,
+						  struct block_device *bdev);
+
+int __must_check uds_load_index_state(struct index_layout *layout, struct uds_index *index);
+
+int __must_check uds_save_index_state(struct index_layout *layout, struct uds_index *index);
+
+int __must_check uds_discard_open_chapter(struct index_layout *layout);
+
+u64 __must_check uds_get_volume_nonce(struct index_layout *layout);
+
+int __must_check uds_open_volume_bufio(struct index_layout *layout,
+				       size_t block_size,
+				       unsigned int reserved_buffers,
+				       struct dm_bufio_client **client_ptr);
+
+#endif /* UDS_INDEX_LAYOUT_H */
diff --git a/drivers/md/dm-vdo/io-factory.c b/drivers/md/dm-vdo/io-factory.c
new file mode 100644
index 000000000000..9e99d0043fd8
--- /dev/null
+++ b/drivers/md/dm-vdo/io-factory.c
@@ -0,0 +1,423 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "io-factory.h"
+
+#include <linux/atomic.h>
+#include <linux/blkdev.h>
+#include <linux/err.h>
+#include <linux/mount.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "numeric.h"
+
+/*
+ * The I/O factory object manages access to index storage, which is a contiguous range of blocks on
+ * a block device.
+ *
+ * The factory holds the open device and is responsible for closing it. The factory has methods to
+ * make helper structures that can be used to access sections of the index.
+ */
+struct io_factory {
+	struct block_device *bdev;
+	atomic_t ref_count;
+};
+
+/* The buffered reader allows efficient I/O by reading page-sized segments into a buffer. */
+struct buffered_reader {
+	struct io_factory *factory;
+	struct dm_bufio_client *client;
+	struct dm_buffer *buffer;
+	sector_t limit;
+	sector_t block_number;
+	u8 *start;
+	u8 *end;
+};
+
+enum { MAX_READ_AHEAD_BLOCKS = 4 };
+
+/*
+ * The buffered writer allows efficient I/O by buffering writes and committing page-sized segments
+ * to storage.
+ */
+struct buffered_writer {
+	struct io_factory *factory;
+	struct dm_bufio_client *client;
+	struct dm_buffer *buffer;
+	sector_t limit;
+	sector_t block_number;
+	u8 *start;
+	u8 *end;
+	int error;
+};
+
+static void uds_get_io_factory(struct io_factory *factory)
+{
+	atomic_inc(&factory->ref_count);
+}
+
+int uds_make_io_factory(struct block_device *bdev, struct io_factory **factory_ptr)
+{
+	int result;
+	struct io_factory *factory;
+
+	result = UDS_ALLOCATE(1, struct io_factory, __func__, &factory);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	factory->bdev = bdev;
+	atomic_set_release(&factory->ref_count, 1);
+
+	*factory_ptr = factory;
+	return UDS_SUCCESS;
+}
+
+int uds_replace_storage(struct io_factory *factory, struct block_device *bdev)
+{
+	factory->bdev = bdev;
+	return UDS_SUCCESS;
+}
+
+/* Free an I/O factory once all references have been released. */
+void uds_put_io_factory(struct io_factory *factory)
+{
+	if (atomic_add_return(-1, &factory->ref_count) <= 0)
+		UDS_FREE(factory);
+}
+
+size_t uds_get_writable_size(struct io_factory *factory)
+{
+	return i_size_read(factory->bdev->bd_inode);
+}
+
+/* Create a struct dm_bufio_client for an index region starting at offset. */
+int uds_make_bufio(struct io_factory *factory,
+		   off_t block_offset,
+		   size_t block_size,
+		   unsigned int reserved_buffers,
+		   struct dm_bufio_client **client_ptr)
+{
+	struct dm_bufio_client *client;
+
+	client = dm_bufio_client_create(factory->bdev,
+					block_size,
+					reserved_buffers,
+					0,
+					NULL,
+					NULL,
+					0);
+	if (IS_ERR(client))
+		return -PTR_ERR(client);
+
+	dm_bufio_set_sector_offset(client, block_offset * SECTORS_PER_BLOCK);
+	*client_ptr = client;
+	return UDS_SUCCESS;
+}
+
+static void read_ahead(struct buffered_reader *reader, sector_t block_number)
+{
+	if (block_number < reader->limit) {
+		sector_t read_ahead =
+			min((sector_t) MAX_READ_AHEAD_BLOCKS, reader->limit - block_number);
+
+		dm_bufio_prefetch(reader->client, block_number, read_ahead);
+	}
+}
+
+void uds_free_buffered_reader(struct buffered_reader *reader)
+{
+	if (reader == NULL)
+		return;
+
+	if (reader->buffer != NULL)
+		dm_bufio_release(reader->buffer);
+
+	dm_bufio_client_destroy(reader->client);
+	uds_put_io_factory(reader->factory);
+	UDS_FREE(reader);
+}
+
+/* Create a buffered reader for an index region starting at offset. */
+int uds_make_buffered_reader(struct io_factory *factory,
+			     off_t offset,
+			     u64 block_count,
+			     struct buffered_reader **reader_ptr)
+{
+	int result;
+	struct dm_bufio_client *client = NULL;
+	struct buffered_reader *reader = NULL;
+
+	result = uds_make_bufio(factory, offset, UDS_BLOCK_SIZE, 1, &client);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(1, struct buffered_reader, "buffered reader", &reader);
+	if (result != UDS_SUCCESS) {
+		dm_bufio_client_destroy(client);
+		return result;
+	}
+
+	*reader = (struct buffered_reader) {
+		.factory = factory,
+		.client = client,
+		.buffer = NULL,
+		.limit = block_count,
+		.block_number = 0,
+		.start = NULL,
+		.end = NULL,
+	};
+
+	read_ahead(reader, 0);
+	uds_get_io_factory(factory);
+	*reader_ptr = reader;
+	return UDS_SUCCESS;
+}
+
+static int position_reader(struct buffered_reader *reader, sector_t block_number, off_t offset)
+{
+	struct dm_buffer *buffer = NULL;
+	void *data;
+
+	if ((reader->end == NULL) || (block_number != reader->block_number)) {
+		if (block_number >= reader->limit)
+			return UDS_OUT_OF_RANGE;
+
+		if (reader->buffer != NULL)
+			dm_bufio_release(UDS_FORGET(reader->buffer));
+
+		data = dm_bufio_read(reader->client, block_number, &buffer);
+		if (IS_ERR(data))
+			return -PTR_ERR(data);
+
+		reader->buffer = buffer;
+		reader->start = data;
+		if (block_number == reader->block_number + 1)
+			read_ahead(reader, block_number + 1);
+	}
+
+	reader->block_number = block_number;
+	reader->end = reader->start + offset;
+	return UDS_SUCCESS;
+}
+
+static size_t bytes_remaining_in_read_buffer(struct buffered_reader *reader)
+{
+	return (reader->end == NULL) ? 0 : reader->start + UDS_BLOCK_SIZE - reader->end;
+}
+
+static int reset_reader(struct buffered_reader *reader)
+{
+	sector_t block_number;
+
+	if (bytes_remaining_in_read_buffer(reader) > 0)
+		return UDS_SUCCESS;
+
+	block_number = reader->block_number;
+	if (reader->end != NULL)
+		++block_number;
+
+	return position_reader(reader, block_number, 0);
+}
+
+int uds_read_from_buffered_reader(struct buffered_reader *reader, u8 *data, size_t length)
+{
+	int result = UDS_SUCCESS;
+	size_t chunk_size;
+
+	while (length > 0) {
+		result = reset_reader(reader);
+		if (result != UDS_SUCCESS)
+			return result;
+
+		chunk_size = min(length, bytes_remaining_in_read_buffer(reader));
+		memcpy(data, reader->end, chunk_size);
+		length -= chunk_size;
+		data += chunk_size;
+		reader->end += chunk_size;
+	}
+
+	return UDS_SUCCESS;
+}
+
+/*
+ * Verify that the next data on the reader matches the required value. If the value matches, the
+ * matching contents are consumed. If the value does not match, the reader state is unchanged.
+ */
+int uds_verify_buffered_data(struct buffered_reader *reader, const u8 *value, size_t length)
+{
+	int result = UDS_SUCCESS;
+	size_t chunk_size;
+	sector_t start_block_number = reader->block_number;
+	int start_offset = reader->end - reader->start;
+
+	while (length > 0) {
+		result = reset_reader(reader);
+		if (result != UDS_SUCCESS) {
+			result = UDS_CORRUPT_DATA;
+			break;
+		}
+
+		chunk_size = min(length, bytes_remaining_in_read_buffer(reader));
+		if (memcmp(value, reader->end, chunk_size) != 0) {
+			result = UDS_CORRUPT_DATA;
+			break;
+		}
+
+		length -= chunk_size;
+		value += chunk_size;
+		reader->end += chunk_size;
+	}
+
+	if (result != UDS_SUCCESS)
+		position_reader(reader, start_block_number, start_offset);
+
+	return result;
+}
+
+/* Create a buffered writer for an index region starting at offset. */
+int uds_make_buffered_writer(struct io_factory *factory,
+			     off_t offset,
+			     u64 block_count,
+			     struct buffered_writer **writer_ptr)
+{
+	int result;
+	struct dm_bufio_client *client = NULL;
+	struct buffered_writer *writer;
+
+	result = uds_make_bufio(factory, offset, UDS_BLOCK_SIZE, 1, &client);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(1, struct buffered_writer, "buffered writer", &writer);
+	if (result != UDS_SUCCESS) {
+		dm_bufio_client_destroy(client);
+		return result;
+	}
+
+	*writer = (struct buffered_writer) {
+		.factory = factory,
+		.client = client,
+		.buffer = NULL,
+		.limit = block_count,
+		.start = NULL,
+		.end = NULL,
+		.block_number = 0,
+		.error = UDS_SUCCESS,
+	};
+
+	uds_get_io_factory(factory);
+	*writer_ptr = writer;
+	return UDS_SUCCESS;
+}
+
+static size_t get_remaining_write_space(struct buffered_writer *writer)
+{
+	return writer->start + UDS_BLOCK_SIZE - writer->end;
+}
+
+static int __must_check prepare_next_buffer(struct buffered_writer *writer)
+{
+	struct dm_buffer *buffer = NULL;
+	void *data;
+
+	if (writer->block_number >= writer->limit) {
+		writer->error = UDS_OUT_OF_RANGE;
+		return UDS_OUT_OF_RANGE;
+	}
+
+	data = dm_bufio_new(writer->client, writer->block_number, &buffer);
+	if (IS_ERR(data)) {
+		writer->error = -PTR_ERR(data);
+		return writer->error;
+	}
+
+	writer->buffer = buffer;
+	writer->start = data;
+	writer->end = data;
+	return UDS_SUCCESS;
+}
+
+static int flush_previous_buffer(struct buffered_writer *writer)
+{
+	size_t available;
+
+	if (writer->buffer == NULL)
+		return writer->error;
+
+	if (writer->error == UDS_SUCCESS) {
+		available = get_remaining_write_space(writer);
+
+		if (available > 0)
+			memset(writer->end, 0, available);
+
+		dm_bufio_mark_buffer_dirty(writer->buffer);
+	}
+
+	dm_bufio_release(writer->buffer);
+	writer->buffer = NULL;
+	writer->start = NULL;
+	writer->end = NULL;
+	writer->block_number++;
+	return writer->error;
+}
+
+void uds_free_buffered_writer(struct buffered_writer *writer)
+{
+	int result;
+
+	if (writer == NULL)
+		return;
+
+	flush_previous_buffer(writer);
+	result = -dm_bufio_write_dirty_buffers(writer->client);
+	if (result != UDS_SUCCESS)
+		uds_log_warning_strerror(result, "%s: failed to sync storage", __func__);
+
+	dm_bufio_client_destroy(writer->client);
+	uds_put_io_factory(writer->factory);
+	UDS_FREE(writer);
+}
+
+/*
+ * Append data to the buffer, writing as needed. If no data is provided, zeros are written instead.
+ * If a write error occurs, it is recorded and returned on every subsequent write attempt.
+ */
+int uds_write_to_buffered_writer(struct buffered_writer *writer, const u8 *data, size_t length)
+{
+	int result = writer->error;
+	size_t chunk_size;
+
+	while ((length > 0) && (result == UDS_SUCCESS)) {
+		if (writer->buffer == NULL) {
+			result = prepare_next_buffer(writer);
+			continue;
+		}
+
+		chunk_size = min(length, get_remaining_write_space(writer));
+		if (data == NULL) {
+			memset(writer->end, 0, chunk_size);
+		} else {
+			memcpy(writer->end, data, chunk_size);
+			data += chunk_size;
+		}
+
+		length -= chunk_size;
+		writer->end += chunk_size;
+
+		if (get_remaining_write_space(writer) == 0)
+			result = uds_flush_buffered_writer(writer);
+	}
+
+	return result;
+}
+
+int uds_flush_buffered_writer(struct buffered_writer *writer)
+{
+	if (writer->error != UDS_SUCCESS)
+		return writer->error;
+
+	return flush_previous_buffer(writer);
+}
diff --git a/drivers/md/dm-vdo/io-factory.h b/drivers/md/dm-vdo/io-factory.h
new file mode 100644
index 000000000000..ad6482978159
--- /dev/null
+++ b/drivers/md/dm-vdo/io-factory.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_IO_FACTORY_H
+#define UDS_IO_FACTORY_H
+
+#include <linux/dm-bufio.h>
+
+/*
+ * The I/O factory manages all low-level I/O operations to the underlying storage device. Its main
+ * clients are the index layout and the volume. The buffered reader and buffered writer interfaces
+ * are helpers for accessing data in a contiguous range of storage blocks.
+ */
+
+struct buffered_reader;
+struct buffered_writer;
+
+struct io_factory;
+
+enum {
+	UDS_BLOCK_SIZE = 4096,
+	SECTORS_PER_BLOCK = UDS_BLOCK_SIZE >> SECTOR_SHIFT,
+};
+
+int __must_check uds_make_io_factory(struct block_device *bdev, struct io_factory **factory_ptr);
+
+int __must_check uds_replace_storage(struct io_factory *factory, struct block_device *bdev);
+
+void uds_put_io_factory(struct io_factory *factory);
+
+size_t __must_check uds_get_writable_size(struct io_factory *factory);
+
+int __must_check uds_make_bufio(struct io_factory *factory,
+				off_t block_offset,
+				size_t block_size,
+				unsigned int reserved_buffers,
+				struct dm_bufio_client **client_ptr);
+
+int __must_check uds_make_buffered_reader(struct io_factory *factory,
+					  off_t offset,
+					  u64 block_count,
+					  struct buffered_reader **reader_ptr);
+
+void uds_free_buffered_reader(struct buffered_reader *reader);
+
+int __must_check
+uds_read_from_buffered_reader(struct buffered_reader *reader, u8 *data, size_t length);
+
+int __must_check
+uds_verify_buffered_data(struct buffered_reader *reader, const u8 *value, size_t length);
+
+int __must_check uds_make_buffered_writer(struct io_factory *factory,
+					  off_t offset,
+					  u64 block_count,
+					  struct buffered_writer **writer_ptr);
+
+void uds_free_buffered_writer(struct buffered_writer *buffer);
+
+int __must_check
+uds_write_to_buffered_writer(struct buffered_writer *writer, const u8 *data, size_t length);
+
+int __must_check uds_flush_buffered_writer(struct buffered_writer *writer);
+
+#endif /* UDS_IO_FACTORY_H */
diff --git a/drivers/md/dm-vdo/numeric.h b/drivers/md/dm-vdo/numeric.h
new file mode 100644
index 000000000000..dc8c400b21d2
--- /dev/null
+++ b/drivers/md/dm-vdo/numeric.h
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_NUMERIC_H
+#define UDS_NUMERIC_H
+
+#include <asm/unaligned.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+/*
+ * These utilities encode or decode a number from an offset in a larger data buffer and then
+ * advance the offset pointer to the next field in the buffer.
+ */
+
+static inline void decode_s64_le(const u8 *buffer, size_t *offset, s64 *decoded)
+{
+	*decoded = get_unaligned_le64(buffer + *offset);
+	*offset += sizeof(s64);
+}
+
+static inline void encode_s64_le(u8 *data, size_t *offset, s64 to_encode)
+{
+	put_unaligned_le64(to_encode, data + *offset);
+	*offset += sizeof(s64);
+}
+
+static inline void decode_u64_le(const u8 *buffer, size_t *offset, u64 *decoded)
+{
+	*decoded = get_unaligned_le64(buffer + *offset);
+	*offset += sizeof(u64);
+}
+
+static inline void encode_u64_le(u8 *data, size_t *offset, u64 to_encode)
+{
+	put_unaligned_le64(to_encode, data + *offset);
+	*offset += sizeof(u64);
+}
+
+static inline void decode_s32_le(const u8 *buffer, size_t *offset, s32 *decoded)
+{
+	*decoded = get_unaligned_le32(buffer + *offset);
+	*offset += sizeof(s32);
+}
+
+static inline void encode_s32_le(u8 *data, size_t *offset, s32 to_encode)
+{
+	put_unaligned_le32(to_encode, data + *offset);
+	*offset += sizeof(s32);
+}
+
+static inline void decode_u32_le(const u8 *buffer, size_t *offset, u32 *decoded)
+{
+	*decoded = get_unaligned_le32(buffer + *offset);
+	*offset += sizeof(u32);
+}
+
+static inline void encode_u32_le(u8 *data, size_t *offset, u32 to_encode)
+{
+	put_unaligned_le32(to_encode, data + *offset);
+	*offset += sizeof(u32);
+}
+
+static inline void decode_u16_le(const u8 *buffer, size_t *offset, u16 *decoded)
+{
+	*decoded = get_unaligned_le16(buffer + *offset);
+	*offset += sizeof(u16);
+}
+
+static inline void encode_u16_le(u8 *data, size_t *offset, u16 to_encode)
+{
+	put_unaligned_le16(to_encode, data + *offset);
+	*offset += sizeof(u16);
+}
+
+#endif /* UDS_NUMERIC_H */

From patchwork Thu Oct 26 21:41:08 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437939
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 55186C25B70
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:43:24 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233234AbjJZVnX (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:43:23 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41042 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232235AbjJZVnW (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:43:22 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8EA3FD43
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356502;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=tOGv6osDhIgSNg7zbZ8buxOix8n1KX4PVNYrFT3SimQ=;
        b=aBoiCHRhGnuORQqlkYK46OvvtBydSiXIRVljPSWwLsc0LV1PMIBAPpfLzPleDhSdXJE9+j
        p1FVrVjox5YLYXiUnhVhdPtDw7uGKSmiMap2z3bAS5Y0F0icE6w0guBO7PreS1aDb1eTlJ
        Qkwr/ANYUr/JyrOSAGJh26kMe3w2upQ=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-381-sjC-VDiLPkSAlyVaFcC1kg-1; Thu,
 26 Oct 2023 17:41:39 -0400
X-MC-Unique: sjC-VDiLPkSAlyVaFcC1kg-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1BEE71C04B73;
        Thu, 26 Oct 2023 21:41:39 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id F150B2949;
        Thu, 26 Oct 2023 21:41:38 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id A6DF43003B; Thu, 26 Oct 2023 17:41:38 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 11/39] dm vdo: implement the delta index
Date: Thu, 26 Oct 2023 17:41:08 -0400
Message-Id: <20231026214136.1067410-12-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

The delta index is a space and memory efficient alternative to a hashtable.
Instead of storing the entire key for each entry, the entries are sorted by
key and only the difference between adjacent keys (the delta) is stored.
If the keys are evenly distributed, the size of the deltas follows an
exponential distribution, and the deltas can use a Huffman code to take up
even less space.

This structure allows the index to use many fewer bytes per entry than a
traditional hash table, but it is slightly more expensive to look up
entries, because a request must read and sum every entry in a list of
deltas in order to find a given record. The delta index reduces this lookup
cost by splitting its key space into many sub-lists, each starting at a
fixed key value, so that each individual list is short.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/delta-index.c | 2018 +++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/delta-index.h |  292 +++++
 drivers/md/dm-vdo/hash-utils.h  |   66 +
 3 files changed, 2376 insertions(+)
 create mode 100644 drivers/md/dm-vdo/delta-index.c
 create mode 100644 drivers/md/dm-vdo/delta-index.h
 create mode 100644 drivers/md/dm-vdo/hash-utils.h

diff --git a/drivers/md/dm-vdo/delta-index.c b/drivers/md/dm-vdo/delta-index.c
new file mode 100644
index 000000000000..2cdcd9d4ea27
--- /dev/null
+++ b/drivers/md/dm-vdo/delta-index.c
@@ -0,0 +1,2018 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+#include "delta-index.h"
+
+#include <linux/bitops.h>
+#include <linux/bits.h>
+#include <linux/compiler.h>
+#include <linux/limits.h>
+#include <linux/log2.h>
+
+#include "config.h"
+#include "cpu.h"
+#include "errors.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "numeric.h"
+#include "permassert.h"
+#include "string-utils.h"
+#include "time-utils.h"
+#include "uds.h"
+
+/*
+ * The entries in a delta index could be stored in a single delta list, but to reduce search times
+ * and update costs it uses multiple delta lists. These lists are stored in a single chunk of
+ * memory managed by the delta_zone structure. The delta_zone can move the data around within its
+ * memory, so the location of each delta list is recorded as a bit offset into the memory. Because
+ * the volume index can contain over a million delta lists, we want to be efficient with the size
+ * of the delta list header information. This information is encoded into 16 bytes per list. The
+ * volume index delta list memory can easily exceed 4 gigabits, so a 64 bit value is needed to
+ * address the memory. The volume index delta lists average around 6 kilobits, so 16 bits are
+ * sufficient to store the size of a delta list.
+ *
+ * Each delta list is stored as a bit stream. Within the delta list encoding, bits and bytes are
+ * numbered in little endian order. Within a byte, bit 0 is the least significant bit (0x1), and
+ * bit 7 is the most significant bit (0x80). Within a bit stream, bit 7 is the most signficant bit
+ * of byte 0, and bit 8 is the least significant bit of byte 1. Within a byte array, a byte's
+ * number corresponds to its index in the array.
+ *
+ * A standard delta list entry is stored as a fixed length payload (the value) followed by a
+ * variable length key (the delta). A collision entry is used when two block names have the same
+ * delta list address. A collision entry always follows a standard entry for the hash with which it
+ * collides, and is encoded with DELTA == 0 with an additional 256 bits field at the end,
+ * containing the full block name. An entry with a delta of 0 at the beginning of a delta list
+ * indicates a normal entry.
+ *
+ * The delta in each entry is encoded with a variable-length Huffman code to minimize the memory
+ * used by small deltas. The Huffman code is specified by three parameters, which can be computed
+ * from the desired mean delta when the index is full. (See compute_coding constants() for
+ * details.)
+ *
+ * The bit field utilities used to read and write delta entries assume that it is possible to read
+ * some bytes beyond the end of the bit field, so a delta_zone memory allocation is guarded by two
+ * invalid delta lists to prevent reading outside the delta_zone memory. The valid delta lists are
+ * numbered 1 to N, and the guard lists are numbered 0 and N+1. The function to decode the bit
+ * stream include a step that skips over bits set to 0 until the first 1 bit is found. A corrupted
+ * delta list could cause this step to run off the end of the delta_zone memory, so as extra
+ * protection against this happening, the tail guard list is set to all ones.
+ *
+ * The delta_index supports two different forms. The mutable form is created by
+ * uds_initialize_delta_index(), and is used for the volume index and for open chapter indexes. The
+ * immutable form is created by uds_initialize_delta_index_page(), and is used for closed (and
+ * cached) chapter index pages. The immutable form does not allocate delta list headers or
+ * temporary offsets, and thus is somewhat more memory efficient.
+ */
+
+/*
+ * This is the largest field size supported by get_field() and set_field(). Any field that is
+ * larger is not guaranteed to fit in a single byte-aligned u32.
+ */
+enum {
+	MAX_FIELD_BITS = (sizeof(u32) - 1) * BITS_PER_BYTE + 1,
+};
+
+/*
+ * This is the largest field size supported by get_big_field() and set_big_field(). Any field that
+ * is larger is not guaranteed to fit in a single byte-aligned u64.
+ */
+enum {
+	MAX_BIG_FIELD_BITS = (sizeof(u64) - 1) * BITS_PER_BYTE + 1,
+};
+
+/*
+ * This is the number of guard bytes needed at the end of the memory byte array when using the bit
+ * utilities. These utilities call get_big_field() and set_big_field(), which can access up to 7
+ * bytes beyond the end of the desired field. The definition is written to make it clear how this
+ * value is derived.
+ */
+enum {
+	POST_FIELD_GUARD_BYTES = sizeof(u64) - 1,
+};
+
+/* The number of guard bits that are needed in the tail guard list */
+enum {
+	GUARD_BITS = POST_FIELD_GUARD_BYTES * BITS_PER_BYTE
+};
+
+/*
+ * The maximum size of a single delta list in bytes. We count guard bytes in this value because a
+ * buffer of this size can be used with move_bits().
+ */
+enum {
+	DELTA_LIST_MAX_BYTE_COUNT =
+		((U16_MAX + BITS_PER_BYTE) / BITS_PER_BYTE + POST_FIELD_GUARD_BYTES)
+};
+
+/* The number of extra bytes and bits needed to store a collision entry */
+enum {
+	COLLISION_BYTES = UDS_RECORD_NAME_SIZE,
+	COLLISION_BITS = COLLISION_BYTES * BITS_PER_BYTE
+};
+
+/*
+ * Immutable delta lists are packed into pages containing a header that encodes the delta list
+ * information into 19 bits per list (64KB bit offset).
+ */
+
+enum { IMMUTABLE_HEADER_SIZE = 19 };
+
+/*
+ * Constants and structures for the saved delta index. "DI" is for delta_index, and -##### is a
+ * number to increment when the format of the data changes.
+ */
+
+enum {
+	MAGIC_SIZE = 8,
+};
+
+static const char DELTA_INDEX_MAGIC[] = "DI-00002";
+
+struct delta_index_header {
+	char magic[MAGIC_SIZE];
+	u32 zone_number;
+	u32 zone_count;
+	u32 first_list;
+	u32 list_count;
+	u64 record_count;
+	u64 collision_count;
+};
+
+/*
+ * Header data used for immutable delta index pages. This data is followed by the delta list offset
+ * table.
+ */
+struct delta_page_header {
+	/* Externally-defined nonce */
+	u64 nonce;
+	/* The virtual chapter number */
+	u64 virtual_chapter_number;
+	/* Index of the first delta list on the page */
+	u16 first_list;
+	/* Number of delta lists on the page */
+	u16 list_count;
+} __packed;
+
+static inline u64 get_delta_list_byte_start(const struct delta_list *delta_list)
+{
+	return delta_list->start / BITS_PER_BYTE;
+}
+
+static inline u16 get_delta_list_byte_size(const struct delta_list *delta_list)
+{
+	unsigned int bit_offset = delta_list->start % BITS_PER_BYTE;
+
+	return BITS_TO_BYTES(bit_offset + delta_list->size);
+}
+
+static void rebalance_delta_zone(const struct delta_zone *delta_zone, u32 first, u32 last)
+{
+	struct delta_list *delta_list;
+	u64 new_start;
+
+	if (first == last) {
+		/* Only one list is moving, and we know there is space. */
+		delta_list = &delta_zone->delta_lists[first];
+		new_start = delta_zone->new_offsets[first];
+		if (delta_list->start != new_start) {
+			u64 source;
+			u64 destination;
+
+			source = get_delta_list_byte_start(delta_list);
+			delta_list->start = new_start;
+			destination = get_delta_list_byte_start(delta_list);
+			memmove(delta_zone->memory + destination,
+				delta_zone->memory + source,
+				get_delta_list_byte_size(delta_list));
+		}
+	} else {
+		/*
+		 * There is more than one list. Divide the problem in half, and use recursive calls
+		 * to process each half. Note that after this computation, first <= middle, and
+		 * middle < last.
+		 */
+		u32 middle = (first + last) / 2;
+
+		delta_list = &delta_zone->delta_lists[middle];
+		new_start = delta_zone->new_offsets[middle];
+
+		/*
+		 * The direction that our middle list is moving determines which half of the
+		 * problem must be processed first.
+		 */
+		if (new_start > delta_list->start) {
+			rebalance_delta_zone(delta_zone, middle + 1, last);
+			rebalance_delta_zone(delta_zone, first, middle);
+		} else {
+			rebalance_delta_zone(delta_zone, first, middle);
+			rebalance_delta_zone(delta_zone, middle + 1, last);
+		}
+	}
+}
+
+static inline size_t get_zone_memory_size(unsigned int zone_count, size_t memory_size)
+{
+	/* Round up so that each zone is a multiple of 64K in size. */
+	enum {
+		ALLOC_BOUNDARY = 64 * 1024,
+	};
+
+	return (memory_size / zone_count + ALLOC_BOUNDARY - 1) & -ALLOC_BOUNDARY;
+}
+
+void uds_reset_delta_index(const struct delta_index *delta_index)
+{
+	unsigned int z;
+
+	/*
+	 * Initialize all delta lists to be empty. We keep 2 extra delta list descriptors, one
+	 * before the first real entry and one after so that we don't need to bounds check the
+	 * array access when calculating preceding and following gap sizes.
+	 */
+	for (z = 0; z < delta_index->zone_count; z++) {
+		u64 list_bits;
+		u64 spacing;
+		u64 offset;
+		unsigned int i;
+		struct delta_zone *zone = &delta_index->delta_zones[z];
+		struct delta_list *delta_lists = zone->delta_lists;
+
+		/* Zeroing the delta list headers initializes the head guard list correctly. */
+		memset(delta_lists, 0, (zone->list_count + 2) * sizeof(struct delta_list));
+
+		/* Set all the bits in the end guard list. */
+		list_bits = (u64) zone->size * BITS_PER_BYTE - GUARD_BITS;
+		delta_lists[zone->list_count + 1].start = list_bits;
+		delta_lists[zone->list_count + 1].size = GUARD_BITS;
+		memset(zone->memory + (list_bits / BITS_PER_BYTE), ~0, POST_FIELD_GUARD_BYTES);
+
+		/* Evenly space out the real delta lists by setting regular offsets. */
+		spacing = list_bits / zone->list_count;
+		offset = spacing / 2;
+		for (i = 1; i <= zone->list_count; i++) {
+			delta_lists[i].start = offset;
+			offset += spacing;
+		}
+
+		/* Update the statistics. */
+		zone->discard_count += zone->record_count;
+		zone->record_count = 0;
+		zone->collision_count = 0;
+	}
+}
+
+/* Compute the Huffman coding parameters for the given mean delta. The Huffman code is specified by
+ * three parameters:
+ *
+ *  MINBITS   The number of bits in the smallest code
+ *  BASE      The number of values coded using a code of length MINBITS
+ *  INCR      The number of values coded by using one additional bit
+ *
+ * These parameters are related by this equation:
+ *
+ *	BASE + INCR == 1 << MINBITS
+ *
+ * The math for the Huffman code of an exponential distribution says that
+ *
+ *	INCR = log(2) * MEAN_DELTA
+ *
+ * Then use the smallest MINBITS value so that
+ *
+ *	(1 << MINBITS) > INCR
+ *
+ * And then
+ *
+ *	BASE = (1 << MINBITS) - INCR
+ *
+ * Now the index can generate a code such that
+ * - The first BASE values code using MINBITS bits.
+ * - The next INCR values code using MINBITS+1 bits.
+ * - The next INCR values code using MINBITS+2 bits.
+ * - (and so on).
+ */
+static void compute_coding_constants(u32 mean_delta, u16 *min_bits, u32 *min_keys, u32 *incr_keys)
+{
+	/*
+	 * We want to compute the rounded value of log(2) * mean_delta. Since we cannot always use
+	 * floating point, use a really good integer approximation.
+	 */
+	*incr_keys = (836158UL * mean_delta + 603160UL) / 1206321UL;
+	*min_bits = bits_per(*incr_keys + 1);
+	*min_keys = (1 << *min_bits) - *incr_keys;
+}
+
+void uds_uninitialize_delta_index(struct delta_index *delta_index)
+{
+	unsigned int z;
+
+	if (delta_index->delta_zones == NULL)
+		return;
+
+	for (z = 0; z < delta_index->zone_count; z++) {
+		UDS_FREE(UDS_FORGET(delta_index->delta_zones[z].new_offsets));
+		UDS_FREE(UDS_FORGET(delta_index->delta_zones[z].delta_lists));
+		UDS_FREE(UDS_FORGET(delta_index->delta_zones[z].memory));
+	}
+
+	UDS_FREE(delta_index->delta_zones);
+	memset(delta_index, 0, sizeof(struct delta_index));
+}
+
+static int initialize_delta_zone(struct delta_zone *delta_zone,
+				 size_t size,
+				 u32 first_list,
+				 u32 list_count,
+				 u32 mean_delta,
+				 u32 payload_bits,
+				 u8 tag)
+{
+	int result;
+
+	result = UDS_ALLOCATE(size, u8, "delta list", &delta_zone->memory);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(list_count + 2, u64, "delta list temp", &delta_zone->new_offsets);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	/* Allocate the delta lists. */
+	result = UDS_ALLOCATE(list_count + 2,
+			      struct delta_list,
+			      "delta lists",
+			      &delta_zone->delta_lists);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	compute_coding_constants(mean_delta,
+				 &delta_zone->min_bits,
+				 &delta_zone->min_keys,
+				 &delta_zone->incr_keys);
+	delta_zone->value_bits = payload_bits;
+	delta_zone->buffered_writer = NULL;
+	delta_zone->size = size;
+	delta_zone->rebalance_time = 0;
+	delta_zone->rebalance_count = 0;
+	delta_zone->record_count = 0;
+	delta_zone->collision_count = 0;
+	delta_zone->discard_count = 0;
+	delta_zone->overflow_count = 0;
+	delta_zone->first_list = first_list;
+	delta_zone->list_count = list_count;
+	delta_zone->tag = tag;
+
+	return UDS_SUCCESS;
+}
+
+int uds_initialize_delta_index(struct delta_index *delta_index,
+			       unsigned int zone_count,
+			       u32 list_count,
+			       u32 mean_delta,
+			       u32 payload_bits,
+			       size_t memory_size,
+			       u8 tag)
+{
+	int result;
+	unsigned int z;
+	size_t zone_memory;
+
+	result = UDS_ALLOCATE(zone_count,
+			      struct delta_zone,
+			      "Delta Index Zones",
+			      &delta_index->delta_zones);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	delta_index->zone_count = zone_count;
+	delta_index->list_count = list_count;
+	delta_index->lists_per_zone = DIV_ROUND_UP(list_count, zone_count);
+	delta_index->memory_size = 0;
+	delta_index->mutable = true;
+	delta_index->tag = tag;
+
+	for (z = 0; z < zone_count; z++) {
+		u32 lists_in_zone = delta_index->lists_per_zone;
+		u32 first_list_in_zone = z * lists_in_zone;
+
+		if (z == zone_count - 1) {
+			/*
+			 * The last zone gets fewer lists if zone_count doesn't evenly divide
+			 * list_count. We'll have an underflow if the assertion below doesn't hold.
+			 */
+			if (delta_index->list_count <= first_list_in_zone) {
+				uds_uninitialize_delta_index(delta_index);
+				return uds_log_error_strerror(UDS_INVALID_ARGUMENT,
+							      "%u delta lists not enough for %u zones",
+							      list_count,
+							      zone_count);
+			}
+			lists_in_zone = delta_index->list_count - first_list_in_zone;
+		}
+
+		zone_memory = get_zone_memory_size(zone_count, memory_size);
+		result = initialize_delta_zone(&delta_index->delta_zones[z],
+					       zone_memory,
+					       first_list_in_zone,
+					       lists_in_zone,
+					       mean_delta,
+					       payload_bits,
+					       tag);
+		if (result != UDS_SUCCESS) {
+			uds_uninitialize_delta_index(delta_index);
+			return result;
+		}
+
+		delta_index->memory_size +=
+			(sizeof(struct delta_zone) + zone_memory +
+			 (lists_in_zone + 2) * (sizeof(struct delta_list) + sizeof(u64)));
+	}
+
+	uds_reset_delta_index(delta_index);
+	return UDS_SUCCESS;
+}
+
+/* Read a bit field from an arbitrary bit boundary. */
+static inline u32 get_field(const u8 *memory, u64 offset, u8 size)
+{
+	const void *addr = memory + offset / BITS_PER_BYTE;
+
+	return (get_unaligned_le32(addr) >> (offset % BITS_PER_BYTE)) & ((1 << size) - 1);
+}
+
+/* Write a bit field to an arbitrary bit boundary. */
+static inline void set_field(u32 value, u8 *memory, u64 offset, u8 size)
+{
+	void *addr = memory + offset / BITS_PER_BYTE;
+	int shift = offset % BITS_PER_BYTE;
+	u32 data = get_unaligned_le32(addr);
+
+	data &= ~(((1 << size) - 1) << shift);
+	data |= value << shift;
+	put_unaligned_le32(data, addr);
+}
+
+/* Get the bit offset to the immutable delta list header. */
+static inline u32 get_immutable_header_offset(u32 list_number)
+{
+	return sizeof(struct delta_page_header) * BITS_PER_BYTE +
+		list_number * IMMUTABLE_HEADER_SIZE;
+}
+
+/* Get the bit offset to the start of the immutable delta list bit stream. */
+static inline u32 get_immutable_start(const u8 *memory, u32 list_number)
+{
+	return get_field(memory, get_immutable_header_offset(list_number), IMMUTABLE_HEADER_SIZE);
+}
+
+/* Set the bit offset to the start of the immutable delta list bit stream. */
+static inline void set_immutable_start(u8 *memory, u32 list_number, u32 start)
+{
+	set_field(start, memory, get_immutable_header_offset(list_number), IMMUTABLE_HEADER_SIZE);
+}
+
+static bool verify_delta_index_page(u64 nonce,
+				    u16 list_count,
+				    u64 expected_nonce,
+				    u8 *memory,
+				    size_t memory_size)
+{
+	unsigned int i;
+
+	/*
+	 * Verify the nonce. A mismatch can happen here during rebuild if we haven't written the
+	 * entire volume at least once.
+	 */
+	if (nonce != expected_nonce)
+		return false;
+
+	/* Verify that the number of delta lists can fit in the page. */
+	if (list_count > ((memory_size - sizeof(struct delta_page_header)) *
+			  BITS_PER_BYTE / IMMUTABLE_HEADER_SIZE))
+		return false;
+
+	/*
+	 * Verify that the first delta list is immediately after the last delta
+	 * list header.
+	 */
+	if (get_immutable_start(memory, 0) != get_immutable_header_offset(list_count + 1))
+		return false;
+
+	/* Verify that the lists are in the correct order. */
+	for (i = 0; i < list_count; i++)
+		if (get_immutable_start(memory, i) > get_immutable_start(memory, i + 1))
+			return false;
+
+	/*
+	 * Verify that the last list ends on the page, and that there is room
+	 * for the post-field guard bits.
+	 */
+	if (get_immutable_start(memory, list_count) >
+	    (memory_size - POST_FIELD_GUARD_BYTES) * BITS_PER_BYTE)
+		return false;
+
+	/* Verify that the guard bytes are correctly set to all ones. */
+	for (i = 0; i < POST_FIELD_GUARD_BYTES; i++) {
+		if (memory[memory_size - POST_FIELD_GUARD_BYTES + i] != (u8) ~0)
+			return false;
+	}
+
+	/* All verifications passed. */
+	return true;
+}
+
+/* Initialize a delta index page to refer to a supplied page. */
+int uds_initialize_delta_index_page(struct delta_index_page *delta_index_page,
+				    u64 expected_nonce,
+				    u32 mean_delta,
+				    u32 payload_bits,
+				    u8 *memory,
+				    size_t memory_size)
+{
+	u64 nonce;
+	u64 vcn;
+	u64 first_list;
+	u64 list_count;
+	struct delta_page_header *header = (struct delta_page_header *) memory;
+	struct delta_zone *delta_zone = &delta_index_page->delta_zone;
+	const u8 *nonce_addr = (const u8 *) &header->nonce;
+	const u8 *vcn_addr = (const u8 *) &header->virtual_chapter_number;
+	const u8 *first_list_addr = (const u8 *) &header->first_list;
+	const u8 *list_count_addr = (const u8 *) &header->list_count;
+
+	/* First assume that the header is little endian. */
+	nonce = get_unaligned_le64(nonce_addr);
+	vcn = get_unaligned_le64(vcn_addr);
+	first_list = get_unaligned_le16(first_list_addr);
+	list_count = get_unaligned_le16(list_count_addr);
+	if (!verify_delta_index_page(nonce, list_count, expected_nonce, memory, memory_size)) {
+		/* If that fails, try big endian. */
+		nonce = get_unaligned_be64(nonce_addr);
+		vcn = get_unaligned_be64(vcn_addr);
+		first_list = get_unaligned_be16(first_list_addr);
+		list_count = get_unaligned_be16(list_count_addr);
+		if (!verify_delta_index_page(nonce,
+					     list_count,
+					     expected_nonce,
+					     memory,
+					     memory_size))
+			/*
+			 * Both attempts failed. Do not log this as an error, because it can happen
+			 * during a rebuild if we haven't written the entire volume at least once.
+			 */
+			return UDS_CORRUPT_DATA;
+	}
+
+	delta_index_page->delta_index.delta_zones = delta_zone;
+	delta_index_page->delta_index.zone_count = 1;
+	delta_index_page->delta_index.list_count = list_count;
+	delta_index_page->delta_index.lists_per_zone = list_count;
+	delta_index_page->delta_index.mutable = false;
+	delta_index_page->delta_index.tag = 'p';
+	delta_index_page->virtual_chapter_number = vcn;
+	delta_index_page->lowest_list_number = first_list;
+	delta_index_page->highest_list_number = first_list + list_count - 1;
+
+	compute_coding_constants(mean_delta,
+				 &delta_zone->min_bits,
+				 &delta_zone->min_keys,
+				 &delta_zone->incr_keys);
+	delta_zone->value_bits = payload_bits;
+	delta_zone->memory = memory;
+	delta_zone->delta_lists = NULL;
+	delta_zone->new_offsets = NULL;
+	delta_zone->buffered_writer = NULL;
+	delta_zone->size = memory_size;
+	delta_zone->rebalance_time = 0;
+	delta_zone->rebalance_count = 0;
+	delta_zone->record_count = 0;
+	delta_zone->collision_count = 0;
+	delta_zone->discard_count = 0;
+	delta_zone->overflow_count = 0;
+	delta_zone->first_list = 0;
+	delta_zone->list_count = list_count;
+	delta_zone->tag = 'p';
+
+	return UDS_SUCCESS;
+}
+
+/* Read a large bit field from an arbitrary bit boundary. */
+static inline u64 get_big_field(const u8 *memory, u64 offset, u8 size)
+{
+	const void *addr = memory + offset / BITS_PER_BYTE;
+
+	return (get_unaligned_le64(addr) >> (offset % BITS_PER_BYTE)) & ((1UL << size) - 1);
+}
+
+/* Write a large bit field to an arbitrary bit boundary. */
+static inline void set_big_field(u64 value, u8 *memory, u64 offset, u8 size)
+{
+	void *addr = memory + offset / BITS_PER_BYTE;
+	u8 shift = offset % BITS_PER_BYTE;
+	u64 data = get_unaligned_le64(addr);
+
+	data &= ~(((1UL << size) - 1) << shift);
+	data |= value << shift;
+	put_unaligned_le64(data, addr);
+}
+
+/* Set a sequence of bits to all zeros. */
+static inline void set_zero(u8 *memory, u64 offset, u32 size)
+{
+	if (size > 0) {
+		u8 *addr = memory + offset / BITS_PER_BYTE;
+		u8 shift = offset % BITS_PER_BYTE;
+		u32 count = size + shift > BITS_PER_BYTE ? (u32) BITS_PER_BYTE - shift : size;
+
+		*addr++ &= ~(((1 << count) - 1) << shift);
+		for (size -= count; size > BITS_PER_BYTE; size -= BITS_PER_BYTE)
+			*addr++ = 0;
+
+		if (size > 0)
+			*addr &= 0xFF << size;
+	}
+}
+
+/*
+ * Move several bits from a higher to a lower address, moving the lower addressed bits first. The
+ * size and memory offsets are measured in bits.
+ */
+static void move_bits_down(const u8 *from, u64 from_offset, u8 *to, u64 to_offset, u32 size)
+{
+	const u8 *source;
+	u8 *destination;
+	u8 offset;
+	u8 count;
+	u64 field;
+
+	/* Start by moving one field that ends on a to int boundary. */
+	count = (MAX_BIG_FIELD_BITS - ((to_offset + MAX_BIG_FIELD_BITS) % BITS_PER_TYPE(u32)));
+	field = get_big_field(from, from_offset, count);
+	set_big_field(field, to, to_offset, count);
+	from_offset += count;
+	to_offset += count;
+	size -= count;
+
+	/* Now do the main loop to copy 32 bit chunks that are int-aligned at the destination. */
+	offset = from_offset % BITS_PER_TYPE(u32);
+	source = from + (from_offset - offset) / BITS_PER_BYTE;
+	destination = to + to_offset / BITS_PER_BYTE;
+	while (size > MAX_BIG_FIELD_BITS) {
+		put_unaligned_le32(get_unaligned_le64(source) >> offset, destination);
+		source += sizeof(u32);
+		destination += sizeof(u32);
+		from_offset += BITS_PER_TYPE(u32);
+		to_offset += BITS_PER_TYPE(u32);
+		size -= BITS_PER_TYPE(u32);
+	}
+
+	/* Finish up by moving any remaining bits. */
+	if (size > 0) {
+		field = get_big_field(from, from_offset, size);
+		set_big_field(field, to, to_offset, size);
+	}
+}
+
+/*
+ * Move several bits from a lower to a higher address, moving the higher addressed bits first. The
+ * size and memory offsets are measured in bits.
+ */
+static void move_bits_up(const u8 *from, u64 from_offset, u8 *to, u64 to_offset, u32 size)
+{
+	const u8 *source;
+	u8 *destination;
+	u8 offset;
+	u8 count;
+	u64 field;
+
+	/* Start by moving one field that begins on a destination int boundary. */
+	count = (to_offset + size) % BITS_PER_TYPE(u32);
+	if (count > 0) {
+		size -= count;
+		field = get_big_field(from, from_offset + size, count);
+		set_big_field(field, to, to_offset + size, count);
+	}
+
+	/* Now do the main loop to copy 32 bit chunks that are int-aligned at the destination. */
+	offset = (from_offset + size) % BITS_PER_TYPE(u32);
+	source = from + (from_offset + size - offset) / BITS_PER_BYTE;
+	destination = to + (to_offset + size) / BITS_PER_BYTE;
+	while (size > MAX_BIG_FIELD_BITS) {
+		source -= sizeof(u32);
+		destination -= sizeof(u32);
+		size -= BITS_PER_TYPE(u32);
+		put_unaligned_le32(get_unaligned_le64(source) >> offset, destination);
+	}
+
+	/* Finish up by moving any remaining bits. */
+	if (size > 0) {
+		field = get_big_field(from, from_offset, size);
+		set_big_field(field, to, to_offset, size);
+	}
+}
+
+/*
+ * Move bits from one field to another. When the fields overlap, behave as if we first move all the
+ * bits from the source to a temporary value, and then move all the bits from the temporary value
+ * to the destination. The size and memory offsets are measured in bits.
+ */
+static void move_bits(const u8 *from, u64 from_offset, u8 *to, u64 to_offset, u32 size)
+{
+	u64 field;
+
+	/* A small move doesn't require special handling. */
+	if (size <= MAX_BIG_FIELD_BITS) {
+		if (size > 0) {
+			field = get_big_field(from, from_offset, size);
+			set_big_field(field, to, to_offset, size);
+		}
+
+		return;
+	}
+
+	if (from_offset > to_offset)
+		move_bits_down(from, from_offset, to, to_offset, size);
+	else
+		move_bits_up(from, from_offset, to, to_offset, size);
+}
+
+/*
+ * Pack delta lists from a mutable delta index into an immutable delta index page. A range of delta
+ * lists (starting with a specified list index) is copied from the mutable delta index into a
+ * memory page used in the immutable index. The number of lists copied onto the page is returned in
+ * list_count.
+ */
+int uds_pack_delta_index_page(const struct delta_index *delta_index,
+			      u64 header_nonce,
+			      u8 *memory,
+			      size_t memory_size,
+			      u64 virtual_chapter_number,
+			      u32 first_list,
+			      u32 *list_count)
+{
+	const struct delta_zone *delta_zone;
+	struct delta_list *delta_lists;
+	u32 max_lists;
+	u32 n_lists = 0;
+	u32 offset;
+	u32 i;
+	int free_bits;
+	int bits;
+	struct delta_page_header *header;
+
+	delta_zone = &delta_index->delta_zones[0];
+	delta_lists = &delta_zone->delta_lists[first_list + 1];
+	max_lists = delta_index->list_count - first_list;
+
+	/*
+	 * Compute how many lists will fit on the page. Subtract the size of the fixed header, one
+	 * delta list offset, and the guard bytes from the page size to determine how much space is
+	 * available for delta lists.
+	 */
+	free_bits = memory_size * BITS_PER_BYTE;
+	free_bits -= get_immutable_header_offset(1);
+	free_bits -= GUARD_BITS;
+	if (free_bits < IMMUTABLE_HEADER_SIZE)
+		/* This page is too small to store any delta lists. */
+		return uds_log_error_strerror(UDS_OVERFLOW,
+					      "Chapter Index Page of %zu bytes is too small",
+					      memory_size);
+
+	while (n_lists < max_lists) {
+		/* Each list requires a delta list offset and the list data. */
+		bits = IMMUTABLE_HEADER_SIZE + delta_lists[n_lists].size;
+		if (bits > free_bits)
+			break;
+
+		n_lists++;
+		free_bits -= bits;
+	}
+
+	*list_count = n_lists;
+
+	header = (struct delta_page_header *) memory;
+	put_unaligned_le64(header_nonce, (u8 *) &header->nonce);
+	put_unaligned_le64(virtual_chapter_number, (u8 *) &header->virtual_chapter_number);
+	put_unaligned_le16(first_list, (u8 *) &header->first_list);
+	put_unaligned_le16(n_lists, (u8 *) &header->list_count);
+
+	/* Construct the delta list offset table. */
+	offset = get_immutable_header_offset(n_lists + 1);
+	set_immutable_start(memory, 0, offset);
+	for (i = 0; i < n_lists; i++) {
+		offset += delta_lists[i].size;
+		set_immutable_start(memory, i + 1, offset);
+	}
+
+	/* Copy the delta list data onto the memory page. */
+	for (i = 0; i < n_lists; i++)
+		move_bits(delta_zone->memory,
+			  delta_lists[i].start,
+			  memory,
+			  get_immutable_start(memory, i),
+			  delta_lists[i].size);
+
+	/* Set all the bits in the guard bytes. */
+	memset(memory + memory_size - POST_FIELD_GUARD_BYTES, ~0, POST_FIELD_GUARD_BYTES);
+	return UDS_SUCCESS;
+}
+
+/* Compute the new offsets of the delta lists. */
+static void compute_new_list_offsets(struct delta_zone *delta_zone,
+				     u32 growing_index,
+				     size_t growing_size,
+				     size_t used_space)
+{
+	size_t spacing;
+	u32 i;
+	struct delta_list *delta_lists = delta_zone->delta_lists;
+	u32 tail_guard_index = delta_zone->list_count + 1;
+
+	spacing = (delta_zone->size - used_space) / delta_zone->list_count;
+	delta_zone->new_offsets[0] = 0;
+	for (i = 0; i <= delta_zone->list_count; i++) {
+		delta_zone->new_offsets[i + 1] =
+			(delta_zone->new_offsets[i] +
+			 get_delta_list_byte_size(&delta_lists[i]) + spacing);
+		delta_zone->new_offsets[i] *= BITS_PER_BYTE;
+		delta_zone->new_offsets[i] += delta_lists[i].start % BITS_PER_BYTE;
+		if (i == 0)
+			delta_zone->new_offsets[i + 1] -= spacing / 2;
+		if (i + 1 == growing_index)
+			delta_zone->new_offsets[i + 1] += growing_size;
+	}
+
+	delta_zone->new_offsets[tail_guard_index] =
+		(delta_zone->size * BITS_PER_BYTE - delta_lists[tail_guard_index].size);
+}
+
+static void rebalance_lists(struct delta_zone *delta_zone)
+{
+	struct delta_list *delta_lists;
+	u32 i;
+	size_t used_space = 0;
+
+	/* Extend and balance memory to receive the delta lists */
+	delta_lists = delta_zone->delta_lists;
+	for (i = 0; i <= delta_zone->list_count + 1; i++)
+		used_space += get_delta_list_byte_size(&delta_lists[i]);
+
+	compute_new_list_offsets(delta_zone, 0, 0, used_space);
+	for (i = 1; i <= delta_zone->list_count + 1; i++)
+		delta_lists[i].start = delta_zone->new_offsets[i];
+}
+
+/* Start restoring a delta index from multiple input streams. */
+int uds_start_restoring_delta_index(struct delta_index *delta_index,
+				    struct buffered_reader **buffered_readers,
+				    unsigned int reader_count)
+{
+	int result;
+	unsigned int zone_count = reader_count;
+	u64 record_count = 0;
+	u64 collision_count = 0;
+	u32 first_list[MAX_ZONES];
+	u32 list_count[MAX_ZONES];
+	unsigned int z;
+	u32 list_next = 0;
+	const struct delta_zone *delta_zone;
+
+	/* Read and validate each header. */
+	for (z = 0; z < zone_count; z++) {
+		struct delta_index_header header;
+		u8 buffer[sizeof(struct delta_index_header)];
+		size_t offset = 0;
+
+		result = uds_read_from_buffered_reader(buffered_readers[z],
+						       buffer,
+						       sizeof(buffer));
+		if (result != UDS_SUCCESS)
+			return uds_log_warning_strerror(result,
+							"failed to read delta index header");
+
+		memcpy(&header.magic, buffer, MAGIC_SIZE);
+		offset += MAGIC_SIZE;
+		decode_u32_le(buffer, &offset, &header.zone_number);
+		decode_u32_le(buffer, &offset, &header.zone_count);
+		decode_u32_le(buffer, &offset, &header.first_list);
+		decode_u32_le(buffer, &offset, &header.list_count);
+		decode_u64_le(buffer, &offset, &header.record_count);
+		decode_u64_le(buffer, &offset, &header.collision_count);
+
+		result = ASSERT(offset == sizeof(struct delta_index_header),
+				"%zu bytes decoded of %zu expected",
+				offset,
+				sizeof(struct delta_index_header));
+		if (result != UDS_SUCCESS)
+			return uds_log_warning_strerror(result,
+							"failed to read delta index header");
+
+		if (memcmp(header.magic, DELTA_INDEX_MAGIC, MAGIC_SIZE) != 0)
+			return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+							"delta index file has bad magic number");
+
+		if (zone_count != header.zone_count)
+			return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+							"delta index files contain mismatched zone counts (%u,%u)",
+							zone_count,
+							header.zone_count);
+
+		if (header.zone_number != z)
+			return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+							"delta index zone %u found in slot %u",
+							header.zone_number,
+							z);
+
+		first_list[z] = header.first_list;
+		list_count[z] = header.list_count;
+		record_count += header.record_count;
+		collision_count += header.collision_count;
+
+		if (first_list[z] != list_next)
+			return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+							"delta index file for zone %u starts with list %u instead of list %u",
+							z,
+							first_list[z],
+							list_next);
+
+		list_next += list_count[z];
+	}
+
+	if (list_next != delta_index->list_count)
+		return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+						"delta index files contain %u delta lists instead of %u delta lists",
+						list_next,
+						delta_index->list_count);
+
+	if (collision_count > record_count)
+		return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+						"delta index files contain %llu collisions and %llu records",
+						(unsigned long long) collision_count,
+						(unsigned long long) record_count);
+
+	uds_reset_delta_index(delta_index);
+	delta_index->delta_zones[0].record_count = record_count;
+	delta_index->delta_zones[0].collision_count = collision_count;
+
+	/* Read the delta lists and distribute them to the proper zones. */
+	for (z = 0; z < zone_count; z++) {
+		u32 i;
+
+		delta_index->load_lists[z] = 0;
+		for (i = 0; i < list_count[z]; i++) {
+			u16 delta_list_size;
+			u32 list_number;
+			unsigned int zone_number;
+			u8 size_data[sizeof(u16)];
+
+			result = uds_read_from_buffered_reader(buffered_readers[z],
+							       size_data,
+							       sizeof(size_data));
+			if (result != UDS_SUCCESS)
+				return uds_log_warning_strerror(result,
+								"failed to read delta index size");
+
+			delta_list_size = get_unaligned_le16(size_data);
+			if (delta_list_size > 0)
+				delta_index->load_lists[z] += 1;
+
+			list_number = first_list[z] + i;
+			zone_number = list_number / delta_index->lists_per_zone;
+			delta_zone = &delta_index->delta_zones[zone_number];
+			list_number -= delta_zone->first_list;
+			delta_zone->delta_lists[list_number + 1].size = delta_list_size;
+		}
+	}
+
+	/* Prepare each zone to start receiving the delta list data. */
+	for (z = 0; z < delta_index->zone_count; z++)
+		rebalance_lists(&delta_index->delta_zones[z]);
+	return UDS_SUCCESS;
+}
+
+static int restore_delta_list_to_zone(struct delta_zone *delta_zone,
+				      const struct delta_list_save_info *save_info,
+				      const u8 *data)
+{
+	struct delta_list *delta_list;
+	u16 bit_count;
+	u16 byte_count;
+	u32 list_number = save_info->index - delta_zone->first_list;
+
+	if (list_number >= delta_zone->list_count)
+		return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+						"invalid delta list number %u not in range [%u,%u)",
+						save_info->index,
+						delta_zone->first_list,
+						delta_zone->first_list +
+						delta_zone->list_count);
+
+	delta_list = &delta_zone->delta_lists[list_number + 1];
+	if (delta_list->size == 0)
+		return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+						"unexpected delta list number %u",
+						save_info->index);
+
+	bit_count = delta_list->size + save_info->bit_offset;
+	byte_count = BITS_TO_BYTES(bit_count);
+	if (save_info->byte_count != byte_count)
+		return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+						"unexpected delta list size %u != %u",
+						save_info->byte_count,
+						byte_count);
+
+	move_bits(data,
+		  save_info->bit_offset,
+		  delta_zone->memory,
+		  delta_list->start,
+		  delta_list->size);
+	return UDS_SUCCESS;
+}
+
+static int restore_delta_list_data(struct delta_index *delta_index,
+				   unsigned int load_zone,
+				   struct buffered_reader *buffered_reader,
+				   u8 *data)
+{
+	int result;
+	struct delta_list_save_info save_info;
+	u8 buffer[sizeof(struct delta_list_save_info)];
+	unsigned int new_zone;
+
+	result = uds_read_from_buffered_reader(buffered_reader, buffer, sizeof(buffer));
+	if (result != UDS_SUCCESS)
+		return uds_log_warning_strerror(result, "failed to read delta list data");
+
+	save_info = (struct delta_list_save_info) {
+		.tag = buffer[0],
+		.bit_offset = buffer[1],
+		.byte_count = get_unaligned_le16(&buffer[2]),
+		.index = get_unaligned_le32(&buffer[4]),
+	};
+
+	if ((save_info.bit_offset >= BITS_PER_BYTE) ||
+	    (save_info.byte_count > DELTA_LIST_MAX_BYTE_COUNT))
+		return uds_log_warning_strerror(UDS_CORRUPT_DATA, "corrupt delta list data");
+
+	/* Make sure the data is intended for this delta index. */
+	if (save_info.tag != delta_index->tag)
+		return UDS_CORRUPT_DATA;
+
+	if (save_info.index >= delta_index->list_count)
+		return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+						"invalid delta list number %u of %u",
+						save_info.index,
+						delta_index->list_count);
+
+	result = uds_read_from_buffered_reader(buffered_reader, data, save_info.byte_count);
+	if (result != UDS_SUCCESS)
+		return uds_log_warning_strerror(result,
+						"failed to read delta list data");
+
+	delta_index->load_lists[load_zone] -= 1;
+	new_zone = save_info.index / delta_index->lists_per_zone;
+	return restore_delta_list_to_zone(&delta_index->delta_zones[new_zone], &save_info, data);
+}
+
+/* Restore delta lists from saved data. */
+int uds_finish_restoring_delta_index(struct delta_index *delta_index,
+				     struct buffered_reader **buffered_readers,
+				     unsigned int reader_count)
+{
+	int result;
+	int saved_result = UDS_SUCCESS;
+	unsigned int z;
+	u8 *data;
+
+	result = UDS_ALLOCATE(DELTA_LIST_MAX_BYTE_COUNT, u8, __func__, &data);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	for (z = 0; z < reader_count; z++) {
+		while (delta_index->load_lists[z] > 0) {
+			result = restore_delta_list_data(delta_index,
+							 z,
+							 buffered_readers[z],
+							 data);
+			if (result != UDS_SUCCESS) {
+				saved_result = result;
+				break;
+			}
+		}
+	}
+
+	UDS_FREE(data);
+	return saved_result;
+}
+
+int uds_check_guard_delta_lists(struct buffered_reader **buffered_readers,
+				unsigned int reader_count)
+{
+	int result;
+	unsigned int z;
+	u8 buffer[sizeof(struct delta_list_save_info)];
+
+	for (z = 0; z < reader_count; z++) {
+		result = uds_read_from_buffered_reader(buffered_readers[z],
+						       buffer,
+						       sizeof(buffer));
+		if (result != UDS_SUCCESS)
+			return result;
+
+		if (buffer[0] != 'z')
+			return UDS_CORRUPT_DATA;
+	}
+
+	return UDS_SUCCESS;
+}
+
+static int flush_delta_list(struct delta_zone *zone, u32 flush_index)
+{
+	struct delta_list *delta_list;
+	u8 buffer[sizeof(struct delta_list_save_info)];
+	int result;
+
+	delta_list = &zone->delta_lists[flush_index + 1];
+
+	buffer[0] = zone->tag;
+	buffer[1] = delta_list->start % BITS_PER_BYTE;
+	put_unaligned_le16(get_delta_list_byte_size(delta_list), &buffer[2]);
+	put_unaligned_le32(zone->first_list + flush_index, &buffer[4]);
+
+	result = uds_write_to_buffered_writer(zone->buffered_writer, buffer, sizeof(buffer));
+	if (result != UDS_SUCCESS) {
+		uds_log_warning_strerror(result, "failed to write delta list memory");
+		return result;
+	}
+
+	result = uds_write_to_buffered_writer(zone->buffered_writer,
+					      zone->memory + get_delta_list_byte_start(delta_list),
+					      get_delta_list_byte_size(delta_list));
+	if (result != UDS_SUCCESS)
+		uds_log_warning_strerror(result, "failed to write delta list memory");
+
+	return result;
+}
+
+/* Start saving a delta index zone to a buffered output stream. */
+int uds_start_saving_delta_index(const struct delta_index *delta_index,
+				 unsigned int zone_number,
+				 struct buffered_writer *buffered_writer)
+{
+	int result;
+	u32 i;
+	struct delta_zone *delta_zone;
+	u8 buffer[sizeof(struct delta_index_header)];
+	size_t offset = 0;
+
+	delta_zone = &delta_index->delta_zones[zone_number];
+	memcpy(buffer, DELTA_INDEX_MAGIC, MAGIC_SIZE);
+	offset += MAGIC_SIZE;
+	encode_u32_le(buffer, &offset, zone_number);
+	encode_u32_le(buffer, &offset, delta_index->zone_count);
+	encode_u32_le(buffer, &offset, delta_zone->first_list);
+	encode_u32_le(buffer, &offset, delta_zone->list_count);
+	encode_u64_le(buffer, &offset, delta_zone->record_count);
+	encode_u64_le(buffer, &offset, delta_zone->collision_count);
+
+	result = ASSERT(offset == sizeof(struct delta_index_header),
+			"%zu bytes encoded of %zu expected",
+			offset,
+			sizeof(struct delta_index_header));
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_write_to_buffered_writer(buffered_writer, buffer, offset);
+	if (result != UDS_SUCCESS)
+		return uds_log_warning_strerror(result, "failed to write delta index header");
+
+	for (i = 0; i < delta_zone->list_count; i++) {
+		u8 data[sizeof(u16)];
+		struct delta_list *delta_list;
+
+		delta_list = &delta_zone->delta_lists[i + 1];
+		put_unaligned_le16(delta_list->size, data);
+		result = uds_write_to_buffered_writer(buffered_writer, data, sizeof(data));
+		if (result != UDS_SUCCESS)
+			return uds_log_warning_strerror(result, "failed to write delta list size");
+	}
+
+	delta_zone->buffered_writer = buffered_writer;
+	return UDS_SUCCESS;
+}
+
+int uds_finish_saving_delta_index(const struct delta_index *delta_index, unsigned int zone_number)
+{
+	int result;
+	int first_error = UDS_SUCCESS;
+	u32 i;
+	struct delta_zone *delta_zone;
+	struct delta_list *delta_list;
+
+	delta_zone = &delta_index->delta_zones[zone_number];
+	for (i = 0; i < delta_zone->list_count; i++) {
+		delta_list = &delta_zone->delta_lists[i + 1];
+		if (delta_list->size > 0) {
+			result = flush_delta_list(delta_zone, i);
+			if ((result != UDS_SUCCESS) && (first_error == UDS_SUCCESS))
+				first_error = result;
+		}
+	}
+
+	delta_zone->buffered_writer = NULL;
+	return first_error;
+}
+
+int uds_write_guard_delta_list(struct buffered_writer *buffered_writer)
+{
+	int result;
+	u8 buffer[sizeof(struct delta_list_save_info)];
+
+	memset(buffer, 0, sizeof(struct delta_list_save_info));
+	buffer[0] = 'z';
+
+	result = uds_write_to_buffered_writer(buffered_writer, buffer, sizeof(buffer));
+	if (result != UDS_SUCCESS)
+		uds_log_warning_strerror(result, "failed to write guard delta list");
+
+	return UDS_SUCCESS;
+}
+
+size_t uds_compute_delta_index_save_bytes(u32 list_count, size_t memory_size)
+{
+	/* One zone will use at least as much memory as other zone counts. */
+	return (sizeof(struct delta_index_header) +
+		list_count * (sizeof(struct delta_list_save_info) + 1) +
+		get_zone_memory_size(1, memory_size));
+}
+
+static int assert_not_at_end(const struct delta_index_entry *delta_entry)
+{
+	int result = ASSERT(!delta_entry->at_end,
+			    "operation is invalid because the list entry is at the end of the delta list");
+	if (result != UDS_SUCCESS)
+		result = UDS_BAD_STATE;
+
+	return result;
+}
+
+/*
+ * Prepare to search for an entry in the specified delta list.
+ *
+ * This is always the first function to be called when dealing with delta index entries. It is
+ * always followed by calls to uds_next_delta_index_entry() to iterate through a delta list. The
+ * fields of the delta_index_entry argument will be set up for iteration, but will not contain an
+ * entry from the list.
+ */
+int uds_start_delta_index_search(const struct delta_index *delta_index,
+				 u32 list_number,
+				 u32 key,
+				 struct delta_index_entry *delta_entry)
+{
+	int result;
+	unsigned int zone_number;
+	struct delta_zone *delta_zone;
+	struct delta_list *delta_list;
+
+	result = ASSERT((list_number < delta_index->list_count),
+			"Delta list number (%u) is out of range (%u)",
+			list_number,
+			delta_index->list_count);
+	if (result != UDS_SUCCESS)
+		return UDS_CORRUPT_DATA;
+
+	zone_number = list_number / delta_index->lists_per_zone;
+	delta_zone = &delta_index->delta_zones[zone_number];
+	list_number -= delta_zone->first_list;
+	result = ASSERT((list_number < delta_zone->list_count),
+			"Delta list number (%u) is out of range (%u) for zone (%u)",
+			list_number,
+			delta_zone->list_count,
+			zone_number);
+	if (result != UDS_SUCCESS)
+		return UDS_CORRUPT_DATA;
+
+	if (delta_index->mutable) {
+		delta_list = &delta_zone->delta_lists[list_number + 1];
+	} else {
+		u32 end_offset;
+
+		/*
+		 * Translate the immutable delta list header into a temporary
+		 * full delta list header.
+		 */
+		delta_list = &delta_entry->temp_delta_list;
+		delta_list->start = get_immutable_start(delta_zone->memory, list_number);
+		end_offset = get_immutable_start(delta_zone->memory, list_number + 1);
+		delta_list->size = end_offset - delta_list->start;
+		delta_list->save_key = 0;
+		delta_list->save_offset = 0;
+	}
+
+	if (key > delta_list->save_key) {
+		delta_entry->key = delta_list->save_key;
+		delta_entry->offset = delta_list->save_offset;
+	} else {
+		delta_entry->key = 0;
+		delta_entry->offset = 0;
+		if (key == 0)
+			/*
+			 * This usually means we're about to walk the entire delta list, so get all
+			 * of it into the CPU cache.
+			 */
+			uds_prefetch_range(&delta_zone->memory[delta_list->start / BITS_PER_BYTE],
+					   delta_list->size / BITS_PER_BYTE,
+					   false);
+	}
+
+	delta_entry->at_end = false;
+	delta_entry->delta_zone = delta_zone;
+	delta_entry->delta_list = delta_list;
+	delta_entry->entry_bits = 0;
+	delta_entry->is_collision = false;
+	delta_entry->list_number = list_number;
+	delta_entry->list_overflow = false;
+	delta_entry->value_bits = delta_zone->value_bits;
+	return UDS_SUCCESS;
+}
+
+static inline u64 get_delta_entry_offset(const struct delta_index_entry *delta_entry)
+{
+	return delta_entry->delta_list->start + delta_entry->offset;
+}
+
+/*
+ * Decode a delta index entry delta value. The delta_index_entry basically describes the previous
+ * list entry, and has had its offset field changed to point to the subsequent entry. We decode the
+ * bit stream and update the delta_list_entry to describe the entry.
+ */
+static inline void decode_delta(struct delta_index_entry *delta_entry)
+{
+	int key_bits;
+	u32 delta;
+	const struct delta_zone *delta_zone = delta_entry->delta_zone;
+	const u8 *memory = delta_zone->memory;
+	u64 delta_offset = get_delta_entry_offset(delta_entry) + delta_entry->value_bits;
+	const u8 *addr = memory + delta_offset / BITS_PER_BYTE;
+	int offset = delta_offset % BITS_PER_BYTE;
+	u32 data = get_unaligned_le32(addr) >> offset;
+
+	addr += sizeof(u32);
+	key_bits = delta_zone->min_bits;
+	delta = data & ((1 << key_bits) - 1);
+	if (delta >= delta_zone->min_keys) {
+		data >>= key_bits;
+		if (data == 0) {
+			key_bits = sizeof(u32) * BITS_PER_BYTE - offset;
+			while ((data = get_unaligned_le32(addr)) == 0) {
+				addr += sizeof(u32);
+				key_bits += sizeof(u32) * BITS_PER_BYTE;
+			}
+		}
+		key_bits += ffs(data);
+		delta += ((key_bits - delta_zone->min_bits - 1) * delta_zone->incr_keys);
+	}
+	delta_entry->delta = delta;
+	delta_entry->key += delta;
+
+	/* Check for a collision, a delta of zero after the start. */
+	if (unlikely((delta == 0) && (delta_entry->offset > 0))) {
+		delta_entry->is_collision = true;
+		delta_entry->entry_bits = delta_entry->value_bits + key_bits + COLLISION_BITS;
+	} else {
+		delta_entry->is_collision = false;
+		delta_entry->entry_bits = delta_entry->value_bits + key_bits;
+	}
+}
+
+noinline int uds_next_delta_index_entry(struct delta_index_entry *delta_entry)
+{
+	int result;
+	const struct delta_list *delta_list;
+	u32 next_offset;
+	u16 size;
+
+	result = assert_not_at_end(delta_entry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	delta_list = delta_entry->delta_list;
+	delta_entry->offset += delta_entry->entry_bits;
+	size = delta_list->size;
+	if (unlikely(delta_entry->offset >= size)) {
+		delta_entry->at_end = true;
+		delta_entry->delta = 0;
+		delta_entry->is_collision = false;
+		result = ASSERT((delta_entry->offset == size),
+				"next offset past end of delta list");
+		if (result != UDS_SUCCESS)
+			result = UDS_CORRUPT_DATA;
+
+		return result;
+	}
+
+	decode_delta(delta_entry);
+
+	next_offset = delta_entry->offset + delta_entry->entry_bits;
+	if (next_offset > size) {
+		/*
+		 * This is not an assertion because uds_validate_chapter_index_page() wants to
+		 * handle this error.
+		 */
+		uds_log_warning("Decoded past the end of the delta list");
+		return UDS_CORRUPT_DATA;
+	}
+
+	return UDS_SUCCESS;
+}
+
+int uds_remember_delta_index_offset(const struct delta_index_entry *delta_entry)
+{
+	int result;
+	struct delta_list *delta_list = delta_entry->delta_list;
+
+	result = ASSERT(!delta_entry->is_collision, "entry is not a collision");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	delta_list->save_key = delta_entry->key - delta_entry->delta;
+	delta_list->save_offset = delta_entry->offset;
+	return UDS_SUCCESS;
+}
+
+static void set_delta(struct delta_index_entry *delta_entry, u32 delta)
+{
+	const struct delta_zone *delta_zone = delta_entry->delta_zone;
+	u32 key_bits = (delta_zone->min_bits +
+			((delta_zone->incr_keys -
+			  delta_zone->min_keys + delta) /
+			 delta_zone->incr_keys));
+
+	delta_entry->delta = delta;
+	delta_entry->entry_bits = delta_entry->value_bits + key_bits;
+}
+
+static void get_collision_name(const struct delta_index_entry *entry, u8 *name)
+{
+	u64 offset = get_delta_entry_offset(entry) + entry->entry_bits - COLLISION_BITS;
+	const u8 *addr = entry->delta_zone->memory + offset / BITS_PER_BYTE;
+	int size = COLLISION_BYTES;
+	int shift = offset % BITS_PER_BYTE;
+
+	while (--size >= 0)
+		*name++ = get_unaligned_le16(addr++) >> shift;
+}
+
+static void set_collision_name(const struct delta_index_entry *entry, const u8 *name)
+{
+	u64 offset = get_delta_entry_offset(entry) + entry->entry_bits - COLLISION_BITS;
+	u8 *addr = entry->delta_zone->memory + offset / BITS_PER_BYTE;
+	int size = COLLISION_BYTES;
+	int shift = offset % BITS_PER_BYTE;
+	u16 mask = ~((u16) 0xFF << shift);
+	u16 data;
+
+	while (--size >= 0) {
+		data = (get_unaligned_le16(addr) & mask) | (*name++ << shift);
+		put_unaligned_le16(data, addr++);
+	}
+}
+
+int uds_get_delta_index_entry(const struct delta_index *delta_index,
+			      u32 list_number,
+			      u32 key,
+			      const u8 *name,
+			      struct delta_index_entry *delta_entry)
+{
+	int result;
+
+	result = uds_start_delta_index_search(delta_index, list_number, key, delta_entry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	do {
+		result = uds_next_delta_index_entry(delta_entry);
+		if (result != UDS_SUCCESS)
+			return result;
+	} while (!delta_entry->at_end && (key > delta_entry->key));
+
+	result = uds_remember_delta_index_offset(delta_entry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (!delta_entry->at_end && (key == delta_entry->key)) {
+		struct delta_index_entry collision_entry = *delta_entry;
+
+		for (;;) {
+			u8 full_name[COLLISION_BYTES];
+
+			result = uds_next_delta_index_entry(&collision_entry);
+			if (result != UDS_SUCCESS)
+				return result;
+
+			if (collision_entry.at_end || !collision_entry.is_collision)
+				break;
+
+			get_collision_name(&collision_entry, full_name);
+			if (memcmp(full_name, name, COLLISION_BYTES) == 0) {
+				*delta_entry = collision_entry;
+				break;
+			}
+		}
+	}
+
+	return UDS_SUCCESS;
+}
+
+int uds_get_delta_entry_collision(const struct delta_index_entry *delta_entry, u8 *name)
+{
+	int result;
+
+	result = assert_not_at_end(delta_entry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT(delta_entry->is_collision,
+			"Cannot get full block name from a non-collision delta index entry");
+	if (result != UDS_SUCCESS)
+		return UDS_BAD_STATE;
+
+	get_collision_name(delta_entry, name);
+	return UDS_SUCCESS;
+}
+
+u32 uds_get_delta_entry_value(const struct delta_index_entry *delta_entry)
+{
+	return get_field(delta_entry->delta_zone->memory,
+			 get_delta_entry_offset(delta_entry),
+			 delta_entry->value_bits);
+}
+
+static int assert_mutable_entry(const struct delta_index_entry *delta_entry)
+{
+	int result = ASSERT((delta_entry->delta_list != &delta_entry->temp_delta_list),
+			    "delta index is mutable");
+	if (result != UDS_SUCCESS)
+		result = UDS_BAD_STATE;
+
+	return result;
+}
+
+int uds_set_delta_entry_value(const struct delta_index_entry *delta_entry, u32 value)
+{
+	int result;
+	u32 value_mask = (1 << delta_entry->value_bits) - 1;
+
+	result = assert_mutable_entry(delta_entry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = assert_not_at_end(delta_entry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT((value & value_mask) == value,
+			"Value (%u) being set in a delta index is too large (must fit in %u bits)",
+			value,
+			delta_entry->value_bits);
+	if (result != UDS_SUCCESS)
+		return UDS_INVALID_ARGUMENT;
+
+	set_field(value,
+		  delta_entry->delta_zone->memory,
+		  get_delta_entry_offset(delta_entry),
+		  delta_entry->value_bits);
+	return UDS_SUCCESS;
+}
+
+/*
+ * Extend the memory used by the delta lists by adding growing_size bytes before the list indicated
+ * by growing_index, then rebalancing the lists in the new chunk.
+ */
+static int
+extend_delta_zone(struct delta_zone *delta_zone, u32 growing_index, size_t growing_size)
+{
+	ktime_t start_time;
+	ktime_t end_time;
+	struct delta_list *delta_lists;
+	u32 i;
+	size_t used_space;
+
+
+	/* Calculate the amount of space that is or will be in use. */
+	start_time = current_time_ns(CLOCK_MONOTONIC);
+	delta_lists = delta_zone->delta_lists;
+	used_space = growing_size;
+	for (i = 0; i <= delta_zone->list_count + 1; i++)
+		used_space += get_delta_list_byte_size(&delta_lists[i]);
+
+	if (delta_zone->size < used_space)
+		return UDS_OVERFLOW;
+
+	/* Compute the new offsets of the delta lists. */
+	compute_new_list_offsets(delta_zone, growing_index, growing_size, used_space);
+
+	/*
+	 * When we rebalance the delta list, we will include the end guard list in the rebalancing.
+	 * It contains the end guard data, which must be copied.
+	 */
+	rebalance_delta_zone(delta_zone, 1, delta_zone->list_count + 1);
+	end_time = current_time_ns(CLOCK_MONOTONIC);
+	delta_zone->rebalance_count++;
+	delta_zone->rebalance_time += ktime_sub(end_time, start_time);
+	return UDS_SUCCESS;
+}
+
+static int insert_bits(struct delta_index_entry *delta_entry, u16 size)
+{
+	u64 free_before;
+	u64 free_after;
+	u64 source;
+	u64 destination;
+	u32 count;
+	bool before_flag;
+	u8 *memory;
+	struct delta_zone *delta_zone = delta_entry->delta_zone;
+	struct delta_list *delta_list = delta_entry->delta_list;
+	/* Compute bits in use before and after the inserted bits. */
+	u32 total_size = delta_list->size;
+	u32 before_size = delta_entry->offset;
+	u32 after_size = total_size - delta_entry->offset;
+
+	if (total_size + size > U16_MAX) {
+		delta_entry->list_overflow = true;
+		delta_zone->overflow_count++;
+		return UDS_OVERFLOW;
+	}
+
+	/* Compute bits available before and after the delta list. */
+	free_before = (delta_list[0].start - (delta_list[-1].start + delta_list[-1].size));
+	free_after = (delta_list[1].start - (delta_list[0].start + delta_list[0].size));
+
+	if ((size <= free_before) && (size <= free_after)) {
+		/*
+		 * We have enough space to use either before or after the list. Select the smaller
+		 * amount of data. If it is exactly the same, try to take from the larger amount of
+		 * free space.
+		 */
+		if (before_size < after_size)
+			before_flag = true;
+		else if (after_size < before_size)
+			before_flag = false;
+		else
+			before_flag = free_before > free_after;
+	} else if (size <= free_before) {
+		/* There is space before but not after. */
+		before_flag = true;
+	} else if (size <= free_after) {
+		/* There is space after but not before. */
+		before_flag = false;
+	} else {
+		/*
+		 * Neither of the surrounding spaces is large enough for this request. Extend
+		 * and/or rebalance the delta list memory choosing to move the least amount of
+		 * data.
+		 */
+		int result;
+		u32 growing_index = delta_entry->list_number + 1;
+
+		before_flag = before_size < after_size;
+		if (!before_flag)
+			growing_index++;
+		result = extend_delta_zone(delta_zone, growing_index, BITS_TO_BYTES(size));
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	delta_list->size += size;
+	if (before_flag) {
+		source = delta_list->start;
+		destination = source - size;
+		delta_list->start -= size;
+		count = before_size;
+	} else {
+		source = delta_list->start + delta_entry->offset;
+		destination = source + size;
+		count = after_size;
+	}
+
+	memory = delta_zone->memory;
+	move_bits(memory, source, memory, destination, count);
+	return UDS_SUCCESS;
+}
+
+static void encode_delta(const struct delta_index_entry *delta_entry)
+{
+	u32 temp;
+	u32 t1;
+	u32 t2;
+	u64 offset;
+	const struct delta_zone *delta_zone = delta_entry->delta_zone;
+	u8 *memory = delta_zone->memory;
+
+	offset = get_delta_entry_offset(delta_entry) + delta_entry->value_bits;
+	if (delta_entry->delta < delta_zone->min_keys) {
+		set_field(delta_entry->delta, memory, offset, delta_zone->min_bits);
+		return;
+	}
+
+	temp = delta_entry->delta - delta_zone->min_keys;
+	t1 = (temp % delta_zone->incr_keys) + delta_zone->min_keys;
+	t2 = temp / delta_zone->incr_keys;
+	set_field(t1, memory, offset, delta_zone->min_bits);
+	set_zero(memory, offset + delta_zone->min_bits, t2);
+	set_field(1, memory, offset + delta_zone->min_bits + t2, 1);
+}
+
+static void encode_entry(const struct delta_index_entry *delta_entry, u32 value, const u8 *name)
+{
+	u8 *memory = delta_entry->delta_zone->memory;
+	u64 offset = get_delta_entry_offset(delta_entry);
+
+	set_field(value, memory, offset, delta_entry->value_bits);
+	encode_delta(delta_entry);
+	if (name != NULL)
+		set_collision_name(delta_entry, name);
+}
+
+/*
+ * Create a new entry in the delta index. If the entry is a collision, the full 256 bit name must
+ * be provided.
+ */
+int uds_put_delta_index_entry(struct delta_index_entry *delta_entry,
+			      u32 key,
+			      u32 value,
+			      const u8 *name)
+{
+	int result;
+	struct delta_zone *delta_zone;
+
+	result = assert_mutable_entry(delta_entry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (delta_entry->is_collision)
+		/*
+		 * The caller wants us to insert a collision entry onto a collision entry. This
+		 * happens when we find a collision and attempt to add the name again to the index.
+		 * This is normally a fatal error unless we are replaying a closed chapter while we
+		 * are rebuilding a volume index.
+		 */
+		return UDS_DUPLICATE_NAME;
+
+	if (delta_entry->offset < delta_entry->delta_list->save_offset) {
+		/*
+		 * The saved entry offset is after the new entry and will no longer be valid, so
+		 * replace it with the insertion point.
+		 */
+		result = uds_remember_delta_index_offset(delta_entry);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	if (name != NULL) {
+		/* Insert a collision entry which is placed after this entry. */
+		result = assert_not_at_end(delta_entry);
+		if (result != UDS_SUCCESS)
+			return result;
+
+		result = ASSERT((key == delta_entry->key), "incorrect key for collision entry");
+		if (result != UDS_SUCCESS)
+			return result;
+
+		delta_entry->offset += delta_entry->entry_bits;
+		set_delta(delta_entry, 0);
+		delta_entry->is_collision = true;
+		delta_entry->entry_bits += COLLISION_BITS;
+		result = insert_bits(delta_entry, delta_entry->entry_bits);
+	} else if (delta_entry->at_end) {
+		/* Insert a new entry at the end of the delta list. */
+		result = ASSERT((key >= delta_entry->key), "key past end of list");
+		if (result != UDS_SUCCESS)
+			return result;
+
+		set_delta(delta_entry, key - delta_entry->key);
+		delta_entry->key = key;
+		delta_entry->at_end = false;
+		result = insert_bits(delta_entry, delta_entry->entry_bits);
+	} else {
+		u16 old_entry_size;
+		u16 additional_size;
+		struct delta_index_entry next_entry;
+		u32 next_value;
+
+		/*
+		 * Insert a new entry which requires the delta in the following entry to be
+		 * updated.
+		 */
+		result = ASSERT((key < delta_entry->key), "key precedes following entry");
+		if (result != UDS_SUCCESS)
+			return result;
+
+		result = ASSERT((key >= delta_entry->key - delta_entry->delta),
+				"key effects following entry's delta");
+		if (result != UDS_SUCCESS)
+			return result;
+
+		old_entry_size = delta_entry->entry_bits;
+		next_entry = *delta_entry;
+		next_value = uds_get_delta_entry_value(&next_entry);
+		set_delta(delta_entry, key - (delta_entry->key - delta_entry->delta));
+		delta_entry->key = key;
+		set_delta(&next_entry, next_entry.key - key);
+		next_entry.offset += delta_entry->entry_bits;
+		/* The two new entries are always bigger than the single entry being replaced. */
+		additional_size = (delta_entry->entry_bits +
+				   next_entry.entry_bits - old_entry_size);
+		result = insert_bits(delta_entry, additional_size);
+		if (result != UDS_SUCCESS)
+			return result;
+
+		encode_entry(&next_entry, next_value, NULL);
+	}
+
+	if (result != UDS_SUCCESS)
+		return result;
+
+	encode_entry(delta_entry, value, name);
+	delta_zone = delta_entry->delta_zone;
+	delta_zone->record_count++;
+	delta_zone->collision_count += delta_entry->is_collision ? 1 : 0;
+	return UDS_SUCCESS;
+}
+
+static void delete_bits(const struct delta_index_entry *delta_entry, int size)
+{
+	u64 source;
+	u64 destination;
+	u32 count;
+	bool before_flag;
+	struct delta_list *delta_list = delta_entry->delta_list;
+	u8 *memory = delta_entry->delta_zone->memory;
+	/* Compute bits retained before and after the deleted bits. */
+	u32 total_size = delta_list->size;
+	u32 before_size = delta_entry->offset;
+	u32 after_size = total_size - delta_entry->offset - size;
+
+	/*
+	 * Determine whether to add to the available space either before or after the delta list.
+	 * We prefer to move the least amount of data. If it is exactly the same, try to add to the
+	 * smaller amount of free space.
+	 */
+	if (before_size < after_size) {
+		before_flag = true;
+	} else if (after_size < before_size) {
+		before_flag = false;
+	} else {
+		u64 free_before =
+			(delta_list[0].start - (delta_list[-1].start + delta_list[-1].size));
+		u64 free_after =
+			(delta_list[1].start - (delta_list[0].start + delta_list[0].size));
+
+		before_flag = (free_before < free_after);
+	}
+
+	delta_list->size -= size;
+	if (before_flag) {
+		source = delta_list->start;
+		destination = source + size;
+		delta_list->start += size;
+		count = before_size;
+	} else {
+		destination = delta_list->start + delta_entry->offset;
+		source = destination + size;
+		count = after_size;
+	}
+
+	move_bits(memory, source, memory, destination, count);
+}
+
+int uds_remove_delta_index_entry(struct delta_index_entry *delta_entry)
+{
+	int result;
+	struct delta_index_entry next_entry;
+	struct delta_zone *delta_zone;
+	struct delta_list *delta_list;
+
+	result = assert_mutable_entry(delta_entry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	next_entry = *delta_entry;
+	result = uds_next_delta_index_entry(&next_entry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	delta_zone = delta_entry->delta_zone;
+
+	if (delta_entry->is_collision) {
+		/* This is a collision entry, so just remove it. */
+		delete_bits(delta_entry, delta_entry->entry_bits);
+		next_entry.offset = delta_entry->offset;
+		delta_zone->collision_count -= 1;
+	} else if (next_entry.at_end) {
+		/* This entry is at the end of the list, so just remove it. */
+		delete_bits(delta_entry, delta_entry->entry_bits);
+		next_entry.key -= delta_entry->delta;
+		next_entry.offset = delta_entry->offset;
+	} else {
+		/* The delta in the next entry needs to be updated. */
+		u32 next_value = uds_get_delta_entry_value(&next_entry);
+		u16 old_size = delta_entry->entry_bits + next_entry.entry_bits;
+
+		if (next_entry.is_collision) {
+			next_entry.is_collision = false;
+			delta_zone->collision_count -= 1;
+		}
+
+		set_delta(&next_entry, delta_entry->delta + next_entry.delta);
+		next_entry.offset = delta_entry->offset;
+		/* The one new entry is always smaller than the two entries being replaced. */
+		delete_bits(delta_entry, old_size - next_entry.entry_bits);
+		encode_entry(&next_entry, next_value, NULL);
+	}
+
+	delta_zone->record_count--;
+	delta_zone->discard_count++;
+	*delta_entry = next_entry;
+
+	delta_list = delta_entry->delta_list;
+	if (delta_entry->offset < delta_list->save_offset) {
+		/* The saved entry offset is no longer valid. */
+		delta_list->save_key = 0;
+		delta_list->save_offset = 0;
+	}
+
+	return UDS_SUCCESS;
+}
+
+void uds_get_delta_index_stats(const struct delta_index *delta_index,
+			       struct delta_index_stats *stats)
+{
+	unsigned int z;
+	const struct delta_zone *delta_zone;
+
+	memset(stats, 0, sizeof(struct delta_index_stats));
+	for (z = 0; z < delta_index->zone_count; z++) {
+		delta_zone = &delta_index->delta_zones[z];
+		stats->rebalance_time += delta_zone->rebalance_time;
+		stats->rebalance_count += delta_zone->rebalance_count;
+		stats->record_count += delta_zone->record_count;
+		stats->collision_count += delta_zone->collision_count;
+		stats->discard_count += delta_zone->discard_count;
+		stats->overflow_count += delta_zone->overflow_count;
+		stats->list_count += delta_zone->list_count;
+	}
+}
+
+size_t uds_compute_delta_index_size(u32 entry_count, u32 mean_delta, u32 payload_bits)
+{
+	u16 min_bits;
+	u32 incr_keys;
+	u32 min_keys;
+
+	compute_coding_constants(mean_delta, &min_bits, &min_keys, &incr_keys);
+	/* On average, each delta is encoded into about min_bits + 1.5 bits. */
+	return entry_count * (payload_bits + min_bits + 1) + entry_count / 2;
+}
+
+u32 uds_get_delta_index_page_count(u32 entry_count,
+				   u32 list_count,
+				   u32 mean_delta,
+				   u32 payload_bits,
+				   size_t bytes_per_page)
+{
+	unsigned int bits_per_delta_list;
+	unsigned int bits_per_page;
+	size_t bits_per_index;
+
+	/* Compute the expected number of bits needed for all the entries. */
+	bits_per_index = uds_compute_delta_index_size(entry_count, mean_delta, payload_bits);
+	bits_per_delta_list = bits_per_index / list_count;
+
+	/* Add in the immutable delta list headers. */
+	bits_per_index += list_count * IMMUTABLE_HEADER_SIZE;
+	/* Compute the number of usable bits on an immutable index page. */
+	bits_per_page = ((bytes_per_page - sizeof(struct delta_page_header)) * BITS_PER_BYTE);
+	/*
+	 * Reduce the bits per page by one immutable delta list header and one delta list to
+	 * account for internal fragmentation.
+	 */
+	bits_per_page -= IMMUTABLE_HEADER_SIZE + bits_per_delta_list;
+	/* Now compute the number of pages needed. */
+	return DIV_ROUND_UP(bits_per_index, bits_per_page);
+}
+
+void uds_log_delta_index_entry(struct delta_index_entry *delta_entry)
+{
+	uds_log_ratelimit(uds_log_info,
+			  "List 0x%X Key 0x%X Offset 0x%X%s%s List_size 0x%X%s",
+			  delta_entry->list_number,
+			  delta_entry->key,
+			  delta_entry->offset,
+			  delta_entry->at_end ? " end" : "",
+			  delta_entry->is_collision ? " collision" : "",
+			  delta_entry->delta_list->size,
+			  delta_entry->list_overflow ? " overflow" : "");
+	delta_entry->list_overflow = false;
+}
diff --git a/drivers/md/dm-vdo/delta-index.h b/drivers/md/dm-vdo/delta-index.h
new file mode 100644
index 000000000000..0b49b87a4c76
--- /dev/null
+++ b/drivers/md/dm-vdo/delta-index.h
@@ -0,0 +1,292 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_DELTA_INDEX_H
+#define UDS_DELTA_INDEX_H
+
+#include <linux/cache.h>
+
+#include "config.h"
+#include "io-factory.h"
+#include "numeric.h"
+#include "time-utils.h"
+
+/*
+ * A delta index is a key-value store, where each entry maps an address (the key) to a payload (the
+ * value). The entries are sorted by address, and only the delta between successive addresses is
+ * stored in the entry. The addresses are assumed to be uniformly distributed, and the deltas are
+ * therefore exponentially distributed.
+ *
+ * A delta_index can either be mutable or immutable depending on its expected use. The immutable
+ * form of a delta index is used for the indexes of closed chapters committed to the volume. The
+ * mutable form of a delta index is used by the volume index, and also by the chapter index in an
+ * open chapter. Like the index as a whole, each mutable delta index is divided into a number of
+ * independent zones.
+ */
+
+struct delta_list {
+	/* The offset of the delta list start, in bits */
+	u64 start;
+	/* The number of bits in the delta list */
+	u16 size;
+	/* Where the last search "found" the key, in bits */
+	u16 save_offset;
+	/* The key for the record just before save_offset */
+	u32 save_key;
+};
+
+struct delta_zone {
+	/* The delta list memory */
+	u8 *memory;
+	/* The delta list headers */
+	struct delta_list *delta_lists;
+	/* Temporary starts of delta lists */
+	u64 *new_offsets;
+	/* Buffered writer for saving an index */
+	struct buffered_writer *buffered_writer;
+	/* The size of delta list memory */
+	size_t size;
+	/* Nanoseconds spent rebalancing */
+	ktime_t rebalance_time;
+	/* Number of memory rebalances */
+	u32 rebalance_count;
+	/* The number of bits in a stored value */
+	u8 value_bits;
+	/* The number of bits in the minimal key code */
+	u16 min_bits;
+	/* The number of keys used in a minimal code */
+	u32 min_keys;
+	/* The number of keys used for another code bit */
+	u32 incr_keys;
+	/* The number of records in the index */
+	u64 record_count;
+	/* The number of collision records */
+	u64 collision_count;
+	/* The number of records removed */
+	u64 discard_count;
+	/* The number of UDS_OVERFLOW errors detected */
+	u64 overflow_count;
+	/* The index of the first delta list */
+	u32 first_list;
+	/* The number of delta lists */
+	u32 list_count;
+	/* Tag belonging to this delta index */
+	u8 tag;
+} __aligned(L1_CACHE_BYTES);
+
+struct delta_list_save_info {
+	/* Tag identifying which delta index this list is in */
+	u8 tag;
+	/* Bit offset of the start of the list data */
+	u8 bit_offset;
+	/* Number of bytes of list data */
+	u16 byte_count;
+	/* The delta list number within the delta index */
+	u32 index;
+} __packed;
+
+struct delta_index {
+	/* The zones */
+	struct delta_zone *delta_zones;
+	/* The number of zones */
+	unsigned int zone_count;
+	/* The number of delta lists */
+	u32 list_count;
+	/* Maximum lists per zone */
+	u32 lists_per_zone;
+	/* Total memory allocated to this index */
+	size_t memory_size;
+	/* The number of non-empty lists at load time per zone */
+	u32 load_lists[MAX_ZONES];
+	/* True if this index is mutable */
+	bool mutable;
+	/* Tag belonging to this delta index */
+	u8 tag;
+};
+
+/*
+ * A delta_index_page describes a single page of a chapter index. The delta_index field allows the
+ * page to be treated as an immutable delta_index. We use the delta_zone field to treat the chapter
+ * index page as a single zone index, and without the need to do an additional memory allocation.
+ */
+struct delta_index_page {
+	struct delta_index delta_index;
+	/* These values are loaded from the DeltaPageHeader */
+	u32 lowest_list_number;
+	u32 highest_list_number;
+	u64 virtual_chapter_number;
+	/* This structure describes the single zone of a delta index page. */
+	struct delta_zone delta_zone;
+};
+
+/*
+ * Notes on the delta_index_entries:
+ *
+ * The fields documented as "public" can be read by any code that uses a delta_index. The fields
+ * documented as "private" carry information between delta_index method calls and should not be
+ * used outside the delta_index module.
+ *
+ * (1) The delta_index_entry is used like an iterator when searching a delta list.
+ *
+ * (2) It is also the result of a successful search and can be used to refer to the element found
+ *     by the search.
+ *
+ * (3) It is also the result of an unsuccessful search and can be used to refer to the insertion
+ *     point for a new record.
+ *
+ * (4) If at_end is true, the delta_list entry can only be used as the insertion point for a new
+ *     record at the end of the list.
+ *
+ * (5) If at_end is false and is_collision is true, the delta_list entry fields refer to a
+ *     collision entry in the list, and the delta_list entry can be used a a reference to this
+ *     entry.
+ *
+ * (6) If at_end is false and is_collision is false, the delta_list entry fields refer to a
+ *     non-collision entry in the list. Such delta_list entries can be used as a reference to a
+ *     found entry, or an insertion point for a non-collision entry before this entry, or an
+ *     insertion point for a collision entry that collides with this entry.
+ */
+struct delta_index_entry {
+	/* Public fields */
+	/* The key for this entry */
+	u32 key;
+	/* We are after the last list entry */
+	bool at_end;
+	/* This record is a collision */
+	bool is_collision;
+
+	/* Private fields */
+	/* This delta list overflowed */
+	bool list_overflow;
+	/* The number of bits used for the value */
+	u8 value_bits;
+	/* The number of bits used for the entire entry */
+	u16 entry_bits;
+	/* The delta index zone */
+	struct delta_zone *delta_zone;
+	/* The delta list containing the entry */
+	struct delta_list *delta_list;
+	/* The delta list number */
+	u32 list_number;
+	/* Bit offset of this entry within the list */
+	u16 offset;
+	/* The delta between this and previous entry */
+	u32 delta;
+	/* Temporary delta list for immutable indices */
+	struct delta_list temp_delta_list;
+};
+
+struct delta_index_stats {
+	/* Number of bytes allocated */
+	size_t memory_allocated;
+	/* Nanoseconds spent rebalancing */
+	ktime_t rebalance_time;
+	/* Number of memory rebalances */
+	u32 rebalance_count;
+	/* The number of records in the index */
+	u64 record_count;
+	/* The number of collision records */
+	u64 collision_count;
+	/* The number of records removed */
+	u64 discard_count;
+	/* The number of UDS_OVERFLOW errors detected */
+	u64 overflow_count;
+	/* The number of delta lists */
+	u32 list_count;
+};
+
+int __must_check uds_initialize_delta_index(struct delta_index *delta_index,
+					    unsigned int zone_count,
+					    u32 list_count,
+					    u32 mean_delta,
+					    u32 payload_bits,
+					    size_t memory_size,
+					    u8 tag);
+
+int __must_check uds_initialize_delta_index_page(struct delta_index_page *delta_index_page,
+						 u64 expected_nonce,
+						 u32 mean_delta,
+						 u32 payload_bits,
+						 u8 *memory,
+						 size_t memory_size);
+
+void uds_uninitialize_delta_index(struct delta_index *delta_index);
+
+void uds_reset_delta_index(const struct delta_index *delta_index);
+
+int __must_check uds_pack_delta_index_page(const struct delta_index *delta_index,
+					   u64 header_nonce,
+					   u8 *memory,
+					   size_t memory_size,
+					   u64 virtual_chapter_number,
+					   u32 first_list,
+					   u32 *list_count);
+
+int __must_check uds_start_restoring_delta_index(struct delta_index *delta_index,
+						 struct buffered_reader **buffered_readers,
+						 unsigned int reader_count);
+
+int __must_check uds_finish_restoring_delta_index(struct delta_index *delta_index,
+						  struct buffered_reader **buffered_readers,
+						  unsigned int reader_count);
+
+int __must_check
+uds_check_guard_delta_lists(struct buffered_reader **buffered_readers, unsigned int reader_count);
+
+int __must_check uds_start_saving_delta_index(const struct delta_index *delta_index,
+					      unsigned int zone_number,
+					      struct buffered_writer *buffered_writer);
+
+int __must_check
+uds_finish_saving_delta_index(const struct delta_index *delta_index, unsigned int zone_number);
+
+int __must_check uds_write_guard_delta_list(struct buffered_writer *buffered_writer);
+
+size_t __must_check uds_compute_delta_index_save_bytes(u32 list_count, size_t memory_size);
+
+int __must_check uds_start_delta_index_search(const struct delta_index *delta_index,
+					      u32 list_number,
+					      u32 key,
+					      struct delta_index_entry *iterator);
+
+int __must_check uds_next_delta_index_entry(struct delta_index_entry *delta_entry);
+
+int __must_check uds_remember_delta_index_offset(const struct delta_index_entry *delta_entry);
+
+int __must_check uds_get_delta_index_entry(const struct delta_index *delta_index,
+					   u32 list_number,
+					   u32 key,
+					   const u8 *name,
+					   struct delta_index_entry *delta_entry);
+
+int __must_check
+uds_get_delta_entry_collision(const struct delta_index_entry *delta_entry, u8 *name);
+
+u32 __must_check uds_get_delta_entry_value(const struct delta_index_entry *delta_entry);
+
+int __must_check uds_set_delta_entry_value(const struct delta_index_entry *delta_entry, u32 value);
+
+int __must_check uds_put_delta_index_entry(struct delta_index_entry *delta_entry,
+					   u32 key,
+					   u32 value,
+					   const u8 *name);
+
+int __must_check uds_remove_delta_index_entry(struct delta_index_entry *delta_entry);
+
+void uds_get_delta_index_stats(const struct delta_index *delta_index,
+			       struct delta_index_stats *stats);
+
+size_t __must_check
+uds_compute_delta_index_size(u32 entry_count, u32 mean_delta, u32 payload_bits);
+
+u32 uds_get_delta_index_page_count(u32 entry_count,
+				   u32 list_count,
+				   u32 mean_delta,
+				   u32 payload_bits,
+				   size_t bytes_per_page);
+
+void uds_log_delta_index_entry(struct delta_index_entry *delta_entry);
+
+#endif /* UDS_DELTA_INDEX_H */
diff --git a/drivers/md/dm-vdo/hash-utils.h b/drivers/md/dm-vdo/hash-utils.h
new file mode 100644
index 000000000000..f371c7fd07bf
--- /dev/null
+++ b/drivers/md/dm-vdo/hash-utils.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_HASH_UTILS_H
+#define UDS_HASH_UTILS_H
+
+#include "geometry.h"
+#include "numeric.h"
+#include "uds.h"
+
+/* Utilities for extracting portions of a request name for various uses. */
+
+/* How various portions of a record name are apportioned. */
+enum {
+	VOLUME_INDEX_BYTES_OFFSET = 0,
+	VOLUME_INDEX_BYTES_COUNT = 8,
+	CHAPTER_INDEX_BYTES_OFFSET = 8,
+	CHAPTER_INDEX_BYTES_COUNT = 6,
+	SAMPLE_BYTES_OFFSET = 14,
+	SAMPLE_BYTES_COUNT = 2,
+};
+
+static inline u64 uds_extract_chapter_index_bytes(const struct  uds_record_name *name)
+{
+	const u8 *chapter_bits = &name->name[CHAPTER_INDEX_BYTES_OFFSET];
+	u64 bytes = (u64) get_unaligned_be16(chapter_bits) << 32;
+
+	bytes |= get_unaligned_be32(chapter_bits + 2);
+	return bytes;
+}
+
+static inline u64 uds_extract_volume_index_bytes(const struct uds_record_name *name)
+{
+	return get_unaligned_be64(&name->name[VOLUME_INDEX_BYTES_OFFSET]);
+}
+
+static inline u32 uds_extract_sampling_bytes(const struct uds_record_name *name)
+{
+	return get_unaligned_be16(&name->name[SAMPLE_BYTES_OFFSET]);
+}
+
+/* Compute the chapter delta list for a given name. */
+static inline u32
+uds_hash_to_chapter_delta_list(const struct uds_record_name *name, const struct geometry *geometry)
+{
+	return ((uds_extract_chapter_index_bytes(name) >> geometry->chapter_address_bits) &
+		((1 << geometry->chapter_delta_list_bits) - 1));
+}
+
+/* Compute the chapter delta address for a given name. */
+static inline u32
+uds_hash_to_chapter_delta_address(const struct uds_record_name *name,
+				  const struct geometry *geometry)
+{
+	return uds_extract_chapter_index_bytes(name) & ((1 << geometry->chapter_address_bits) - 1);
+}
+
+static inline unsigned int
+uds_name_to_hash_slot(const struct uds_record_name *name, unsigned int slot_count)
+{
+	return (unsigned int) (uds_extract_chapter_index_bytes(name) % slot_count);
+}
+
+#endif /* UDS_HASH_UTILS_H */

From patchwork Thu Oct 26 21:41:09 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437933
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0473DC25B72
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:43:17 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344764AbjJZVnQ (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:43:16 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40966 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232235AbjJZVnP (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:43:15 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EBD4D1AD
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356501;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=HBaEgq/LYZ4zTrjjb/aTnzfT2sK+0iLUYph+Q6Fry+0=;
        b=e3mpgIYvTAMhq8bMbhr4nrri79H4K/ZDfLyFxGVOUoXuPm6r7p72A4+zjk+iEgSg01JmK+
        gh8xTxd0KSm7lTpGIGb3LgjhAMOam/WQKV+JKNJtqL8yLBHZZuXkYWoibx+2QuocIqcsUp
        8d64ZUxITWrNIxUJaUc+KaTGpYHAqj0=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-617-6F94N9rgNOCnKF2uOkwWxQ-1; Thu, 26 Oct 2023 17:41:39 -0400
X-MC-Unique: 6F94N9rgNOCnKF2uOkwWxQ-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 2F54D185A784;
        Thu, 26 Oct 2023 21:41:39 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 1BE1E10751;
        Thu, 26 Oct 2023 21:41:39 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id D09FC3003C; Thu, 26 Oct 2023 17:41:38 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 12/39] dm vdo: implement the volume index
Date: Thu, 26 Oct 2023 17:41:09 -0400
Message-Id: <20231026214136.1067410-13-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

The volume index is a large delta index that maps each record name to the
chapter which contains the newest record for that name. The volume index
can contain several million records and is stored entirely in memory while
the index is operating, accounting for the majority of the deduplication
index's memory budget.

The volume index is composed of two subindexes in order to handle sparse
hook names separately from regular names. If sparse indexing is not
enabled, the sparse hook portion of the volume index is not used or
instantiated.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/volume-index.c | 1277 ++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/volume-index.h |  192 +++++
 2 files changed, 1469 insertions(+)
 create mode 100644 drivers/md/dm-vdo/volume-index.c
 create mode 100644 drivers/md/dm-vdo/volume-index.h

diff --git a/drivers/md/dm-vdo/volume-index.c b/drivers/md/dm-vdo/volume-index.c
new file mode 100644
index 000000000000..0c91a2765449
--- /dev/null
+++ b/drivers/md/dm-vdo/volume-index.c
@@ -0,0 +1,1277 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+#include "volume-index.h"
+
+#include <linux/bitops.h>
+#include <linux/bits.h>
+#include <linux/cache.h>
+#include <linux/compiler.h>
+#include <linux/log2.h>
+
+#include "config.h"
+#include "errors.h"
+#include "geometry.h"
+#include "hash-utils.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "numeric.h"
+#include "permassert.h"
+#include "uds.h"
+#include "uds-threads.h"
+
+/*
+ * The volume index is a combination of two separate subindexes, one containing sparse hook entries
+ * (retained for all chapters), and one containing the remaining entries (retained only for the
+ * dense chapters). If there are no sparse chapters, only the non-hook sub index is used, and it
+ * will contain all records for all chapters.
+ *
+ * The volume index is also divided into zones, with one thread operating on each zone. Each
+ * incoming request is dispatched to the appropriate thread, and then to the appropriate subindex.
+ * Each delta list is handled by a single zone. To ensure that the distribution of delta lists to
+ * zones doesn't underflow (leaving some zone with no delta lists), the minimum number of delta
+ * lists must be the square of the maximum zone count for both subindexes.
+ *
+ * Each subindex zone is a delta index where the payload is a chapter number. The volume index can
+ * compute the delta list number, address, and zone number from the record name in order to
+ * dispatch record handling to the correct structures.
+ *
+ * Most operations that use all the zones take place either before request processing is allowed,
+ * or after all requests have been flushed in order to shut down. The only multi-threaded operation
+ * supported during normal operation is the uds_lookup_volume_index_name() method, used to determine
+ * whether a new chapter should be loaded into the sparse index cache. This operation only uses the
+ * sparse hook subindex, and the zone mutexes are used to make this operation safe.
+ *
+ * There are three ways of expressing chapter numbers in the volume index: virtual, index, and
+ * rolling. The interface to the volume index uses virtual chapter numbers, which are 64 bits long.
+ * Internally the subindex stores only the minimal number of bits necessary by masking away the
+ * high-order bits. When the index needs to deal with ordering of index chapter numbers, as when
+ * flushing entries from older chapters, it rolls the index chapter number around so that the
+ * smallest one in use is mapped to 0. See convert_index_to_virtual() or flush_invalid_entries()
+ * for an example of this technique.
+ *
+ * For efficiency, when older chapter numbers become invalid, the index does not immediately remove
+ * the invalidated entries. Instead it lazily removes them from a given delta list the next time it
+ * walks that list during normal operation. Because of this, the index size must be increased
+ * somewhat to accommodate all the invalid entries that have not yet been removed. For the standard
+ * index sizes, this requires about 4 chapters of old entries per 1024 chapters of valid entries in
+ * the index.
+ */
+
+struct sub_index_parameters {
+	/* The number of bits in address mask */
+	u8 address_bits;
+	/* The number of bits in chapter number */
+	u8 chapter_bits;
+	/* The mean delta */
+	u32 mean_delta;
+	/* The number of delta lists */
+	u64 list_count;
+	/* The number of chapters used */
+	u32 chapter_count;
+	/* The number of bits per chapter */
+	size_t chapter_size_in_bits;
+	/* The number of bytes of delta list memory */
+	size_t memory_size;
+	/* The number of bytes the index should keep free at all times */
+	size_t target_free_bytes;
+};
+
+struct split_config {
+	/* The hook subindex configuration */
+	struct configuration hook_config;
+	struct geometry hook_geometry;
+
+	/* The non-hook subindex configuration */
+	struct configuration non_hook_config;
+	struct geometry non_hook_geometry;
+};
+
+struct chapter_range {
+	u32 chapter_start;
+	u32 chapter_count;
+};
+
+enum { MAGIC_SIZE = 8 };
+static const char MAGIC_START_5[] = "MI5-0005";
+
+struct sub_index_data {
+	char magic[MAGIC_SIZE]; /* MAGIC_START_5 */
+	u64 volume_nonce;
+	u64 virtual_chapter_low;
+	u64 virtual_chapter_high;
+	u32 first_list;
+	u32 list_count;
+};
+
+static const char MAGIC_START_6[] = "MI6-0001";
+
+struct volume_index_data {
+	char magic[MAGIC_SIZE]; /* MAGIC_START_6 */
+	u32 sparse_sample_rate;
+};
+
+static inline u32
+extract_address(const struct volume_sub_index *sub_index, const struct uds_record_name *name)
+{
+	return uds_extract_volume_index_bytes(name) & sub_index->address_mask;
+}
+
+static inline u32
+extract_dlist_num(const struct volume_sub_index *sub_index, const struct uds_record_name *name)
+{
+	u64 bits = uds_extract_volume_index_bytes(name);
+
+	return (bits >> sub_index->address_bits) % sub_index->list_count;
+}
+
+static inline const struct volume_sub_index_zone *
+get_zone_for_record(const struct volume_index_record *record)
+{
+	return &record->sub_index->zones[record->zone_number];
+}
+
+static inline u64
+convert_index_to_virtual(const struct volume_index_record *record, u32 index_chapter)
+{
+	const struct volume_sub_index_zone *volume_index_zone = get_zone_for_record(record);
+	u32 rolling_chapter = ((index_chapter - volume_index_zone->virtual_chapter_low) &
+			       record->sub_index->chapter_mask);
+
+	return volume_index_zone->virtual_chapter_low + rolling_chapter;
+}
+
+static inline u32
+convert_virtual_to_index(const struct volume_sub_index *sub_index, u64 virtual_chapter)
+{
+	return virtual_chapter & sub_index->chapter_mask;
+}
+
+static inline bool
+is_virtual_chapter_indexed(const struct volume_index_record *record, u64 virtual_chapter)
+{
+	const struct volume_sub_index_zone *volume_index_zone = get_zone_for_record(record);
+
+	return ((virtual_chapter >= volume_index_zone->virtual_chapter_low) &&
+		(virtual_chapter <= volume_index_zone->virtual_chapter_high));
+}
+
+static inline bool has_sparse(const struct volume_index *volume_index)
+{
+	return volume_index->sparse_sample_rate > 0;
+}
+
+bool uds_is_volume_index_sample(const struct volume_index *volume_index,
+				const struct uds_record_name *name)
+{
+	if (!has_sparse(volume_index))
+		return false;
+
+	return (uds_extract_sampling_bytes(name) % volume_index->sparse_sample_rate) == 0;
+}
+
+static inline const struct volume_sub_index *
+get_volume_sub_index(const struct volume_index *volume_index, const struct uds_record_name *name)
+{
+	return (uds_is_volume_index_sample(volume_index, name) ?
+		&volume_index->vi_hook :
+		&volume_index->vi_non_hook);
+}
+
+static unsigned int get_volume_sub_index_zone(const struct volume_sub_index *sub_index,
+					      const struct uds_record_name *name)
+{
+	return extract_dlist_num(sub_index, name) / sub_index->delta_index.lists_per_zone;
+}
+
+unsigned int uds_get_volume_index_zone(const struct volume_index *volume_index,
+				       const struct uds_record_name *name)
+{
+	return get_volume_sub_index_zone(get_volume_sub_index(volume_index, name), name);
+}
+
+static int compute_volume_sub_index_parameters(const struct configuration *config,
+					       struct sub_index_parameters *params)
+{
+	enum { DELTA_LIST_SIZE = 256 };
+	u64 entries_in_volume_index, address_span;
+	u32 chapters_in_volume_index, invalid_chapters;
+	u32 rounded_chapters;
+	u64 delta_list_records;
+	u32 address_count;
+	u64 index_size_in_bits;
+	size_t expected_index_size;
+	u64 min_delta_lists = MAX_ZONES * MAX_ZONES;
+	struct geometry *geometry = config->geometry;
+	u64 records_per_chapter = geometry->records_per_chapter;
+
+	params->chapter_count = geometry->chapters_per_volume;
+	/*
+	 * Make sure that the number of delta list records in the volume index does not change when
+	 * the volume is reduced by one chapter. This preserves the mapping from name to volume
+	 * index delta list.
+	 */
+	rounded_chapters = params->chapter_count;
+	if (uds_is_reduced_geometry(geometry))
+		rounded_chapters += 1;
+	delta_list_records = records_per_chapter * rounded_chapters;
+	address_count = config->volume_index_mean_delta * DELTA_LIST_SIZE;
+	params->list_count = max(delta_list_records / DELTA_LIST_SIZE, min_delta_lists);
+	params->address_bits = bits_per(address_count - 1);
+	params->chapter_bits = bits_per(rounded_chapters - 1);
+	if ((u32) params->list_count != params->list_count)
+		return uds_log_warning_strerror(UDS_INVALID_ARGUMENT,
+						"cannot initialize volume index with %llu delta lists",
+						(unsigned long long) params->list_count);
+	if (params->address_bits > 31)
+		return uds_log_warning_strerror(UDS_INVALID_ARGUMENT,
+						"cannot initialize volume index with %u address bits",
+						params->address_bits);
+
+	/*
+	 * The probability that a given delta list is not touched during the writing of an entire
+	 * chapter is:
+	 *
+	 * double p_not_touched = pow((double) (params->list_count - 1) / params->list_count,
+	 *                            records_per_chapter);
+	 *
+	 * For the standard index sizes, about 78% of the delta lists are not touched, and
+	 * therefore contain old index entries that have not been eliminated by the lazy LRU
+	 * processing. Then the number of old index entries that accumulate over the entire index,
+	 * in terms of full chapters worth of entries, is:
+	 *
+	 * double invalid_chapters = p_not_touched / (1.0 - p_not_touched);
+	 *
+	 * For the standard index sizes, the index needs about 3.5 chapters of space for the old
+	 * entries in a 1024 chapter index, so round this up to use 4 chapters per 1024 chapters in
+	 * the index.
+	 */
+	invalid_chapters = max(rounded_chapters / 256, 2U);
+	chapters_in_volume_index = rounded_chapters + invalid_chapters;
+	entries_in_volume_index = records_per_chapter * chapters_in_volume_index;
+
+	address_span = params->list_count << params->address_bits;
+	params->mean_delta = address_span / entries_in_volume_index;
+
+	/*
+	 * Compute the expected size of a full index, then set the total memory to be 6% larger
+	 * than that expected size. This number should be large enough that there are not many
+	 * rebalances when the index is full.
+	 */
+	params->chapter_size_in_bits = uds_compute_delta_index_size(records_per_chapter,
+								    params->mean_delta,
+								    params->chapter_bits);
+	index_size_in_bits = params->chapter_size_in_bits * chapters_in_volume_index;
+	expected_index_size = index_size_in_bits / BITS_PER_BYTE;
+	params->memory_size = expected_index_size * 106 / 100;
+
+	params->target_free_bytes = expected_index_size / 20;
+	return UDS_SUCCESS;
+}
+
+static void uninitialize_volume_sub_index(struct volume_sub_index *sub_index)
+{
+	UDS_FREE(UDS_FORGET(sub_index->flush_chapters));
+	UDS_FREE(UDS_FORGET(sub_index->zones));
+	uds_uninitialize_delta_index(&sub_index->delta_index);
+}
+
+void uds_free_volume_index(struct volume_index *volume_index)
+{
+	if (volume_index == NULL)
+		return;
+
+	if (volume_index->zones != NULL) {
+		unsigned int zone;
+
+		for (zone = 0; zone < volume_index->zone_count; zone++)
+			uds_destroy_mutex(&volume_index->zones[zone].hook_mutex);
+		UDS_FREE(UDS_FORGET(volume_index->zones));
+	}
+
+	uninitialize_volume_sub_index(&volume_index->vi_non_hook);
+	uninitialize_volume_sub_index(&volume_index->vi_hook);
+	UDS_FREE(volume_index);
+}
+
+
+static int
+compute_volume_sub_index_save_bytes(const struct configuration *config, size_t *bytes)
+{
+	struct sub_index_parameters params = { .address_bits = 0 };
+	int result;
+
+	result = compute_volume_sub_index_parameters(config, &params);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	*bytes = (sizeof(struct sub_index_data) + params.list_count * sizeof(u64) +
+		  uds_compute_delta_index_save_bytes(params.list_count, params.memory_size));
+	return UDS_SUCCESS;
+}
+
+/* This function is only useful if the configuration includes sparse chapters. */
+static void split_configuration(const struct configuration *config, struct split_config *split)
+{
+	u64 sample_rate, sample_records;
+	u64 dense_chapters, sparse_chapters;
+
+	/* Start with copies of the base configuration. */
+	split->hook_config = *config;
+	split->hook_geometry = *config->geometry;
+	split->hook_config.geometry = &split->hook_geometry;
+	split->non_hook_config = *config;
+	split->non_hook_geometry = *config->geometry;
+	split->non_hook_config.geometry = &split->non_hook_geometry;
+
+	sample_rate = config->sparse_sample_rate;
+	sparse_chapters = config->geometry->sparse_chapters_per_volume;
+	dense_chapters = config->geometry->chapters_per_volume - sparse_chapters;
+	sample_records = config->geometry->records_per_chapter / sample_rate;
+
+	/* Adjust the number of records indexed for each chapter. */
+	split->hook_geometry.records_per_chapter = sample_records;
+	split->non_hook_geometry.records_per_chapter -= sample_records;
+
+	/* Adjust the number of chapters indexed. */
+	split->hook_geometry.sparse_chapters_per_volume = 0;
+	split->non_hook_geometry.sparse_chapters_per_volume = 0;
+	split->non_hook_geometry.chapters_per_volume = dense_chapters;
+}
+
+static int compute_volume_index_save_bytes(const struct configuration *config, size_t *bytes)
+{
+	size_t hook_bytes, non_hook_bytes;
+	struct split_config split;
+	int result;
+
+	if (!uds_is_sparse_geometry(config->geometry))
+		return compute_volume_sub_index_save_bytes(config, bytes);
+
+	split_configuration(config, &split);
+	result = compute_volume_sub_index_save_bytes(&split.hook_config, &hook_bytes);
+	if (result != UDS_SUCCESS)
+		return result;
+	result = compute_volume_sub_index_save_bytes(&split.non_hook_config, &non_hook_bytes);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	*bytes = sizeof(struct volume_index_data) + hook_bytes + non_hook_bytes;
+	return UDS_SUCCESS;
+}
+
+int uds_compute_volume_index_save_blocks(const struct configuration *config,
+					 size_t block_size,
+					 u64 *block_count)
+{
+	size_t bytes;
+	int result;
+
+	result = compute_volume_index_save_bytes(config, &bytes);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	bytes += sizeof(struct delta_list_save_info);
+	*block_count = DIV_ROUND_UP(bytes, block_size) + MAX_ZONES;
+	return UDS_SUCCESS;
+}
+
+/* Flush invalid entries while walking the delta list. */
+static inline int flush_invalid_entries(struct volume_index_record *record,
+					struct chapter_range *flush_range,
+					u32 *next_chapter_to_invalidate)
+{
+	int result;
+
+	result = uds_next_delta_index_entry(&record->delta_entry);
+	if (result != UDS_SUCCESS)
+		return result;
+	while (!record->delta_entry.at_end) {
+		u32 index_chapter = uds_get_delta_entry_value(&record->delta_entry);
+		u32 relative_chapter = ((index_chapter - flush_range->chapter_start) &
+					record->sub_index->chapter_mask);
+
+		if (likely(relative_chapter >= flush_range->chapter_count)) {
+			if (relative_chapter < *next_chapter_to_invalidate)
+				*next_chapter_to_invalidate = relative_chapter;
+			break;
+		}
+		result = uds_remove_delta_index_entry(&record->delta_entry);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+	return UDS_SUCCESS;
+}
+
+/* Find the matching record, or the list offset where the record would go. */
+static int get_volume_index_entry(struct volume_index_record *record,
+				  u32 list_number,
+				  u32 key,
+				  struct chapter_range *flush_range)
+{
+	struct volume_index_record other_record;
+	const struct volume_sub_index *sub_index = record->sub_index;
+	u32 next_chapter_to_invalidate = sub_index->chapter_mask;
+	int result;
+
+	result = uds_start_delta_index_search(&sub_index->delta_index,
+					      list_number,
+					      0,
+					      &record->delta_entry);
+	if (result != UDS_SUCCESS)
+		return result;
+	do {
+		result = flush_invalid_entries(record, flush_range, &next_chapter_to_invalidate);
+		if (result != UDS_SUCCESS)
+			return result;
+	} while (!record->delta_entry.at_end && (key > record->delta_entry.key));
+
+	result = uds_remember_delta_index_offset(&record->delta_entry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	/* Check any collision records for a more precise match. */
+	other_record = *record;
+	if (!other_record.delta_entry.at_end && (key == other_record.delta_entry.key)) {
+		for (;;) {
+			u8 collision_name[UDS_RECORD_NAME_SIZE];
+
+			result = flush_invalid_entries(&other_record,
+						       flush_range,
+						       &next_chapter_to_invalidate);
+			if (result != UDS_SUCCESS)
+				return result;
+			if (other_record.delta_entry.at_end ||
+			    !other_record.delta_entry.is_collision)
+				break;
+			result = uds_get_delta_entry_collision(&other_record.delta_entry,
+							       collision_name);
+			if (result != UDS_SUCCESS)
+				return result;
+			if (memcmp(collision_name, record->name, UDS_RECORD_NAME_SIZE) == 0) {
+				*record = other_record;
+				break;
+			}
+		}
+	}
+	while (!other_record.delta_entry.at_end) {
+		result = flush_invalid_entries(&other_record,
+					       flush_range,
+					       &next_chapter_to_invalidate);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+	next_chapter_to_invalidate += flush_range->chapter_start;
+	next_chapter_to_invalidate &= sub_index->chapter_mask;
+	flush_range->chapter_start = next_chapter_to_invalidate;
+	flush_range->chapter_count = 0;
+	return UDS_SUCCESS;
+}
+
+static int get_volume_sub_index_record(struct volume_sub_index *sub_index,
+				       const struct uds_record_name *name,
+				       struct volume_index_record *record)
+{
+	int result;
+	const struct volume_sub_index_zone *volume_index_zone;
+	u32 address = extract_address(sub_index, name);
+	u32 delta_list_number = extract_dlist_num(sub_index, name);
+	u64 flush_chapter = sub_index->flush_chapters[delta_list_number];
+
+	record->sub_index = sub_index;
+	record->mutex = NULL;
+	record->name = name;
+	record->zone_number = delta_list_number / sub_index->delta_index.lists_per_zone;
+	volume_index_zone = get_zone_for_record(record);
+
+	if (flush_chapter < volume_index_zone->virtual_chapter_low) {
+		struct chapter_range range;
+		u64 flush_count = volume_index_zone->virtual_chapter_low - flush_chapter;
+
+		range.chapter_start = convert_virtual_to_index(sub_index, flush_chapter);
+		range.chapter_count = (flush_count > sub_index->chapter_mask ?
+				       sub_index->chapter_mask + 1 :
+				       flush_count);
+		result = get_volume_index_entry(record, delta_list_number, address, &range);
+		flush_chapter = convert_index_to_virtual(record, range.chapter_start);
+		if (flush_chapter > volume_index_zone->virtual_chapter_high)
+			flush_chapter = volume_index_zone->virtual_chapter_high;
+		sub_index->flush_chapters[delta_list_number] = flush_chapter;
+	} else {
+		result = uds_get_delta_index_entry(&sub_index->delta_index,
+						   delta_list_number,
+						   address,
+						   name->name,
+						   &record->delta_entry);
+	}
+	if (result != UDS_SUCCESS)
+		return result;
+	record->is_found = (!record->delta_entry.at_end && (record->delta_entry.key == address));
+	if (record->is_found) {
+		u32 index_chapter = uds_get_delta_entry_value(&record->delta_entry);
+
+		record->virtual_chapter = convert_index_to_virtual(record, index_chapter);
+	}
+	record->is_collision = record->delta_entry.is_collision;
+	return UDS_SUCCESS;
+}
+
+int uds_get_volume_index_record(struct volume_index *volume_index,
+				const struct uds_record_name *name,
+				struct volume_index_record *record)
+{
+	int result;
+
+	if (uds_is_volume_index_sample(volume_index, name)) {
+		/*
+		 * Other threads cannot be allowed to call uds_lookup_volume_index_name() while
+		 * this thread is finding the volume index record. Due to the lazy LRU flushing of
+		 * the volume index, uds_get_volume_index_record() is not a read-only operation.
+		 */
+		unsigned int zone = get_volume_sub_index_zone(&volume_index->vi_hook, name);
+		struct mutex *mutex = &volume_index->zones[zone].hook_mutex;
+
+		uds_lock_mutex(mutex);
+		result = get_volume_sub_index_record(&volume_index->vi_hook, name, record);
+		uds_unlock_mutex(mutex);
+		/* Remember the mutex so that other operations on the index record can use it. */
+		record->mutex = mutex;
+	} else {
+		result = get_volume_sub_index_record(&volume_index->vi_non_hook, name, record);
+	}
+	return result;
+}
+
+int uds_put_volume_index_record(struct volume_index_record *record, u64 virtual_chapter)
+{
+	int result;
+	u32 address;
+	const struct volume_sub_index *sub_index = record->sub_index;
+
+	if (!is_virtual_chapter_indexed(record, virtual_chapter)) {
+		u64 low = get_zone_for_record(record)->virtual_chapter_low;
+		u64 high = get_zone_for_record(record)->virtual_chapter_high;
+
+		return uds_log_warning_strerror(UDS_INVALID_ARGUMENT,
+						"cannot put record into chapter number %llu that is out of the valid range %llu to %llu",
+						(unsigned long long) virtual_chapter,
+						(unsigned long long) low,
+						(unsigned long long) high);
+	}
+	address = extract_address(sub_index, record->name);
+	if (unlikely(record->mutex != NULL))
+		uds_lock_mutex(record->mutex);
+	result = uds_put_delta_index_entry(&record->delta_entry,
+					   address,
+					   convert_virtual_to_index(sub_index, virtual_chapter),
+					   record->is_found ? record->name->name : NULL);
+	if (unlikely(record->mutex != NULL))
+		uds_unlock_mutex(record->mutex);
+	switch (result) {
+	case UDS_SUCCESS:
+		record->virtual_chapter = virtual_chapter;
+		record->is_collision = record->delta_entry.is_collision;
+		record->is_found = true;
+		break;
+	case UDS_OVERFLOW:
+		uds_log_ratelimit(uds_log_warning_strerror,
+				  UDS_OVERFLOW,
+				  "Volume index entry dropped due to overflow condition");
+		uds_log_delta_index_entry(&record->delta_entry);
+		break;
+	default:
+		break;
+	}
+	return result;
+}
+
+int uds_remove_volume_index_record(struct volume_index_record *record)
+{
+	int result;
+
+	if (!record->is_found)
+		return uds_log_warning_strerror(UDS_BAD_STATE, "illegal operation on new record");
+	/* Mark the record so that it cannot be used again */
+	record->is_found = false;
+	if (unlikely(record->mutex != NULL))
+		uds_lock_mutex(record->mutex);
+	result = uds_remove_delta_index_entry(&record->delta_entry);
+	if (unlikely(record->mutex != NULL))
+		uds_unlock_mutex(record->mutex);
+	return result;
+}
+
+static void set_volume_sub_index_zone_open_chapter(struct volume_sub_index *sub_index,
+						   unsigned int zone_number,
+						   u64 virtual_chapter)
+{
+	u64 used_bits = 0;
+	struct volume_sub_index_zone *zone = &sub_index->zones[zone_number];
+	struct delta_zone *delta_zone;
+	u32 i;
+
+	zone->virtual_chapter_low = (virtual_chapter >= sub_index->chapter_count ?
+				     virtual_chapter - sub_index->chapter_count + 1 :
+				     0);
+	zone->virtual_chapter_high = virtual_chapter;
+
+	/* Check to see if the new zone data is too large. */
+	delta_zone = &sub_index->delta_index.delta_zones[zone_number];
+	for (i = 1; i <= delta_zone->list_count; i++)
+		used_bits += delta_zone->delta_lists[i].size;
+
+	if (used_bits > sub_index->max_zone_bits) {
+		/* Expire enough chapters to free the desired space. */
+		u64 expire_count =
+			1 + (used_bits - sub_index->max_zone_bits) / sub_index->chapter_zone_bits;
+
+		if (expire_count == 1) {
+			uds_log_ratelimit(uds_log_info,
+					  "zone %u:  At chapter %llu, expiring chapter %llu early",
+					  zone_number,
+					  (unsigned long long) virtual_chapter,
+					  (unsigned long long) zone->virtual_chapter_low);
+			zone->early_flushes++;
+			zone->virtual_chapter_low++;
+		} else {
+			u64 first_expired = zone->virtual_chapter_low;
+
+			if (first_expired + expire_count < zone->virtual_chapter_high) {
+				zone->early_flushes += expire_count;
+				zone->virtual_chapter_low += expire_count;
+			} else {
+				zone->early_flushes +=
+					zone->virtual_chapter_high - zone->virtual_chapter_low;
+				zone->virtual_chapter_low = zone->virtual_chapter_high;
+			}
+			uds_log_ratelimit(uds_log_info,
+					  "zone %u:  At chapter %llu, expiring chapters %llu to %llu early",
+					  zone_number,
+					  (unsigned long long) virtual_chapter,
+					  (unsigned long long) first_expired,
+					  (unsigned long long) zone->virtual_chapter_low - 1);
+		}
+	}
+}
+
+void uds_set_volume_index_zone_open_chapter(struct volume_index *volume_index,
+					    unsigned int zone_number,
+					    u64 virtual_chapter)
+{
+	struct mutex *mutex = &volume_index->zones[zone_number].hook_mutex;
+
+	set_volume_sub_index_zone_open_chapter(&volume_index->vi_non_hook,
+					       zone_number,
+					       virtual_chapter);
+
+	/*
+	 * Other threads cannot be allowed to call uds_lookup_volume_index_name() while the open
+	 * chapter number is changing.
+	 */
+	if (has_sparse(volume_index)) {
+		uds_lock_mutex(mutex);
+		set_volume_sub_index_zone_open_chapter(&volume_index->vi_hook,
+						       zone_number,
+						       virtual_chapter);
+		uds_unlock_mutex(mutex);
+	}
+}
+
+/*
+ * Set the newest open chapter number for the index, while also advancing the oldest valid chapter
+ * number.
+ */
+void uds_set_volume_index_open_chapter(struct volume_index *volume_index, u64 virtual_chapter)
+{
+	unsigned int zone;
+
+	for (zone = 0; zone < volume_index->zone_count; zone++)
+		uds_set_volume_index_zone_open_chapter(volume_index, zone, virtual_chapter);
+}
+
+int uds_set_volume_index_record_chapter(struct volume_index_record *record, u64 virtual_chapter)
+{
+	const struct volume_sub_index *sub_index = record->sub_index;
+	int result;
+
+	if (!record->is_found)
+		return uds_log_warning_strerror(UDS_BAD_STATE, "illegal operation on new record");
+	if (!is_virtual_chapter_indexed(record, virtual_chapter)) {
+		u64 low = get_zone_for_record(record)->virtual_chapter_low;
+		u64 high = get_zone_for_record(record)->virtual_chapter_high;
+
+		return uds_log_warning_strerror(UDS_INVALID_ARGUMENT,
+						"cannot set chapter number %llu that is out of the valid range %llu to %llu",
+						(unsigned long long) virtual_chapter,
+						(unsigned long long) low,
+						(unsigned long long) high);
+	}
+	if (unlikely(record->mutex != NULL))
+		uds_lock_mutex(record->mutex);
+	result = uds_set_delta_entry_value(&record->delta_entry,
+					   convert_virtual_to_index(sub_index, virtual_chapter));
+	if (unlikely(record->mutex != NULL))
+		uds_unlock_mutex(record->mutex);
+	if (result != UDS_SUCCESS)
+		return result;
+	record->virtual_chapter = virtual_chapter;
+	return UDS_SUCCESS;
+}
+
+static u64 lookup_volume_sub_index_name(const struct volume_sub_index *sub_index,
+					const struct uds_record_name *name)
+{
+	int result;
+	u32 address = extract_address(sub_index, name);
+	u32 delta_list_number = extract_dlist_num(sub_index, name);
+	unsigned int zone_number = get_volume_sub_index_zone(sub_index, name);
+	const struct volume_sub_index_zone *zone = &sub_index->zones[zone_number];
+	u64 virtual_chapter;
+	u32 index_chapter;
+	u32 rolling_chapter;
+	struct delta_index_entry delta_entry;
+
+	result = uds_get_delta_index_entry(&sub_index->delta_index,
+					   delta_list_number,
+					   address,
+					   name->name,
+					   &delta_entry);
+	if (result != UDS_SUCCESS)
+		return NO_CHAPTER;
+
+	if (delta_entry.at_end || (delta_entry.key != address))
+		return NO_CHAPTER;
+
+	index_chapter = uds_get_delta_entry_value(&delta_entry);
+	rolling_chapter = (index_chapter - zone->virtual_chapter_low) & sub_index->chapter_mask;
+
+	virtual_chapter = zone->virtual_chapter_low + rolling_chapter;
+	if (virtual_chapter > zone->virtual_chapter_high)
+		return NO_CHAPTER;
+
+	return virtual_chapter;
+}
+
+/* Do a read-only lookup of the record name for sparse cache management. */
+u64 uds_lookup_volume_index_name(const struct volume_index *volume_index,
+				 const struct uds_record_name *name)
+{
+	unsigned int zone_number = uds_get_volume_index_zone(volume_index, name);
+	struct mutex *mutex = &volume_index->zones[zone_number].hook_mutex;
+	u64 virtual_chapter;
+
+	if (!uds_is_volume_index_sample(volume_index, name))
+		return NO_CHAPTER;
+
+	uds_lock_mutex(mutex);
+	virtual_chapter = lookup_volume_sub_index_name(&volume_index->vi_hook, name);
+	uds_unlock_mutex(mutex);
+
+	return virtual_chapter;
+}
+
+static void abort_restoring_volume_sub_index(struct volume_sub_index *sub_index)
+{
+	uds_reset_delta_index(&sub_index->delta_index);
+}
+
+static void abort_restoring_volume_index(struct volume_index *volume_index)
+{
+	abort_restoring_volume_sub_index(&volume_index->vi_non_hook);
+	if (has_sparse(volume_index))
+		abort_restoring_volume_sub_index(&volume_index->vi_hook);
+}
+
+static int start_restoring_volume_sub_index(struct volume_sub_index *sub_index,
+					    struct buffered_reader **readers,
+					    unsigned int reader_count)
+{
+	unsigned int z;
+	int result;
+	u64 virtual_chapter_low = 0, virtual_chapter_high = 0;
+	unsigned int i;
+
+	for (i = 0; i < reader_count; i++) {
+		struct sub_index_data header;
+		u8 buffer[sizeof(struct sub_index_data)];
+		size_t offset = 0;
+		u32 j;
+
+		result = uds_read_from_buffered_reader(readers[i], buffer, sizeof(buffer));
+		if (result != UDS_SUCCESS)
+			return uds_log_warning_strerror(result,
+							"failed to read volume index header");
+
+		memcpy(&header.magic, buffer, MAGIC_SIZE);
+		offset += MAGIC_SIZE;
+		decode_u64_le(buffer, &offset, &header.volume_nonce);
+		decode_u64_le(buffer, &offset, &header.virtual_chapter_low);
+		decode_u64_le(buffer, &offset, &header.virtual_chapter_high);
+		decode_u32_le(buffer, &offset, &header.first_list);
+		decode_u32_le(buffer, &offset, &header.list_count);
+
+		result = ASSERT(offset = sizeof(buffer),
+				"%zu bytes decoded of %zu expected",
+				offset,
+				sizeof(buffer));
+		if (result != UDS_SUCCESS)
+			result = UDS_CORRUPT_DATA;
+
+		if (memcmp(header.magic, MAGIC_START_5, MAGIC_SIZE) != 0)
+			return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+							"volume index file had bad magic number");
+
+		if (sub_index->volume_nonce == 0)
+			sub_index->volume_nonce = header.volume_nonce;
+		else if (header.volume_nonce != sub_index->volume_nonce)
+			return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+							"volume index volume nonce incorrect");
+
+		if (i == 0) {
+			virtual_chapter_low = header.virtual_chapter_low;
+			virtual_chapter_high = header.virtual_chapter_high;
+		} else if (virtual_chapter_high != header.virtual_chapter_high) {
+			u64 low = header.virtual_chapter_low;
+			u64 high = header.virtual_chapter_high;
+
+			return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+							"Inconsistent volume index zone files: Chapter range is [%llu,%llu], chapter range %d is [%llu,%llu]",
+							(unsigned long long) virtual_chapter_low,
+							(unsigned long long) virtual_chapter_high,
+							i,
+							(unsigned long long) low,
+							(unsigned long long) high);
+		} else if (virtual_chapter_low < header.virtual_chapter_low) {
+			virtual_chapter_low = header.virtual_chapter_low;
+		}
+
+		for (j = 0; j < header.list_count; j++) {
+			u8 decoded[sizeof(u64)];
+
+			result = uds_read_from_buffered_reader(readers[i], decoded, sizeof(u64));
+			if (result != UDS_SUCCESS)
+				return uds_log_warning_strerror(result,
+								"failed to read volume index flush ranges");
+
+			sub_index->flush_chapters[header.first_list + j] =
+				get_unaligned_le64(decoded);
+		}
+	}
+
+	for (z = 0; z < sub_index->zone_count; z++) {
+		memset(&sub_index->zones[z], 0, sizeof(struct volume_sub_index_zone));
+		sub_index->zones[z].virtual_chapter_low = virtual_chapter_low;
+		sub_index->zones[z].virtual_chapter_high = virtual_chapter_high;
+	}
+
+	result = uds_start_restoring_delta_index(&sub_index->delta_index, readers, reader_count);
+	if (result != UDS_SUCCESS)
+		return uds_log_warning_strerror(result, "restoring delta index failed");
+
+	return UDS_SUCCESS;
+}
+
+static int start_restoring_volume_index(struct volume_index *volume_index,
+					struct buffered_reader **buffered_readers,
+					unsigned int reader_count)
+{
+	unsigned int i;
+	int result;
+
+	if (!has_sparse(volume_index))
+		return start_restoring_volume_sub_index(&volume_index->vi_non_hook,
+							buffered_readers,
+							reader_count);
+
+	for (i = 0; i < reader_count; i++) {
+		struct volume_index_data header;
+		u8 buffer[sizeof(struct volume_index_data)];
+		size_t offset = 0;
+
+		result = uds_read_from_buffered_reader(buffered_readers[i], buffer, sizeof(buffer));
+		if (result != UDS_SUCCESS)
+			return uds_log_warning_strerror(result,
+							"failed to read volume index header");
+
+		memcpy(&header.magic, buffer, MAGIC_SIZE);
+		offset += MAGIC_SIZE;
+		decode_u32_le(buffer, &offset, &header.sparse_sample_rate);
+
+		result = ASSERT(offset == sizeof(buffer),
+				"%zu bytes decoded of %zu expected",
+				offset,
+				sizeof(buffer));
+		if (result != UDS_SUCCESS)
+			result = UDS_CORRUPT_DATA;
+
+		if (memcmp(header.magic, MAGIC_START_6, MAGIC_SIZE) != 0)
+			return uds_log_warning_strerror(UDS_CORRUPT_DATA,
+							"volume index file had bad magic number");
+
+		if (i == 0) {
+			volume_index->sparse_sample_rate = header.sparse_sample_rate;
+		} else if (volume_index->sparse_sample_rate != header.sparse_sample_rate) {
+			uds_log_warning_strerror(UDS_CORRUPT_DATA,
+						 "Inconsistent sparse sample rate in delta index zone files: %u vs. %u",
+						 volume_index->sparse_sample_rate,
+						 header.sparse_sample_rate);
+			return UDS_CORRUPT_DATA;
+		}
+	}
+
+	result = start_restoring_volume_sub_index(&volume_index->vi_non_hook,
+						  buffered_readers,
+						  reader_count);
+	if (result != UDS_SUCCESS)
+		return result;
+	return start_restoring_volume_sub_index(&volume_index->vi_hook,
+						buffered_readers,
+						reader_count);
+}
+
+static int finish_restoring_volume_sub_index(struct volume_sub_index *sub_index,
+					     struct buffered_reader **buffered_readers,
+					     unsigned int reader_count)
+{
+	return uds_finish_restoring_delta_index(&sub_index->delta_index,
+						buffered_readers,
+						reader_count);
+}
+
+static int finish_restoring_volume_index(struct volume_index *volume_index,
+					 struct buffered_reader **buffered_readers,
+					 unsigned int reader_count)
+{
+	int result;
+
+	result = finish_restoring_volume_sub_index(&volume_index->vi_non_hook,
+						   buffered_readers,
+						   reader_count);
+	if ((result == UDS_SUCCESS) && has_sparse(volume_index))
+		result = finish_restoring_volume_sub_index(&volume_index->vi_hook,
+							   buffered_readers,
+							   reader_count);
+	return result;
+}
+
+int uds_load_volume_index(struct volume_index *volume_index,
+			  struct buffered_reader **readers,
+			  unsigned int reader_count)
+{
+	int result;
+
+	/* Start by reading the header section of the stream. */
+	result = start_restoring_volume_index(volume_index, readers, reader_count);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = finish_restoring_volume_index(volume_index, readers, reader_count);
+	if (result != UDS_SUCCESS) {
+		abort_restoring_volume_index(volume_index);
+		return result;
+	}
+
+	/* Check the final guard lists to make sure there is no extra data. */
+	result = uds_check_guard_delta_lists(readers, reader_count);
+	if (result != UDS_SUCCESS)
+		abort_restoring_volume_index(volume_index);
+
+	return result;
+}
+
+static int start_saving_volume_sub_index(const struct volume_sub_index *sub_index,
+					 unsigned int zone_number,
+					 struct buffered_writer *buffered_writer)
+{
+	int result;
+	struct volume_sub_index_zone *volume_index_zone = &sub_index->zones[zone_number];
+	u32 first_list = sub_index->delta_index.delta_zones[zone_number].first_list;
+	u32 list_count = sub_index->delta_index.delta_zones[zone_number].list_count;
+	u8 buffer[sizeof(struct sub_index_data)];
+	size_t offset = 0;
+	u32 i;
+
+	memcpy(buffer, MAGIC_START_5, MAGIC_SIZE);
+	offset += MAGIC_SIZE;
+	encode_u64_le(buffer, &offset, sub_index->volume_nonce);
+	encode_u64_le(buffer, &offset, volume_index_zone->virtual_chapter_low);
+	encode_u64_le(buffer, &offset, volume_index_zone->virtual_chapter_high);
+	encode_u32_le(buffer, &offset, first_list);
+	encode_u32_le(buffer, &offset, list_count);
+
+	result =  ASSERT(offset == sizeof(struct sub_index_data),
+			 "%zu bytes of config written, of %zu expected",
+			 offset,
+			 sizeof(struct sub_index_data));
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_write_to_buffered_writer(buffered_writer, buffer, offset);
+	if (result != UDS_SUCCESS)
+		return uds_log_warning_strerror(result, "failed to write volume index header");
+
+	for (i = 0; i < list_count; i++) {
+		u8 encoded[sizeof(u64)];
+
+		put_unaligned_le64(sub_index->flush_chapters[first_list + i], &encoded);
+		result = uds_write_to_buffered_writer(buffered_writer, encoded, sizeof(u64));
+		if (result != UDS_SUCCESS)
+			return uds_log_warning_strerror(result,
+							"failed to write volume index flush ranges");
+	}
+
+	return uds_start_saving_delta_index(&sub_index->delta_index, zone_number, buffered_writer);
+}
+
+static int start_saving_volume_index(const struct volume_index *volume_index,
+				     unsigned int zone_number,
+				     struct buffered_writer *writer)
+{
+	u8 buffer[sizeof(struct volume_index_data)];
+	size_t offset = 0;
+	int result;
+
+	if (!has_sparse(volume_index))
+		return start_saving_volume_sub_index(&volume_index->vi_non_hook,
+						     zone_number,
+						     writer);
+
+	memcpy(buffer, MAGIC_START_6, MAGIC_SIZE);
+	offset += MAGIC_SIZE;
+	encode_u32_le(buffer, &offset, volume_index->sparse_sample_rate);
+	result = ASSERT(offset == sizeof(struct volume_index_data),
+			"%zu bytes of header written, of %zu expected",
+			offset,
+			sizeof(struct volume_index_data));
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_write_to_buffered_writer(writer, buffer, offset);
+	if (result != UDS_SUCCESS) {
+		uds_log_warning_strerror(result, "failed to write volume index header");
+		return result;
+	}
+
+	result = start_saving_volume_sub_index(&volume_index->vi_non_hook, zone_number, writer);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	return start_saving_volume_sub_index(&volume_index->vi_hook, zone_number, writer);
+}
+
+static int
+finish_saving_volume_sub_index(const struct volume_sub_index *sub_index, unsigned int zone_number)
+{
+	return uds_finish_saving_delta_index(&sub_index->delta_index, zone_number);
+}
+
+static int
+finish_saving_volume_index(const struct volume_index *volume_index, unsigned int zone_number)
+{
+	int result;
+
+	result = finish_saving_volume_sub_index(&volume_index->vi_non_hook, zone_number);
+	if ((result == UDS_SUCCESS) && has_sparse(volume_index))
+		result = finish_saving_volume_sub_index(&volume_index->vi_hook, zone_number);
+	return result;
+}
+
+int uds_save_volume_index(struct volume_index *volume_index,
+			  struct buffered_writer **writers,
+			  unsigned int writer_count)
+{
+	int result = UDS_SUCCESS;
+	unsigned int zone;
+
+	for (zone = 0; zone < writer_count; ++zone) {
+		result = start_saving_volume_index(volume_index, zone, writers[zone]);
+		if (result != UDS_SUCCESS)
+			break;
+
+		result = finish_saving_volume_index(volume_index, zone);
+		if (result != UDS_SUCCESS)
+			break;
+
+		result = uds_write_guard_delta_list(writers[zone]);
+		if (result != UDS_SUCCESS)
+			break;
+
+		result = uds_flush_buffered_writer(writers[zone]);
+		if (result != UDS_SUCCESS)
+			break;
+	}
+
+	return result;
+}
+
+static void get_volume_sub_index_stats(const struct volume_sub_index *sub_index,
+				       struct volume_index_stats *stats)
+{
+	struct delta_index_stats dis;
+	unsigned int z;
+
+	uds_get_delta_index_stats(&sub_index->delta_index, &dis);
+	stats->rebalance_time = dis.rebalance_time;
+	stats->rebalance_count = dis.rebalance_count;
+	stats->record_count = dis.record_count;
+	stats->collision_count = dis.collision_count;
+	stats->discard_count = dis.discard_count;
+	stats->overflow_count = dis.overflow_count;
+	stats->delta_lists = dis.list_count;
+	stats->early_flushes = 0;
+	for (z = 0; z < sub_index->zone_count; z++)
+		stats->early_flushes += sub_index->zones[z].early_flushes;
+}
+
+void uds_get_volume_index_stats(const struct volume_index *volume_index,
+				struct volume_index_stats *stats)
+{
+	struct volume_index_stats sparse_stats;
+
+	get_volume_sub_index_stats(&volume_index->vi_non_hook, stats);
+	if (!has_sparse(volume_index))
+		return;
+
+	get_volume_sub_index_stats(&volume_index->vi_hook, &sparse_stats);
+	stats->rebalance_time += sparse_stats.rebalance_time;
+	stats->rebalance_count += sparse_stats.rebalance_count;
+	stats->record_count += sparse_stats.record_count;
+	stats->collision_count += sparse_stats.collision_count;
+	stats->discard_count += sparse_stats.discard_count;
+	stats->overflow_count += sparse_stats.overflow_count;
+	stats->delta_lists += sparse_stats.delta_lists;
+	stats->early_flushes += sparse_stats.early_flushes;
+}
+
+static int initialize_volume_sub_index(const struct configuration *config,
+				       u64 volume_nonce,
+				       u8 tag,
+				       struct volume_sub_index *sub_index)
+{
+	struct sub_index_parameters params = { .address_bits = 0 };
+	unsigned int zone_count = config->zone_count;
+	u64 available_bytes = 0;
+	unsigned int z;
+	int result;
+
+	result = compute_volume_sub_index_parameters(config, &params);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	sub_index->address_bits = params.address_bits;
+	sub_index->address_mask = (1u << params.address_bits) - 1;
+	sub_index->chapter_bits = params.chapter_bits;
+	sub_index->chapter_mask = (1u << params.chapter_bits) - 1;
+	sub_index->chapter_count = params.chapter_count;
+	sub_index->list_count = params.list_count;
+	sub_index->zone_count = zone_count;
+	sub_index->chapter_zone_bits = params.chapter_size_in_bits / zone_count;
+	sub_index->volume_nonce = volume_nonce;
+
+	result = uds_initialize_delta_index(&sub_index->delta_index,
+					    zone_count,
+					    params.list_count,
+					    params.mean_delta,
+					    params.chapter_bits,
+					    params.memory_size,
+					    tag);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	for (z = 0; z < sub_index->delta_index.zone_count; z++)
+		available_bytes += sub_index->delta_index.delta_zones[z].size;
+	available_bytes -= params.target_free_bytes;
+	sub_index->max_zone_bits = (available_bytes * BITS_PER_BYTE) / zone_count;
+	sub_index->memory_size = (sub_index->delta_index.memory_size +
+				  sizeof(struct volume_sub_index) +
+				  (params.list_count * sizeof(u64)) +
+				  (zone_count * sizeof(struct volume_sub_index_zone)));
+
+	/* The following arrays are initialized to all zeros. */
+	result = UDS_ALLOCATE(params.list_count,
+			      u64,
+			      "first chapter to flush",
+			      &sub_index->flush_chapters);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	return UDS_ALLOCATE(zone_count,
+			    struct volume_sub_index_zone,
+			    "volume index zones",
+			    &sub_index->zones);
+}
+
+int uds_make_volume_index(const struct configuration *config,
+			  u64 volume_nonce,
+			  struct volume_index **volume_index_ptr)
+{
+	struct split_config split;
+	unsigned int zone;
+	struct volume_index *volume_index;
+	int result;
+
+	result = UDS_ALLOCATE(1, struct volume_index, "volume index", &volume_index);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	volume_index->zone_count = config->zone_count;
+
+	if (!uds_is_sparse_geometry(config->geometry)) {
+		result = initialize_volume_sub_index(config,
+						     volume_nonce,
+						     'm',
+						     &volume_index->vi_non_hook);
+		if (result != UDS_SUCCESS) {
+			uds_free_volume_index(volume_index);
+			return result;
+		}
+
+		volume_index->memory_size = volume_index->vi_non_hook.memory_size;
+		*volume_index_ptr = volume_index;
+		return UDS_SUCCESS;
+	}
+
+	volume_index->sparse_sample_rate = config->sparse_sample_rate;
+
+	result = UDS_ALLOCATE(config->zone_count,
+			      struct volume_index_zone,
+			      "volume index zones",
+			      &volume_index->zones);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume_index(volume_index);
+		return result;
+	}
+
+	for (zone = 0; zone < config->zone_count; zone++) {
+		result = uds_init_mutex(&volume_index->zones[zone].hook_mutex);
+		if (result != UDS_SUCCESS) {
+			uds_free_volume_index(volume_index);
+			return result;
+		}
+	}
+
+	split_configuration(config, &split);
+	result = initialize_volume_sub_index(&split.non_hook_config,
+					     volume_nonce,
+					     'd',
+					     &volume_index->vi_non_hook);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume_index(volume_index);
+		return uds_log_error_strerror(result, "Error creating non hook volume index");
+	}
+
+	result = initialize_volume_sub_index(&split.hook_config,
+					     volume_nonce,
+					     's',
+					     &volume_index->vi_hook);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume_index(volume_index);
+		return uds_log_error_strerror(result, "Error creating hook volume index");
+	}
+
+	volume_index->memory_size =
+		volume_index->vi_non_hook.memory_size + volume_index->vi_hook.memory_size;
+	*volume_index_ptr = volume_index;
+	return UDS_SUCCESS;
+}
diff --git a/drivers/md/dm-vdo/volume-index.h b/drivers/md/dm-vdo/volume-index.h
new file mode 100644
index 000000000000..248f8ec86957
--- /dev/null
+++ b/drivers/md/dm-vdo/volume-index.h
@@ -0,0 +1,192 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_VOLUME_INDEX_H
+#define UDS_VOLUME_INDEX_H
+
+#include <linux/limits.h>
+
+#include "config.h"
+#include "delta-index.h"
+#include "uds.h"
+#include "uds-threads.h"
+
+/*
+ * The volume index is the primary top-level index for UDS. It contains records which map a record
+ * name to the chapter where a record with that name is stored. This mapping can definitively say
+ * when no record exists. However, because we only use a subset of the name for this index, it
+ * cannot definitively say that a record for the entry does exist. It can only say that if a record
+ * exists, it will be in a particular chapter. The request can then be dispatched to that chapter
+ * for further processing.
+ *
+ * If the volume_index_record does not actually match the record name, the index can store a more
+ * specific collision record to disambiguate the new entry from the existing one. Index entries are
+ * managed with volume_index_record structures.
+ */
+
+static const u64 NO_CHAPTER = U64_MAX;
+
+struct volume_index_stats {
+	/* Nanoseconds spent rebalancing */
+	ktime_t rebalance_time;
+	/* Number of memory rebalances */
+	u32 rebalance_count;
+	/* The number of records in the index */
+	u64 record_count;
+	/* The number of collision records */
+	u64 collision_count;
+	/* The number of records removed */
+	u64 discard_count;
+	/* The number of UDS_OVERFLOWs detected */
+	u64 overflow_count;
+	/* The number of delta lists */
+	u32 delta_lists;
+	/* Number of early flushes */
+	u64 early_flushes;
+};
+
+struct volume_sub_index_zone {
+	u64 virtual_chapter_low;
+	u64 virtual_chapter_high;
+	u64 early_flushes;
+} __aligned(L1_CACHE_BYTES);
+
+struct volume_sub_index {
+	/* The delta index */
+	struct delta_index delta_index;
+	/* The first chapter to be flushed in each zone */
+	u64 *flush_chapters;
+	/* The zones */
+	struct volume_sub_index_zone *zones;
+	/* The volume nonce */
+	u64 volume_nonce;
+	/* Expected size of a chapter (per zone) */
+	u64 chapter_zone_bits;
+	/* Maximum size of the index (per zone) */
+	u64 max_zone_bits;
+	/* The number of bits in address mask */
+	u8 address_bits;
+	/* Mask to get address within delta list */
+	u32 address_mask;
+	/* The number of bits in chapter number */
+	u8 chapter_bits;
+	/* The largest storable chapter number */
+	u32 chapter_mask;
+	/* The number of chapters used */
+	u32 chapter_count;
+	/* The number of delta lists */
+	u32 list_count;
+	/* The number of zones */
+	unsigned int zone_count;
+	/* The amount of memory allocated */
+	u64 memory_size;
+};
+
+struct volume_index_zone {
+	/* Protects the sampled index in this zone */
+	struct mutex hook_mutex;
+} __aligned(L1_CACHE_BYTES);
+
+struct volume_index {
+	u32 sparse_sample_rate;
+	unsigned int zone_count;
+	u64 memory_size;
+	struct volume_sub_index vi_non_hook;
+	struct volume_sub_index vi_hook;
+	struct volume_index_zone *zones;
+};
+
+/*
+ * The volume_index_record structure is used to facilitate processing of a record name. A client
+ * first calls uds_get_volume_index_record() to find the volume index record for a record name. The
+ * fields of the record can then be examined to determine the state of the record.
+ *
+ * If is_found is false, then the index did not find an entry for the record name. Calling
+ * uds_put_volume_index_record() will insert a new entry for that name at the proper place.
+ *
+ * If is_found is true, then we did find an entry for the record name, and the virtual_chapter and
+ * is_collision fields reflect the entry found. Subsequently, a call to
+ * uds_remove_volume_index_record() will remove the entry, a call to
+ * uds_set_volume_index_record_chapter() will update the existing entry, and a call to
+ * uds_put_volume_index_record() will insert a new collision record after the existing entry.
+ */
+struct volume_index_record {
+	/* Public fields */
+
+	/* Chapter where the record info is found */
+	u64 virtual_chapter;
+	/* This record is a collision */
+	bool is_collision;
+	/* This record is the requested record */
+	bool is_found;
+
+	/* Private fields */
+
+	/* Zone that contains this name */
+	unsigned int zone_number;
+	/* The volume index */
+	struct volume_sub_index *sub_index;
+	/* Mutex for accessing this delta index entry in the hook index */
+	struct mutex *mutex;
+	/* The record name to which this record refers */
+	const struct uds_record_name *name;
+	/* The delta index entry for this record */
+	struct delta_index_entry delta_entry;
+};
+
+int __must_check uds_make_volume_index(const struct configuration *config,
+				       u64 volume_nonce,
+				       struct volume_index **volume_index);
+
+void uds_free_volume_index(struct volume_index *volume_index);
+
+int __must_check uds_compute_volume_index_save_blocks(const struct configuration *config,
+						      size_t block_size,
+						      u64 *block_count);
+
+unsigned int __must_check
+uds_get_volume_index_zone(const struct volume_index *volume_index,
+			  const struct uds_record_name *name);
+
+bool __must_check uds_is_volume_index_sample(const struct volume_index *volume_index,
+					     const struct uds_record_name *name);
+
+/*
+ * This function is only used to manage sparse cache membership. Most requests should use
+ * uds_get_volume_index_record() to look up index records instead.
+ */
+u64 __must_check uds_lookup_volume_index_name(const struct volume_index *volume_index,
+					      const struct uds_record_name *name);
+
+int __must_check uds_get_volume_index_record(struct volume_index *volume_index,
+					     const struct uds_record_name *name,
+					     struct volume_index_record *record);
+
+int __must_check
+uds_put_volume_index_record(struct volume_index_record *record, u64 virtual_chapter);
+
+int __must_check uds_remove_volume_index_record(struct volume_index_record *record);
+
+int __must_check
+uds_set_volume_index_record_chapter(struct volume_index_record *record, u64 virtual_chapter);
+
+void uds_set_volume_index_open_chapter(struct volume_index *volume_index, u64 virtual_chapter);
+
+void uds_set_volume_index_zone_open_chapter(struct volume_index *volume_index,
+					    unsigned int zone_number,
+					    u64 virtual_chapter);
+
+int __must_check uds_load_volume_index(struct volume_index *volume_index,
+				       struct buffered_reader **readers,
+				       unsigned int reader_count);
+
+int __must_check uds_save_volume_index(struct volume_index *volume_index,
+				       struct buffered_writer **writers,
+				       unsigned int writer_count);
+
+void uds_get_volume_index_stats(const struct volume_index *volume_index,
+				struct volume_index_stats *stats);
+
+#endif /* UDS_VOLUME_INDEX_H */

From patchwork Thu Oct 26 21:41:10 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437931
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D91A6C25B6B
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:43:14 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231977AbjJZVnO (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:43:14 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40958 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232027AbjJZVnO (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:43:14 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 201BED47
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356504;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=ZMxVCYoLfbc+fV0heDgTWWxLG3arUYk2eoet8M+OQ4k=;
        b=C4eHPbeuEbvpWxbbaUPcTzV+S0REyfFd3U3OGp3qGbcp/ij32Cl4tCIJFoq0bcWgFohdYf
        7o7tghp74VpZB9pSy4dvffYA6LW++GBs7gMyb2ICU6p7j2tm5hoJccT5ciBRf2bMZhzOry
        fGpb2O1I+ARMBYCQoDbUa2afiWB7Fqk=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-625-i8IqB_xOO8--74czHundhg-1; Thu, 26 Oct 2023 17:41:39 -0400
X-MC-Unique: i8IqB_xOO8--74czHundhg-1
Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com
 [10.11.54.10])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id AA3ED811E8F;
        Thu, 26 Oct 2023 21:41:39 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 51710492BFA;
        Thu, 26 Oct 2023 21:41:39 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 0604A3003E; Thu, 26 Oct 2023 17:41:38 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 13/39] dm vdo: implement the open chapter and chapter
 indexes
Date: Thu, 26 Oct 2023 17:41:10 -0400
Message-Id: <20231026214136.1067410-14-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.10
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Deduplication records are stored in groups called chapters. New records are
collected in a structure called the open chapter, which is optimized for
adding, removing, and sorting records.

When a chapter fills, it is packed into a read-only structure called a
closed chapter, which is optimized for searching and reading. The closed
chapter includes a delta index, called the chapter index, which maps each
record name to the record page containing the record and allows the index
to read at most one record page when looking up a record.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/chapter-index.c | 304 +++++++++++++++++++++
 drivers/md/dm-vdo/chapter-index.h |  66 +++++
 drivers/md/dm-vdo/open-chapter.c  | 433 ++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/open-chapter.h  |  79 ++++++
 4 files changed, 882 insertions(+)
 create mode 100644 drivers/md/dm-vdo/chapter-index.c
 create mode 100644 drivers/md/dm-vdo/chapter-index.h
 create mode 100644 drivers/md/dm-vdo/open-chapter.c
 create mode 100644 drivers/md/dm-vdo/open-chapter.h

diff --git a/drivers/md/dm-vdo/chapter-index.c b/drivers/md/dm-vdo/chapter-index.c
new file mode 100644
index 000000000000..29eb7fb8b4fe
--- /dev/null
+++ b/drivers/md/dm-vdo/chapter-index.c
@@ -0,0 +1,304 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "chapter-index.h"
+
+#include "errors.h"
+#include "hash-utils.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+#include "uds.h"
+
+int uds_make_open_chapter_index(struct open_chapter_index **chapter_index,
+				const struct geometry *geometry,
+				u64 volume_nonce)
+{
+	int result;
+	size_t memory_size;
+	struct open_chapter_index *index;
+
+	result = UDS_ALLOCATE(1, struct open_chapter_index, "open chapter index", &index);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	/*
+	 * The delta index will rebalance delta lists when memory gets tight,
+	 * so give the chapter index one extra page.
+	 */
+	memory_size = ((geometry->index_pages_per_chapter + 1) * geometry->bytes_per_page);
+	index->geometry = geometry;
+	index->volume_nonce = volume_nonce;
+	result = uds_initialize_delta_index(&index->delta_index,
+					    1,
+					    geometry->delta_lists_per_chapter,
+					    geometry->chapter_mean_delta,
+					    geometry->chapter_payload_bits,
+					    memory_size,
+					    'm');
+	if (result != UDS_SUCCESS) {
+		UDS_FREE(index);
+		return result;
+	}
+
+	index->memory_size = index->delta_index.memory_size + sizeof(struct open_chapter_index);
+	*chapter_index = index;
+	return UDS_SUCCESS;
+}
+
+void uds_free_open_chapter_index(struct open_chapter_index *chapter_index)
+{
+	if (chapter_index == NULL)
+		return;
+
+	uds_uninitialize_delta_index(&chapter_index->delta_index);
+	UDS_FREE(chapter_index);
+}
+
+/* Re-initialize an open chapter index for a new chapter. */
+void uds_empty_open_chapter_index(struct open_chapter_index *chapter_index,
+				  u64 virtual_chapter_number)
+{
+	uds_reset_delta_index(&chapter_index->delta_index);
+	chapter_index->virtual_chapter_number = virtual_chapter_number;
+}
+
+static inline bool was_entry_found(const struct delta_index_entry *entry, u32 address)
+{
+	return (!entry->at_end) && (entry->key == address);
+}
+
+/* Associate a record name with the record page containing its metadata. */
+int uds_put_open_chapter_index_record(struct open_chapter_index *chapter_index,
+				      const struct uds_record_name *name,
+				      u32 page_number)
+{
+	int result;
+	struct delta_index_entry entry;
+	u32 address;
+	u32 list_number;
+	const u8 *found_name;
+	bool found;
+	const struct geometry *geometry = chapter_index->geometry;
+	u64 chapter_number = chapter_index->virtual_chapter_number;
+	u32 record_pages = geometry->record_pages_per_chapter;
+
+	result = ASSERT(page_number < record_pages,
+			"Page number within chapter (%u) exceeds the maximum value %u",
+			page_number,
+			record_pages);
+	if (result != UDS_SUCCESS)
+		return UDS_INVALID_ARGUMENT;
+
+	address = uds_hash_to_chapter_delta_address(name, geometry);
+	list_number = uds_hash_to_chapter_delta_list(name, geometry);
+	result = uds_get_delta_index_entry(&chapter_index->delta_index,
+					   list_number,
+					   address,
+					   name->name,
+					   &entry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	found = was_entry_found(&entry, address);
+	result = ASSERT(!(found && entry.is_collision),
+			"Chunk appears more than once in chapter %llu",
+			(unsigned long long) chapter_number);
+	if (result != UDS_SUCCESS)
+		return UDS_BAD_STATE;
+
+	found_name = (found ? name->name : NULL);
+	return uds_put_delta_index_entry(&entry, address, page_number, found_name);
+}
+
+/*
+ * Pack a section of an open chapter index into a chapter index page. A range of delta lists
+ * (starting with a specified list index) is copied from the open chapter index into a memory page.
+ * The number of lists copied onto the page is returned to the caller on success.
+ *
+ * @chapter_index: The open chapter index
+ * @memory: The memory page to use
+ * @first_list: The first delta list number to be copied
+ * @last_page: If true, this is the last page of the chapter index and all the remaining lists must
+ *             be packed onto this page
+ * @lists_packed: The number of delta lists that were packed onto this page
+ */
+int uds_pack_open_chapter_index_page(struct open_chapter_index *chapter_index,
+				     u8 *memory,
+				     u32 first_list,
+				     bool last_page,
+				     u32 *lists_packed)
+{
+	int result;
+	struct delta_index *delta_index = &chapter_index->delta_index;
+	struct delta_index_stats stats;
+	u64 nonce = chapter_index->volume_nonce;
+	u64 chapter_number = chapter_index->virtual_chapter_number;
+	const struct geometry *geometry = chapter_index->geometry;
+	u32 list_count = geometry->delta_lists_per_chapter;
+	unsigned int removals = 0;
+	struct delta_index_entry entry;
+	u32 next_list;
+	s32 list_number;
+
+	for (;;) {
+		result = uds_pack_delta_index_page(delta_index,
+						   nonce,
+						   memory,
+						   geometry->bytes_per_page,
+						   chapter_number,
+						   first_list,
+						   lists_packed);
+		if (result != UDS_SUCCESS)
+			return result;
+		if ((first_list + *lists_packed) == list_count)
+			/* All lists are packed. */
+			break;
+		else if (*lists_packed == 0) {
+			/*
+			 * The next delta list does not fit on a page. This delta list will be
+			 * removed.
+			 */
+		} else if (last_page) {
+			/*
+			 * This is the last page and there are lists left unpacked, but all of the
+			 * remaining lists must fit on the page. Find a list that contains entries
+			 * and remove the entire list. Try the first list that does not fit. If it
+			 * is empty, we will select the last list that already fits and has any
+			 * entries.
+			 */
+		} else {
+			/* This page is done. */
+			break;
+		}
+
+		if (removals == 0) {
+			uds_get_delta_index_stats(delta_index, &stats);
+			uds_log_warning("The chapter index for chapter %llu contains %llu entries with %llu collisions",
+					(unsigned long long) chapter_number,
+					(unsigned long long) stats.record_count,
+					(unsigned long long) stats.collision_count);
+		}
+
+		list_number = *lists_packed;
+		do {
+			if (list_number < 0)
+				return UDS_OVERFLOW;
+
+			next_list = first_list + list_number--,
+			result = uds_start_delta_index_search(delta_index, next_list, 0, &entry);
+			if (result != UDS_SUCCESS)
+				return result;
+
+			result = uds_next_delta_index_entry(&entry);
+			if (result != UDS_SUCCESS)
+				return result;
+		} while (entry.at_end);
+
+		do {
+			result = uds_remove_delta_index_entry(&entry);
+			if (result != UDS_SUCCESS)
+				return result;
+			removals++;
+		} while (!entry.at_end);
+	}
+
+	if (removals > 0)
+		uds_log_warning("To avoid chapter index page overflow in chapter %llu, %u entries were removed from the chapter index",
+				(unsigned long long) chapter_number,
+				removals);
+
+	return UDS_SUCCESS;
+}
+
+/* Make a new chapter index page, initializing it with the data from a given index_page buffer. */
+int uds_initialize_chapter_index_page(struct delta_index_page *index_page,
+				      const struct geometry *geometry,
+				      u8 *page_buffer,
+				      u64 volume_nonce)
+{
+	return uds_initialize_delta_index_page(index_page,
+					       volume_nonce,
+					       geometry->chapter_mean_delta,
+					       geometry->chapter_payload_bits,
+					       page_buffer,
+					       geometry->bytes_per_page);
+}
+
+/* Validate a chapter index page read during rebuild. */
+int uds_validate_chapter_index_page(const struct delta_index_page *index_page,
+				    const struct geometry *geometry)
+{
+	int result;
+	const struct delta_index *delta_index = &index_page->delta_index;
+	u32 first = index_page->lowest_list_number;
+	u32 last = index_page->highest_list_number;
+	u32 list_number;
+
+	/* We walk every delta list from start to finish. */
+	for (list_number = first; list_number <= last; list_number++) {
+		struct delta_index_entry entry;
+
+		result = uds_start_delta_index_search(delta_index, list_number - first, 0, &entry);
+		if (result != UDS_SUCCESS)
+			return result;
+
+		for (;;) {
+			result = uds_next_delta_index_entry(&entry);
+			if (result != UDS_SUCCESS)
+				/*
+				 * A random bit stream is highly likely to arrive here when we go
+				 * past the end of the delta list.
+				 */
+				return result;
+
+			if (entry.at_end)
+				break;
+
+			/* Also make sure that the record page field contains a plausible value. */
+			if (uds_get_delta_entry_value(&entry) >=
+			    geometry->record_pages_per_chapter)
+				/*
+				 * Do not log this as an error. It happens in normal operation when
+				 * we are doing a rebuild but haven't written the entire volume
+				 * once.
+				 */
+				return UDS_CORRUPT_DATA;
+		}
+	}
+	return UDS_SUCCESS;
+}
+
+/*
+ * Search a chapter index page for a record name, returning the record page number that may contain
+ * the name.
+ */
+int uds_search_chapter_index_page(struct delta_index_page *index_page,
+				  const struct geometry *geometry,
+				  const struct uds_record_name *name,
+				  u16 *record_page_ptr)
+{
+	int result;
+	struct delta_index *delta_index = &index_page->delta_index;
+	u32 address = uds_hash_to_chapter_delta_address(name, geometry);
+	u32 delta_list_number = uds_hash_to_chapter_delta_list(name, geometry);
+	u32 sub_list_number = delta_list_number - index_page->lowest_list_number;
+	struct delta_index_entry entry;
+
+	result = uds_get_delta_index_entry(delta_index,
+					   sub_list_number,
+					   address,
+					   name->name,
+					   &entry);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (was_entry_found(&entry, address))
+		*record_page_ptr = uds_get_delta_entry_value(&entry);
+	else
+		*record_page_ptr = NO_CHAPTER_INDEX_ENTRY;
+
+	return UDS_SUCCESS;
+}
diff --git a/drivers/md/dm-vdo/chapter-index.h b/drivers/md/dm-vdo/chapter-index.h
new file mode 100644
index 000000000000..0a9992e5e347
--- /dev/null
+++ b/drivers/md/dm-vdo/chapter-index.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_CHAPTER_INDEX_H
+#define UDS_CHAPTER_INDEX_H
+
+#include <linux/limits.h>
+
+#include "delta-index.h"
+#include "geometry.h"
+
+/*
+ * A chapter index for an open chapter is a mutable structure that tracks all the records that have
+ * been added to the chapter. A chapter index for a closed chapter is similar except that it is
+ * immutable because the contents of a closed chapter can never change, and the immutable structure
+ * is more efficient. Both types of chapter index are implemented with a delta index.
+ */
+
+enum {
+	/* The value returned when no entry is found in the chapter index. */
+	NO_CHAPTER_INDEX_ENTRY = U16_MAX,
+};
+
+struct open_chapter_index {
+	const struct geometry *geometry;
+	struct delta_index delta_index;
+	u64 virtual_chapter_number;
+	u64 volume_nonce;
+	size_t memory_size;
+};
+
+int __must_check uds_make_open_chapter_index(struct open_chapter_index **chapter_index,
+					     const struct geometry *geometry,
+					     u64 volume_nonce);
+
+void uds_free_open_chapter_index(struct open_chapter_index *chapter_index);
+
+void uds_empty_open_chapter_index(struct open_chapter_index *chapter_index,
+				  u64 virtual_chapter_number);
+
+int __must_check uds_put_open_chapter_index_record(struct open_chapter_index *chapter_index,
+						   const struct uds_record_name *name,
+						   u32 page_number);
+
+int __must_check uds_pack_open_chapter_index_page(struct open_chapter_index *chapter_index,
+						  u8 *memory,
+						  u32 first_list,
+						  bool last_page,
+						  u32 *lists_packed);
+
+int __must_check uds_initialize_chapter_index_page(struct delta_index_page *index_page,
+						   const struct geometry *geometry,
+						   u8 *page_buffer,
+						   u64 volume_nonce);
+
+int __must_check uds_validate_chapter_index_page(const struct delta_index_page *index_page,
+						 const struct geometry *geometry);
+
+int __must_check uds_search_chapter_index_page(struct delta_index_page *index_page,
+					       const struct geometry *geometry,
+					       const struct uds_record_name *name,
+					       u16 *record_page_ptr);
+
+#endif /* UDS_CHAPTER_INDEX_H */
diff --git a/drivers/md/dm-vdo/open-chapter.c b/drivers/md/dm-vdo/open-chapter.c
new file mode 100644
index 000000000000..d54c8fb6f118
--- /dev/null
+++ b/drivers/md/dm-vdo/open-chapter.c
@@ -0,0 +1,433 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "open-chapter.h"
+
+#include <linux/log2.h>
+
+#include "config.h"
+#include "hash-utils.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "numeric.h"
+#include "permassert.h"
+
+/*
+ * Each index zone has a dedicated open chapter zone structure which gets an equal share of the
+ * open chapter space. Records are assigned to zones based on their record name. Within each zone,
+ * records are stored in an array in the order they arrive. Additionally, a reference to each
+ * record is stored in a hash table to help determine if a new record duplicates an existing one.
+ * If new metadata for an existing name arrives, the record is altered in place. The array of
+ * records is 1-based so that record number 0 can be used to indicate an unused hash slot.
+ *
+ * Deleted records are marked with a flag rather than actually removed to simplify hash table
+ * management. The array of deleted flags overlays the array of hash slots, but the flags are
+ * indexed by record number instead of by record name. The number of hash slots will always be a
+ * power of two that is greater than the number of records to be indexed, guaranteeing that hash
+ * insertion cannot fail, and that there are sufficient flags for all records.
+ *
+ * Once any open chapter zone fills its available space, the chapter is closed. The records from
+ * each zone are interleaved to attempt to preserve temporal locality and assigned to record pages.
+ * Empty or deleted records are replaced by copies of a valid record so that the record pages only
+ * contain valid records. The chapter then constructs a delta index which maps each record name to
+ * the record page on which that record can be found, which is split into index pages. These
+ * structures are then passed to the volume to be recorded on storage.
+ *
+ * When the index is saved, the open chapter records are saved in a single array, once again
+ * interleaved to attempt to preserve temporal locality. When the index is reloaded, there may be a
+ * different number of zones than previously, so the records must be parcelled out to their new
+ * zones. In addition, depending on the distribution of record names, a new zone may have more
+ * records than it has space. In this case, the latest records for that zone will be discarded.
+ */
+
+static const u8 OPEN_CHAPTER_MAGIC[] = "ALBOC";
+static const u8 OPEN_CHAPTER_VERSION[] = "02.00";
+
+enum {
+	OPEN_CHAPTER_MAGIC_LENGTH = sizeof(OPEN_CHAPTER_MAGIC) - 1,
+	OPEN_CHAPTER_VERSION_LENGTH = sizeof(OPEN_CHAPTER_VERSION) - 1,
+	LOAD_RATIO = 2,
+};
+
+static inline size_t records_size(const struct open_chapter_zone *open_chapter)
+{
+	return sizeof(struct uds_volume_record) * (1 + open_chapter->capacity);
+}
+
+static inline size_t slots_size(size_t slot_count)
+{
+	return sizeof(struct open_chapter_zone_slot) * slot_count;
+}
+
+int uds_make_open_chapter(const struct geometry *geometry,
+			  unsigned int zone_count,
+			  struct open_chapter_zone **open_chapter_ptr)
+{
+	int result;
+	struct open_chapter_zone *open_chapter;
+	size_t capacity = geometry->records_per_chapter / zone_count;
+	size_t slot_count = (1 << bits_per(capacity * LOAD_RATIO));
+
+	result = UDS_ALLOCATE_EXTENDED(struct open_chapter_zone,
+				       slot_count,
+				       struct open_chapter_zone_slot,
+				       "open chapter",
+				       &open_chapter);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	open_chapter->slot_count = slot_count;
+	open_chapter->capacity = capacity;
+	result = uds_allocate_cache_aligned(records_size(open_chapter),
+					    "record pages",
+					    &open_chapter->records);
+	if (result != UDS_SUCCESS) {
+		uds_free_open_chapter(open_chapter);
+		return result;
+	}
+
+	*open_chapter_ptr = open_chapter;
+	return UDS_SUCCESS;
+}
+
+void uds_reset_open_chapter(struct open_chapter_zone *open_chapter)
+{
+	open_chapter->size = 0;
+	open_chapter->deletions = 0;
+
+	memset(open_chapter->records, 0, records_size(open_chapter));
+	memset(open_chapter->slots, 0, slots_size(open_chapter->slot_count));
+}
+
+static unsigned int
+probe_chapter_slots(struct open_chapter_zone *open_chapter, const struct uds_record_name *name)
+{
+	struct uds_volume_record *record;
+	unsigned int slot_count = open_chapter->slot_count;
+	unsigned int slot = uds_name_to_hash_slot(name, slot_count);
+	unsigned int record_number;
+	unsigned int attempts = 1;
+
+	while (true) {
+		record_number = open_chapter->slots[slot].record_number;
+
+		/*
+		 * If the hash slot is empty, we've reached the end of a chain without finding the
+		 * record and should terminate the search.
+		 */
+		if (record_number == 0)
+			return slot;
+
+		/*
+		 * If the name of the record referenced by the slot matches and has not been
+		 * deleted, then we've found the requested name.
+		 */
+		record = &open_chapter->records[record_number];
+		if ((memcmp(&record->name, name, UDS_RECORD_NAME_SIZE) == 0) &&
+		    !open_chapter->slots[record_number].deleted)
+			return slot;
+
+		/*
+		 * Quadratic probing: advance the probe by 1, 2, 3, etc. and try again. This
+		 * performs better than linear probing and works best for 2^N slots.
+		 */
+		slot = (slot + attempts++) % slot_count;
+	}
+}
+
+void uds_search_open_chapter(struct open_chapter_zone *open_chapter,
+			     const struct uds_record_name *name,
+			     struct uds_record_data *metadata,
+			     bool *found)
+{
+	unsigned int slot;
+	unsigned int record_number;
+
+	slot = probe_chapter_slots(open_chapter, name);
+	record_number = open_chapter->slots[slot].record_number;
+	if (record_number == 0) {
+		*found = false;
+	} else {
+		*found = true;
+		*metadata = open_chapter->records[record_number].data;
+	}
+}
+
+/* Add a record to the open chapter zone and return the remaining space. */
+int uds_put_open_chapter(struct open_chapter_zone *open_chapter,
+			 const struct uds_record_name *name,
+			 const struct uds_record_data *metadata)
+{
+	unsigned int slot;
+	unsigned int record_number;
+	struct uds_volume_record *record;
+
+	if (open_chapter->size >= open_chapter->capacity)
+		return 0;
+
+	slot = probe_chapter_slots(open_chapter, name);
+	record_number = open_chapter->slots[slot].record_number;
+
+	if (record_number == 0) {
+		record_number = ++open_chapter->size;
+		open_chapter->slots[slot].record_number = record_number;
+	}
+
+	record = &open_chapter->records[record_number];
+	record->name = *name;
+	record->data = *metadata;
+
+	return open_chapter->capacity - open_chapter->size;
+}
+
+void uds_remove_from_open_chapter(struct open_chapter_zone *open_chapter,
+				  const struct uds_record_name *name)
+{
+	unsigned int slot;
+	unsigned int record_number;
+
+	slot = probe_chapter_slots(open_chapter, name);
+	record_number = open_chapter->slots[slot].record_number;
+
+	if (record_number > 0) {
+		open_chapter->slots[record_number].deleted = true;
+		open_chapter->deletions += 1;
+	}
+}
+
+void uds_free_open_chapter(struct open_chapter_zone *open_chapter)
+{
+	if (open_chapter != NULL) {
+		UDS_FREE(open_chapter->records);
+		UDS_FREE(open_chapter);
+	}
+}
+
+/* Map each record name to its record page number in the delta chapter index. */
+static int fill_delta_chapter_index(struct open_chapter_zone **chapter_zones,
+				    unsigned int zone_count,
+				    struct open_chapter_index *index,
+				    struct uds_volume_record *collated_records)
+{
+	int result;
+	unsigned int records_per_chapter;
+	unsigned int records_per_page;
+	unsigned int record_index;
+	unsigned int records = 0;
+	u32 page_number;
+	unsigned int z;
+	int overflow_count = 0;
+	struct uds_volume_record *fill_record = NULL;
+
+	/*
+	 * The record pages should not have any empty space, so find a record with which to fill
+	 * the chapter zone if it was closed early, and also to replace any deleted records. The
+	 * last record in any filled zone is guaranteed to not have been deleted, so use one of
+	 * those.
+	 */
+	for (z = 0; z < zone_count; z++) {
+		struct open_chapter_zone *zone = chapter_zones[z];
+
+		if (zone->size == zone->capacity) {
+			fill_record = &zone->records[zone->size];
+			break;
+		}
+	}
+
+	records_per_chapter = index->geometry->records_per_chapter;
+	records_per_page = index->geometry->records_per_page;
+
+	for (records = 0; records < records_per_chapter; records++) {
+		struct uds_volume_record *record = &collated_records[records];
+		struct open_chapter_zone *open_chapter;
+
+		/* The record arrays in the zones are 1-based. */
+		record_index = 1 + (records / zone_count);
+		page_number = records / records_per_page;
+		open_chapter = chapter_zones[records % zone_count];
+
+		/* Use the fill record in place of an unused record. */
+		if (record_index > open_chapter->size ||
+		    open_chapter->slots[record_index].deleted) {
+			*record = *fill_record;
+			continue;
+		}
+
+		*record = open_chapter->records[record_index];
+		result = uds_put_open_chapter_index_record(index, &record->name, page_number);
+		switch (result) {
+		case UDS_SUCCESS:
+			break;
+		case UDS_OVERFLOW:
+			overflow_count++;
+			break;
+		default:
+			uds_log_error_strerror(result, "failed to build open chapter index");
+			return result;
+		}
+	}
+
+	if (overflow_count > 0)
+		uds_log_warning("Failed to add %d entries to chapter index", overflow_count);
+
+	return UDS_SUCCESS;
+}
+
+int uds_close_open_chapter(struct open_chapter_zone **chapter_zones,
+			   unsigned int zone_count,
+			   struct volume *volume,
+			   struct open_chapter_index *chapter_index,
+			   struct uds_volume_record *collated_records,
+			   u64 virtual_chapter_number)
+{
+	int result;
+
+	uds_empty_open_chapter_index(chapter_index, virtual_chapter_number);
+	result = fill_delta_chapter_index(chapter_zones,
+					  zone_count,
+					  chapter_index,
+					  collated_records);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	return uds_write_chapter(volume, chapter_index, collated_records);
+}
+
+int uds_save_open_chapter(struct uds_index *index, struct buffered_writer *writer)
+{
+	int result;
+	struct open_chapter_zone *open_chapter;
+	struct uds_volume_record *record;
+	u8 record_count_data[sizeof(u32)];
+	u32 record_count = 0;
+	unsigned int record_index;
+	unsigned int z;
+
+	result = uds_write_to_buffered_writer(writer,
+					      OPEN_CHAPTER_MAGIC,
+					      OPEN_CHAPTER_MAGIC_LENGTH);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_write_to_buffered_writer(writer,
+					      OPEN_CHAPTER_VERSION,
+					      OPEN_CHAPTER_VERSION_LENGTH);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	for (z = 0; z < index->zone_count; z++) {
+		open_chapter = index->zones[z]->open_chapter;
+		record_count += open_chapter->size - open_chapter->deletions;
+	}
+
+	put_unaligned_le32(record_count, record_count_data);
+	result = uds_write_to_buffered_writer(writer,
+					      record_count_data,
+					      sizeof(record_count_data));
+	if (result != UDS_SUCCESS)
+		return result;
+
+	record_index = 1;
+	while (record_count > 0) {
+		for (z = 0; z < index->zone_count; z++) {
+			open_chapter = index->zones[z]->open_chapter;
+			if (record_index > open_chapter->size)
+				continue;
+
+			if (open_chapter->slots[record_index].deleted)
+				continue;
+
+			record = &open_chapter->records[record_index];
+			result = uds_write_to_buffered_writer(writer,
+							      (u8 *) record,
+							      sizeof(*record));
+			if (result != UDS_SUCCESS)
+				return result;
+
+			record_count--;
+		}
+
+		record_index++;
+	}
+
+	return uds_flush_buffered_writer(writer);
+}
+
+u64 uds_compute_saved_open_chapter_size(struct geometry *geometry)
+{
+	unsigned int records_per_chapter = geometry->records_per_chapter;
+
+	return OPEN_CHAPTER_MAGIC_LENGTH + OPEN_CHAPTER_VERSION_LENGTH + sizeof(u32) +
+		records_per_chapter * sizeof(struct uds_volume_record);
+}
+
+static int load_version20(struct uds_index *index, struct buffered_reader *reader)
+{
+	int result;
+	u32 record_count;
+	u8 record_count_data[sizeof(u32)];
+	struct uds_volume_record record;
+
+	/*
+	 * Track which zones cannot accept any more records. If the open chapter had a different
+	 * number of zones previously, some new zones may have more records than they have space
+	 * for. These overflow records will be discarded.
+	 */
+	bool full_flags[MAX_ZONES] = {
+		false,
+	};
+
+	result = uds_read_from_buffered_reader(reader,
+					       (u8 *) &record_count_data,
+					       sizeof(record_count_data));
+	if (result != UDS_SUCCESS)
+		return result;
+
+	record_count = get_unaligned_le32(record_count_data);
+	while (record_count-- > 0) {
+		unsigned int zone = 0;
+
+		result = uds_read_from_buffered_reader(reader, (u8 *) &record, sizeof(record));
+		if (result != UDS_SUCCESS)
+			return result;
+
+		if (index->zone_count > 1)
+			zone = uds_get_volume_index_zone(index->volume_index, &record.name);
+
+		if (!full_flags[zone]) {
+			struct open_chapter_zone *open_chapter;
+			unsigned int remaining;
+
+			open_chapter = index->zones[zone]->open_chapter;
+			remaining = uds_put_open_chapter(open_chapter, &record.name, &record.data);
+			/* Do not allow any zone to fill completely. */
+			full_flags[zone] = (remaining <= 1);
+		}
+	}
+
+	return UDS_SUCCESS;
+}
+
+int uds_load_open_chapter(struct uds_index *index, struct buffered_reader *reader)
+{
+	u8 version[OPEN_CHAPTER_VERSION_LENGTH];
+	int result;
+
+	result = uds_verify_buffered_data(reader, OPEN_CHAPTER_MAGIC, OPEN_CHAPTER_MAGIC_LENGTH);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_read_from_buffered_reader(reader, version, sizeof(version));
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (memcmp(OPEN_CHAPTER_VERSION, version, sizeof(version)) != 0)
+		return uds_log_error_strerror(UDS_CORRUPT_DATA,
+					      "Invalid open chapter version: %.*s",
+					      (int) sizeof(version),
+					      version);
+
+	return load_version20(index, reader);
+}
diff --git a/drivers/md/dm-vdo/open-chapter.h b/drivers/md/dm-vdo/open-chapter.h
new file mode 100644
index 000000000000..1fc575d2789a
--- /dev/null
+++ b/drivers/md/dm-vdo/open-chapter.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_OPEN_CHAPTER_H
+#define UDS_OPEN_CHAPTER_H
+
+#include "chapter-index.h"
+#include "geometry.h"
+#include "index.h"
+#include "volume.h"
+
+/*
+ * The open chapter tracks the newest records in memory. Like the index as a whole, each open
+ * chapter is divided into a number of independent zones which are interleaved when the chapter is
+ * committed to the volume.
+ */
+
+enum {
+	OPEN_CHAPTER_RECORD_NUMBER_BITS = 23,
+};
+
+struct open_chapter_zone_slot {
+	/* If non-zero, the record number addressed by this hash slot */
+	unsigned int record_number : OPEN_CHAPTER_RECORD_NUMBER_BITS;
+	/* If true, the record at the index of this hash slot was deleted */
+	bool deleted : 1;
+} __packed;
+
+struct open_chapter_zone {
+	/* The maximum number of records that can be stored */
+	unsigned int capacity;
+	/* The number of records stored */
+	unsigned int size;
+	/* The number of deleted records */
+	unsigned int deletions;
+	/* Array of chunk records, 1-based */
+	struct uds_volume_record *records;
+	/* The number of slots in the hash table */
+	unsigned int slot_count;
+	/* The hash table slots, referencing virtual record numbers */
+	struct open_chapter_zone_slot slots[];
+};
+
+int __must_check uds_make_open_chapter(const struct geometry *geometry,
+				       unsigned int zone_count,
+				       struct open_chapter_zone **open_chapter_ptr);
+
+void uds_reset_open_chapter(struct open_chapter_zone *open_chapter);
+
+void uds_search_open_chapter(struct open_chapter_zone *open_chapter,
+			     const struct uds_record_name *name,
+			     struct uds_record_data *metadata,
+			     bool *found);
+
+int __must_check uds_put_open_chapter(struct open_chapter_zone *open_chapter,
+				      const struct uds_record_name *name,
+				      const struct uds_record_data *metadata);
+
+void uds_remove_from_open_chapter(struct open_chapter_zone *open_chapter,
+				  const struct uds_record_name *name);
+
+void uds_free_open_chapter(struct open_chapter_zone *open_chapter);
+
+int __must_check uds_close_open_chapter(struct open_chapter_zone **chapter_zones,
+					unsigned int zone_count,
+					struct volume *volume,
+					struct open_chapter_index *chapter_index,
+					struct uds_volume_record *collated_records,
+					u64 virtual_chapter_number);
+
+int __must_check uds_save_open_chapter(struct uds_index *index, struct buffered_writer *writer);
+
+int __must_check uds_load_open_chapter(struct uds_index *index, struct buffered_reader *reader);
+
+u64 uds_compute_saved_open_chapter_size(struct geometry *geometry);
+
+#endif /* UDS_OPEN_CHAPTER_H */

From patchwork Thu Oct 26 21:41:11 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437940
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4D0CAC25B6B
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:43:25 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232235AbjJZVnZ (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:43:25 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41018 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234816AbjJZVnY (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:43:24 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2ADE21B1
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356501;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=ODKufA3yy+JagkMbOskLqCtKGFaENxn//G1r0LhRZ1o=;
        b=eS5brrYTDpZ8bcIbtY5fzJZ2elmVLeunb/m2kl21ogClQwIbb0b8GUuH3bL2buywP1fAAl
        +3Rg84kEAPbvMrDnZLw+P17C6bXboGta+jhOoaj7NcvpCgdBvs9d0uAbckDigLDOgFI58j
        QH/kzYUvM7P5ADF/pRQfKb2U94GO01o=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-692-iLxlmcG5NviEPrfXxPyMFg-1; Thu, 26 Oct 2023 17:41:39 -0400
X-MC-Unique: iLxlmcG5NviEPrfXxPyMFg-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com
 [10.11.54.2])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 6D9B788D184;
        Thu, 26 Oct 2023 21:41:39 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 50E4140C6F79;
        Thu, 26 Oct 2023 21:41:39 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 309CF3003C; Thu, 26 Oct 2023 17:41:39 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 14/39] dm vdo: implement the chapter volume store
Date: Thu, 26 Oct 2023 17:41:11 -0400
Message-Id: <20231026214136.1067410-15-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

The volume store structures manage the reading and writing of chapter
pages. When a chapter is closed, it is packed into a read-only structure,
split across several pages, and written to storage.

The volume store also contains a cache and specialized queues that sort and
batch requests by the page they need, in order to minimize latency and I/O
requests when records have to be read from storage. The cache and queues
also coordinate with the volume index to ensure that the volume does not
waste resources reading pages that are no longer valid.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Co-developed-by: John Wiele <jwiele@redhat.com>
Signed-off-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/index-page-map.c |  181 +++
 drivers/md/dm-vdo/index-page-map.h |   54 +
 drivers/md/dm-vdo/radix-sort.c     |  349 ++++++
 drivers/md/dm-vdo/radix-sort.h     |   28 +
 drivers/md/dm-vdo/volume.c         | 1793 ++++++++++++++++++++++++++++
 drivers/md/dm-vdo/volume.h         |  175 +++
 6 files changed, 2580 insertions(+)
 create mode 100644 drivers/md/dm-vdo/index-page-map.c
 create mode 100644 drivers/md/dm-vdo/index-page-map.h
 create mode 100644 drivers/md/dm-vdo/radix-sort.c
 create mode 100644 drivers/md/dm-vdo/radix-sort.h
 create mode 100644 drivers/md/dm-vdo/volume.c
 create mode 100644 drivers/md/dm-vdo/volume.h

diff --git a/drivers/md/dm-vdo/index-page-map.c b/drivers/md/dm-vdo/index-page-map.c
new file mode 100644
index 000000000000..592d5aff7c77
--- /dev/null
+++ b/drivers/md/dm-vdo/index-page-map.c
@@ -0,0 +1,181 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "index-page-map.h"
+
+#include "errors.h"
+#include "hash-utils.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "numeric.h"
+#include "permassert.h"
+#include "string-utils.h"
+#include "uds-threads.h"
+#include "uds.h"
+
+/*
+ * The index page map is conceptually a two-dimensional array indexed by chapter number and index
+ * page number within the chapter. Each entry contains the number of the last delta list on that
+ * index page. In order to save memory, the information for the last page in each chapter is not
+ * recorded, as it is known from the geometry.
+ */
+
+static const u8 PAGE_MAP_MAGIC[] = "ALBIPM02";
+
+enum {
+	PAGE_MAP_MAGIC_LENGTH = sizeof(PAGE_MAP_MAGIC) - 1,
+};
+
+static inline u32 get_entry_count(const struct geometry *geometry)
+{
+	return geometry->chapters_per_volume * (geometry->index_pages_per_chapter - 1);
+}
+
+int uds_make_index_page_map(const struct geometry *geometry, struct index_page_map **map_ptr)
+{
+	int result;
+	struct index_page_map *map;
+
+	result = UDS_ALLOCATE(1, struct index_page_map, "page map", &map);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	map->geometry = geometry;
+	map->entries_per_chapter = geometry->index_pages_per_chapter - 1;
+	result = UDS_ALLOCATE(get_entry_count(geometry),
+			      u16,
+			      "Index Page Map Entries",
+			      &map->entries);
+	if (result != UDS_SUCCESS) {
+		uds_free_index_page_map(map);
+		return result;
+	}
+
+	*map_ptr = map;
+	return UDS_SUCCESS;
+}
+
+void uds_free_index_page_map(struct index_page_map *map)
+{
+	if (map != NULL) {
+		UDS_FREE(map->entries);
+		UDS_FREE(map);
+	}
+}
+
+void uds_update_index_page_map(struct index_page_map *map,
+			       u64 virtual_chapter_number,
+			       u32 chapter_number,
+			       u32 index_page_number,
+			       u32 delta_list_number)
+{
+	size_t slot;
+
+	map->last_update = virtual_chapter_number;
+	if (index_page_number == map->entries_per_chapter)
+		return;
+
+	slot = (chapter_number * map->entries_per_chapter) + index_page_number;
+	map->entries[slot] = delta_list_number;
+}
+
+u32 uds_find_index_page_number(const struct index_page_map *map,
+			       const struct uds_record_name *name,
+			       u32 chapter_number)
+{
+	u32 delta_list_number = uds_hash_to_chapter_delta_list(name, map->geometry);
+	u32 slot = chapter_number * map->entries_per_chapter;
+	u32 page;
+
+	for (page = 0; page < map->entries_per_chapter; page++) {
+		if (delta_list_number <= map->entries[slot + page])
+			break;
+	}
+
+	return page;
+}
+
+void uds_get_list_number_bounds(const struct index_page_map *map,
+				u32 chapter_number,
+				u32 index_page_number,
+				u32 *lowest_list,
+				u32 *highest_list)
+{
+	u32 slot = chapter_number * map->entries_per_chapter;
+
+	*lowest_list = ((index_page_number == 0) ?
+			0 :
+			map->entries[slot + index_page_number - 1] + 1);
+	*highest_list = ((index_page_number < map->entries_per_chapter) ?
+			 map->entries[slot + index_page_number] :
+			 map->geometry->delta_lists_per_chapter - 1);
+}
+
+u64 uds_compute_index_page_map_save_size(const struct geometry *geometry)
+{
+	return PAGE_MAP_MAGIC_LENGTH + sizeof(u64) + sizeof(u16) * get_entry_count(geometry);
+}
+
+int uds_write_index_page_map(struct index_page_map *map, struct buffered_writer *writer)
+{
+	int result;
+	u8 *buffer;
+	size_t offset = 0;
+	u64 saved_size = uds_compute_index_page_map_save_size(map->geometry);
+	u32 i;
+
+	result = UDS_ALLOCATE(saved_size, u8, "page map data", &buffer);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	memcpy(buffer, PAGE_MAP_MAGIC, PAGE_MAP_MAGIC_LENGTH);
+	offset += PAGE_MAP_MAGIC_LENGTH;
+	encode_u64_le(buffer, &offset, map->last_update);
+	for (i = 0; i < get_entry_count(map->geometry); i++)
+		encode_u16_le(buffer, &offset, map->entries[i]);
+
+	result = uds_write_to_buffered_writer(writer, buffer, offset);
+	UDS_FREE(buffer);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	return uds_flush_buffered_writer(writer);
+}
+
+int uds_read_index_page_map(struct index_page_map *map, struct buffered_reader *reader)
+{
+	int result;
+	u8 magic[PAGE_MAP_MAGIC_LENGTH];
+	u8 *buffer;
+	size_t offset = 0;
+	u64 saved_size = uds_compute_index_page_map_save_size(map->geometry);
+	u32 i;
+
+	result = UDS_ALLOCATE(saved_size, u8, "page map data", &buffer);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_read_from_buffered_reader(reader, buffer, saved_size);
+	if (result != UDS_SUCCESS) {
+		UDS_FREE(buffer);
+		return result;
+	}
+
+	memcpy(&magic, buffer, PAGE_MAP_MAGIC_LENGTH);
+	offset += PAGE_MAP_MAGIC_LENGTH;
+	if (memcmp(magic, PAGE_MAP_MAGIC, PAGE_MAP_MAGIC_LENGTH) != 0) {
+		UDS_FREE(buffer);
+		return UDS_CORRUPT_DATA;
+	}
+
+	decode_u64_le(buffer, &offset, &map->last_update);
+	for (i = 0; i < get_entry_count(map->geometry); i++)
+		decode_u16_le(buffer, &offset, &map->entries[i]);
+
+	UDS_FREE(buffer);
+	uds_log_debug("read index page map, last update %llu",
+		      (unsigned long long) map->last_update);
+	return UDS_SUCCESS;
+}
diff --git a/drivers/md/dm-vdo/index-page-map.h b/drivers/md/dm-vdo/index-page-map.h
new file mode 100644
index 000000000000..14c84051af51
--- /dev/null
+++ b/drivers/md/dm-vdo/index-page-map.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_INDEX_PAGE_MAP_H
+#define UDS_INDEX_PAGE_MAP_H
+
+#include "geometry.h"
+#include "io-factory.h"
+
+/*
+ * The index maintains a page map which records how the chapter delta lists are distributed among
+ * the index pages for each chapter, allowing the volume to be efficient about reading only pages
+ * that it knows it will need.
+ */
+
+struct index_page_map {
+	const struct geometry *geometry;
+	u64 last_update;
+	u32 entries_per_chapter;
+	u16 *entries;
+};
+
+int __must_check
+uds_make_index_page_map(const struct geometry *geometry, struct index_page_map **map_ptr);
+
+void uds_free_index_page_map(struct index_page_map *map);
+
+int __must_check
+uds_read_index_page_map(struct index_page_map *map, struct buffered_reader *reader);
+
+int __must_check
+uds_write_index_page_map(struct index_page_map *map, struct buffered_writer *writer);
+
+void uds_update_index_page_map(struct index_page_map *map,
+			       u64 virtual_chapter_number,
+			       u32 chapter_number,
+			       u32 index_page_number,
+			       u32 delta_list_number);
+
+u32 __must_check uds_find_index_page_number(const struct index_page_map *map,
+					    const struct uds_record_name *name,
+					    u32 chapter_number);
+
+void uds_get_list_number_bounds(const struct index_page_map *map,
+				u32 chapter_number,
+				u32 index_page_number,
+				u32 *lowest_list,
+				u32 *highest_list);
+
+u64 uds_compute_index_page_map_save_size(const struct geometry *geometry);
+
+#endif /* UDS_INDEX_PAGE_MAP_H */
diff --git a/drivers/md/dm-vdo/radix-sort.c b/drivers/md/dm-vdo/radix-sort.c
new file mode 100644
index 000000000000..7eeb81a6deb6
--- /dev/null
+++ b/drivers/md/dm-vdo/radix-sort.c
@@ -0,0 +1,349 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "radix-sort.h"
+
+#include <linux/limits.h>
+#include <linux/types.h>
+
+#include "memory-alloc.h"
+#include "string-utils.h"
+
+/*
+ * This implementation allocates one large object to do the sorting, which can be reused as many
+ * times as desired. The amount of memory required is logarithmically proportional to the number of
+ * keys to be sorted.
+ */
+
+enum {
+	/* Piles smaller than this are handled with a simple insertion sort. */
+	INSERTION_SORT_THRESHOLD = 12,
+};
+
+/* Sort keys are pointers to immutable fixed-length arrays of bytes. */
+typedef const u8 *sort_key_t;
+
+/*
+ * The keys are separated into piles based on the byte in each keys at the current offset, so the
+ * number of keys with each byte must be counted.
+ */
+struct histogram {
+	/* The number of non-empty bins */
+	u16 used;
+	/* The index (key byte) of the first non-empty bin */
+	u16 first;
+	/* The index (key byte) of the last non-empty bin */
+	u16 last;
+	/* The number of occurrences of each specific byte */
+	u32 size[256];
+};
+
+/*
+ * Sub-tasks are manually managed on a stack, both for performance and to put a logarithmic bound
+ * on the stack space needed.
+ */
+struct task {
+	/* Pointer to the first key to sort. */
+	sort_key_t *first_key;
+	/* Pointer to the last key to sort. */
+	sort_key_t *last_key;
+	/* The offset into the key at which to continue sorting. */
+	u16 offset;
+	/* The number of bytes remaining in the sort keys. */
+	u16 length;
+};
+
+struct radix_sorter {
+	unsigned int count;
+	struct histogram bins;
+	sort_key_t *pile[256];
+	struct task *end_of_stack;
+	struct task insertion_list[256];
+	struct task stack[];
+};
+
+/* Compare a segment of two fixed-length keys starting at an offset. */
+static inline int compare(sort_key_t key1, sort_key_t key2, u16 offset, u16 length)
+{
+	return memcmp(&key1[offset], &key2[offset], length);
+}
+
+/* Insert the next unsorted key into an array of sorted keys. */
+static inline void insert_key(const struct task task, sort_key_t *next)
+{
+	/* Pull the unsorted key out, freeing up the array slot. */
+	sort_key_t unsorted = *next;
+
+	/* Compare the key to the preceding sorted entries, shifting down ones that are larger. */
+	while ((--next >= task.first_key) &&
+	       (compare(unsorted, next[0], task.offset, task.length) < 0))
+		next[1] = next[0];
+
+	/* Insert the key into the last slot that was cleared, sorting it. */
+	next[1] = unsorted;
+}
+
+/*
+ * Sort a range of key segments using an insertion sort. This simple sort is faster than the
+ * 256-way radix sort when the number of keys to sort is small.
+ */
+static inline void insertion_sort(const struct task task)
+{
+	sort_key_t *next;
+
+	for (next = task.first_key + 1; next <= task.last_key; next++)
+		insert_key(task, next);
+}
+
+/* Push a sorting task onto a task stack. */
+static inline void push_task(struct task **stack_pointer,
+			     sort_key_t *first_key,
+			     u32 count,
+			     u16 offset,
+			     u16 length)
+{
+	struct task *task = (*stack_pointer)++;
+
+	task->first_key = first_key;
+	task->last_key = &first_key[count - 1];
+	task->offset = offset;
+	task->length = length;
+}
+
+static inline void swap_keys(sort_key_t *a, sort_key_t *b)
+{
+	sort_key_t c = *a;
+	*a = *b;
+	*b = c;
+}
+
+/*
+ * Count the number of times each byte value appears in the arrays of keys to sort at the current
+ * offset, keeping track of the number of non-empty bins, and the index of the first and last
+ * non-empty bin.
+ */
+static inline void measure_bins(const struct task task, struct histogram *bins)
+{
+	sort_key_t *key_ptr;
+
+	/*
+	 * Subtle invariant: bins->used and bins->size[] are zero because the sorting code clears
+	 * it all out as it goes. Even though this structure is re-used, we don't need to pay to
+	 * zero it before starting a new tally.
+	 */
+	bins->first = U8_MAX;
+	bins->last = 0;
+
+	for (key_ptr = task.first_key; key_ptr <= task.last_key; key_ptr++) {
+		/* Increment the count for the byte in the key at the current offset. */
+		u8 bin = (*key_ptr)[task.offset];
+		u32 size = ++bins->size[bin];
+
+		/* Track non-empty bins. */
+		if (size == 1) {
+			bins->used += 1;
+			if (bin < bins->first)
+				bins->first = bin;
+
+			if (bin > bins->last)
+				bins->last = bin;
+		}
+	}
+}
+
+/*
+ * Convert the bin sizes to pointers to where each pile goes.
+ *
+ *   pile[0] = first_key + bin->size[0],
+ *   pile[1] = pile[0]  + bin->size[1], etc.
+ *
+ * After the keys are moved to the appropriate pile, we'll need to sort each of the piles by the
+ * next radix position. A new task is put on the stack for each pile containing lots of keys, or a
+ * new task is put on the list for each pile containing few keys.
+ *
+ * @stack: pointer the top of the stack
+ * @end_of_stack: the end of the stack
+ * @list: pointer the head of the list
+ * @pile: array for pointers to the end of each pile
+ * @bins: the histogram of the sizes of each pile
+ * @first_key: the first key of the stack
+ * @offset: the next radix position to sort by
+ * @length: the number of bytes remaining in the sort keys
+ *
+ * Return: UDS_SUCCESS or an error code
+ */
+static inline int push_bins(struct task **stack,
+			    struct task *end_of_stack,
+			    struct task **list,
+			    sort_key_t *pile[],
+			    struct histogram *bins,
+			    sort_key_t *first_key,
+			    u16 offset,
+			    u16 length)
+{
+	sort_key_t *pile_start = first_key;
+	int bin;
+
+	for (bin = bins->first; ; bin++) {
+		u32 size = bins->size[bin];
+
+		/* Skip empty piles. */
+		if (size == 0)
+			continue;
+
+		/* There's no need to sort empty keys. */
+		if (length > 0) {
+			if (size > INSERTION_SORT_THRESHOLD) {
+				if (*stack >= end_of_stack)
+					return UDS_BAD_STATE;
+
+				push_task(stack, pile_start, size, offset, length);
+			} else if (size > 1) {
+				push_task(list, pile_start, size, offset, length);
+			}
+		}
+
+		pile_start += size;
+		pile[bin] = pile_start;
+		if (--bins->used == 0)
+			break;
+	}
+
+	return UDS_SUCCESS;
+}
+
+int uds_make_radix_sorter(unsigned int count, struct radix_sorter **sorter)
+{
+	int result;
+	unsigned int stack_size = count / INSERTION_SORT_THRESHOLD;
+	struct radix_sorter *radix_sorter;
+
+	result = UDS_ALLOCATE_EXTENDED(struct radix_sorter,
+				       stack_size,
+				       struct task,
+				       __func__,
+				       &radix_sorter);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	radix_sorter->count = count;
+	radix_sorter->end_of_stack = radix_sorter->stack + stack_size;
+	*sorter = radix_sorter;
+	return UDS_SUCCESS;
+}
+
+void uds_free_radix_sorter(struct radix_sorter *sorter)
+{
+	UDS_FREE(sorter);
+}
+
+/*
+ * Sort pointers to fixed-length keys (arrays of bytes) using a radix sort. The sort implementation
+ * is unstable, so the relative ordering of equal keys is not preserved.
+ */
+int uds_radix_sort(struct radix_sorter *sorter,
+		   const unsigned char *keys[],
+		   unsigned int count,
+		   unsigned short length)
+{
+	struct task start;
+	struct histogram *bins = &sorter->bins;
+	sort_key_t **pile = sorter->pile;
+	struct task *task_stack = sorter->stack;
+
+	/* All zero-length keys are identical and therefore already sorted. */
+	if ((count == 0) || (length == 0))
+		return UDS_SUCCESS;
+
+	/* The initial task is to sort the entire length of all the keys. */
+	start = (struct task) {
+		.first_key = keys,
+		.last_key = &keys[count - 1],
+		.offset = 0,
+		.length = length,
+	};
+
+	if (count <= INSERTION_SORT_THRESHOLD) {
+		insertion_sort(start);
+		return UDS_SUCCESS;
+	}
+
+	if (count > sorter->count)
+		return UDS_INVALID_ARGUMENT;
+
+	/*
+	 * Repeatedly consume a sorting task from the stack and process it, pushing new sub-tasks
+	 * onto the stack for each radix-sorted pile. When all tasks and sub-tasks have been
+	 * processed, the stack will be empty and all the keys in the starting task will be fully
+	 * sorted.
+	 */
+	for (*task_stack = start; task_stack >= sorter->stack; task_stack--) {
+		const struct task task = *task_stack;
+		struct task *insertion_task_list;
+		int result;
+		sort_key_t *fence;
+		sort_key_t *end;
+
+		measure_bins(task, bins);
+
+		/*
+		 * Now that we know how large each bin is, generate pointers for each of the piles
+		 * and push a new task to sort each pile by the next radix byte.
+		 */
+		insertion_task_list = sorter->insertion_list;
+		result = push_bins(&task_stack,
+				   sorter->end_of_stack,
+				   &insertion_task_list,
+				   pile,
+				   bins,
+				   task.first_key,
+				   task.offset + 1,
+				   task.length - 1);
+		if (result != UDS_SUCCESS) {
+			memset(bins, 0, sizeof(*bins));
+			return result;
+		}
+
+		/* Now bins->used is zero again. */
+
+		/*
+		 * Don't bother processing the last pile: when piles 0..N-1 are all in place, then
+		 * pile N must also be in place.
+		 */
+		end = task.last_key - bins->size[bins->last];
+		bins->size[bins->last] = 0;
+
+		for (fence = task.first_key; fence <= end; ) {
+			u8 bin;
+			sort_key_t key = *fence;
+
+			/*
+			 * The radix byte of the key tells us which pile it belongs in. Swap it for
+			 * an unprocessed item just below that pile, and repeat.
+			 */
+			while (--pile[bin = key[task.offset]] > fence)
+				swap_keys(pile[bin], &key);
+
+			/*
+			 * The pile reached the fence. Put the key at the bottom of that pile,
+			 * completing it, and advance the fence to the next pile.
+			 */
+			*fence = key;
+			fence += bins->size[bin];
+			bins->size[bin] = 0;
+		}
+
+		/* Now bins->size[] is all zero again. */
+
+		/*
+		 * When the number of keys in a task gets small enough, it is faster to use an
+		 * insertion sort than to keep subdividing into tiny piles.
+		 */
+		while (--insertion_task_list >= sorter->insertion_list)
+			insertion_sort(*insertion_task_list);
+	}
+
+	return UDS_SUCCESS;
+}
diff --git a/drivers/md/dm-vdo/radix-sort.h b/drivers/md/dm-vdo/radix-sort.h
new file mode 100644
index 000000000000..f6f0b62be0ee
--- /dev/null
+++ b/drivers/md/dm-vdo/radix-sort.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_RADIX_SORT_H
+#define UDS_RADIX_SORT_H
+
+/*
+ * Radix sort is implemented using an American Flag sort, an unstable, in-place 8-bit radix
+ * exchange sort. This is adapted from the algorithm in the paper by Peter M. McIlroy, Keith
+ * Bostic, and M. Douglas McIlroy, "Engineering Radix Sort".
+ *
+ * http://www.usenix.org/publications/compsystems/1993/win_mcilroy.pdf
+ */
+
+struct radix_sorter;
+
+int __must_check uds_make_radix_sorter(unsigned int count, struct radix_sorter **sorter);
+
+void uds_free_radix_sorter(struct radix_sorter *sorter);
+
+int __must_check uds_radix_sort(struct radix_sorter *sorter,
+				const unsigned char *keys[],
+				unsigned int count,
+				unsigned short length);
+
+#endif /* UDS_RADIX_SORT_H */
diff --git a/drivers/md/dm-vdo/volume.c b/drivers/md/dm-vdo/volume.c
new file mode 100644
index 000000000000..fef5c25248db
--- /dev/null
+++ b/drivers/md/dm-vdo/volume.c
@@ -0,0 +1,1793 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "volume.h"
+
+#include <linux/atomic.h>
+#include <linux/dm-bufio.h>
+#include <linux/err.h>
+
+#include "chapter-index.h"
+#include "config.h"
+#include "errors.h"
+#include "geometry.h"
+#include "hash-utils.h"
+#include "index.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+#include "sparse-cache.h"
+#include "string-utils.h"
+#include "uds-threads.h"
+
+/*
+ * The first block of the volume layout is reserved for the volume header, which is no longer used.
+ * The remainder of the volume is divided into chapters consisting of several pages of records, and
+ * several pages of static index to use to find those records. The index pages are recorded first,
+ * followed by the record pages. The chapters are written in order as they are filled, so the
+ * volume storage acts as a circular log of the most recent chapters, with each new chapter
+ * overwriting the oldest saved one.
+ *
+ * When a new chapter is filled and closed, the records from that chapter are sorted and
+ * interleaved in approximate temporal order, and assigned to record pages. Then a static delta
+ * index is generated to store which record page contains each record. The in-memory index page map
+ * is also updated to indicate which delta lists fall on each chapter index page. This means that
+ * when a record is read, the volume only has to load a single index page and a single record page,
+ * rather than search the entire chapter. These index and record pages are written to storage, and
+ * the index pages are transferred to the page cache under the theory that the most recently
+ * written chapter is likely to be accessed again soon.
+ *
+ * When reading a record, the volume index will indicate which chapter should contain it. The
+ * volume uses the index page map to determine which chapter index page needs to be loaded, and
+ * then reads the relevant record page number from the chapter index. Both index and record pages
+ * are stored in a page cache when read for the common case that subsequent records need the same
+ * pages. The page cache evicts the least recently accessed entries when caching new pages. In
+ * addition, the volume uses dm-bufio to manage access to the storage, which may allow for
+ * additional caching depending on available system resources.
+ *
+ * Record requests are handled from cached pages when possible. If a page needs to be read, it is
+ * placed on a queue along with the request that wants to read it. Any requests for the same page
+ * that arrive while the read is pending are added to the queue entry. A separate reader thread
+ * handles the queued reads, adding the page to the cache and updating any requests queued with it
+ * so they can continue processing. This allows the index zone threads to continue processing new
+ * requests rather than wait for the storage reads.
+ *
+ * When an index rebuild is necessary, the volume reads each stored chapter to determine which
+ * range of chapters contain valid records, so that those records can be used to reconstruct the
+ * in-memory volume index.
+ */
+
+enum {
+	/* The maximum allowable number of contiguous bad chapters */
+	MAX_BAD_CHAPTERS = 100,
+	VOLUME_CACHE_MAX_ENTRIES = (U16_MAX >> 1),
+	VOLUME_CACHE_QUEUED_FLAG = (1 << 15),
+	VOLUME_CACHE_MAX_QUEUED_READS = 4096,
+};
+
+static const u64 BAD_CHAPTER = U64_MAX;
+
+/*
+ * The invalidate counter is two 32 bits fields stored together atomically. The low order 32 bits
+ * are the physical page number of the cached page being read. The high order 32 bits are a
+ * sequence number. This value is written when the zone that owns it begins or completes a cache
+ * search. Any other thread will only read the counter in wait_for_pending_searches() while waiting
+ * to update the cache contents.
+ */
+union invalidate_counter {
+	u64 value;
+	struct {
+		u32 page;
+		u32 counter;
+	};
+};
+
+static inline u32 map_to_page_number(struct geometry *geometry, u32 physical_page)
+{
+	return (physical_page - HEADER_PAGES_PER_VOLUME) % geometry->pages_per_chapter;
+}
+
+static inline u32 map_to_chapter_number(struct geometry *geometry, u32 physical_page)
+{
+	return (physical_page - HEADER_PAGES_PER_VOLUME) / geometry->pages_per_chapter;
+}
+
+static inline bool is_record_page(struct geometry *geometry, u32 physical_page)
+{
+	return map_to_page_number(geometry, physical_page) >= geometry->index_pages_per_chapter;
+}
+
+static u32 map_to_physical_page(const struct geometry *geometry, u32 chapter, u32 page)
+{
+	/* Page zero is the header page, so the first chapter index page is page one. */
+	return HEADER_PAGES_PER_VOLUME + (geometry->pages_per_chapter * chapter) + page;
+}
+
+static inline union invalidate_counter
+get_invalidate_counter(struct page_cache *cache, unsigned int zone_number)
+{
+	return (union invalidate_counter) {
+		.value = READ_ONCE(cache->search_pending_counters[zone_number].atomic_value),
+	};
+}
+
+static inline void set_invalidate_counter(struct page_cache *cache,
+					  unsigned int zone_number,
+					  union invalidate_counter invalidate_counter)
+{
+	WRITE_ONCE(cache->search_pending_counters[zone_number].atomic_value,
+		   invalidate_counter.value);
+}
+
+static inline bool search_pending(union invalidate_counter invalidate_counter)
+{
+	return (invalidate_counter.counter & 1) != 0;
+}
+
+/* Lock the cache for a zone in order to search for a page. */
+static void
+begin_pending_search(struct page_cache *cache, u32 physical_page, unsigned int zone_number)
+{
+	union invalidate_counter invalidate_counter = get_invalidate_counter(cache, zone_number);
+
+	invalidate_counter.page = physical_page;
+	invalidate_counter.counter++;
+	set_invalidate_counter(cache, zone_number, invalidate_counter);
+	ASSERT_LOG_ONLY(search_pending(invalidate_counter),
+			"Search is pending for zone %u",
+			zone_number);
+	/*
+	 * This memory barrier ensures that the write to the invalidate counter is seen by other
+	 * threads before this thread accesses the cached page. The corresponding read memory
+	 * barrier is in wait_for_pending_searches().
+	 */
+	smp_mb();
+}
+
+/* Unlock the cache for a zone by clearing its invalidate counter. */
+static void end_pending_search(struct page_cache *cache, unsigned int zone_number)
+{
+	union invalidate_counter invalidate_counter;
+
+	/*
+	 * This memory barrier ensures that this thread completes reads of the
+	 * cached page before other threads see the write to the invalidate
+	 * counter.
+	 */
+	smp_mb();
+
+	invalidate_counter = get_invalidate_counter(cache, zone_number);
+	ASSERT_LOG_ONLY(search_pending(invalidate_counter),
+			"Search is pending for zone %u",
+			zone_number);
+	invalidate_counter.counter++;
+	set_invalidate_counter(cache, zone_number, invalidate_counter);
+}
+
+static void wait_for_pending_searches(struct page_cache *cache, u32 physical_page)
+{
+	union invalidate_counter initial_counters[MAX_ZONES];
+	unsigned int i;
+
+	/*
+	 * We hold the read_threads_mutex. We are waiting for threads that do not hold the
+	 * read_threads_mutex. Those threads have "locked" their targeted page by setting the
+	 * search_pending_counter. The corresponding write memory barrier is in
+	 * begin_pending_search().
+	 */
+	smp_mb();
+
+	for (i = 0; i < cache->zone_count; i++)
+		initial_counters[i] = get_invalidate_counter(cache, i);
+	for (i = 0; i < cache->zone_count; i++)
+		if (search_pending(initial_counters[i]) &&
+		    (initial_counters[i].page == physical_page))
+			/*
+			 * There is an active search using the physical page. We need to wait for
+			 * the search to finish.
+			 *
+			 * FIXME: Investigate using wait_event() to wait for the search to finish.
+			 */
+			while (initial_counters[i].value == get_invalidate_counter(cache, i).value)
+				cond_resched();
+}
+
+static void release_page_buffer(struct cached_page *page)
+{
+	if (page->buffer != NULL)
+		dm_bufio_release(UDS_FORGET(page->buffer));
+}
+
+static void clear_cache_page(struct page_cache *cache, struct cached_page *page)
+{
+	/* Do not clear read_pending because the read queue relies on it. */
+	release_page_buffer(page);
+	page->physical_page = cache->indexable_pages;
+	WRITE_ONCE(page->last_used, 0);
+}
+
+static void make_page_most_recent(struct page_cache *cache, struct cached_page *page)
+{
+	/*
+	 * ASSERTION: We are either a zone thread holding a search_pending_counter, or we are any
+	 * thread holding the read_threads_mutex.
+	 */
+	if (atomic64_read(&cache->clock) != READ_ONCE(page->last_used))
+		WRITE_ONCE(page->last_used, atomic64_inc_return(&cache->clock));
+}
+
+/* Select a page to remove from the cache to make space for a new entry. */
+static struct cached_page *select_victim_in_cache(struct page_cache *cache)
+{
+	struct cached_page *page;
+	int oldest_index = 0;
+	s64 oldest_time = S64_MAX;
+	s64 last_used;
+	u16 i;
+
+	/* Find the oldest unclaimed page. We hold the read_threads_mutex. */
+	for (i = 0; i < cache->cache_slots; i++) {
+		/* A page with a pending read must not be replaced. */
+		if (cache->cache[i].read_pending)
+			continue;
+
+		last_used = READ_ONCE(cache->cache[i].last_used);
+		if (last_used <= oldest_time) {
+			oldest_time = last_used;
+			oldest_index = i;
+		}
+	}
+
+	page = &cache->cache[oldest_index];
+	if (page->physical_page != cache->indexable_pages) {
+		WRITE_ONCE(cache->index[page->physical_page], cache->cache_slots);
+		wait_for_pending_searches(cache, page->physical_page);
+	}
+
+	page->read_pending = true;
+	clear_cache_page(cache, page);
+	return page;
+}
+
+/* Make a newly filled cache entry available to other threads. */
+static int
+put_page_in_cache(struct page_cache *cache, u32 physical_page, struct cached_page *page)
+{
+	int result;
+
+	/* We hold the read_threads_mutex. */
+	result = ASSERT((page->read_pending), "page to install has a pending read");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	page->physical_page = physical_page;
+	make_page_most_recent(cache, page);
+	page->read_pending = false;
+
+	/*
+	 * We hold the read_threads_mutex, but we must have a write memory barrier before making
+	 * the cached_page available to the readers that do not hold the mutex. The corresponding
+	 * read memory barrier is in get_page_and_index().
+	 */
+	smp_wmb();
+
+	/* This assignment also clears the queued flag. */
+	WRITE_ONCE(cache->index[physical_page], page - cache->cache);
+	return UDS_SUCCESS;
+}
+
+static void cancel_page_in_cache(struct page_cache *cache,
+				 u32 physical_page,
+				 struct cached_page *page)
+{
+	int result;
+
+	/* We hold the read_threads_mutex. */
+	result = ASSERT((page->read_pending), "page to install has a pending read");
+	if (result != UDS_SUCCESS)
+		return;
+
+	clear_cache_page(cache, page);
+	page->read_pending = false;
+
+	/* Clear the mapping and the queued flag for the new page. */
+	WRITE_ONCE(cache->index[physical_page], cache->cache_slots);
+}
+
+static inline u16 next_queue_position(u16 position)
+{
+	return (position + 1) % VOLUME_CACHE_MAX_QUEUED_READS;
+}
+
+static inline void advance_queue_position(u16 *position)
+{
+	*position = next_queue_position(*position);
+}
+
+static inline bool read_queue_is_full(struct page_cache *cache)
+{
+	return cache->read_queue_first == next_queue_position(cache->read_queue_last);
+}
+
+static bool
+enqueue_read(struct page_cache *cache, struct uds_request *request, u32 physical_page)
+{
+	struct queued_read *queue_entry;
+	u16 last = cache->read_queue_last;
+	u16 read_queue_index;
+
+	/* We hold the read_threads_mutex. */
+	if ((cache->index[physical_page] & VOLUME_CACHE_QUEUED_FLAG) == 0) {
+		/* This page has no existing entry in the queue. */
+		if (read_queue_is_full(cache))
+			return false;
+
+		/* Fill in the read queue entry. */
+		cache->read_queue[last].physical_page = physical_page;
+		cache->read_queue[last].invalid = false;
+		cache->read_queue[last].first_request = NULL;
+		cache->read_queue[last].last_request = NULL;
+
+		/* Point the cache index to the read queue entry. */
+		read_queue_index = last;
+		WRITE_ONCE(cache->index[physical_page],
+			   read_queue_index | VOLUME_CACHE_QUEUED_FLAG);
+
+		advance_queue_position(&cache->read_queue_last);
+	} else {
+		/* It's already queued, so add this request to the existing entry. */
+		read_queue_index = cache->index[physical_page] & ~VOLUME_CACHE_QUEUED_FLAG;
+	}
+
+	request->next_request = NULL;
+	queue_entry = &cache->read_queue[read_queue_index];
+	if (queue_entry->first_request == NULL)
+		queue_entry->first_request = request;
+	else
+		queue_entry->last_request->next_request = request;
+	queue_entry->last_request = request;
+
+	return true;
+}
+
+static void
+enqueue_page_read(struct volume *volume, struct uds_request *request, u32 physical_page)
+{
+	/* Mark the page as queued, so that chapter invalidation knows to cancel a read. */
+	while (!enqueue_read(&volume->page_cache, request, physical_page)) {
+		uds_log_debug("Read queue full, waiting for reads to finish");
+		uds_wait_cond(&volume->read_threads_read_done_cond, &volume->read_threads_mutex);
+	}
+
+	uds_signal_cond(&volume->read_threads_cond);
+}
+
+/*
+ * Reserve the next read queue entry for processing, but do not actually remove it from the queue.
+ * Must be followed by release_queued_requests().
+ */
+static struct queued_read *reserve_read_queue_entry(struct page_cache *cache)
+{
+	/* We hold the read_threads_mutex. */
+	struct queued_read *entry;
+	u16 index_value;
+	bool queued;
+
+	/* No items to dequeue */
+	if (cache->read_queue_next_read == cache->read_queue_last)
+		return NULL;
+
+	entry = &cache->read_queue[cache->read_queue_next_read];
+	index_value = cache->index[entry->physical_page];
+	queued = (index_value & VOLUME_CACHE_QUEUED_FLAG) != 0;
+	/* Check to see if it's still queued before resetting. */
+	if (entry->invalid && queued)
+		WRITE_ONCE(cache->index[entry->physical_page], cache->cache_slots);
+
+	/*
+	 * If a synchronous read has taken this page, set invalid to true so it doesn't get
+	 * overwritten. Requests will just be requeued.
+	 */
+	if (!queued)
+		entry->invalid = true;
+
+	entry->reserved = true;
+	advance_queue_position(&cache->read_queue_next_read);
+	return entry;
+}
+
+static inline struct queued_read *wait_to_reserve_read_queue_entry(struct volume *volume)
+{
+	struct queued_read *queue_entry = NULL;
+
+	while (!volume->read_threads_exiting) {
+		queue_entry = reserve_read_queue_entry(&volume->page_cache);
+		if (queue_entry != NULL)
+			break;
+
+		uds_wait_cond(&volume->read_threads_cond, &volume->read_threads_mutex);
+	}
+
+	return queue_entry;
+}
+
+static int init_chapter_index_page(const struct volume *volume,
+				   u8 *index_page,
+				   u32 chapter,
+				   u32 index_page_number,
+				   struct delta_index_page *chapter_index_page)
+{
+	u64 ci_virtual;
+	u32 ci_chapter;
+	u32 lowest_list;
+	u32 highest_list;
+	struct geometry *geometry = volume->geometry;
+	int result;
+
+	result = uds_initialize_chapter_index_page(chapter_index_page,
+						   geometry,
+						   index_page,
+						   volume->nonce);
+	if (volume->lookup_mode == LOOKUP_FOR_REBUILD)
+		return result;
+
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result,
+					      "Reading chapter index page for chapter %u page %u",
+					      chapter,
+					      index_page_number);
+
+	uds_get_list_number_bounds(volume->index_page_map,
+				   chapter,
+				   index_page_number,
+				   &lowest_list,
+				   &highest_list);
+	ci_virtual = chapter_index_page->virtual_chapter_number;
+	ci_chapter = uds_map_to_physical_chapter(geometry, ci_virtual);
+	if ((chapter == ci_chapter) &&
+	    (lowest_list == chapter_index_page->lowest_list_number) &&
+	    (highest_list == chapter_index_page->highest_list_number))
+		return UDS_SUCCESS;
+
+	uds_log_warning("Index page map updated to %llu",
+			(unsigned long long) volume->index_page_map->last_update);
+	uds_log_warning("Page map expects that chapter %u page %u has range %u to %u, but chapter index page has chapter %llu with range %u to %u",
+			chapter,
+			index_page_number,
+			lowest_list,
+			highest_list,
+			(unsigned long long) ci_virtual,
+			chapter_index_page->lowest_list_number,
+			chapter_index_page->highest_list_number);
+	return uds_log_error_strerror(UDS_CORRUPT_DATA,
+				      "index page map mismatch with chapter index");
+}
+
+static int initialize_index_page(const struct volume *volume,
+				 u32 physical_page,
+				 struct cached_page *page)
+{
+	u32 chapter = map_to_chapter_number(volume->geometry, physical_page);
+	u32 index_page_number = map_to_page_number(volume->geometry, physical_page);
+
+	return init_chapter_index_page(volume,
+				       dm_bufio_get_block_data(page->buffer),
+				       chapter,
+				       index_page_number,
+				       &page->index_page);
+}
+
+static bool
+search_record_page(const u8 record_page[],
+		   const struct uds_record_name *name,
+		   const struct geometry *geometry,
+		   struct uds_record_data *metadata)
+{
+	/*
+	 * The array of records is sorted by name and stored as a binary tree in heap order, so the
+	 * root of the tree is the first array element.
+	 */
+	u32 node = 0;
+	const struct uds_volume_record *records = (const struct uds_volume_record *) record_page;
+
+	while (node < geometry->records_per_page) {
+		int result;
+		const struct uds_volume_record *record = &records[node];
+
+		result = memcmp(name, &record->name, UDS_RECORD_NAME_SIZE);
+		if (result == 0) {
+			if (metadata != NULL)
+				*metadata = record->data;
+			return true;
+		}
+
+		/* The children of node N are at indexes 2N+1 and 2N+2. */
+		node = ((2 * node) + ((result < 0) ? 1 : 2));
+	}
+
+	return false;
+}
+
+/*
+ * If we've read in a record page, we're going to do an immediate search, to speed up processing by
+ * avoiding get_record_from_zone(), and to ensure that requests make progress even when queued. If
+ * we've read in an index page, we save the record page number so we don't have to resolve the
+ * index page again. We use the location, virtual_chapter, and old_metadata fields in the request
+ * to allow the index code to know where to begin processing the request again.
+ */
+static int search_page(struct cached_page *page,
+		       const struct volume *volume,
+		       struct uds_request *request,
+		       u32 physical_page)
+{
+	int result;
+	enum uds_index_region location;
+	u16 record_page_number;
+
+	if (is_record_page(volume->geometry, physical_page)) {
+		if (search_record_page(dm_bufio_get_block_data(page->buffer),
+				       &request->record_name,
+				       volume->geometry,
+				       &request->old_metadata))
+			location = UDS_LOCATION_RECORD_PAGE_LOOKUP;
+		else
+			location = UDS_LOCATION_UNAVAILABLE;
+	} else {
+		result = uds_search_chapter_index_page(&page->index_page,
+						       volume->geometry,
+						       &request->record_name,
+						       &record_page_number);
+		if (result != UDS_SUCCESS)
+			return result;
+
+		if (record_page_number == NO_CHAPTER_INDEX_ENTRY) {
+			location = UDS_LOCATION_UNAVAILABLE;
+		} else {
+			location = UDS_LOCATION_INDEX_PAGE_LOOKUP;
+			*((u16 *) &request->old_metadata) = record_page_number;
+		}
+	}
+
+	request->location = location;
+	request->found = false;
+	return UDS_SUCCESS;
+}
+
+static int process_entry(struct volume *volume, struct queued_read *entry)
+{
+	u32 page_number = entry->physical_page;
+	struct uds_request *request;
+	struct cached_page *page = NULL;
+	u8 *page_data;
+	int result;
+
+	if (entry->invalid) {
+		uds_log_debug("Requeuing requests for invalid page");
+		return UDS_SUCCESS;
+	}
+
+	page = select_victim_in_cache(&volume->page_cache);
+
+	uds_unlock_mutex(&volume->read_threads_mutex);
+	page_data = dm_bufio_read(volume->client, page_number, &page->buffer);
+	if (IS_ERR(page_data)) {
+		result = -PTR_ERR(page_data);
+		uds_log_warning_strerror(result,
+					 "error reading physical page %u from volume",
+					 page_number);
+		cancel_page_in_cache(&volume->page_cache, page_number, page);
+		return result;
+	}
+	uds_lock_mutex(&volume->read_threads_mutex);
+
+	if (entry->invalid) {
+		uds_log_warning("Page %u invalidated after read", page_number);
+		cancel_page_in_cache(&volume->page_cache, page_number, page);
+		return UDS_SUCCESS;
+	}
+
+	if (!is_record_page(volume->geometry, page_number)) {
+		result = initialize_index_page(volume, page_number, page);
+		if (result != UDS_SUCCESS) {
+			uds_log_warning("Error initializing chapter index page");
+			cancel_page_in_cache(&volume->page_cache, page_number, page);
+			return result;
+		}
+	}
+
+	result = put_page_in_cache(&volume->page_cache, page_number, page);
+	if (result != UDS_SUCCESS) {
+		uds_log_warning("Error putting page %u in cache", page_number);
+		cancel_page_in_cache(&volume->page_cache, page_number, page);
+		return result;
+	}
+
+	request = entry->first_request;
+	while ((request != NULL) && (result == UDS_SUCCESS)) {
+		result = search_page(page, volume, request, page_number);
+		request = request->next_request;
+	}
+
+	return result;
+}
+
+static void release_queued_requests(struct volume *volume, struct queued_read *entry, int result)
+{
+	struct page_cache *cache = &volume->page_cache;
+	u16 next_read = cache->read_queue_next_read;
+	struct uds_request *request;
+	struct uds_request *next;
+
+	for (request = entry->first_request; request != NULL; request = next) {
+		next = request->next_request;
+		request->status = result;
+		request->requeued = true;
+		uds_enqueue_request(request, STAGE_INDEX);
+	}
+
+	entry->reserved = false;
+
+	/* Move the read_queue_first pointer as far as we can. */
+	while ((cache->read_queue_first != next_read) &&
+	       (!cache->read_queue[cache->read_queue_first].reserved))
+		advance_queue_position(&cache->read_queue_first);
+	uds_broadcast_cond(&volume->read_threads_read_done_cond);
+}
+
+static void read_thread_function(void *arg)
+{
+	struct volume *volume = arg;
+
+	uds_log_debug("reader starting");
+	uds_lock_mutex(&volume->read_threads_mutex);
+	while (true) {
+		struct queued_read *queue_entry;
+		int result;
+
+		queue_entry = wait_to_reserve_read_queue_entry(volume);
+		if (volume->read_threads_exiting)
+			break;
+
+		result = process_entry(volume, queue_entry);
+		release_queued_requests(volume, queue_entry, result);
+	}
+	uds_unlock_mutex(&volume->read_threads_mutex);
+	uds_log_debug("reader done");
+}
+
+static void get_page_and_index(struct page_cache *cache,
+			       u32 physical_page,
+			       int *queue_index,
+			       struct cached_page **page_ptr)
+{
+	u16 index_value;
+	u16 index;
+	bool queued;
+
+	/*
+	 * ASSERTION: We are either a zone thread holding a search_pending_counter, or we are any
+	 * thread holding the read_threads_mutex.
+	 *
+	 * Holding only a search_pending_counter is the most frequent case.
+	 */
+	/*
+	 * It would be unlikely for the compiler to turn the usage of index_value into two reads of
+	 * cache->index, but it would be possible and very bad if those reads did not return the
+	 * same bits.
+	 */
+	index_value = READ_ONCE(cache->index[physical_page]);
+	queued = (index_value & VOLUME_CACHE_QUEUED_FLAG) != 0;
+	index = index_value & ~VOLUME_CACHE_QUEUED_FLAG;
+
+	if (!queued && (index < cache->cache_slots)) {
+		*page_ptr = &cache->cache[index];
+		/*
+		 * We have acquired access to the cached page, but unless we hold the
+		 * read_threads_mutex, we need a read memory barrier now. The corresponding write
+		 * memory barrier is in put_page_in_cache().
+		 */
+		smp_rmb();
+	} else {
+		*page_ptr = NULL;
+	}
+
+	*queue_index = queued ? index : -1;
+}
+
+static void
+get_page_from_cache(struct page_cache *cache, u32 physical_page, struct cached_page **page)
+{
+	/*
+	 * ASSERTION: We are in a zone thread.
+	 * ASSERTION: We holding a search_pending_counter or the read_threads_mutex.
+	 */
+	int queue_index = -1;
+
+	get_page_and_index(cache, physical_page, &queue_index, page);
+}
+
+static int read_page_locked(struct volume *volume,
+			    u32 physical_page,
+			    struct cached_page **page_ptr)
+{
+	int result = UDS_SUCCESS;
+	struct cached_page *page = NULL;
+	u8 *page_data;
+
+	page = select_victim_in_cache(&volume->page_cache);
+	page_data = dm_bufio_read(volume->client, physical_page, &page->buffer);
+	if (IS_ERR(page_data)) {
+		result = -PTR_ERR(page_data);
+		uds_log_warning_strerror(result,
+					 "error reading physical page %u from volume",
+					 physical_page);
+		cancel_page_in_cache(&volume->page_cache, physical_page, page);
+		return result;
+	}
+
+	if (!is_record_page(volume->geometry, physical_page)) {
+		result = initialize_index_page(volume, physical_page, page);
+		if (result != UDS_SUCCESS) {
+			if (volume->lookup_mode != LOOKUP_FOR_REBUILD)
+				uds_log_warning("Corrupt index page %u", physical_page);
+			cancel_page_in_cache(&volume->page_cache, physical_page, page);
+			return result;
+		}
+	}
+
+	result = put_page_in_cache(&volume->page_cache, physical_page, page);
+	if (result != UDS_SUCCESS) {
+		uds_log_warning("Error putting page %u in cache", physical_page);
+		cancel_page_in_cache(&volume->page_cache, physical_page, page);
+		return result;
+	}
+
+	*page_ptr = page;
+	return UDS_SUCCESS;
+}
+
+/* Retrieve a page from the cache while holding the read threads mutex. */
+static int
+get_volume_page_locked(struct volume *volume, u32 physical_page, struct cached_page **page_ptr)
+{
+	int result;
+	struct cached_page *page = NULL;
+
+	get_page_from_cache(&volume->page_cache, physical_page, &page);
+	if (page == NULL) {
+		result = read_page_locked(volume, physical_page, &page);
+		if (result != UDS_SUCCESS)
+			return result;
+	} else {
+		make_page_most_recent(&volume->page_cache, page);
+	}
+
+	*page_ptr = page;
+	return UDS_SUCCESS;
+}
+
+/* Retrieve a page from the cache while holding a search_pending lock. */
+static int
+get_volume_page_protected(struct volume *volume,
+			  struct uds_request *request,
+			  u32 physical_page,
+			  struct cached_page **page_ptr)
+{
+	struct cached_page *page;
+
+	get_page_from_cache(&volume->page_cache, physical_page, &page);
+	if (page != NULL) {
+		if (request->zone_number == 0)
+			/* Only one zone is allowed to update the LRU. */
+			make_page_most_recent(&volume->page_cache, page);
+
+		*page_ptr = page;
+		return UDS_SUCCESS;
+	}
+
+	/* Prepare to enqueue a read for the page. */
+	end_pending_search(&volume->page_cache, request->zone_number);
+	uds_lock_mutex(&volume->read_threads_mutex);
+
+	/*
+	 * Do the lookup again while holding the read mutex (no longer the fast case so this should
+	 * be fine to repeat). We need to do this because a page may have been added to the cache
+	 * by a reader thread between the time we searched above and the time we went to actually
+	 * try to enqueue it below. This could result in us enqueuing another read for a page which
+	 * is already in the cache, which would mean we end up with two entries in the cache for
+	 * the same page.
+	 */
+	get_page_from_cache(&volume->page_cache, physical_page, &page);
+	if (page == NULL) {
+		enqueue_page_read(volume, request, physical_page);
+		/*
+		 * The performance gain from unlocking first, while "search pending" mode is off,
+		 * turns out to be significant in some cases. The page is not available yet so
+		 * the order does not matter for correctness as it does below.
+		 */
+		uds_unlock_mutex(&volume->read_threads_mutex);
+		begin_pending_search(&volume->page_cache, physical_page, request->zone_number);
+		return UDS_QUEUED;
+	}
+
+	/*
+	 * Now that the page is loaded, the volume needs to switch to "reader thread unlocked" and
+	 * "search pending" state in careful order so no other thread can mess with the data before
+	 * the caller gets to look at it.
+	 */
+	begin_pending_search(&volume->page_cache, physical_page, request->zone_number);
+	uds_unlock_mutex(&volume->read_threads_mutex);
+	*page_ptr = page;
+	return UDS_SUCCESS;
+}
+
+static int get_volume_page(struct volume *volume,
+			   u32 chapter,
+			   u32 page_number,
+			   struct cached_page **page_ptr)
+{
+	int result;
+	u32 physical_page = map_to_physical_page(volume->geometry, chapter, page_number);
+
+	uds_lock_mutex(&volume->read_threads_mutex);
+	result = get_volume_page_locked(volume, physical_page, page_ptr);
+	uds_unlock_mutex(&volume->read_threads_mutex);
+	return result;
+}
+
+int uds_get_volume_record_page(struct volume *volume, u32 chapter, u32 page_number, u8 **data_ptr)
+{
+	int result;
+	struct cached_page *page = NULL;
+
+	result = get_volume_page(volume, chapter, page_number, &page);
+	if (result == UDS_SUCCESS)
+		*data_ptr = dm_bufio_get_block_data(page->buffer);
+	return result;
+}
+
+int uds_get_volume_index_page(struct volume *volume,
+			      u32 chapter,
+			      u32 page_number,
+			      struct delta_index_page **index_page_ptr)
+{
+	int result;
+	struct cached_page *page = NULL;
+
+	result = get_volume_page(volume, chapter, page_number, &page);
+	if (result == UDS_SUCCESS)
+		*index_page_ptr = &page->index_page;
+	return result;
+}
+
+/*
+ * Find the record page associated with a name in a given index page. This will return UDS_QUEUED
+ * if the page in question must be read from storage.
+ */
+static int search_cached_index_page(struct volume *volume,
+				    struct uds_request *request,
+				    u32 chapter,
+				    u32 index_page_number,
+				    u16 *record_page_number)
+{
+	int result;
+	struct cached_page *page = NULL;
+	u32 physical_page = map_to_physical_page(volume->geometry, chapter, index_page_number);
+
+	/*
+	 * Make sure the invalidate counter is updated before we try and read the mapping. This
+	 * prevents this thread from reading a page in the cache which has already been marked for
+	 * invalidation by the reader thread, before the reader thread has noticed that the
+	 * invalidate_counter has been incremented.
+	 */
+	begin_pending_search(&volume->page_cache, physical_page, request->zone_number);
+
+	result = get_volume_page_protected(volume, request, physical_page, &page);
+	if (result != UDS_SUCCESS) {
+		end_pending_search(&volume->page_cache, request->zone_number);
+		return result;
+	}
+
+	result = uds_search_chapter_index_page(&page->index_page,
+					       volume->geometry,
+					       &request->record_name,
+					       record_page_number);
+	end_pending_search(&volume->page_cache, request->zone_number);
+	return result;
+}
+
+/*
+ * Find the metadata associated with a name in a given record page. This will return UDS_QUEUED if
+ * the page in question must be read from storage.
+ */
+int uds_search_cached_record_page(struct volume *volume,
+				  struct uds_request *request,
+				  u32 chapter,
+				  u16 record_page_number,
+				  bool *found)
+{
+	struct cached_page *record_page;
+	struct geometry *geometry = volume->geometry;
+	int result;
+	u32 physical_page, page_number;
+
+	*found = false;
+	if (record_page_number == NO_CHAPTER_INDEX_ENTRY)
+		return UDS_SUCCESS;
+
+	result = ASSERT(record_page_number < geometry->record_pages_per_chapter,
+			"0 <= %d < %u",
+			record_page_number,
+			geometry->record_pages_per_chapter);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	page_number = geometry->index_pages_per_chapter + record_page_number;
+
+	physical_page = map_to_physical_page(volume->geometry, chapter, page_number);
+
+	/*
+	 * Make sure the invalidate counter is updated before we try and read the mapping. This
+	 * prevents this thread from reading a page in the cache which has already been marked for
+	 * invalidation by the reader thread, before the reader thread has noticed that the
+	 * invalidate_counter has been incremented.
+	 */
+	begin_pending_search(&volume->page_cache, physical_page, request->zone_number);
+
+	result = get_volume_page_protected(volume, request, physical_page, &record_page);
+	if (result != UDS_SUCCESS) {
+		end_pending_search(&volume->page_cache, request->zone_number);
+		return result;
+	}
+
+	if (search_record_page(dm_bufio_get_block_data(record_page->buffer),
+			       &request->record_name,
+			       geometry,
+			       &request->old_metadata))
+		*found = true;
+
+	end_pending_search(&volume->page_cache, request->zone_number);
+	return UDS_SUCCESS;
+}
+
+void uds_prefetch_volume_chapter(const struct volume *volume, u32 chapter)
+{
+	const struct geometry *geometry = volume->geometry;
+	u32 physical_page = map_to_physical_page(geometry, chapter, 0);
+
+	dm_bufio_prefetch(volume->client, physical_page, geometry->pages_per_chapter);
+}
+
+int uds_read_chapter_index_from_volume(const struct volume *volume,
+				       u64 virtual_chapter,
+				       struct dm_buffer *volume_buffers[],
+				       struct delta_index_page index_pages[])
+{
+	int result;
+	u32 i;
+	const struct geometry *geometry = volume->geometry;
+	u32 physical_chapter = uds_map_to_physical_chapter(geometry, virtual_chapter);
+	u32 physical_page = map_to_physical_page(geometry, physical_chapter, 0);
+
+	dm_bufio_prefetch(volume->client, physical_page, geometry->index_pages_per_chapter);
+	for (i = 0; i < geometry->index_pages_per_chapter; i++) {
+		u8 *index_page;
+
+		index_page = dm_bufio_read(volume->client, physical_page + i, &volume_buffers[i]);
+		if (IS_ERR(index_page)) {
+			result = -PTR_ERR(index_page);
+			uds_log_warning_strerror(result,
+						 "error reading physical page %u",
+						 physical_page);
+			return result;
+		}
+
+		result = init_chapter_index_page(volume,
+						 index_page,
+						 physical_chapter,
+						 i,
+						 &index_pages[i]);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	return UDS_SUCCESS;
+}
+
+int uds_search_volume_page_cache(struct volume *volume, struct uds_request *request, bool *found)
+{
+	int result;
+	u32 physical_chapter =
+		uds_map_to_physical_chapter(volume->geometry, request->virtual_chapter);
+	u32 index_page_number;
+	u16 record_page_number;
+
+	index_page_number = uds_find_index_page_number(volume->index_page_map,
+						       &request->record_name,
+						       physical_chapter);
+
+	if (request->location == UDS_LOCATION_INDEX_PAGE_LOOKUP) {
+		record_page_number = *((u16 *) &request->old_metadata);
+	} else {
+		result = search_cached_index_page(volume,
+						  request,
+						  physical_chapter,
+						  index_page_number,
+						  &record_page_number);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	return uds_search_cached_record_page(volume,
+					     request,
+					     physical_chapter,
+					     record_page_number,
+					     found);
+}
+
+int uds_search_volume_page_cache_for_rebuild(struct volume *volume,
+					     const struct uds_record_name *name,
+					     u64 virtual_chapter,
+					     bool *found)
+{
+	int result;
+	struct geometry *geometry = volume->geometry;
+	struct cached_page *page;
+	u32 physical_chapter = uds_map_to_physical_chapter(geometry, virtual_chapter);
+	u32 index_page_number;
+	u16 record_page_number;
+	u32 page_number;
+
+	*found = false;
+	index_page_number =
+		uds_find_index_page_number(volume->index_page_map, name, physical_chapter);
+	result = get_volume_page(volume, physical_chapter, index_page_number, &page);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_search_chapter_index_page(&page->index_page,
+					       geometry,
+					       name,
+					       &record_page_number);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (record_page_number == NO_CHAPTER_INDEX_ENTRY)
+		return UDS_SUCCESS;
+
+	page_number = geometry->index_pages_per_chapter + record_page_number;
+	result = get_volume_page(volume, physical_chapter, page_number, &page);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	*found = search_record_page(dm_bufio_get_block_data(page->buffer), name, geometry, NULL);
+	return UDS_SUCCESS;
+}
+
+static void invalidate_page(struct page_cache *cache, u32 physical_page)
+{
+	struct cached_page *page;
+	int queue_index = -1;
+
+	/* We hold the read_threads_mutex. */
+	get_page_and_index(cache, physical_page, &queue_index, &page);
+	if (page != NULL) {
+		WRITE_ONCE(cache->index[page->physical_page], cache->cache_slots);
+		wait_for_pending_searches(cache, page->physical_page);
+		clear_cache_page(cache, page);
+	} else if (queue_index > -1) {
+		uds_log_debug("setting pending read to invalid");
+		cache->read_queue[queue_index].invalid = true;
+	}
+}
+
+void uds_forget_chapter(struct volume *volume, u64 virtual_chapter)
+{
+	u32 physical_chapter = uds_map_to_physical_chapter(volume->geometry, virtual_chapter);
+	u32 first_page = map_to_physical_page(volume->geometry, physical_chapter, 0);
+	u32 i;
+
+	uds_log_debug("forgetting chapter %llu", (unsigned long long) virtual_chapter);
+	uds_lock_mutex(&volume->read_threads_mutex);
+	for (i = 0; i < volume->geometry->pages_per_chapter; i++)
+		invalidate_page(&volume->page_cache, first_page + i);
+	uds_unlock_mutex(&volume->read_threads_mutex);
+}
+
+/*
+ * Donate an index pages from a newly written chapter to the page cache since it is likely to be
+ * used again soon. The caller must already hold the reader thread mutex.
+ */
+static int donate_index_page_locked(struct volume *volume,
+				    u32 physical_chapter,
+				    u32 index_page_number,
+				    struct dm_buffer *page_buffer)
+{
+	int result;
+	struct cached_page *page = NULL;
+	u32 physical_page =
+		map_to_physical_page(volume->geometry, physical_chapter, index_page_number);
+
+	page = select_victim_in_cache(&volume->page_cache);
+	page->buffer = page_buffer;
+	result = init_chapter_index_page(volume,
+					 dm_bufio_get_block_data(page_buffer),
+					 physical_chapter,
+					 index_page_number,
+					 &page->index_page);
+	if (result != UDS_SUCCESS) {
+		uds_log_warning("Error initialize chapter index page");
+		cancel_page_in_cache(&volume->page_cache, physical_page, page);
+		return result;
+	}
+
+	result = put_page_in_cache(&volume->page_cache, physical_page, page);
+	if (result != UDS_SUCCESS) {
+		uds_log_warning("Error putting page %u in cache", physical_page);
+		cancel_page_in_cache(&volume->page_cache, physical_page, page);
+		return result;
+	}
+
+	return UDS_SUCCESS;
+}
+
+static int write_index_pages(struct volume *volume,
+			     u32 physical_chapter_number,
+			     struct open_chapter_index *chapter_index)
+{
+	struct geometry *geometry = volume->geometry;
+	struct dm_buffer *page_buffer;
+	u32 first_index_page = map_to_physical_page(geometry, physical_chapter_number, 0);
+	u32 delta_list_number = 0;
+	u32 index_page_number;
+
+	for (index_page_number = 0;
+	     index_page_number < geometry->index_pages_per_chapter;
+	     index_page_number++) {
+		u8 *page_data;
+		u32 physical_page = first_index_page + index_page_number;
+		u32 lists_packed;
+		bool last_page;
+		int result;
+
+		page_data = dm_bufio_new(volume->client, physical_page, &page_buffer);
+		if (IS_ERR(page_data))
+			return uds_log_warning_strerror(-PTR_ERR(page_data),
+							"failed to prepare index page");
+
+		last_page = ((index_page_number + 1) == geometry->index_pages_per_chapter);
+		result = uds_pack_open_chapter_index_page(chapter_index,
+							  page_data,
+							  delta_list_number,
+							  last_page,
+							  &lists_packed);
+		if (result != UDS_SUCCESS) {
+			dm_bufio_release(page_buffer);
+			return uds_log_warning_strerror(result, "failed to pack index page");
+		}
+
+		dm_bufio_mark_buffer_dirty(page_buffer);
+
+		if (lists_packed == 0)
+			uds_log_debug("no delta lists packed on chapter %u page %u",
+				      physical_chapter_number,
+				      index_page_number);
+		else
+			delta_list_number += lists_packed;
+
+		uds_update_index_page_map(volume->index_page_map,
+					  chapter_index->virtual_chapter_number,
+					  physical_chapter_number,
+					  index_page_number,
+					  delta_list_number - 1);
+
+		uds_lock_mutex(&volume->read_threads_mutex);
+		result = donate_index_page_locked(volume,
+						  physical_chapter_number,
+						  index_page_number,
+						  page_buffer);
+		uds_unlock_mutex(&volume->read_threads_mutex);
+		if (result != UDS_SUCCESS) {
+			dm_bufio_release(page_buffer);
+			return result;
+		}
+	}
+
+	return UDS_SUCCESS;
+}
+
+static u32 encode_tree(u8 record_page[],
+		       const struct uds_volume_record *sorted_pointers[],
+		       u32 next_record,
+		       u32 node,
+		       u32 node_count)
+{
+	if (node < node_count) {
+		u32 child = (2 * node) + 1;
+
+		next_record = encode_tree(record_page,
+					  sorted_pointers,
+					  next_record,
+					  child,
+					  node_count);
+
+		/*
+		 * In-order traversal: copy the contents of the next record into the page at the
+		 * node offset.
+		 */
+		memcpy(&record_page[node * BYTES_PER_RECORD],
+		       sorted_pointers[next_record++],
+		       BYTES_PER_RECORD);
+
+		next_record = encode_tree(record_page,
+					  sorted_pointers,
+					  next_record,
+					  child + 1,
+					  node_count);
+	}
+
+	return next_record;
+}
+
+static int
+encode_record_page(const struct volume *volume,
+		   const struct uds_volume_record records[],
+		   u8 record_page[])
+{
+	int result;
+	u32 i;
+	u32 records_per_page = volume->geometry->records_per_page;
+	const struct uds_volume_record **record_pointers = volume->record_pointers;
+
+	for (i = 0; i < records_per_page; i++)
+		record_pointers[i] = &records[i];
+
+	/*
+	 * Sort the record pointers by using just the names in the records, which is less work than
+	 * sorting the entire record values.
+	 */
+	STATIC_ASSERT(offsetof(struct uds_volume_record, name) == 0);
+	result = uds_radix_sort(volume->radix_sorter,
+				(const u8 **) record_pointers,
+				records_per_page,
+				UDS_RECORD_NAME_SIZE);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	encode_tree(record_page, record_pointers, 0, 0, records_per_page);
+	return UDS_SUCCESS;
+}
+
+static int write_record_pages(struct volume *volume,
+			      u32 physical_chapter_number,
+			      const struct uds_volume_record *records)
+{
+	u32 record_page_number;
+	struct geometry *geometry = volume->geometry;
+	struct dm_buffer *page_buffer;
+	const struct uds_volume_record *next_record = records;
+	u32 first_record_page = map_to_physical_page(geometry,
+						     physical_chapter_number,
+						     geometry->index_pages_per_chapter);
+
+	for (record_page_number = 0;
+	     record_page_number < geometry->record_pages_per_chapter;
+	     record_page_number++) {
+		u8 *page_data;
+		u32 physical_page = first_record_page + record_page_number;
+		int result;
+
+		page_data = dm_bufio_new(volume->client, physical_page, &page_buffer);
+		if (IS_ERR(page_data))
+			return uds_log_warning_strerror(-PTR_ERR(page_data),
+							"failed to prepare record page");
+
+		result = encode_record_page(volume, next_record, page_data);
+		if (result != UDS_SUCCESS) {
+			dm_bufio_release(page_buffer);
+			return uds_log_warning_strerror(result,
+							"failed to encode record page %u",
+							record_page_number);
+		}
+
+		next_record += geometry->records_per_page;
+		dm_bufio_mark_buffer_dirty(page_buffer);
+		dm_bufio_release(page_buffer);
+	}
+
+	return UDS_SUCCESS;
+}
+
+int uds_write_chapter(struct volume *volume,
+		      struct open_chapter_index *chapter_index,
+		      const struct uds_volume_record *records)
+{
+	int result;
+	u32 physical_chapter_number =
+		uds_map_to_physical_chapter(volume->geometry,
+					    chapter_index->virtual_chapter_number);
+
+	result = write_index_pages(volume, physical_chapter_number, chapter_index);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = write_record_pages(volume, physical_chapter_number, records);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = -dm_bufio_write_dirty_buffers(volume->client);
+	if (result != UDS_SUCCESS)
+		uds_log_error_strerror(result, "cannot sync chapter to volume");
+
+	return result;
+}
+
+static void probe_chapter(struct volume *volume, u32 chapter_number, u64 *virtual_chapter_number)
+{
+	const struct geometry *geometry = volume->geometry;
+	u32 expected_list_number = 0;
+	u32 i;
+	u64 vcn = BAD_CHAPTER;
+
+	*virtual_chapter_number = BAD_CHAPTER;
+	dm_bufio_prefetch(volume->client,
+			  map_to_physical_page(geometry, chapter_number, 0),
+			  geometry->index_pages_per_chapter);
+
+	for (i = 0; i < geometry->index_pages_per_chapter; ++i) {
+		struct delta_index_page *page;
+		int result;
+
+		result = uds_get_volume_index_page(volume, chapter_number, i, &page);
+		if (result != UDS_SUCCESS)
+			return;
+
+		if (page->virtual_chapter_number == BAD_CHAPTER) {
+			uds_log_error("corrupt index page in chapter %u", chapter_number);
+			return;
+		}
+
+		if (vcn == BAD_CHAPTER) {
+			vcn = page->virtual_chapter_number;
+		} else if (page->virtual_chapter_number != vcn) {
+			uds_log_error("inconsistent chapter %u index page %u: expected vcn %llu, got vcn %llu",
+				      chapter_number,
+				      i,
+				      (unsigned long long) vcn,
+				      (unsigned long long) page->virtual_chapter_number);
+			return;
+		}
+
+		if (expected_list_number != page->lowest_list_number) {
+			uds_log_error("inconsistent chapter %u index page %u: expected list number %u, got list number %u",
+				      chapter_number,
+				      i,
+				      expected_list_number,
+				      page->lowest_list_number);
+			return;
+		}
+		expected_list_number = page->highest_list_number + 1;
+
+		result = uds_validate_chapter_index_page(page, geometry);
+		if (result != UDS_SUCCESS)
+			return;
+	}
+
+	if (chapter_number != uds_map_to_physical_chapter(geometry, vcn)) {
+		uds_log_error("chapter %u vcn %llu is out of phase (%u)",
+			      chapter_number,
+			      (unsigned long long) vcn,
+			      geometry->chapters_per_volume);
+		return;
+	}
+
+	*virtual_chapter_number = vcn;
+}
+
+/* Find the last valid physical chapter in the volume. */
+static void find_real_end_of_volume(struct volume *volume, u32 limit, u32 *limit_ptr)
+{
+	u32 span = 1;
+	u32 tries = 0;
+
+	while (limit > 0) {
+		u32 chapter = (span > limit) ? 0 : limit - span;
+		u64 vcn = 0;
+
+		probe_chapter(volume, chapter, &vcn);
+		if (vcn == BAD_CHAPTER) {
+			limit = chapter;
+			if (++tries > 1)
+				span *= 2;
+		} else {
+			if (span == 1)
+				break;
+			span /= 2;
+			tries = 0;
+		}
+	}
+
+	*limit_ptr = limit;
+}
+
+static int
+find_chapter_limits(struct volume *volume, u32 chapter_limit, u64 *lowest_vcn, u64 *highest_vcn)
+{
+	struct geometry *geometry = volume->geometry;
+	u64 zero_vcn;
+	u64 lowest = BAD_CHAPTER;
+	u64 highest = BAD_CHAPTER;
+	u64 moved_chapter = BAD_CHAPTER;
+	u32 left_chapter = 0;
+	u32 right_chapter = 0;
+	u32 bad_chapters = 0;
+
+	/*
+	 * This method assumes there is at most one run of contiguous bad chapters caused by
+	 * unflushed writes. Either the bad spot is at the beginning and end, or somewhere in the
+	 * middle. Wherever it is, the highest and lowest VCNs are adjacent to it. Otherwise the
+	 * volume is cleanly saved and somewhere in the middle of it the highest VCN immediately
+	 * precedes the lowest one.
+	 */
+
+	/* It doesn't matter if this results in a bad spot (BAD_CHAPTER). */
+	probe_chapter(volume, 0, &zero_vcn);
+
+	/*
+	 * Binary search for end of the discontinuity in the monotonically increasing virtual
+	 * chapter numbers; bad spots are treated as a span of BAD_CHAPTER values. In effect we're
+	 * searching for the index of the smallest value less than zero_vcn. In the case we go off
+	 * the end it means that chapter 0 has the lowest vcn.
+	 *
+	 * If a virtual chapter is out-of-order, it will be the one moved by conversion. Always
+	 * skip over the moved chapter when searching, adding it to the range at the end if
+	 * necessary.
+	 */
+	if (geometry->remapped_physical > 0) {
+		u64 remapped_vcn;
+
+		probe_chapter(volume, geometry->remapped_physical, &remapped_vcn);
+		if (remapped_vcn == geometry->remapped_virtual)
+			moved_chapter = geometry->remapped_physical;
+	}
+
+	left_chapter = 0;
+	right_chapter = chapter_limit;
+
+	while (left_chapter < right_chapter) {
+		u64 probe_vcn;
+		u32 chapter = (left_chapter + right_chapter) / 2;
+
+		if (chapter == moved_chapter)
+			chapter--;
+
+		probe_chapter(volume, chapter, &probe_vcn);
+		if (zero_vcn <= probe_vcn) {
+			left_chapter = chapter + 1;
+			if (left_chapter == moved_chapter)
+				left_chapter++;
+		} else {
+			right_chapter = chapter;
+		}
+	}
+
+	/* If left_chapter goes off the end, chapter 0 has the lowest virtual chapter number.*/
+	if (left_chapter >= chapter_limit)
+		left_chapter = 0;
+
+	/* At this point, left_chapter is the chapter with the lowest virtual chapter number. */
+	probe_chapter(volume, left_chapter, &lowest);
+
+	/* The moved chapter might be the lowest in the range. */
+	if ((moved_chapter != BAD_CHAPTER) && (lowest == geometry->remapped_virtual + 1))
+		lowest = geometry->remapped_virtual;
+
+	/*
+	 * Circularly scan backwards, moving over any bad chapters until encountering a good one,
+	 * which is the chapter with the highest vcn.
+	 */
+	while (highest == BAD_CHAPTER) {
+		right_chapter = (right_chapter + chapter_limit - 1) % chapter_limit;
+		if (right_chapter == moved_chapter)
+			continue;
+
+		probe_chapter(volume, right_chapter, &highest);
+		if (bad_chapters++ >= MAX_BAD_CHAPTERS) {
+			uds_log_error("too many bad chapters in volume: %u", bad_chapters);
+			return UDS_CORRUPT_DATA;
+		}
+	}
+
+	*lowest_vcn = lowest;
+	*highest_vcn = highest;
+	return UDS_SUCCESS;
+}
+
+/*
+ * Find the highest and lowest contiguous chapters present in the volume and determine their
+ * virtual chapter numbers. This is used by rebuild.
+ */
+int uds_find_volume_chapter_boundaries(struct volume *volume,
+				       u64 *lowest_vcn,
+				       u64 *highest_vcn,
+				       bool *is_empty)
+{
+	u32 chapter_limit = volume->geometry->chapters_per_volume;
+
+	find_real_end_of_volume(volume, chapter_limit, &chapter_limit);
+	if (chapter_limit == 0) {
+		*lowest_vcn = 0;
+		*highest_vcn = 0;
+		*is_empty = true;
+		return UDS_SUCCESS;
+	}
+
+	*is_empty = false;
+	return find_chapter_limits(volume, chapter_limit, lowest_vcn, highest_vcn);
+}
+
+int __must_check uds_replace_volume_storage(struct volume *volume,
+					    struct index_layout *layout,
+					    struct block_device *bdev)
+{
+	int result;
+	u32 i;
+
+	result = uds_replace_index_layout_storage(layout, bdev);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	/* Release all outstanding dm_bufio objects */
+	for (i = 0; i < volume->page_cache.indexable_pages; i++)
+		volume->page_cache.index[i] = volume->page_cache.cache_slots;
+	for (i = 0; i < volume->page_cache.cache_slots; i++)
+		clear_cache_page(&volume->page_cache, &volume->page_cache.cache[i]);
+	if (volume->sparse_cache != NULL)
+		uds_invalidate_sparse_cache(volume->sparse_cache);
+	if (volume->client != NULL)
+		dm_bufio_client_destroy(UDS_FORGET(volume->client));
+
+	return uds_open_volume_bufio(layout,
+				     volume->geometry->bytes_per_page,
+				     volume->reserved_buffers,
+				     &volume->client);
+}
+
+static int __must_check
+initialize_page_cache(struct page_cache *cache,
+		      const struct geometry *geometry,
+		      u32 chapters_in_cache,
+		      unsigned int zone_count)
+{
+	int result;
+	u32 i;
+
+	cache->indexable_pages = geometry->pages_per_volume + 1;
+	cache->cache_slots = chapters_in_cache * geometry->record_pages_per_chapter;
+	cache->zone_count = zone_count;
+	atomic64_set(&cache->clock, 1);
+
+	result = ASSERT((cache->cache_slots <= VOLUME_CACHE_MAX_ENTRIES),
+			"requested cache size, %u, within limit %u",
+			cache->cache_slots,
+			VOLUME_CACHE_MAX_ENTRIES);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(VOLUME_CACHE_MAX_QUEUED_READS,
+			      struct queued_read,
+			      "volume read queue",
+			      &cache->read_queue);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(cache->zone_count,
+			      struct search_pending_counter,
+			      "Volume Cache Zones",
+			      &cache->search_pending_counters);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(cache->indexable_pages, u16, "page cache index", &cache->index);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(cache->cache_slots,
+			      struct cached_page,
+			      "page cache cache",
+			      &cache->cache);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	/* Initialize index values to invalid values. */
+	for (i = 0; i < cache->indexable_pages; i++)
+		cache->index[i] = cache->cache_slots;
+
+	for (i = 0; i < cache->cache_slots; i++)
+		clear_cache_page(cache, &cache->cache[i]);
+
+	return UDS_SUCCESS;
+}
+
+int uds_make_volume(const struct configuration *config,
+		    struct index_layout *layout,
+		    struct volume **new_volume)
+{
+	unsigned int i;
+	struct volume *volume = NULL;
+	struct geometry *geometry;
+	unsigned int reserved_buffers;
+	int result;
+
+	result = UDS_ALLOCATE(1, struct volume, "volume", &volume);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	volume->nonce = uds_get_volume_nonce(layout);
+
+	result = uds_copy_geometry(config->geometry, &volume->geometry);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume(volume);
+		return uds_log_warning_strerror(result, "failed to allocate geometry: error");
+	}
+	geometry = volume->geometry;
+
+	/*
+	 * Reserve a buffer for each entry in the page cache, one for the chapter writer, and one
+	 * for each entry in the sparse cache.
+	 */
+	reserved_buffers = config->cache_chapters * geometry->record_pages_per_chapter;
+	reserved_buffers += 1;
+	if (uds_is_sparse_geometry(geometry))
+		reserved_buffers += (config->cache_chapters * geometry->index_pages_per_chapter);
+	volume->reserved_buffers = reserved_buffers;
+	result = uds_open_volume_bufio(layout,
+				       geometry->bytes_per_page,
+				       volume->reserved_buffers,
+				       &volume->client);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume(volume);
+		return result;
+	}
+
+	result = uds_make_radix_sorter(geometry->records_per_page, &volume->radix_sorter);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume(volume);
+		return result;
+	}
+
+	result = UDS_ALLOCATE(geometry->records_per_page,
+			      const struct uds_volume_record *,
+			      "record pointers",
+			      &volume->record_pointers);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume(volume);
+		return result;
+	}
+
+	if (uds_is_sparse_geometry(geometry)) {
+		size_t page_size = sizeof(struct delta_index_page) + geometry->bytes_per_page;
+
+		result = uds_make_sparse_cache(geometry,
+					       config->cache_chapters,
+					       config->zone_count,
+					       &volume->sparse_cache);
+		if (result != UDS_SUCCESS) {
+			uds_free_volume(volume);
+			return result;
+		}
+
+		volume->cache_size =
+			page_size * geometry->index_pages_per_chapter * config->cache_chapters;
+	}
+
+	result = initialize_page_cache(&volume->page_cache,
+				       geometry,
+				       config->cache_chapters,
+				       config->zone_count);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume(volume);
+		return result;
+	}
+
+	volume->cache_size += volume->page_cache.cache_slots * sizeof(struct delta_index_page);
+	result = uds_make_index_page_map(geometry, &volume->index_page_map);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume(volume);
+		return result;
+	}
+
+	result = uds_init_mutex(&volume->read_threads_mutex);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume(volume);
+		return result;
+	}
+
+	result = uds_init_cond(&volume->read_threads_read_done_cond);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume(volume);
+		return result;
+	}
+
+	result = uds_init_cond(&volume->read_threads_cond);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume(volume);
+		return result;
+	}
+
+	result = UDS_ALLOCATE(config->read_threads,
+			      struct thread *,
+			      "reader threads",
+			      &volume->reader_threads);
+	if (result != UDS_SUCCESS) {
+		uds_free_volume(volume);
+		return result;
+	}
+
+	for (i = 0; i < config->read_threads; i++) {
+		result = uds_create_thread(read_thread_function,
+					   (void *) volume,
+					   "reader",
+					   &volume->reader_threads[i]);
+		if (result != UDS_SUCCESS) {
+			uds_free_volume(volume);
+			return result;
+		}
+
+		volume->read_thread_count = i + 1;
+	}
+
+	*new_volume = volume;
+	return UDS_SUCCESS;
+}
+
+static void uninitialize_page_cache(struct page_cache *cache)
+{
+	u16 i;
+
+	if (cache->cache != NULL) {
+		for (i = 0; i < cache->cache_slots; i++)
+			release_page_buffer(&cache->cache[i]);
+	}
+	UDS_FREE(cache->index);
+	UDS_FREE(cache->cache);
+	UDS_FREE(cache->search_pending_counters);
+	UDS_FREE(cache->read_queue);
+}
+
+void uds_free_volume(struct volume *volume)
+{
+	if (volume == NULL)
+		return;
+
+	if (volume->reader_threads != NULL) {
+		unsigned int i;
+
+		/* This works even if some threads weren't started. */
+		uds_lock_mutex(&volume->read_threads_mutex);
+		volume->read_threads_exiting = true;
+		uds_broadcast_cond(&volume->read_threads_cond);
+		uds_unlock_mutex(&volume->read_threads_mutex);
+		for (i = 0; i < volume->read_thread_count; i++)
+			uds_join_threads(volume->reader_threads[i]);
+		UDS_FREE(volume->reader_threads);
+		volume->reader_threads = NULL;
+	}
+
+	/* Must destroy the client AFTER freeing the cached pages. */
+	uninitialize_page_cache(&volume->page_cache);
+	uds_free_sparse_cache(volume->sparse_cache);
+	if (volume->client != NULL)
+		dm_bufio_client_destroy(UDS_FORGET(volume->client));
+
+	uds_destroy_cond(&volume->read_threads_cond);
+	uds_destroy_cond(&volume->read_threads_read_done_cond);
+	uds_destroy_mutex(&volume->read_threads_mutex);
+	uds_free_index_page_map(volume->index_page_map);
+	uds_free_radix_sorter(volume->radix_sorter);
+	UDS_FREE(volume->geometry);
+	UDS_FREE(volume->record_pointers);
+	UDS_FREE(volume);
+}
diff --git a/drivers/md/dm-vdo/volume.h b/drivers/md/dm-vdo/volume.h
new file mode 100644
index 000000000000..4991f41fe5c3
--- /dev/null
+++ b/drivers/md/dm-vdo/volume.h
@@ -0,0 +1,175 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_VOLUME_H
+#define UDS_VOLUME_H
+
+#include <linux/atomic.h>
+#include <linux/cache.h>
+#include <linux/dm-bufio.h>
+#include <linux/limits.h>
+
+#include "chapter-index.h"
+#include "config.h"
+#include "geometry.h"
+#include "index-layout.h"
+#include "index-page-map.h"
+#include "permassert.h"
+#include "radix-sort.h"
+#include "sparse-cache.h"
+#include "uds.h"
+#include "uds-threads.h"
+
+/*
+ * The volume manages deduplication records on permanent storage. The term "volume" can also refer
+ * to the region of permanent storage where the records (and the chapters containing them) are
+ * stored. The volume handles all I/O to this region by reading, caching, and writing chapter pages
+ * as necessary.
+ */
+
+enum index_lookup_mode {
+	/* Always do lookups in all chapters normally */
+	LOOKUP_NORMAL,
+	/* Only do a subset of lookups needed when rebuilding an index */
+	LOOKUP_FOR_REBUILD,
+};
+
+struct queued_read {
+	bool invalid;
+	bool reserved;
+	u32 physical_page;
+	struct uds_request *first_request;
+	struct uds_request *last_request;
+};
+
+struct __aligned(L1_CACHE_BYTES) search_pending_counter {
+	u64 atomic_value;
+};
+
+struct cached_page {
+	/* Whether this page is currently being read asynchronously */
+	bool read_pending;
+	/* The physical page stored in this cache entry */
+	u32 physical_page;
+	/* The value of the volume clock when this page was last used */
+	s64 last_used;
+	/* The cached page buffer */
+	struct dm_buffer *buffer;
+	/* The chapter index page, meaningless for record pages */
+	struct delta_index_page index_page;
+};
+
+struct page_cache {
+	/* The number of zones */
+	unsigned int zone_count;
+	/* The number of volume pages that can be cached */
+	u32 indexable_pages;
+	/* The maximum number of simultaneously cached pages */
+	u16 cache_slots;
+	/* An index for each physical page noting where it is in the cache */
+	u16 *index;
+	/* The array of cached pages */
+	struct cached_page *cache;
+	/* A counter for each zone tracking if a search is occurring there */
+	struct search_pending_counter *search_pending_counters;
+	/* The read queue entries as a circular array */
+	struct queued_read *read_queue;
+
+	/* All entries above this point are constant after initialization. */
+
+	/*
+	 * These values are all indexes into the array of read queue entries. New entries in the
+	 * read queue are enqueued at read_queue_last. To dequeue entries, a reader thread gets the
+	 * lock and then claims the entry pointed to by read_queue_next_read and increments that
+	 * value. After the read is completed, the reader thread calls release_read_queue_entry(),
+	 * which increments read_queue_first until it points to a pending read, or is equal to
+	 * read_queue_next_read. This means that if multiple reads are outstanding,
+	 * read_queue_first might not advance until the last of the reads finishes.
+	 */
+	u16 read_queue_first;
+	u16 read_queue_next_read;
+	u16 read_queue_last;
+
+	atomic64_t clock;
+};
+
+struct volume {
+	struct geometry *geometry;
+	struct dm_bufio_client *client;
+	u64 nonce;
+	size_t cache_size;
+
+	/* A single page worth of records, for sorting */
+	const struct uds_volume_record **record_pointers;
+	/* Sorter for sorting records within each page */
+	struct radix_sorter *radix_sorter;
+
+	struct sparse_cache *sparse_cache;
+	struct page_cache page_cache;
+	struct index_page_map *index_page_map;
+
+	struct mutex read_threads_mutex;
+	struct cond_var read_threads_cond;
+	struct cond_var read_threads_read_done_cond;
+	struct thread **reader_threads;
+	unsigned int read_thread_count;
+	bool read_threads_exiting;
+
+	enum index_lookup_mode lookup_mode;
+	unsigned int reserved_buffers;
+};
+
+int __must_check uds_make_volume(const struct configuration *config,
+				 struct index_layout *layout,
+				 struct volume **new_volume);
+
+void uds_free_volume(struct volume *volume);
+
+int __must_check uds_replace_volume_storage(struct volume *volume,
+					    struct index_layout *layout,
+					    struct block_device *bdev);
+
+int __must_check uds_find_volume_chapter_boundaries(struct volume *volume,
+						    u64 *lowest_vcn,
+						    u64 *highest_vcn,
+						    bool *is_empty);
+
+int __must_check
+uds_search_volume_page_cache(struct volume *volume, struct uds_request *request, bool *found);
+
+int __must_check uds_search_volume_page_cache_for_rebuild(struct volume *volume,
+							  const struct uds_record_name *name,
+							  u64 virtual_chapter,
+							  bool *found);
+
+int __must_check uds_search_cached_record_page(struct volume *volume,
+					       struct uds_request *request,
+					       u32 chapter,
+					       u16 record_page_number,
+					       bool *found);
+
+void uds_forget_chapter(struct volume *volume, u64 chapter);
+
+int __must_check uds_write_chapter(struct volume *volume,
+				   struct open_chapter_index *chapter_index,
+				   const struct uds_volume_record records[]);
+
+void uds_prefetch_volume_chapter(const struct volume *volume, u32 chapter);
+
+int __must_check
+uds_read_chapter_index_from_volume(const struct volume *volume,
+				   u64 virtual_chapter,
+				   struct dm_buffer *volume_buffers[],
+				   struct delta_index_page index_pages[]);
+
+int __must_check
+uds_get_volume_record_page(struct volume *volume, u32 chapter, u32 page_number, u8 **data_ptr);
+
+int __must_check uds_get_volume_index_page(struct volume *volume,
+					   u32 chapter,
+					   u32 page_number,
+					   struct delta_index_page **page_ptr);
+
+#endif /* UDS_VOLUME_H */

From patchwork Thu Oct 26 21:41:12 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437934
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5EF2DC25B6E
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:43:17 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344870AbjJZVnR (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:43:17 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41000 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232027AbjJZVnP (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:43:15 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC3EB1B5
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356502;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=TtgQkQLH9r9KRy/YlnSEmRWxtT9bMdPOjAKm9E7/2sk=;
        b=KOeWMH+DU8lY1tYq/39UZZ7uhnNrjImSnmRlz0S5SLSBCc3y269f3dAYNMd5j8mZ6yV8Ot
        G2YugDq8HRqbGW9B8SoC4Y7b9kcvuUaXQx1YSdC8LdL73BSB9jwTM0ii6Dw0yvCg3xtefG
        gYBWmuJ+aqGLGv7V1LV6j46DnBrPdSE=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-584-dj-EVN06MiKJYWvUDZ6KGw-1; Thu, 26 Oct 2023 17:41:39 -0400
X-MC-Unique: dj-EVN06MiKJYWvUDZ6KGw-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id AAB6E8021B1;
        Thu, 26 Oct 2023 21:41:39 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 9A9132949;
        Thu, 26 Oct 2023 21:41:39 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 5B42D3003B; Thu, 26 Oct 2023 17:41:39 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 15/39] dm vdo: implement top-level deduplication index
Date: Thu, 26 Oct 2023 17:41:12 -0400
Message-Id: <20231026214136.1067410-16-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

The top-level deduplication index brings all the earlier components
together. The top-level index creates the separate zone structures that
enable the index to handle several requests in parallel, handles
dispatching requests to the right zones and components, and coordinates
metadata to ensure that it remain consistent. It also coordinates recovery
in the event of an unexpected index failure.

If sparse caching is enabled, the top-level index also handles the
coordination required by the sparse chapter index cache, which (unlike most
index structures) is shared among all zones.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Co-developed-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/index.c        | 1403 ++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/index.h        |   84 ++
 drivers/md/dm-vdo/sparse-cache.c |  595 +++++++++++++
 drivers/md/dm-vdo/sparse-cache.h |   49 ++
 4 files changed, 2131 insertions(+)
 create mode 100644 drivers/md/dm-vdo/index.c
 create mode 100644 drivers/md/dm-vdo/index.h
 create mode 100644 drivers/md/dm-vdo/sparse-cache.c
 create mode 100644 drivers/md/dm-vdo/sparse-cache.h

diff --git a/drivers/md/dm-vdo/index.c b/drivers/md/dm-vdo/index.c
new file mode 100644
index 000000000000..e8859cee2b93
--- /dev/null
+++ b/drivers/md/dm-vdo/index.c
@@ -0,0 +1,1403 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+
+#include "index.h"
+
+#include "hash-utils.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "request-queue.h"
+#include "sparse-cache.h"
+
+static const u64 NO_LAST_SAVE = U64_MAX;
+
+/*
+ * When searching for deduplication records, the index first searches the volume index, and then
+ * searches the chapter index for the relevant chapter. If the chapter has been fully committed to
+ * storage, the chapter pages are loaded into the page cache. If the chapter has not yet been
+ * committed (either the open chapter or a recently closed one), the index searches the in-memory
+ * representation of the chapter. Finally, if the volume index does not find a record and the index
+ * is sparse, the index will search the sparse cache.
+ *
+ * The index send two kinds of messages to coordinate between zones: chapter close messages for the
+ * chapter writer, and sparse cache barrier messages for the sparse cache.
+ *
+ * The chapter writer is responsible for committing chapters of records to storage. Since zones can
+ * get different numbers of records, some zones may fall behind others. Each time a zone fills up
+ * its available space in a chapter, it informs the chapter writer that the chapter is complete,
+ * and also informs all other zones that it has closed the chapter. Each other zone will then close
+ * the chapter immediately, regardless of how full it is, in order to minimize skew between zones.
+ * Once every zone has closed the chapter, the chapter writer will commit that chapter to storage.
+ *
+ * The last zone to close the chapter also removes the oldest chapter from the volume index.
+ * Although that chapter is invalid for zones that have moved on, the existence of the open chapter
+ * means that those zones will never ask the volume index about it. No zone is allowed to get more
+ * than one chapter ahead of any other. If a zone is so far ahead that it tries to close another
+ * chapter before the previous one has been closed by all zones, it is forced to wait.
+ *
+ * The sparse cache relies on having the same set of chapter indexes available to all zones. When a
+ * request wants to add a chapter to the sparse cache, it sends a barrier message to each zone
+ * during the triage stage that acts as a rendezvous. Once every zone has reached the barrier and
+ * paused its operations, the cache membership is changed and each zone is then informed that it
+ * can proceed. More details can be found in the sparse cache documentation.
+ *
+ * If a sparse cache has only one zone, it will not create a triage queue, but it still needs the
+ * barrier message to change the sparse cache membership, so the index simulates the message by
+ * invoking the handler directly.
+ */
+
+struct chapter_writer {
+	/* The index to which we belong */
+	struct uds_index *index;
+	/* The thread to do the writing */
+	struct thread *thread;
+	/* The lock protecting the following fields */
+	struct mutex mutex;
+	/* The condition signalled on state changes */
+	struct cond_var cond;
+	/* Set to true to stop the thread */
+	bool stop;
+	/* The result from the most recent write */
+	int result;
+	/* The number of bytes allocated by the chapter writer */
+	size_t memory_size;
+	/* The number of zones which have submitted a chapter for writing */
+	unsigned int zones_to_write;
+	/* Open chapter index used by uds_close_open_chapter() */
+	struct open_chapter_index *open_chapter_index;
+	/* Collated records used by uds_close_open_chapter() */
+	struct uds_volume_record *collated_records;
+	/* The chapters to write (one per zone) */
+	struct open_chapter_zone *chapters[];
+};
+
+static bool is_zone_chapter_sparse(const struct index_zone *zone, u64 virtual_chapter)
+{
+	return uds_is_chapter_sparse(zone->index->volume->geometry,
+				     zone->oldest_virtual_chapter,
+				     zone->newest_virtual_chapter,
+				     virtual_chapter);
+}
+
+static int
+launch_zone_message(struct uds_zone_message message, unsigned int zone, struct uds_index *index)
+{
+	int result;
+	struct uds_request *request;
+
+	result = UDS_ALLOCATE(1, struct uds_request, __func__, &request);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	request->index = index;
+	request->unbatched = true;
+	request->zone_number = zone;
+	request->zone_message = message;
+
+	uds_enqueue_request(request, STAGE_MESSAGE);
+	return UDS_SUCCESS;
+}
+
+static void enqueue_barrier_messages(struct uds_index *index, u64 virtual_chapter)
+{
+	struct uds_zone_message message = {
+		.type = UDS_MESSAGE_SPARSE_CACHE_BARRIER,
+		.virtual_chapter = virtual_chapter,
+	};
+	unsigned int zone;
+
+	for (zone = 0; zone < index->zone_count; zone++) {
+		int result = launch_zone_message(message, zone, index);
+
+		ASSERT_LOG_ONLY((result == UDS_SUCCESS), "barrier message allocation");
+	}
+}
+
+/*
+ * Determine whether this request should trigger a sparse cache barrier message to change the
+ * membership of the sparse cache. If a change in membership is desired, the function returns the
+ * chapter number to add.
+ */
+static u64 triage_index_request(struct uds_index *index, struct uds_request *request)
+{
+	u64 virtual_chapter;
+	struct index_zone *zone;
+
+	virtual_chapter = uds_lookup_volume_index_name(index->volume_index, &request->record_name);
+	if (virtual_chapter == NO_CHAPTER)
+		return NO_CHAPTER;
+
+	zone = index->zones[request->zone_number];
+	if (!is_zone_chapter_sparse(zone, virtual_chapter))
+		return NO_CHAPTER;
+
+	/*
+	 * FIXME: Optimize for a common case by remembering the chapter from the most recent
+	 * barrier message and skipping this chapter if is it the same.
+	 */
+
+	return virtual_chapter;
+}
+
+/*
+ * Simulate a message to change the sparse cache membership for a single-zone sparse index. This
+ * allows us to forgo the complicated locking required by a multi-zone sparse index. Any other kind
+ * of index does nothing here.
+ */
+static int
+simulate_index_zone_barrier_message(struct index_zone *zone, struct uds_request *request)
+{
+	u64 sparse_virtual_chapter;
+
+	if ((zone->index->zone_count > 1) ||
+	    !uds_is_sparse_geometry(zone->index->volume->geometry))
+		return UDS_SUCCESS;
+
+	sparse_virtual_chapter = triage_index_request(zone->index, request);
+	if (sparse_virtual_chapter == NO_CHAPTER)
+		return UDS_SUCCESS;
+
+	return uds_update_sparse_cache(zone, sparse_virtual_chapter);
+}
+
+/* This is the request processing function for the triage queue. */
+static void triage_request(struct uds_request *request)
+{
+	struct uds_index *index = request->index;
+	u64 sparse_virtual_chapter = triage_index_request(index, request);
+
+	if (sparse_virtual_chapter != NO_CHAPTER)
+		enqueue_barrier_messages(index, sparse_virtual_chapter);
+
+	uds_enqueue_request(request, STAGE_INDEX);
+}
+
+static int finish_previous_chapter(struct uds_index *index, u64 current_chapter_number)
+{
+	int result;
+	struct chapter_writer *writer = index->chapter_writer;
+
+	uds_lock_mutex(&writer->mutex);
+	while (index->newest_virtual_chapter < current_chapter_number)
+		uds_wait_cond(&writer->cond, &writer->mutex);
+	result = writer->result;
+	uds_unlock_mutex(&writer->mutex);
+
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result, "Writing of previous open chapter failed");
+
+	return UDS_SUCCESS;
+}
+
+static int swap_open_chapter(struct index_zone *zone)
+{
+	int result;
+	struct open_chapter_zone *temporary_chapter;
+
+	result = finish_previous_chapter(zone->index, zone->newest_virtual_chapter);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	temporary_chapter = zone->open_chapter;
+	zone->open_chapter = zone->writing_chapter;
+	zone->writing_chapter = temporary_chapter;
+	return UDS_SUCCESS;
+}
+
+/*
+ * Inform the chapter writer that this zone is done with this chapter. The chapter won't start
+ * writing until all zones have closed it.
+ */
+static unsigned int start_closing_chapter(struct uds_index *index,
+					  unsigned int zone_number,
+					  struct open_chapter_zone *chapter)
+{
+	unsigned int finished_zones;
+	struct chapter_writer *writer = index->chapter_writer;
+
+	uds_lock_mutex(&writer->mutex);
+	finished_zones = ++writer->zones_to_write;
+	writer->chapters[zone_number] = chapter;
+	uds_broadcast_cond(&writer->cond);
+	uds_unlock_mutex(&writer->mutex);
+
+	return finished_zones;
+}
+
+static int announce_chapter_closed(struct index_zone *zone, u64 closed_chapter)
+{
+	int result;
+	unsigned int i;
+	struct uds_zone_message zone_message = {
+		.type = UDS_MESSAGE_ANNOUNCE_CHAPTER_CLOSED,
+		.virtual_chapter = closed_chapter,
+	};
+
+	for (i = 0; i < zone->index->zone_count; i++) {
+		if (zone->id == i)
+			continue;
+
+		result = launch_zone_message(zone_message, i, zone->index);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	return UDS_SUCCESS;
+}
+
+static int open_next_chapter(struct index_zone *zone)
+{
+	int result;
+	u64 closed_chapter;
+	u64 expiring;
+	unsigned int finished_zones;
+	u32 expire_chapters;
+
+	uds_log_debug("closing chapter %llu of zone %u after %u entries (%u short)",
+		      (unsigned long long) zone->newest_virtual_chapter,
+		      zone->id,
+		      zone->open_chapter->size,
+		      zone->open_chapter->capacity - zone->open_chapter->size);
+
+	result = swap_open_chapter(zone);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	closed_chapter = zone->newest_virtual_chapter++;
+	uds_set_volume_index_zone_open_chapter(zone->index->volume_index,
+					       zone->id,
+					       zone->newest_virtual_chapter);
+	uds_reset_open_chapter(zone->open_chapter);
+
+	finished_zones = start_closing_chapter(zone->index, zone->id, zone->writing_chapter);
+	if ((finished_zones == 1) && (zone->index->zone_count > 1)) {
+		result = announce_chapter_closed(zone, closed_chapter);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	expiring = zone->oldest_virtual_chapter;
+	expire_chapters = uds_chapters_to_expire(zone->index->volume->geometry,
+						 zone->newest_virtual_chapter);
+	zone->oldest_virtual_chapter += expire_chapters;
+
+	if (finished_zones < zone->index->zone_count)
+		return UDS_SUCCESS;
+
+	while (expire_chapters-- > 0)
+		uds_forget_chapter(zone->index->volume, expiring++);
+
+	return UDS_SUCCESS;
+}
+
+static int handle_chapter_closed(struct index_zone *zone, u64 virtual_chapter)
+{
+	if (zone->newest_virtual_chapter == virtual_chapter)
+		return open_next_chapter(zone);
+
+	return UDS_SUCCESS;
+}
+
+static int dispatch_index_zone_control_request(struct uds_request *request)
+{
+	struct uds_zone_message *message = &request->zone_message;
+	struct index_zone *zone = request->index->zones[request->zone_number];
+
+	switch (message->type) {
+	case UDS_MESSAGE_SPARSE_CACHE_BARRIER:
+		return uds_update_sparse_cache(zone, message->virtual_chapter);
+
+	case UDS_MESSAGE_ANNOUNCE_CHAPTER_CLOSED:
+		return handle_chapter_closed(zone, message->virtual_chapter);
+
+	default:
+		uds_log_error("invalid message type: %d", message->type);
+		return UDS_INVALID_ARGUMENT;
+	}
+}
+
+static void set_request_location(struct uds_request *request, enum uds_index_region new_location)
+{
+	request->location = new_location;
+	request->found = ((new_location == UDS_LOCATION_IN_OPEN_CHAPTER) ||
+			  (new_location == UDS_LOCATION_IN_DENSE) ||
+			  (new_location == UDS_LOCATION_IN_SPARSE));
+}
+
+static void set_chapter_location(struct uds_request *request,
+				 const struct index_zone *zone,
+				 u64 virtual_chapter)
+{
+	request->found = true;
+	if (virtual_chapter == zone->newest_virtual_chapter)
+		request->location = UDS_LOCATION_IN_OPEN_CHAPTER;
+	else if (is_zone_chapter_sparse(zone, virtual_chapter))
+		request->location = UDS_LOCATION_IN_SPARSE;
+	else
+		request->location = UDS_LOCATION_IN_DENSE;
+}
+
+static int search_sparse_cache_in_zone(struct index_zone *zone,
+				       struct uds_request *request,
+				       u64 virtual_chapter,
+				       bool *found)
+{
+	int result;
+	struct volume *volume;
+	u16 record_page_number;
+	u32 chapter;
+
+	result = uds_search_sparse_cache(zone,
+					 &request->record_name,
+					 &virtual_chapter,
+					 &record_page_number);
+	if ((result != UDS_SUCCESS) || (virtual_chapter == NO_CHAPTER))
+		return result;
+
+	request->virtual_chapter = virtual_chapter;
+	volume = zone->index->volume;
+	chapter = uds_map_to_physical_chapter(volume->geometry, virtual_chapter);
+	return uds_search_cached_record_page(volume, request, chapter, record_page_number, found);
+}
+
+static int get_record_from_zone(struct index_zone *zone, struct uds_request *request, bool *found)
+{
+	struct volume *volume;
+
+	if (request->location == UDS_LOCATION_RECORD_PAGE_LOOKUP) {
+		*found = true;
+		return UDS_SUCCESS;
+	} else if (request->location == UDS_LOCATION_UNAVAILABLE) {
+		*found = false;
+		return UDS_SUCCESS;
+	}
+
+	if (request->virtual_chapter == zone->newest_virtual_chapter) {
+		uds_search_open_chapter(zone->open_chapter,
+					&request->record_name,
+					&request->old_metadata,
+					found);
+		return UDS_SUCCESS;
+	}
+
+	if ((zone->newest_virtual_chapter > 0) &&
+	    (request->virtual_chapter == (zone->newest_virtual_chapter - 1)) &&
+	    (zone->writing_chapter->size > 0)) {
+		uds_search_open_chapter(zone->writing_chapter,
+					&request->record_name,
+					&request->old_metadata,
+					found);
+		return UDS_SUCCESS;
+	}
+
+	volume = zone->index->volume;
+	if (is_zone_chapter_sparse(zone, request->virtual_chapter) &&
+	    uds_sparse_cache_contains(volume->sparse_cache,
+				      request->virtual_chapter,
+				      request->zone_number))
+		return search_sparse_cache_in_zone(zone, request, request->virtual_chapter, found);
+
+	return uds_search_volume_page_cache(volume, request, found);
+}
+
+static int put_record_in_zone(struct index_zone *zone,
+			      struct uds_request *request,
+			      const struct uds_record_data *metadata)
+{
+	unsigned int remaining;
+
+	remaining = uds_put_open_chapter(zone->open_chapter, &request->record_name, metadata);
+	if (remaining == 0)
+		return open_next_chapter(zone);
+
+	return UDS_SUCCESS;
+}
+
+static int search_index_zone(struct index_zone *zone, struct uds_request *request)
+{
+	int result;
+	struct volume_index_record record;
+	bool overflow_record, found = false;
+	struct uds_record_data *metadata;
+	u64 chapter;
+
+	result = uds_get_volume_index_record(zone->index->volume_index,
+					     &request->record_name,
+					     &record);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (record.is_found) {
+		if (request->requeued && request->virtual_chapter != record.virtual_chapter)
+			set_request_location(request, UDS_LOCATION_UNKNOWN);
+
+		request->virtual_chapter = record.virtual_chapter;
+		result = get_record_from_zone(zone, request, &found);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	if (found)
+		set_chapter_location(request, zone, record.virtual_chapter);
+
+	/*
+	 * If a record has overflowed a chapter index in more than one chapter (or overflowed in
+	 * one chapter and collided with an existing record), it will exist as a collision record
+	 * in the volume index, but we won't find it in the volume. This case needs special
+	 * handling.
+	 */
+	overflow_record = (record.is_found && record.is_collision && !found);
+	chapter = zone->newest_virtual_chapter;
+	if (found || overflow_record) {
+		if ((request->type == UDS_QUERY_NO_UPDATE) ||
+		    ((request->type == UDS_QUERY) && overflow_record))
+			/* There is nothing left to do. */
+			return UDS_SUCCESS;
+
+		if (record.virtual_chapter != chapter)
+			/*
+			 * Update the volume index to reference the new chapter for the block. If
+			 * the record had been deleted or dropped from the chapter index, it will
+			 * be back.
+			 */
+			result = uds_set_volume_index_record_chapter(&record, chapter);
+		else if (request->type != UDS_UPDATE)
+			/* The record is already in the open chapter. */
+			return UDS_SUCCESS;
+	} else {
+		/*
+		 * The record wasn't in the volume index, so check whether the
+		 * name is in a cached sparse chapter. If we found the name on
+		 * a previous search, use that result instead.
+		 */
+		if (request->location == UDS_LOCATION_RECORD_PAGE_LOOKUP) {
+			found = true;
+		} else if (request->location == UDS_LOCATION_UNAVAILABLE) {
+			found = false;
+		} else if (uds_is_sparse_geometry(zone->index->volume->geometry) &&
+			   !uds_is_volume_index_sample(zone->index->volume_index,
+						       &request->record_name)) {
+			result = search_sparse_cache_in_zone(zone, request, NO_CHAPTER, &found);
+			if (result != UDS_SUCCESS)
+				return result;
+		}
+
+		if (found)
+			set_request_location(request, UDS_LOCATION_IN_SPARSE);
+
+		if ((request->type == UDS_QUERY_NO_UPDATE) ||
+		    ((request->type == UDS_QUERY) && !found))
+			/* There is nothing left to do. */
+			return UDS_SUCCESS;
+
+		/*
+		 * Add a new entry to the volume index referencing the open chapter. This needs to
+		 * be done both for new records, and for records from cached sparse chapters.
+		 */
+		result = uds_put_volume_index_record(&record, chapter);
+	}
+
+	if (result == UDS_OVERFLOW)
+		/*
+		 * The volume index encountered a delta list overflow. The condition was already
+		 * logged. We will go on without adding the record to the open chapter.
+		 */
+		return UDS_SUCCESS;
+
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (!found || (request->type == UDS_UPDATE))
+		/* This is a new record or we're updating an existing record. */
+		metadata = &request->new_metadata;
+	else
+		/* Move the existing record to the open chapter. */
+		metadata = &request->old_metadata;
+
+	return put_record_in_zone(zone, request, metadata);
+}
+
+static int remove_from_index_zone(struct index_zone *zone, struct uds_request *request)
+{
+	int result;
+	struct volume_index_record record;
+
+	result = uds_get_volume_index_record(zone->index->volume_index,
+					     &request->record_name,
+					     &record);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (!record.is_found)
+		return UDS_SUCCESS;
+
+	/* If the request was requeued, check whether the saved state is still valid. */
+
+	if (record.is_collision) {
+		set_chapter_location(request, zone, record.virtual_chapter);
+	} else {
+		/* Non-collision records are hints, so resolve the name in the chapter. */
+		bool found;
+
+		if (request->requeued && request->virtual_chapter != record.virtual_chapter)
+			set_request_location(request, UDS_LOCATION_UNKNOWN);
+
+		request->virtual_chapter = record.virtual_chapter;
+		result = get_record_from_zone(zone, request, &found);
+		if (result != UDS_SUCCESS)
+			return result;
+
+		if (!found)
+			/* There is no record to remove. */
+			return UDS_SUCCESS;
+	}
+
+	set_chapter_location(request, zone, record.virtual_chapter);
+
+	/*
+	 * Delete the volume index entry for the named record only. Note that a later search might
+	 * later return stale advice if there is a colliding name in the same chapter, but it's a
+	 * very rare case (1 in 2^21).
+	 */
+	result = uds_remove_volume_index_record(&record);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	/*
+	 * If the record is in the open chapter, we must remove it or mark it deleted to avoid
+	 * trouble if the record is added again later.
+	 */
+	if (request->location == UDS_LOCATION_IN_OPEN_CHAPTER)
+		uds_remove_from_open_chapter(zone->open_chapter, &request->record_name);
+
+	return UDS_SUCCESS;
+}
+
+static int dispatch_index_request(struct uds_index *index, struct uds_request *request)
+{
+	int result;
+	struct index_zone *zone = index->zones[request->zone_number];
+
+	if (!request->requeued) {
+		result = simulate_index_zone_barrier_message(zone, request);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	switch (request->type) {
+	case UDS_POST:
+	case UDS_UPDATE:
+	case UDS_QUERY:
+	case UDS_QUERY_NO_UPDATE:
+		result = search_index_zone(zone, request);
+		break;
+
+	case UDS_DELETE:
+		result = remove_from_index_zone(zone, request);
+		break;
+
+	default:
+		result = uds_log_warning_strerror(UDS_INVALID_ARGUMENT,
+						  "invalid request type: %d",
+						  request->type);
+		break;
+	}
+
+	return result;
+}
+
+/* This is the request processing function invoked by each zone's thread. */
+static void execute_zone_request(struct uds_request *request)
+{
+	int result;
+	struct uds_index *index = request->index;
+
+	if (request->zone_message.type != UDS_MESSAGE_NONE) {
+		result = dispatch_index_zone_control_request(request);
+		if (result != UDS_SUCCESS)
+			uds_log_error_strerror(result,
+					       "error executing message: %d",
+					       request->zone_message.type);
+
+		/* Once the message is processed it can be freed. */
+		UDS_FREE(UDS_FORGET(request));
+		return;
+	}
+
+	index->need_to_save = true;
+	if (request->requeued && (request->status != UDS_SUCCESS)) {
+		set_request_location(request, UDS_LOCATION_UNAVAILABLE);
+		index->callback(request);
+		return;
+	}
+
+	result = dispatch_index_request(index, request);
+	if (result == UDS_QUEUED)
+		/* The request has been requeued so don't let it complete. */
+		return;
+
+	if (!request->found)
+		set_request_location(request, UDS_LOCATION_UNAVAILABLE);
+
+	request->status = result;
+	index->callback(request);
+}
+
+static int initialize_index_queues(struct uds_index *index, const struct geometry *geometry)
+{
+	int result;
+	unsigned int i;
+
+	for (i = 0; i < index->zone_count; i++) {
+		result = uds_make_request_queue("indexW",
+						&execute_zone_request,
+						&index->zone_queues[i]);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	/* The triage queue is only needed for sparse multi-zone indexes. */
+	if ((index->zone_count > 1) && uds_is_sparse_geometry(geometry)) {
+		result = uds_make_request_queue("triageW", &triage_request, &index->triage_queue);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	return UDS_SUCCESS;
+}
+
+/* This is the driver function for the chapter writer thread. */
+static void close_chapters(void *arg)
+{
+	int result;
+	struct chapter_writer *writer = arg;
+	struct uds_index *index = writer->index;
+
+	uds_log_debug("chapter writer starting");
+	uds_lock_mutex(&writer->mutex);
+	for (;;) {
+		while (writer->zones_to_write < index->zone_count) {
+			if (writer->stop && (writer->zones_to_write == 0)) {
+				/*
+				 * We've been told to stop, and all of the zones are in the same
+				 * open chapter, so we can exit now.
+				 */
+				uds_unlock_mutex(&writer->mutex);
+				uds_log_debug("chapter writer stopping");
+				return;
+			}
+			uds_wait_cond(&writer->cond, &writer->mutex);
+		}
+
+		/*
+		 * Release the lock while closing a chapter. We probably don't need to do this, but
+		 * it seems safer in principle. It's OK to access the chapter and chapter_number
+		 * fields without the lock since those aren't allowed to change until we're done.
+		 */
+		uds_unlock_mutex(&writer->mutex);
+
+		if (index->has_saved_open_chapter) {
+			/*
+			 * Remove the saved open chapter the first time we close an open chapter
+			 * after loading from a clean shutdown, or after doing a clean save. The
+			 * lack of the saved open chapter will indicate that a recovery is
+			 * necessary.
+			 */
+			index->has_saved_open_chapter = false;
+			result = uds_discard_open_chapter(index->layout);
+			if (result == UDS_SUCCESS)
+				uds_log_debug("Discarding saved open chapter");
+		}
+
+		result = uds_close_open_chapter(writer->chapters,
+						index->zone_count,
+						index->volume,
+						writer->open_chapter_index,
+						writer->collated_records,
+						index->newest_virtual_chapter);
+
+		uds_lock_mutex(&writer->mutex);
+		index->newest_virtual_chapter++;
+		index->oldest_virtual_chapter +=
+			uds_chapters_to_expire(index->volume->geometry,
+					       index->newest_virtual_chapter);
+		writer->result = result;
+		writer->zones_to_write = 0;
+		uds_broadcast_cond(&writer->cond);
+	}
+}
+
+static void stop_chapter_writer(struct chapter_writer *writer)
+{
+	struct thread *writer_thread = 0;
+
+	uds_lock_mutex(&writer->mutex);
+	if (writer->thread != 0) {
+		writer_thread = writer->thread;
+		writer->thread = 0;
+		writer->stop = true;
+		uds_broadcast_cond(&writer->cond);
+	}
+	uds_unlock_mutex(&writer->mutex);
+
+	if (writer_thread != 0)
+		uds_join_threads(writer_thread);
+}
+
+static void free_chapter_writer(struct chapter_writer *writer)
+{
+	if (writer == NULL)
+		return;
+
+	stop_chapter_writer(writer);
+	uds_destroy_mutex(&writer->mutex);
+	uds_destroy_cond(&writer->cond);
+	uds_free_open_chapter_index(writer->open_chapter_index);
+	UDS_FREE(writer->collated_records);
+	UDS_FREE(writer);
+}
+
+static int make_chapter_writer(struct uds_index *index, struct chapter_writer **writer_ptr)
+{
+	int result;
+	struct chapter_writer *writer;
+	size_t collated_records_size =
+		(sizeof(struct uds_volume_record) * index->volume->geometry->records_per_chapter);
+
+	result = UDS_ALLOCATE_EXTENDED(struct chapter_writer,
+				       index->zone_count,
+				       struct open_chapter_zone *,
+				       "Chapter Writer",
+				       &writer);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	writer->index = index;
+	result = uds_init_mutex(&writer->mutex);
+	if (result != UDS_SUCCESS) {
+		UDS_FREE(writer);
+		return result;
+	}
+
+	result = uds_init_cond(&writer->cond);
+	if (result != UDS_SUCCESS) {
+		uds_destroy_mutex(&writer->mutex);
+		UDS_FREE(writer);
+		return result;
+	}
+
+	result = uds_allocate_cache_aligned(collated_records_size,
+					    "collated records",
+					    &writer->collated_records);
+	if (result != UDS_SUCCESS) {
+		free_chapter_writer(writer);
+		return result;
+	}
+
+	result = uds_make_open_chapter_index(&writer->open_chapter_index,
+					     index->volume->geometry,
+					     index->volume->nonce);
+	if (result != UDS_SUCCESS) {
+		free_chapter_writer(writer);
+		return result;
+	}
+
+	writer->memory_size = (sizeof(struct chapter_writer) +
+			       index->zone_count * sizeof(struct open_chapter_zone *) +
+			       collated_records_size +
+			       writer->open_chapter_index->memory_size);
+
+	result = uds_create_thread(close_chapters, writer, "writer", &writer->thread);
+	if (result != UDS_SUCCESS) {
+		free_chapter_writer(writer);
+		return result;
+	}
+
+	*writer_ptr = writer;
+	return UDS_SUCCESS;
+}
+
+static int load_index(struct uds_index *index)
+{
+	int result;
+	u64 last_save_chapter;
+
+	result = uds_load_index_state(index->layout, index);
+	if (result != UDS_SUCCESS)
+		return UDS_INDEX_NOT_SAVED_CLEANLY;
+
+	last_save_chapter = ((index->last_save != NO_LAST_SAVE) ? index->last_save : 0);
+
+	uds_log_info("loaded index from chapter %llu through chapter %llu",
+		     (unsigned long long) index->oldest_virtual_chapter,
+		     (unsigned long long) last_save_chapter);
+
+	return UDS_SUCCESS;
+}
+
+static int rebuild_index_page_map(struct uds_index *index, u64 vcn)
+{
+	int result;
+	struct delta_index_page *chapter_index_page;
+	struct geometry *geometry = index->volume->geometry;
+	u32 chapter = uds_map_to_physical_chapter(geometry, vcn);
+	u32 expected_list_number = 0;
+	u32 index_page_number;
+	u32 lowest_delta_list;
+	u32 highest_delta_list;
+
+	for (index_page_number = 0;
+	     index_page_number < geometry->index_pages_per_chapter;
+	     index_page_number++) {
+		result = uds_get_volume_index_page(index->volume,
+						   chapter,
+						   index_page_number,
+						   &chapter_index_page);
+		if (result != UDS_SUCCESS)
+			return uds_log_error_strerror(result,
+						      "failed to read index page %u in chapter %u",
+						      index_page_number,
+						      chapter);
+
+		lowest_delta_list = chapter_index_page->lowest_list_number;
+		highest_delta_list = chapter_index_page->highest_list_number;
+		if (lowest_delta_list != expected_list_number)
+			return uds_log_error_strerror(UDS_CORRUPT_DATA,
+						      "chapter %u index page %u is corrupt",
+						      chapter,
+						      index_page_number);
+
+		uds_update_index_page_map(index->volume->index_page_map,
+					  vcn,
+					  chapter,
+					  index_page_number,
+					  highest_delta_list);
+		expected_list_number = highest_delta_list + 1;
+	}
+
+	return UDS_SUCCESS;
+}
+
+static int replay_record(struct uds_index *index,
+			 const struct uds_record_name *name,
+			 u64 virtual_chapter,
+			 bool will_be_sparse_chapter)
+{
+	int result;
+	struct volume_index_record record;
+	bool update_record;
+
+	if (will_be_sparse_chapter && !uds_is_volume_index_sample(index->volume_index, name))
+		/*
+		 * This entry will be in a sparse chapter after the rebuild completes, and it is
+		 * not a sample, so just skip over it.
+		 */
+		return UDS_SUCCESS;
+
+	result = uds_get_volume_index_record(index->volume_index, name, &record);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (record.is_found) {
+		if (record.is_collision) {
+			if (record.virtual_chapter == virtual_chapter)
+				/* The record is already correct. */
+				return UDS_SUCCESS;
+
+			update_record = true;
+		} else if (record.virtual_chapter == virtual_chapter) {
+			/*
+			 * There is a volume index entry pointing to the current chapter, but we
+			 * don't know if it is for the same name as the one we are currently
+			 * working on or not. For now, we're just going to assume that it isn't.
+			 * This will create one extra collision record if there was a deleted
+			 * record in the current chapter.
+			 */
+			update_record = false;
+		} else {
+			/*
+			 * If we're rebuilding, we don't normally want to go to disk to see if the
+			 * record exists, since we will likely have just read the record from disk
+			 * (i.e. we know it's there). The exception to this is when we find an
+			 * entry in the volume index that has a different chapter. In this case, we
+			 * need to search that chapter to determine if the volume index entry was
+			 * for the same record or a different one.
+			 */
+			result = uds_search_volume_page_cache_for_rebuild(index->volume,
+									  name,
+									  record.virtual_chapter,
+									  &update_record);
+			if (result != UDS_SUCCESS)
+				return result;
+			}
+	} else {
+		update_record = false;
+	}
+
+	if (update_record)
+		/*
+		 * Update the volume index to reference the new chapter for the block. If the
+		 * record had been deleted or dropped from the chapter index, it will be back.
+		 */
+		result = uds_set_volume_index_record_chapter(&record, virtual_chapter);
+	else
+		/*
+		 * Add a new entry to the volume index referencing the open chapter. This should be
+		 * done regardless of whether we are a brand new record or a sparse record, i.e.
+		 * one that doesn't exist in the index but does on disk, since for a sparse record,
+		 * we would want to un-sparsify if it did exist.
+		 */
+		result = uds_put_volume_index_record(&record, virtual_chapter);
+
+	if ((result == UDS_DUPLICATE_NAME) || (result == UDS_OVERFLOW))
+		/* The rebuilt index will lose these records. */
+		return UDS_SUCCESS;
+
+	return result;
+}
+
+static bool check_for_suspend(struct uds_index *index)
+{
+	bool closing;
+
+	if (index->load_context == NULL)
+		return false;
+
+	uds_lock_mutex(&index->load_context->mutex);
+	if (index->load_context->status != INDEX_SUSPENDING) {
+		uds_unlock_mutex(&index->load_context->mutex);
+		return false;
+	}
+
+	/* Notify that we are suspended and wait for the resume. */
+	index->load_context->status = INDEX_SUSPENDED;
+	uds_broadcast_cond(&index->load_context->cond);
+
+	while ((index->load_context->status != INDEX_OPENING) &&
+	       (index->load_context->status != INDEX_FREEING))
+		uds_wait_cond(&index->load_context->cond, &index->load_context->mutex);
+
+	closing = (index->load_context->status == INDEX_FREEING);
+	uds_unlock_mutex(&index->load_context->mutex);
+	return closing;
+}
+
+static int replay_chapter(struct uds_index *index, u64 virtual, bool sparse)
+{
+	int result;
+	u32 i;
+	u32 j;
+	const struct geometry *geometry;
+	u32 physical_chapter;
+
+	if (check_for_suspend(index)) {
+		uds_log_info("Replay interrupted by index shutdown at chapter %llu",
+			     (unsigned long long) virtual);
+		return -EBUSY;
+	}
+
+	geometry = index->volume->geometry;
+	physical_chapter = uds_map_to_physical_chapter(geometry, virtual);
+	uds_prefetch_volume_chapter(index->volume, physical_chapter);
+	uds_set_volume_index_open_chapter(index->volume_index, virtual);
+
+	result = rebuild_index_page_map(index, virtual);
+	if (result != UDS_SUCCESS)
+		return uds_log_error_strerror(result,
+					      "could not rebuild index page map for chapter %u",
+					      physical_chapter);
+
+	for (i = 0; i < geometry->record_pages_per_chapter; i++) {
+		u8 *record_page;
+		u32 record_page_number;
+
+		record_page_number = geometry->index_pages_per_chapter + i;
+		result = uds_get_volume_record_page(index->volume,
+						    physical_chapter,
+						    record_page_number,
+						    &record_page);
+		if (result != UDS_SUCCESS)
+			return uds_log_error_strerror(result,
+						      "could not get page %d",
+						      record_page_number);
+
+		for (j = 0; j < geometry->records_per_page; j++) {
+			const u8 *name_bytes;
+			struct uds_record_name name;
+
+			name_bytes = record_page + (j * BYTES_PER_RECORD);
+			memcpy(&name.name, name_bytes, UDS_RECORD_NAME_SIZE);
+			result = replay_record(index, &name, virtual, sparse);
+			if (result != UDS_SUCCESS)
+				return result;
+		}
+	}
+
+	return UDS_SUCCESS;
+}
+
+static int replay_volume(struct uds_index *index)
+{
+	int result;
+	u64 old_map_update;
+	u64 new_map_update;
+	u64 virtual;
+	u64 from_virtual = index->oldest_virtual_chapter;
+	u64 upto_virtual = index->newest_virtual_chapter;
+	bool will_be_sparse;
+
+	uds_log_info("Replaying volume from chapter %llu through chapter %llu",
+		     (unsigned long long) from_virtual,
+		     (unsigned long long) upto_virtual);
+
+	/*
+	 * The index failed to load, so the volume index is empty. Add records to the volume index
+	 * in order, skipping non-hooks in chapters which will be sparse to save time.
+	 *
+	 * Go through each record page of each chapter and add the records back to the volume
+	 * index. This should not cause anything to be written to either the open chapter or the
+	 * on-disk volume. Also skip the on-disk chapter corresponding to upto_virtual, as this
+	 * would have already been purged from the volume index when the chapter was opened.
+	 *
+	 * Also, go through each index page for each chapter and rebuild the index page map.
+	 */
+	old_map_update = index->volume->index_page_map->last_update;
+	for (virtual = from_virtual; virtual < upto_virtual; ++virtual) {
+		will_be_sparse = uds_is_chapter_sparse(index->volume->geometry,
+						       from_virtual,
+						       upto_virtual,
+						       virtual);
+		result = replay_chapter(index, virtual, will_be_sparse);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	/* Also reap the chapter being replaced by the open chapter. */
+	uds_set_volume_index_open_chapter(index->volume_index, upto_virtual);
+
+	new_map_update = index->volume->index_page_map->last_update;
+	if (new_map_update != old_map_update)
+		uds_log_info("replay changed index page map update from %llu to %llu",
+			     (unsigned long long) old_map_update,
+			     (unsigned long long) new_map_update);
+
+	return UDS_SUCCESS;
+}
+
+static int rebuild_index(struct uds_index *index)
+{
+	int result;
+	u64 lowest;
+	u64 highest;
+	bool is_empty = false;
+	u32 chapters_per_volume = index->volume->geometry->chapters_per_volume;
+
+	index->volume->lookup_mode = LOOKUP_FOR_REBUILD;
+	result = uds_find_volume_chapter_boundaries(index->volume, &lowest, &highest, &is_empty);
+	if (result != UDS_SUCCESS)
+		return uds_log_fatal_strerror(result,
+					      "cannot rebuild index: unknown volume chapter boundaries");
+
+	if (is_empty) {
+		index->newest_virtual_chapter = 0;
+		index->oldest_virtual_chapter = 0;
+		index->volume->lookup_mode = LOOKUP_NORMAL;
+		return UDS_SUCCESS;
+	}
+
+	index->newest_virtual_chapter = highest + 1;
+	index->oldest_virtual_chapter = lowest;
+	if (index->newest_virtual_chapter == (index->oldest_virtual_chapter + chapters_per_volume))
+		/* Skip the chapter shadowed by the open chapter. */
+		index->oldest_virtual_chapter++;
+
+	result = replay_volume(index);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	index->volume->lookup_mode = LOOKUP_NORMAL;
+	return UDS_SUCCESS;
+}
+
+static void free_index_zone(struct index_zone *zone)
+{
+	if (zone == NULL)
+		return;
+
+	uds_free_open_chapter(zone->open_chapter);
+	uds_free_open_chapter(zone->writing_chapter);
+	UDS_FREE(zone);
+}
+
+static int make_index_zone(struct uds_index *index, unsigned int zone_number)
+{
+	int result;
+	struct index_zone *zone;
+
+	result = UDS_ALLOCATE(1, struct index_zone, "index zone", &zone);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_make_open_chapter(index->volume->geometry,
+				       index->zone_count,
+				       &zone->open_chapter);
+	if (result != UDS_SUCCESS) {
+		free_index_zone(zone);
+		return result;
+	}
+
+	result = uds_make_open_chapter(index->volume->geometry,
+				       index->zone_count,
+				       &zone->writing_chapter);
+	if (result != UDS_SUCCESS) {
+		free_index_zone(zone);
+		return result;
+	}
+
+	zone->index = index;
+	zone->id = zone_number;
+	index->zones[zone_number] = zone;
+
+	return UDS_SUCCESS;
+}
+
+int uds_make_index(struct configuration *config,
+		   enum uds_open_index_type open_type,
+		   struct index_load_context *load_context,
+		   index_callback_t callback,
+		   struct uds_index **new_index)
+{
+	int result;
+	bool loaded = false;
+	bool new = (open_type == UDS_CREATE);
+	struct uds_index *index = NULL;
+	struct index_zone *zone;
+	u64 nonce;
+	unsigned int z;
+
+	result = UDS_ALLOCATE_EXTENDED(struct uds_index,
+				       config->zone_count,
+				       struct uds_request_queue *,
+				       "index",
+				       &index);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	index->zone_count = config->zone_count;
+
+	result = uds_make_index_layout(config, new, &index->layout);
+	if (result != UDS_SUCCESS) {
+		uds_free_index(index);
+		return result;
+	}
+
+	result = UDS_ALLOCATE(index->zone_count, struct index_zone *, "zones", &index->zones);
+	if (result != UDS_SUCCESS) {
+		uds_free_index(index);
+		return result;
+	}
+
+	result = uds_make_volume(config, index->layout, &index->volume);
+	if (result != UDS_SUCCESS) {
+		uds_free_index(index);
+		return result;
+	}
+
+	index->volume->lookup_mode = LOOKUP_NORMAL;
+	for (z = 0; z < index->zone_count; z++) {
+		result = make_index_zone(index, z);
+		if (result != UDS_SUCCESS) {
+			uds_free_index(index);
+			return uds_log_error_strerror(result, "Could not create index zone");
+		}
+	}
+
+	nonce = uds_get_volume_nonce(index->layout);
+	result = uds_make_volume_index(config, nonce, &index->volume_index);
+	if (result != UDS_SUCCESS) {
+		uds_free_index(index);
+		return uds_log_error_strerror(result, "could not make volume index");
+	}
+
+	index->load_context = load_context;
+	index->callback = callback;
+
+	result = initialize_index_queues(index, config->geometry);
+	if (result != UDS_SUCCESS) {
+		uds_free_index(index);
+		return result;
+	}
+
+	result = make_chapter_writer(index, &index->chapter_writer);
+	if (result != UDS_SUCCESS) {
+		uds_free_index(index);
+		return result;
+	}
+
+	if (!new) {
+		result = load_index(index);
+		switch (result) {
+		case UDS_SUCCESS:
+			loaded = true;
+			break;
+		case -ENOMEM:
+			/* We should not try a rebuild for this error. */
+			uds_log_error_strerror(result, "index could not be loaded");
+			break;
+		default:
+			uds_log_error_strerror(result, "index could not be loaded");
+			if (open_type == UDS_LOAD) {
+				result = rebuild_index(index);
+				if (result != UDS_SUCCESS)
+					uds_log_error_strerror(result,
+							       "index could not be rebuilt");
+			}
+			break;
+		}
+	}
+
+	if (result != UDS_SUCCESS) {
+		uds_free_index(index);
+		return uds_log_error_strerror(result, "fatal error in %s()", __func__);
+	}
+
+	for (z = 0; z < index->zone_count; z++) {
+		zone = index->zones[z];
+		zone->oldest_virtual_chapter = index->oldest_virtual_chapter;
+		zone->newest_virtual_chapter = index->newest_virtual_chapter;
+	}
+
+	if (index->load_context != NULL) {
+		uds_lock_mutex(&index->load_context->mutex);
+		index->load_context->status = INDEX_READY;
+		/*
+		 * If we get here, suspend is meaningless, but notify any thread trying to suspend
+		 * us so it doesn't hang.
+		 */
+		uds_broadcast_cond(&index->load_context->cond);
+		uds_unlock_mutex(&index->load_context->mutex);
+	}
+
+	index->has_saved_open_chapter = loaded;
+	index->need_to_save = !loaded;
+	*new_index = index;
+	return UDS_SUCCESS;
+}
+
+void uds_free_index(struct uds_index *index)
+{
+	unsigned int i;
+
+	if (index == NULL)
+		return;
+
+	uds_request_queue_finish(index->triage_queue);
+	for (i = 0; i < index->zone_count; i++)
+		uds_request_queue_finish(index->zone_queues[i]);
+
+	free_chapter_writer(index->chapter_writer);
+
+	uds_free_volume_index(index->volume_index);
+	if (index->zones != NULL) {
+		for (i = 0; i < index->zone_count; i++)
+			free_index_zone(index->zones[i]);
+		UDS_FREE(index->zones);
+	}
+
+	uds_free_volume(index->volume);
+	uds_free_index_layout(UDS_FORGET(index->layout));
+	UDS_FREE(index);
+}
+
+/* Wait for the chapter writer to complete any outstanding writes. */
+void uds_wait_for_idle_index(struct uds_index *index)
+{
+	struct chapter_writer *writer = index->chapter_writer;
+
+	uds_lock_mutex(&writer->mutex);
+	while (writer->zones_to_write > 0)
+		uds_wait_cond(&writer->cond, &writer->mutex);
+	uds_unlock_mutex(&writer->mutex);
+}
+
+/* This function assumes that all requests have been drained. */
+int uds_save_index(struct uds_index *index)
+{
+	int result;
+
+	if (!index->need_to_save)
+		return UDS_SUCCESS;
+
+	uds_wait_for_idle_index(index);
+	index->prev_save = index->last_save;
+	index->last_save = ((index->newest_virtual_chapter == 0) ?
+			    NO_LAST_SAVE :
+			    index->newest_virtual_chapter - 1);
+	uds_log_info("beginning save (vcn %llu)", (unsigned long long) index->last_save);
+
+	result = uds_save_index_state(index->layout, index);
+	if (result != UDS_SUCCESS) {
+		uds_log_info("save index failed");
+		index->last_save = index->prev_save;
+	} else {
+		index->has_saved_open_chapter = true;
+		index->need_to_save = false;
+		uds_log_info("finished save (vcn %llu)", (unsigned long long) index->last_save);
+	}
+
+	return result;
+}
+
+int uds_replace_index_storage(struct uds_index *index, struct block_device *bdev)
+{
+	return uds_replace_volume_storage(index->volume, index->layout, bdev);
+}
+
+/* Accessing statistics should be safe from any thread. */
+void uds_get_index_stats(struct uds_index *index, struct uds_index_stats *counters)
+{
+	struct volume_index_stats stats;
+
+	uds_get_volume_index_stats(index->volume_index, &stats);
+	counters->entries_indexed = stats.record_count;
+	counters->collisions = stats.collision_count;
+	counters->entries_discarded = stats.discard_count;
+
+	counters->memory_used = (index->volume_index->memory_size +
+				 index->volume->cache_size +
+				 index->chapter_writer->memory_size);
+}
+
+void uds_enqueue_request(struct uds_request *request, enum request_stage stage)
+{
+	struct uds_index *index = request->index;
+	struct uds_request_queue *queue;
+
+	switch (stage) {
+	case STAGE_TRIAGE:
+		if (index->triage_queue != NULL) {
+			queue = index->triage_queue;
+			break;
+		}
+
+		fallthrough;
+
+	case STAGE_INDEX:
+		request->zone_number =
+			uds_get_volume_index_zone(index->volume_index, &request->record_name);
+		fallthrough;
+
+	case STAGE_MESSAGE:
+		queue = index->zone_queues[request->zone_number];
+		break;
+
+	default:
+		ASSERT_LOG_ONLY(false, "invalid index stage: %d", stage);
+		return;
+	}
+
+	uds_request_queue_enqueue(queue, request);
+}
diff --git a/drivers/md/dm-vdo/index.h b/drivers/md/dm-vdo/index.h
new file mode 100644
index 000000000000..aca61cabc433
--- /dev/null
+++ b/drivers/md/dm-vdo/index.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_INDEX_H
+#define UDS_INDEX_H
+
+#include "index-layout.h"
+#include "index-session.h"
+#include "open-chapter.h"
+#include "volume.h"
+#include "volume-index.h"
+
+/*
+ * The index is a high-level structure which represents the totality of the UDS index. It manages
+ * the queues for incoming requests and dispatches them to the appropriate sub-components like the
+ * volume or the volume index. It also manages administrative tasks such as saving and loading the
+ * index.
+ *
+ * The index is divided into a number of independent zones and assigns each request to a zone based
+ * on its name. Most sub-components are similarly divided into zones as well so that requests in
+ * each zone usually operate without interference or coordination between zones.
+ */
+
+typedef void (*index_callback_t)(struct uds_request *request);
+
+struct index_zone {
+	struct uds_index *index;
+	struct open_chapter_zone *open_chapter;
+	struct open_chapter_zone *writing_chapter;
+	u64 oldest_virtual_chapter;
+	u64 newest_virtual_chapter;
+	unsigned int id;
+};
+
+struct uds_index {
+	bool has_saved_open_chapter;
+	bool need_to_save;
+	struct index_load_context *load_context;
+	struct index_layout *layout;
+	struct volume_index *volume_index;
+	struct volume *volume;
+	unsigned int zone_count;
+	struct index_zone **zones;
+
+	u64 oldest_virtual_chapter;
+	u64 newest_virtual_chapter;
+
+	u64 last_save;
+	u64 prev_save;
+	struct chapter_writer *chapter_writer;
+
+	index_callback_t callback;
+	struct uds_request_queue *triage_queue;
+	struct uds_request_queue *zone_queues[];
+};
+
+enum request_stage {
+	STAGE_TRIAGE,
+	STAGE_INDEX,
+	STAGE_MESSAGE,
+};
+
+int __must_check uds_make_index(struct configuration *config,
+				enum uds_open_index_type open_type,
+				struct index_load_context *load_context,
+				index_callback_t callback,
+				struct uds_index **new_index);
+
+int __must_check uds_save_index(struct uds_index *index);
+
+void uds_free_index(struct uds_index *index);
+
+int __must_check uds_replace_index_storage(struct uds_index *index,
+					   struct block_device *bdev);
+
+void uds_get_index_stats(struct uds_index *index, struct uds_index_stats *counters);
+
+void uds_enqueue_request(struct uds_request *request, enum request_stage stage);
+
+void uds_wait_for_idle_index(struct uds_index *index);
+
+#endif /* UDS_INDEX_H */
diff --git a/drivers/md/dm-vdo/sparse-cache.c b/drivers/md/dm-vdo/sparse-cache.c
new file mode 100644
index 000000000000..b13aaa61e3f6
--- /dev/null
+++ b/drivers/md/dm-vdo/sparse-cache.c
@@ -0,0 +1,595 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "sparse-cache.h"
+
+#include <linux/cache.h>
+#include <linux/dm-bufio.h>
+
+#include "chapter-index.h"
+#include "config.h"
+#include "index.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+#include "uds-threads.h"
+
+/*
+ * Since the cache is small, it is implemented as a simple array of cache entries. Searching for a
+ * specific virtual chapter is implemented as a linear search. The cache replacement policy is
+ * least-recently-used (LRU). Again, the small size of the cache allows the LRU order to be
+ * maintained by shifting entries in an array list.
+ *
+ * Changing the contents of the cache requires the coordinated participation of all zone threads
+ * via the careful use of barrier messages sent to all the index zones by the triage queue worker
+ * thread. The critical invariant for coordination is that the cache membership must not change
+ * between updates, so that all calls to uds_sparse_cache_contains() from the zone threads must all
+ * receive the same results for every virtual chapter number. To ensure that critical invariant,
+ * state changes such as "that virtual chapter is no longer in the volume" and "skip searching that
+ * chapter because it has had too many cache misses" are represented separately from the cache
+ * membership information (the virtual chapter number).
+ *
+ * As a result of this invariant, we have the guarantee that every zone thread will call
+ * uds_update_sparse_cache() once and exactly once to request a chapter that is not in the cache,
+ * and the serialization of the barrier requests from the triage queue ensures they will all
+ * request the same chapter number. This means the only synchronization we need can be provided by
+ * a pair of thread barriers used only in the uds_update_sparse_cache() call, providing a critical
+ * section where a single zone thread can drive the cache update while all the other zone threads
+ * are known to be blocked, waiting in the second barrier. Outside that critical section, all the
+ * zone threads implicitly hold a shared lock. Inside it, the thread for zone zero holds an
+ * exclusive lock. No other threads may access or modify the cache entries.
+ *
+ * Chapter statistics must only be modified by a single thread, which is also the zone zero thread.
+ * All fields that might be frequently updated by that thread are kept in separate cache-aligned
+ * structures so they will not cause cache contention via "false sharing" with the fields that are
+ * frequently accessed by all of the zone threads.
+ *
+ * The LRU order is managed independently by each zone thread, and each zone uses its own list for
+ * searching and cache membership queries. The zone zero list is used to decide which chapter to
+ * evict when the cache is updated, and its search list is copied to the other threads at that
+ * time.
+ *
+ * The virtual chapter number field of the cache entry is the single field indicating whether a
+ * chapter is a member of the cache or not. The value NO_CHAPTER is used to represent a null or
+ * undefined chapter number. When present in the virtual chapter number field of a
+ * cached_chapter_index, it indicates that the cache entry is dead, and all the other fields of
+ * that entry (other than immutable pointers to cache memory) are undefined and irrelevant. Any
+ * cache entry that is not marked as dead is fully defined and a member of the cache, and
+ * uds_sparse_cache_contains() will always return true for any virtual chapter number that appears
+ * in any of the cache entries.
+ *
+ * A chapter index that is a member of the cache may be excluded from searches between calls to
+ * uds_update_sparse_cache() in two different ways. First, when a chapter falls off the end of the
+ * volume, its virtual chapter number will be less that the oldest virtual chapter number. Since
+ * that chapter is no longer part of the volume, there's no point in continuing to search that
+ * chapter index. Once invalidated, that virtual chapter will still be considered a member of the
+ * cache, but it will no longer be searched for matching names.
+ *
+ * The second mechanism is a heuristic based on keeping track of the number of consecutive search
+ * misses in a given chapter index. Once that count exceeds a threshold, the skip_search flag will
+ * be set to true, causing the chapter to be skipped when searching the entire cache, but still
+ * allowing it to be found when searching for a hook in that specific chapter. Finding a hook will
+ * clear the skip_search flag, once again allowing the non-hook searches to use that cache entry.
+ * Again, regardless of the state of the skip_search flag, the virtual chapter must still
+ * considered to be a member of the cache for uds_sparse_cache_contains().
+ */
+
+enum {
+	SKIP_SEARCH_THRESHOLD = 20000,
+	ZONE_ZERO = 0,
+};
+
+/*
+ * These counters are essentially fields of the struct cached_chapter_index, but are segregated
+ * into this structure because they are frequently modified. They are grouped and aligned to keep
+ * them on different cache lines from the chapter fields that are accessed far more often than they
+ * are updated.
+ */
+struct __aligned(L1_CACHE_BYTES) cached_index_counters {
+	u64 consecutive_misses;
+};
+
+struct __aligned(L1_CACHE_BYTES) cached_chapter_index {
+	/*
+	 * The virtual chapter number of the cached chapter index. NO_CHAPTER means this cache
+	 * entry is unused. This field must only be modified in the critical section in
+	 * uds_update_sparse_cache().
+	 */
+	u64 virtual_chapter;
+
+	u32 index_pages_count;
+
+	/*
+	 * These pointers are immutable during the life of the cache. The contents of the arrays
+	 * change when the cache entry is replaced.
+	 */
+	struct delta_index_page *index_pages;
+	struct dm_buffer **page_buffers;
+
+	/*
+	 * If set, skip the chapter when searching the entire cache. This flag is just a
+	 * performance optimization. This flag is mutable between cache updates, but it rarely
+	 * changes and is frequently accessed, so it groups with the immutable fields.
+	 */
+	bool skip_search;
+
+	/*
+	 * The cache-aligned counters change often and are placed at the end of the structure to
+	 * prevent false sharing with the more stable fields above.
+	 */
+	struct cached_index_counters counters;
+};
+
+/*
+ * A search_list represents an ordering of the sparse chapter index cache entry array, from most
+ * recently accessed to least recently accessed, which is the order in which the indexes should be
+ * searched and the reverse order in which they should be evicted from the cache.
+ *
+ * Cache entries that are dead or empty are kept at the end of the list, avoiding the need to even
+ * iterate over them to search, and ensuring that dead entries are replaced before any live entries
+ * are evicted.
+ *
+ * The search list is instantiated for each zone thread, avoiding any need for synchronization. The
+ * structure is allocated on a cache boundary to avoid false sharing of memory cache lines between
+ * zone threads.
+ */
+struct search_list {
+	u8 capacity;
+	u8 first_dead_entry;
+	struct cached_chapter_index *entries[];
+};
+
+struct sparse_cache {
+	const struct geometry *geometry;
+	unsigned int capacity;
+	unsigned int zone_count;
+
+	unsigned int skip_threshold;
+	struct search_list *search_lists[MAX_ZONES];
+	struct cached_chapter_index **scratch_entries;
+
+	struct barrier begin_update_barrier;
+	struct barrier end_update_barrier;
+
+	struct cached_chapter_index chapters[];
+};
+
+static int __must_check
+initialize_cached_chapter_index(struct cached_chapter_index *chapter,
+				const struct geometry *geometry)
+{
+	int result;
+
+	chapter->virtual_chapter = NO_CHAPTER;
+	chapter->index_pages_count = geometry->index_pages_per_chapter;
+
+	result = UDS_ALLOCATE(chapter->index_pages_count,
+			      struct delta_index_page,
+			      __func__,
+			      &chapter->index_pages);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	return UDS_ALLOCATE(chapter->index_pages_count,
+			    struct dm_buffer *,
+			    "sparse index volume pages",
+			    &chapter->page_buffers);
+}
+
+static int __must_check make_search_list(struct sparse_cache *cache, struct search_list **list_ptr)
+{
+	struct search_list *list;
+	unsigned int bytes;
+	u8 i;
+	int result;
+
+	bytes = (sizeof(struct search_list) +
+		 (cache->capacity * sizeof(struct cached_chapter_index *)));
+	result = uds_allocate_cache_aligned(bytes, "search list", &list);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	list->capacity = cache->capacity;
+	list->first_dead_entry = 0;
+
+	for (i = 0; i < list->capacity; i++)
+		list->entries[i] = &cache->chapters[i];
+
+	*list_ptr = list;
+	return UDS_SUCCESS;
+}
+
+int uds_make_sparse_cache(const struct geometry *geometry,
+			  unsigned int capacity,
+			  unsigned int zone_count,
+			  struct sparse_cache **cache_ptr)
+{
+	int result;
+	unsigned int i;
+	struct sparse_cache *cache;
+	unsigned int bytes;
+
+	bytes = (sizeof(struct sparse_cache) + (capacity * sizeof(struct cached_chapter_index)));
+	result = uds_allocate_cache_aligned(bytes, "sparse cache", &cache);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	cache->geometry = geometry;
+	cache->capacity = capacity;
+	cache->zone_count = zone_count;
+
+	/*
+	 * Scale down the skip threshold since the cache only counts cache misses in zone zero, but
+	 * requests are being handled in all zones.
+	 */
+	cache->skip_threshold = (SKIP_SEARCH_THRESHOLD / zone_count);
+
+	result = uds_initialize_barrier(&cache->begin_update_barrier, zone_count);
+	if (result != UDS_SUCCESS) {
+		uds_free_sparse_cache(cache);
+		return result;
+	}
+
+	result = uds_initialize_barrier(&cache->end_update_barrier, zone_count);
+	if (result != UDS_SUCCESS) {
+		uds_free_sparse_cache(cache);
+		return result;
+	}
+
+	for (i = 0; i < capacity; i++) {
+		result = initialize_cached_chapter_index(&cache->chapters[i], geometry);
+		if (result != UDS_SUCCESS) {
+			uds_free_sparse_cache(cache);
+			return result;
+		}
+	}
+
+	for (i = 0; i < zone_count; i++) {
+		result = make_search_list(cache, &cache->search_lists[i]);
+		if (result != UDS_SUCCESS) {
+			uds_free_sparse_cache(cache);
+			return result;
+		}
+	}
+
+	/* purge_search_list() needs some temporary lists for sorting. */
+	result = UDS_ALLOCATE(capacity * 2,
+			      struct cached_chapter_index *,
+			      "scratch entries",
+			      &cache->scratch_entries);
+	if (result != UDS_SUCCESS) {
+		uds_free_sparse_cache(cache);
+		return result;
+	}
+
+	*cache_ptr = cache;
+	return UDS_SUCCESS;
+}
+
+static inline void set_skip_search(struct cached_chapter_index *chapter, bool skip_search)
+{
+	/* Check before setting to reduce cache line contention. */
+	if (READ_ONCE(chapter->skip_search) != skip_search)
+		WRITE_ONCE(chapter->skip_search, skip_search);
+}
+
+static void score_search_hit(struct cached_chapter_index *chapter)
+{
+	chapter->counters.consecutive_misses = 0;
+	set_skip_search(chapter, false);
+}
+
+static void score_search_miss(struct sparse_cache *cache, struct cached_chapter_index *chapter)
+{
+	chapter->counters.consecutive_misses++;
+	if (chapter->counters.consecutive_misses > cache->skip_threshold)
+		set_skip_search(chapter, true);
+}
+
+static void release_cached_chapter_index(struct cached_chapter_index *chapter)
+{
+	unsigned int i;
+
+	chapter->virtual_chapter = NO_CHAPTER;
+	if (chapter->page_buffers == NULL)
+		return;
+
+	for (i = 0; i < chapter->index_pages_count; i++)
+		if (chapter->page_buffers[i] != NULL)
+			dm_bufio_release(UDS_FORGET(chapter->page_buffers[i]));
+}
+
+void uds_free_sparse_cache(struct sparse_cache *cache)
+{
+	unsigned int i;
+
+	if (cache == NULL)
+		return;
+
+	UDS_FREE(cache->scratch_entries);
+
+	for (i = 0; i < cache->zone_count; i++)
+		UDS_FREE(cache->search_lists[i]);
+
+	for (i = 0; i < cache->capacity; i++) {
+		release_cached_chapter_index(&cache->chapters[i]);
+		UDS_FREE(cache->chapters[i].index_pages);
+		UDS_FREE(cache->chapters[i].page_buffers);
+	}
+
+	uds_destroy_barrier(&cache->begin_update_barrier);
+	uds_destroy_barrier(&cache->end_update_barrier);
+	UDS_FREE(cache);
+}
+
+/*
+ * Take the indicated element of the search list and move it to the start, pushing the pointers
+ * previously before it back down the list.
+ */
+static inline void set_newest_entry(struct search_list *search_list, u8 index)
+{
+	struct cached_chapter_index *newest;
+
+	if (index > 0) {
+		newest = search_list->entries[index];
+		memmove(&search_list->entries[1],
+			&search_list->entries[0],
+			index * sizeof(struct cached_chapter_index *));
+		search_list->entries[0] = newest;
+	}
+
+	/*
+	 * This function may have moved a dead chapter to the front of the list for reuse, in which
+	 * case the set of dead chapters becomes smaller.
+	 */
+	if (search_list->first_dead_entry <= index)
+		search_list->first_dead_entry++;
+}
+
+bool uds_sparse_cache_contains(struct sparse_cache *cache,
+			       u64 virtual_chapter,
+			       unsigned int zone_number)
+{
+	struct search_list *search_list;
+	struct cached_chapter_index *chapter;
+	u8 i;
+
+	/*
+	 * The correctness of the barriers depends on the invariant that between calls to
+	 * uds_update_sparse_cache(), the answers this function returns must never vary: the result
+	 * for a given chapter must be identical across zones. That invariant must be maintained
+	 * even if the chapter falls off the end of the volume, or if searching it is disabled
+	 * because of too many search misses.
+	 */
+	search_list = cache->search_lists[zone_number];
+	for (i = 0; i < search_list->first_dead_entry; i++) {
+		chapter = search_list->entries[i];
+
+		if (virtual_chapter == chapter->virtual_chapter) {
+			if (zone_number == ZONE_ZERO)
+				score_search_hit(chapter);
+
+			set_newest_entry(search_list, i);
+			return true;
+		}
+	}
+
+	return false;
+}
+
+/*
+ * Re-sort cache entries into three sets (active, skippable, and dead) while maintaining the LRU
+ * ordering that already existed. This operation must only be called during the critical section in
+ * uds_update_sparse_cache().
+ */
+static void purge_search_list(struct search_list *search_list,
+			      struct sparse_cache *cache,
+			      u64 oldest_virtual_chapter)
+{
+	struct cached_chapter_index **entries;
+	struct cached_chapter_index **skipped;
+	struct cached_chapter_index **dead;
+	struct cached_chapter_index *chapter;
+	unsigned int next_alive = 0;
+	unsigned int next_skipped = 0;
+	unsigned int next_dead = 0;
+	unsigned int i;
+
+	entries = &search_list->entries[0];
+	skipped = &cache->scratch_entries[0];
+	dead = &cache->scratch_entries[search_list->capacity];
+
+	for (i = 0; i < search_list->first_dead_entry; i++) {
+		chapter = search_list->entries[i];
+		if ((chapter->virtual_chapter < oldest_virtual_chapter) ||
+		    (chapter->virtual_chapter == NO_CHAPTER))
+			dead[next_dead++] = chapter;
+		else if (chapter->skip_search)
+			skipped[next_skipped++] = chapter;
+		else
+			entries[next_alive++] = chapter;
+	}
+
+	memcpy(&entries[next_alive],
+	       skipped,
+	       next_skipped * sizeof(struct cached_chapter_index *));
+	memcpy(&entries[next_alive + next_skipped],
+	       dead,
+	       next_dead * sizeof(struct cached_chapter_index *));
+	search_list->first_dead_entry = next_alive + next_skipped;
+}
+
+static int __must_check cache_chapter_index(struct cached_chapter_index *chapter,
+					    u64 virtual_chapter,
+					    const struct volume *volume)
+{
+	int result;
+
+	release_cached_chapter_index(chapter);
+
+	result = uds_read_chapter_index_from_volume(volume,
+						    virtual_chapter,
+						    chapter->page_buffers,
+						    chapter->index_pages);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	chapter->counters.consecutive_misses = 0;
+	chapter->virtual_chapter = virtual_chapter;
+	chapter->skip_search = false;
+
+	return UDS_SUCCESS;
+}
+
+static inline void copy_search_list(const struct search_list *source, struct search_list *target)
+{
+	*target = *source;
+	memcpy(target->entries,
+	       source->entries,
+	       source->capacity * sizeof(struct cached_chapter_index *));
+}
+
+/*
+ * Update the sparse cache to contain a chapter index. This function must be called by all the zone
+ * threads with the same chapter number to correctly enter the thread barriers used to synchronize
+ * the cache updates.
+ */
+int uds_update_sparse_cache(struct index_zone *zone, u64 virtual_chapter)
+{
+	int result = UDS_SUCCESS;
+	const struct uds_index *index = zone->index;
+	struct sparse_cache *cache = index->volume->sparse_cache;
+
+	if (uds_sparse_cache_contains(cache, virtual_chapter, zone->id))
+		return UDS_SUCCESS;
+
+	/*
+	 * Wait for every zone thread to reach its corresponding barrier request and invoke this
+	 * function before starting to modify the cache.
+	 */
+	uds_enter_barrier(&cache->begin_update_barrier);
+
+	/*
+	 * This is the start of the critical section: the zone zero thread is captain, effectively
+	 * holding an exclusive lock on the sparse cache. All the other zone threads must do
+	 * nothing between the two barriers. They will wait at the end_update_barrier again for the
+	 * captain to finish the update.
+	 */
+
+	if (zone->id == ZONE_ZERO) {
+		unsigned int z;
+		struct search_list *list = cache->search_lists[ZONE_ZERO];
+
+		purge_search_list(list, cache, zone->oldest_virtual_chapter);
+
+		if (virtual_chapter >= index->oldest_virtual_chapter) {
+			set_newest_entry(list, list->capacity - 1);
+			result = cache_chapter_index(list->entries[0],
+						     virtual_chapter,
+						     index->volume);
+		}
+
+		for (z = 1; z < cache->zone_count; z++)
+			copy_search_list(list, cache->search_lists[z]);
+	}
+
+	/*
+	 * This is the end of the critical section. All cache invariants must have been restored.
+	 */
+	uds_enter_barrier(&cache->end_update_barrier);
+	return result;
+}
+
+void uds_invalidate_sparse_cache(struct sparse_cache *cache)
+{
+	unsigned int i;
+
+	for (i = 0; i < cache->capacity; i++)
+		release_cached_chapter_index(&cache->chapters[i]);
+}
+
+static inline bool should_skip_chapter(struct cached_chapter_index *chapter,
+				       u64 oldest_chapter,
+				       u64 requested_chapter)
+{
+	if ((chapter->virtual_chapter == NO_CHAPTER) ||
+	    (chapter->virtual_chapter < oldest_chapter))
+		return true;
+
+	if (requested_chapter != NO_CHAPTER)
+		return requested_chapter != chapter->virtual_chapter;
+	else
+		return READ_ONCE(chapter->skip_search);
+}
+
+static int __must_check
+search_cached_chapter_index(struct cached_chapter_index *chapter,
+			    const struct geometry *geometry,
+			    const struct index_page_map *index_page_map,
+			    const struct uds_record_name *name,
+			    u16 *record_page_ptr)
+{
+	u32 physical_chapter = uds_map_to_physical_chapter(geometry, chapter->virtual_chapter);
+	u32 index_page_number = uds_find_index_page_number(index_page_map, name, physical_chapter);
+	struct delta_index_page *index_page = &chapter->index_pages[index_page_number];
+
+	return uds_search_chapter_index_page(index_page, geometry, name, record_page_ptr);
+}
+
+int uds_search_sparse_cache(struct index_zone *zone,
+			    const struct uds_record_name *name,
+			    u64 *virtual_chapter_ptr,
+			    u16 *record_page_ptr)
+{
+	int result;
+	struct volume *volume = zone->index->volume;
+	struct sparse_cache *cache = volume->sparse_cache;
+	struct cached_chapter_index *chapter;
+	struct search_list *search_list;
+	u8 i;
+	/* Search the entire cache unless a specific chapter was requested. */
+	bool search_one = (*virtual_chapter_ptr != NO_CHAPTER);
+
+	*record_page_ptr = NO_CHAPTER_INDEX_ENTRY;
+	search_list = cache->search_lists[zone->id];
+	for (i = 0; i < search_list->first_dead_entry; i++) {
+		chapter = search_list->entries[i];
+
+		if (should_skip_chapter(chapter,
+					zone->oldest_virtual_chapter,
+					*virtual_chapter_ptr))
+			continue;
+
+		result = search_cached_chapter_index(chapter,
+						     cache->geometry,
+						     volume->index_page_map,
+						     name,
+						     record_page_ptr);
+		if (result != UDS_SUCCESS)
+			return result;
+
+		if (*record_page_ptr != NO_CHAPTER_INDEX_ENTRY) {
+			/*
+			 * In theory, this might be a false match while a true match exists in
+			 * another chapter, but that's a very rare case and not worth the extra
+			 * search complexity.
+			 */
+			set_newest_entry(search_list, i);
+			if (zone->id == ZONE_ZERO)
+				score_search_hit(chapter);
+
+			*virtual_chapter_ptr = chapter->virtual_chapter;
+			return UDS_SUCCESS;
+		}
+
+		if (zone->id == ZONE_ZERO)
+			score_search_miss(cache, chapter);
+
+		if (search_one)
+			break;
+	}
+
+	return UDS_SUCCESS;
+}
diff --git a/drivers/md/dm-vdo/sparse-cache.h b/drivers/md/dm-vdo/sparse-cache.h
new file mode 100644
index 000000000000..891175de762c
--- /dev/null
+++ b/drivers/md/dm-vdo/sparse-cache.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_SPARSE_CACHE_H
+#define UDS_SPARSE_CACHE_H
+
+#include "geometry.h"
+#include "uds.h"
+
+/*
+ * The sparse cache is a cache of entire chapter indexes from sparse chapters used for searching
+ * for names after all other search paths have failed. It contains only complete chapter indexes;
+ * record pages from sparse chapters and single index pages used for resolving hooks are kept in
+ * the regular page cache in the volume.
+ *
+ * The most important property of this cache is the absence of synchronization for read operations.
+ * Safe concurrent access to the cache by the zone threads is controlled by the triage queue and
+ * the barrier requests it issues to the zone queues. The set of cached chapters does not and must
+ * not change between the carefully coordinated calls to uds_update_sparse_cache() from the zone
+ * threads. Outside of updates, every zone will get the same result when calling
+ * uds_sparse_cache_contains() as every other zone.
+ */
+
+struct index_zone;
+struct sparse_cache;
+
+int __must_check uds_make_sparse_cache(const struct geometry *geometry,
+				       unsigned int capacity,
+				       unsigned int zone_count,
+				       struct sparse_cache **cache_ptr);
+
+void uds_free_sparse_cache(struct sparse_cache *cache);
+
+bool uds_sparse_cache_contains(struct sparse_cache *cache,
+			       u64 virtual_chapter,
+			       unsigned int zone_number);
+
+int __must_check uds_update_sparse_cache(struct index_zone *zone, u64 virtual_chapter);
+
+void uds_invalidate_sparse_cache(struct sparse_cache *cache);
+
+int __must_check uds_search_sparse_cache(struct index_zone *zone,
+					 const struct uds_record_name *name,
+					 u64 *virtual_chapter_ptr,
+					 u16 *record_page_ptr);
+
+#endif /* UDS_SPARSE_CACHE_H */

From patchwork Thu Oct 26 21:41:13 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437950
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CED32C25B48
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:29 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344943AbjJZVo3 (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:29 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40942 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344898AbjJZVoQ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:16 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3BF871B8
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356502;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=6uwQsAPUU2XlcaXzxiM4B9xYonnTjsTiD+XSVq65RMI=;
        b=KWANbUQEovZa5XPbJsMQhsrG89VaMSMHY/ciA5NQtwgrhINtEmExess3Del0pHtPmd3Wek
        8JrLm+imgmC0uCiOcX52CgqmW47vwERaN1kGEOZpQ6O83lD4bHCzxQhXMGtZLSXNChBlSh
        +RPCIu6uBPiSU7tqUvwReoBIxbtL9Ew=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-488-KbOqpmT6PP29tb8GzaZCag-1; Thu, 26 Oct 2023 17:41:40 -0400
X-MC-Unique: KbOqpmT6PP29tb8GzaZCag-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com
 [10.11.54.2])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E6C5088D186;
        Thu, 26 Oct 2023 21:41:39 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id CEA5E40C6F79;
        Thu, 26 Oct 2023 21:41:39 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 846A13003D; Thu, 26 Oct 2023 17:41:39 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 16/39] dm vdo: implement external deduplication index
 interface
Date: Thu, 26 Oct 2023 17:41:13 -0400
Message-Id: <20231026214136.1067410-17-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

The deduplication index interface for index clients includes the
deduplication request and index session structures. This is the interface
that the rest of the vdo target uses to make requests, receive responses,
and collect statistics.

This patch also adds sysfs nodes for inspecting various index properties at
runtime.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us>
Co-developed-by: John Wiele <jwiele@redhat.com>
Signed-off-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/index-session.c | 746 ++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/index-session.h |  84 ++++
 drivers/md/dm-vdo/uds-sysfs.c     | 185 ++++++++
 drivers/md/dm-vdo/uds-sysfs.h     |  12 +
 drivers/md/dm-vdo/uds.h           | 328 +++++++++++++
 5 files changed, 1355 insertions(+)
 create mode 100644 drivers/md/dm-vdo/index-session.c
 create mode 100644 drivers/md/dm-vdo/index-session.h
 create mode 100644 drivers/md/dm-vdo/uds-sysfs.c
 create mode 100644 drivers/md/dm-vdo/uds-sysfs.h
 create mode 100644 drivers/md/dm-vdo/uds.h

diff --git a/drivers/md/dm-vdo/index-session.c b/drivers/md/dm-vdo/index-session.c
new file mode 100644
index 000000000000..052ba093fbf1
--- /dev/null
+++ b/drivers/md/dm-vdo/index-session.c
@@ -0,0 +1,746 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "index-session.h"
+
+#include <linux/atomic.h>
+
+#include "index.h"
+#include "index-layout.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "request-queue.h"
+#include "time-utils.h"
+
+/*
+ * The index session contains a lock (the request_mutex) which ensures that only one thread can
+ * change the state of its index at a time. The state field indicates the current state of the
+ * index through a set of descriptive flags. The request_mutex must be notified whenever a
+ * non-transient state flag is cleared. The request_mutex is also used to count the number of
+ * requests currently in progress so that they can be drained when suspending or closing the index.
+ *
+ * If the index session is suspended shortly after opening an index, it may have to suspend during
+ * a rebuild. Depending on the size of the index, a rebuild may take a significant amount of time,
+ * so UDS allows the rebuild to be paused in order to suspend the session in a timely manner. When
+ * the index session is resumed, the rebuild can continue from where it left off. If the index
+ * session is shut down with a suspended rebuild, the rebuild progress is abandoned and the rebuild
+ * will start from the beginning the next time the index is loaded. The mutex and status fields in
+ * the index_load_context are used to record the state of any interrupted rebuild.
+ */
+
+enum index_session_flag_bit {
+	IS_FLAG_BIT_START = 8,
+	/* The session has started loading an index but not completed it. */
+	IS_FLAG_BIT_LOADING = IS_FLAG_BIT_START,
+	/* The session has loaded an index, which can handle requests. */
+	IS_FLAG_BIT_LOADED,
+	/* The session's index has been permanently disabled. */
+	IS_FLAG_BIT_DISABLED,
+	/* The session's index is suspended. */
+	IS_FLAG_BIT_SUSPENDED,
+	/* The session is handling some index state change. */
+	IS_FLAG_BIT_WAITING,
+	/* The session's index is closing and draining requests. */
+	IS_FLAG_BIT_CLOSING,
+	/* The session is being destroyed and is draining requests. */
+	IS_FLAG_BIT_DESTROYING,
+};
+
+enum index_session_flag {
+	IS_FLAG_LOADED = (1 << IS_FLAG_BIT_LOADED),
+	IS_FLAG_LOADING = (1 << IS_FLAG_BIT_LOADING),
+	IS_FLAG_DISABLED = (1 << IS_FLAG_BIT_DISABLED),
+	IS_FLAG_SUSPENDED = (1 << IS_FLAG_BIT_SUSPENDED),
+	IS_FLAG_WAITING = (1 << IS_FLAG_BIT_WAITING),
+	IS_FLAG_CLOSING = (1 << IS_FLAG_BIT_CLOSING),
+	IS_FLAG_DESTROYING = (1 << IS_FLAG_BIT_DESTROYING),
+};
+
+/* Release a reference to an index session. */
+static void release_index_session(struct uds_index_session *index_session)
+{
+	uds_lock_mutex(&index_session->request_mutex);
+	if (--index_session->request_count == 0)
+		uds_broadcast_cond(&index_session->request_cond);
+	uds_unlock_mutex(&index_session->request_mutex);
+}
+
+/*
+ * Acquire a reference to the index session for an asynchronous index request. The reference must
+ * eventually be released with a corresponding call to release_index_session().
+ */
+static int get_index_session(struct uds_index_session *index_session)
+{
+	unsigned int state;
+	int result = UDS_SUCCESS;
+
+	uds_lock_mutex(&index_session->request_mutex);
+	index_session->request_count++;
+	state = index_session->state;
+	uds_unlock_mutex(&index_session->request_mutex);
+
+	if (state == IS_FLAG_LOADED) {
+		return UDS_SUCCESS;
+	} else if (state & IS_FLAG_DISABLED) {
+		result = UDS_DISABLED;
+	} else if ((state & IS_FLAG_LOADING) ||
+		   (state & IS_FLAG_SUSPENDED) ||
+		   (state & IS_FLAG_WAITING)) {
+		result = -EBUSY;
+	} else {
+		result = UDS_NO_INDEX;
+	}
+
+	release_index_session(index_session);
+	return result;
+}
+
+int uds_launch_request(struct uds_request *request)
+{
+	size_t internal_size;
+	int result;
+
+	if (request->callback == NULL) {
+		uds_log_error("missing required callback");
+		return -EINVAL;
+	}
+
+	switch (request->type) {
+	case UDS_DELETE:
+	case UDS_POST:
+	case UDS_QUERY:
+	case UDS_QUERY_NO_UPDATE:
+	case UDS_UPDATE:
+		break;
+	default:
+		uds_log_error("received invalid callback type");
+		return -EINVAL;
+	}
+
+	/* Reset all internal fields before processing. */
+	internal_size = sizeof(struct uds_request) - offsetof(struct uds_request, zone_number);
+	// FIXME should be using struct_group for this instead
+	memset((char *) request + sizeof(*request) - internal_size, 0, internal_size);
+
+	result = get_index_session(request->session);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	request->found = false;
+	request->unbatched = false;
+	request->index = request->session->index;
+
+	uds_enqueue_request(request, STAGE_TRIAGE);
+	return UDS_SUCCESS;
+}
+
+static void enter_callback_stage(struct uds_request *request)
+{
+	if (request->status != UDS_SUCCESS) {
+		/* All request errors are considered unrecoverable */
+		uds_lock_mutex(&request->session->request_mutex);
+		request->session->state |= IS_FLAG_DISABLED;
+		uds_unlock_mutex(&request->session->request_mutex);
+	}
+
+	uds_request_queue_enqueue(request->session->callback_queue, request);
+}
+
+static inline void count_once(u64 *count_ptr)
+{
+	WRITE_ONCE(*count_ptr, READ_ONCE(*count_ptr) + 1);
+}
+
+static void update_session_stats(struct uds_request *request)
+{
+	struct session_stats *session_stats = &request->session->stats;
+
+	count_once(&session_stats->requests);
+
+	switch (request->type) {
+	case UDS_POST:
+		if (request->found)
+			count_once(&session_stats->posts_found);
+		else
+			count_once(&session_stats->posts_not_found);
+
+		if (request->location == UDS_LOCATION_IN_OPEN_CHAPTER)
+			count_once(&session_stats->posts_found_open_chapter);
+		else if (request->location == UDS_LOCATION_IN_DENSE)
+			count_once(&session_stats->posts_found_dense);
+		else if (request->location == UDS_LOCATION_IN_SPARSE)
+			count_once(&session_stats->posts_found_sparse);
+		break;
+
+	case UDS_UPDATE:
+		if (request->found)
+			count_once(&session_stats->updates_found);
+		else
+			count_once(&session_stats->updates_not_found);
+		break;
+
+	case UDS_DELETE:
+		if (request->found)
+			count_once(&session_stats->deletions_found);
+		else
+			count_once(&session_stats->deletions_not_found);
+		break;
+
+	case UDS_QUERY:
+	case UDS_QUERY_NO_UPDATE:
+		if (request->found)
+			count_once(&session_stats->queries_found);
+		else
+			count_once(&session_stats->queries_not_found);
+		break;
+
+	default:
+		request->status = ASSERT(false, "unknown request type: %d", request->type);
+	}
+}
+
+static void handle_callbacks(struct uds_request *request)
+{
+	struct uds_index_session *index_session = request->session;
+
+	if (request->status == UDS_SUCCESS)
+		update_session_stats(request);
+
+	request->status = uds_map_to_system_error(request->status);
+	request->callback(request);
+	release_index_session(index_session);
+}
+
+static int __must_check make_empty_index_session(struct uds_index_session **index_session_ptr)
+{
+	int result;
+	struct uds_index_session *session;
+
+	result = UDS_ALLOCATE(1, struct uds_index_session, __func__, &session);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_init_mutex(&session->request_mutex);
+	if (result != UDS_SUCCESS) {
+		UDS_FREE(session);
+		return result;
+	}
+
+	result = uds_init_cond(&session->request_cond);
+	if (result != UDS_SUCCESS) {
+		uds_destroy_mutex(&session->request_mutex);
+		UDS_FREE(session);
+		return result;
+	}
+
+	result = uds_init_mutex(&session->load_context.mutex);
+	if (result != UDS_SUCCESS) {
+		uds_destroy_cond(&session->request_cond);
+		uds_destroy_mutex(&session->request_mutex);
+		UDS_FREE(session);
+		return result;
+	}
+
+	result = uds_init_cond(&session->load_context.cond);
+	if (result != UDS_SUCCESS) {
+		uds_destroy_mutex(&session->load_context.mutex);
+		uds_destroy_cond(&session->request_cond);
+		uds_destroy_mutex(&session->request_mutex);
+		UDS_FREE(session);
+		return result;
+	}
+
+	result = uds_make_request_queue("callbackW", &handle_callbacks, &session->callback_queue);
+	if (result != UDS_SUCCESS) {
+		uds_destroy_cond(&session->load_context.cond);
+		uds_destroy_mutex(&session->load_context.mutex);
+		uds_destroy_cond(&session->request_cond);
+		uds_destroy_mutex(&session->request_mutex);
+		UDS_FREE(session);
+		return result;
+	}
+
+	*index_session_ptr = session;
+	return UDS_SUCCESS;
+}
+
+int uds_create_index_session(struct uds_index_session **session)
+{
+	if (session == NULL) {
+		uds_log_error("missing session pointer");
+		return -EINVAL;
+	}
+
+	return uds_map_to_system_error(make_empty_index_session(session));
+}
+
+static int __must_check start_loading_index_session(struct uds_index_session *index_session)
+{
+	int result;
+
+	uds_lock_mutex(&index_session->request_mutex);
+	if (index_session->state & IS_FLAG_SUSPENDED) {
+		uds_log_info("Index session is suspended");
+		result = -EBUSY;
+	} else if (index_session->state != 0) {
+		uds_log_info("Index is already loaded");
+		result = -EBUSY;
+	} else {
+		index_session->state |= IS_FLAG_LOADING;
+		result = UDS_SUCCESS;
+	}
+	uds_unlock_mutex(&index_session->request_mutex);
+	return result;
+}
+
+static void finish_loading_index_session(struct uds_index_session *index_session, int result)
+{
+	uds_lock_mutex(&index_session->request_mutex);
+	index_session->state &= ~IS_FLAG_LOADING;
+	if (result == UDS_SUCCESS)
+		index_session->state |= IS_FLAG_LOADED;
+
+	uds_broadcast_cond(&index_session->request_cond);
+	uds_unlock_mutex(&index_session->request_mutex);
+}
+
+static int initialize_index_session(struct uds_index_session *index_session,
+				    enum uds_open_index_type open_type)
+{
+	int result;
+	struct configuration *config;
+
+	result = uds_make_configuration(&index_session->parameters, &config);
+	if (result != UDS_SUCCESS) {
+		uds_log_error_strerror(result, "Failed to allocate config");
+		return result;
+	}
+
+	memset(&index_session->stats, 0, sizeof(index_session->stats));
+	result = uds_make_index(config,
+				open_type,
+				&index_session->load_context,
+				enter_callback_stage,
+				&index_session->index);
+	if (result != UDS_SUCCESS)
+		uds_log_error_strerror(result, "Failed to make index");
+	else
+		uds_log_configuration(config);
+
+	uds_free_configuration(config);
+	return result;
+}
+
+static const char *get_open_type_string(enum uds_open_index_type open_type)
+{
+	switch (open_type) {
+	case UDS_CREATE:
+		return "creating index";
+	case UDS_LOAD:
+		return "loading or rebuilding index";
+	case UDS_NO_REBUILD:
+		return "loading index";
+	default:
+		return "unknown open method";
+	}
+}
+
+/*
+ * Open an index under the given session. This operation will fail if the
+ * index session is suspended, or if there is already an open index.
+ */
+int uds_open_index(enum uds_open_index_type open_type,
+		   const struct uds_parameters *parameters,
+		   struct uds_index_session *session)
+{
+	int result;
+	char name[BDEVNAME_SIZE];
+
+	if (parameters == NULL) {
+		uds_log_error("missing required parameters");
+		return -EINVAL;
+	}
+	if (parameters->bdev == NULL) {
+		uds_log_error("missing required block device");
+		return -EINVAL;
+	}
+	if (session == NULL) {
+		uds_log_error("missing required session pointer");
+		return -EINVAL;
+	}
+
+	result = start_loading_index_session(session);
+	if (result != UDS_SUCCESS)
+		return uds_map_to_system_error(result);
+
+	session->parameters = *parameters;
+	format_dev_t(name, parameters->bdev->bd_dev);
+	uds_log_notice("%s: %s", get_open_type_string(open_type), name);
+
+	result = initialize_index_session(session, open_type);
+	if (result != UDS_SUCCESS)
+		uds_log_error_strerror(result, "Failed %s", get_open_type_string(open_type));
+
+	finish_loading_index_session(session, result);
+	return uds_map_to_system_error(result);
+}
+
+static void wait_for_no_requests_in_progress(struct uds_index_session *index_session)
+{
+	uds_lock_mutex(&index_session->request_mutex);
+	while (index_session->request_count > 0)
+		uds_wait_cond(&index_session->request_cond, &index_session->request_mutex);
+	uds_unlock_mutex(&index_session->request_mutex);
+}
+
+static int __must_check save_index(struct uds_index_session *index_session)
+{
+	wait_for_no_requests_in_progress(index_session);
+	return uds_save_index(index_session->index);
+}
+
+static void suspend_rebuild(struct uds_index_session *session)
+{
+	uds_lock_mutex(&session->load_context.mutex);
+	switch (session->load_context.status) {
+	case INDEX_OPENING:
+		session->load_context.status = INDEX_SUSPENDING;
+
+		/* Wait until the index indicates that it is not replaying. */
+		while ((session->load_context.status != INDEX_SUSPENDED) &&
+		       (session->load_context.status != INDEX_READY))
+			uds_wait_cond(&session->load_context.cond,
+				      &session->load_context.mutex);
+		break;
+
+	case INDEX_READY:
+		/* Index load does not need to be suspended. */
+		break;
+
+	case INDEX_SUSPENDED:
+	case INDEX_SUSPENDING:
+	case INDEX_FREEING:
+	default:
+		/* These cases should not happen. */
+		ASSERT_LOG_ONLY(false, "Bad load context state %u", session->load_context.status);
+		break;
+	}
+	uds_unlock_mutex(&session->load_context.mutex);
+}
+
+/*
+ * Suspend index operation, draining all current index requests and preventing new index requests
+ * from starting. Optionally saves all index data before returning.
+ */
+int uds_suspend_index_session(struct uds_index_session *session, bool save)
+{
+	int result = UDS_SUCCESS;
+	bool no_work = false;
+	bool rebuilding = false;
+
+	/* Wait for any current index state change to complete. */
+	uds_lock_mutex(&session->request_mutex);
+	while (session->state & IS_FLAG_CLOSING)
+		uds_wait_cond(&session->request_cond, &session->request_mutex);
+
+	if ((session->state & IS_FLAG_WAITING) || (session->state & IS_FLAG_DESTROYING)) {
+		no_work = true;
+		uds_log_info("Index session is already changing state");
+		result = -EBUSY;
+	} else if (session->state & IS_FLAG_SUSPENDED) {
+		no_work = true;
+	} else if (session->state & IS_FLAG_LOADING) {
+		session->state |= IS_FLAG_WAITING;
+		rebuilding = true;
+	} else if (session->state & IS_FLAG_LOADED) {
+		session->state |= IS_FLAG_WAITING;
+	} else {
+		no_work = true;
+		session->state |= IS_FLAG_SUSPENDED;
+		uds_broadcast_cond(&session->request_cond);
+	}
+	uds_unlock_mutex(&session->request_mutex);
+
+	if (no_work)
+		return uds_map_to_system_error(result);
+
+	if (rebuilding)
+		suspend_rebuild(session);
+	else if (save)
+		result = save_index(session);
+	else
+		result = uds_flush_index_session(session);
+
+	uds_lock_mutex(&session->request_mutex);
+	session->state &= ~IS_FLAG_WAITING;
+	session->state |= IS_FLAG_SUSPENDED;
+	uds_broadcast_cond(&session->request_cond);
+	uds_unlock_mutex(&session->request_mutex);
+	return uds_map_to_system_error(result);
+}
+
+static int replace_device(struct uds_index_session *session, struct block_device *bdev)
+{
+	int result;
+
+	result = uds_replace_index_storage(session->index, bdev);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	session->parameters.bdev = bdev;
+	return UDS_SUCCESS;
+}
+
+/*
+ * Resume index operation after being suspended. If the index is suspended and the supplied block
+ * device differs from the current backing store, the index will start using the new backing store.
+ */
+int uds_resume_index_session(struct uds_index_session *session, struct block_device *bdev)
+{
+	int result = UDS_SUCCESS;
+	bool no_work = false;
+	bool resume_replay = false;
+
+	uds_lock_mutex(&session->request_mutex);
+	if (session->state & IS_FLAG_WAITING) {
+		uds_log_info("Index session is already changing state");
+		no_work = true;
+		result = -EBUSY;
+	} else if (!(session->state & IS_FLAG_SUSPENDED)) {
+		/* If not suspended, just succeed. */
+		no_work = true;
+		result = UDS_SUCCESS;
+	} else {
+		session->state |= IS_FLAG_WAITING;
+		if (session->state & IS_FLAG_LOADING)
+			resume_replay = true;
+	}
+	uds_unlock_mutex(&session->request_mutex);
+
+	if (no_work)
+		return result;
+
+	if ((session->index != NULL) && (bdev != session->parameters.bdev)) {
+		result = replace_device(session, bdev);
+		if (result != UDS_SUCCESS) {
+			uds_lock_mutex(&session->request_mutex);
+			session->state &= ~IS_FLAG_WAITING;
+			uds_broadcast_cond(&session->request_cond);
+			uds_unlock_mutex(&session->request_mutex);
+			return uds_map_to_system_error(result);
+		}
+	}
+
+	if (resume_replay) {
+		uds_lock_mutex(&session->load_context.mutex);
+		switch (session->load_context.status) {
+		case INDEX_SUSPENDED:
+			session->load_context.status = INDEX_OPENING;
+			/* Notify the index to start replaying again. */
+			uds_broadcast_cond(&session->load_context.cond);
+			break;
+
+		case INDEX_READY:
+			/* There is no index rebuild to resume. */
+			break;
+
+		case INDEX_OPENING:
+		case INDEX_SUSPENDING:
+		case INDEX_FREEING:
+		default:
+			/* These cases should not happen; do nothing. */
+			ASSERT_LOG_ONLY(false,
+					"Bad load context state %u",
+					session->load_context.status);
+			break;
+		}
+		uds_unlock_mutex(&session->load_context.mutex);
+	}
+
+	uds_lock_mutex(&session->request_mutex);
+	session->state &= ~IS_FLAG_WAITING;
+	session->state &= ~IS_FLAG_SUSPENDED;
+	uds_broadcast_cond(&session->request_cond);
+	uds_unlock_mutex(&session->request_mutex);
+	return UDS_SUCCESS;
+}
+
+static int save_and_free_index(struct uds_index_session *index_session)
+{
+	int result = UDS_SUCCESS;
+	bool suspended;
+	struct uds_index *index = index_session->index;
+
+	if (index == NULL)
+		return UDS_SUCCESS;
+
+	uds_lock_mutex(&index_session->request_mutex);
+	suspended = (index_session->state & IS_FLAG_SUSPENDED);
+	uds_unlock_mutex(&index_session->request_mutex);
+
+	if (!suspended) {
+		result = uds_save_index(index);
+		if (result != UDS_SUCCESS)
+			uds_log_warning_strerror(result, "ignoring error from save_index");
+	}
+	uds_free_index(index);
+	index_session->index = NULL;
+
+	/*
+	 * Reset all index state that happens to be in the index
+	 * session, so it doesn't affect any future index.
+	 */
+	uds_lock_mutex(&index_session->load_context.mutex);
+	index_session->load_context.status = INDEX_OPENING;
+	uds_unlock_mutex(&index_session->load_context.mutex);
+
+	uds_lock_mutex(&index_session->request_mutex);
+	/* Only the suspend bit will remain relevant. */
+	index_session->state &= IS_FLAG_SUSPENDED;
+	uds_unlock_mutex(&index_session->request_mutex);
+
+	return result;
+}
+
+/* Save and close the current index. */
+int uds_close_index(struct uds_index_session *index_session)
+{
+	int result = UDS_SUCCESS;
+
+	/* Wait for any current index state change to complete. */
+	uds_lock_mutex(&index_session->request_mutex);
+	while ((index_session->state & IS_FLAG_WAITING) ||
+	       (index_session->state & IS_FLAG_CLOSING))
+		uds_wait_cond(&index_session->request_cond, &index_session->request_mutex);
+
+	if (index_session->state & IS_FLAG_SUSPENDED) {
+		uds_log_info("Index session is suspended");
+		result = -EBUSY;
+	} else if ((index_session->state & IS_FLAG_DESTROYING) ||
+		   !(index_session->state & IS_FLAG_LOADED)) {
+		/* The index doesn't exist, hasn't finished loading, or is being destroyed. */
+		result = UDS_NO_INDEX;
+	} else {
+		index_session->state |= IS_FLAG_CLOSING;
+	}
+	uds_unlock_mutex(&index_session->request_mutex);
+	if (result != UDS_SUCCESS)
+		return uds_map_to_system_error(result);
+
+	uds_log_debug("Closing index");
+	wait_for_no_requests_in_progress(index_session);
+	result = save_and_free_index(index_session);
+	uds_log_debug("Closed index");
+
+	uds_lock_mutex(&index_session->request_mutex);
+	index_session->state &= ~IS_FLAG_CLOSING;
+	uds_broadcast_cond(&index_session->request_cond);
+	uds_unlock_mutex(&index_session->request_mutex);
+	return uds_map_to_system_error(result);
+}
+
+/* This will save and close an open index before destroying the session. */
+int uds_destroy_index_session(struct uds_index_session *index_session)
+{
+	int result;
+	bool load_pending = false;
+
+	uds_log_debug("Destroying index session");
+
+	/* Wait for any current index state change to complete. */
+	uds_lock_mutex(&index_session->request_mutex);
+	while ((index_session->state & IS_FLAG_WAITING) ||
+	       (index_session->state & IS_FLAG_CLOSING))
+		uds_wait_cond(&index_session->request_cond, &index_session->request_mutex);
+
+	if (index_session->state & IS_FLAG_DESTROYING) {
+		uds_unlock_mutex(&index_session->request_mutex);
+		uds_log_info("Index session is already closing");
+		return -EBUSY;
+	}
+
+	index_session->state |= IS_FLAG_DESTROYING;
+	load_pending = ((index_session->state & IS_FLAG_LOADING) &&
+			(index_session->state & IS_FLAG_SUSPENDED));
+	uds_unlock_mutex(&index_session->request_mutex);
+
+	if (load_pending) {
+		/* Tell the index to terminate the rebuild. */
+		uds_lock_mutex(&index_session->load_context.mutex);
+		if (index_session->load_context.status == INDEX_SUSPENDED) {
+			index_session->load_context.status = INDEX_FREEING;
+			uds_broadcast_cond(&index_session->load_context.cond);
+		}
+		uds_unlock_mutex(&index_session->load_context.mutex);
+
+		/* Wait until the load exits before proceeding. */
+		uds_lock_mutex(&index_session->request_mutex);
+		while (index_session->state & IS_FLAG_LOADING)
+			uds_wait_cond(&index_session->request_cond, &index_session->request_mutex);
+		uds_unlock_mutex(&index_session->request_mutex);
+	}
+
+	wait_for_no_requests_in_progress(index_session);
+	result = save_and_free_index(index_session);
+	uds_request_queue_finish(index_session->callback_queue);
+	index_session->callback_queue = NULL;
+	uds_destroy_cond(&index_session->load_context.cond);
+	uds_destroy_mutex(&index_session->load_context.mutex);
+	uds_destroy_cond(&index_session->request_cond);
+	uds_destroy_mutex(&index_session->request_mutex);
+	uds_log_debug("Destroyed index session");
+	UDS_FREE(index_session);
+	return uds_map_to_system_error(result);
+}
+
+/* Wait until all callbacks for index operations are complete. */
+int uds_flush_index_session(struct uds_index_session *index_session)
+{
+	wait_for_no_requests_in_progress(index_session);
+	uds_wait_for_idle_index(index_session->index);
+	return UDS_SUCCESS;
+}
+
+/* Statistics collection is intended to be thread-safe. */
+static void
+collect_stats(const struct uds_index_session *index_session, struct uds_index_stats *stats)
+{
+	const struct session_stats *session_stats = &index_session->stats;
+
+	stats->current_time = ktime_to_seconds(current_time_ns(CLOCK_REALTIME));
+	stats->posts_found = READ_ONCE(session_stats->posts_found);
+	stats->in_memory_posts_found = READ_ONCE(session_stats->posts_found_open_chapter);
+	stats->dense_posts_found = READ_ONCE(session_stats->posts_found_dense);
+	stats->sparse_posts_found = READ_ONCE(session_stats->posts_found_sparse);
+	stats->posts_not_found = READ_ONCE(session_stats->posts_not_found);
+	stats->updates_found = READ_ONCE(session_stats->updates_found);
+	stats->updates_not_found = READ_ONCE(session_stats->updates_not_found);
+	stats->deletions_found = READ_ONCE(session_stats->deletions_found);
+	stats->deletions_not_found = READ_ONCE(session_stats->deletions_not_found);
+	stats->queries_found = READ_ONCE(session_stats->queries_found);
+	stats->queries_not_found = READ_ONCE(session_stats->queries_not_found);
+	stats->requests = READ_ONCE(session_stats->requests);
+}
+
+int uds_get_index_session_stats(struct uds_index_session *index_session,
+				struct uds_index_stats *stats)
+{
+	if (stats == NULL) {
+		uds_log_error("received a NULL index stats pointer");
+		return -EINVAL;
+	}
+
+	collect_stats(index_session, stats);
+	if (index_session->index != NULL) {
+		uds_get_index_stats(index_session->index, stats);
+	} else {
+		stats->entries_indexed = 0;
+		stats->memory_used = 0;
+		stats->collisions = 0;
+		stats->entries_discarded = 0;
+	}
+
+	return UDS_SUCCESS;
+}
diff --git a/drivers/md/dm-vdo/index-session.h b/drivers/md/dm-vdo/index-session.h
new file mode 100644
index 000000000000..c77ee021d510
--- /dev/null
+++ b/drivers/md/dm-vdo/index-session.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_INDEX_SESSION_H
+#define UDS_INDEX_SESSION_H
+
+#include <linux/atomic.h>
+#include <linux/cache.h>
+
+#include "config.h"
+#include "uds-threads.h"
+#include "uds.h"
+
+/*
+ * The index session mediates all interactions with a UDS index. Once the index session is created,
+ * it can be used to open, close, suspend, or recreate an index. It implements the majority of the
+ * functions in the top-level UDS API.
+ *
+ * If any deduplication request fails due to an internal error, the index is marked disabled. It
+ * will not accept any further requests and can only be closed. Closing the index will clear the
+ * disabled flag, and the index can then be reopened and recovered using the same index session.
+ */
+
+struct __aligned(L1_CACHE_BYTES) session_stats {
+	/* Post requests that found an entry */
+	u64 posts_found;
+	/* Post requests found in the open chapter */
+	u64 posts_found_open_chapter;
+	/* Post requests found in the dense index */
+	u64 posts_found_dense;
+	/* Post requests found in the sparse index */
+	u64 posts_found_sparse;
+	/* Post requests that did not find an entry */
+	u64 posts_not_found;
+	/* Update requests that found an entry */
+	u64 updates_found;
+	/* Update requests that did not find an entry */
+	u64 updates_not_found;
+	/* Delete requests that found an entry */
+	u64 deletions_found;
+	/* Delete requests that did not find an entry */
+	u64 deletions_not_found;
+	/* Query requests that found an entry */
+	u64 queries_found;
+	/* Query requests that did not find an entry */
+	u64 queries_not_found;
+	/* Total number of requests */
+	u64 requests;
+};
+
+enum index_suspend_status {
+	/* An index load has started but the index is not ready for use. */
+	INDEX_OPENING = 0,
+	/* The index is able to handle requests. */
+	INDEX_READY,
+	/* The index is attempting to suspend a rebuild. */
+	INDEX_SUSPENDING,
+	/* An index rebuild has been suspended. */
+	INDEX_SUSPENDED,
+	/* An index rebuild is being stopped in order to shut down. */
+	INDEX_FREEING,
+};
+
+struct index_load_context {
+	struct mutex mutex;
+	struct cond_var cond;
+	enum index_suspend_status status;
+};
+
+struct uds_index_session {
+	unsigned int state;
+	struct uds_index *index;
+	struct uds_request_queue *callback_queue;
+	struct uds_parameters parameters;
+	struct index_load_context load_context;
+	struct mutex request_mutex;
+	struct cond_var request_cond;
+	int request_count;
+	struct session_stats stats;
+};
+
+#endif /* UDS_INDEX_SESSION_H */
diff --git a/drivers/md/dm-vdo/uds-sysfs.c b/drivers/md/dm-vdo/uds-sysfs.c
new file mode 100644
index 000000000000..8700a028a11c
--- /dev/null
+++ b/drivers/md/dm-vdo/uds-sysfs.c
@@ -0,0 +1,185 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "uds-sysfs.h"
+
+#include <linux/kobject.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "string-utils.h"
+#include "uds.h"
+
+#define UDS_SYSFS_NAME "uds"
+
+static struct {
+	/* /sys/uds */
+	struct kobject kobj;
+	/* /sys/uds/parameter */
+	struct kobject parameter_kobj;
+
+	/* These flags are used to ensure a clean shutdown */
+
+	/* /sys/uds flag */
+	bool flag;
+	/* /sys/uds/parameter flag */
+	bool parameter_flag;
+} object_root;
+
+static char *buffer_to_string(const char *buf, size_t length)
+{
+	char *string;
+
+	if (UDS_ALLOCATE(length + 1, char, __func__, &string) != UDS_SUCCESS)
+		return NULL;
+
+	memcpy(string, buf, length);
+	string[length] = '\0';
+	if (string[length - 1] == '\n')
+		string[length - 1] = '\0';
+
+	return string;
+}
+
+/*
+ * This is the code for any directory in the /sys/<module_name> tree that contains no regular files
+ * (only subdirectories).
+ */
+
+static void empty_release(struct kobject *kobj)
+{
+}
+
+static ssize_t empty_show(struct kobject *kobj, struct attribute *attr, char *buf)
+{
+	return 0;
+}
+
+static ssize_t
+empty_store(struct kobject *kobj, struct attribute *attr, const char *buf, size_t length)
+{
+	return length;
+}
+
+static const struct sysfs_ops empty_ops = {
+	.show = empty_show,
+	.store = empty_store,
+};
+
+static struct attribute *empty_attrs[] = {
+	NULL,
+};
+ATTRIBUTE_GROUPS(empty);
+
+static const struct kobj_type empty_object_type = {
+	.release = empty_release,
+	.sysfs_ops = &empty_ops,
+	.default_groups = empty_groups,
+};
+
+/*
+ * This is the code for the /sys/<module_name>/parameter directory.
+ * <dir>/log_level                 UDS_LOG_LEVEL
+ */
+
+struct parameter_attribute {
+	struct attribute attr;
+	const char *(*show_string)(void);
+	void (*store_string)(const char *string);
+};
+
+static ssize_t parameter_show(struct kobject *kobj, struct attribute *attr, char *buf)
+{
+	struct parameter_attribute *pa;
+
+	pa = container_of(attr, struct parameter_attribute, attr);
+	if (pa->show_string != NULL)
+		return sprintf(buf, "%s\n", pa->show_string());
+	else
+		return -EINVAL;
+}
+
+static ssize_t
+parameter_store(struct kobject *kobj, struct attribute *attr, const char *buf, size_t length)
+{
+	char *string;
+	struct parameter_attribute *pa;
+
+	pa = container_of(attr, struct parameter_attribute, attr);
+	if (pa->store_string == NULL)
+		return -EINVAL;
+	string = buffer_to_string(buf, length);
+	if (string == NULL)
+		return -ENOMEM;
+
+	pa->store_string(string);
+	UDS_FREE(string);
+	return length;
+}
+
+static const char *parameter_show_log_level(void)
+{
+	return uds_log_priority_to_string(uds_get_log_level());
+}
+
+static void parameter_store_log_level(const char *string)
+{
+	uds_set_log_level(uds_log_string_to_priority(string));
+}
+
+static struct parameter_attribute log_level_attr = {
+	.attr = { .name = "log_level", .mode = 0600 },
+	.show_string = parameter_show_log_level,
+	.store_string = parameter_store_log_level,
+};
+
+static struct attribute *parameter_attrs[] = {
+	&log_level_attr.attr,
+	NULL,
+};
+ATTRIBUTE_GROUPS(parameter);
+
+static const struct sysfs_ops parameter_ops = {
+	.show = parameter_show,
+	.store = parameter_store,
+};
+
+static const struct kobj_type parameter_object_type = {
+	.release = empty_release,
+	.sysfs_ops = &parameter_ops,
+	.default_groups = parameter_groups,
+};
+
+int uds_init_sysfs(void)
+{
+	int result;
+
+	memset(&object_root, 0, sizeof(object_root));
+	kobject_init(&object_root.kobj, &empty_object_type);
+	result = kobject_add(&object_root.kobj, NULL, UDS_SYSFS_NAME);
+	if (result == 0) {
+		object_root.flag = true;
+		kobject_init(&object_root.parameter_kobj, &parameter_object_type);
+		result = kobject_add(&object_root.parameter_kobj, &object_root.kobj, "parameter");
+		if (result == 0)
+			object_root.parameter_flag = true;
+	}
+
+	if (result != 0)
+		uds_put_sysfs();
+
+	return result;
+}
+
+void uds_put_sysfs(void)
+{
+	if (object_root.parameter_flag)
+		kobject_put(&object_root.parameter_kobj);
+
+	if (object_root.flag)
+		kobject_put(&object_root.kobj);
+}
diff --git a/drivers/md/dm-vdo/uds-sysfs.h b/drivers/md/dm-vdo/uds-sysfs.h
new file mode 100644
index 000000000000..c3d00a7187bd
--- /dev/null
+++ b/drivers/md/dm-vdo/uds-sysfs.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_SYSFS_H
+#define UDS_SYSFS_H
+
+int uds_init_sysfs(void);
+void uds_put_sysfs(void);
+
+#endif  /* UDS_SYSFS_H */
diff --git a/drivers/md/dm-vdo/uds.h b/drivers/md/dm-vdo/uds.h
new file mode 100644
index 000000000000..a02350165ab6
--- /dev/null
+++ b/drivers/md/dm-vdo/uds.h
@@ -0,0 +1,328 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef UDS_H
+#define UDS_H
+
+#include <linux/types.h>
+
+#include "funnel-queue.h"
+
+/*
+ * UDS public API
+ *
+ * The Universal Deduplication System (UDS) is an efficient name-value store. When used for
+ * deduplicating storage, the names are generally hashes of data blocks and the associated data is
+ * where that block is located on the underlying storage medium. The stored names are expected to
+ * be randomly distributed among the space of possible names. If this assumption is violated, the
+ * UDS index will store fewer names than normal but will otherwise continue to work. The data
+ * associated with each name can be any 16-byte value.
+ *
+ * A client must first create an index session to interact with an index. Once created, the session
+ * can be shared among multiple threads or users. When a session is destroyed, it will also close
+ * and save any associated index.
+ *
+ * To make a request, a client must allocate a uds_request structure and set the required fields
+ * before launching it. UDS will invoke the provided callback to complete the request. After the
+ * callback has been called, the uds_request structure can be freed or reused for a new request.
+ * There are five types of requests:
+ *
+ * A UDS_UPDATE request will associate the provided name with the provided data. Any previous data
+ * associated with that name will be discarded.
+ *
+ * A UDS_QUERY request will return the data associated with the provided name, if any. The entry
+ * for the name will also be marked as most recent, as if the data had been updated.
+ *
+ * A UDS_POST request is a combination of UDS_QUERY and UDS_UPDATE. If there is already data
+ * associated with the provided name, that data is returned. If there is no existing association,
+ * the name is associated with the newly provided data. This request is equivalent to a UDS_QUERY
+ * request followed by a UDS_UPDATE request if no data is found, but it is much more efficient.
+ *
+ * A UDS_QUERY_NO_UPDATE request will return the data associated with the provided name, but will
+ * not change the recency of the entry for the name. This request is primarily useful for testing,
+ * to determine whether an entry exists without changing the internal state of the index.
+ *
+ * A UDS_DELETE request removes any data associated with the provided name. This operation is
+ * generally not necessary, because the index will automatically discard its oldest entries once it
+ * becomes full.
+ */
+
+/* General UDS constants and structures */
+
+enum uds_request_type {
+	/* Create or update the mapping for a name, and make the name most recent. */
+	UDS_UPDATE,
+
+	/* Return any mapped data for a name, and make the name most recent. */
+	UDS_QUERY,
+
+	/*
+	 * Return any mapped data for a name, or map the provided data to the name if there is no
+	 * current data, and make the name most recent.
+	 */
+	UDS_POST,
+
+	/* Return any mapped data for a name without updating its recency. */
+	UDS_QUERY_NO_UPDATE,
+
+	/* Remove any mapping for a name. */
+	UDS_DELETE,
+
+};
+
+enum uds_open_index_type {
+	/* Create a new index. */
+	UDS_CREATE,
+
+	/* Load an existing index and try to recover if necessary. */
+	UDS_LOAD,
+
+	/* Load an existing index, but only if it was saved cleanly. */
+	UDS_NO_REBUILD,
+};
+
+enum {
+	/* The record name size in bytes */
+	UDS_RECORD_NAME_SIZE = 16,
+	/* The maximum record data size in bytes */
+	UDS_RECORD_DATA_SIZE = 16,
+};
+
+/*
+ * A type representing a UDS memory configuration which is either a positive integer number of
+ * gigabytes or one of the six special constants for configurations smaller than one gigabyte.
+ */
+typedef int uds_memory_config_size_t;
+
+enum {
+	/* The maximum configurable amount of memory */
+	UDS_MEMORY_CONFIG_MAX = 1024,
+	/* Flag indicating that the index has one less chapter than usual */
+	UDS_MEMORY_CONFIG_REDUCED = 0x1000,
+	UDS_MEMORY_CONFIG_REDUCED_MAX = 1024 + UDS_MEMORY_CONFIG_REDUCED,
+	/* Special values indicating sizes less than 1 GB */
+	UDS_MEMORY_CONFIG_256MB = -256,
+	UDS_MEMORY_CONFIG_512MB = -512,
+	UDS_MEMORY_CONFIG_768MB = -768,
+	UDS_MEMORY_CONFIG_REDUCED_256MB = -1280,
+	UDS_MEMORY_CONFIG_REDUCED_512MB = -1536,
+	UDS_MEMORY_CONFIG_REDUCED_768MB = -1792,
+};
+
+struct uds_record_name {
+	unsigned char name[UDS_RECORD_NAME_SIZE];
+};
+
+struct uds_record_data {
+	unsigned char data[UDS_RECORD_DATA_SIZE];
+};
+
+struct uds_volume_record {
+	struct uds_record_name name;
+	struct uds_record_data data;
+};
+
+struct uds_parameters {
+	/* The block_device used for storage */
+	struct block_device *bdev;
+	/* The maximum allowable size of the index on storage */
+	size_t size;
+	/* The offset where the index should start */
+	off_t offset;
+	/* The maximum memory allocation, in GB */
+	uds_memory_config_size_t memory_size;
+	/* Whether the index should include sparse chapters */
+	bool sparse;
+	/* A 64-bit nonce to validate the index */
+	u64 nonce;
+	/* The number of threads used to process index requests */
+	unsigned int zone_count;
+	/* The number of threads used to read volume pages */
+	unsigned int read_threads;
+};
+
+/*
+ * These statistics capture characteristics of the current index, including resource usage and
+ * requests processed since the index was opened.
+ */
+struct uds_index_stats {
+	/* The total number of records stored in the index */
+	u64 entries_indexed;
+	/* An estimate of the index's memory usage, in bytes */
+	u64 memory_used;
+	/* The number of collisions recorded in the volume index */
+	u64 collisions;
+	/* The number of entries discarded from the index since startup */
+	u64 entries_discarded;
+	/* The time at which these statistics were fetched */
+	s64 current_time;
+	/* The number of post calls that found an existing entry */
+	u64 posts_found;
+	/* The number of post calls that added an entry */
+	u64 posts_not_found;
+	/*
+	 * The number of post calls that found an existing entry that is current enough to only
+	 * exist in memory and not have been committed to disk yet
+	 */
+	u64 in_memory_posts_found;
+	/*
+	 * The number of post calls that found an existing entry in the dense portion of the index
+	 */
+	u64 dense_posts_found;
+	/*
+	 * The number of post calls that found an existing entry in the sparse portion of the index
+	 */
+	u64 sparse_posts_found;
+	/* The number of update calls that updated an existing entry */
+	u64 updates_found;
+	/* The number of update calls that added a new entry */
+	u64 updates_not_found;
+	/* The number of delete requests that deleted an existing entry */
+	u64 deletions_found;
+	/* The number of delete requests that did nothing */
+	u64 deletions_not_found;
+	/* The number of query calls that found existing entry */
+	u64 queries_found;
+	/* The number of query calls that did not find an entry */
+	u64 queries_not_found;
+	/* The total number of requests processed */
+	u64 requests;
+};
+
+enum uds_index_region {
+	/* No location information has been determined */
+	UDS_LOCATION_UNKNOWN = 0,
+	/* The index page entry has been found */
+	UDS_LOCATION_INDEX_PAGE_LOOKUP,
+	/* The record page entry has been found */
+	UDS_LOCATION_RECORD_PAGE_LOOKUP,
+	/* The record is not in the index */
+	UDS_LOCATION_UNAVAILABLE,
+	/* The record was found in the open chapter */
+	UDS_LOCATION_IN_OPEN_CHAPTER,
+	/* The record was found in the dense part of the index */
+	UDS_LOCATION_IN_DENSE,
+	/* The record was found in the sparse part of the index */
+	UDS_LOCATION_IN_SPARSE,
+} __packed;
+
+/* Zone message requests are used to communicate between index zones. */
+enum uds_zone_message_type {
+	/* A standard request with no message */
+	UDS_MESSAGE_NONE = 0,
+	/* Add a chapter to the sparse chapter index cache */
+	UDS_MESSAGE_SPARSE_CACHE_BARRIER,
+	/* Close a chapter to keep the zone from falling behind */
+	UDS_MESSAGE_ANNOUNCE_CHAPTER_CLOSED,
+} __packed;
+
+struct uds_zone_message {
+	/* The type of message, determining how it will be processed */
+	enum uds_zone_message_type type;
+	/* The virtual chapter number to which the message applies */
+	u64 virtual_chapter;
+};
+
+struct uds_index_session;
+struct uds_index;
+struct uds_request;
+
+/* Once this callback has been invoked, the uds_request structure can be reused or freed. */
+typedef void uds_request_callback_t(struct uds_request *request);
+
+struct uds_request {
+	/* These input fields must be set before launching a request. */
+
+	/* The name of the record to look up or create */
+	struct uds_record_name record_name;
+	/* New data to associate with the record name, if applicable */
+	struct uds_record_data new_metadata;
+	/* A callback to invoke when the request is complete */
+	uds_request_callback_t *callback;
+	/* The index session that will manage this request */
+	struct uds_index_session *session;
+	/* The type of operation to perform, as describe above */
+	enum uds_request_type type;
+
+	/* These output fields are set when a request is complete. */
+
+	/* The existing data associated with the request name, if any */
+	struct uds_record_data old_metadata;
+	/* Either UDS_SUCCESS or an error code for the request */
+	int status;
+	/* True if the record name had an existing entry in the index */
+	bool found;
+
+	/*
+	 * The remaining fields are used internally and should not be altered by clients. The index
+	 * relies on zone_number being the first field in this section.
+	 */
+
+	/* The number of the zone which will process this request*/
+	unsigned int zone_number;
+	/* A link for adding a request to a lock-free queue */
+	struct funnel_queue_entry queue_link;
+	/* A link for adding a request to a standard linked list */
+	struct uds_request *next_request;
+	/* A pointer to the index processing this request */
+	struct uds_index *index;
+	/* Control message for coordinating between zones */
+	struct uds_zone_message zone_message;
+	/* If true, process request immediately by waking the worker thread */
+	bool unbatched;
+	/* If true, continue this request before processing newer requests */
+	bool requeued;
+	/* The virtual chapter containing the record name, if known */
+	u64 virtual_chapter;
+	/* The region of the index containing the record name */
+	enum uds_index_region location;
+};
+
+/* Compute the number of bytes needed to store an index. */
+int __must_check uds_compute_index_size(const struct uds_parameters *parameters, u64 *index_size);
+
+/* A session is required for most index operations. */
+int __must_check uds_create_index_session(struct uds_index_session **session);
+
+/* Destroying an index session also closes and saves the associated index. */
+int uds_destroy_index_session(struct uds_index_session *session);
+
+/*
+ * Create or open an index with an existing session. This operation fails if the index session is
+ * suspended, or if there is already an open index.
+ */
+int __must_check uds_open_index(enum uds_open_index_type open_type,
+				const struct uds_parameters *parameters,
+				struct uds_index_session *session);
+
+/*
+ * Wait until all callbacks for index operations are complete, and prevent new index operations
+ * from starting. New index operations will fail with EBUSY until the session is resumed. Also
+ * optionally saves the index.
+ */
+int __must_check uds_suspend_index_session(struct uds_index_session *session, bool save);
+
+/*
+ * Allow new index operations for an index, whether it was suspended or not. If the index is
+ * suspended and the supplied block device differs from the current backing store, the index will
+ * start using the new backing store instead.
+ */
+int __must_check uds_resume_index_session(struct uds_index_session *session,
+					  struct block_device *bdev);
+
+/* Wait until all outstanding index operations are complete. */
+int __must_check uds_flush_index_session(struct uds_index_session *session);
+
+/* Close an index. This operation fails if the index session is suspended. */
+int __must_check uds_close_index(struct uds_index_session *session);
+
+/* Get index statistics since the last time the index was opened. */
+int __must_check
+uds_get_index_session_stats(struct uds_index_session *session, struct uds_index_stats *stats);
+
+/* This function will fail if any required field of the request is not set. */
+int __must_check uds_launch_request(struct uds_request *request);
+
+#endif /* UDS_H */

From patchwork Thu Oct 26 21:41:14 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437951
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2698EC25B6E
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:31 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344901AbjJZVoa (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:30 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52814 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344896AbjJZVoQ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:16 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2E051D48
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356504;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=4ojwaNJQWqHBpOxOjCr2b/F6Gy0NjzxULvzU5h3SImM=;
        b=TYh08Q1tdXhumkNiRB2437ahFJfpTANJDgl/PgIS/eYSi7mHsuCRZYsVruGHB36nWAkRWm
        bKhzla+16XLSADL4GX7M1BZf/Pbl12Q9OQCiYj3qFe1k70uRSUobS9u5AN0Ac1ORzzoZG+
        5U7RwbuL/QLeju9VUxcGsJE+i7TjqHI=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-625-nrttTHMnOKug6PkFa-JExA-1; Thu, 26 Oct 2023 17:41:40 -0400
X-MC-Unique: nrttTHMnOKug6PkFa-JExA-1
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com
 [10.11.54.4])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1E1F1185A780;
        Thu, 26 Oct 2023 21:41:40 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 04BFD2026D4C;
        Thu, 26 Oct 2023 21:41:40 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id ADEA53003B; Thu, 26 Oct 2023 17:41:39 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 17/39] dm vdo: add administrative state and action manager
Date: Thu, 26 Oct 2023 17:41:14 -0400
Message-Id: <20231026214136.1067410-18-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.4
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

This patch adds the admin_state structures which are used to track the
states of individual vdo components for handling of operations like suspend
and resume. It also adds the action manager which is used to schedule and
manage cross-thread administrative and internal operations.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/action-manager.c | 410 +++++++++++++++++++++++
 drivers/md/dm-vdo/action-manager.h | 117 +++++++
 drivers/md/dm-vdo/admin-state.c    | 512 +++++++++++++++++++++++++++++
 drivers/md/dm-vdo/admin-state.h    | 180 ++++++++++
 4 files changed, 1219 insertions(+)
 create mode 100644 drivers/md/dm-vdo/action-manager.c
 create mode 100644 drivers/md/dm-vdo/action-manager.h
 create mode 100644 drivers/md/dm-vdo/admin-state.c
 create mode 100644 drivers/md/dm-vdo/admin-state.h

diff --git a/drivers/md/dm-vdo/action-manager.c b/drivers/md/dm-vdo/action-manager.c
new file mode 100644
index 000000000000..fb3586a4140c
--- /dev/null
+++ b/drivers/md/dm-vdo/action-manager.c
@@ -0,0 +1,410 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "action-manager.h"
+
+#include "memory-alloc.h"
+#include "permassert.h"
+
+#include "admin-state.h"
+#include "completion.h"
+#include "status-codes.h"
+#include "types.h"
+#include "vdo.h"
+
+/**
+ * struct action - An action to be performed in each of a set of zones.
+ * @in_use: Whether this structure is in use.
+ * @operation: The admin operation associated with this action.
+ * @preamble: The method to run on the initiator thread before the action is applied to each zone.
+ * @zone_action: The action to be performed in each zone.
+ * @conclusion: The method to run on the initiator thread after the action is applied to each zone.
+ * @parent: The object to notify when the action is complete.
+ * @context: The action specific context.
+ * @next: The action to perform after this one.
+ */
+struct action {
+	bool in_use;
+	const struct admin_state_code *operation;
+	vdo_action_preamble *preamble;
+	vdo_zone_action *zone_action;
+	vdo_action_conclusion *conclusion;
+	struct vdo_completion *parent;
+	void *context;
+	struct action *next;
+};
+
+/**
+ * struct action_manager - Definition of an action manager.
+ * @completion: The completion for performing actions.
+ * @state: The state of this action manager.
+ * @actions: The two action slots.
+ * @current_action: The current action slot.
+ * @zones: The number of zones in which an action is to be applied.
+ * @Scheduler: A function to schedule a default next action.
+ * @get_zone_thread_id: A function to get the id of the thread on which to apply an action to a
+ *                      zone.
+ * @initiator_thread_id: The ID of the thread on which actions may be initiated.
+ * @context: Opaque data associated with this action manager.
+ * @acting_zone: The zone currently being acted upon.
+ */
+struct action_manager {
+	struct vdo_completion completion;
+	struct admin_state state;
+	struct action actions[2];
+	struct action *current_action;
+	zone_count_t zones;
+	vdo_action_scheduler *scheduler;
+	vdo_zone_thread_getter *get_zone_thread_id;
+	thread_id_t initiator_thread_id;
+	void *context;
+	zone_count_t acting_zone;
+};
+
+static inline struct action_manager *as_action_manager(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VDO_ACTION_COMPLETION);
+	return container_of(completion, struct action_manager, completion);
+}
+
+/* Implements vdo_action_scheduler. */
+static bool no_default_action(void *context __always_unused)
+{
+	return false;
+}
+
+/* Implements vdo_action_preamble. */
+static void no_preamble(void *context __always_unused,
+			struct vdo_completion *completion)
+{
+	vdo_finish_completion(completion);
+}
+
+/* Implements vdo_action_conclusion. */
+static int no_conclusion(void *context __always_unused)
+{
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_make_action_manager() - Make an action manager.
+ * @zones: The number of zones to which actions will be applied.
+ * @get_zone_thread_id: A function to get the thread id associated with a zone.
+ * @initiator_thread_id: The thread on which actions may initiated.
+ * @context: The object which holds the per-zone context for the action.
+ * @scheduler: A function to schedule a next action after an action concludes if there is no
+ *             pending action (may be NULL).
+ * @vdo: The vdo used to initialize completions.
+ * @manager_ptr: A pointer to hold the new action manager.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+int vdo_make_action_manager(zone_count_t zones,
+			    vdo_zone_thread_getter *get_zone_thread_id,
+			    thread_id_t initiator_thread_id,
+			    void *context,
+			    vdo_action_scheduler *scheduler,
+			    struct vdo *vdo,
+			    struct action_manager **manager_ptr)
+{
+	struct action_manager *manager;
+	int result = UDS_ALLOCATE(1, struct action_manager, __func__, &manager);
+
+	if (result != VDO_SUCCESS)
+		return result;
+
+	*manager = (struct action_manager) {
+		.zones = zones,
+		.scheduler =
+			((scheduler == NULL) ? no_default_action : scheduler),
+		.get_zone_thread_id = get_zone_thread_id,
+		.initiator_thread_id = initiator_thread_id,
+		.context = context,
+	};
+
+	manager->actions[0].next = &manager->actions[1];
+	manager->current_action = manager->actions[1].next =
+		&manager->actions[0];
+	vdo_set_admin_state_code(&manager->state,
+				 VDO_ADMIN_STATE_NORMAL_OPERATION);
+	vdo_initialize_completion(&manager->completion, vdo,
+				  VDO_ACTION_COMPLETION);
+	*manager_ptr = manager;
+	return VDO_SUCCESS;
+}
+
+const struct admin_state_code *vdo_get_current_manager_operation(struct action_manager *manager)
+{
+	return vdo_get_admin_state_code(&manager->state);
+}
+
+void *vdo_get_current_action_context(struct action_manager *manager)
+{
+	return manager->current_action->in_use ? manager->current_action->context : NULL;
+}
+
+static void finish_action_callback(struct vdo_completion *completion);
+static void apply_to_zone(struct vdo_completion *completion);
+
+static thread_id_t get_acting_zone_thread_id(struct action_manager *manager)
+{
+	return manager->get_zone_thread_id(manager->context, manager->acting_zone);
+}
+
+static void preserve_error(struct vdo_completion *completion)
+{
+	if (completion->parent != NULL)
+		vdo_set_completion_result(completion->parent, completion->result);
+
+	vdo_reset_completion(completion);
+	vdo_run_completion(completion);
+}
+
+static void prepare_for_next_zone(struct action_manager *manager)
+{
+	vdo_prepare_completion_for_requeue(&manager->completion,
+					   apply_to_zone,
+					   preserve_error,
+					   get_acting_zone_thread_id(manager),
+					   manager->current_action->parent);
+}
+
+static void prepare_for_conclusion(struct action_manager *manager)
+{
+	vdo_prepare_completion_for_requeue(&manager->completion,
+					   finish_action_callback,
+					   preserve_error,
+					   manager->initiator_thread_id,
+					   manager->current_action->parent);
+}
+
+static void apply_to_zone(struct vdo_completion *completion)
+{
+	zone_count_t zone;
+	struct action_manager *manager = as_action_manager(completion);
+
+	ASSERT_LOG_ONLY((vdo_get_callback_thread_id() == get_acting_zone_thread_id(manager)),
+			"%s() called on acting zones's thread",
+			__func__);
+
+	zone = manager->acting_zone++;
+	if (manager->acting_zone == manager->zones)
+		/*
+		 * We are about to apply to the last zone. Once that is finished, we're done, so go
+		 * back to the initiator thread and finish up.
+		 */
+		prepare_for_conclusion(manager);
+	else
+		/* Prepare to come back on the next zone */
+		prepare_for_next_zone(manager);
+
+	manager->current_action->zone_action(manager->context, zone, completion);
+}
+
+static void handle_preamble_error(struct vdo_completion *completion)
+{
+	/* Skip the zone actions since the preamble failed. */
+	completion->callback = finish_action_callback;
+	preserve_error(completion);
+}
+
+static void launch_current_action(struct action_manager *manager)
+{
+	struct action *action = manager->current_action;
+	int result = vdo_start_operation(&manager->state, action->operation);
+
+	if (result != VDO_SUCCESS) {
+		if (action->parent != NULL)
+			vdo_set_completion_result(action->parent, result);
+
+		/* We aren't going to run the preamble, so don't run the conclusion */
+		action->conclusion = no_conclusion;
+		finish_action_callback(&manager->completion);
+		return;
+	}
+
+	if (action->zone_action == NULL) {
+		prepare_for_conclusion(manager);
+	} else {
+		manager->acting_zone = 0;
+		vdo_prepare_completion_for_requeue(&manager->completion,
+						   apply_to_zone,
+						   handle_preamble_error,
+						   get_acting_zone_thread_id(manager),
+						   manager->current_action->parent);
+	}
+
+	action->preamble(manager->context, &manager->completion);
+}
+
+/**
+ * vdo_schedule_default_action() - Attempt to schedule the default action.
+ * @manager: The action manager.
+ *
+ * If the manager is not operating normally, the action will not be scheduled.
+ *
+ * Return: true if an action was scheduled.
+ */
+bool vdo_schedule_default_action(struct action_manager *manager)
+{
+	/* Don't schedule a default action if we are operating or not in normal operation. */
+	const struct admin_state_code *code = vdo_get_current_manager_operation(manager);
+
+	return ((code == VDO_ADMIN_STATE_NORMAL_OPERATION) &&
+		manager->scheduler(manager->context));
+}
+
+static void finish_action_callback(struct vdo_completion *completion)
+{
+	bool has_next_action;
+	int result;
+	struct action_manager *manager = as_action_manager(completion);
+	struct action action = *(manager->current_action);
+
+	manager->current_action->in_use = false;
+	manager->current_action = manager->current_action->next;
+
+	/*
+	 * We need to check this now to avoid use-after-free issues if running the conclusion or
+	 * notifying the parent results in the manager being freed.
+	 */
+	has_next_action =
+		(manager->current_action->in_use || vdo_schedule_default_action(manager));
+	result = action.conclusion(manager->context);
+	vdo_finish_operation(&manager->state, VDO_SUCCESS);
+	if (action.parent != NULL)
+		vdo_continue_completion(action.parent, result);
+
+	if (has_next_action)
+		launch_current_action(manager);
+}
+
+/**
+ * vdo_schedule_action() - Schedule an action to be applied to all zones.
+ * @manager: The action manager to schedule the action on.
+ * @preamble: A method to be invoked on the initiator thread once this action is started but before
+ *            applying to each zone; may be NULL.
+ * @action: The action to apply to each zone; may be NULL.
+ * @conclusion: A method to be invoked back on the initiator thread once the action has been
+ *              applied to all zones; may be NULL.
+ * @parent: The object to notify once the action is complete or if the action can not be scheduled;
+ *          may be NULL.
+ *
+ * The action will be launched immediately if there is no current action, or as soon as the current
+ * action completes. If there is already a pending action, this action will not be scheduled, and,
+ * if it has a parent, that parent will be notified. At least one of the preamble, action, or
+ * conclusion must not be NULL.
+ *
+ * Return: true if the action was scheduled.
+ */
+bool vdo_schedule_action(struct action_manager *manager,
+			 vdo_action_preamble *preamble,
+			 vdo_zone_action *action,
+			 vdo_action_conclusion *conclusion,
+			 struct vdo_completion *parent)
+{
+	return vdo_schedule_operation(manager,
+				      VDO_ADMIN_STATE_OPERATING,
+				      preamble,
+				      action,
+				      conclusion,
+				      parent);
+}
+
+/**
+ * vdo_schedule_operation() - Schedule an operation to be applied to all zones.
+ * @manager: The action manager to schedule the action on.
+ * @operation: The operation this action will perform
+ * @preamble: A method to be invoked on the initiator thread once this action is started but before
+ *            applying to each zone; may be NULL.
+ * @action: The action to apply to each zone; may be NULL.
+ * @conclusion: A method to be invoked back on the initiator thread once the action has been
+ *              applied to all zones; may be NULL.
+ * @parent: The object to notify once the action is complete or if the action can not be scheduled;
+ *          may be NULL.
+ *
+ * The operation's action will be launched immediately if there is no current action, or as soon as
+ * the current action completes. If there is already a pending action, this operation will not be
+ * scheduled, and, if it has a parent, that parent will be notified. At least one of the preamble,
+ * action, or conclusion must not be NULL.
+ *
+ * Return: true if the action was scheduled.
+ */
+bool vdo_schedule_operation(struct action_manager *manager,
+			    const struct admin_state_code *operation,
+			    vdo_action_preamble *preamble,
+			    vdo_zone_action *action,
+			    vdo_action_conclusion *conclusion,
+			    struct vdo_completion *parent)
+{
+	return vdo_schedule_operation_with_context(manager,
+						   operation,
+						   preamble,
+						   action,
+						   conclusion,
+						   NULL,
+						   parent);
+}
+
+/**
+ * vdo_schedule_operation_with_context() - Schedule an operation on all zones.
+ * @manager: The action manager to schedule the action on.
+ * @operation: The operation this action will perform.
+ * @preamble: A method to be invoked on the initiator thread once this action is started but before
+ *            applying to each zone; may be NULL.
+ * @action: The action to apply to each zone; may be NULL.
+ * @conclusion: A method to be invoked back on the initiator thread once the action has been
+ *              applied to all zones; may be NULL.
+ * @context: An action-specific context which may be retrieved via
+ *           vdo_get_current_action_context(); may be NULL.
+ * @parent: The object to notify once the action is complete or if the action can not be scheduled;
+ *          may be NULL.
+ *
+ * The operation's action will be launched immediately if there is no current action, or as soon as
+ * the current action completes. If there is already a pending action, this operation will not be
+ * scheduled, and, if it has a parent, that parent will be notified. At least one of the preamble,
+ * action, or conclusion must not be NULL.
+ *
+ * Return: true if the action was scheduled
+ */
+bool vdo_schedule_operation_with_context(struct action_manager *manager,
+					 const struct admin_state_code *operation,
+					 vdo_action_preamble *preamble,
+					 vdo_zone_action *action,
+					 vdo_action_conclusion *conclusion,
+					 void *context,
+					 struct vdo_completion *parent)
+{
+	struct action *current_action;
+
+	ASSERT_LOG_ONLY((vdo_get_callback_thread_id() == manager->initiator_thread_id),
+			"action initiated from correct thread");
+	if (!manager->current_action->in_use) {
+		current_action = manager->current_action;
+	} else if (!manager->current_action->next->in_use) {
+		current_action = manager->current_action->next;
+	} else {
+		if (parent != NULL)
+			vdo_continue_completion(parent, VDO_COMPONENT_BUSY);
+
+		return false;
+	}
+
+	*current_action = (struct action) {
+		.in_use = true,
+		.operation = operation,
+		.preamble = (preamble == NULL) ? no_preamble : preamble,
+		.zone_action = action,
+		.conclusion = (conclusion == NULL) ? no_conclusion : conclusion,
+		.context = context,
+		.parent = parent,
+		.next = current_action->next,
+	};
+
+	if (current_action == manager->current_action)
+		launch_current_action(manager);
+
+	return true;
+}
diff --git a/drivers/md/dm-vdo/action-manager.h b/drivers/md/dm-vdo/action-manager.h
new file mode 100644
index 000000000000..b97bb2eec145
--- /dev/null
+++ b/drivers/md/dm-vdo/action-manager.h
@@ -0,0 +1,117 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_ACTION_MANAGER_H
+#define VDO_ACTION_MANAGER_H
+
+#include "admin-state.h"
+#include "types.h"
+
+/*
+ * An action_manager provides a generic mechanism for applying actions to multi-zone entities (such
+ * as the block map or slab depot). Each action manager is tied to a specific context for which it
+ * manages actions. The manager ensures that only one action is active on that context at a time,
+ * and supports at most one pending action. Calls to schedule an action when there is already a
+ * pending action will result in VDO_COMPONENT_BUSY errors. Actions may only be submitted to the
+ * action manager from a single thread (which thread is determined when the action manager is
+ * constructed).
+ *
+ * A scheduled action consists of four components:
+ *
+ *   preamble
+ *     an optional method to be run on the initiator thread before applying the action to all zones
+ *   zone_action
+ *     an optional method to be applied to each of the zones
+ *   conclusion
+ *     an optional method to be run on the initiator thread once the per-zone method has been
+ *     applied to all zones
+ *   parent
+ *     an optional completion to be finished once the conclusion is done
+ *
+ * At least one of the three methods must be provided.
+ */
+
+/*
+ * A function which is to be applied asynchronously to a set of zones.
+ * @context: The object which holds the per-zone context for the action
+ * @zone_number: The number of zone to which the action is being applied
+ * @parent: The object to notify when the action is complete
+ */
+typedef void vdo_zone_action(void *context,
+			     zone_count_t zone_number,
+			     struct vdo_completion *parent);
+
+/*
+ * A function which is to be applied asynchronously on an action manager's initiator thread as the
+ * preamble of an action.
+ * @context: The object which holds the per-zone context for the action
+ * @parent: The object to notify when the action is complete
+ */
+typedef void vdo_action_preamble(void *context, struct vdo_completion *parent);
+
+/*
+ * A function which will run on the action manager's initiator thread as the conclusion of an
+ * action.
+ * @context: The object which holds the per-zone context for the action
+ *
+ * Return: VDO_SUCCESS or an error
+ */
+typedef int vdo_action_conclusion(void *context);
+
+/*
+ * A function to schedule an action.
+ * @context: The object which holds the per-zone context for the action
+ *
+ * Return: true if an action was scheduled
+ */
+typedef bool vdo_action_scheduler(void *context);
+
+/*
+ * A function to get the id of the thread associated with a given zone.
+ * @context: The action context
+ * @zone_number: The number of the zone for which the thread ID is desired
+ */
+typedef thread_id_t vdo_zone_thread_getter(void *context, zone_count_t zone_number);
+
+struct action_manager;
+
+int __must_check
+vdo_make_action_manager(zone_count_t zones,
+			vdo_zone_thread_getter *get_zone_thread_id,
+			thread_id_t initiator_thread_id,
+			void *context,
+			vdo_action_scheduler *scheduler,
+			struct vdo *vdo,
+			struct action_manager **manager_ptr);
+
+const struct admin_state_code *__must_check
+vdo_get_current_manager_operation(struct action_manager *manager);
+
+void * __must_check vdo_get_current_action_context(struct action_manager *manager);
+
+bool vdo_schedule_default_action(struct action_manager *manager);
+
+bool vdo_schedule_action(struct action_manager *manager,
+			 vdo_action_preamble *preamble,
+			 vdo_zone_action *action,
+			 vdo_action_conclusion *conclusion,
+			 struct vdo_completion *parent);
+
+bool vdo_schedule_operation(struct action_manager *manager,
+			    const struct admin_state_code *operation,
+			    vdo_action_preamble *preamble,
+			    vdo_zone_action *action,
+			    vdo_action_conclusion *conclusion,
+			    struct vdo_completion *parent);
+
+bool vdo_schedule_operation_with_context(struct action_manager *manager,
+					 const struct admin_state_code *operation,
+					 vdo_action_preamble *preamble,
+					 vdo_zone_action *action,
+					 vdo_action_conclusion *conclusion,
+					 void *context,
+					 struct vdo_completion *parent);
+
+#endif /* VDO_ACTION_MANAGER_H */
diff --git a/drivers/md/dm-vdo/admin-state.c b/drivers/md/dm-vdo/admin-state.c
new file mode 100644
index 000000000000..e2ba354508b2
--- /dev/null
+++ b/drivers/md/dm-vdo/admin-state.c
@@ -0,0 +1,512 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "admin-state.h"
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+
+#include "completion.h"
+#include "types.h"
+
+static const struct admin_state_code VDO_CODE_NORMAL_OPERATION = {
+	.name = "VDO_ADMIN_STATE_NORMAL_OPERATION",
+	.normal = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_NORMAL_OPERATION = &VDO_CODE_NORMAL_OPERATION;
+static const struct admin_state_code VDO_CODE_OPERATING = {
+	.name = "VDO_ADMIN_STATE_OPERATING",
+	.normal = true,
+	.operating = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_OPERATING = &VDO_CODE_OPERATING;
+static const struct admin_state_code VDO_CODE_FORMATTING = {
+	.name = "VDO_ADMIN_STATE_FORMATTING",
+	.operating = true,
+	.loading = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_FORMATTING = &VDO_CODE_FORMATTING;
+static const struct admin_state_code VDO_CODE_PRE_LOADING = {
+	.name = "VDO_ADMIN_STATE_PRE_LOADING",
+	.operating = true,
+	.loading = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_PRE_LOADING = &VDO_CODE_PRE_LOADING;
+static const struct admin_state_code VDO_CODE_PRE_LOADED = {
+	.name = "VDO_ADMIN_STATE_PRE_LOADED",
+};
+const struct admin_state_code *VDO_ADMIN_STATE_PRE_LOADED = &VDO_CODE_PRE_LOADED;
+static const struct admin_state_code VDO_CODE_LOADING = {
+	.name = "VDO_ADMIN_STATE_LOADING",
+	.normal = true,
+	.operating = true,
+	.loading = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_LOADING = &VDO_CODE_LOADING;
+static const struct admin_state_code VDO_CODE_LOADING_FOR_RECOVERY = {
+	.name = "VDO_ADMIN_STATE_LOADING_FOR_RECOVERY",
+	.operating = true,
+	.loading = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_LOADING_FOR_RECOVERY =
+	&VDO_CODE_LOADING_FOR_RECOVERY;
+static const struct admin_state_code VDO_CODE_LOADING_FOR_REBUILD = {
+	.name = "VDO_ADMIN_STATE_LOADING_FOR_REBUILD",
+	.operating = true,
+	.loading = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_LOADING_FOR_REBUILD = &VDO_CODE_LOADING_FOR_REBUILD;
+static const struct admin_state_code VDO_CODE_WAITING_FOR_RECOVERY = {
+	.name = "VDO_ADMIN_STATE_WAITING_FOR_RECOVERY",
+	.operating = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_WAITING_FOR_RECOVERY =
+	&VDO_CODE_WAITING_FOR_RECOVERY;
+static const struct admin_state_code VDO_CODE_NEW = {
+	.name = "VDO_ADMIN_STATE_NEW",
+	.quiescent = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_NEW = &VDO_CODE_NEW;
+static const struct admin_state_code VDO_CODE_INITIALIZED = {
+	.name = "VDO_ADMIN_STATE_INITIALIZED",
+};
+const struct admin_state_code *VDO_ADMIN_STATE_INITIALIZED = &VDO_CODE_INITIALIZED;
+static const struct admin_state_code VDO_CODE_RECOVERING = {
+	.name = "VDO_ADMIN_STATE_RECOVERING",
+	.draining = true,
+	.operating = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_RECOVERING = &VDO_CODE_RECOVERING;
+static const struct admin_state_code VDO_CODE_REBUILDING = {
+	.name = "VDO_ADMIN_STATE_REBUILDING",
+	.draining = true,
+	.operating = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_REBUILDING = &VDO_CODE_REBUILDING;
+static const struct admin_state_code VDO_CODE_SAVING = {
+	.name = "VDO_ADMIN_STATE_SAVING",
+	.draining = true,
+	.quiescing = true,
+	.operating = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_SAVING = &VDO_CODE_SAVING;
+static const struct admin_state_code VDO_CODE_SAVED = {
+	.name = "VDO_ADMIN_STATE_SAVED",
+	.quiescent = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_SAVED = &VDO_CODE_SAVED;
+static const struct admin_state_code VDO_CODE_SCRUBBING = {
+	.name = "VDO_ADMIN_STATE_SCRUBBING",
+	.draining = true,
+	.loading = true,
+	.operating = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_SCRUBBING = &VDO_CODE_SCRUBBING;
+static const struct admin_state_code VDO_CODE_SAVE_FOR_SCRUBBING = {
+	.name = "VDO_ADMIN_STATE_SAVE_FOR_SCRUBBING",
+	.draining = true,
+	.operating = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_SAVE_FOR_SCRUBBING = &VDO_CODE_SAVE_FOR_SCRUBBING;
+static const struct admin_state_code VDO_CODE_STOPPING = {
+	.name = "VDO_ADMIN_STATE_STOPPING",
+	.draining = true,
+	.quiescing = true,
+	.operating = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_STOPPING = &VDO_CODE_STOPPING;
+static const struct admin_state_code VDO_CODE_STOPPED = {
+	.name = "VDO_ADMIN_STATE_STOPPED",
+	.quiescent = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_STOPPED = &VDO_CODE_STOPPED;
+static const struct admin_state_code VDO_CODE_SUSPENDING = {
+	.name = "VDO_ADMIN_STATE_SUSPENDING",
+	.draining = true,
+	.quiescing = true,
+	.operating = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_SUSPENDING = &VDO_CODE_SUSPENDING;
+static const struct admin_state_code VDO_CODE_SUSPENDED = {
+	.name = "VDO_ADMIN_STATE_SUSPENDED",
+	.quiescent = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_SUSPENDED = &VDO_CODE_SUSPENDED;
+static const struct admin_state_code VDO_CODE_SUSPENDED_OPERATION = {
+	.name = "VDO_ADMIN_STATE_SUSPENDED_OPERATION",
+	.operating = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_SUSPENDED_OPERATION = &VDO_CODE_SUSPENDED_OPERATION;
+static const struct admin_state_code VDO_CODE_RESUMING = {
+	.name = "VDO_ADMIN_STATE_RESUMING",
+	.operating = true,
+};
+const struct admin_state_code *VDO_ADMIN_STATE_RESUMING = &VDO_CODE_RESUMING;
+
+/**
+ * get_next_state() - Determine the state which should be set after a given operation completes
+ *                    based on the operation and the current state.
+ * @operation The operation to be started.
+ *
+ * Return: The state to set when the operation completes or NULL if the operation can not be
+ *         started in the current state.
+ */
+static const struct admin_state_code *
+get_next_state(const struct admin_state *state, const struct admin_state_code *operation)
+{
+	const struct admin_state_code *code = vdo_get_admin_state_code(state);
+
+	if (code->operating)
+		return NULL;
+
+	if (operation == VDO_ADMIN_STATE_SAVING)
+		return (code == VDO_ADMIN_STATE_NORMAL_OPERATION ? VDO_ADMIN_STATE_SAVED : NULL);
+
+	if (operation == VDO_ADMIN_STATE_SUSPENDING)
+		return (code == VDO_ADMIN_STATE_NORMAL_OPERATION
+			? VDO_ADMIN_STATE_SUSPENDED
+			: NULL);
+
+	if (operation == VDO_ADMIN_STATE_STOPPING)
+		return (code == VDO_ADMIN_STATE_NORMAL_OPERATION ? VDO_ADMIN_STATE_STOPPED : NULL);
+
+	if (operation == VDO_ADMIN_STATE_PRE_LOADING)
+		return (code == VDO_ADMIN_STATE_INITIALIZED ? VDO_ADMIN_STATE_PRE_LOADED : NULL);
+
+	if (operation == VDO_ADMIN_STATE_SUSPENDED_OPERATION)
+		return (((code == VDO_ADMIN_STATE_SUSPENDED) ||
+			 (code == VDO_ADMIN_STATE_SAVED)) ? code : NULL);
+
+	return VDO_ADMIN_STATE_NORMAL_OPERATION;
+}
+
+/**
+ * vdo_finish_operation() - Finish the current operation.
+ *
+ * Will notify the operation waiter if there is one. This method should be used for operations
+ * started with vdo_start_operation(). For operations which were started with vdo_start_draining(),
+ * use vdo_finish_draining() instead.
+ *
+ * Return: true if there was an operation to finish.
+ */
+bool vdo_finish_operation(struct admin_state *state, int result)
+{
+	if (!vdo_get_admin_state_code(state)->operating)
+		return false;
+
+	state->complete = state->starting;
+	if (state->waiter != NULL)
+		vdo_set_completion_result(state->waiter, result);
+
+	if (!state->starting) {
+		vdo_set_admin_state_code(state, state->next_state);
+		if (state->waiter != NULL)
+			vdo_launch_completion(UDS_FORGET(state->waiter));
+	}
+
+	return true;
+}
+
+/**
+ * begin_operation() - Begin an operation if it may be started given the current state.
+ * @waiter A completion to notify when the operation is complete; may be NULL.
+ * @initiator The vdo_admin_initiator to call if the operation may begin; may be NULL.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int __must_check begin_operation(struct admin_state *state,
+					const struct admin_state_code *operation,
+					struct vdo_completion *waiter,
+					vdo_admin_initiator *initiator)
+{
+	int result;
+	const struct admin_state_code *next_state = get_next_state(state, operation);
+
+	if (next_state == NULL) {
+		result = uds_log_error_strerror(VDO_INVALID_ADMIN_STATE,
+						"Can't start %s from %s",
+						operation->name,
+						vdo_get_admin_state_code(state)->name);
+	} else if (state->waiter != NULL) {
+		result = uds_log_error_strerror(VDO_COMPONENT_BUSY,
+						"Can't start %s with extant waiter",
+						operation->name);
+	} else {
+		state->waiter = waiter;
+		state->next_state = next_state;
+		vdo_set_admin_state_code(state, operation);
+		if (initiator != NULL) {
+			state->starting = true;
+			initiator(state);
+			state->starting = false;
+			if (state->complete)
+				vdo_finish_operation(state, VDO_SUCCESS);
+		}
+
+		return VDO_SUCCESS;
+	}
+
+	if (waiter != NULL)
+		vdo_continue_completion(waiter, result);
+
+	return result;
+}
+
+/**
+ * start_operation() - Start an operation if it may be started given the current state.
+ * @waiter     A completion to notify when the operation is complete.
+ * @initiator The vdo_admin_initiator to call if the operation may begin; may be NULL.
+ *
+ * Return: true if the operation was started.
+ */
+static inline bool __must_check start_operation(struct admin_state *state,
+						const struct admin_state_code *operation,
+						struct vdo_completion *waiter,
+						vdo_admin_initiator *initiator)
+{
+	return (begin_operation(state, operation, waiter, initiator) == VDO_SUCCESS);
+}
+
+/**
+ * check_code() - Check the result of a state validation.
+ * @valid true if the code is of an appropriate type.
+ * @code The code which failed to be of the correct type.
+ * @what What the code failed to be, for logging.
+ * @waiter The completion to notify of the error; may be NULL.
+ *
+ * If the result failed, log an invalid state error and, if there is a waiter, notify it.
+ *
+ * Return: The result of the check.
+ */
+static bool check_code(bool valid,
+		       const struct admin_state_code *code,
+		       const char *what,
+		       struct vdo_completion *waiter)
+{
+	int result;
+
+	if (valid)
+		return true;
+
+	result = uds_log_error_strerror(VDO_INVALID_ADMIN_STATE,
+					"%s is not a %s", code->name, what);
+	if (waiter != NULL)
+		vdo_continue_completion(waiter, result);
+
+	return false;
+}
+
+/**
+ * vdo_drain_operation() - Check that an operation is a drain.
+ * @waiter The completion to finish with an error if the operation is not a drain.
+ *
+ * Return: true if the specified operation is a drain.
+ */
+static bool __must_check
+assert_vdo_drain_operation(const struct admin_state_code *operation, struct vdo_completion *waiter)
+{
+	return check_code(operation->draining, operation, "drain operation", waiter);
+}
+
+/**
+ * vdo_start_draining() - Initiate a drain operation if the current state permits it.
+ * @operation The type of drain to initiate.
+ * @waiter The completion to notify when the drain is complete.
+ * @initiator The vdo_admin_initiator to call if the operation may begin; may be NULL.
+ *
+ * Return: true if the drain was initiated, if not the waiter will be notified.
+ */
+bool vdo_start_draining(struct admin_state *state,
+			const struct admin_state_code *operation,
+			struct vdo_completion *waiter,
+			vdo_admin_initiator *initiator)
+{
+	const struct admin_state_code *code = vdo_get_admin_state_code(state);
+
+	if (!assert_vdo_drain_operation(operation, waiter))
+		return false;
+
+	if (code->quiescent) {
+		vdo_launch_completion(waiter);
+		return false;
+	}
+
+	if (!code->normal) {
+		uds_log_error_strerror(VDO_INVALID_ADMIN_STATE,
+				       "can't start %s from %s",
+				       operation->name,
+				       code->name);
+		vdo_continue_completion(waiter, VDO_INVALID_ADMIN_STATE);
+		return false;
+	}
+
+	return start_operation(state, operation, waiter, initiator);
+}
+
+/**
+ * vdo_finish_draining() - Finish a drain operation if one was in progress.
+ *
+ * Return: true if the state was draining; will notify the waiter if so.
+ */
+bool vdo_finish_draining(struct admin_state *state)
+{
+	return vdo_finish_draining_with_result(state, VDO_SUCCESS);
+}
+
+/**
+ * vdo_finish_draining_with_result() - Finish a drain operation with a status code.
+ *
+ * Return: true if the state was draining; will notify the waiter if so.
+ */
+bool vdo_finish_draining_with_result(struct admin_state *state, int result)
+{
+	return (vdo_is_state_draining(state) && vdo_finish_operation(state, result));
+}
+
+/**
+ * vdo_assert_load_operation() - Check that an operation is a load.
+ * @waiter The completion to finish with an error if the operation is not a load.
+ *
+ * Return: true if the specified operation is a load.
+ */
+bool vdo_assert_load_operation(const struct admin_state_code *operation,
+			       struct vdo_completion *waiter)
+{
+	return check_code(operation->loading, operation, "load operation", waiter);
+}
+
+/**
+ * vdo_start_loading() - Initiate a load operation if the current state permits it.
+ * @operation The type of load to initiate.
+ * @waiter The completion to notify when the load is complete (may be NULL).
+ * @initiator The vdo_admin_initiator to call if the operation may begin; may be NULL.
+ *
+ * Return: true if the load was initiated, if not the waiter will be notified.
+ */
+bool vdo_start_loading(struct admin_state *state,
+		       const struct admin_state_code *operation,
+		       struct vdo_completion *waiter,
+		       vdo_admin_initiator *initiator)
+{
+	return (vdo_assert_load_operation(operation, waiter) &&
+		start_operation(state, operation, waiter, initiator));
+}
+
+/**
+ * vdo_finish_loading() - Finish a load operation if one was in progress.
+ *
+ * Return: true if the state was loading; will notify the waiter if so.
+ */
+bool vdo_finish_loading(struct admin_state *state)
+{
+	return vdo_finish_loading_with_result(state, VDO_SUCCESS);
+}
+
+/**
+ * vdo_finish_loading_with_result() - Finish a load operation with a status code.
+ * @result The result of the load operation.
+ *
+ * Return: true if the state was loading; will notify the waiter if so.
+ */
+bool vdo_finish_loading_with_result(struct admin_state *state, int result)
+{
+	return (vdo_is_state_loading(state) && vdo_finish_operation(state, result));
+}
+
+/**
+ * assert_vdo_resume_operation() - Check whether an admin_state_code is a resume operation.
+ * @waiter The completion to notify if the operation is not a resume operation; may be NULL.
+ *
+ * Return: true if the code is a resume operation.
+ */
+static bool __must_check assert_vdo_resume_operation(const struct admin_state_code *operation,
+						     struct vdo_completion *waiter)
+{
+	return check_code(operation == VDO_ADMIN_STATE_RESUMING,
+			  operation,
+			  "resume operation",
+			  waiter);
+}
+
+/**
+ * vdo_start_resuming() - Initiate a resume operation if the current state permits it.
+ * @operation The type of resume to start.
+ * @waiter The completion to notify when the resume is complete (may be NULL).
+ * @initiator The vdo_admin_initiator to call if the operation may begin; may be NULL.
+ *
+ * Return: true if the resume was initiated, if not the waiter will be notified.
+ */
+bool vdo_start_resuming(struct admin_state *state,
+			const struct admin_state_code *operation,
+			struct vdo_completion *waiter,
+			vdo_admin_initiator *initiator)
+{
+	return (assert_vdo_resume_operation(operation, waiter) &&
+		start_operation(state, operation, waiter, initiator));
+}
+
+/**
+ * vdo_finish_resuming() - Finish a resume operation if one was in progress.
+ *
+ * Return: true if the state was resuming; will notify the waiter if so.
+ */
+bool vdo_finish_resuming(struct admin_state *state)
+{
+	return vdo_finish_resuming_with_result(state, VDO_SUCCESS);
+}
+
+/**
+ * vdo_finish_resuming_with_result() - Finish a resume operation with a status code.
+ * @result The result of the resume operation.
+ *
+ * Return: true if the state was resuming; will notify the waiter if so.
+ */
+bool vdo_finish_resuming_with_result(struct admin_state *state, int result)
+{
+	return (vdo_is_state_resuming(state) && vdo_finish_operation(state, result));
+}
+
+/**
+ * vdo_resume_if_quiescent() - Change the state to normal operation if the current state is
+ *                             quiescent.
+ *
+ * Return: VDO_SUCCESS if the state resumed, VDO_INVALID_ADMIN_STATE otherwise.
+ */
+int vdo_resume_if_quiescent(struct admin_state *state)
+{
+	if (!vdo_is_state_quiescent(state))
+		return VDO_INVALID_ADMIN_STATE;
+
+	vdo_set_admin_state_code(state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_start_operation() - Attempt to start an operation.
+ *
+ * Return: VDO_SUCCESS if the operation was started, VDO_INVALID_ADMIN_STATE if not
+ */
+int vdo_start_operation(struct admin_state *state, const struct admin_state_code *operation)
+{
+	return vdo_start_operation_with_waiter(state, operation, NULL, NULL);
+}
+
+/**
+ * vdo_start_operation_with_waiter() - Attempt to start an operation.
+ * @waiter the completion to notify when the operation completes or fails to start; may be NULL.
+ * @initiator The vdo_admin_initiator to call if the operation may begin; may be NULL.
+ *
+ * Return: VDO_SUCCESS if the operation was started, VDO_INVALID_ADMIN_STATE if not
+ */
+int vdo_start_operation_with_waiter(struct admin_state *state,
+				    const struct admin_state_code *operation,
+				    struct vdo_completion *waiter,
+				    vdo_admin_initiator *initiator)
+{
+	return (check_code(operation->operating, operation, "operation", waiter) ?
+		begin_operation(state, operation, waiter, initiator) :
+		VDO_INVALID_ADMIN_STATE);
+}
diff --git a/drivers/md/dm-vdo/admin-state.h b/drivers/md/dm-vdo/admin-state.h
new file mode 100644
index 000000000000..550e9084662d
--- /dev/null
+++ b/drivers/md/dm-vdo/admin-state.h
@@ -0,0 +1,180 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_ADMIN_STATE_H
+#define VDO_ADMIN_STATE_H
+
+#include "completion.h"
+#include "types.h"
+
+struct admin_state_code {
+	const char *name;
+	/* Normal operation, data_vios may be active */
+	bool normal;
+	/* I/O is draining, new requests should not start */
+	bool draining;
+	/* This is a startup time operation */
+	bool loading;
+	/* The next state will be quiescent */
+	bool quiescing;
+	/* The VDO is quiescent, there should be no I/O */
+	bool quiescent;
+	/* Whether an operation is in progress and so no other operation may be started */
+	bool operating;
+};
+
+extern const struct admin_state_code *VDO_ADMIN_STATE_NORMAL_OPERATION;
+extern const struct admin_state_code *VDO_ADMIN_STATE_OPERATING;
+extern const struct admin_state_code *VDO_ADMIN_STATE_FORMATTING;
+extern const struct admin_state_code *VDO_ADMIN_STATE_PRE_LOADING;
+extern const struct admin_state_code *VDO_ADMIN_STATE_PRE_LOADED;
+extern const struct admin_state_code *VDO_ADMIN_STATE_LOADING;
+extern const struct admin_state_code *VDO_ADMIN_STATE_LOADING_FOR_RECOVERY;
+extern const struct admin_state_code *VDO_ADMIN_STATE_LOADING_FOR_REBUILD;
+extern const struct admin_state_code *VDO_ADMIN_STATE_WAITING_FOR_RECOVERY;
+extern const struct admin_state_code *VDO_ADMIN_STATE_NEW;
+extern const struct admin_state_code *VDO_ADMIN_STATE_INITIALIZED;
+extern const struct admin_state_code *VDO_ADMIN_STATE_RECOVERING;
+extern const struct admin_state_code *VDO_ADMIN_STATE_REBUILDING;
+extern const struct admin_state_code *VDO_ADMIN_STATE_SAVING;
+extern const struct admin_state_code *VDO_ADMIN_STATE_SAVED;
+extern const struct admin_state_code *VDO_ADMIN_STATE_SCRUBBING;
+extern const struct admin_state_code *VDO_ADMIN_STATE_SAVE_FOR_SCRUBBING;
+extern const struct admin_state_code *VDO_ADMIN_STATE_STOPPING;
+extern const struct admin_state_code *VDO_ADMIN_STATE_STOPPED;
+extern const struct admin_state_code *VDO_ADMIN_STATE_SUSPENDING;
+extern const struct admin_state_code *VDO_ADMIN_STATE_SUSPENDED;
+extern const struct admin_state_code *VDO_ADMIN_STATE_SUSPENDED_OPERATION;
+extern const struct admin_state_code *VDO_ADMIN_STATE_RESUMING;
+
+struct admin_state {
+	const struct admin_state_code *current_state;
+	/* The next administrative state (when the current operation finishes) */
+	const struct admin_state_code *next_state;
+	/* A completion waiting on a state change */
+	struct vdo_completion *waiter;
+	/* Whether an operation is being initiated */
+	bool starting;
+	/* Whether an operation has completed in the initiator */
+	bool complete;
+};
+
+/**
+ * typedef vdo_admin_initiator - A method to be called once an admin operation may be initiated.
+ */
+typedef void vdo_admin_initiator(struct admin_state *state);
+
+static inline const struct admin_state_code * __must_check
+vdo_get_admin_state_code(const struct admin_state *state)
+{
+	return READ_ONCE(state->current_state);
+}
+
+/**
+ * vdo_set_admin_state_code() - Set the current admin state code.
+ *
+ * This function should be used primarily for initialization and by adminState internals. Most uses
+ * should go through the operation interfaces.
+ */
+static inline void
+vdo_set_admin_state_code(struct admin_state *state, const struct admin_state_code *code)
+{
+	WRITE_ONCE(state->current_state, code);
+}
+
+static inline bool __must_check vdo_is_state_normal(const struct admin_state *state)
+{
+	return vdo_get_admin_state_code(state)->normal;
+}
+
+static inline bool __must_check vdo_is_state_suspending(const struct admin_state *state)
+{
+	return (vdo_get_admin_state_code(state) == VDO_ADMIN_STATE_SUSPENDING);
+}
+
+static inline bool __must_check vdo_is_state_saving(const struct admin_state *state)
+{
+	return (vdo_get_admin_state_code(state) == VDO_ADMIN_STATE_SAVING);
+}
+
+static inline bool __must_check vdo_is_state_saved(const struct admin_state *state)
+{
+	return (vdo_get_admin_state_code(state) == VDO_ADMIN_STATE_SAVED);
+}
+
+static inline bool __must_check vdo_is_state_draining(const struct admin_state *state)
+{
+	return vdo_get_admin_state_code(state)->draining;
+}
+
+static inline bool __must_check vdo_is_state_loading(const struct admin_state *state)
+{
+	return vdo_get_admin_state_code(state)->loading;
+}
+
+static inline bool __must_check vdo_is_state_resuming(const struct admin_state *state)
+{
+	return (vdo_get_admin_state_code(state) == VDO_ADMIN_STATE_RESUMING);
+}
+
+static inline bool __must_check vdo_is_state_clean_load(const struct admin_state *state)
+{
+	const struct admin_state_code *code = vdo_get_admin_state_code(state);
+
+	return ((code == VDO_ADMIN_STATE_FORMATTING) || (code == VDO_ADMIN_STATE_LOADING));
+}
+
+static inline bool __must_check vdo_is_state_quiescing(const struct admin_state *state)
+{
+	return vdo_get_admin_state_code(state)->quiescing;
+}
+
+static inline bool __must_check vdo_is_state_quiescent(const struct admin_state *state)
+{
+	return vdo_get_admin_state_code(state)->quiescent;
+}
+
+bool __must_check
+vdo_assert_load_operation(const struct admin_state_code *operation, struct vdo_completion *waiter);
+
+bool vdo_start_loading(struct admin_state *state,
+		       const struct admin_state_code *operation,
+		       struct vdo_completion *waiter,
+		       vdo_admin_initiator *initiator);
+
+bool vdo_finish_loading(struct admin_state *state);
+
+bool vdo_finish_loading_with_result(struct admin_state *state, int result);
+
+bool vdo_start_resuming(struct admin_state *state,
+			const struct admin_state_code *operation,
+			struct vdo_completion *waiter,
+			vdo_admin_initiator *initiator);
+
+bool vdo_finish_resuming(struct admin_state *state);
+
+bool vdo_finish_resuming_with_result(struct admin_state *state, int result);
+
+int vdo_resume_if_quiescent(struct admin_state *state);
+
+bool vdo_start_draining(struct admin_state *state,
+			const struct admin_state_code *operation,
+			struct vdo_completion *waiter,
+			vdo_admin_initiator *initiator);
+
+bool vdo_finish_draining(struct admin_state *state);
+
+bool vdo_finish_draining_with_result(struct admin_state *state, int result);
+
+int vdo_start_operation(struct admin_state *state, const struct admin_state_code *operation);
+
+int vdo_start_operation_with_waiter(struct admin_state *state,
+				    const struct admin_state_code *operation,
+				    struct vdo_completion *waiter,
+				    vdo_admin_initiator *initiator);
+
+bool vdo_finish_operation(struct admin_state *state, int result);
+
+#endif /* VDO_ADMIN_STATE_H */

From patchwork Thu Oct 26 21:41:15 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437957
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5CC14C27C4A
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:38 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344898AbjJZVoh (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:37 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41548 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234958AbjJZVoZ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:25 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 595ADD69
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356513;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=IpRVNb6aNbYH+snZQfq/GbghCxTrFFw9bp2+fpNsTnA=;
        b=VHEnKPkHZRFs9hM8J8QtIVyQ8lQUfzHJj8ANYs69lWFn68tijNZ0Km7gfTsqqb5iRM2m76
        9IRcn4707GldgjC9+spFXbb7mC5LyksNhR9HEOiQxpuUIm6Y5gz1IacVnwaio0E6KMQMeD
        vdyCV0XIWPFDI+ZgYn6ujMvG2nAT8g4=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-625-sXT5b5XUNienrO_O-NXhYw-1; Thu, 26 Oct 2023 17:41:40 -0400
X-MC-Unique: sXT5b5XUNienrO_O-NXhYw-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 2EB26101A550;
        Thu, 26 Oct 2023 21:41:40 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 24B742949;
        Thu, 26 Oct 2023 21:41:40 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id D86D13003E; Thu, 26 Oct 2023 17:41:39 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 18/39] dm vdo: add vio, the request object for vdo metadata
Date: Thu, 26 Oct 2023 17:41:15 -0400
Message-Id: <20231026214136.1067410-19-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Add the data and methods that implement the vio object that is basic
unit of I/O in vdo.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Co-developed-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Co-developed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/vio.c | 525 ++++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/vio.h | 221 +++++++++++++++++
 2 files changed, 746 insertions(+)
 create mode 100644 drivers/md/dm-vdo/vio.c
 create mode 100644 drivers/md/dm-vdo/vio.h

diff --git a/drivers/md/dm-vdo/vio.c b/drivers/md/dm-vdo/vio.c
new file mode 100644
index 000000000000..634add0be453
--- /dev/null
+++ b/drivers/md/dm-vdo/vio.c
@@ -0,0 +1,525 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "vio.h"
+
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/kernel.h>
+#include <linux/ratelimit.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+
+#include "constants.h"
+#include "io-submitter.h"
+#include "vdo.h"
+
+/* A vio_pool is a collection of preallocated vios. */
+struct vio_pool {
+	/** The number of objects managed by the pool */
+	size_t size;
+	/** The list of objects which are available */
+	struct list_head available;
+	/** The queue of requestors waiting for objects from the pool */
+	struct wait_queue waiting;
+	/** The number of objects currently in use */
+	size_t busy_count;
+	/** The list of objects which are in use */
+	struct list_head busy;
+	/** The ID of the thread on which this pool may be used */
+	thread_id_t thread_id;
+	/** The buffer backing the pool's vios */
+	char *buffer;
+	/** The pool entries */
+	struct pooled_vio vios[];
+};
+
+physical_block_number_t pbn_from_vio_bio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct vdo *vdo = vio->completion.vdo;
+	physical_block_number_t pbn = bio->bi_iter.bi_sector / VDO_SECTORS_PER_BLOCK;
+
+	return ((pbn == VDO_GEOMETRY_BLOCK_LOCATION) ? pbn : pbn + vdo->geometry.bio_offset);
+}
+
+static int create_multi_block_bio(block_count_t size, struct bio **bio_ptr)
+{
+	struct bio *bio = NULL;
+	int result;
+
+	result = UDS_ALLOCATE_EXTENDED(struct bio, size + 1, struct bio_vec, "bio", &bio);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	*bio_ptr = bio;
+	return VDO_SUCCESS;
+}
+
+int vdo_create_bio(struct bio **bio_ptr)
+{
+	return create_multi_block_bio(1, bio_ptr);
+}
+
+void vdo_free_bio(struct bio *bio)
+{
+	if (bio == NULL)
+		return;
+
+	bio_uninit(bio);
+	UDS_FREE(UDS_FORGET(bio));
+}
+
+int allocate_vio_components(struct vdo *vdo,
+			    enum vio_type vio_type,
+			    enum vio_priority priority,
+			    void *parent,
+			    unsigned int block_count,
+			    char *data,
+			    struct vio *vio)
+{
+	struct bio *bio;
+	int result;
+
+	result = ASSERT(block_count <= MAX_BLOCKS_PER_VIO,
+			"block count %u does not exceed maximum %u",
+			block_count,
+			MAX_BLOCKS_PER_VIO);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = ASSERT(((vio_type != VIO_TYPE_UNINITIALIZED) && (vio_type != VIO_TYPE_DATA)),
+			"%d is a metadata type",
+			vio_type);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = create_multi_block_bio(block_count, &bio);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	initialize_vio(vio, bio, block_count, vio_type, priority, vdo);
+	vio->completion.parent = parent;
+	vio->data = data;
+	return VDO_SUCCESS;
+}
+
+/**
+ * create_multi_block_metadata_vio() - Create a vio.
+ * @vdo: The vdo on which the vio will operate.
+ * @vio_type: The type of vio to create.
+ * @priority: The relative priority to assign to the vio.
+ * @parent: The parent of the vio.
+ * @block_count: The size of the vio in blocks.
+ * @data: The buffer.
+ * @vio_ptr: A pointer to hold the new vio.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int create_multi_block_metadata_vio(struct vdo *vdo,
+				    enum vio_type vio_type,
+				    enum vio_priority priority,
+				    void *parent,
+				    unsigned int block_count,
+				    char *data,
+				    struct vio **vio_ptr)
+{
+	struct vio *vio;
+	int result;
+
+	/* If struct vio grows past 256 bytes, we'll lose benefits of VDOSTORY-176. */
+	STATIC_ASSERT(sizeof(struct vio) <= 256);
+
+	/*
+	 * Metadata vios should use direct allocation and not use the buffer pool, which is
+	 * reserved for submissions from the linux block layer.
+	 */
+	result = UDS_ALLOCATE(1, struct vio, __func__, &vio);
+	if (result != VDO_SUCCESS) {
+		uds_log_error("metadata vio allocation failure %d", result);
+		return result;
+	}
+
+	result = allocate_vio_components(vdo, vio_type, priority, parent, block_count, data, vio);
+	if (result != VDO_SUCCESS) {
+		UDS_FREE(vio);
+		return result;
+	}
+
+	*vio_ptr  = vio;
+	return VDO_SUCCESS;
+}
+
+/**
+ * free_vio_components() - Free the components of a vio embedded in a larger structure.
+ * @vio: The vio to destroy
+ */
+void free_vio_components(struct vio *vio)
+{
+	if (vio == NULL)
+		return;
+
+	BUG_ON(is_data_vio(vio));
+	vdo_free_bio(UDS_FORGET(vio->bio));
+}
+
+/**
+ * free_vio() - Destroy a vio.
+ * @vio: The vio to destroy.
+ */
+void free_vio(struct vio *vio)
+{
+	free_vio_components(vio);
+	UDS_FREE(vio);
+}
+
+/* Set bio properties for a VDO read or write. */
+void vdo_set_bio_properties(struct bio *bio,
+			    struct vio *vio,
+			    bio_end_io_t callback,
+			    unsigned int bi_opf,
+			    physical_block_number_t pbn)
+{
+	struct vdo *vdo = vio->completion.vdo;
+	struct device_config *config = vdo->device_config;
+
+	pbn -= vdo->geometry.bio_offset;
+	vio->bio_zone = ((pbn / config->thread_counts.bio_rotation_interval) %
+			 config->thread_counts.bio_threads);
+
+	bio->bi_private = vio;
+	bio->bi_end_io = callback;
+	bio->bi_opf = bi_opf;
+	bio->bi_iter.bi_sector = pbn * VDO_SECTORS_PER_BLOCK;
+}
+
+/*
+ * Prepares the bio to perform IO with the specified buffer. May only be used on a VDO-allocated
+ * bio, as it assumes the bio wraps a 4k buffer that is 4k aligned, but there does not have to be a
+ * vio associated with the bio.
+ */
+int vio_reset_bio(struct vio *vio,
+		  char *data,
+		  bio_end_io_t callback,
+		  unsigned int bi_opf,
+		  physical_block_number_t pbn)
+{
+	int bvec_count, offset, len, i;
+	struct bio *bio = vio->bio;
+
+	bio_reset(bio, bio->bi_bdev, bi_opf);
+	vdo_set_bio_properties(bio, vio, callback, bi_opf, pbn);
+	if (data == NULL)
+		return VDO_SUCCESS;
+
+	bio->bi_io_vec = bio->bi_inline_vecs;
+	bio->bi_max_vecs = vio->block_count + 1;
+	len = VDO_BLOCK_SIZE * vio->block_count;
+	offset = offset_in_page(data);
+	bvec_count = DIV_ROUND_UP(offset + len, PAGE_SIZE);
+
+	/*
+	 * If we knew that data was always on one page, or contiguous pages, we wouldn't need the
+	 * loop. But if we're using vmalloc, it's not impossible that the data is in different
+	 * pages that can't be merged in bio_add_page...
+	 */
+	for (i = 0; (i < bvec_count) && (len > 0); i++) {
+		struct page *page;
+		int bytes_added;
+		int bytes = PAGE_SIZE - offset;
+
+		if (bytes > len)
+			bytes = len;
+
+		page = is_vmalloc_addr(data) ? vmalloc_to_page(data) : virt_to_page(data);
+		bytes_added = bio_add_page(bio, page, bytes, offset);
+
+		if (bytes_added != bytes)
+			return uds_log_error_strerror(VDO_BIO_CREATION_FAILED,
+						      "Could only add %i bytes to bio",
+						       bytes_added);
+
+		data += bytes;
+		len -= bytes;
+		offset = 0;
+	}
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * update_vio_error_stats() - Update per-vio error stats and log the error.
+ * @vio: The vio which got an error.
+ * @format: The format of the message to log (a printf style format).
+ */
+void update_vio_error_stats(struct vio *vio, const char *format, ...)
+{
+	static DEFINE_RATELIMIT_STATE(error_limiter,
+				      DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+	va_list args;
+	int priority;
+	struct vdo *vdo = vio->completion.vdo;
+
+	switch (vio->completion.result) {
+	case VDO_READ_ONLY:
+		atomic64_inc(&vdo->stats.read_only_error_count);
+		return;
+
+	case VDO_NO_SPACE:
+		atomic64_inc(&vdo->stats.no_space_error_count);
+		priority = UDS_LOG_DEBUG;
+		break;
+
+	default:
+		priority = UDS_LOG_ERR;
+	}
+
+	if (!__ratelimit(&error_limiter))
+		return;
+
+	va_start(args, format);
+	uds_vlog_strerror(priority, vio->completion.result, UDS_LOGGING_MODULE_NAME, format, args);
+	va_end(args);
+}
+
+void vio_record_metadata_io_error(struct vio *vio)
+{
+	const char *description;
+	physical_block_number_t pbn = pbn_from_vio_bio(vio->bio);
+
+	if (bio_op(vio->bio) == REQ_OP_READ)
+		description = "read";
+	else if ((vio->bio->bi_opf & REQ_PREFLUSH) == REQ_PREFLUSH)
+		description = (((vio->bio->bi_opf & REQ_FUA) == REQ_FUA) ?
+			       "write+preflush+fua" :
+			       "write+preflush");
+	else if ((vio->bio->bi_opf & REQ_FUA) == REQ_FUA)
+		description = "write+fua";
+	else
+		description = "write";
+
+	update_vio_error_stats(vio,
+			       "Completing %s vio of type %u for physical block %llu with error",
+			       description,
+			       vio->type,
+			       (unsigned long long) pbn);
+}
+
+/**
+ * make_vio_pool() - Create a new vio pool.
+ * @vdo: The vdo.
+ * @pool_size: The number of vios in the pool.
+ * @thread_id: The ID of the thread using this pool.
+ * @vio_type: The type of vios in the pool.
+ * @priority: The priority with which vios from the pool should be enqueued.
+ * @context: The context that each entry will have.
+ * @pool_ptr: The resulting pool.
+ *
+ * Return: A success or error code.
+ */
+int make_vio_pool(struct vdo *vdo,
+		  size_t pool_size,
+		  thread_id_t thread_id,
+		  enum vio_type vio_type,
+		  enum vio_priority priority,
+		  void *context,
+		  struct vio_pool **pool_ptr)
+{
+	struct vio_pool *pool;
+	char *ptr;
+	int result;
+
+	result = UDS_ALLOCATE_EXTENDED(struct vio_pool,
+				       pool_size,
+				       struct pooled_vio,
+				       __func__,
+				       &pool);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	pool->thread_id = thread_id;
+	INIT_LIST_HEAD(&pool->available);
+	INIT_LIST_HEAD(&pool->busy);
+
+	result = UDS_ALLOCATE(pool_size * VDO_BLOCK_SIZE, char, "VIO pool buffer", &pool->buffer);
+	if (result != VDO_SUCCESS) {
+		free_vio_pool(pool);
+		return result;
+	}
+
+	ptr = pool->buffer;
+	for (pool->size = 0; pool->size < pool_size; pool->size++, ptr += VDO_BLOCK_SIZE) {
+		struct pooled_vio *pooled = &pool->vios[pool->size];
+
+		result = allocate_vio_components(vdo,
+						 vio_type,
+						 priority,
+						 NULL,
+						 1,
+						 ptr,
+						 &pooled->vio);
+		if (result != VDO_SUCCESS) {
+			free_vio_pool(pool);
+			return result;
+		}
+
+		pooled->context = context;
+		list_add_tail(&pooled->pool_entry, &pool->available);
+	}
+
+	*pool_ptr = pool;
+	return VDO_SUCCESS;
+}
+
+/**
+ * free_vio_pool() - Destroy a vio pool.
+ * @pool: The pool to free.
+ */
+void free_vio_pool(struct vio_pool *pool)
+{
+	struct pooled_vio *pooled, *tmp;
+
+	if (pool == NULL)
+		return;
+
+	/* Remove all available vios from the object pool. */
+	ASSERT_LOG_ONLY(!vdo_has_waiters(&pool->waiting),
+			"VIO pool must not have any waiters when being freed");
+	ASSERT_LOG_ONLY((pool->busy_count == 0),
+			"VIO pool must not have %zu busy entries when being freed",
+			pool->busy_count);
+	ASSERT_LOG_ONLY(list_empty(&pool->busy),
+			"VIO pool must not have busy entries when being freed");
+
+	list_for_each_entry_safe(pooled, tmp, &pool->available, pool_entry) {
+		list_del(&pooled->pool_entry);
+		free_vio_components(&pooled->vio);
+		pool->size--;
+	}
+
+	ASSERT_LOG_ONLY(pool->size == 0,
+			"VIO pool must not have missing entries when being freed");
+
+	UDS_FREE(UDS_FORGET(pool->buffer));
+	UDS_FREE(pool);
+}
+
+/**
+ * is_vio_pool_busy() - Check whether an vio pool has outstanding entries.
+ *
+ * Return: true if the pool is busy.
+ */
+bool is_vio_pool_busy(struct vio_pool *pool)
+{
+	return (pool->busy_count != 0);
+}
+
+/**
+ * acquire_vio_from_pool() - Acquire a vio and buffer from the pool (asynchronous).
+ * @pool: The vio pool.
+ * @waiter: Object that is requesting a vio.
+ */
+void acquire_vio_from_pool(struct vio_pool *pool, struct waiter *waiter)
+{
+	struct pooled_vio *pooled;
+
+	ASSERT_LOG_ONLY((pool->thread_id == vdo_get_callback_thread_id()),
+			"acquire from active vio_pool called from correct thread");
+
+	if (list_empty(&pool->available)) {
+		vdo_enqueue_waiter(&pool->waiting, waiter);
+		return;
+	}
+
+	pooled = list_first_entry(&pool->available, struct pooled_vio, pool_entry);
+	pool->busy_count++;
+	list_move_tail(&pooled->pool_entry, &pool->busy);
+	(*waiter->callback)(waiter, pooled);
+}
+
+/**
+ * return_vio_to_pool() - Return a vio to the pool
+ * @pool: The vio pool.
+ * @vio: The pooled vio to return.
+ */
+void return_vio_to_pool(struct vio_pool *pool, struct pooled_vio *vio)
+{
+	ASSERT_LOG_ONLY((pool->thread_id == vdo_get_callback_thread_id()),
+			"vio pool entry returned on same thread as it was acquired");
+
+	vio->vio.completion.error_handler = NULL;
+	vio->vio.completion.parent = NULL;
+	if (vdo_has_waiters(&pool->waiting)) {
+		vdo_notify_next_waiter(&pool->waiting, NULL, vio);
+		return;
+	}
+
+	list_move_tail(&vio->pool_entry, &pool->available);
+	--pool->busy_count;
+}
+
+/*
+ * Various counting functions for statistics.
+ * These are used for bios coming into VDO, as well as bios generated by VDO.
+ */
+void vdo_count_bios(struct atomic_bio_stats *bio_stats, struct bio *bio)
+{
+	if (((bio->bi_opf & REQ_PREFLUSH) != 0) && (bio->bi_iter.bi_size == 0)) {
+		atomic64_inc(&bio_stats->empty_flush);
+		atomic64_inc(&bio_stats->flush);
+		return;
+	}
+
+	switch (bio_op(bio)) {
+	case REQ_OP_WRITE:
+		atomic64_inc(&bio_stats->write);
+		break;
+	case REQ_OP_READ:
+		atomic64_inc(&bio_stats->read);
+		break;
+	case REQ_OP_DISCARD:
+		atomic64_inc(&bio_stats->discard);
+		break;
+		/*
+		 * All other operations are filtered out in dmvdo.c, or not created by VDO, so
+		 * shouldn't exist.
+		 */
+	default:
+		ASSERT_LOG_ONLY(0, "Bio operation %d not a write, read, discard, or empty flush",
+				bio_op(bio));
+	}
+
+	if ((bio->bi_opf & REQ_PREFLUSH) != 0)
+		atomic64_inc(&bio_stats->flush);
+	if (bio->bi_opf & REQ_FUA)
+		atomic64_inc(&bio_stats->fua);
+}
+
+static void count_all_bios_completed(struct vio *vio, struct bio *bio)
+{
+	struct atomic_statistics *stats = &vio->completion.vdo->stats;
+
+	if (is_data_vio(vio)) {
+		vdo_count_bios(&stats->bios_out_completed, bio);
+		return;
+	}
+
+	vdo_count_bios(&stats->bios_meta_completed, bio);
+	if (vio->type == VIO_TYPE_RECOVERY_JOURNAL)
+		vdo_count_bios(&stats->bios_journal_completed, bio);
+	else if (vio->type == VIO_TYPE_BLOCK_MAP)
+		vdo_count_bios(&stats->bios_page_cache_completed, bio);
+}
+
+void vdo_count_completed_bios(struct bio *bio)
+{
+	struct vio *vio = (struct vio *) bio->bi_private;
+
+	atomic64_inc(&vio->completion.vdo->stats.bios_completed);
+	count_all_bios_completed(vio, bio);
+}
diff --git a/drivers/md/dm-vdo/vio.h b/drivers/md/dm-vdo/vio.h
new file mode 100644
index 000000000000..fea9ea847b3d
--- /dev/null
+++ b/drivers/md/dm-vdo/vio.h
@@ -0,0 +1,221 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VIO_H
+#define VIO_H
+
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+
+#include "completion.h"
+#include "constants.h"
+#include "types.h"
+#include "vdo.h"
+
+enum {
+	MAX_BLOCKS_PER_VIO = (BIO_MAX_VECS << PAGE_SHIFT) / VDO_BLOCK_SIZE,
+};
+
+struct pooled_vio {
+	/* The underlying vio */
+	struct vio vio;
+	/* The list entry for chaining pooled vios together */
+	struct list_head list_entry;
+	/* The context set by the pool */
+	void *context;
+	/* The list entry used by the pool */
+	struct list_head pool_entry;
+};
+
+/**
+ * as_vio() - Convert a generic vdo_completion to a vio.
+ * @completion: The completion to convert.
+ *
+ * Return: The completion as a vio.
+ */
+static inline struct vio *as_vio(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VIO_COMPLETION);
+	return container_of(completion, struct vio, completion);
+}
+
+/**
+ * get_vio_bio_zone_thread_id() - Get the thread id of the bio zone in which a vio should submit
+ *                                its I/O.
+ * @vio: The vio.
+ *
+ * Return: The id of the bio zone thread the vio should use.
+ */
+static inline thread_id_t __must_check get_vio_bio_zone_thread_id(struct vio *vio)
+{
+	return vio->completion.vdo->thread_config.bio_threads[vio->bio_zone];
+}
+
+physical_block_number_t __must_check pbn_from_vio_bio(struct bio *bio);
+
+/**
+ * assert_vio_in_bio_zone() - Check that a vio is running on the correct thread for its bio zone.
+ * @vio: The vio to check.
+ */
+static inline void assert_vio_in_bio_zone(struct vio *vio)
+{
+	thread_id_t expected = get_vio_bio_zone_thread_id(vio);
+	thread_id_t thread_id = vdo_get_callback_thread_id();
+
+	ASSERT_LOG_ONLY((expected == thread_id),
+			"vio I/O for physical block %llu on thread %u, should be on bio zone thread %u",
+			(unsigned long long) pbn_from_vio_bio(vio->bio),
+			thread_id,
+			expected);
+}
+
+int vdo_create_bio(struct bio **bio_ptr);
+void vdo_free_bio(struct bio *bio);
+int allocate_vio_components(struct vdo *vdo,
+			    enum vio_type vio_type,
+			    enum vio_priority priority,
+			    void *parent,
+			    unsigned int block_count,
+			    char *data,
+			    struct vio *vio);
+int __must_check create_multi_block_metadata_vio(struct vdo *vdo,
+						 enum vio_type vio_type,
+						 enum vio_priority priority,
+						 void *parent,
+						 unsigned int block_count,
+						 char *data,
+						 struct vio **vio_ptr);
+
+static inline int __must_check
+create_metadata_vio(struct vdo *vdo,
+		    enum vio_type vio_type,
+		    enum vio_priority priority,
+		    void *parent,
+		    char *data,
+		    struct vio **vio_ptr)
+{
+	return create_multi_block_metadata_vio(vdo, vio_type, priority, parent, 1, data, vio_ptr);
+}
+
+void free_vio_components(struct vio *vio);
+void free_vio(struct vio *vio);
+
+/**
+ * initialize_vio() - Initialize a vio.
+ * @vio: The vio to initialize.
+ * @bio: The bio this vio should use for its I/O.
+ * @block_count: The size of this vio in vdo blocks.
+ * @vio_type: The vio type.
+ * @priority: The relative priority of the vio.
+ * @vdo: The vdo for this vio.
+ */
+static inline void initialize_vio(struct vio *vio,
+				  struct bio *bio,
+				  unsigned int block_count,
+				  enum vio_type vio_type,
+				  enum vio_priority priority,
+				  struct vdo *vdo)
+{
+	/* data_vio's may not span multiple blocks */
+	BUG_ON((vio_type == VIO_TYPE_DATA) && (block_count != 1));
+
+	vio->bio = bio;
+	vio->block_count = block_count;
+	vio->type = vio_type;
+	vio->priority = priority;
+	vdo_initialize_completion(&vio->completion, vdo, VIO_COMPLETION);
+}
+
+void vdo_set_bio_properties(struct bio *bio,
+			    struct vio *vio,
+			    bio_end_io_t callback,
+			    unsigned int bi_opf,
+			    physical_block_number_t pbn);
+
+int vio_reset_bio(struct vio *vio,
+		  char *data,
+		  bio_end_io_t callback,
+		  unsigned int bi_opf,
+		  physical_block_number_t pbn);
+
+void update_vio_error_stats(struct vio *vio, const char *format, ...)
+	__printf(2, 3);
+
+/**
+ * is_data_vio() - Check whether a vio is servicing an external data request.
+ * @vio: The vio to check.
+ */
+static inline bool is_data_vio(struct vio *vio)
+{
+	return (vio->type == VIO_TYPE_DATA);
+}
+
+/**
+ * get_metadata_priority() - Convert a vio's priority to a work item priority.
+ * @vio: The vio.
+ *
+ * Return: The priority with which to submit the vio's bio.
+ */
+static inline enum vdo_completion_priority get_metadata_priority(struct vio *vio)
+{
+	return ((vio->priority == VIO_PRIORITY_HIGH) ?
+		BIO_Q_HIGH_PRIORITY :
+		BIO_Q_METADATA_PRIORITY);
+}
+
+/**
+ * continue_vio() - Enqueue a vio to run its next callback.
+ * @vio: The vio to continue.
+ *
+ * Return: The result of the current operation.
+ */
+static inline void continue_vio(struct vio *vio, int result)
+{
+	if (unlikely(result != VDO_SUCCESS))
+		vdo_set_completion_result(&vio->completion, result);
+
+	vdo_enqueue_completion(&vio->completion, VDO_WORK_Q_DEFAULT_PRIORITY);
+}
+
+void vdo_count_bios(struct atomic_bio_stats *bio_stats, struct bio *bio);
+void vdo_count_completed_bios(struct bio *bio);
+
+/**
+ * continue_vio_after_io() - Continue a vio now that its I/O has returned.
+ */
+static inline void continue_vio_after_io(struct vio *vio, vdo_action *callback, thread_id_t thread)
+{
+	vdo_count_completed_bios(vio->bio);
+	vdo_set_completion_callback(&vio->completion, callback, thread);
+	continue_vio(vio, blk_status_to_errno(vio->bio->bi_status));
+}
+
+void vio_record_metadata_io_error(struct vio *vio);
+
+/* A vio_pool is a collection of preallocated vios used to write arbitrary metadata blocks. */
+
+static inline struct pooled_vio *vio_as_pooled_vio(struct vio *vio)
+{
+	return container_of(vio, struct pooled_vio, vio);
+}
+
+struct vio_pool;
+
+int __must_check make_vio_pool(struct vdo *vdo,
+			       size_t pool_size,
+			       thread_id_t thread_id,
+			       enum vio_type vio_type,
+			       enum vio_priority priority,
+			       void *context,
+			       struct vio_pool **pool_ptr);
+void free_vio_pool(struct vio_pool *pool);
+bool __must_check is_vio_pool_busy(struct vio_pool *pool);
+void acquire_vio_from_pool(struct vio_pool *pool, struct waiter *waiter);
+void return_vio_to_pool(struct vio_pool *pool, struct pooled_vio *vio);
+
+#endif /* VIO_H */

From patchwork Thu Oct 26 21:41:16 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437947
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BD986C27C46
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:27 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344925AbjJZVo1 (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:27 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52856 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344889AbjJZVoQ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:16 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A4241D4B
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356504;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=rPEIvODrb0Yro+9cxyGOej/nchgZImaJNVhpOgaH1kY=;
        b=PiapqtUG7uVzMBABoNllfNHq2j9ccX62UWQGRbgNP0eXlgf5CX8PubfloUpqZloJihDkKG
        YmlmHbIclyHCZoyBQ0Y6WCl/LwIhxCH3dP4wjvfcjmxqrLcoxb3mjntM9hUvVoc0NnQM61
        YQYGQbL8tz9ZEVPReEFRPLMo5r29ek8=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-510-jNftyU9DPiW0ls92qDLIIQ-1; Thu, 26 Oct 2023 17:41:40 -0400
X-MC-Unique: jNftyU9DPiW0ls92qDLIIQ-1
Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com
 [10.11.54.7])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 7652088D187;
        Thu, 26 Oct 2023 21:41:40 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 589851C060AE;
        Thu, 26 Oct 2023 21:41:40 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 0E2293003C; Thu, 26 Oct 2023 17:41:40 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 19/39] dm vdo: add data_vio,
 the request object which services incoming bios
Date: Thu, 26 Oct 2023 17:41:16 -0400
Message-Id: <20231026214136.1067410-20-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.7
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Add the data and methods that implement the data_vio object that
handles user data bios as they are processed.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Co-developed-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Co-developed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/data-vio.c | 2076 ++++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/data-vio.h |  683 +++++++++++
 2 files changed, 2759 insertions(+)
 create mode 100644 drivers/md/dm-vdo/data-vio.c
 create mode 100644 drivers/md/dm-vdo/data-vio.h

diff --git a/drivers/md/dm-vdo/data-vio.c b/drivers/md/dm-vdo/data-vio.c
new file mode 100644
index 000000000000..6f03b4fac569
--- /dev/null
+++ b/drivers/md/dm-vdo/data-vio.c
@@ -0,0 +1,2076 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "data-vio.h"
+
+#include <linux/atomic.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/device-mapper.h>
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/lz4.h>
+#include <linux/minmax.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "murmurhash3.h"
+#include "permassert.h"
+
+#include "block-map.h"
+#include "dump.h"
+#include "encodings.h"
+#include "int-map.h"
+#include "io-submitter.h"
+#include "logical-zone.h"
+#include "packer.h"
+#include "recovery-journal.h"
+#include "slab-depot.h"
+#include "status-codes.h"
+#include "types.h"
+#include "vdo.h"
+#include "vio.h"
+#include "wait-queue.h"
+
+/**
+ * DOC: Bio flags.
+ *
+ * For certain flags set on user bios, if the user bio has not yet been acknowledged, setting those
+ * flags on our own bio(s) for that request may help underlying layers better fulfill the user
+ * bio's needs. This constant contains the aggregate of those flags; VDO strips all the other
+ * flags, as they convey incorrect information.
+ *
+ * These flags are always irrelevant if we have already finished the user bio as they are only
+ * hints on IO importance. If VDO has finished the user bio, any remaining IO done doesn't care how
+ * important finishing the finished bio was.
+ *
+ * Note that bio.c contains the complete list of flags we believe may be set; the following list
+ * explains the action taken with each of those flags VDO could receive:
+ *
+ * * REQ_SYNC: Passed down if the user bio is not yet completed, since it indicates the user bio
+ *   completion is required for further work to be done by the issuer.
+ * * REQ_META: Passed down if the user bio is not yet completed, since it may mean the lower layer
+ *   treats it as more urgent, similar to REQ_SYNC.
+ * * REQ_PRIO: Passed down if the user bio is not yet completed, since it indicates the user bio is
+ *   important.
+ * * REQ_NOMERGE: Set only if the incoming bio was split; irrelevant to VDO IO.
+ * * REQ_IDLE: Set if the incoming bio had more IO quickly following; VDO's IO pattern doesn't
+ *   match incoming IO, so this flag is incorrect for it.
+ * * REQ_FUA: Handled separately, and irrelevant to VDO IO otherwise.
+ * * REQ_RAHEAD: Passed down, as, for reads, it indicates trivial importance.
+ * * REQ_BACKGROUND: Not passed down, as VIOs are a limited resource and VDO needs them recycled
+ *   ASAP to service heavy load, which is the only place where REQ_BACKGROUND might aid in load
+ *   prioritization.
+ */
+static unsigned int PASSTHROUGH_FLAGS = (REQ_PRIO | REQ_META | REQ_SYNC | REQ_RAHEAD);
+
+/**
+ * DOC:
+ *
+ * The data_vio_pool maintains the pool of data_vios which a vdo uses to service incoming bios. For
+ * correctness, and in order to avoid potentially expensive or blocking memory allocations during
+ * normal operation, the number of concurrently active data_vios is capped. Furthermore, in order
+ * to avoid starvation of reads and writes, at most 75% of the data_vios may be used for
+ * discards. The data_vio_pool is responsible for enforcing these limits. Threads submitting bios
+ * for which a data_vio or discard permit are not available will block until the necessary
+ * resources are available. The pool is also responsible for distributing resources to blocked
+ * threads and waking them. Finally, the pool attempts to batch the work of recycling data_vios by
+ * performing the work of actually assigning resources to blocked threads or placing data_vios back
+ * into the pool on a single cpu at a time.
+ *
+ * The pool contains two "limiters", one for tracking data_vios and one for tracking discard
+ * permits. The limiters also provide safe cross-thread access to pool statistics without the need
+ * to take the pool's lock. When a thread submits a bio to a vdo device, it will first attempt to
+ * get a discard permit if it is a discard, and then to get a data_vio. If the necessary resources
+ * are available, the incoming bio will be assigned to the acquired data_vio, and it will be
+ * launched. However, if either of these are unavailable, the arrival time of the bio is recorded
+ * in the bio's bi_private field, the bio and its submitter are both queued on the appropriate
+ * limiter and the submitting thread will then put itself to sleep. (note that this mechanism will
+ * break if jiffies are only 32 bits.)
+ *
+ * Whenever a data_vio has completed processing for the bio it was servicing, release_data_vio()
+ * will be called on it. This function will add the data_vio to a funnel queue, and then check the
+ * state of the pool. If the pool is not currently processing released data_vios, the pool's
+ * completion will be enqueued on a cpu queue. This obviates the need for the releasing threads to
+ * hold the pool's lock, and also batches release work while avoiding starvation of the cpu
+ * threads.
+ *
+ * Whenever the pool's completion is run on a cpu thread, it calls process_release_callback() which
+ * processes a batch of returned data_vios (currently at most 32) from the pool's funnel queue. For
+ * each data_vio, it first checks whether that data_vio was processing a discard. If so, and there
+ * is a blocked bio waiting for a discard permit, that permit is notionally transferred to the
+ * eldest discard waiter, and that waiter is moved to the end of the list of discard bios waiting
+ * for a data_vio. If there are no discard waiters, the discard permit is returned to the pool.
+ * Next, the data_vio is assigned to the oldest blocked bio which either has a discard permit, or
+ * doesn't need one and relaunched. If neither of these exist, the data_vio is returned to the
+ * pool. Finally, if any waiting bios were launched, the threads which blocked trying to submit
+ * them are awakened.
+ */
+
+enum {
+	DATA_VIO_RELEASE_BATCH_SIZE = 128,
+};
+
+static const unsigned int VDO_SECTORS_PER_BLOCK_MASK = VDO_SECTORS_PER_BLOCK - 1;
+static const u32 COMPRESSION_STATUS_MASK = 0xff;
+static const u32 MAY_NOT_COMPRESS_MASK = 0x80000000;
+
+struct limiter;
+typedef void assigner(struct limiter *limiter);
+
+/* Bookkeeping structure for a single type of resource. */
+struct limiter {
+	/* The data_vio_pool to which this limiter belongs */
+	struct data_vio_pool *pool;
+	/* The maximum number of data_vios available */
+	data_vio_count_t limit;
+	/* The number of resources in use */
+	data_vio_count_t busy;
+	/* The maximum number of resources ever simultaneously in use */
+	data_vio_count_t max_busy;
+	/* The number of resources to release */
+	data_vio_count_t release_count;
+	/* The number of waiters to wake */
+	data_vio_count_t wake_count;
+	/* The list of waiting bios which are known to process_release_callback() */
+	struct bio_list waiters;
+	/* The list of waiting bios which are not yet known to process_release_callback() */
+	struct bio_list new_waiters;
+	/* The list of waiters which have their permits */
+	struct bio_list *permitted_waiters;
+	/* The function for assigning a resource to a waiter */
+	assigner *assigner;
+	/* The queue of blocked threads */
+	wait_queue_head_t blocked_threads;
+	/* The arrival time of the eldest waiter */
+	u64 arrival;
+};
+
+/*
+ * A data_vio_pool is a collection of preallocated data_vios which may be acquired from any thread,
+ * and are released in batches.
+ */
+struct data_vio_pool {
+	/* Completion for scheduling releases */
+	struct vdo_completion completion;
+	/* The administrative state of the pool */
+	struct admin_state state;
+	/* Lock protecting the pool */
+	spinlock_t lock;
+	/* The main limiter controlling the total data_vios in the pool. */
+	struct limiter limiter;
+	/* The limiter controlling data_vios for discard */
+	struct limiter discard_limiter;
+	/* The list of bios which have discard permits but still need a data_vio */
+	struct bio_list permitted_discards;
+	/* The list of available data_vios */
+	struct list_head available;
+	/* The queue of data_vios waiting to be returned to the pool */
+	struct funnel_queue *queue;
+	/* Whether the pool is processing, or scheduled to process releases */
+	atomic_t processing;
+	/* The data vios in the pool */
+	struct data_vio data_vios[];
+};
+
+static const char * const ASYNC_OPERATION_NAMES[] = {
+	"launch",
+	"acknowledge_write",
+	"acquire_hash_lock",
+	"attempt_logical_block_lock",
+	"lock_duplicate_pbn",
+	"check_for_duplication",
+	"cleanup",
+	"compress_data_vio",
+	"find_block_map_slot",
+	"get_mapped_block_for_read",
+	"get_mapped_block_for_write",
+	"hash_data_vio",
+	"journal_remapping",
+	"vdo_attempt_packing",
+	"put_mapped_block",
+	"read_data_vio",
+	"update_dedupe_index",
+	"update_reference_counts",
+	"verify_duplication",
+	"write_data_vio",
+};
+
+/* The steps taken cleaning up a VIO, in the order they are performed. */
+enum data_vio_cleanup_stage {
+	VIO_CLEANUP_START,
+	VIO_RELEASE_HASH_LOCK = VIO_CLEANUP_START,
+	VIO_RELEASE_ALLOCATED,
+	VIO_RELEASE_RECOVERY_LOCKS,
+	VIO_RELEASE_LOGICAL,
+	VIO_CLEANUP_DONE
+};
+
+static inline struct data_vio_pool * __must_check
+as_data_vio_pool(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VDO_DATA_VIO_POOL_COMPLETION);
+	return container_of(completion, struct data_vio_pool, completion);
+}
+
+static inline u64 get_arrival_time(struct bio *bio)
+{
+	return (u64) bio->bi_private;
+}
+
+/**
+ * check_for_drain_complete_locked() - Check whether a data_vio_pool has no outstanding data_vios
+ *				       or waiters while holding the pool's lock.
+ */
+static bool check_for_drain_complete_locked(struct data_vio_pool *pool)
+{
+	if (pool->limiter.busy > 0)
+		return false;
+
+	ASSERT_LOG_ONLY((pool->discard_limiter.busy == 0), "no outstanding discard permits");
+
+	return (bio_list_empty(&pool->limiter.new_waiters) &&
+		bio_list_empty(&pool->discard_limiter.new_waiters));
+}
+
+static void initialize_lbn_lock(struct data_vio *data_vio, logical_block_number_t lbn)
+{
+	struct vdo *vdo = vdo_from_data_vio(data_vio);
+	zone_count_t zone_number;
+	struct lbn_lock *lock = &data_vio->logical;
+
+	lock->lbn = lbn;
+	lock->locked = false;
+	vdo_initialize_wait_queue(&lock->waiters);
+	zone_number = vdo_compute_logical_zone(data_vio);
+	lock->zone = &vdo->logical_zones->zones[zone_number];
+}
+
+static void launch_locked_request(struct data_vio *data_vio)
+{
+	data_vio->logical.locked = true;
+	if (data_vio->write) {
+		struct vdo *vdo = vdo_from_data_vio(data_vio);
+
+		if (vdo_is_read_only(vdo)) {
+			continue_data_vio_with_error(data_vio, VDO_READ_ONLY);
+			return;
+		}
+	}
+
+	data_vio->last_async_operation = VIO_ASYNC_OP_FIND_BLOCK_MAP_SLOT;
+	vdo_find_block_map_slot(data_vio);
+}
+
+static void acknowledge_data_vio(struct data_vio *data_vio)
+{
+	struct vdo *vdo = vdo_from_data_vio(data_vio);
+	struct bio *bio = data_vio->user_bio;
+	int error = vdo_map_to_system_error(data_vio->vio.completion.result);
+
+	if (bio == NULL)
+		return;
+
+	ASSERT_LOG_ONLY((data_vio->remaining_discard <=
+			 (u32) (VDO_BLOCK_SIZE - data_vio->offset)),
+			"data_vio to acknowledge is not an incomplete discard");
+
+	data_vio->user_bio = NULL;
+	vdo_count_bios(&vdo->stats.bios_acknowledged, bio);
+	if (data_vio->is_partial)
+		vdo_count_bios(&vdo->stats.bios_acknowledged_partial, bio);
+
+	bio->bi_status = errno_to_blk_status(error);
+	bio_endio(bio);
+}
+
+static void copy_to_bio(struct bio *bio, char *data_ptr)
+{
+	struct bio_vec biovec;
+	struct bvec_iter iter;
+
+	bio_for_each_segment(biovec, bio, iter) {
+		memcpy_to_bvec(&biovec, data_ptr);
+		data_ptr += biovec.bv_len;
+	}
+}
+
+struct data_vio_compression_status
+get_data_vio_compression_status(struct data_vio *data_vio)
+{
+	u32 packed = atomic_read(&data_vio->compression.status);
+
+	/* pairs with cmpxchg in set_data_vio_compression_status */
+	smp_rmb();
+	return (struct data_vio_compression_status) {
+		.stage = packed & COMPRESSION_STATUS_MASK,
+		.may_not_compress = ((packed & MAY_NOT_COMPRESS_MASK) != 0),
+	};
+}
+
+/**
+ * pack_status() - Convert a data_vio_compression_status into a u32 which may be stored
+ *                 atomically.
+ * @status: The state to convert.
+ *
+ * Return: The compression state packed into a u32.
+ */
+static u32 __must_check pack_status(struct data_vio_compression_status status)
+{
+	return status.stage | (status.may_not_compress ? MAY_NOT_COMPRESS_MASK : 0);
+}
+
+/**
+ * set_data_vio_compression_status() - Set the compression status of a data_vio.
+ * @state: The expected current status of the data_vio.
+ * @new_state: The status to set.
+ *
+ * Return: true if the new status was set, false if the data_vio's compression status did not
+ *         match the expected state, and so was left unchanged.
+ */
+static bool __must_check
+set_data_vio_compression_status(struct data_vio *data_vio,
+				struct data_vio_compression_status status,
+				struct data_vio_compression_status new_status)
+{
+	u32 actual;
+	u32 expected = pack_status(status);
+	u32 replacement = pack_status(new_status);
+
+	/*
+	 * Extra barriers because this was original developed using a CAS operation that implicitly
+	 * had them.
+	 */
+	smp_mb__before_atomic();
+	actual = atomic_cmpxchg(&data_vio->compression.status, expected, replacement);
+	/* same as before_atomic */
+	smp_mb__after_atomic();
+	return (expected == actual);
+}
+
+struct data_vio_compression_status advance_data_vio_compression_stage(struct data_vio *data_vio)
+{
+	for (;;) {
+		struct data_vio_compression_status status =
+			get_data_vio_compression_status(data_vio);
+		struct data_vio_compression_status new_status = status;
+
+		if (status.stage == DATA_VIO_POST_PACKER)
+			/* We're already in the last stage. */
+			return status;
+
+		if (status.may_not_compress)
+			/*
+			 * Compression has been dis-allowed for this VIO, so skip the rest of the
+			 * path and go to the end.
+			 */
+			new_status.stage = DATA_VIO_POST_PACKER;
+		else
+			/* Go to the next state. */
+			new_status.stage++;
+
+		if (set_data_vio_compression_status(data_vio, status, new_status))
+			return new_status;
+
+		/* Another thread changed the status out from under us so try again. */
+	}
+}
+
+/**
+ * cancel_data_vio_compression() - Prevent this data_vio from being compressed or packed.
+ *
+ * Return: true if the data_vio is in the packer and the caller was the first caller to cancel it.
+ */
+bool cancel_data_vio_compression(struct data_vio *data_vio)
+{
+	struct data_vio_compression_status status, new_status;
+
+	for (;;) {
+		status = get_data_vio_compression_status(data_vio);
+		if (status.may_not_compress || (status.stage == DATA_VIO_POST_PACKER))
+			/* This data_vio is already set up to not block in the packer. */
+			break;
+
+		new_status.stage = status.stage;
+		new_status.may_not_compress = true;
+
+		if (set_data_vio_compression_status(data_vio, status, new_status))
+			break;
+	}
+
+	return ((status.stage == DATA_VIO_PACKING) && !status.may_not_compress);
+}
+
+/**
+ * attempt_logical_block_lock() - Attempt to acquire the lock on a logical block.
+ * @completion: The data_vio for an external data request as a completion.
+ *
+ * This is the start of the path for all external requests. It is registered in launch_data_vio().
+ */
+static void attempt_logical_block_lock(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+	struct lbn_lock *lock = &data_vio->logical;
+	struct vdo *vdo = vdo_from_data_vio(data_vio);
+	struct data_vio *lock_holder;
+	int result;
+
+	assert_data_vio_in_logical_zone(data_vio);
+
+	if (data_vio->logical.lbn >= vdo->states.vdo.config.logical_blocks) {
+		continue_data_vio_with_error(data_vio, VDO_OUT_OF_RANGE);
+		return;
+	}
+
+	result = vdo_int_map_put(lock->zone->lbn_operations,
+				 lock->lbn,
+				 data_vio,
+				 false,
+				 (void **) &lock_holder);
+	if (result != VDO_SUCCESS) {
+		continue_data_vio_with_error(data_vio, result);
+		return;
+	}
+
+	if (lock_holder == NULL) {
+		/* We got the lock */
+		launch_locked_request(data_vio);
+		return;
+	}
+
+	result = ASSERT(lock_holder->logical.locked, "logical block lock held");
+	if (result != VDO_SUCCESS) {
+		continue_data_vio_with_error(data_vio, result);
+		return;
+	}
+
+	/*
+	 * If the new request is a pure read request (not read-modify-write) and the lock_holder is
+	 * writing and has received an allocation (VDO-2683), service the read request immediately
+	 * by copying data from the lock_holder to avoid having to flush the write out of the
+	 * packer just to prevent the read from waiting indefinitely. If the lock_holder does not
+	 * yet have an allocation, prevent it from blocking in the packer and wait on it.
+	 */
+	if (!data_vio->write && READ_ONCE(lock_holder->allocation_succeeded)) {
+		copy_to_bio(data_vio->user_bio, (lock_holder->vio.data + data_vio->offset));
+		acknowledge_data_vio(data_vio);
+		complete_data_vio(completion);
+		return;
+	}
+
+	data_vio->last_async_operation = VIO_ASYNC_OP_ATTEMPT_LOGICAL_BLOCK_LOCK;
+	vdo_enqueue_waiter(&lock_holder->logical.waiters, &data_vio->waiter);
+
+	/*
+	 * Prevent writes and read-modify-writes from blocking indefinitely on lock holders in the
+	 * packer.
+	 */
+	if (lock_holder->write && cancel_data_vio_compression(lock_holder)) {
+		data_vio->compression.lock_holder = lock_holder;
+		launch_data_vio_packer_callback(data_vio, vdo_remove_lock_holder_from_packer);
+	}
+}
+
+/**
+ * launch_data_vio() - (Re)initialize a data_vio to have a new logical block number, keeping the
+ *		       same parent and other state and send it on its way.
+ */
+static void launch_data_vio(struct data_vio *data_vio, logical_block_number_t lbn)
+{
+	struct vdo_completion *completion = &data_vio->vio.completion;
+
+	/*
+	 * Clearing the tree lock must happen before initializing the LBN lock, which also adds
+	 * information to the tree lock.
+	 */
+	memset(&data_vio->tree_lock, 0, sizeof(data_vio->tree_lock));
+	initialize_lbn_lock(data_vio, lbn);
+	INIT_LIST_HEAD(&data_vio->hash_lock_entry);
+	INIT_LIST_HEAD(&data_vio->write_entry);
+
+	memset(&data_vio->allocation, 0, sizeof(data_vio->allocation));
+
+	data_vio->is_duplicate = false;
+
+	memset(&data_vio->record_name, 0, sizeof(data_vio->record_name));
+	memset(&data_vio->duplicate, 0, sizeof(data_vio->duplicate));
+	vdo_reset_completion(completion);
+	completion->error_handler = handle_data_vio_error;
+	set_data_vio_logical_callback(data_vio, attempt_logical_block_lock);
+	vdo_enqueue_completion(completion, VDO_DEFAULT_Q_MAP_BIO_PRIORITY);
+}
+
+static bool is_zero_block(char *block)
+{
+	int i;
+
+	for (i = 0; i < VDO_BLOCK_SIZE; i += sizeof(u64))
+		if (*((u64 *) &block[i]))
+			return false;
+	return true;
+}
+
+static void copy_from_bio(struct bio *bio, char *data_ptr)
+{
+	struct bio_vec biovec;
+	struct bvec_iter iter;
+
+	bio_for_each_segment(biovec, bio, iter) {
+		memcpy_from_bvec(data_ptr, &biovec);
+		data_ptr += biovec.bv_len;
+	}
+}
+
+static void launch_bio(struct vdo *vdo, struct data_vio *data_vio, struct bio *bio)
+{
+	logical_block_number_t lbn;
+	/*
+	 * Zero out the fields which don't need to be preserved (i.e. which are not pointers to
+	 * separately allocated objects).
+	 */
+	memset(data_vio, 0, offsetof(struct data_vio, vio));
+	memset(&data_vio->compression, 0, offsetof(struct compression_state, block));
+
+	data_vio->user_bio = bio;
+	data_vio->offset = to_bytes(bio->bi_iter.bi_sector & VDO_SECTORS_PER_BLOCK_MASK);
+	data_vio->is_partial = (bio->bi_iter.bi_size < VDO_BLOCK_SIZE) || (data_vio->offset != 0);
+
+	/*
+	 * Discards behave very differently than other requests when coming in from device-mapper.
+	 * We have to be able to handle any size discards and various sector offsets within a
+	 * block.
+	 */
+	if (bio_op(bio) == REQ_OP_DISCARD) {
+		data_vio->remaining_discard = bio->bi_iter.bi_size;
+		data_vio->write = true;
+		data_vio->is_trim = true;
+		if (data_vio->is_partial) {
+			vdo_count_bios(&vdo->stats.bios_in_partial, bio);
+			data_vio->read = true;
+		}
+	} else if (data_vio->is_partial) {
+		vdo_count_bios(&vdo->stats.bios_in_partial, bio);
+		data_vio->read = true;
+		if (bio_data_dir(bio) == WRITE)
+			data_vio->write = true;
+	} else if (bio_data_dir(bio) == READ) {
+		data_vio->read = true;
+	} else {
+		/*
+		 * Copy the bio data to a char array so that we can continue to use the data after
+		 * we acknowledge the bio.
+		 */
+		copy_from_bio(bio, data_vio->vio.data);
+		data_vio->is_zero = is_zero_block(data_vio->vio.data);
+		data_vio->write = true;
+	}
+
+	if (data_vio->user_bio->bi_opf & REQ_FUA)
+		data_vio->fua = true;
+
+	lbn = (bio->bi_iter.bi_sector - vdo->starting_sector_offset) / VDO_SECTORS_PER_BLOCK;
+	launch_data_vio(data_vio, lbn);
+}
+
+static void assign_data_vio(struct limiter *limiter, struct data_vio *data_vio)
+{
+	struct bio *bio = bio_list_pop(limiter->permitted_waiters);
+
+	launch_bio(limiter->pool->completion.vdo, data_vio, bio);
+	limiter->wake_count++;
+
+	bio = bio_list_peek(limiter->permitted_waiters);
+	limiter->arrival = ((bio == NULL) ? U64_MAX : get_arrival_time(bio));
+}
+
+static void assign_discard_permit(struct limiter *limiter)
+{
+	struct bio *bio = bio_list_pop(&limiter->waiters);
+
+	if (limiter->arrival == U64_MAX)
+		limiter->arrival = get_arrival_time(bio);
+
+	bio_list_add(limiter->permitted_waiters, bio);
+}
+
+static void get_waiters(struct limiter *limiter)
+{
+	bio_list_merge(&limiter->waiters, &limiter->new_waiters);
+	bio_list_init(&limiter->new_waiters);
+}
+
+static inline
+struct data_vio *get_available_data_vio(struct data_vio_pool *pool)
+{
+	struct data_vio *data_vio =
+		list_first_entry(&pool->available, struct data_vio, pool_entry);
+
+	list_del_init(&data_vio->pool_entry);
+	return data_vio;
+}
+
+static void assign_data_vio_to_waiter(struct limiter *limiter)
+{
+	assign_data_vio(limiter, get_available_data_vio(limiter->pool));
+}
+
+static void update_limiter(struct limiter *limiter)
+{
+	struct bio_list *waiters = &limiter->waiters;
+	data_vio_count_t available = limiter->limit - limiter->busy;
+
+	ASSERT_LOG_ONLY((limiter->release_count <= limiter->busy),
+			"Release count %u is not more than busy count %u",
+			limiter->release_count,
+			limiter->busy);
+
+	get_waiters(limiter);
+	for (; (limiter->release_count > 0) && !bio_list_empty(waiters); limiter->release_count--)
+		limiter->assigner(limiter);
+
+	if (limiter->release_count > 0) {
+		WRITE_ONCE(limiter->busy, limiter->busy - limiter->release_count);
+		limiter->release_count = 0;
+		return;
+	}
+
+	for (; (available > 0) && !bio_list_empty(waiters); available--)
+		limiter->assigner(limiter);
+
+	WRITE_ONCE(limiter->busy, limiter->limit - available);
+	if (limiter->max_busy < limiter->busy)
+		WRITE_ONCE(limiter->max_busy, limiter->busy);
+}
+
+/**
+ * schedule_releases() - Ensure that release processing is scheduled.
+ *
+ * If this call switches the state to processing, enqueue. Otherwise, some other thread has already
+ * done so.
+ */
+static void schedule_releases(struct data_vio_pool *pool)
+{
+	/* Pairs with the barrier in process_release_callback(). */
+	smp_mb__before_atomic();
+	if (atomic_cmpxchg(&pool->processing, false, true))
+		return;
+
+	pool->completion.requeue = true;
+	vdo_launch_completion_with_priority(&pool->completion, CPU_Q_COMPLETE_VIO_PRIORITY);
+}
+
+static void reuse_or_release_resources(struct data_vio_pool *pool,
+				       struct data_vio *data_vio,
+				       struct list_head *returned)
+{
+	if (data_vio->remaining_discard > 0) {
+		if (bio_list_empty(&pool->discard_limiter.waiters))
+			/* Return the data_vio's discard permit. */
+			pool->discard_limiter.release_count++;
+		else
+			assign_discard_permit(&pool->discard_limiter);
+	}
+
+	if (pool->limiter.arrival < pool->discard_limiter.arrival) {
+		assign_data_vio(&pool->limiter, data_vio);
+	} else if (pool->discard_limiter.arrival < U64_MAX) {
+		assign_data_vio(&pool->discard_limiter, data_vio);
+	} else {
+		list_add(&data_vio->pool_entry, returned);
+		pool->limiter.release_count++;
+	}
+}
+
+/**
+ * process_release_callback() - Process a batch of data_vio releases.
+ * @completion: The pool with data_vios to release.
+ */
+static void process_release_callback(struct vdo_completion *completion)
+{
+	struct data_vio_pool *pool = as_data_vio_pool(completion);
+	bool reschedule;
+	bool drained;
+	data_vio_count_t processed;
+	data_vio_count_t to_wake;
+	data_vio_count_t discards_to_wake;
+	LIST_HEAD(returned);
+
+	spin_lock(&pool->lock);
+	get_waiters(&pool->discard_limiter);
+	get_waiters(&pool->limiter);
+	spin_unlock(&pool->lock);
+
+	if (pool->limiter.arrival == U64_MAX) {
+		struct bio *bio = bio_list_peek(&pool->limiter.waiters);
+
+		if (bio != NULL)
+			pool->limiter.arrival = get_arrival_time(bio);
+	}
+
+	for (processed = 0; processed < DATA_VIO_RELEASE_BATCH_SIZE; processed++) {
+		struct data_vio *data_vio;
+		struct funnel_queue_entry *entry = uds_funnel_queue_poll(pool->queue);
+
+		if (entry == NULL)
+			break;
+
+		data_vio = as_data_vio(container_of(entry,
+						    struct vdo_completion,
+						    work_queue_entry_link));
+		acknowledge_data_vio(data_vio);
+		reuse_or_release_resources(pool, data_vio, &returned);
+	}
+
+	spin_lock(&pool->lock);
+	/*
+	 * There is a race where waiters could be added while we are in the unlocked section above.
+	 * Those waiters could not see the resources we are now about to release, so we assign
+	 * those resources now as we have no guarantee of being rescheduled. This is handled in
+	 * update_limiter().
+	 */
+	update_limiter(&pool->discard_limiter);
+	list_splice(&returned, &pool->available);
+	update_limiter(&pool->limiter);
+	to_wake = pool->limiter.wake_count;
+	pool->limiter.wake_count = 0;
+	discards_to_wake = pool->discard_limiter.wake_count;
+	pool->discard_limiter.wake_count = 0;
+
+	atomic_set(&pool->processing, false);
+	/* Pairs with the barrier in schedule_releases(). */
+	smp_mb();
+
+	reschedule = !uds_is_funnel_queue_empty(pool->queue);
+	drained = (!reschedule &&
+		   vdo_is_state_draining(&pool->state) &&
+		   check_for_drain_complete_locked(pool));
+	spin_unlock(&pool->lock);
+
+	if (to_wake > 0)
+		wake_up_nr(&pool->limiter.blocked_threads, to_wake);
+
+	if (discards_to_wake > 0)
+		wake_up_nr(&pool->discard_limiter.blocked_threads,
+			   discards_to_wake);
+
+	if (reschedule)
+		schedule_releases(pool);
+	else if (drained)
+		vdo_finish_draining(&pool->state);
+}
+
+static void initialize_limiter(struct limiter *limiter,
+			       struct data_vio_pool *pool,
+			       assigner *assigner,
+			       data_vio_count_t limit)
+{
+	limiter->pool = pool;
+	limiter->assigner = assigner;
+	limiter->limit = limit;
+	limiter->arrival = U64_MAX;
+	init_waitqueue_head(&limiter->blocked_threads);
+}
+
+/**
+ * initialize_data_vio() - Allocate the components of a data_vio.
+ *
+ * The caller is responsible for cleaning up the data_vio on error.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int initialize_data_vio(struct data_vio *data_vio, struct vdo *vdo)
+{
+	struct bio *bio;
+	int result;
+
+	STATIC_ASSERT(VDO_BLOCK_SIZE <= PAGE_SIZE);
+	result = uds_allocate_memory(VDO_BLOCK_SIZE, 0, "data_vio data", &data_vio->vio.data);
+	if (result != VDO_SUCCESS)
+		return uds_log_error_strerror(result, "data_vio data allocation failure");
+
+	result = uds_allocate_memory(VDO_BLOCK_SIZE,
+				     0,
+				     "compressed block",
+				     &data_vio->compression.block);
+	if (result != VDO_SUCCESS)
+		return uds_log_error_strerror(result,
+					      "data_vio compressed block allocation failure");
+
+	result = uds_allocate_memory(VDO_BLOCK_SIZE, 0, "vio scratch", &data_vio->scratch_block);
+	if (result != VDO_SUCCESS)
+		return uds_log_error_strerror(result, "data_vio scratch allocation failure");
+
+	result = vdo_create_bio(&bio);
+	if (result != VDO_SUCCESS)
+		return uds_log_error_strerror(result, "data_vio data bio allocation failure");
+
+	vdo_initialize_completion(&data_vio->decrement_completion, vdo, VDO_DECREMENT_COMPLETION);
+	initialize_vio(&data_vio->vio, bio, 1, VIO_TYPE_DATA, VIO_PRIORITY_DATA, vdo);
+
+	return VDO_SUCCESS;
+}
+
+static void destroy_data_vio(struct data_vio *data_vio)
+{
+	if (data_vio == NULL)
+		return;
+
+	vdo_free_bio(UDS_FORGET(data_vio->vio.bio));
+	UDS_FREE(UDS_FORGET(data_vio->vio.data));
+	UDS_FREE(UDS_FORGET(data_vio->compression.block));
+	UDS_FREE(UDS_FORGET(data_vio->scratch_block));
+}
+
+/**
+ * make_data_vio_pool() - Initialize a data_vio pool.
+ * @vdo: The vdo to which the pool will belong.
+ * @pool_size: The number of data_vios in the pool.
+ * @discard_limit: The maximum number of data_vios which may be used for discards.
+ * @pool: A pointer to hold the newly allocated pool.
+ */
+int make_data_vio_pool(struct vdo *vdo,
+		       data_vio_count_t pool_size,
+		       data_vio_count_t discard_limit,
+		       struct data_vio_pool **pool_ptr)
+{
+	int result;
+	struct data_vio_pool *pool;
+	data_vio_count_t i;
+
+	result = UDS_ALLOCATE_EXTENDED(struct data_vio_pool,
+				       pool_size,
+				       struct data_vio,
+				       __func__,
+				       &pool);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	ASSERT_LOG_ONLY((discard_limit <= pool_size), "discard limit does not exceed pool size");
+	initialize_limiter(&pool->discard_limiter, pool, assign_discard_permit, discard_limit);
+	pool->discard_limiter.permitted_waiters = &pool->permitted_discards;
+	initialize_limiter(&pool->limiter, pool, assign_data_vio_to_waiter, pool_size);
+	pool->limiter.permitted_waiters = &pool->limiter.waiters;
+	INIT_LIST_HEAD(&pool->available);
+	spin_lock_init(&pool->lock);
+	vdo_set_admin_state_code(&pool->state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+	vdo_initialize_completion(&pool->completion, vdo, VDO_DATA_VIO_POOL_COMPLETION);
+	vdo_prepare_completion(&pool->completion,
+			       process_release_callback,
+			       process_release_callback,
+			       vdo->thread_config.cpu_thread,
+			       NULL);
+
+	result = uds_make_funnel_queue(&pool->queue);
+	if (result != UDS_SUCCESS) {
+		free_data_vio_pool(UDS_FORGET(pool));
+		return result;
+	}
+
+	for (i = 0; i < pool_size; i++) {
+		struct data_vio *data_vio = &pool->data_vios[i];
+
+		result = initialize_data_vio(data_vio, vdo);
+		if (result != VDO_SUCCESS) {
+			destroy_data_vio(data_vio);
+			free_data_vio_pool(pool);
+			return result;
+		}
+
+		list_add(&data_vio->pool_entry, &pool->available);
+	}
+
+	*pool_ptr = pool;
+	return VDO_SUCCESS;
+}
+
+/**
+ * free_data_vio_pool() - Free a data_vio_pool and the data_vios in it.
+ *
+ * All data_vios must be returned to the pool before calling this function.
+ */
+void free_data_vio_pool(struct data_vio_pool *pool)
+{
+	struct data_vio *data_vio, *tmp;
+
+	if (pool == NULL)
+		return;
+
+	/*
+	 * Pairs with the barrier in process_release_callback(). Possibly not needed since it
+	 * caters to an enqueue vs. free race.
+	 */
+	smp_mb();
+	BUG_ON(atomic_read(&pool->processing));
+
+	spin_lock(&pool->lock);
+	ASSERT_LOG_ONLY((pool->limiter.busy == 0),
+			"data_vio pool must not have %u busy entries when being freed",
+			pool->limiter.busy);
+	ASSERT_LOG_ONLY((bio_list_empty(&pool->limiter.waiters) &&
+			 bio_list_empty(&pool->limiter.new_waiters)),
+			"data_vio pool must not have threads waiting to read or write when being freed");
+	ASSERT_LOG_ONLY((bio_list_empty(&pool->discard_limiter.waiters) &&
+			 bio_list_empty(&pool->discard_limiter.new_waiters)),
+			"data_vio pool must not have threads waiting to discard when being freed");
+	spin_unlock(&pool->lock);
+
+	list_for_each_entry_safe(data_vio, tmp, &pool->available, pool_entry) {
+		list_del_init(&data_vio->pool_entry);
+		destroy_data_vio(data_vio);
+	}
+
+	uds_free_funnel_queue(UDS_FORGET(pool->queue));
+	UDS_FREE(pool);
+}
+
+static bool acquire_permit(struct limiter *limiter, struct bio *bio)
+{
+	if (limiter->busy >= limiter->limit) {
+		DEFINE_WAIT(wait);
+
+		bio_list_add(&limiter->new_waiters, bio);
+		prepare_to_wait_exclusive(&limiter->blocked_threads, &wait, TASK_UNINTERRUPTIBLE);
+		spin_unlock(&limiter->pool->lock);
+		io_schedule();
+		finish_wait(&limiter->blocked_threads, &wait);
+		return false;
+	}
+
+	WRITE_ONCE(limiter->busy, limiter->busy + 1);
+	if (limiter->max_busy < limiter->busy)
+		WRITE_ONCE(limiter->max_busy, limiter->busy);
+
+	return true;
+}
+
+/**
+ * vdo_launch_bio() - Acquire a data_vio from the pool, assign the bio to it, and launch it.
+ *
+ * This will block if data_vios or discard permits are not available.
+ */
+void vdo_launch_bio(struct data_vio_pool *pool, struct bio *bio)
+{
+	struct data_vio *data_vio;
+
+	ASSERT_LOG_ONLY(!vdo_is_state_quiescent(&pool->state),
+			"data_vio_pool not quiescent on acquire");
+
+	bio->bi_private = (void *) jiffies;
+	spin_lock(&pool->lock);
+	if ((bio_op(bio) == REQ_OP_DISCARD) && !acquire_permit(&pool->discard_limiter, bio))
+		return;
+
+	if (!acquire_permit(&pool->limiter, bio))
+		return;
+
+	data_vio = get_available_data_vio(pool);
+	spin_unlock(&pool->lock);
+	launch_bio(pool->completion.vdo, data_vio, bio);
+}
+
+/* Implements vdo_admin_initiator. */
+static void initiate_drain(struct admin_state *state)
+{
+	bool drained;
+	struct data_vio_pool *pool = container_of(state, struct data_vio_pool, state);
+
+	spin_lock(&pool->lock);
+	drained = check_for_drain_complete_locked(pool);
+	spin_unlock(&pool->lock);
+
+	if (drained)
+		vdo_finish_draining(state);
+}
+
+static void assert_on_vdo_cpu_thread(const struct vdo *vdo, const char *name)
+{
+	ASSERT_LOG_ONLY((vdo_get_callback_thread_id() == vdo->thread_config.cpu_thread),
+			"%s called on cpu thread",
+			name);
+}
+
+/**
+ * drain_data_vio_pool() - Wait asynchronously for all data_vios to be returned to the pool.
+ * @completion: The completion to notify when the pool has drained.
+ */
+void drain_data_vio_pool(struct data_vio_pool *pool, struct vdo_completion *completion)
+{
+	assert_on_vdo_cpu_thread(completion->vdo, __func__);
+	vdo_start_draining(&pool->state, VDO_ADMIN_STATE_SUSPENDING, completion, initiate_drain);
+}
+
+/**
+ * resume_data_vio_pool() - Resume a data_vio pool.
+ * @completion: The completion to notify when the pool has resumed.
+ */
+void resume_data_vio_pool(struct data_vio_pool *pool, struct vdo_completion *completion)
+{
+	assert_on_vdo_cpu_thread(completion->vdo, __func__);
+	vdo_continue_completion(completion, vdo_resume_if_quiescent(&pool->state));
+}
+
+static void dump_limiter(const char *name, struct limiter *limiter)
+{
+	uds_log_info("%s: %u of %u busy (max %u), %s",
+		     name,
+		     limiter->busy,
+		     limiter->limit,
+		     limiter->max_busy,
+		     ((bio_list_empty(&limiter->waiters) &&
+		       bio_list_empty(&limiter->new_waiters)) ? "no waiters" : "has waiters"));
+}
+
+/**
+ * dump_data_vio_pool() - Dump a data_vio pool to the log.
+ * @dump_vios: Whether to dump the details of each busy data_vio as well.
+ */
+void dump_data_vio_pool(struct data_vio_pool *pool, bool dump_vios)
+{
+	/*
+	 * In order that syslog can empty its buffer, sleep after 35 elements for 4ms (till the
+	 * second clock tick).  These numbers were picked based on experiments with lab machines.
+	 */
+	enum { ELEMENTS_PER_BATCH = 35 };
+	enum { SLEEP_FOR_SYSLOG = 4000 };
+
+	if (pool == NULL)
+		return;
+
+	spin_lock(&pool->lock);
+	dump_limiter("data_vios", &pool->limiter);
+	dump_limiter("discard permits", &pool->discard_limiter);
+	if (dump_vios) {
+		int i;
+		int dumped = 0;
+
+		for (i = 0; i < pool->limiter.limit; i++) {
+			struct data_vio *data_vio = &pool->data_vios[i];
+
+			if (!list_empty(&data_vio->pool_entry))
+				continue;
+
+			dump_data_vio(data_vio);
+			if (++dumped >= ELEMENTS_PER_BATCH) {
+				spin_unlock(&pool->lock);
+				dumped = 0;
+				fsleep(SLEEP_FOR_SYSLOG);
+				spin_lock(&pool->lock);
+			}
+		}
+	}
+
+	spin_unlock(&pool->lock);
+}
+
+data_vio_count_t get_data_vio_pool_active_discards(struct data_vio_pool *pool)
+{
+	return READ_ONCE(pool->discard_limiter.busy);
+}
+
+data_vio_count_t get_data_vio_pool_discard_limit(struct data_vio_pool *pool)
+{
+	return READ_ONCE(pool->discard_limiter.limit);
+}
+
+data_vio_count_t get_data_vio_pool_maximum_discards(struct data_vio_pool *pool)
+{
+	return READ_ONCE(pool->discard_limiter.max_busy);
+}
+
+int set_data_vio_pool_discard_limit(struct data_vio_pool *pool, data_vio_count_t limit)
+{
+	if (get_data_vio_pool_request_limit(pool) < limit)
+		// The discard limit may not be higher than the data_vio limit.
+		return -EINVAL;
+
+	spin_lock(&pool->lock);
+	pool->discard_limiter.limit = limit;
+	spin_unlock(&pool->lock);
+
+	return VDO_SUCCESS;
+}
+
+data_vio_count_t get_data_vio_pool_active_requests(struct data_vio_pool *pool)
+{
+	return READ_ONCE(pool->limiter.busy);
+}
+
+data_vio_count_t get_data_vio_pool_request_limit(struct data_vio_pool *pool)
+{
+	return READ_ONCE(pool->limiter.limit);
+}
+
+data_vio_count_t get_data_vio_pool_maximum_requests(struct data_vio_pool *pool)
+{
+	return READ_ONCE(pool->limiter.max_busy);
+}
+
+static void update_data_vio_error_stats(struct data_vio *data_vio)
+{
+	u8 index = 0;
+	static const char * const operations[] = {
+		[0] = "empty",
+		[1] = "read",
+		[2] = "write",
+		[3] = "read-modify-write",
+		[5] = "read+fua",
+		[6] = "write+fua",
+		[7] = "read-modify-write+fua",
+	};
+
+	if (data_vio->read)
+		index = 1;
+
+	if (data_vio->write)
+		index += 2;
+
+	if (data_vio->fua)
+		index += 4;
+
+	update_vio_error_stats(&data_vio->vio,
+			       "Completing %s vio for LBN %llu with error after %s",
+			       operations[index],
+			       (unsigned long long) data_vio->logical.lbn,
+			       get_data_vio_operation_name(data_vio));
+}
+
+static void perform_cleanup_stage(struct data_vio *data_vio, enum data_vio_cleanup_stage stage);
+
+/**
+ * release_allocated_lock() - Release the PBN lock and/or the reference on the allocated block at
+ *			      the end of processing a data_vio.
+ */
+static void release_allocated_lock(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_in_allocated_zone(data_vio);
+	release_data_vio_allocation_lock(data_vio, false);
+	perform_cleanup_stage(data_vio, VIO_RELEASE_RECOVERY_LOCKS);
+}
+
+/** release_lock() - Release an uncontended LBN lock. */
+static void release_lock(struct data_vio *data_vio, struct lbn_lock *lock)
+{
+	struct int_map *lock_map = lock->zone->lbn_operations;
+	struct data_vio *lock_holder;
+
+	if (!lock->locked) {
+		/*  The lock is not locked, so it had better not be registered in the lock map. */
+		struct data_vio *lock_holder = vdo_int_map_get(lock_map, lock->lbn);
+
+		ASSERT_LOG_ONLY((data_vio != lock_holder),
+				"no logical block lock held for block %llu",
+				(unsigned long long) lock->lbn);
+		return;
+	}
+
+	/* Release the lock by removing the lock from the map. */
+	lock_holder = vdo_int_map_remove(lock_map, lock->lbn);
+	ASSERT_LOG_ONLY((data_vio == lock_holder),
+			"logical block lock mismatch for block %llu",
+			(unsigned long long) lock->lbn);
+	lock->locked = false;
+}
+
+/** transfer_lock() - Transfer a contended LBN lock to the eldest waiter. */
+static void transfer_lock(struct data_vio *data_vio, struct lbn_lock *lock)
+{
+	struct data_vio *lock_holder, *next_lock_holder;
+	int result;
+
+	ASSERT_LOG_ONLY(lock->locked, "lbn_lock with waiters is not locked");
+
+	/* Another data_vio is waiting for the lock, transfer it in a single lock map operation. */
+	next_lock_holder = waiter_as_data_vio(vdo_dequeue_next_waiter(&lock->waiters));
+
+	/* Transfer the remaining lock waiters to the next lock holder. */
+	vdo_transfer_all_waiters(&lock->waiters, &next_lock_holder->logical.waiters);
+
+	result = vdo_int_map_put(lock->zone->lbn_operations,
+				 lock->lbn,
+				 next_lock_holder,
+				 true,
+				 (void **) &lock_holder);
+	if (result != VDO_SUCCESS) {
+		continue_data_vio_with_error(next_lock_holder, result);
+		return;
+	}
+
+	ASSERT_LOG_ONLY((lock_holder == data_vio),
+			"logical block lock mismatch for block %llu",
+			(unsigned long long) lock->lbn);
+	lock->locked = false;
+
+	/*
+	 * If there are still waiters, other data_vios must be trying to get the lock we just
+	 * transferred. We must ensure that the new lock holder doesn't block in the packer.
+	 */
+	if (vdo_has_waiters(&next_lock_holder->logical.waiters))
+		cancel_data_vio_compression(next_lock_holder);
+
+	/*
+	 * Avoid stack overflow on lock transfer.
+	 * FIXME: this is only an issue in the 1 thread config.
+	 */
+	next_lock_holder->vio.completion.requeue = true;
+	launch_locked_request(next_lock_holder);
+}
+
+/**
+ * release_logical_lock() - Release the logical block lock and flush generation lock at the end of
+ *			    processing a data_vio.
+ */
+static void release_logical_lock(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+	struct lbn_lock *lock = &data_vio->logical;
+
+	assert_data_vio_in_logical_zone(data_vio);
+
+	if (vdo_has_waiters(&lock->waiters))
+		transfer_lock(data_vio, lock);
+	else
+		release_lock(data_vio, lock);
+
+	vdo_release_flush_generation_lock(data_vio);
+	perform_cleanup_stage(data_vio, VIO_CLEANUP_DONE);
+}
+
+/** clean_hash_lock() - Release the hash lock at the end of processing a data_vio. */
+static void clean_hash_lock(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_in_hash_zone(data_vio);
+	if (completion->result != VDO_SUCCESS) {
+		vdo_clean_failed_hash_lock(data_vio);
+		return;
+	}
+
+	vdo_release_hash_lock(data_vio);
+	perform_cleanup_stage(data_vio, VIO_RELEASE_LOGICAL);
+}
+
+/**
+ * finish_cleanup() - Make some assertions about a data_vio which has finished cleaning up.
+ *
+ * If it is part of a multi-block discard, starts on the next block, otherwise, returns it to the
+ * pool.
+ */
+static void finish_cleanup(struct data_vio *data_vio)
+{
+	struct vdo_completion *completion = &data_vio->vio.completion;
+
+	ASSERT_LOG_ONLY(data_vio->allocation.lock == NULL,
+			"complete data_vio has no allocation lock");
+	ASSERT_LOG_ONLY(data_vio->hash_lock == NULL, "complete data_vio has no hash lock");
+	if ((data_vio->remaining_discard <= VDO_BLOCK_SIZE) ||
+	    (completion->result != VDO_SUCCESS)) {
+		struct data_vio_pool *pool = completion->vdo->data_vio_pool;
+
+		uds_funnel_queue_put(pool->queue, &completion->work_queue_entry_link);
+		schedule_releases(pool);
+		return;
+	}
+
+	data_vio->remaining_discard -= min_t(u32,
+					     data_vio->remaining_discard,
+					     VDO_BLOCK_SIZE - data_vio->offset);
+	data_vio->is_partial = (data_vio->remaining_discard < VDO_BLOCK_SIZE);
+	data_vio->read = data_vio->is_partial;
+	data_vio->offset = 0;
+	completion->requeue = true;
+	launch_data_vio(data_vio, data_vio->logical.lbn + 1);
+}
+
+/** perform_cleanup_stage() - Perform the next step in the process of cleaning up a data_vio. */
+static void perform_cleanup_stage(struct data_vio *data_vio, enum data_vio_cleanup_stage stage)
+{
+	struct vdo *vdo = vdo_from_data_vio(data_vio);
+
+	switch (stage) {
+	case VIO_RELEASE_HASH_LOCK:
+		if (data_vio->hash_lock != NULL) {
+			launch_data_vio_hash_zone_callback(data_vio, clean_hash_lock);
+			return;
+		}
+		fallthrough;
+
+	case VIO_RELEASE_ALLOCATED:
+		if (data_vio_has_allocation(data_vio)) {
+			launch_data_vio_allocated_zone_callback(data_vio, release_allocated_lock);
+			return;
+		}
+		fallthrough;
+
+	case VIO_RELEASE_RECOVERY_LOCKS:
+		if ((data_vio->recovery_sequence_number > 0) &&
+		    (READ_ONCE(vdo->read_only_notifier.read_only_error) == VDO_SUCCESS) &&
+		    (data_vio->vio.completion.result != VDO_READ_ONLY))
+			uds_log_warning("VDO not read-only when cleaning data_vio with RJ lock");
+		fallthrough;
+
+	case VIO_RELEASE_LOGICAL:
+		launch_data_vio_logical_callback(data_vio, release_logical_lock);
+		return;
+
+	default:
+		finish_cleanup(data_vio);
+	}
+}
+
+void complete_data_vio(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	completion->error_handler = NULL;
+	data_vio->last_async_operation = VIO_ASYNC_OP_CLEANUP;
+	perform_cleanup_stage(data_vio,
+			      (data_vio->write ? VIO_CLEANUP_START : VIO_RELEASE_LOGICAL));
+}
+
+static void enter_read_only_mode(struct vdo_completion *completion)
+{
+	if (vdo_is_read_only(completion->vdo))
+		return;
+
+	if (completion->result != VDO_READ_ONLY) {
+		struct data_vio *data_vio = as_data_vio(completion);
+
+		uds_log_error_strerror(completion->result,
+				       "Preparing to enter read-only mode: data_vio for LBN %llu (becoming mapped to %llu, previously mapped to %llu, allocated %llu) is completing with a fatal error after operation %s",
+				       (unsigned long long) data_vio->logical.lbn,
+				       (unsigned long long) data_vio->new_mapped.pbn,
+				       (unsigned long long) data_vio->mapped.pbn,
+				       (unsigned long long) data_vio->allocation.pbn,
+				       get_data_vio_operation_name(data_vio));
+	}
+
+	vdo_enter_read_only_mode(completion->vdo, completion->result);
+}
+
+void handle_data_vio_error(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	if ((completion->result == VDO_READ_ONLY) || (data_vio->user_bio == NULL))
+		enter_read_only_mode(completion);
+
+	update_data_vio_error_stats(data_vio);
+	complete_data_vio(completion);
+}
+
+/**
+ * get_data_vio_operation_name() - Get the name of the last asynchronous operation performed on a
+ *				   data_vio.
+ */
+const char *get_data_vio_operation_name(struct data_vio *data_vio)
+{
+	STATIC_ASSERT((MAX_VIO_ASYNC_OPERATION_NUMBER - MIN_VIO_ASYNC_OPERATION_NUMBER) ==
+		      ARRAY_SIZE(ASYNC_OPERATION_NAMES));
+
+	return ((data_vio->last_async_operation < MAX_VIO_ASYNC_OPERATION_NUMBER) ?
+		ASYNC_OPERATION_NAMES[data_vio->last_async_operation] :
+		"unknown async operation");
+}
+
+/**
+ * data_vio_allocate_data_block() - Allocate a data block.
+ *
+ * @write_lock_type: The type of write lock to obtain on the block.
+ * @callback: The callback which will attempt an allocation in the current zone and continue if it
+ *	      succeeds.
+ * @error_handler: The handler for errors while allocating.
+ */
+void data_vio_allocate_data_block(struct data_vio *data_vio,
+				  enum pbn_lock_type write_lock_type,
+				  vdo_action *callback,
+				  vdo_action *error_handler)
+{
+	struct allocation *allocation = &data_vio->allocation;
+
+	ASSERT_LOG_ONLY((allocation->pbn == VDO_ZERO_BLOCK),
+			"data_vio does not have an allocation");
+	allocation->write_lock_type = write_lock_type;
+	allocation->zone = vdo_get_next_allocation_zone(data_vio->logical.zone);
+	allocation->first_allocation_zone = allocation->zone->zone_number;
+
+	data_vio->vio.completion.error_handler = error_handler;
+	launch_data_vio_allocated_zone_callback(data_vio, callback);
+}
+
+/**
+ * release_data_vio_allocation_lock() - Release the PBN lock on a data_vio's allocated block.
+ * @reset: If true, the allocation will be reset (i.e. any allocated pbn will be forgotten).
+ *
+ * If the reference to the locked block is still provisional, it will be released as well.
+ */
+void release_data_vio_allocation_lock(struct data_vio *data_vio, bool reset)
+{
+	struct allocation *allocation = &data_vio->allocation;
+	physical_block_number_t locked_pbn = allocation->pbn;
+
+	assert_data_vio_in_allocated_zone(data_vio);
+
+	if (reset || vdo_pbn_lock_has_provisional_reference(allocation->lock))
+		allocation->pbn = VDO_ZERO_BLOCK;
+
+	vdo_release_physical_zone_pbn_lock(allocation->zone,
+					   locked_pbn,
+					   UDS_FORGET(allocation->lock));
+}
+
+/**
+ * uncompress_data_vio() - Uncompress the data a data_vio has just read.
+ * @mapping_state: The mapping state indicating which fragment to decompress.
+ * @buffer: The buffer to receive the uncompressed data.
+ */
+int uncompress_data_vio(struct data_vio *data_vio,
+			enum block_mapping_state mapping_state,
+			char *buffer)
+{
+	int size;
+	u16 fragment_offset, fragment_size;
+	struct compressed_block *block = data_vio->compression.block;
+	int result = vdo_get_compressed_block_fragment(mapping_state,
+						       block,
+						       &fragment_offset,
+						       &fragment_size);
+
+	if (result != VDO_SUCCESS) {
+		uds_log_debug("%s: compressed fragment error %d", __func__, result);
+		return result;
+	}
+
+	size = LZ4_decompress_safe((block->data + fragment_offset),
+				   buffer,
+				   fragment_size,
+				   VDO_BLOCK_SIZE);
+	if (size != VDO_BLOCK_SIZE) {
+		uds_log_debug("%s: lz4 error", __func__);
+		return VDO_INVALID_FRAGMENT;
+	}
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * modify_for_partial_write() - Do the modify-write part of a read-modify-write cycle.
+ * @completion: The data_vio which has just finished its read.
+ *
+ * This callback is registered in read_block().
+ */
+static void modify_for_partial_write(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+	char *data = data_vio->vio.data;
+	struct bio *bio = data_vio->user_bio;
+
+	assert_data_vio_on_cpu_thread(data_vio);
+
+	if (bio_op(bio) == REQ_OP_DISCARD) {
+		memset(data + data_vio->offset, '\0', min_t(u32,
+							    data_vio->remaining_discard,
+							    VDO_BLOCK_SIZE - data_vio->offset));
+	} else {
+		copy_from_bio(bio, data + data_vio->offset);
+	}
+
+	data_vio->is_zero = is_zero_block(data);
+	data_vio->read = false;
+	launch_data_vio_logical_callback(data_vio, continue_data_vio_with_block_map_slot);
+}
+
+static void complete_read(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+	char *data = data_vio->vio.data;
+	bool compressed = vdo_is_state_compressed(data_vio->mapped.state);
+
+	assert_data_vio_on_cpu_thread(data_vio);
+
+	if (compressed) {
+		int result = uncompress_data_vio(data_vio, data_vio->mapped.state, data);
+
+		if (result != VDO_SUCCESS) {
+			continue_data_vio_with_error(data_vio, result);
+			return;
+		}
+	}
+
+	if (data_vio->write) {
+		modify_for_partial_write(completion);
+		return;
+	}
+
+	if (compressed || data_vio->is_partial)
+		copy_to_bio(data_vio->user_bio, data + data_vio->offset);
+
+	acknowledge_data_vio(data_vio);
+	complete_data_vio(completion);
+}
+
+static void read_endio(struct bio *bio)
+{
+	struct data_vio *data_vio = vio_as_data_vio(bio->bi_private);
+	int result = blk_status_to_errno(bio->bi_status);
+
+	vdo_count_completed_bios(bio);
+	if (result != VDO_SUCCESS) {
+		continue_data_vio_with_error(data_vio, result);
+		return;
+	}
+
+	launch_data_vio_cpu_callback(data_vio, complete_read, CPU_Q_COMPLETE_READ_PRIORITY);
+}
+
+static void complete_zero_read(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_on_cpu_thread(data_vio);
+
+	if (data_vio->is_partial) {
+		memset(data_vio->vio.data, 0, VDO_BLOCK_SIZE);
+		if (data_vio->write) {
+			modify_for_partial_write(completion);
+			return;
+		}
+	} else {
+		zero_fill_bio(data_vio->user_bio);
+	}
+
+	complete_read(completion);
+}
+
+/**
+ * read_block() - Read a block asynchronously.
+ *
+ * This is the callback registered in read_block_mapping().
+ */
+static void read_block(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+	struct vio *vio = as_vio(completion);
+	int result = VDO_SUCCESS;
+
+	if (data_vio->mapped.pbn == VDO_ZERO_BLOCK) {
+		launch_data_vio_cpu_callback(data_vio,
+					     complete_zero_read,
+					     CPU_Q_COMPLETE_VIO_PRIORITY);
+		return;
+	}
+
+	data_vio->last_async_operation = VIO_ASYNC_OP_READ_DATA_VIO;
+	if (vdo_is_state_compressed(data_vio->mapped.state)) {
+		result = vio_reset_bio(vio,
+				       (char *) data_vio->compression.block,
+				       read_endio,
+				       REQ_OP_READ,
+				       data_vio->mapped.pbn);
+	} else {
+		int opf = ((data_vio->user_bio->bi_opf & PASSTHROUGH_FLAGS) | REQ_OP_READ);
+
+		if (data_vio->is_partial) {
+			result = vio_reset_bio(vio,
+					       vio->data,
+					       read_endio,
+					       opf,
+					       data_vio->mapped.pbn);
+		} else {
+			/* A full 4k read. Use the incoming bio to avoid having to copy the data */
+			bio_reset(vio->bio, vio->bio->bi_bdev, opf);
+			bio_init_clone(data_vio->user_bio->bi_bdev,
+				       vio->bio,
+				       data_vio->user_bio,
+				       GFP_KERNEL);
+
+			/* Copy over the original bio iovec and opflags. */
+			vdo_set_bio_properties(vio->bio,
+					       vio,
+					       read_endio,
+					       opf,
+					       data_vio->mapped.pbn);
+		}
+	}
+
+	if (result != VDO_SUCCESS) {
+		continue_data_vio_with_error(data_vio, result);
+		return;
+	}
+
+	submit_data_vio_io(data_vio);
+}
+
+static inline struct data_vio *
+reference_count_update_completion_as_data_vio(struct vdo_completion *completion)
+{
+	if (completion->type == VIO_COMPLETION)
+		return as_data_vio(completion);
+
+	return container_of(completion, struct data_vio, decrement_completion);
+}
+
+/**
+ * update_block_map() - Rendezvous of the data_vio and decrement completions after each has
+ *                      made its reference updates. Handle any error from either, or proceed
+ *                      to updating the block map.
+ * @completion: The completion of the write in progress.
+ */
+static void update_block_map(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = reference_count_update_completion_as_data_vio(completion);
+
+	assert_data_vio_in_logical_zone(data_vio);
+
+	if (!data_vio->first_reference_operation_complete) {
+		/* Rendezvous, we're first */
+		data_vio->first_reference_operation_complete = true;
+		return;
+	}
+
+	completion = &data_vio->vio.completion;
+	vdo_set_completion_result(completion, data_vio->decrement_completion.result);
+	if (completion->result != VDO_SUCCESS) {
+		handle_data_vio_error(completion);
+		return;
+	}
+
+	completion->error_handler = handle_data_vio_error;
+	if (data_vio->hash_lock != NULL)
+		set_data_vio_hash_zone_callback(data_vio, vdo_continue_hash_lock);
+	else
+		completion->callback = complete_data_vio;
+
+	data_vio->last_async_operation = VIO_ASYNC_OP_PUT_MAPPED_BLOCK;
+	vdo_put_mapped_block(data_vio);
+}
+
+static void decrement_reference_count(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = container_of(completion,
+						 struct data_vio,
+						 decrement_completion);
+
+	assert_data_vio_in_mapped_zone(data_vio);
+
+	vdo_set_completion_callback(completion,
+				    update_block_map,
+				    data_vio->logical.zone->thread_id);
+	completion->error_handler = update_block_map;
+	vdo_modify_reference_count(completion, &data_vio->decrement_updater);
+}
+
+static void increment_reference_count(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_in_new_mapped_zone(data_vio);
+
+	if (data_vio->downgrade_allocation_lock) {
+		/*
+		 * Now that the data has been written, it's safe to deduplicate against the
+		 * block. Downgrade the allocation lock to a read lock so it can be used later by
+		 * the hash lock. This is done here since it needs to happen sometime before we
+		 * return to the hash zone, and we are currently on the correct thread. For
+		 * compressed blocks, the downgrade will have already been done.
+		 */
+		vdo_downgrade_pbn_write_lock(data_vio->allocation.lock, false);
+	}
+
+	set_data_vio_logical_callback(data_vio, update_block_map);
+	completion->error_handler = update_block_map;
+	vdo_modify_reference_count(completion, &data_vio->increment_updater);
+}
+
+/** journal_remapping() - Add a recovery journal entry for a data remapping. */
+static void journal_remapping(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_in_journal_zone(data_vio);
+
+	data_vio->decrement_updater.operation = VDO_JOURNAL_DATA_REMAPPING;
+	data_vio->decrement_updater.zpbn = data_vio->mapped;
+	if (data_vio->new_mapped.pbn == VDO_ZERO_BLOCK) {
+		data_vio->first_reference_operation_complete = true;
+		if (data_vio->mapped.pbn == VDO_ZERO_BLOCK)
+			set_data_vio_logical_callback(data_vio, update_block_map);
+	} else {
+		set_data_vio_new_mapped_zone_callback(data_vio, increment_reference_count);
+	}
+
+	if (data_vio->mapped.pbn == VDO_ZERO_BLOCK)
+		data_vio->first_reference_operation_complete = true;
+	else
+		vdo_set_completion_callback(&data_vio->decrement_completion,
+					    decrement_reference_count,
+					    data_vio->mapped.zone->thread_id);
+
+	data_vio->last_async_operation = VIO_ASYNC_OP_JOURNAL_REMAPPING;
+	vdo_add_recovery_journal_entry(completion->vdo->recovery_journal, data_vio);
+}
+
+/**
+ * read_old_block_mapping() - Get the previous PBN/LBN mapping of an in-progress write.
+ *
+ * Gets the previous PBN mapped to this LBN from the block map, so as to make an appropriate
+ * journal entry referencing the removal of this LBN->PBN mapping.
+ */
+static void read_old_block_mapping(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_in_logical_zone(data_vio);
+
+	data_vio->last_async_operation = VIO_ASYNC_OP_GET_MAPPED_BLOCK_FOR_WRITE;
+	set_data_vio_journal_callback(data_vio, journal_remapping);
+	vdo_get_mapped_block(data_vio);
+}
+
+void update_metadata_for_data_vio_write(struct data_vio *data_vio, struct pbn_lock *lock)
+{
+	data_vio->increment_updater = (struct reference_updater) {
+		.operation = VDO_JOURNAL_DATA_REMAPPING,
+		.increment = true,
+		.zpbn = data_vio->new_mapped,
+		.lock = lock,
+	};
+
+	launch_data_vio_logical_callback(data_vio, read_old_block_mapping);
+}
+
+/**
+ * pack_compressed_data() - Attempt to pack the compressed data_vio into a block.
+ *
+ * This is the callback registered in launch_compress_data_vio().
+ */
+static void pack_compressed_data(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_in_packer_zone(data_vio);
+
+	if (!vdo_get_compressing(vdo_from_data_vio(data_vio)) ||
+	    get_data_vio_compression_status(data_vio).may_not_compress) {
+		write_data_vio(data_vio);
+		return;
+	}
+
+	data_vio->last_async_operation = VIO_ASYNC_OP_ATTEMPT_PACKING;
+	vdo_attempt_packing(data_vio);
+}
+
+/**
+ * compress_data_vio() - Do the actual work of compressing the data on a CPU queue.
+ *
+ * This callback is registered in launch_compress_data_vio().
+ */
+static void compress_data_vio(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+	int size;
+
+	assert_data_vio_on_cpu_thread(data_vio);
+
+	/*
+	 * By putting the compressed data at the start of the compressed block data field, we won't
+	 * need to copy it if this data_vio becomes a compressed write agent.
+	 */
+	size = LZ4_compress_default(data_vio->vio.data,
+				    data_vio->compression.block->data,
+				    VDO_BLOCK_SIZE,
+				    VDO_MAX_COMPRESSED_FRAGMENT_SIZE,
+				    (char *) vdo_get_work_queue_private_data());
+	if ((size > 0) && (size < VDO_COMPRESSED_BLOCK_DATA_SIZE)) {
+		data_vio->compression.size = size;
+		launch_data_vio_packer_callback(data_vio, pack_compressed_data);
+		return;
+	}
+
+	write_data_vio(data_vio);
+}
+
+/**
+ * launch_compress_data_vio() - Continue a write by attempting to compress the data.
+ *
+ * This is a re-entry point to vio_write used by hash locks.
+ */
+void launch_compress_data_vio(struct data_vio *data_vio)
+{
+	ASSERT_LOG_ONLY(!data_vio->is_duplicate, "compressing a non-duplicate block");
+	ASSERT_LOG_ONLY(data_vio->hash_lock != NULL, "data_vio to compress has a hash_lock");
+	ASSERT_LOG_ONLY(data_vio_has_allocation(data_vio),
+			"data_vio to compress has an allocation");
+
+	/*
+	 * There are 4 reasons why a data_vio which has reached this point will not be eligible for
+	 * compression:
+	 *
+	 * 1) Since data_vios can block indefinitely in the packer, it would be bad to do so if the
+	 * write request also requests FUA.
+	 *
+	 * 2) A data_vio should not be compressed when compression is disabled for the vdo.
+	 *
+	 * 3) A data_vio could be doing a partial write on behalf of a larger discard which has not
+	 * yet been acknowledged and hence blocking in the packer would be bad.
+	 *
+	 * 4) Some other data_vio may be waiting on this data_vio in which case blocking in the
+	 * packer would also be bad.
+	 */
+	if (data_vio->fua ||
+	    !vdo_get_compressing(vdo_from_data_vio(data_vio)) ||
+	    ((data_vio->user_bio != NULL) && (bio_op(data_vio->user_bio) == REQ_OP_DISCARD)) ||
+	    (advance_data_vio_compression_stage(data_vio).stage != DATA_VIO_COMPRESSING)) {
+		write_data_vio(data_vio);
+		return;
+	}
+
+	data_vio->last_async_operation = VIO_ASYNC_OP_COMPRESS_DATA_VIO;
+	launch_data_vio_cpu_callback(data_vio, compress_data_vio, CPU_Q_COMPRESS_BLOCK_PRIORITY);
+}
+
+/**
+ * hash_data_vio() - Hash the data in a data_vio and set the hash zone (which also flags the record
+ *		     name as set).
+
+ * This callback is registered in prepare_for_dedupe().
+ */
+static void hash_data_vio(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_on_cpu_thread(data_vio);
+	ASSERT_LOG_ONLY(!data_vio->is_zero, "zero blocks should not be hashed");
+
+	murmurhash3_128(data_vio->vio.data,
+			VDO_BLOCK_SIZE,
+			0x62ea60be,
+			&data_vio->record_name);
+
+	data_vio->hash_zone = vdo_select_hash_zone(vdo_from_data_vio(data_vio)->hash_zones,
+						   &data_vio->record_name);
+	data_vio->last_async_operation = VIO_ASYNC_OP_ACQUIRE_VDO_HASH_LOCK;
+	launch_data_vio_hash_zone_callback(data_vio, vdo_acquire_hash_lock);
+}
+
+/** prepare_for_dedupe() - Prepare for the dedupe path after attempting to get an allocation. */
+static void prepare_for_dedupe(struct data_vio *data_vio)
+{
+	/* We don't care what thread we are on. */
+	ASSERT_LOG_ONLY(!data_vio->is_zero, "must not prepare to dedupe zero blocks");
+
+	/*
+	 * Before we can dedupe, we need to know the record name, so the first
+	 * step is to hash the block data.
+	 */
+	data_vio->last_async_operation = VIO_ASYNC_OP_HASH_DATA_VIO;
+	launch_data_vio_cpu_callback(data_vio, hash_data_vio, CPU_Q_HASH_BLOCK_PRIORITY);
+}
+
+/**
+ * write_bio_finished() - This is the bio_end_io function registered in write_block() to be called
+ *			  when a data_vio's write to the underlying storage has completed.
+ */
+static void write_bio_finished(struct bio *bio)
+{
+	struct data_vio *data_vio = vio_as_data_vio((struct vio *) bio->bi_private);
+
+	vdo_count_completed_bios(bio);
+	vdo_set_completion_result(&data_vio->vio.completion, blk_status_to_errno(bio->bi_status));
+	data_vio->downgrade_allocation_lock = true;
+	update_metadata_for_data_vio_write(data_vio, data_vio->allocation.lock);
+}
+
+/** write_data_vio() - Write a data block to storage without compression. */
+void write_data_vio(struct data_vio *data_vio)
+{
+	struct data_vio_compression_status status, new_status;
+	int result;
+
+	if (!data_vio_has_allocation(data_vio)) {
+		/*
+		 * There was no space to write this block and we failed to deduplicate or compress
+		 * it.
+		 */
+		continue_data_vio_with_error(data_vio, VDO_NO_SPACE);
+		return;
+	}
+
+	new_status = (struct data_vio_compression_status) {
+		.stage = DATA_VIO_POST_PACKER,
+		.may_not_compress = true,
+	};
+
+	do {
+		status = get_data_vio_compression_status(data_vio);
+	} while ((status.stage != DATA_VIO_POST_PACKER) &&
+		 !set_data_vio_compression_status(data_vio, status, new_status));
+
+	/* Write the data from the data block buffer. */
+	result = vio_reset_bio(&data_vio->vio,
+			       data_vio->vio.data,
+			       write_bio_finished,
+			       REQ_OP_WRITE,
+			       data_vio->allocation.pbn);
+	if (result != VDO_SUCCESS) {
+		continue_data_vio_with_error(data_vio, result);
+		return;
+	}
+
+	data_vio->last_async_operation = VIO_ASYNC_OP_WRITE_DATA_VIO;
+	submit_data_vio_io(data_vio);
+}
+
+/**
+ * acknowledge_write_callback() - Acknowledge a write to the requestor.
+ *
+ * This callback is registered in allocate_block() and continue_write_with_block_map_slot().
+ */
+static void acknowledge_write_callback(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+	struct vdo *vdo = completion->vdo;
+
+	ASSERT_LOG_ONLY((!vdo_uses_bio_ack_queue(vdo) ||
+			 (vdo_get_callback_thread_id() == vdo->thread_config.bio_ack_thread)),
+			"%s() called on bio ack queue",
+			__func__);
+	ASSERT_LOG_ONLY(data_vio_has_flush_generation_lock(data_vio),
+			"write VIO to be acknowledged has a flush generation lock");
+	acknowledge_data_vio(data_vio);
+	if (data_vio->new_mapped.pbn == VDO_ZERO_BLOCK) {
+		/* This is a zero write or discard */
+		update_metadata_for_data_vio_write(data_vio, NULL);
+		return;
+	}
+
+	prepare_for_dedupe(data_vio);
+}
+
+/**
+ * allocate_block() - Attempt to allocate a block in the current allocation zone.
+ *
+ * This callback is registered in continue_write_with_block_map_slot().
+ */
+static void allocate_block(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_in_allocated_zone(data_vio);
+
+	if (!vdo_allocate_block_in_zone(data_vio))
+		return;
+
+	completion->error_handler = handle_data_vio_error;
+	WRITE_ONCE(data_vio->allocation_succeeded, true);
+	data_vio->new_mapped = (struct zoned_pbn) {
+		.zone = data_vio->allocation.zone,
+		.pbn = data_vio->allocation.pbn,
+		.state = VDO_MAPPING_STATE_UNCOMPRESSED,
+	};
+
+	if (data_vio->fua) {
+		prepare_for_dedupe(data_vio);
+		return;
+	}
+
+	data_vio->last_async_operation = VIO_ASYNC_OP_ACKNOWLEDGE_WRITE;
+	launch_data_vio_on_bio_ack_queue(data_vio, acknowledge_write_callback);
+}
+
+/**
+ * handle_allocation_error() - Handle an error attempting to allocate a block.
+ *
+ * This error handler is registered in continue_write_with_block_map_slot().
+ */
+static void handle_allocation_error(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	if (completion->result == VDO_NO_SPACE) {
+		/* We failed to get an allocation, but we can try to dedupe. */
+		vdo_reset_completion(completion);
+		completion->error_handler = handle_data_vio_error;
+		prepare_for_dedupe(data_vio);
+		return;
+	}
+
+	/* We got a "real" error, not just a failure to allocate, so fail the request. */
+	handle_data_vio_error(completion);
+}
+
+static int assert_is_trim(struct data_vio *data_vio)
+{
+	int result = ASSERT(data_vio->is_trim, "data_vio with no block map page is a trim");
+
+	return ((result == VDO_SUCCESS) ? result : VDO_READ_ONLY);
+}
+
+/**
+ * continue_data_vio_with_block_map_slot() - Read the data_vio's mapping from the block map.
+ *
+ * This callback is registered in launch_read_data_vio().
+ */
+void continue_data_vio_with_block_map_slot(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_in_logical_zone(data_vio);
+	if (data_vio->read) {
+		set_data_vio_logical_callback(data_vio, read_block);
+		data_vio->last_async_operation = VIO_ASYNC_OP_GET_MAPPED_BLOCK_FOR_READ;
+		vdo_get_mapped_block(data_vio);
+		return;
+	}
+
+	vdo_acquire_flush_generation_lock(data_vio);
+
+	if (data_vio->tree_lock.tree_slots[0].block_map_slot.pbn == VDO_ZERO_BLOCK) {
+		/*
+		 * This is a trim for a block on a block map page which has not been allocated, so
+		 * there's nothing more we need to do.
+		 */
+		completion->callback = complete_data_vio;
+		continue_data_vio_with_error(data_vio, assert_is_trim(data_vio));
+		return;
+	}
+
+	/*
+	 * We need an allocation if this is neither a full-block trim nor a
+	 * full-block zero write.
+	 */
+	if (!data_vio->is_zero && (!data_vio->is_trim || data_vio->is_partial)) {
+		data_vio_allocate_data_block(data_vio,
+					     VIO_WRITE_LOCK,
+					     allocate_block,
+					     handle_allocation_error);
+		return;
+	}
+
+
+	/*
+	 * We don't need to write any data, so skip allocation and just update the block map and
+	 * reference counts (via the journal).
+	 */
+	data_vio->new_mapped.pbn = VDO_ZERO_BLOCK;
+	if (data_vio->is_zero)
+		data_vio->new_mapped.state = VDO_MAPPING_STATE_UNCOMPRESSED;
+
+	if (data_vio->remaining_discard > VDO_BLOCK_SIZE) {
+		/* This is not the final block of a discard so we can't acknowledge it yet. */
+		update_metadata_for_data_vio_write(data_vio, NULL);
+		return;
+	}
+
+	data_vio->last_async_operation = VIO_ASYNC_OP_ACKNOWLEDGE_WRITE;
+	launch_data_vio_on_bio_ack_queue(data_vio, acknowledge_write_callback);
+}
diff --git a/drivers/md/dm-vdo/data-vio.h b/drivers/md/dm-vdo/data-vio.h
new file mode 100644
index 000000000000..b6a6f9f9f498
--- /dev/null
+++ b/drivers/md/dm-vdo/data-vio.h
@@ -0,0 +1,683 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef DATA_VIO_H
+#define DATA_VIO_H
+
+#include <linux/atomic.h>
+#include <linux/bio.h>
+#include <linux/list.h>
+
+#include "permassert.h"
+#include "uds.h"
+
+#include "block-map.h"
+#include "completion.h"
+#include "constants.h"
+#include "dedupe.h"
+#include "encodings.h"
+#include "logical-zone.h"
+#include "physical-zone.h"
+#include "types.h"
+#include "vdo.h"
+#include "vio.h"
+#include "wait-queue.h"
+
+/* Codes for describing the last asynchronous operation performed on a vio. */
+enum async_operation_number {
+	MIN_VIO_ASYNC_OPERATION_NUMBER,
+	VIO_ASYNC_OP_LAUNCH = MIN_VIO_ASYNC_OPERATION_NUMBER,
+	VIO_ASYNC_OP_ACKNOWLEDGE_WRITE,
+	VIO_ASYNC_OP_ACQUIRE_VDO_HASH_LOCK,
+	VIO_ASYNC_OP_ATTEMPT_LOGICAL_BLOCK_LOCK,
+	VIO_ASYNC_OP_LOCK_DUPLICATE_PBN,
+	VIO_ASYNC_OP_CHECK_FOR_DUPLICATION,
+	VIO_ASYNC_OP_CLEANUP,
+	VIO_ASYNC_OP_COMPRESS_DATA_VIO,
+	VIO_ASYNC_OP_FIND_BLOCK_MAP_SLOT,
+	VIO_ASYNC_OP_GET_MAPPED_BLOCK_FOR_READ,
+	VIO_ASYNC_OP_GET_MAPPED_BLOCK_FOR_WRITE,
+	VIO_ASYNC_OP_HASH_DATA_VIO,
+	VIO_ASYNC_OP_JOURNAL_REMAPPING,
+	VIO_ASYNC_OP_ATTEMPT_PACKING,
+	VIO_ASYNC_OP_PUT_MAPPED_BLOCK,
+	VIO_ASYNC_OP_READ_DATA_VIO,
+	VIO_ASYNC_OP_UPDATE_DEDUPE_INDEX,
+	VIO_ASYNC_OP_UPDATE_REFERENCE_COUNTS,
+	VIO_ASYNC_OP_VERIFY_DUPLICATION,
+	VIO_ASYNC_OP_WRITE_DATA_VIO,
+	MAX_VIO_ASYNC_OPERATION_NUMBER,
+} __packed;
+
+struct lbn_lock {
+	logical_block_number_t lbn;
+	bool locked;
+	struct wait_queue waiters;
+	struct logical_zone *zone;
+};
+
+/* A position in the arboreal block map at a specific level. */
+struct block_map_tree_slot {
+	page_number_t page_index;
+	struct block_map_slot block_map_slot;
+};
+
+/* Fields for using the arboreal block map. */
+struct tree_lock {
+	/* The current height at which this data_vio is operating */
+	height_t height;
+	/* The block map tree for this LBN */
+	root_count_t root_index;
+	/* Whether we hold a page lock */
+	bool locked;
+	/* The key for the lock map */
+	u64 key;
+	/* The queue of waiters for the page this vio is allocating or loading */
+	struct wait_queue waiters;
+	/* The block map tree slots for this LBN */
+	struct block_map_tree_slot tree_slots[VDO_BLOCK_MAP_TREE_HEIGHT + 1];
+};
+
+struct zoned_pbn {
+	physical_block_number_t pbn;
+	enum block_mapping_state state;
+	struct physical_zone *zone;
+};
+
+/*
+ * Where a data_vio is on the compression path; advance_compression_stage() depends on the order of
+ * this enum.
+ */
+enum data_vio_compression_stage {
+	/* A data_vio which has not yet entered the compression path */
+	DATA_VIO_PRE_COMPRESSOR,
+	/* A data_vio which is in the compressor */
+	DATA_VIO_COMPRESSING,
+	/* A data_vio which is blocked in the packer */
+	DATA_VIO_PACKING,
+	/* A data_vio which is no longer on the compression path (and never will be) */
+	DATA_VIO_POST_PACKER,
+};
+
+struct data_vio_compression_status {
+	enum data_vio_compression_stage stage;
+	bool may_not_compress;
+};
+
+struct compression_state {
+	/*
+	 * The current compression status of this data_vio. This field contains a value which
+	 * consists of a data_vio_compression_stage and a flag indicating whether a request has
+	 * been made to cancel (or prevent) compression for this data_vio.
+	 *
+	 * This field should be accessed through the get_data_vio_compression_status() and
+	 * set_data_vio_compression_status() methods. It should not be accessed directly.
+	 */
+	atomic_t status;
+
+	/* The compressed size of this block */
+	u16 size;
+
+	/* The packer input or output bin slot which holds the enclosing data_vio */
+	slot_number_t slot;
+
+	/* The packer bin to which the enclosing data_vio has been assigned */
+	struct packer_bin *bin;
+
+	/* A link in the chain of data_vios which have been packed together */
+	struct data_vio *next_in_batch;
+
+	/* A vio which is blocked in the packer while holding a lock this vio needs. */
+	struct data_vio *lock_holder;
+
+	/*
+	 * The compressed block used to hold the compressed form of this block and that of any
+	 * other blocks for which this data_vio is the compressed write agent.
+	 */
+	struct compressed_block *block;
+};
+
+/* Fields supporting allocation of data blocks. */
+struct allocation {
+	/* The physical zone in which to allocate a physical block */
+	struct physical_zone *zone;
+
+	/* The block allocated to this vio */
+	physical_block_number_t pbn;
+
+	/*
+	 * If non-NULL, the pooled PBN lock held on the allocated block. Must be a write lock until
+	 * the block has been written, after which it will become a read lock.
+	 */
+	struct pbn_lock *lock;
+
+	/* The type of write lock to obtain on the allocated block */
+	enum pbn_lock_type write_lock_type;
+
+	/* The zone which was the start of the current allocation cycle */
+	zone_count_t first_allocation_zone;
+
+	/* Whether this vio should wait for a clean slab */
+	bool wait_for_clean_slab;
+};
+
+struct reference_updater {
+	enum journal_operation operation;
+	bool increment;
+	struct zoned_pbn zpbn;
+	struct pbn_lock *lock;
+	struct waiter waiter;
+};
+
+/* A vio for processing user data requests. */
+struct data_vio {
+	/* The wait_queue entry structure */
+	struct waiter waiter;
+
+	/* The logical block of this request */
+	struct lbn_lock logical;
+
+	/* The state for traversing the block map tree */
+	struct tree_lock tree_lock;
+
+	/* The current partition address of this block */
+	struct zoned_pbn mapped;
+
+	/* The hash of this vio (if not zero) */
+	struct uds_record_name record_name;
+
+	/* Used for logging and debugging */
+	enum async_operation_number last_async_operation;
+
+	/* The operations to record in the recovery and slab journals */
+	struct reference_updater increment_updater;
+	struct reference_updater decrement_updater;
+
+	u16 read : 1;
+	u16 write : 1;
+	u16 fua : 1;
+	u16 is_zero : 1;
+	u16 is_trim : 1;
+	u16 is_partial : 1;
+	u16 is_duplicate : 1;
+	u16 first_reference_operation_complete : 1;
+	u16 downgrade_allocation_lock : 1;
+
+	struct allocation allocation;
+
+	/*
+	 * Whether this vio has received an allocation. This field is examined from threads not in
+	 * the allocation zone.
+	 */
+	bool allocation_succeeded;
+
+	/* The new partition address of this block after the vio write completes */
+	struct zoned_pbn new_mapped;
+
+	/* The hash zone responsible for the name (NULL if is_zero_block) */
+	struct hash_zone *hash_zone;
+
+	/* The lock this vio holds or shares with other vios with the same data */
+	struct hash_lock *hash_lock;
+
+	/* All data_vios sharing a hash lock are kept in a list linking these list entries */
+	struct list_head hash_lock_entry;
+
+	/* The block number in the partition of the UDS deduplication advice */
+	struct zoned_pbn duplicate;
+
+	/*
+	 * The sequence number of the recovery journal block containing the increment entry for
+	 * this vio.
+	 */
+	sequence_number_t recovery_sequence_number;
+
+	/* The point in the recovery journal where this write last made an entry */
+	struct journal_point recovery_journal_point;
+
+	/* The list of vios in user initiated write requests */
+	struct list_head write_entry;
+
+	/* The generation number of the VDO that this vio belongs to */
+	sequence_number_t flush_generation;
+
+	/* The completion to use for fetching block map pages for this vio */
+	struct vdo_page_completion page_completion;
+
+	/* The user bio that initiated this VIO */
+	struct bio *user_bio;
+
+	/* partial block support */
+	block_size_t offset;
+
+	/*
+	 * The number of bytes to be discarded. For discards, this field will always be positive,
+	 * whereas for non-discards it will always be 0. Hence it can be used to determine whether
+	 * a data_vio is processing a discard, even after the user_bio has been acknowledged.
+	 */
+	u32 remaining_discard;
+
+	struct dedupe_context *dedupe_context;
+
+	/* Fields beyond this point will not be reset when a pooled data_vio is reused. */
+
+	struct vio vio;
+
+	/* The completion for making reference count decrements */
+	struct vdo_completion decrement_completion;
+
+	/* All of the fields necessary for the compression path */
+	struct compression_state compression;
+
+	/* A block used as output during compression or uncompression */
+	char *scratch_block;
+
+	struct list_head pool_entry;
+};
+
+static inline struct data_vio *vio_as_data_vio(struct vio *vio)
+{
+	ASSERT_LOG_ONLY((vio->type == VIO_TYPE_DATA), "vio is a data_vio");
+	return container_of(vio, struct data_vio, vio);
+}
+
+static inline struct data_vio *as_data_vio(struct vdo_completion *completion)
+{
+	return vio_as_data_vio(as_vio(completion));
+}
+
+static inline struct data_vio *waiter_as_data_vio(struct waiter *waiter)
+{
+	if (waiter == NULL)
+		return NULL;
+
+	return container_of(waiter, struct data_vio, waiter);
+}
+
+static inline struct data_vio *
+data_vio_from_reference_updater(struct reference_updater *updater)
+{
+	if (updater->increment)
+		return container_of(updater, struct data_vio, increment_updater);
+
+	return container_of(updater, struct data_vio, decrement_updater);
+}
+
+static inline bool data_vio_has_flush_generation_lock(struct data_vio *data_vio)
+{
+	return !list_empty(&data_vio->write_entry);
+}
+
+static inline struct vdo *vdo_from_data_vio(struct data_vio *data_vio)
+{
+	return data_vio->vio.completion.vdo;
+}
+
+static inline bool data_vio_has_allocation(struct data_vio *data_vio)
+{
+	return (data_vio->allocation.pbn != VDO_ZERO_BLOCK);
+}
+
+struct data_vio_compression_status __must_check
+advance_data_vio_compression_stage(struct data_vio *data_vio);
+struct data_vio_compression_status __must_check
+get_data_vio_compression_status(struct data_vio *data_vio);
+bool cancel_data_vio_compression(struct data_vio *data_vio);
+
+struct data_vio_pool;
+
+int make_data_vio_pool(struct vdo *vdo,
+		       data_vio_count_t pool_size,
+		       data_vio_count_t discard_limit,
+		       struct data_vio_pool **pool_ptr);
+void free_data_vio_pool(struct data_vio_pool *pool);
+void vdo_launch_bio(struct data_vio_pool *pool, struct bio *bio);
+void drain_data_vio_pool(struct data_vio_pool *pool, struct vdo_completion *completion);
+void resume_data_vio_pool(struct data_vio_pool *pool, struct vdo_completion *completion);
+
+void dump_data_vio_pool(struct data_vio_pool *pool, bool dump_vios);
+data_vio_count_t get_data_vio_pool_active_discards(struct data_vio_pool *pool);
+data_vio_count_t get_data_vio_pool_discard_limit(struct data_vio_pool *pool);
+data_vio_count_t get_data_vio_pool_maximum_discards(struct data_vio_pool *pool);
+int __must_check
+set_data_vio_pool_discard_limit(struct data_vio_pool *pool, data_vio_count_t limit);
+data_vio_count_t get_data_vio_pool_active_requests(struct data_vio_pool *pool);
+data_vio_count_t get_data_vio_pool_request_limit(struct data_vio_pool *pool);
+data_vio_count_t get_data_vio_pool_maximum_requests(struct data_vio_pool *pool);
+
+void complete_data_vio(struct vdo_completion *completion);
+void handle_data_vio_error(struct vdo_completion *completion);
+
+static inline void continue_data_vio(struct data_vio *data_vio)
+{
+	vdo_launch_completion(&data_vio->vio.completion);
+}
+
+/**
+ * continue_data_vio_with_error() - Set an error code and then continue processing a data_vio.
+ *
+ * This will not mask older errors. This function can be called with a success code, but it is more
+ * efficient to call continue_data_vio() if the caller knows the result was a success.
+ */
+static inline void continue_data_vio_with_error(struct data_vio *data_vio, int result)
+{
+	vdo_continue_completion(&data_vio->vio.completion, result);
+}
+
+const char * __must_check get_data_vio_operation_name(struct data_vio *data_vio);
+
+static inline void assert_data_vio_in_hash_zone(struct data_vio *data_vio)
+{
+	thread_id_t expected = data_vio->hash_zone->thread_id;
+	thread_id_t thread_id = vdo_get_callback_thread_id();
+	/*
+	 * It's odd to use the LBN, but converting the record name to hex is a bit clunky for an
+	 * inline, and the LBN better than nothing as an identifier.
+	 */
+	ASSERT_LOG_ONLY((expected == thread_id),
+			"data_vio for logical block %llu on thread %u, should be on hash zone thread %u",
+			(unsigned long long) data_vio->logical.lbn,
+			thread_id,
+			expected);
+}
+
+static inline void set_data_vio_hash_zone_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	vdo_set_completion_callback(&data_vio->vio.completion,
+				    callback,
+				    data_vio->hash_zone->thread_id);
+}
+
+/**
+ * launch_data_vio_hash_zone_callback() - Set a callback as a hash zone operation and invoke it
+ *					  immediately.
+ */
+static inline void
+launch_data_vio_hash_zone_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	set_data_vio_hash_zone_callback(data_vio, callback);
+	vdo_launch_completion(&data_vio->vio.completion);
+}
+
+static inline void assert_data_vio_in_logical_zone(struct data_vio *data_vio)
+{
+	thread_id_t expected = data_vio->logical.zone->thread_id;
+	thread_id_t thread_id = vdo_get_callback_thread_id();
+
+	ASSERT_LOG_ONLY((expected == thread_id),
+			"data_vio for logical block %llu on thread %u, should be on thread %u",
+			(unsigned long long) data_vio->logical.lbn,
+			thread_id,
+			expected);
+}
+
+static inline void set_data_vio_logical_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	vdo_set_completion_callback(&data_vio->vio.completion,
+				    callback,
+				    data_vio->logical.zone->thread_id);
+}
+
+/**
+ * launch_data_vio_logical_callback() - Set a callback as a logical block operation and invoke it
+ *					immediately.
+ */
+static inline void
+launch_data_vio_logical_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	set_data_vio_logical_callback(data_vio, callback);
+	vdo_launch_completion(&data_vio->vio.completion);
+}
+
+static inline void assert_data_vio_in_allocated_zone(struct data_vio *data_vio)
+{
+	thread_id_t expected = data_vio->allocation.zone->thread_id;
+	thread_id_t thread_id = vdo_get_callback_thread_id();
+
+	ASSERT_LOG_ONLY((expected == thread_id),
+			"struct data_vio for allocated physical block %llu on thread %u, should be on thread %u",
+			(unsigned long long) data_vio->allocation.pbn,
+			thread_id,
+			expected);
+}
+
+static inline void
+set_data_vio_allocated_zone_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	vdo_set_completion_callback(&data_vio->vio.completion,
+				    callback,
+				    data_vio->allocation.zone->thread_id);
+}
+
+/**
+ * launch_data_vio_allocated_zone_callback() - Set a callback as a physical block operation in a
+ *					       data_vio's allocated zone and queue the data_vio and
+ *					       invoke it immediately.
+ */
+static inline void
+launch_data_vio_allocated_zone_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	set_data_vio_allocated_zone_callback(data_vio, callback);
+	vdo_launch_completion(&data_vio->vio.completion);
+}
+
+static inline void assert_data_vio_in_duplicate_zone(struct data_vio *data_vio)
+{
+	thread_id_t expected = data_vio->duplicate.zone->thread_id;
+	thread_id_t thread_id = vdo_get_callback_thread_id();
+
+	ASSERT_LOG_ONLY((expected == thread_id),
+			"data_vio for duplicate physical block %llu on thread %u, should be on thread %u",
+			(unsigned long long) data_vio->duplicate.pbn,
+			thread_id,
+			expected);
+}
+
+static inline void
+set_data_vio_duplicate_zone_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	vdo_set_completion_callback(&data_vio->vio.completion,
+				    callback,
+				    data_vio->duplicate.zone->thread_id);
+}
+
+/**
+ * launch_data_vio_duplicate_zone_callback() - Set a callback as a physical block operation in a
+ *					       data_vio's duplicate zone and queue the data_vio and
+ *					       invoke it immediately.
+ */
+static inline void
+launch_data_vio_duplicate_zone_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	set_data_vio_duplicate_zone_callback(data_vio, callback);
+	vdo_launch_completion(&data_vio->vio.completion);
+}
+
+static inline void assert_data_vio_in_mapped_zone(struct data_vio *data_vio)
+{
+	thread_id_t expected = data_vio->mapped.zone->thread_id;
+	thread_id_t thread_id = vdo_get_callback_thread_id();
+
+	ASSERT_LOG_ONLY((expected == thread_id),
+			"data_vio for mapped physical block %llu on thread %u, should be on thread %u",
+			(unsigned long long) data_vio->mapped.pbn,
+			thread_id,
+			expected);
+}
+
+static inline void
+set_data_vio_mapped_zone_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	vdo_set_completion_callback(&data_vio->vio.completion,
+				    callback,
+				    data_vio->mapped.zone->thread_id);
+}
+
+static inline void assert_data_vio_in_new_mapped_zone(struct data_vio *data_vio)
+{
+	thread_id_t expected = data_vio->new_mapped.zone->thread_id;
+	thread_id_t thread_id = vdo_get_callback_thread_id();
+
+	ASSERT_LOG_ONLY((expected == thread_id),
+			"data_vio for new_mapped physical block %llu on thread %u, should be on thread %u",
+			(unsigned long long) data_vio->new_mapped.pbn,
+			thread_id,
+			expected);
+}
+
+static inline void
+set_data_vio_new_mapped_zone_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	vdo_set_completion_callback(&data_vio->vio.completion,
+				    callback,
+				    data_vio->new_mapped.zone->thread_id);
+}
+
+static inline void assert_data_vio_in_journal_zone(struct data_vio *data_vio)
+{
+	thread_id_t journal_thread = vdo_from_data_vio(data_vio)->thread_config.journal_thread;
+	thread_id_t thread_id = vdo_get_callback_thread_id();
+
+	ASSERT_LOG_ONLY((journal_thread == thread_id),
+			"data_vio for logical block %llu on thread %u, should be on journal thread %u",
+			(unsigned long long) data_vio->logical.lbn,
+			thread_id,
+			journal_thread);
+}
+
+static inline void set_data_vio_journal_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	thread_id_t journal_thread = vdo_from_data_vio(data_vio)->thread_config.journal_thread;
+
+	vdo_set_completion_callback(&data_vio->vio.completion, callback, journal_thread);
+}
+
+/**
+ * launch_data_vio_journal_callback() - Set a callback as a journal operation and invoke it
+ *					immediately.
+ */
+static inline void
+launch_data_vio_journal_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	set_data_vio_journal_callback(data_vio, callback);
+	vdo_launch_completion(&data_vio->vio.completion);
+}
+
+static inline void assert_data_vio_in_packer_zone(struct data_vio *data_vio)
+{
+	thread_id_t packer_thread = vdo_from_data_vio(data_vio)->thread_config.packer_thread;
+	thread_id_t thread_id = vdo_get_callback_thread_id();
+
+	ASSERT_LOG_ONLY((packer_thread == thread_id),
+			"data_vio for logical block %llu on thread %u, should be on packer thread %u",
+			(unsigned long long) data_vio->logical.lbn,
+			thread_id,
+			packer_thread);
+}
+
+static inline void set_data_vio_packer_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	thread_id_t packer_thread = vdo_from_data_vio(data_vio)->thread_config.packer_thread;
+
+	vdo_set_completion_callback(&data_vio->vio.completion, callback, packer_thread);
+}
+
+/**
+ * launch_data_vio_packer_callback() - Set a callback as a packer operation and invoke it
+ *				       immediately.
+ */
+static inline void launch_data_vio_packer_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	set_data_vio_packer_callback(data_vio, callback);
+	vdo_launch_completion(&data_vio->vio.completion);
+}
+
+static inline void assert_data_vio_on_cpu_thread(struct data_vio *data_vio)
+{
+	thread_id_t cpu_thread = vdo_from_data_vio(data_vio)->thread_config.cpu_thread;
+	thread_id_t thread_id = vdo_get_callback_thread_id();
+
+	ASSERT_LOG_ONLY((cpu_thread == thread_id),
+			"data_vio for logical block %llu on thread %u, should be on cpu thread %u",
+			(unsigned long long) data_vio->logical.lbn,
+			thread_id,
+			cpu_thread);
+}
+
+static inline void set_data_vio_cpu_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	thread_id_t cpu_thread = vdo_from_data_vio(data_vio)->thread_config.cpu_thread;
+
+	vdo_set_completion_callback(&data_vio->vio.completion, callback, cpu_thread);
+}
+
+/**
+ * launch_data_vio_cpu_callback() - Set a callback to run on the CPU queues and invoke it
+ *				    immediately.
+ */
+static inline void
+launch_data_vio_cpu_callback(struct data_vio *data_vio,
+			     vdo_action *callback,
+			     enum vdo_completion_priority priority)
+{
+	set_data_vio_cpu_callback(data_vio, callback);
+	vdo_launch_completion_with_priority(&data_vio->vio.completion, priority);
+}
+
+static inline void set_data_vio_bio_zone_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	vdo_set_completion_callback(&data_vio->vio.completion,
+				    callback,
+				    get_vio_bio_zone_thread_id(&data_vio->vio));
+}
+
+/**
+ * launch_data_vio_bio_zone_callback() - Set a callback as a bio zone operation and invoke it
+ *					 immediately.
+ */
+static inline void
+launch_data_vio_bio_zone_callback(struct data_vio *data_vio, vdo_action *callback)
+{
+	set_data_vio_bio_zone_callback(data_vio, callback);
+	vdo_launch_completion_with_priority(&data_vio->vio.completion, BIO_Q_DATA_PRIORITY);
+}
+
+/**
+ * launch_data_vio_on_bio_ack_queue() - If the vdo uses a bio_ack queue, set a callback to run on
+ *					it and invoke it immediately, otherwise, just run the
+ *					callback on the current thread.
+ */
+static inline void
+launch_data_vio_on_bio_ack_queue(struct data_vio *data_vio, vdo_action *callback)
+{
+	struct vdo_completion *completion = &data_vio->vio.completion;
+	struct vdo *vdo = completion->vdo;
+
+	if (!vdo_uses_bio_ack_queue(vdo)) {
+		callback(completion);
+		return;
+	}
+
+	vdo_set_completion_callback(completion, callback, vdo->thread_config.bio_ack_thread);
+	vdo_launch_completion_with_priority(completion, BIO_ACK_Q_ACK_PRIORITY);
+}
+
+void data_vio_allocate_data_block(struct data_vio *data_vio,
+				  enum pbn_lock_type write_lock_type,
+				  vdo_action *callback,
+				  vdo_action *error_handler);
+
+void release_data_vio_allocation_lock(struct data_vio *data_vio, bool reset);
+
+int __must_check uncompress_data_vio(struct data_vio *data_vio,
+				     enum block_mapping_state mapping_state,
+				     char *buffer);
+
+void update_metadata_for_data_vio_write(struct data_vio *data_vio, struct pbn_lock *lock);
+void write_data_vio(struct data_vio *data_vio);
+void launch_compress_data_vio(struct data_vio *data_vio);
+void continue_data_vio_with_block_map_slot(struct vdo_completion *completion);
+
+#endif /* DATA_VIO_H */

From patchwork Thu Oct 26 21:41:17 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437958
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2BCA0C27C5A
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:39 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344919AbjJZVoi (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:38 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41554 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234999AbjJZVoZ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:25 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8D0BCD64
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356512;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=ODsaMcZaEAFO6RAAaXSuPnSjHR9sZbR/Lf5H+MmiMO0=;
        b=X+e9ni5/MdQQ3p0vaXv4/VT7WUTH1nz7Jnf4xBQWO2nIs5DXqm+bkYoFHhnzeCU+yTkjwd
        py/pYz5gxkOc6PWeoKj1CxTuVpToFcngsVFk/krMTAKG0Zp7z/fxwubq6LJM5rT/txTGkz
        IzGzmaDv7cGaaYNEm0zSK+2SOzhj2ZM=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-6-YEtb8nb6Me6fkgyjl2WJWw-1; Thu,
 26 Oct 2023 17:41:41 -0400
X-MC-Unique: YEtb8nb6Me6fkgyjl2WJWw-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com
 [10.11.54.2])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 993F53C02765;
        Thu, 26 Oct 2023 21:41:40 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 80CC940C6F79;
        Thu, 26 Oct 2023 21:41:40 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 37C083003D; Thu, 26 Oct 2023 17:41:40 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 20/39] dm vdo: add flush support
Date: Thu, 26 Oct 2023 17:41:17 -0400
Message-Id: <20231026214136.1067410-21-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

This patch adds support for handling incoming flush and/or FUA bios. Each
such bio is assigned to a struct vdo_flush. These are allocated as needed,
but there is always one kept in reserve in case allocations fail. In the
event of an allocation failure, bios may need to wait for an outstanding
flush to complete.

The logical address space is partitioned into logical zones, each handled
by its own thread. Each zone keeps a list of all data_vios handling write
requests for logical addresses in that zone. When a flush bio is processed,
each logical zone is informed of the flush. When all of the writes which
are in progress at the time of the notification have completed in all
zones, the flush bio is then allowed to complete.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/flush.c        | 563 +++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/flush.h        |  44 +++
 drivers/md/dm-vdo/logical-zone.c | 378 +++++++++++++++++++++
 drivers/md/dm-vdo/logical-zone.h |  87 +++++
 4 files changed, 1072 insertions(+)
 create mode 100644 drivers/md/dm-vdo/flush.c
 create mode 100644 drivers/md/dm-vdo/flush.h
 create mode 100644 drivers/md/dm-vdo/logical-zone.c
 create mode 100644 drivers/md/dm-vdo/logical-zone.h

diff --git a/drivers/md/dm-vdo/flush.c b/drivers/md/dm-vdo/flush.c
new file mode 100644
index 000000000000..8c77337e5262
--- /dev/null
+++ b/drivers/md/dm-vdo/flush.c
@@ -0,0 +1,563 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "flush.h"
+
+#include <linux/mempool.h>
+#include <linux/spinlock.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+
+#include "admin-state.h"
+#include "completion.h"
+#include "io-submitter.h"
+#include "logical-zone.h"
+#include "slab-depot.h"
+#include "types.h"
+#include "vdo.h"
+
+struct flusher {
+	struct vdo_completion completion;
+	/** The vdo to which this flusher belongs */
+	struct vdo *vdo;
+	/** The administrative state of the flusher */
+	struct admin_state state;
+	/** The current flush generation of the vdo */
+	sequence_number_t flush_generation;
+	/** The first unacknowledged flush generation */
+	sequence_number_t first_unacknowledged_generation;
+	/** The queue of flush requests waiting to notify other threads */
+	struct wait_queue notifiers;
+	/** The queue of flush requests waiting for VIOs to complete */
+	struct wait_queue pending_flushes;
+	/** The flush generation for which notifications are being sent */
+	sequence_number_t notify_generation;
+	/** The logical zone to notify next */
+	struct logical_zone *logical_zone_to_notify;
+	/** The ID of the thread on which flush requests should be made */
+	thread_id_t thread_id;
+	/** The pool of flush requests */
+	mempool_t *flush_pool;
+	/** Bios waiting for a flush request to become available */
+	struct bio_list waiting_flush_bios;
+	/** The lock to protect the previous fields */
+	spinlock_t lock;
+	/** The rotor for selecting the bio queue for submitting flush bios */
+	zone_count_t bio_queue_rotor;
+	/** The number of flushes submitted to the current bio queue */
+	int flush_count;
+};
+
+/**
+ * assert_on_flusher_thread() - Check that we are on the flusher thread.
+ * @flusher: The flusher.
+ * @caller: The function which is asserting.
+ */
+static inline void assert_on_flusher_thread(struct flusher *flusher, const char *caller)
+{
+	ASSERT_LOG_ONLY((vdo_get_callback_thread_id() == flusher->thread_id),
+			"%s() called from flusher thread",
+			caller);
+}
+
+/**
+ * as_flusher() - Convert a generic vdo_completion to a flusher.
+ * @completion: The completion to convert.
+ *
+ * Return: The completion as a flusher.
+ */
+static struct flusher *as_flusher(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VDO_FLUSH_NOTIFICATION_COMPLETION);
+	return container_of(completion, struct flusher, completion);
+}
+
+/**
+ * completion_as_vdo_flush() - Convert a generic vdo_completion to a vdo_flush.
+ * @completion: The completion to convert.
+ *
+ * Return: The completion as a vdo_flush.
+ */
+static inline struct vdo_flush *completion_as_vdo_flush(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VDO_FLUSH_COMPLETION);
+	return container_of(completion, struct vdo_flush, completion);
+}
+
+/**
+ * waiter_as_flush() - Convert a vdo_flush's generic wait queue entry back to the vdo_flush.
+ * @waiter: The wait queue entry to convert.
+ *
+ * Return: The wait queue entry as a vdo_flush.
+ */
+static struct vdo_flush *waiter_as_flush(struct waiter *waiter)
+{
+	return container_of(waiter, struct vdo_flush, waiter);
+}
+
+static void *allocate_flush(gfp_t gfp_mask, void *pool_data)
+{
+	struct vdo_flush *flush;
+
+	if ((gfp_mask & GFP_NOWAIT) == GFP_NOWAIT) {
+		flush = UDS_ALLOCATE_NOWAIT(struct vdo_flush, __func__);
+	} else {
+		int result = UDS_ALLOCATE(1, struct vdo_flush, __func__, &flush);
+
+		if (result != VDO_SUCCESS)
+			uds_log_error_strerror(result, "failed to allocate spare flush");
+	}
+
+	if (flush != NULL) {
+		struct flusher *flusher = pool_data;
+
+		vdo_initialize_completion(&flush->completion, flusher->vdo, VDO_FLUSH_COMPLETION);
+	}
+
+	return flush;
+}
+
+static void free_flush(void *element, void *pool_data __always_unused)
+{
+	UDS_FREE(element);
+}
+
+/**
+ * vdo_make_flusher() - Make a flusher for a vdo.
+ * @vdo: The vdo which owns the flusher.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_make_flusher(struct vdo *vdo)
+{
+	int result = UDS_ALLOCATE(1, struct flusher, __func__, &vdo->flusher);
+
+	if (result != VDO_SUCCESS)
+		return result;
+
+	vdo->flusher->vdo = vdo;
+	vdo->flusher->thread_id = vdo->thread_config.packer_thread;
+	vdo_set_admin_state_code(&vdo->flusher->state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+	vdo_initialize_completion(&vdo->flusher->completion, vdo,
+				  VDO_FLUSH_NOTIFICATION_COMPLETION);
+
+	spin_lock_init(&vdo->flusher->lock);
+	bio_list_init(&vdo->flusher->waiting_flush_bios);
+	vdo->flusher->flush_pool = mempool_create(1, allocate_flush, free_flush, vdo->flusher);
+	return ((vdo->flusher->flush_pool == NULL) ? -ENOMEM : VDO_SUCCESS);
+}
+
+/**
+ * vdo_free_flusher() - Free a flusher.
+ * @flusher: The flusher to free.
+ */
+void vdo_free_flusher(struct flusher *flusher)
+{
+	if (flusher == NULL)
+		return;
+
+	if (flusher->flush_pool != NULL)
+		mempool_destroy(UDS_FORGET(flusher->flush_pool));
+	UDS_FREE(flusher);
+}
+
+/**
+ * vdo_get_flusher_thread_id() - Get the ID of the thread on which flusher functions should be
+ *                               called.
+ * @flusher: The flusher to query.
+ *
+ * Return: The ID of the thread which handles the flusher.
+ */
+thread_id_t vdo_get_flusher_thread_id(struct flusher *flusher)
+{
+	return flusher->thread_id;
+}
+
+static void notify_flush(struct flusher *flusher);
+static void vdo_complete_flush(struct vdo_flush *flush);
+
+/**
+ * finish_notification() - Finish the notification process.
+ * @completion: The flusher completion.
+ *
+ * Finishes the notification process by checking if any flushes have completed and then starting
+ * the notification of the next flush request if one came in while the current notification was in
+ * progress. This callback is registered in flush_packer_callback().
+ */
+static void finish_notification(struct vdo_completion *completion)
+{
+	struct flusher *flusher = as_flusher(completion);
+
+	assert_on_flusher_thread(flusher, __func__);
+
+	vdo_enqueue_waiter(&flusher->pending_flushes,
+			   vdo_dequeue_next_waiter(&flusher->notifiers));
+	vdo_complete_flushes(flusher);
+	if (vdo_has_waiters(&flusher->notifiers))
+		notify_flush(flusher);
+}
+
+/**
+ * flush_packer_callback() - Flush the packer.
+ * @completion: The flusher completion.
+ *
+ * Flushes the packer now that all of the logical and physical zones have been notified of the new
+ * flush request. This callback is registered in increment_generation().
+ */
+static void flush_packer_callback(struct vdo_completion *completion)
+{
+	struct flusher *flusher = as_flusher(completion);
+
+	vdo_increment_packer_flush_generation(flusher->vdo->packer);
+	vdo_launch_completion_callback(completion, finish_notification, flusher->thread_id);
+}
+
+/**
+ * increment_generation() - Increment the flush generation in a logical zone.
+ * @completion: The flusher as a completion.
+ *
+ * If there are more logical zones, go on to the next one, otherwise, prepare the physical zones.
+ * This callback is registered both in notify_flush() and in itself.
+ */
+static void increment_generation(struct vdo_completion *completion)
+{
+	struct flusher *flusher = as_flusher(completion);
+	struct logical_zone *zone = flusher->logical_zone_to_notify;
+
+	vdo_increment_logical_zone_flush_generation(zone, flusher->notify_generation);
+	if (zone->next == NULL) {
+		vdo_launch_completion_callback(completion,
+					       flush_packer_callback,
+					       flusher->thread_id);
+		return;
+	}
+
+	flusher->logical_zone_to_notify = zone->next;
+	vdo_launch_completion_callback(completion,
+				       increment_generation,
+				       flusher->logical_zone_to_notify->thread_id);
+}
+
+/**
+ * notify_flush() - Launch a flush notification.
+ * @flusher: The flusher doing the notification.
+ */
+static void notify_flush(struct flusher *flusher)
+{
+	struct vdo_flush *flush = waiter_as_flush(vdo_get_first_waiter(&flusher->notifiers));
+
+	flusher->notify_generation = flush->flush_generation;
+	flusher->logical_zone_to_notify = &flusher->vdo->logical_zones->zones[0];
+	flusher->completion.requeue = true;
+	vdo_launch_completion_callback(&flusher->completion,
+				       increment_generation,
+				       flusher->logical_zone_to_notify->thread_id);
+}
+
+/**
+ * flush_vdo() - Start processing a flush request.
+ * @completion: A flush request (as a vdo_completion)
+ *
+ * This callback is registered in launch_flush().
+ */
+static void flush_vdo(struct vdo_completion *completion)
+{
+	struct vdo_flush *flush = completion_as_vdo_flush(completion);
+	struct flusher *flusher = completion->vdo->flusher;
+	bool may_notify;
+	int result;
+
+	assert_on_flusher_thread(flusher, __func__);
+	result = ASSERT(vdo_is_state_normal(&flusher->state), "flusher is in normal operation");
+	if (result != VDO_SUCCESS) {
+		vdo_enter_read_only_mode(flusher->vdo, result);
+		vdo_complete_flush(flush);
+		return;
+	}
+
+	flush->flush_generation = flusher->flush_generation++;
+	may_notify = !vdo_has_waiters(&flusher->notifiers);
+	vdo_enqueue_waiter(&flusher->notifiers, &flush->waiter);
+	if (may_notify)
+		notify_flush(flusher);
+}
+
+/**
+ * check_for_drain_complete() - Check whether the flusher has drained.
+ * @flusher: The flusher.
+ */
+static void check_for_drain_complete(struct flusher *flusher)
+{
+	bool drained;
+
+	if (!vdo_is_state_draining(&flusher->state) || vdo_has_waiters(&flusher->pending_flushes))
+		return;
+
+	spin_lock(&flusher->lock);
+	drained = bio_list_empty(&flusher->waiting_flush_bios);
+	spin_unlock(&flusher->lock);
+
+	if (drained)
+		vdo_finish_draining(&flusher->state);
+}
+
+/**
+ * vdo_complete_flushes() - Attempt to complete any flushes which might have finished.
+ * @flusher: The flusher.
+ */
+void vdo_complete_flushes(struct flusher *flusher)
+{
+	sequence_number_t oldest_active_generation = U64_MAX;
+	struct logical_zone *zone;
+
+	assert_on_flusher_thread(flusher, __func__);
+
+	for (zone = &flusher->vdo->logical_zones->zones[0]; zone != NULL; zone = zone->next)
+		oldest_active_generation =
+			min(oldest_active_generation, READ_ONCE(zone->oldest_active_generation));
+
+	while (vdo_has_waiters(&flusher->pending_flushes)) {
+		struct vdo_flush *flush =
+			waiter_as_flush(vdo_get_first_waiter(&flusher->pending_flushes));
+
+		if (flush->flush_generation >= oldest_active_generation)
+			return;
+
+		ASSERT_LOG_ONLY((flush->flush_generation ==
+				 flusher->first_unacknowledged_generation),
+				"acknowledged next expected flush, %llu, was: %llu",
+				(unsigned long long) flusher->first_unacknowledged_generation,
+				(unsigned long long) flush->flush_generation);
+		vdo_dequeue_next_waiter(&flusher->pending_flushes);
+		vdo_complete_flush(flush);
+		flusher->first_unacknowledged_generation++;
+	}
+
+	check_for_drain_complete(flusher);
+}
+
+/**
+ * vdo_dump_flusher() - Dump the flusher, in a thread-unsafe fashion.
+ * @flusher: The flusher.
+ */
+void vdo_dump_flusher(const struct flusher *flusher)
+{
+	uds_log_info("struct flusher");
+	uds_log_info("  flush_generation=%llu first_unacknowledged_generation=%llu",
+		     (unsigned long long) flusher->flush_generation,
+		     (unsigned long long) flusher->first_unacknowledged_generation);
+	uds_log_info("  notifiers queue is %s; pending_flushes queue is %s",
+		     (vdo_has_waiters(&flusher->notifiers) ? "not empty" : "empty"),
+		     (vdo_has_waiters(&flusher->pending_flushes) ? "not empty" : "empty"));
+}
+
+/**
+ * initialize_flush() - Initialize a vdo_flush structure.
+ * @flush: The flush to initialize.
+ * @vdo: The vdo being flushed.
+ *
+ * Initializes a vdo_flush structure, transferring all the bios in the flusher's waiting_flush_bios
+ * list to it. The caller MUST already hold the lock.
+ */
+static void initialize_flush(struct vdo_flush *flush, struct vdo *vdo)
+{
+	bio_list_init(&flush->bios);
+	bio_list_merge(&flush->bios, &vdo->flusher->waiting_flush_bios);
+	bio_list_init(&vdo->flusher->waiting_flush_bios);
+}
+
+static void launch_flush(struct vdo_flush *flush)
+{
+	struct vdo_completion *completion = &flush->completion;
+
+	vdo_prepare_completion(completion,
+			       flush_vdo,
+			       flush_vdo,
+			       completion->vdo->thread_config.packer_thread,
+			       NULL);
+	vdo_enqueue_completion(completion, VDO_DEFAULT_Q_FLUSH_PRIORITY);
+}
+
+/**
+ * vdo_launch_flush() - Function called to start processing a flush request.
+ * @vdo: The vdo.
+ * @bio: The bio containing an empty flush request.
+ *
+ * This is called when we receive an empty flush bio from the block layer, and before acknowledging
+ * a non-empty bio with the FUA flag set.
+ */
+void vdo_launch_flush(struct vdo *vdo, struct bio *bio)
+{
+	/*
+	 * Try to allocate a vdo_flush to represent the flush request. If the allocation fails,
+	 * we'll deal with it later.
+	 */
+	struct vdo_flush *flush = mempool_alloc(vdo->flusher->flush_pool, GFP_NOWAIT);
+	struct flusher *flusher = vdo->flusher;
+	const struct admin_state_code *code = vdo_get_admin_state_code(&flusher->state);
+
+	ASSERT_LOG_ONLY(!code->quiescent, "Flushing not allowed in state %s", code->name);
+
+	spin_lock(&flusher->lock);
+
+	/* We have a new bio to start. Add it to the list. */
+	bio_list_add(&flusher->waiting_flush_bios, bio);
+
+	if (flush == NULL) {
+		spin_unlock(&flusher->lock);
+		return;
+	}
+
+	/* We have flushes to start. Capture them in the vdo_flush structure. */
+	initialize_flush(flush, vdo);
+	spin_unlock(&flusher->lock);
+
+	/* Finish launching the flushes. */
+	launch_flush(flush);
+}
+
+/**
+ * release_flush() - Release a vdo_flush structure that has completed its work.
+ * @flush: The completed flush structure to re-use or free.
+ *
+ * If there are any pending flush requests whose vdo_flush allocation failed, they will be launched
+ * by immediately re-using the released vdo_flush. If there is no spare vdo_flush, the released
+ * structure will become the spare. Otherwise, the vdo_flush will be freed.
+ */
+static void release_flush(struct vdo_flush *flush)
+{
+	bool relaunch_flush;
+	struct flusher *flusher = flush->completion.vdo->flusher;
+
+	spin_lock(&flusher->lock);
+	if (bio_list_empty(&flusher->waiting_flush_bios)) {
+		relaunch_flush = false;
+	} else {
+		/* We have flushes to start. Capture them in a flush request. */
+		initialize_flush(flush, flusher->vdo);
+		relaunch_flush = true;
+	}
+	spin_unlock(&flusher->lock);
+
+	if (relaunch_flush) {
+		/* Finish launching the flushes. */
+		launch_flush(flush);
+		return;
+	}
+
+	mempool_free(flush, flusher->flush_pool);
+}
+
+/**
+ * vdo_complete_flush_callback() - Function called to complete and free a flush request, registered
+ *                                 in vdo_complete_flush().
+ * @completion: The flush request.
+ */
+static void vdo_complete_flush_callback(struct vdo_completion *completion)
+{
+	struct vdo_flush *flush = completion_as_vdo_flush(completion);
+	struct vdo *vdo = completion->vdo;
+	struct bio *bio;
+
+	while ((bio = bio_list_pop(&flush->bios)) != NULL) {
+		/*
+		 * We're not acknowledging this bio now, but we'll never touch it again, so this is
+		 * the last chance to account for it.
+		 */
+		vdo_count_bios(&vdo->stats.bios_acknowledged, bio);
+
+		/* Update the device, and send it on down... */
+		bio_set_dev(bio, vdo_get_backing_device(vdo));
+		atomic64_inc(&vdo->stats.flush_out);
+		submit_bio_noacct(bio);
+	}
+
+
+	/*
+	 * Release the flush structure, freeing it, re-using it as the spare, or using it to launch
+	 * any flushes that had to wait when allocations failed.
+	 */
+	release_flush(flush);
+}
+
+/**
+ * select_bio_queue() - Select the bio queue on which to finish a flush request.
+ * @flusher: The flusher finishing the request.
+ */
+static thread_id_t select_bio_queue(struct flusher *flusher)
+{
+	struct vdo *vdo = flusher->vdo;
+	zone_count_t bio_threads = flusher->vdo->thread_config.bio_thread_count;
+	int interval;
+
+	if (bio_threads == 1)
+		return vdo->thread_config.bio_threads[0];
+
+	interval = vdo->device_config->thread_counts.bio_rotation_interval;
+	if (flusher->flush_count == interval) {
+		flusher->flush_count = 1;
+		flusher->bio_queue_rotor = ((flusher->bio_queue_rotor + 1) % bio_threads);
+	} else {
+		flusher->flush_count++;
+	}
+
+	return vdo->thread_config.bio_threads[flusher->bio_queue_rotor];
+}
+
+/**
+ * vdo_complete_flush() - Complete and free a vdo flush request.
+ * @flush: The flush request.
+ */
+static void vdo_complete_flush(struct vdo_flush *flush)
+{
+	struct vdo_completion *completion = &flush->completion;
+
+	vdo_prepare_completion(completion,
+			       vdo_complete_flush_callback,
+			       vdo_complete_flush_callback,
+			       select_bio_queue(completion->vdo->flusher),
+			       NULL);
+	vdo_enqueue_completion(completion, BIO_Q_FLUSH_PRIORITY);
+}
+
+/**
+ * initiate_drain() - Initiate a drain.
+ *
+ * Implements vdo_admin_initiator.
+ */
+static void initiate_drain(struct admin_state *state)
+{
+	check_for_drain_complete(container_of(state, struct flusher, state));
+}
+
+/**
+ * vdo_drain_flusher() - Drain the flusher.
+ * @flusher: The flusher to drain.
+ * @completion: The completion to finish when the flusher has drained.
+ *
+ * Drains the flusher by preventing any more VIOs from entering the flusher and then flushing. The
+ * flusher will be left in the suspended state.
+ */
+void vdo_drain_flusher(struct flusher *flusher, struct vdo_completion *completion)
+{
+	assert_on_flusher_thread(flusher, __func__);
+	vdo_start_draining(&flusher->state,
+			   VDO_ADMIN_STATE_SUSPENDING,
+			   completion,
+			   initiate_drain);
+}
+
+/**
+ * vdo_resume_flusher() - Resume a flusher which has been suspended.
+ * @flusher: The flusher to resume.
+ * @parent: The completion to finish when the flusher has resumed.
+ */
+void vdo_resume_flusher(struct flusher *flusher, struct vdo_completion *parent)
+{
+	assert_on_flusher_thread(flusher, __func__);
+	vdo_continue_completion(parent, vdo_resume_if_quiescent(&flusher->state));
+}
diff --git a/drivers/md/dm-vdo/flush.h b/drivers/md/dm-vdo/flush.h
new file mode 100644
index 000000000000..f36231c493c5
--- /dev/null
+++ b/drivers/md/dm-vdo/flush.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_FLUSH_H
+#define VDO_FLUSH_H
+
+#include "types.h"
+#include "vio.h"
+#include "wait-queue.h"
+#include "work-queue.h"
+
+/* A marker for tracking which journal entries are affected by a flush request. */
+struct vdo_flush {
+	/* The completion for enqueueing this flush request. */
+	struct vdo_completion completion;
+	/* The flush bios covered by this request */
+	struct bio_list bios;
+	/* The wait queue entry for this flush */
+	struct waiter waiter;
+	/* Which flush this struct represents */
+	sequence_number_t flush_generation;
+};
+
+struct flusher;
+
+int __must_check vdo_make_flusher(struct vdo *vdo);
+
+void vdo_free_flusher(struct flusher *flusher);
+
+thread_id_t __must_check vdo_get_flusher_thread_id(struct flusher *flusher);
+
+void vdo_complete_flushes(struct flusher *flusher);
+
+void vdo_dump_flusher(const struct flusher *flusher);
+
+void vdo_launch_flush(struct vdo *vdo, struct bio *bio);
+
+void vdo_drain_flusher(struct flusher *flusher, struct vdo_completion *completion);
+
+void vdo_resume_flusher(struct flusher *flusher, struct vdo_completion *parent);
+
+#endif /* VDO_FLUSH_H */
diff --git a/drivers/md/dm-vdo/logical-zone.c b/drivers/md/dm-vdo/logical-zone.c
new file mode 100644
index 000000000000..1cb2b39c64ed
--- /dev/null
+++ b/drivers/md/dm-vdo/logical-zone.c
@@ -0,0 +1,378 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "logical-zone.h"
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+#include "string-utils.h"
+
+#include "action-manager.h"
+#include "admin-state.h"
+#include "block-map.h"
+#include "completion.h"
+#include "constants.h"
+#include "data-vio.h"
+#include "flush.h"
+#include "int-map.h"
+#include "physical-zone.h"
+#include "vdo.h"
+
+enum {
+	ALLOCATIONS_PER_ZONE = 128,
+};
+
+/**
+ * as_logical_zone() - Convert a generic vdo_completion to a logical_zone.
+ * @completion: The completion to convert.
+ *
+ * Return: The completion as a logical_zone.
+ */
+static struct logical_zone *as_logical_zone(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VDO_GENERATION_FLUSHED_COMPLETION);
+	return container_of(completion, struct logical_zone, completion);
+}
+
+/* get_thread_id_for_zone() - Implements vdo_zone_thread_getter. */
+static thread_id_t get_thread_id_for_zone(void *context, zone_count_t zone_number)
+{
+	struct logical_zones *zones = context;
+
+	return zones->zones[zone_number].thread_id;
+}
+
+/**
+ * initialize_zone() - Initialize a logical zone.
+ * @zones: The logical_zones to which this zone belongs.
+ * @zone_number: The logical_zone's index.
+ */
+static int initialize_zone(struct logical_zones *zones, zone_count_t zone_number)
+{
+	int result;
+	struct vdo *vdo = zones->vdo;
+	struct logical_zone *zone = &zones->zones[zone_number];
+	zone_count_t allocation_zone_number;
+
+	result = vdo_make_int_map(VDO_LOCK_MAP_CAPACITY, 0, &zone->lbn_operations);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	if (zone_number < vdo->thread_config.logical_zone_count - 1)
+		zone->next = &zones->zones[zone_number + 1];
+
+	vdo_initialize_completion(&zone->completion, vdo, VDO_GENERATION_FLUSHED_COMPLETION);
+	zone->zones = zones;
+	zone->zone_number = zone_number;
+	zone->thread_id = vdo->thread_config.logical_threads[zone_number];
+	zone->block_map_zone = &vdo->block_map->zones[zone_number];
+	INIT_LIST_HEAD(&zone->write_vios);
+	vdo_set_admin_state_code(&zone->state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+
+	allocation_zone_number = zone->thread_id % vdo->thread_config.physical_zone_count;
+	zone->allocation_zone = &vdo->physical_zones->zones[allocation_zone_number];
+
+	return vdo_make_default_thread(vdo, zone->thread_id);
+}
+
+/**
+ * vdo_make_logical_zones() - Create a set of logical zones.
+ * @vdo: The vdo to which the zones will belong.
+ * @zones_ptr: A pointer to hold the new zones.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+int vdo_make_logical_zones(struct vdo *vdo, struct logical_zones **zones_ptr)
+{
+	struct logical_zones *zones;
+	int result;
+	zone_count_t zone;
+	zone_count_t zone_count = vdo->thread_config.logical_zone_count;
+
+	if (zone_count == 0)
+		return VDO_SUCCESS;
+
+	result = UDS_ALLOCATE_EXTENDED(struct logical_zones, zone_count,
+				       struct logical_zone, __func__, &zones);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	zones->vdo = vdo;
+	zones->zone_count = zone_count;
+	for (zone = 0; zone < zone_count; zone++) {
+		result = initialize_zone(zones, zone);
+		if (result != VDO_SUCCESS) {
+			vdo_free_logical_zones(zones);
+			return result;
+		}
+	}
+
+	result = vdo_make_action_manager(zones->zone_count,
+					 get_thread_id_for_zone,
+					 vdo->thread_config.admin_thread,
+					 zones,
+					 NULL,
+					 vdo,
+					 &zones->manager);
+	if (result != VDO_SUCCESS) {
+		vdo_free_logical_zones(zones);
+		return result;
+	}
+
+	*zones_ptr = zones;
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_free_logical_zones() - Free a set of logical zones.
+ * @zones: The set of zones to free.
+ */
+void vdo_free_logical_zones(struct logical_zones *zones)
+{
+	zone_count_t index;
+
+	if (zones == NULL)
+		return;
+
+	UDS_FREE(UDS_FORGET(zones->manager));
+
+	for (index = 0; index < zones->zone_count; index++)
+		vdo_free_int_map(UDS_FORGET(zones->zones[index].lbn_operations));
+
+	UDS_FREE(zones);
+}
+
+static inline void assert_on_zone_thread(struct logical_zone *zone, const char *what)
+{
+	ASSERT_LOG_ONLY((vdo_get_callback_thread_id() == zone->thread_id),
+			"%s() called on correct thread", what);
+}
+
+/**
+ * check_for_drain_complete() - Check whether this zone has drained.
+ * @zone: The zone to check.
+ */
+static void check_for_drain_complete(struct logical_zone *zone)
+{
+	if (!vdo_is_state_draining(&zone->state) || zone->notifying ||
+	    !list_empty(&zone->write_vios))
+		return;
+
+	vdo_finish_draining(&zone->state);
+}
+
+/**
+ * initiate_drain() - Initiate a drain.
+ *
+ * Implements vdo_admin_initiator.
+ */
+static void initiate_drain(struct admin_state *state)
+{
+	check_for_drain_complete(container_of(state, struct logical_zone, state));
+}
+
+/**
+ * drain_logical_zone() - Drain a logical zone.
+ *
+ * Implements vdo_zone_action.
+ */
+static void
+drain_logical_zone(void *context, zone_count_t zone_number, struct vdo_completion *parent)
+{
+	struct logical_zones *zones = context;
+
+	vdo_start_draining(&zones->zones[zone_number].state,
+			   vdo_get_current_manager_operation(zones->manager),
+			   parent,
+			   initiate_drain);
+}
+
+void vdo_drain_logical_zones(struct logical_zones *zones,
+			     const struct admin_state_code *operation,
+			     struct vdo_completion *parent)
+{
+	vdo_schedule_operation(zones->manager, operation, NULL, drain_logical_zone, NULL, parent);
+}
+
+/**
+ * resume_logical_zone() - Resume a logical zone.
+ *
+ * Implements vdo_zone_action.
+ */
+static void
+resume_logical_zone(void *context, zone_count_t zone_number, struct vdo_completion *parent)
+{
+	struct logical_zone *zone = &(((struct logical_zones *) context)->zones[zone_number]);
+
+	vdo_fail_completion(parent, vdo_resume_if_quiescent(&zone->state));
+}
+
+/**
+ * vdo_resume_logical_zones() - Resume a set of logical zones.
+ * @zones: The logical zones to resume.
+ * @parent: The object to notify when the zones have resumed.
+ */
+void vdo_resume_logical_zones(struct logical_zones *zones, struct vdo_completion *parent)
+{
+	vdo_schedule_operation(zones->manager, VDO_ADMIN_STATE_RESUMING, NULL,
+			       resume_logical_zone, NULL, parent);
+}
+
+/**
+ * update_oldest_active_generation() - Update the oldest active generation.
+ * @zone: The zone.
+ *
+ * Return: true if the oldest active generation has changed.
+ */
+static bool update_oldest_active_generation(struct logical_zone *zone)
+{
+	struct data_vio *data_vio =
+		list_first_entry_or_null(&zone->write_vios, struct data_vio, write_entry);
+	sequence_number_t oldest =
+		(data_vio == NULL) ? zone->flush_generation : data_vio->flush_generation;
+
+	if (oldest == zone->oldest_active_generation)
+		return false;
+
+	WRITE_ONCE(zone->oldest_active_generation, oldest);
+	return true;
+}
+
+/**
+ * vdo_increment_logical_zone_flush_generation() - Increment the flush generation in a logical
+ *                                                 zone.
+ * @zone: The logical zone.
+ * @expected_generation: The expected value of the flush generation before the increment.
+ */
+void vdo_increment_logical_zone_flush_generation(struct logical_zone *zone,
+						 sequence_number_t expected_generation)
+{
+	assert_on_zone_thread(zone, __func__);
+	ASSERT_LOG_ONLY((zone->flush_generation == expected_generation),
+			"logical zone %u flush generation %llu should be %llu before increment",
+			zone->zone_number,
+			(unsigned long long) zone->flush_generation,
+			(unsigned long long) expected_generation);
+
+	zone->flush_generation++;
+	zone->ios_in_flush_generation = 0;
+	update_oldest_active_generation(zone);
+}
+
+/**
+ * vdo_acquire_flush_generation_lock() - Acquire the shared lock on a flush generation by a write
+ *                                       data_vio.
+ * @data_vio: The data_vio.
+ */
+void vdo_acquire_flush_generation_lock(struct data_vio *data_vio)
+{
+	struct logical_zone *zone = data_vio->logical.zone;
+
+	assert_on_zone_thread(zone, __func__);
+	ASSERT_LOG_ONLY(vdo_is_state_normal(&zone->state), "vdo state is normal");
+
+	data_vio->flush_generation = zone->flush_generation;
+	list_add_tail(&data_vio->write_entry, &zone->write_vios);
+	zone->ios_in_flush_generation++;
+}
+
+static void attempt_generation_complete_notification(struct vdo_completion *completion);
+
+/**
+ * notify_flusher() - Notify the flush that at least one generation no longer has active VIOs.
+ * @completion: The zone completion.
+ *
+ * This callback is registered in attempt_generation_complete_notification().
+ */
+static void notify_flusher(struct vdo_completion *completion)
+{
+	struct logical_zone *zone = as_logical_zone(completion);
+
+	vdo_complete_flushes(zone->zones->vdo->flusher);
+	vdo_launch_completion_callback(completion,
+				       attempt_generation_complete_notification,
+				       zone->thread_id);
+}
+
+/**
+ * void attempt_generation_complete_notification() - Notify the flusher if some generation no
+ *                                                   longer has active VIOs.
+ * @completion: The zone completion.
+ */
+static void attempt_generation_complete_notification(struct vdo_completion *completion)
+{
+	struct logical_zone *zone = as_logical_zone(completion);
+
+	assert_on_zone_thread(zone, __func__);
+	if (zone->oldest_active_generation <= zone->notification_generation) {
+		zone->notifying = false;
+		check_for_drain_complete(zone);
+		return;
+	}
+
+	zone->notifying = true;
+	zone->notification_generation = zone->oldest_active_generation;
+	vdo_launch_completion_callback(&zone->completion, notify_flusher,
+				       vdo_get_flusher_thread_id(zone->zones->vdo->flusher));
+}
+
+/**
+ * vdo_release_flush_generation_lock() - Release the shared lock on a flush generation held by a
+ *                                       write data_vio.
+ * @data_vio: The data_vio whose lock is to be released.
+ *
+ * If there are pending flushes, and this data_vio completes the oldest generation active in this
+ * zone, an attempt will be made to finish any flushes which may now be complete.
+ */
+void vdo_release_flush_generation_lock(struct data_vio *data_vio)
+{
+	struct logical_zone *zone = data_vio->logical.zone;
+
+	assert_on_zone_thread(zone, __func__);
+
+	if (!data_vio_has_flush_generation_lock(data_vio))
+		return;
+
+	list_del_init(&data_vio->write_entry);
+	ASSERT_LOG_ONLY((zone->oldest_active_generation <= data_vio->flush_generation),
+			"data_vio releasing lock on generation %llu is not older than oldest active generation %llu",
+			(unsigned long long) data_vio->flush_generation,
+			(unsigned long long) zone->oldest_active_generation);
+
+	if (!update_oldest_active_generation(zone) || zone->notifying)
+		return;
+
+	attempt_generation_complete_notification(&zone->completion);
+}
+
+struct physical_zone *vdo_get_next_allocation_zone(struct logical_zone *zone)
+{
+	if (zone->allocation_count == ALLOCATIONS_PER_ZONE) {
+		zone->allocation_count = 0;
+		zone->allocation_zone = zone->allocation_zone->next;
+	}
+
+	zone->allocation_count++;
+	return zone->allocation_zone;
+}
+
+/**
+ * vdo_dump_logical_zone() - Dump information about a logical zone to the log for debugging.
+ * @zone: The zone to dump
+ *
+ * Context: the information is dumped in a thread-unsafe fashion.
+ *
+ */
+void vdo_dump_logical_zone(const struct logical_zone *zone)
+{
+	uds_log_info("logical_zone %u", zone->zone_number);
+	uds_log_info("  flush_generation=%llu oldest_active_generation=%llu notification_generation=%llu notifying=%s ios_in_flush_generation=%llu",
+		     (unsigned long long) READ_ONCE(zone->flush_generation),
+		     (unsigned long long) READ_ONCE(zone->oldest_active_generation),
+		     (unsigned long long) READ_ONCE(zone->notification_generation),
+		     uds_bool_to_string(READ_ONCE(zone->notifying)),
+		     (unsigned long long) READ_ONCE(zone->ios_in_flush_generation));
+}
diff --git a/drivers/md/dm-vdo/logical-zone.h b/drivers/md/dm-vdo/logical-zone.h
new file mode 100644
index 000000000000..82afe4cc661e
--- /dev/null
+++ b/drivers/md/dm-vdo/logical-zone.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_LOGICAL_ZONE_H
+#define VDO_LOGICAL_ZONE_H
+
+#include <linux/list.h>
+
+#include "admin-state.h"
+#include "int-map.h"
+#include "types.h"
+
+struct physical_zone;
+
+struct logical_zone {
+	/* The completion for flush notifications */
+	struct vdo_completion completion;
+	/* The owner of this zone */
+	struct logical_zones *zones;
+	/* Which logical zone this is */
+	zone_count_t zone_number;
+	/* The thread id for this zone */
+	thread_id_t thread_id;
+	/* In progress operations keyed by LBN */
+	struct int_map *lbn_operations;
+	/* The logical to physical map */
+	struct block_map_zone *block_map_zone;
+	/* The current flush generation */
+	sequence_number_t flush_generation;
+	/*
+	 * The oldest active generation in this zone. This is mutated only on the logical zone
+	 * thread but is queried from the flusher thread.
+	 */
+	sequence_number_t oldest_active_generation;
+	/* The number of IOs in the current flush generation */
+	block_count_t ios_in_flush_generation;
+	/* The youngest generation of the current notification */
+	sequence_number_t notification_generation;
+	/* Whether a notification is in progress */
+	bool notifying;
+	/* The queue of active data write VIOs */
+	struct list_head write_vios;
+	/* The administrative state of the zone */
+	struct admin_state state;
+	/* The physical zone from which to allocate */
+	struct physical_zone *allocation_zone;
+	/* The number of allocations done from the current allocation_zone */
+	block_count_t allocation_count;
+	/* The next zone */
+	struct logical_zone *next;
+};
+
+struct logical_zones {
+	/* The vdo whose zones these are */
+	struct vdo *vdo;
+	/* The manager for administrative actions */
+	struct action_manager *manager;
+	/* The number of zones */
+	zone_count_t zone_count;
+	/* The logical zones themselves */
+	struct logical_zone zones[];
+};
+
+int __must_check vdo_make_logical_zones(struct vdo *vdo, struct logical_zones **zones_ptr);
+
+void vdo_free_logical_zones(struct logical_zones *zones);
+
+void vdo_drain_logical_zones(struct logical_zones *zones,
+			     const struct admin_state_code *operation,
+			     struct vdo_completion *completion);
+
+void vdo_resume_logical_zones(struct logical_zones *zones, struct vdo_completion *parent);
+
+void vdo_increment_logical_zone_flush_generation(struct logical_zone *zone,
+						 sequence_number_t expected_generation);
+
+void vdo_acquire_flush_generation_lock(struct data_vio *data_vio);
+
+void vdo_release_flush_generation_lock(struct data_vio *data_vio);
+
+struct physical_zone * __must_check vdo_get_next_allocation_zone(struct logical_zone *zone);
+
+void vdo_dump_logical_zone(const struct logical_zone *zone);
+
+#endif /* VDO_LOGICAL_ZONE_H */

From patchwork Thu Oct 26 21:41:18 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437942
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9781EC25B48
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:25 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S234917AbjJZVoZ (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:25 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52830 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234958AbjJZVoP (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:15 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D2C9DD56
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356505;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=68nlWM8XoFhIwjUDUKH9YTjj7LNflBOSRKo06es11gQ=;
        b=AxASDbykwhYqD6LkNM8ooCE9WJK0yWFbNVKZMosnQrAFbDumIZCxpc4qb57qnM/rL+TBSV
        FT4w9RsIrzqQQCvXy4bQzlXYUdk7S8MxGwwXQHk0MCOjejjnUkEa9H0Aoct85DywpULslX
        4RA16OoTfxdAYYW4ZHF5xMhdnNlJRJk=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-121-rBlQp-1NOoOG97wiAfuZDg-1; Thu,
 26 Oct 2023 17:41:41 -0400
X-MC-Unique: rBlQp-1NOoOG97wiAfuZDg-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B09561C04B50;
        Thu, 26 Oct 2023 21:41:40 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id A17812949;
        Thu, 26 Oct 2023 21:41:40 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 615F03003B; Thu, 26 Oct 2023 17:41:40 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 21/39] dm vdo: add the vdo io_submitter
Date: Thu, 26 Oct 2023 17:41:18 -0400
Message-Id: <20231026214136.1067410-22-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

The io_submitter handles bio submission from vdo data store to the storage
below. It will merge bios when possible.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Co-developed-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Co-developed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/io-submitter.c | 483 +++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/io-submitter.h |  52 ++++
 2 files changed, 535 insertions(+)
 create mode 100644 drivers/md/dm-vdo/io-submitter.c
 create mode 100644 drivers/md/dm-vdo/io-submitter.h

diff --git a/drivers/md/dm-vdo/io-submitter.c b/drivers/md/dm-vdo/io-submitter.c
new file mode 100644
index 000000000000..995c8c7c908d
--- /dev/null
+++ b/drivers/md/dm-vdo/io-submitter.c
@@ -0,0 +1,483 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "io-submitter.h"
+
+#include <linux/bio.h>
+#include <linux/kernel.h>
+#include <linux/mutex.h>
+
+#include "memory-alloc.h"
+#include "permassert.h"
+
+#include "data-vio.h"
+#include "logger.h"
+#include "types.h"
+#include "vdo.h"
+#include "vio.h"
+
+/*
+ * Submission of bio operations to the underlying storage device will go through a separate work
+ * queue thread (or more than one) to prevent blocking in other threads if the storage device has a
+ * full queue. The plug structure allows that thread to do better batching of requests to make the
+ * I/O more efficient.
+ *
+ * When multiple worker threads are used, a thread is chosen for a I/O operation submission based
+ * on the PBN, so a given PBN will consistently wind up on the same thread. Flush operations are
+ * assigned round-robin.
+ *
+ * The map (protected by the mutex) collects pending I/O operations so that the worker thread can
+ * reorder them to try to encourage I/O request merging in the request queue underneath.
+ */
+struct bio_queue_data {
+	struct vdo_work_queue *queue;
+	struct blk_plug plug;
+	struct int_map *map;
+	struct mutex lock;
+	unsigned int queue_number;
+};
+
+struct io_submitter {
+	unsigned int num_bio_queues_used;
+	unsigned int bio_queue_rotation_interval;
+	struct bio_queue_data bio_queue_data[];
+};
+
+static void start_bio_queue(void *ptr)
+{
+	struct bio_queue_data *bio_queue_data = (struct bio_queue_data *) ptr;
+
+	blk_start_plug(&bio_queue_data->plug);
+}
+
+static void finish_bio_queue(void *ptr)
+{
+	struct bio_queue_data *bio_queue_data = (struct bio_queue_data *) ptr;
+
+	blk_finish_plug(&bio_queue_data->plug);
+}
+
+static const struct vdo_work_queue_type bio_queue_type = {
+	.start = start_bio_queue,
+	.finish = finish_bio_queue,
+	.max_priority = BIO_Q_MAX_PRIORITY,
+	.default_priority = BIO_Q_DATA_PRIORITY,
+};
+
+/**
+ * count_all_bios() - Determine which bio counter to use.
+ * @vio: The vio associated with the bio.
+ * @bio: The bio to count.
+ */
+static void count_all_bios(struct vio *vio, struct bio *bio)
+{
+	struct atomic_statistics *stats = &vio->completion.vdo->stats;
+
+	if (is_data_vio(vio)) {
+		vdo_count_bios(&stats->bios_out, bio);
+		return;
+	}
+
+	vdo_count_bios(&stats->bios_meta, bio);
+	if (vio->type == VIO_TYPE_RECOVERY_JOURNAL)
+		vdo_count_bios(&stats->bios_journal, bio);
+	else if (vio->type == VIO_TYPE_BLOCK_MAP)
+		vdo_count_bios(&stats->bios_page_cache, bio);
+}
+
+/**
+ * assert_in_bio_zone() - Assert that a vio is in the correct bio zone and not in interrupt
+ *                        context.
+ * @vio: The vio to check.
+ */
+static void assert_in_bio_zone(struct vio *vio)
+{
+	ASSERT_LOG_ONLY(!in_interrupt(), "not in interrupt context");
+	assert_vio_in_bio_zone(vio);
+}
+
+/**
+ * send_bio_to_device() - Update stats and tracing info, then submit the supplied bio to the OS for
+ *                        processing.
+ * @vio: The vio associated with the bio.
+ * @bio: The bio to submit to the OS.
+ */
+static void send_bio_to_device(struct vio *vio, struct bio *bio)
+{
+	struct vdo *vdo = vio->completion.vdo;
+
+	assert_in_bio_zone(vio);
+	atomic64_inc(&vdo->stats.bios_submitted);
+	count_all_bios(vio, bio);
+	bio_set_dev(bio, vdo_get_backing_device(vdo));
+	submit_bio_noacct(bio);
+}
+
+static sector_t get_bio_sector(struct bio *bio)
+{
+	return bio->bi_iter.bi_sector;
+}
+
+/**
+ * process_vio_io() - Submits a vio's bio to the underlying block device. May block if the device
+ *                    is busy. This callback should be used by vios which did not attempt to merge.
+ */
+void process_vio_io(struct vdo_completion *completion)
+{
+	struct vio *vio = as_vio(completion);
+
+	send_bio_to_device(vio, vio->bio);
+}
+
+/**
+ * get_bio_list() - Extract the list of bios to submit from a vio.
+ * @vio: The vio submitting I/O.
+ *
+ * The list will always contain at least one entry (the bio for the vio on which it is called), but
+ * other bios may have been merged with it as well.
+ *
+ * Return: bio  The head of the bio list to submit.
+ */
+static struct bio *get_bio_list(struct vio *vio)
+{
+	struct bio *bio;
+	struct io_submitter *submitter = vio->completion.vdo->io_submitter;
+	struct bio_queue_data *bio_queue_data = &(submitter->bio_queue_data[vio->bio_zone]);
+
+	assert_in_bio_zone(vio);
+
+	mutex_lock(&bio_queue_data->lock);
+	vdo_int_map_remove(bio_queue_data->map, get_bio_sector(vio->bios_merged.head));
+	vdo_int_map_remove(bio_queue_data->map, get_bio_sector(vio->bios_merged.tail));
+	bio = vio->bios_merged.head;
+	bio_list_init(&vio->bios_merged);
+	mutex_unlock(&bio_queue_data->lock);
+
+	return bio;
+}
+
+/**
+ * process_data_vio_io() - Submit a data_vio's bio to the storage below along with any bios that
+ *                         have been merged with it.
+ *
+ * Context: This call may block and so should only be called from a bio thread.
+ */
+static void process_data_vio_io(struct vdo_completion *completion)
+{
+	struct bio *bio, *next;
+	struct vio *vio = as_vio(completion);
+
+	assert_in_bio_zone(vio);
+	for (bio = get_bio_list(vio); bio != NULL; bio = next) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+		send_bio_to_device((struct vio *) bio->bi_private, bio);
+	}
+}
+
+/**
+ * get_mergeable_locked() - Attempt to find an already queued bio that the current bio can be
+ *                          merged with.
+ * @map: The bio map to use for merging.
+ * @vio: The vio we want to merge.
+ * @back_merge: Set to true for a back merge, false for a front merge.
+ *
+ * There are two types of merging possible, forward and backward, which are distinguished by a flag
+ * that uses kernel elevator terminology.
+ *
+ * Return: the vio to merge to, NULL if no merging is possible.
+ */
+static struct vio *get_mergeable_locked(struct int_map *map, struct vio *vio, bool back_merge)
+{
+	struct bio *bio = vio->bio;
+	sector_t merge_sector = get_bio_sector(bio);
+	struct vio *vio_merge;
+
+	if (back_merge)
+		merge_sector -= VDO_SECTORS_PER_BLOCK;
+	else
+		merge_sector += VDO_SECTORS_PER_BLOCK;
+
+	vio_merge = vdo_int_map_get(map, merge_sector);
+
+	if (vio_merge == NULL)
+		return NULL;
+
+	if (vio->completion.priority != vio_merge->completion.priority)
+		return NULL;
+
+	if (bio_data_dir(bio) != bio_data_dir(vio_merge->bio))
+		return NULL;
+
+	if (bio_list_empty(&vio_merge->bios_merged))
+		return NULL;
+
+	if (back_merge)
+		return ((get_bio_sector(vio_merge->bios_merged.tail) == merge_sector) ?
+			vio_merge :
+			NULL);
+
+	return ((get_bio_sector(vio_merge->bios_merged.head) == merge_sector) ? vio_merge : NULL);
+}
+
+static int map_merged_vio(struct int_map *bio_map, struct vio *vio)
+{
+	int result;
+
+	result = vdo_int_map_put(bio_map, get_bio_sector(vio->bios_merged.head), vio, true, NULL);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	return vdo_int_map_put(bio_map, get_bio_sector(vio->bios_merged.tail), vio, true, NULL);
+}
+
+static int merge_to_prev_tail(struct int_map *bio_map, struct vio *vio, struct vio *prev_vio)
+{
+	vdo_int_map_remove(bio_map, get_bio_sector(prev_vio->bios_merged.tail));
+	bio_list_merge(&prev_vio->bios_merged, &vio->bios_merged);
+	return map_merged_vio(bio_map, prev_vio);
+}
+
+static int merge_to_next_head(struct int_map *bio_map, struct vio *vio, struct vio *next_vio)
+{
+	/*
+	 * Handle "next merge" and "gap fill" cases the same way so as to reorder bios in a way
+	 * that's compatible with using funnel queues in work queues. This avoids removing an
+	 * existing completion.
+	 */
+	vdo_int_map_remove(bio_map, get_bio_sector(next_vio->bios_merged.head));
+	bio_list_merge_head(&next_vio->bios_merged, &vio->bios_merged);
+	return map_merged_vio(bio_map, next_vio);
+}
+
+/**
+ * try_bio_map_merge() - Attempt to merge a vio's bio with other pending I/Os.
+ * @vio: The vio to merge.
+ *
+ * Currently this is only used for data_vios, but is broken out for future use with metadata vios.
+ *
+ * Return: whether or not the vio was merged.
+ */
+static bool try_bio_map_merge(struct vio *vio)
+{
+	int result;
+	bool merged = true;
+	struct bio *bio = vio->bio;
+	struct vio *prev_vio, *next_vio;
+	struct vdo *vdo = vio->completion.vdo;
+	struct bio_queue_data *bio_queue_data = &vdo->io_submitter->bio_queue_data[vio->bio_zone];
+
+	bio->bi_next = NULL;
+	bio_list_init(&vio->bios_merged);
+	bio_list_add(&vio->bios_merged, bio);
+
+	mutex_lock(&bio_queue_data->lock);
+	prev_vio = get_mergeable_locked(bio_queue_data->map, vio, true);
+	next_vio = get_mergeable_locked(bio_queue_data->map, vio, false);
+	if (prev_vio == next_vio)
+		next_vio = NULL;
+
+	if ((prev_vio == NULL) && (next_vio == NULL)) {
+		/* no merge. just add to bio_queue */
+		merged = false;
+		result = vdo_int_map_put(bio_queue_data->map,
+					 get_bio_sector(bio),
+					 vio,
+					 true,
+					 NULL);
+	} else if (next_vio == NULL) {
+		/* Only prev. merge to prev's tail */
+		result = merge_to_prev_tail(bio_queue_data->map, vio, prev_vio);
+	} else {
+		/* Only next. merge to next's head */
+		result = merge_to_next_head(bio_queue_data->map, vio, next_vio);
+	}
+
+	mutex_unlock(&bio_queue_data->lock);
+
+	/* We don't care about failure of int_map_put in this case. */
+	ASSERT_LOG_ONLY(result == UDS_SUCCESS, "bio map insertion succeeds");
+	return merged;
+}
+
+/**
+ * submit_data_vio_io() - Submit I/O for a data_vio.
+ * @data_vio: the data_vio for which to issue I/O.
+ *
+ * If possible, this I/O will be merged other pending I/Os. Otherwise, the data_vio will be sent to
+ * the appropriate bio zone directly.
+ */
+void submit_data_vio_io(struct data_vio *data_vio)
+{
+	if (try_bio_map_merge(&data_vio->vio))
+		return;
+
+	launch_data_vio_bio_zone_callback(data_vio, process_data_vio_io);
+}
+
+/**
+ * vdo_submit_metadata_io() - Submit I/O for a metadata vio.
+ * @vio: the vio for which to issue I/O
+ * @physical: the physical block number to read or write
+ * @callback: the bio endio function which will be called after the I/O completes
+ * @error_handler: the handler for submission or I/O errors (may be NULL)
+ * @operation: the type of I/O to perform
+ * @data: the buffer to read or write (may be NULL)
+ *
+ * The vio is enqueued on a vdo bio queue so that bio submission (which may block) does not block
+ * other vdo threads.
+ *
+ * That the error handler will run on the correct thread is only true so long as the thread calling
+ * this function, and the thread set in the endio callback are the same, as well as the fact that
+ * no error can occur on the bio queue. Currently this is true for all callers, but additional care
+ * will be needed if this ever changes.
+ */
+void vdo_submit_metadata_io(struct vio *vio,
+			    physical_block_number_t physical,
+			    bio_end_io_t callback,
+			    vdo_action *error_handler,
+			    unsigned int operation,
+			    char *data)
+{
+	struct vdo_completion *completion = &vio->completion;
+	int result;
+	const struct admin_state_code *code = vdo_get_admin_state(completion->vdo);
+
+
+	ASSERT_LOG_ONLY(!code->quiescent, "I/O not allowed in state %s", code->name);
+	ASSERT_LOG_ONLY(vio->bio->bi_next == NULL, "metadata bio has no next bio");
+
+	vdo_reset_completion(completion);
+	completion->error_handler = error_handler;
+	result = vio_reset_bio(vio, data, callback, operation | REQ_META, physical);
+	if (result != VDO_SUCCESS) {
+		continue_vio(vio, result);
+		return;
+	}
+
+	vdo_set_completion_callback(completion, process_vio_io, get_vio_bio_zone_thread_id(vio));
+	vdo_launch_completion_with_priority(completion, get_metadata_priority(vio));
+}
+
+/**
+ * vdo_make_io_submitter() - Create an io_submitter structure.
+ * @thread_count: Number of bio-submission threads to set up.
+ * @rotation_interval: Interval to use when rotating between bio-submission threads when enqueuing
+ *                     completions.
+ * @max_requests_active: Number of bios for merge tracking.
+ * @vdo: The vdo which will use this submitter.
+ * @io_submitter: pointer to the new data structure.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_make_io_submitter(unsigned int thread_count,
+			  unsigned int rotation_interval,
+			  unsigned int max_requests_active,
+			  struct vdo *vdo,
+			  struct io_submitter **io_submitter_ptr)
+{
+	unsigned int i;
+	struct io_submitter *io_submitter;
+	int result;
+
+	result = UDS_ALLOCATE_EXTENDED(struct io_submitter,
+				       thread_count,
+				       struct bio_queue_data,
+				       "bio submission data",
+				       &io_submitter);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	io_submitter->bio_queue_rotation_interval = rotation_interval;
+
+	/* Setup for each bio-submission work queue */
+	for (i = 0; i < thread_count; i++) {
+		struct bio_queue_data *bio_queue_data = &io_submitter->bio_queue_data[i];
+
+		mutex_init(&bio_queue_data->lock);
+		/*
+		 * One I/O operation per request, but both first & last sector numbers.
+		 *
+		 * If requests are assigned to threads round-robin, they should be distributed
+		 * quite evenly. But if they're assigned based on PBN, things can sometimes be very
+		 * uneven. So for now, we'll assume that all requests *may* wind up on one thread,
+		 * and thus all in the same map.
+		 */
+		result = vdo_make_int_map(max_requests_active * 2, 0, &bio_queue_data->map);
+		if (result != 0) {
+			/*
+			 * Clean up the partially initialized bio-queue entirely and indicate that
+			 * initialization failed.
+			 */
+			uds_log_error("bio map initialization failed %d", result);
+			vdo_cleanup_io_submitter(io_submitter);
+			vdo_free_io_submitter(io_submitter);
+			return result;
+		}
+
+		bio_queue_data->queue_number = i;
+		result = vdo_make_thread(vdo,
+					 vdo->thread_config.bio_threads[i],
+					 &bio_queue_type,
+					 1,
+					 (void **) &bio_queue_data);
+		if (result != VDO_SUCCESS) {
+			/*
+			 * Clean up the partially initialized bio-queue entirely and indicate that
+			 * initialization failed.
+			 */
+			vdo_free_int_map(UDS_FORGET(bio_queue_data->map));
+			uds_log_error("bio queue initialization failed %d", result);
+			vdo_cleanup_io_submitter(io_submitter);
+			vdo_free_io_submitter(io_submitter);
+			return result;
+		}
+
+		bio_queue_data->queue = vdo->threads[vdo->thread_config.bio_threads[i]].queue;
+		io_submitter->num_bio_queues_used++;
+	}
+
+	*io_submitter_ptr = io_submitter;
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_cleanup_io_submitter() - Tear down the io_submitter fields as needed for a physical layer.
+ * @io_submitter: The I/O submitter data to tear down (may be NULL).
+ */
+void vdo_cleanup_io_submitter(struct io_submitter *io_submitter)
+{
+	int i;
+
+	if (io_submitter == NULL)
+		return;
+
+	for (i = io_submitter->num_bio_queues_used - 1; i >= 0; i--)
+		vdo_finish_work_queue(io_submitter->bio_queue_data[i].queue);
+}
+
+/**
+ * vdo_free_io_submitter() - Free the io_submitter fields and structure as needed.
+ * @io_submitter: The I/O submitter data to destroy.
+ *
+ * This must be called after vdo_cleanup_io_submitter(). It is used to release resources late in
+ * the shutdown process to avoid or reduce the chance of race conditions.
+ */
+void vdo_free_io_submitter(struct io_submitter *io_submitter)
+{
+	int i;
+
+	if (io_submitter == NULL)
+		return;
+
+	for (i = io_submitter->num_bio_queues_used - 1; i >= 0; i--) {
+		io_submitter->num_bio_queues_used--;
+		/* vdo_destroy() will free the work queue, so just give up our reference to it. */
+		UDS_FORGET(io_submitter->bio_queue_data[i].queue);
+		vdo_free_int_map(UDS_FORGET(io_submitter->bio_queue_data[i].map));
+	}
+	UDS_FREE(io_submitter);
+}
diff --git a/drivers/md/dm-vdo/io-submitter.h b/drivers/md/dm-vdo/io-submitter.h
new file mode 100644
index 000000000000..f21ae2630ff8
--- /dev/null
+++ b/drivers/md/dm-vdo/io-submitter.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_IO_SUBMITTER_H
+#define VDO_IO_SUBMITTER_H
+
+#include <linux/bio.h>
+
+#include "types.h"
+
+struct io_submitter;
+
+int vdo_make_io_submitter(unsigned int thread_count,
+			  unsigned int rotation_interval,
+			  unsigned int max_requests_active,
+			  struct vdo *vdo,
+			  struct io_submitter **io_submitter);
+
+void vdo_cleanup_io_submitter(struct io_submitter *io_submitter);
+
+void vdo_free_io_submitter(struct io_submitter *io_submitter);
+
+void process_vio_io(struct vdo_completion *completion);
+
+void submit_data_vio_io(struct data_vio *data_vio);
+
+void vdo_submit_metadata_io(struct vio *vio,
+			    physical_block_number_t physical,
+			    bio_end_io_t callback,
+			    vdo_action *error_handler,
+			    unsigned int operation,
+			    char *data);
+
+static inline void submit_metadata_vio(struct vio *vio,
+				       physical_block_number_t physical,
+				       bio_end_io_t callback,
+				       vdo_action *error_handler,
+				       unsigned int operation)
+{
+	vdo_submit_metadata_io(vio, physical, callback, error_handler, operation, vio->data);
+}
+
+static inline void
+submit_flush_vio(struct vio *vio, bio_end_io_t callback, vdo_action *error_handler)
+{
+	/* FIXME: Can we just use REQ_OP_FLUSH? */
+	vdo_submit_metadata_io(vio, 0, callback, error_handler, REQ_OP_WRITE | REQ_PREFLUSH, NULL);
+}
+
+#endif /* VDO_IO_SUBMITTER_H */

From patchwork Thu Oct 26 21:41:19 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437953
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5736BC27C4B
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:30 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344958AbjJZVoa (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:30 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41000 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344901AbjJZVoQ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:16 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 45151D49
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356504;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=g3JqTagz+moyBflhV/aTh+qHF07usH+Ko892gn5V4E4=;
        b=Z0G93phrJmIi6LtN3H2s9VvlSlmYr7/NUfbFJaJqdToNu1/P8ZOeXWt+Lw3lwnqqpKXeFb
        fXWxngsTuWBDCTICTlNL9eaGlSR8EKYqX2a8edQpOHKe6Tg0XQyfosSCECwolnO5p2v0Gc
        c7xXWbY55kBZ4I030CuAc7PANgkEHMg=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-465-STgw2Jc9NrC_idVEamUCQg-1; Thu,
 26 Oct 2023 17:41:41 -0400
X-MC-Unique: STgw2Jc9NrC_idVEamUCQg-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com
 [10.11.54.2])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id F40B11C04B73;
        Thu, 26 Oct 2023 21:41:40 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id D607240C6F79;
        Thu, 26 Oct 2023 21:41:40 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 8B63C3003E; Thu, 26 Oct 2023 17:41:40 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 22/39] dm vdo: add hash locks and hash zones
Date: Thu, 26 Oct 2023 17:41:19 -0400
Message-Id: <20231026214136.1067410-23-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

In order to deduplicate concurrent writes of the same data (to different
locations), data_vios which are writing the same data are grouped together
in a "hash lock," named for and keyed by the hash of the data being
written. Each hash lock is assigned to a hash zone based on a portion of
its hash.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/dedupe.c | 2450 ++++++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/dedupe.h |   93 ++
 2 files changed, 2543 insertions(+)
 create mode 100644 drivers/md/dm-vdo/dedupe.c
 create mode 100644 drivers/md/dm-vdo/dedupe.h

diff --git a/drivers/md/dm-vdo/dedupe.c b/drivers/md/dm-vdo/dedupe.c
new file mode 100644
index 000000000000..cf0fbdc30ca5
--- /dev/null
+++ b/drivers/md/dm-vdo/dedupe.c
@@ -0,0 +1,2450 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+/**
+ * DOC:
+ *
+ * Hash Locks:
+ *
+ * A hash_lock controls and coordinates writing, index access, and dedupe among groups of data_vios
+ * concurrently writing identical blocks, allowing them to deduplicate not only against advice but
+ * also against each other. This saves on index queries and allows those data_vios to concurrently
+ * deduplicate against a single block instead of being serialized through a PBN read lock. Only one
+ * index query is needed for each hash_lock, instead of one for every data_vio.
+ *
+ * A hash_lock acts like a state machine perhaps more than as a lock. Other than the starting and
+ * ending states INITIALIZING and BYPASSING, every state represents and is held for the duration of
+ * an asynchronous operation. All state transitions are performed on the thread of the hash_zone
+ * containing the lock. An asynchronous operation is almost always performed upon entering a state,
+ * and the callback from that operation triggers exiting the state and entering a new state.
+ *
+ * In all states except DEDUPING, there is a single data_vio, called the lock agent, performing the
+ * asynchronous operations on behalf of the lock. The agent will change during the lifetime of the
+ * lock if the lock is shared by more than one data_vio. data_vios waiting to deduplicate are kept
+ * on a wait queue. Viewed a different way, the agent holds the lock exclusively until the lock
+ * enters the DEDUPING state, at which point it becomes a shared lock that all the waiters (and any
+ * new data_vios that arrive) use to share a PBN lock. In state DEDUPING, there is no agent. When
+ * the last data_vio in the lock calls back in DEDUPING, it becomes the agent and the lock becomes
+ * exclusive again. New data_vios that arrive in the lock will also go on the wait queue.
+ *
+ * The existence of lock waiters is a key factor controlling which state the lock transitions to
+ * next. When the lock is new or has waiters, it will always try to reach DEDUPING, and when it
+ * doesn't, it will try to clean up and exit.
+ *
+ * Deduping requires holding a PBN lock on a block that is known to contain data identical to the
+ * data_vios in the lock, so the lock will send the agent to the duplicate zone to acquire the PBN
+ * lock (LOCKING), to the kernel I/O threads to read and verify the data (VERIFYING), or to write a
+ * new copy of the data to a full data block or a slot in a compressed block (WRITING).
+ *
+ * Cleaning up consists of updating the index when the data location is different from the initial
+ * index query (UPDATING, triggered by stale advice, compression, and rollover), releasing the PBN
+ * lock on the duplicate block (UNLOCKING), and if the agent is the last data_vio referencing the
+ * lock, releasing the hash_lock itself back to the hash zone (BYPASSING).
+ *
+ * The shortest sequence of states is for non-concurrent writes of new data:
+ *   INITIALIZING -> QUERYING -> WRITING -> BYPASSING
+ * This sequence is short because no PBN read lock or index update is needed.
+ *
+ * Non-concurrent, finding valid advice looks like this (endpoints elided):
+ *   -> QUERYING -> LOCKING -> VERIFYING -> DEDUPING -> UNLOCKING ->
+ * Or with stale advice (endpoints elided):
+ *   -> QUERYING -> LOCKING -> VERIFYING -> UNLOCKING -> WRITING -> UPDATING ->
+ *
+ * When there are not enough available reference count increments available on a PBN for a data_vio
+ * to deduplicate, a new lock is forked and the excess waiters roll over to the new lock (which
+ * goes directly to WRITING). The new lock takes the place of the old lock in the lock map so new
+ * data_vios will be directed to it. The two locks will proceed independently, but only the new
+ * lock will have the right to update the index (unless it also forks).
+ *
+ * Since rollover happens in a lock instance, once a valid data location has been selected, it will
+ * not change. QUERYING and WRITING are only performed once per lock lifetime. All other
+ * non-endpoint states can be re-entered.
+ *
+ * The function names in this module follow a convention referencing the states and transitions in
+ * the state machine. For example, for the LOCKING state, there are start_locking() and
+ * finish_locking() functions.  start_locking() is invoked by the finish function of the state (or
+ * states) that transition to LOCKING. It performs the actual lock state change and must be invoked
+ * on the hash zone thread.  finish_locking() is called by (or continued via callback from) the
+ * code actually obtaining the lock. It does any bookkeeping or decision-making required and
+ * invokes the appropriate start function of the state being transitioned to after LOCKING.
+ *
+ * ----------------------------------------------------------------------
+ *
+ * Index Queries:
+ *
+ * A query to the UDS index is handled asynchronously by the index's threads. When the query is
+ * complete, a callback supplied with the query will be called from one of the those threads. Under
+ * heavy system load, the index may be slower to respond then is desirable for reasonable I/O
+ * throughput. Since deduplication of writes is not necessary for correct operation of a VDO
+ * device, it is acceptable to timeout out slow index queries and proceed to fulfill a write
+ * request without deduplicating. However, because the uds_request struct itself is supplied by the
+ * caller, we can not simply reuse a uds_request object which we have chosen to timeout. Hence,
+ * each hash_zone maintains a pool of dedupe_contexts which each contain a uds_request along with a
+ * reference to the data_vio on behalf of which they are performing a query.
+ *
+ * When a hash_lock needs to query the index, it attempts to acquire an unused dedupe_context from
+ * its hash_zone's pool. If one is available, that context is prepared, associated with the
+ * hash_lock's agent, added to the list of pending contexts, and then sent to the index. The
+ * context's state will be transitioned from DEDUPE_CONTEXT_IDLE to DEDUPE_CONTEXT_PENDING. If all
+ * goes well, the dedupe callback will be called by the index which will change the context's state
+ * to DEDUPE_CONTEXT_COMPLETE, and the associated data_vio will be enqueued to run back in the hash
+ * zone where the query results will be processed and the context will be put back in the idle
+ * state and returned to the hash_zone's available list.
+ *
+ * The first time an index query is launched from a given hash_zone, a timer is started. When the
+ * timer fires, the hash_zone's completion is enqueued to run in the hash_zone where the zone's
+ * pending list will be searched for any contexts in the pending state which have been running for
+ * too long. Those contexts are transitioned to the DEDUPE_CONTEXT_TIMED_OUT state and moved to the
+ * zone's timed_out list where they won't be examined again if there is a subsequent time out). The
+ * data_vios associated with timed out contexts are sent to continue processing their write
+ * operation without deduplicating. The timer is also restarted.
+ *
+ * When the dedupe callback is run for a context which is in the timed out state, that context is
+ * moved to the DEDUPE_CONTEXT_TIMED_OUT_COMPLETE state. No other action need be taken as the
+ * associated data_vios have already been dispatched.
+ *
+ * If a hash_lock needs a dedupe context, and the available list is empty, the timed_out list will
+ * be searched for any contexts which are timed out and complete. One of these will be used
+ * immediately, and the rest will be returned to the available list and marked idle.
+ */
+
+#include "dedupe.h"
+
+#include <linux/atomic.h>
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+#include <linux/list.h>
+#include <linux/ratelimit.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "numeric.h"
+#include "permassert.h"
+#include "string-utils.h"
+#include "uds.h"
+
+#include "action-manager.h"
+#include "admin-state.h"
+#include "completion.h"
+#include "constants.h"
+#include "data-vio.h"
+#include "io-submitter.h"
+#include "packer.h"
+#include "physical-zone.h"
+#include "pointer-map.h"
+#include "slab-depot.h"
+#include "statistics.h"
+#include "types.h"
+#include "vdo.h"
+#include "wait-queue.h"
+
+enum hash_lock_state {
+	/* State for locks that are not in use or are being initialized. */
+	VDO_HASH_LOCK_INITIALIZING,
+
+	/* This is the sequence of states typically used on the non-dedupe path. */
+	VDO_HASH_LOCK_QUERYING,
+	VDO_HASH_LOCK_WRITING,
+	VDO_HASH_LOCK_UPDATING,
+
+	/* The remaining states are typically used on the dedupe path in this order. */
+	VDO_HASH_LOCK_LOCKING,
+	VDO_HASH_LOCK_VERIFYING,
+	VDO_HASH_LOCK_DEDUPING,
+	VDO_HASH_LOCK_UNLOCKING,
+
+	/*
+	 * Terminal state for locks returning to the pool. Must be last both because it's the final
+	 * state, and also because it's used to count the states.
+	 */
+	VDO_HASH_LOCK_BYPASSING,
+};
+
+static const char * const LOCK_STATE_NAMES[] = {
+	[VDO_HASH_LOCK_BYPASSING] = "BYPASSING",
+	[VDO_HASH_LOCK_DEDUPING] = "DEDUPING",
+	[VDO_HASH_LOCK_INITIALIZING] = "INITIALIZING",
+	[VDO_HASH_LOCK_LOCKING] = "LOCKING",
+	[VDO_HASH_LOCK_QUERYING] = "QUERYING",
+	[VDO_HASH_LOCK_UNLOCKING] = "UNLOCKING",
+	[VDO_HASH_LOCK_UPDATING] = "UPDATING",
+	[VDO_HASH_LOCK_VERIFYING] = "VERIFYING",
+	[VDO_HASH_LOCK_WRITING] = "WRITING",
+};
+
+struct hash_lock {
+	/* The block hash covered by this lock */
+	struct uds_record_name hash;
+
+	/* When the lock is unused, this list entry allows the lock to be pooled */
+	struct list_head pool_node;
+
+	/*
+	 * A list containing the data VIOs sharing this lock, all having the same record name and
+	 * data block contents, linked by their hash_lock_node fields.
+	 */
+	struct list_head duplicate_ring;
+
+	/* The number of data_vios sharing this lock instance */
+	data_vio_count_t reference_count;
+
+	/* The maximum value of reference_count in the lifetime of this lock */
+	data_vio_count_t max_references;
+
+	/* The current state of this lock */
+	enum hash_lock_state state;
+
+	/* True if the UDS index should be updated with new advice */
+	bool update_advice;
+
+	/* True if the advice has been verified to be a true duplicate */
+	bool verified;
+
+	/* True if the lock has already accounted for an initial verification */
+	bool verify_counted;
+
+	/* True if this lock is registered in the lock map (cleared on rollover) */
+	bool registered;
+
+	/*
+	 * If verified is false, this is the location of a possible duplicate. If verified is true,
+	 * it is the verified location of a true duplicate.
+	 */
+	struct zoned_pbn duplicate;
+
+	/* The PBN lock on the block containing the duplicate data */
+	struct pbn_lock *duplicate_lock;
+
+	/* The data_vio designated to act on behalf of the lock */
+	struct data_vio *agent;
+
+	/*
+	 * Other data_vios with data identical to the agent who are currently waiting for the agent
+	 * to get the information they all need to deduplicate--either against each other, or
+	 * against an existing duplicate on disk.
+	 */
+	struct wait_queue waiters;
+};
+
+enum {
+	LOCK_POOL_CAPACITY = MAXIMUM_VDO_USER_VIOS,
+};
+
+struct hash_zones {
+	struct action_manager *manager;
+	struct kobject dedupe_directory;
+	struct uds_parameters parameters;
+	struct uds_index_session *index_session;
+	struct ratelimit_state ratelimiter;
+	atomic64_t timeouts;
+	atomic64_t dedupe_context_busy;
+
+	/* This spinlock protects the state fields and the starting of dedupe requests. */
+	spinlock_t lock;
+
+	/* The fields in the next block are all protected by the lock */
+	struct vdo_completion completion;
+	enum index_state index_state;
+	enum index_state index_target;
+	struct admin_state state;
+	bool changing;
+	bool create_flag;
+	bool dedupe_flag;
+	bool error_flag;
+	u64 reported_timeouts;
+
+	/* The number of zones */
+	zone_count_t zone_count;
+	/* The hash zones themselves */
+	struct hash_zone zones[];
+};
+
+static inline struct hash_zone *as_hash_zone(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VDO_HASH_ZONE_COMPLETION);
+	return container_of(completion, struct hash_zone, completion);
+}
+
+static inline struct hash_zones *as_hash_zones(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VDO_HASH_ZONES_COMPLETION);
+	return container_of(completion, struct hash_zones, completion);
+}
+
+static inline void assert_in_hash_zone(struct hash_zone *zone, const char *name)
+{
+	ASSERT_LOG_ONLY((vdo_get_callback_thread_id() == zone->thread_id),
+			"%s called on hash zone thread",
+			name);
+}
+
+/**
+ * return_hash_lock_to_pool() - (Re)initialize a hash lock and return it to its pool.
+ * @zone: The zone from which the lock was borrowed.
+ * @lock: The lock that is no longer in use.
+ */
+static void return_hash_lock_to_pool(struct hash_zone *zone, struct hash_lock *lock)
+{
+	memset(lock, 0, sizeof(*lock));
+	INIT_LIST_HEAD(&lock->pool_node);
+	INIT_LIST_HEAD(&lock->duplicate_ring);
+	vdo_initialize_wait_queue(&lock->waiters);
+	list_add_tail(&lock->pool_node, &zone->lock_pool);
+}
+
+/**
+ * vdo_get_duplicate_lock() - Get the PBN lock on the duplicate data location for a data_vio from
+ *                            the hash_lock the data_vio holds (if there is one).
+ * @data_vio: The data_vio to query.
+ *
+ * Return: The PBN lock on the data_vio's duplicate location.
+ */
+struct pbn_lock *vdo_get_duplicate_lock(struct data_vio *data_vio)
+{
+	if (data_vio->hash_lock == NULL)
+		return NULL;
+	return data_vio->hash_lock->duplicate_lock;
+}
+
+/**
+ * get_hash_lock_state_name() - Get the string representation of a hash lock state.
+ * @state: The hash lock state.
+ *
+ * Return: The short string representing the state
+ */
+static const char *get_hash_lock_state_name(enum hash_lock_state state)
+{
+	/* Catch if a state has been added without updating the name array. */
+	STATIC_ASSERT((VDO_HASH_LOCK_BYPASSING + 1) == ARRAY_SIZE(LOCK_STATE_NAMES));
+	return (state < ARRAY_SIZE(LOCK_STATE_NAMES)) ? LOCK_STATE_NAMES[state] : "INVALID";
+}
+
+/**
+ * assert_hash_lock_agent() - Assert that a data_vio is the agent of its hash lock, and that this
+ *                            is being called in the hash zone.
+ * @data_vio: The data_vio expected to be the lock agent.
+ * @where: A string describing the function making the assertion.
+ */
+static void assert_hash_lock_agent(struct data_vio *data_vio, const char *where)
+{
+	/* Not safe to access the agent field except from the hash zone. */
+	assert_data_vio_in_hash_zone(data_vio);
+	ASSERT_LOG_ONLY(data_vio == data_vio->hash_lock->agent,
+			"%s must be for the hash lock agent", where);
+}
+
+/**
+ * set_duplicate_lock() - Set the duplicate lock held by a hash lock. May only be called in the
+ *                        physical zone of the PBN lock.
+ * @hash_lock: The hash lock to update.
+ * @pbn_lock: The PBN read lock to use as the duplicate lock.
+ */
+static void set_duplicate_lock(struct hash_lock *hash_lock, struct pbn_lock *pbn_lock)
+{
+	ASSERT_LOG_ONLY((hash_lock->duplicate_lock == NULL),
+			"hash lock must not already hold a duplicate lock");
+
+	pbn_lock->holder_count += 1;
+	hash_lock->duplicate_lock = pbn_lock;
+}
+
+/**
+ * dequeue_lock_waiter() - Remove the first data_vio from the lock's wait queue and return it.
+ * @lock: The lock containing the wait queue.
+ *
+ * Return: The first (oldest) waiter in the queue, or NULL if the queue is empty.
+ */
+static inline struct data_vio *dequeue_lock_waiter(struct hash_lock *lock)
+{
+	return waiter_as_data_vio(vdo_dequeue_next_waiter(&lock->waiters));
+}
+
+/**
+ * set_hash_lock() - Set, change, or clear the hash lock a data_vio is using.
+ * @data_vio: The data_vio to update.
+ * @new_lock: The hash lock the data_vio is joining.
+ *
+ * Updates the hash lock (or locks) to reflect the change in membership.
+ */
+static void set_hash_lock(struct data_vio *data_vio, struct hash_lock *new_lock)
+{
+	struct hash_lock *old_lock = data_vio->hash_lock;
+
+	if (old_lock != NULL) {
+		ASSERT_LOG_ONLY(data_vio->hash_zone != NULL,
+				"must have a hash zone when holding a hash lock");
+		ASSERT_LOG_ONLY(!list_empty(&data_vio->hash_lock_entry),
+				"must be on a hash lock ring when holding a hash lock");
+		ASSERT_LOG_ONLY(old_lock->reference_count > 0,
+				"hash lock reference must be counted");
+
+		if ((old_lock->state != VDO_HASH_LOCK_BYPASSING) &&
+		    (old_lock->state != VDO_HASH_LOCK_UNLOCKING))
+			/*
+			 * If the reference count goes to zero in a non-terminal state, we're most
+			 * likely leaking this lock.
+			 */
+			ASSERT_LOG_ONLY(old_lock->reference_count > 1,
+					"hash locks should only become unreferenced in a terminal state, not state %s",
+					get_hash_lock_state_name(old_lock->state));
+
+		list_del_init(&data_vio->hash_lock_entry);
+		old_lock->reference_count -= 1;
+
+		data_vio->hash_lock = NULL;
+	}
+
+	if (new_lock != NULL) {
+		/*
+		 * Keep all data_vios sharing the lock on a ring since they can complete in any
+		 * order and we'll always need a pointer to one to compare data.
+		 */
+		list_move_tail(&data_vio->hash_lock_entry, &new_lock->duplicate_ring);
+		new_lock->reference_count += 1;
+		if (new_lock->max_references < new_lock->reference_count)
+			new_lock->max_references = new_lock->reference_count;
+
+		data_vio->hash_lock = new_lock;
+	}
+}
+
+/* There are loops in the state diagram, so some forward decl's are needed. */
+static void start_deduping(struct hash_lock *lock, struct data_vio *agent, bool agent_is_done);
+static void start_locking(struct hash_lock *lock, struct data_vio *agent);
+static void start_writing(struct hash_lock *lock, struct data_vio *agent);
+static void unlock_duplicate_pbn(struct vdo_completion *completion);
+static void transfer_allocation_lock(struct data_vio *data_vio);
+
+/**
+ * exit_hash_lock() - Bottleneck for data_vios that have written or deduplicated and that are no
+ *                    longer needed to be an agent for the hash lock.
+ * @data_vio: The data_vio to complete and send to be cleaned up.
+ */
+static void exit_hash_lock(struct data_vio *data_vio)
+{
+	/* Release the hash lock now, saving a thread transition in cleanup. */
+	vdo_release_hash_lock(data_vio);
+
+	/* Complete the data_vio and start the clean-up path to release any locks it still holds. */
+	data_vio->vio.completion.callback = complete_data_vio;
+
+	continue_data_vio(data_vio);
+}
+
+/**
+ * set_duplicate_location() - Set the location of the duplicate block for data_vio, updating the
+ *                            is_duplicate and duplicate fields from a zoned_pbn.
+ * @data_vio: The data_vio to modify.
+ * @source: The location of the duplicate.
+ */
+static void set_duplicate_location(struct data_vio *data_vio, const struct zoned_pbn source)
+{
+	data_vio->is_duplicate = (source.pbn != VDO_ZERO_BLOCK);
+	data_vio->duplicate = source;
+}
+
+/**
+ * retire_lock_agent() - Retire the active lock agent, replacing it with the first lock waiter, and
+ *                       make the retired agent exit the hash lock.
+ * @lock: The hash lock to update.
+ *
+ * Return: The new lock agent (which will be NULL if there was no waiter)
+ */
+static struct data_vio *retire_lock_agent(struct hash_lock *lock)
+{
+	struct data_vio *old_agent = lock->agent;
+	struct data_vio *new_agent = dequeue_lock_waiter(lock);
+
+	lock->agent = new_agent;
+	exit_hash_lock(old_agent);
+	if (new_agent != NULL)
+		set_duplicate_location(new_agent, lock->duplicate);
+	return new_agent;
+}
+
+/**
+ * wait_on_hash_lock() - Add a data_vio to the lock's queue of waiters.
+ * @lock: The hash lock on which to wait.
+ * @data_vio: The data_vio to add to the queue.
+ */
+static void wait_on_hash_lock(struct hash_lock *lock, struct data_vio *data_vio)
+{
+	vdo_enqueue_waiter(&lock->waiters, &data_vio->waiter);
+
+	/*
+	 * Make sure the agent doesn't block indefinitely in the packer since it now has at least
+	 * one other data_vio waiting on it.
+	 */
+	if ((lock->state != VDO_HASH_LOCK_WRITING) || !cancel_data_vio_compression(lock->agent))
+		return;
+
+	/*
+	 * Even though we're waiting, we also have to send ourselves as a one-way message to the
+	 * packer to ensure the agent continues executing. This is safe because
+	 * cancel_vio_compression() guarantees the agent won't continue executing until this
+	 * message arrives in the packer, and because the wait queue link isn't used for sending
+	 * the message.
+	 */
+	data_vio->compression.lock_holder = lock->agent;
+	launch_data_vio_packer_callback(data_vio, vdo_remove_lock_holder_from_packer);
+}
+
+/**
+ * abort_waiter() - waiter_callback function that shunts waiters to write their blocks without
+ *                  optimization.
+ * @waiter: The data_vio's waiter link.
+ * @context: Not used.
+ */
+static void abort_waiter(struct waiter *waiter, void *context __always_unused)
+{
+	write_data_vio(waiter_as_data_vio(waiter));
+}
+
+/**
+ * start_bypassing() - Stop using the hash lock.
+ * @lock: The hash lock.
+ * @agent: The data_vio acting as the agent for the lock.
+ *
+ * Stops using the hash lock. This is the final transition for hash locks which did not get an
+ * error.
+ */
+static void start_bypassing(struct hash_lock *lock, struct data_vio *agent)
+{
+	lock->state = VDO_HASH_LOCK_BYPASSING;
+	exit_hash_lock(agent);
+}
+
+void vdo_clean_failed_hash_lock(struct data_vio *data_vio)
+{
+	struct hash_lock *lock = data_vio->hash_lock;
+
+	if (lock->state == VDO_HASH_LOCK_BYPASSING) {
+		exit_hash_lock(data_vio);
+		return;
+	}
+
+	if (lock->agent == NULL) {
+		lock->agent = data_vio;
+	} else if (data_vio != lock->agent) {
+		exit_hash_lock(data_vio);
+		return;
+	}
+
+	lock->state = VDO_HASH_LOCK_BYPASSING;
+
+	/* Ensure we don't attempt to update advice when cleaning up. */
+	lock->update_advice = false;
+
+	vdo_notify_all_waiters(&lock->waiters, abort_waiter, NULL);
+
+	if (lock->duplicate_lock != NULL) {
+		/* The agent must reference the duplicate zone to launch it. */
+		data_vio->duplicate = lock->duplicate;
+		launch_data_vio_duplicate_zone_callback(data_vio, unlock_duplicate_pbn);
+		return;
+	}
+
+	lock->agent = NULL;
+	data_vio->is_duplicate = false;
+	exit_hash_lock(data_vio);
+}
+
+/**
+ * finish_unlocking() - Handle the result of the agent for the lock releasing a read lock on
+ *                      duplicate candidate.
+ * @completion: The completion of the data_vio acting as the lock's agent.
+ *
+ * This continuation is registered in unlock_duplicate_pbn().
+ */
+static void finish_unlocking(struct vdo_completion *completion)
+{
+	struct data_vio *agent = as_data_vio(completion);
+	struct hash_lock *lock = agent->hash_lock;
+
+	assert_hash_lock_agent(agent, __func__);
+
+	ASSERT_LOG_ONLY(lock->duplicate_lock == NULL,
+			"must have released the duplicate lock for the hash lock");
+
+	if (!lock->verified) {
+		/*
+		 * UNLOCKING -> WRITING transition: The lock we released was on an unverified
+		 * block, so it must have been a lock on advice we were verifying, not on a
+		 * location that was used for deduplication. Go write (or compress) the block to
+		 * get a location to dedupe against.
+		 */
+		start_writing(lock, agent);
+		return;
+	}
+
+	/*
+	 * With the lock released, the verified duplicate block may already have changed and will
+	 * need to be re-verified if a waiter arrived.
+	 */
+	lock->verified = false;
+
+	if (vdo_has_waiters(&lock->waiters)) {
+		/*
+		 * UNLOCKING -> LOCKING transition: A new data_vio entered the hash lock while the
+		 * agent was releasing the PBN lock. The current agent exits and the waiter has to
+		 * re-lock and re-verify the duplicate location.
+		 *
+		 * TODO: If we used the current agent to re-acquire the PBN lock we wouldn't need
+		 * to re-verify.
+		 */
+		agent = retire_lock_agent(lock);
+		start_locking(lock, agent);
+		return;
+	}
+
+	/*
+	 * UNLOCKING -> BYPASSING transition: The agent is done with the lock and no other
+	 * data_vios reference it, so remove it from the lock map and return it to the pool.
+	 */
+	start_bypassing(lock, agent);
+}
+
+/**
+ * unlock_duplicate_pbn() - Release a read lock on the PBN of the block that may or may not have
+ *                          contained duplicate data.
+ * @completion: The completion of the data_vio acting as the lock's agent.
+ *
+ * This continuation is launched by start_unlocking(), and calls back to finish_unlocking() on the
+ * hash zone thread.
+ */
+static void unlock_duplicate_pbn(struct vdo_completion *completion)
+{
+	struct data_vio *agent = as_data_vio(completion);
+	struct hash_lock *lock = agent->hash_lock;
+
+	assert_data_vio_in_duplicate_zone(agent);
+	ASSERT_LOG_ONLY(lock->duplicate_lock != NULL, "must have a duplicate lock to release");
+
+	vdo_release_physical_zone_pbn_lock(agent->duplicate.zone,
+					   agent->duplicate.pbn,
+					   UDS_FORGET(lock->duplicate_lock));
+	if (lock->state == VDO_HASH_LOCK_BYPASSING) {
+		complete_data_vio(completion);
+		return;
+	}
+
+	launch_data_vio_hash_zone_callback(agent, finish_unlocking);
+}
+
+/**
+ * start_unlocking() - Release a read lock on the PBN of the block that may or may not have
+ *                     contained duplicate data.
+ * @lock: The hash lock.
+ * @agent: The data_vio currently acting as the agent for the lock.
+ */
+static void start_unlocking(struct hash_lock *lock, struct data_vio *agent)
+{
+	lock->state = VDO_HASH_LOCK_UNLOCKING;
+	launch_data_vio_duplicate_zone_callback(agent, unlock_duplicate_pbn);
+}
+
+static void release_context(struct dedupe_context *context)
+{
+	struct hash_zone *zone = context->zone;
+
+	WRITE_ONCE(zone->active, zone->active - 1);
+	list_move(&context->list_entry, &zone->available);
+}
+
+static void process_update_result(struct data_vio *agent)
+{
+	struct dedupe_context *context = agent->dedupe_context;
+
+	if ((context == NULL) ||
+	    !change_context_state(context, DEDUPE_CONTEXT_COMPLETE, DEDUPE_CONTEXT_IDLE))
+		return;
+
+	release_context(context);
+}
+
+/**
+ * finish_updating() - Process the result of a UDS update performed by the agent for the lock.
+ * @completion: The completion of the data_vio that performed the update
+ *
+ * This continuation is registered in start_querying().
+ */
+static void finish_updating(struct vdo_completion *completion)
+{
+	struct data_vio *agent = as_data_vio(completion);
+	struct hash_lock *lock = agent->hash_lock;
+
+	assert_hash_lock_agent(agent, __func__);
+
+	process_update_result(agent);
+
+	/*
+	 * UDS was updated successfully, so don't update again unless the duplicate location
+	 * changes due to rollover.
+	 */
+	lock->update_advice = false;
+
+	if (vdo_has_waiters(&lock->waiters)) {
+		/*
+		 * UPDATING -> DEDUPING transition: A new data_vio arrived during the UDS update.
+		 * Send it on the verified dedupe path. The agent is done with the lock, but the
+		 * lock may still need to use it to clean up after rollover.
+		 */
+		start_deduping(lock, agent, true);
+		return;
+	}
+
+	if (lock->duplicate_lock != NULL) {
+		/*
+		 * UPDATING -> UNLOCKING transition: No one is waiting to dedupe, but we hold a
+		 * duplicate PBN lock, so go release it.
+		 */
+		start_unlocking(lock, agent);
+		return;
+	}
+
+	/*
+	 * UPDATING -> BYPASSING transition: No one is waiting to dedupe and there's no lock to
+	 * release.
+	 */
+	start_bypassing(lock, agent);
+}
+
+static void query_index(struct data_vio *data_vio, enum uds_request_type operation);
+
+/**
+ * start_updating() - Continue deduplication with the last step, updating UDS with the location of
+ *                    the duplicate that should be returned as advice in the future.
+ * @lock: The hash lock.
+ * @agent: The data_vio currently acting as the agent for the lock.
+ */
+static void start_updating(struct hash_lock *lock, struct data_vio *agent)
+{
+	lock->state = VDO_HASH_LOCK_UPDATING;
+
+	ASSERT_LOG_ONLY(lock->verified, "new advice should have been verified");
+	ASSERT_LOG_ONLY(lock->update_advice, "should only update advice if needed");
+
+	agent->last_async_operation = VIO_ASYNC_OP_UPDATE_DEDUPE_INDEX;
+	set_data_vio_hash_zone_callback(agent, finish_updating);
+	query_index(agent, UDS_UPDATE);
+}
+
+/**
+ * finish_deduping() - Handle a data_vio that has finished deduplicating against the block locked
+ *                     by the hash lock.
+ * @lock: The hash lock.
+ * @data_vio: The lock holder that has finished deduplicating.
+ *
+ * If there are other data_vios still sharing the lock, this will just release the data_vio's share
+ * of the lock and finish processing the data_vio. If this is the last data_vio holding the lock,
+ * this makes the data_vio the lock agent and uses it to advance the state of the lock so it can
+ * eventually be released.
+ */
+static void finish_deduping(struct hash_lock *lock, struct data_vio *data_vio)
+{
+	struct data_vio *agent = data_vio;
+
+	ASSERT_LOG_ONLY(lock->agent == NULL, "shouldn't have an agent in DEDUPING");
+	ASSERT_LOG_ONLY(!vdo_has_waiters(&lock->waiters),
+			"shouldn't have any lock waiters in DEDUPING");
+
+	/* Just release the lock reference if other data_vios are still deduping. */
+	if (lock->reference_count > 1) {
+		exit_hash_lock(data_vio);
+		return;
+	}
+
+	/* The hash lock must have an agent for all other lock states. */
+	lock->agent = agent;
+	if (lock->update_advice)
+		/*
+		 * DEDUPING -> UPDATING transition: The location of the duplicate block changed
+		 * since the initial UDS query because of compression, rollover, or because the
+		 * query agent didn't have an allocation. The UDS update was delayed in case there
+		 * was another change in location, but with only this data_vio using the hash lock,
+		 * it's time to update the advice.
+		 */
+		start_updating(lock, agent);
+	else
+		/*
+		 * DEDUPING -> UNLOCKING transition: Release the PBN read lock on the duplicate
+		 * location so the hash lock itself can be released (contingent on no new data_vios
+		 * arriving in the lock before the agent returns).
+		 */
+		start_unlocking(lock, agent);
+}
+
+/**
+ * acquire_lock() - Get the lock for a record name.
+ * @zone: The zone responsible for the hash.
+ * @hash: The hash to lock.
+ * @replace_lock: If non-NULL, the lock already registered for the hash which should be replaced by
+ *                the new lock.
+ * @lock_ptr: A pointer to receive the hash lock.
+ *
+ * Gets the lock for the hash (record name) of the data in a data_vio, or if one does not exist (or
+ * if we are explicitly rolling over), initialize a new lock for the hash and register it in the
+ * zone. This must only be called in the correct thread for the zone.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int __must_check acquire_lock(struct hash_zone *zone,
+				     const struct uds_record_name *hash,
+				     struct hash_lock *replace_lock,
+				     struct hash_lock **lock_ptr)
+{
+	struct hash_lock *lock, *new_lock;
+	int result;
+
+	/*
+	 * Borrow and prepare a lock from the pool so we don't have to do two pointer_map accesses
+	 * in the common case of no lock contention.
+	 */
+	result = ASSERT(!list_empty(&zone->lock_pool), "never need to wait for a free hash lock");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	new_lock = list_entry(zone->lock_pool.prev, struct hash_lock, pool_node);
+	list_del_init(&new_lock->pool_node);
+
+	/*
+	 * Fill in the hash of the new lock so we can map it, since we have to use the hash as the
+	 * map key.
+	 */
+	new_lock->hash = *hash;
+
+	result = vdo_pointer_map_put(zone->hash_lock_map,
+				     &new_lock->hash,
+				     new_lock,
+				     (replace_lock != NULL),
+				     (void **) &lock);
+	if (result != VDO_SUCCESS) {
+		return_hash_lock_to_pool(zone, UDS_FORGET(new_lock));
+		return result;
+	}
+
+	if (replace_lock != NULL) {
+		/* On mismatch put the old lock back and return a severe error */
+		ASSERT_LOG_ONLY(lock == replace_lock, "old lock must have been in the lock map");
+		/* TODO: Check earlier and bail out? */
+		ASSERT_LOG_ONLY(replace_lock->registered,
+				"old lock must have been marked registered");
+		replace_lock->registered = false;
+	}
+
+	if (lock == replace_lock) {
+		lock = new_lock;
+		lock->registered = true;
+	} else {
+		/* There's already a lock for the hash, so we don't need the borrowed lock. */
+		return_hash_lock_to_pool(zone, UDS_FORGET(new_lock));
+	}
+
+	*lock_ptr = lock;
+	return VDO_SUCCESS;
+}
+
+/**
+ * enter_forked_lock() - Bind the data_vio to a new hash lock.
+ *
+ * Implements waiter_callback. Binds the data_vio that was waiting to a new hash lock and waits on
+ * that lock.
+ */
+static void enter_forked_lock(struct waiter *waiter, void *context)
+{
+	struct data_vio *data_vio = waiter_as_data_vio(waiter);
+	struct hash_lock *new_lock = (struct hash_lock *) context;
+
+	set_hash_lock(data_vio, new_lock);
+	wait_on_hash_lock(new_lock, data_vio);
+}
+
+/**
+ * fork_hash_lock() - Fork a hash lock because it has run out of increments on the duplicate PBN.
+ * @old_lock: The hash lock to fork.
+ * @new_agent: The data_vio that will be the agent for the new lock.
+ *
+ * Transfers the new agent and any lock waiters to a new hash lock instance which takes the place
+ * of the old lock in the lock map. The old lock remains active, but will not update advice.
+ */
+static void fork_hash_lock(struct hash_lock *old_lock, struct data_vio *new_agent)
+{
+	struct hash_lock *new_lock;
+	int result;
+
+	result = acquire_lock(new_agent->hash_zone, &new_agent->record_name, old_lock, &new_lock);
+	if (result != VDO_SUCCESS) {
+		continue_data_vio_with_error(new_agent, result);
+		return;
+	}
+
+	/*
+	 * Only one of the two locks should update UDS. The old lock is out of references, so it
+	 * would be poor dedupe advice in the short term.
+	 */
+	old_lock->update_advice = false;
+	new_lock->update_advice = true;
+
+	set_hash_lock(new_agent, new_lock);
+	new_lock->agent = new_agent;
+
+	vdo_notify_all_waiters(&old_lock->waiters, enter_forked_lock, new_lock);
+
+	new_agent->is_duplicate = false;
+	start_writing(new_lock, new_agent);
+}
+
+/**
+ * launch_dedupe() - Reserve a reference count increment for a data_vio and launch it on the dedupe
+ *                   path.
+ * @lock: The hash lock.
+ * @data_vio: The data_vio to deduplicate using the hash lock.
+ * @has_claim: true if the data_vio already has claimed an increment from the duplicate lock.
+ *
+ * If no increments are available, this will roll over to a new hash lock and launch the data_vio
+ * as the writing agent for that lock.
+ */
+static void launch_dedupe(struct hash_lock *lock, struct data_vio *data_vio, bool has_claim)
+{
+	if (!has_claim && !vdo_claim_pbn_lock_increment(lock->duplicate_lock)) {
+		/* Out of increments, so must roll over to a new lock. */
+		fork_hash_lock(lock, data_vio);
+		return;
+	}
+
+	/* Deduplicate against the lock's verified location. */
+	set_duplicate_location(data_vio, lock->duplicate);
+	data_vio->new_mapped = data_vio->duplicate;
+	update_metadata_for_data_vio_write(data_vio, lock->duplicate_lock);
+}
+
+/**
+ * start_deduping() - Enter the hash lock state where data_vios deduplicate in parallel against a
+ *                    true copy of their data on disk.
+ * @lock: The hash lock.
+ * @agent: The data_vio acting as the agent for the lock.
+ * @agent_is_done: true only if the agent has already written or deduplicated against its data.
+ *
+ * If the agent itself needs to deduplicate, an increment for it must already have been claimed
+ * from the duplicate lock, ensuring the hash lock will still have a data_vio holding it.
+ */
+static void start_deduping(struct hash_lock *lock, struct data_vio *agent, bool agent_is_done)
+{
+	lock->state = VDO_HASH_LOCK_DEDUPING;
+
+	/*
+	 * We don't take the downgraded allocation lock from the agent unless we actually need to
+	 * deduplicate against it.
+	 */
+	if (lock->duplicate_lock == NULL) {
+		ASSERT_LOG_ONLY(!vdo_is_state_compressed(agent->new_mapped.state),
+				"compression must have shared a lock");
+		ASSERT_LOG_ONLY(agent_is_done, "agent must have written the new duplicate");
+		transfer_allocation_lock(agent);
+	}
+
+	ASSERT_LOG_ONLY(vdo_is_pbn_read_lock(lock->duplicate_lock),
+			"duplicate_lock must be a PBN read lock");
+
+	/*
+	 * This state is not like any of the other states. There is no designated agent--the agent
+	 * transitioning to this state and all the waiters will be launched to deduplicate in
+	 * parallel.
+	 */
+	lock->agent = NULL;
+
+	/*
+	 * Launch the agent (if not already deduplicated) and as many lock waiters as we have
+	 * available increments for on the dedupe path. If we run out of increments, rollover will
+	 * be triggered and the remaining waiters will be transferred to the new lock.
+	 */
+	if (!agent_is_done) {
+		launch_dedupe(lock, agent, true);
+		agent = NULL;
+	}
+	while (vdo_has_waiters(&lock->waiters))
+		launch_dedupe(lock, dequeue_lock_waiter(lock), false);
+
+	if (agent_is_done)
+		/*
+		 * In the degenerate case where all the waiters rolled over to a new lock, this
+		 * will continue to use the old agent to clean up this lock, and otherwise it just
+		 * lets the agent exit the lock.
+		 */
+		finish_deduping(lock, agent);
+}
+
+/**
+ * increment_stat() - Increment a statistic counter in a non-atomic yet thread-safe manner.
+ * @stat: The statistic field to increment.
+ */
+static void increment_stat(u64 *stat)
+{
+	/*
+	 * Must only be mutated on the hash zone thread. Prevents any compiler shenanigans from
+	 * affecting other threads reading stats.
+	 */
+	WRITE_ONCE(*stat, *stat + 1);
+}
+
+/**
+ * finish_verifying() - Handle the result of the agent for the lock comparing its data to the
+ *                      duplicate candidate.
+ * @completion: The completion of the data_vio used to verify dedupe
+ *
+ * This continuation is registered in start_verifying().
+ */
+static void finish_verifying(struct vdo_completion *completion)
+{
+	struct data_vio *agent = as_data_vio(completion);
+	struct hash_lock *lock = agent->hash_lock;
+
+	assert_hash_lock_agent(agent, __func__);
+
+	lock->verified = agent->is_duplicate;
+
+	/*
+	 * Only count the result of the initial verification of the advice as valid or stale, and
+	 * not any re-verifications due to PBN lock releases.
+	 */
+	if (!lock->verify_counted) {
+		lock->verify_counted = true;
+		if (lock->verified)
+			increment_stat(&agent->hash_zone->statistics.dedupe_advice_valid);
+		else
+			increment_stat(&agent->hash_zone->statistics.dedupe_advice_stale);
+	}
+
+	/*
+	 * Even if the block is a verified duplicate, we can't start to deduplicate unless we can
+	 * claim a reference count increment for the agent.
+	 */
+	if (lock->verified && !vdo_claim_pbn_lock_increment(lock->duplicate_lock)) {
+		agent->is_duplicate = false;
+		lock->verified = false;
+	}
+
+	if (lock->verified) {
+		/*
+		 * VERIFYING -> DEDUPING transition: The advice is for a true duplicate, so start
+		 * deduplicating against it, if references are available.
+		 */
+		start_deduping(lock, agent, false);
+	} else {
+		/*
+		 * VERIFYING -> UNLOCKING transition: Either the verify failed or we'd try to
+		 * dedupe and roll over immediately, which would fail because it would leave the
+		 * lock without an agent to release the PBN lock. In both cases, the data will have
+		 * to be written or compressed, but first the advice PBN must be unlocked by the
+		 * VERIFYING agent.
+		 */
+		lock->update_advice = true;
+		start_unlocking(lock, agent);
+	}
+}
+
+static bool blocks_equal(char *block1, char *block2)
+{
+	int i;
+
+
+	for (i = 0; i < VDO_BLOCK_SIZE; i += sizeof(u64))
+		if (*((u64 *) &block1[i]) != *((u64 *) &block2[i]))
+			return false;
+
+	return true;
+}
+
+static void verify_callback(struct vdo_completion *completion)
+{
+	struct data_vio *agent = as_data_vio(completion);
+
+	agent->is_duplicate = blocks_equal(agent->vio.data, agent->scratch_block);
+	launch_data_vio_hash_zone_callback(agent, finish_verifying);
+}
+
+static void uncompress_and_verify(struct vdo_completion *completion)
+{
+	struct data_vio *agent = as_data_vio(completion);
+	int result;
+
+	result = uncompress_data_vio(agent, agent->duplicate.state, agent->scratch_block);
+	if (result == VDO_SUCCESS) {
+		verify_callback(completion);
+		return;
+	}
+
+	agent->is_duplicate = false;
+	launch_data_vio_hash_zone_callback(agent, finish_verifying);
+}
+
+static void verify_endio(struct bio *bio)
+{
+	struct data_vio *agent = vio_as_data_vio(bio->bi_private);
+	int result = blk_status_to_errno(bio->bi_status);
+
+	vdo_count_completed_bios(bio);
+	if (result != VDO_SUCCESS) {
+		agent->is_duplicate = false;
+		launch_data_vio_hash_zone_callback(agent, finish_verifying);
+		return;
+	}
+
+	if (vdo_is_state_compressed(agent->duplicate.state)) {
+		launch_data_vio_cpu_callback(agent,
+					     uncompress_and_verify,
+					     CPU_Q_COMPRESS_BLOCK_PRIORITY);
+		return;
+	}
+
+	launch_data_vio_cpu_callback(agent, verify_callback, CPU_Q_COMPLETE_READ_PRIORITY);
+}
+
+/**
+ * start_verifying() - Begin the data verification phase.
+ * @lock: The hash lock (must be LOCKING).
+ * @agent: The data_vio to use to read and compare candidate data.
+ *
+ * Continue the deduplication path for a hash lock by using the agent to read (and possibly
+ * decompress) the data at the candidate duplicate location, comparing it to the data in the agent
+ * to verify that the candidate is identical to all the data_vios sharing the hash. If so, it can
+ * be deduplicated against, otherwise a data_vio allocation will have to be written to and used for
+ * dedupe.
+ */
+static void start_verifying(struct hash_lock *lock, struct data_vio *agent)
+{
+	int result;
+	struct vio *vio = &agent->vio;
+	char *buffer = (vdo_is_state_compressed(agent->duplicate.state) ?
+			(char *) agent->compression.block :
+			agent->scratch_block);
+
+	lock->state = VDO_HASH_LOCK_VERIFYING;
+	ASSERT_LOG_ONLY(!lock->verified, "hash lock only verifies advice once");
+
+	agent->last_async_operation = VIO_ASYNC_OP_VERIFY_DUPLICATION;
+	result = vio_reset_bio(vio, buffer, verify_endio, REQ_OP_READ, agent->duplicate.pbn);
+	if (result != VDO_SUCCESS) {
+		set_data_vio_hash_zone_callback(agent, finish_verifying);
+		continue_data_vio_with_error(agent, result);
+		return;
+	}
+
+	set_data_vio_bio_zone_callback(agent, process_vio_io);
+	vdo_launch_completion_with_priority(&vio->completion, BIO_Q_VERIFY_PRIORITY);
+}
+
+/**
+ * finish_locking() - Handle the result of the agent for the lock attempting to obtain a PBN read
+ *                    lock on the candidate duplicate block.
+ * @completion: The completion of the data_vio that attempted to get the read lock.
+ *
+ * This continuation is registered in lock_duplicate_pbn().
+ */
+static void finish_locking(struct vdo_completion *completion)
+{
+	struct data_vio *agent = as_data_vio(completion);
+	struct hash_lock *lock = agent->hash_lock;
+
+	assert_hash_lock_agent(agent, __func__);
+
+	if (!agent->is_duplicate) {
+		ASSERT_LOG_ONLY(lock->duplicate_lock == NULL,
+				"must not hold duplicate_lock if not flagged as a duplicate");
+		/*
+		 * LOCKING -> WRITING transition: The advice block is being modified or has no
+		 * available references, so try to write or compress the data, remembering to
+		 * update UDS later with the new advice.
+		 */
+		increment_stat(&agent->hash_zone->statistics.dedupe_advice_stale);
+		lock->update_advice = true;
+		start_writing(lock, agent);
+		return;
+	}
+
+	ASSERT_LOG_ONLY(lock->duplicate_lock != NULL,
+			"must hold duplicate_lock if flagged as a duplicate");
+
+	if (!lock->verified) {
+		/*
+		 * LOCKING -> VERIFYING transition: Continue on the unverified dedupe path, reading
+		 * the candidate duplicate and comparing it to the agent's data to decide whether
+		 * it is a true duplicate or stale advice.
+		 */
+		start_verifying(lock, agent);
+		return;
+	}
+
+	if (!vdo_claim_pbn_lock_increment(lock->duplicate_lock)) {
+		/*
+		 * LOCKING -> UNLOCKING transition: The verified block was re-locked, but has no
+		 * available increments left. Must first release the useless PBN read lock before
+		 * rolling over to a new copy of the block.
+		 */
+		agent->is_duplicate = false;
+		lock->verified = false;
+		lock->update_advice = true;
+		start_unlocking(lock, agent);
+		return;
+	}
+
+	/*
+	 * LOCKING -> DEDUPING transition: Continue on the verified dedupe path, deduplicating
+	 * against a location that was previously verified or written to.
+	 */
+	start_deduping(lock, agent, false);
+}
+
+static bool acquire_provisional_reference(struct data_vio *agent,
+					  struct pbn_lock *lock,
+					  struct slab_depot *depot)
+{
+	/* Ensure that the newly-locked block is referenced. */
+	struct vdo_slab *slab = vdo_get_slab(depot, agent->duplicate.pbn);
+	int result = vdo_acquire_provisional_reference(slab, agent->duplicate.pbn, lock);
+
+	if (result == VDO_SUCCESS)
+		return true;
+
+	uds_log_warning_strerror(result,
+				 "Error acquiring provisional reference for dedupe candidate; aborting dedupe");
+	agent->is_duplicate = false;
+	vdo_release_physical_zone_pbn_lock(agent->duplicate.zone, agent->duplicate.pbn, lock);
+	continue_data_vio_with_error(agent, result);
+	return false;
+}
+
+/**
+ * lock_duplicate_pbn() - Acquire a read lock on the PBN of the block containing candidate
+ *                        duplicate data (compressed or uncompressed).
+ * @completion: The completion of the data_vio attempting to acquire the physical block lock on
+ *              behalf of its hash lock.
+ *
+ * If the PBN is already locked for writing, the lock attempt is abandoned and is_duplicate will be
+ * cleared before calling back. this continuation is launched from start_locking(), and calls back
+ * to finish_locking() on the hash zone thread.
+ */
+static void lock_duplicate_pbn(struct vdo_completion *completion)
+{
+	unsigned int increment_limit;
+	struct pbn_lock *lock;
+	int result;
+
+	struct data_vio *agent = as_data_vio(completion);
+	struct slab_depot *depot = vdo_from_data_vio(agent)->depot;
+	struct physical_zone *zone = agent->duplicate.zone;
+
+	assert_data_vio_in_duplicate_zone(agent);
+
+	set_data_vio_hash_zone_callback(agent, finish_locking);
+
+	/*
+	 * While in the zone that owns it, find out how many additional references can be made to
+	 * the block if it turns out to truly be a duplicate.
+	 */
+	increment_limit = vdo_get_increment_limit(depot, agent->duplicate.pbn);
+	if (increment_limit == 0) {
+		/*
+		 * We could deduplicate against it later if a reference happened to be released
+		 * during verification, but it's probably better to bail out now.
+		 */
+		agent->is_duplicate = false;
+		continue_data_vio(agent);
+		return;
+	}
+
+	result = vdo_attempt_physical_zone_pbn_lock(zone,
+						    agent->duplicate.pbn,
+						    VIO_READ_LOCK,
+						    &lock);
+	if (result != VDO_SUCCESS) {
+		continue_data_vio_with_error(agent, result);
+		return;
+	}
+
+	if (!vdo_is_pbn_read_lock(lock)) {
+		/*
+		 * There are three cases of write locks: uncompressed data block writes, compressed
+		 * (packed) block writes, and block map page writes. In all three cases, we give up
+		 * on trying to verify the advice and don't bother to try deduplicate against the
+		 * data in the write lock holder.
+		 *
+		 * 1) We don't ever want to try to deduplicate against a block map page.
+		 *
+		 * 2a) It's very unlikely we'd deduplicate against an entire packed block, both
+		 * because of the chance of matching it, and because we don't record advice for it,
+		 * but for the uncompressed representation of all the fragments it contains. The
+		 * only way we'd be getting lock contention is if we've written the same
+		 * representation coincidentally before, had it become unreferenced, and it just
+		 * happened to be packed together from compressed writes when we go to verify the
+		 * lucky advice. Giving up is a minuscule loss of potential dedupe.
+		 *
+		 * 2b) If the advice is for a slot of a compressed block, it's about to get
+		 * smashed, and the write smashing it cannot contain our data--it would have to be
+		 * writing on behalf of our hash lock, but that's impossible since we're the lock
+		 * agent.
+		 *
+		 * 3a) If the lock is held by a data_vio with different data, the advice is already
+		 * stale or is about to become stale.
+		 *
+		 * 3b) If the lock is held by a data_vio that matches us, we may as well either
+		 * write it ourselves (or reference the copy we already wrote) instead of
+		 * potentially having many duplicates wait for the lock holder to write, journal,
+		 * hash, and finally arrive in the hash lock. We lose a chance to avoid a UDS
+		 * update in the very rare case of advice for a free block that just happened to be
+		 * allocated to a data_vio with the same hash. There's also a chance to save on a
+		 * block write, at the cost of a block verify. Saving on a full block compare in
+		 * all stale advice cases almost certainly outweighs saving a UDS update and
+		 * trading a write for a read in a lucky case where advice would have been saved
+		 * from becoming stale.
+		 */
+		agent->is_duplicate = false;
+		continue_data_vio(agent);
+		return;
+	}
+
+	if (lock->holder_count == 0) {
+		if (!acquire_provisional_reference(agent, lock, depot))
+			return;
+
+		/*
+		 * The increment limit we grabbed earlier is still valid. The lock now holds the
+		 * rights to acquire all those references. Those rights will be claimed by hash
+		 * locks sharing this read lock.
+		 */
+		lock->increment_limit = increment_limit;
+	}
+
+	/*
+	 * We've successfully acquired a read lock on behalf of the hash lock, so mark it as such.
+	 */
+	set_duplicate_lock(agent->hash_lock, lock);
+
+	/*
+	 * TODO: Optimization: We could directly launch the block verify, then switch to a hash
+	 * thread.
+	 */
+	continue_data_vio(agent);
+}
+
+/**
+ * start_locking() - Continue deduplication for a hash lock that has obtained valid advice of a
+ *                   potential duplicate through its agent.
+ * @lock: The hash lock (currently must be QUERYING).
+ * @agent: The data_vio bearing the dedupe advice.
+ */
+static void start_locking(struct hash_lock *lock, struct data_vio *agent)
+{
+	ASSERT_LOG_ONLY(lock->duplicate_lock == NULL,
+			"must not acquire a duplicate lock when already holding it");
+
+	lock->state = VDO_HASH_LOCK_LOCKING;
+
+	/*
+	 * TODO: Optimization: If we arrange to continue on the duplicate zone thread when
+	 * accepting the advice, and don't explicitly change lock states (or use an agent-local
+	 * state, or an atomic), we can avoid a thread transition here.
+	 */
+	agent->last_async_operation = VIO_ASYNC_OP_LOCK_DUPLICATE_PBN;
+	launch_data_vio_duplicate_zone_callback(agent, lock_duplicate_pbn);
+}
+
+/**
+ * finish_writing() - Re-entry point for the lock agent after it has finished writing or
+ *                    compressing its copy of the data block.
+ * @lock: The hash lock, which must be in state WRITING.
+ * @agent: The data_vio that wrote its data for the lock.
+ *
+ * The agent will never need to dedupe against anything, so it's done with the lock, but the lock
+ * may not be finished with it, as a UDS update might still be needed.
+ *
+ * If there are other lock holders, the agent will hand the job to one of them and exit, leaving
+ * the lock to deduplicate against the just-written block. If there are no other lock holders, the
+ * agent either exits (and later tears down the hash lock), or it remains the agent and updates
+ * UDS.
+ */
+static void finish_writing(struct hash_lock *lock, struct data_vio *agent)
+{
+	/*
+	 * Dedupe against the data block or compressed block slot the agent wrote. Since we know
+	 * the write succeeded, there's no need to verify it.
+	 */
+	lock->duplicate = agent->new_mapped;
+	lock->verified = true;
+
+	if (vdo_is_state_compressed(lock->duplicate.state) &&
+	    lock->registered)
+		/*
+		 * Compression means the location we gave in the UDS query is not the location
+		 * we're using to deduplicate.
+		 */
+		lock->update_advice = true;
+
+	/* If there are any waiters, we need to start deduping them. */
+	if (vdo_has_waiters(&lock->waiters)) {
+		/*
+		 * WRITING -> DEDUPING transition: an asynchronously-written block failed to
+		 * compress, so the PBN lock on the written copy was already transferred. The agent
+		 * is done with the lock, but the lock may still need to use it to clean up after
+		 * rollover.
+		 */
+		start_deduping(lock, agent, true);
+		return;
+	}
+
+	/*
+	 * There are no waiters and the agent has successfully written, so take a step towards
+	 * being able to release the hash lock (or just release it).
+	 */
+	if (lock->update_advice) {
+		/*
+		 * WRITING -> UPDATING transition: There's no waiter and a UDS update is needed, so
+		 * retain the WRITING agent and use it to launch the update. The happens on
+		 * compression, rollover, or the QUERYING agent not having an allocation.
+		 */
+		start_updating(lock, agent);
+	} else if (lock->duplicate_lock != NULL) {
+		/*
+		 * WRITING -> UNLOCKING transition: There's no waiter and no update needed, but the
+		 * compressed write gave us a shared duplicate lock that we must release.
+		 */
+		set_duplicate_location(agent, lock->duplicate);
+		start_unlocking(lock, agent);
+	} else {
+		/*
+		 * WRITING -> BYPASSING transition: There's no waiter, no update needed, and no
+		 * duplicate lock held, so both the agent and lock have no more work to do. The
+		 * agent will release its allocation lock in cleanup.
+		 */
+		start_bypassing(lock, agent);
+	}
+}
+
+/**
+ * select_writing_agent() - Search through the lock waiters for a data_vio that has an allocation.
+ * @lock: The hash lock to modify.
+ *
+ * If an allocation is found, swap agents, put the old agent at the head of the wait queue, then
+ * return the new agent. Otherwise, just return the current agent.
+ */
+static struct data_vio *select_writing_agent(struct hash_lock *lock)
+{
+	struct wait_queue temp_queue;
+	struct data_vio *data_vio;
+
+	vdo_initialize_wait_queue(&temp_queue);
+
+	/*
+	 * Move waiters to the temp queue one-by-one until we find an allocation. Not ideal to
+	 * search, but it only happens when nearly out of space.
+	 */
+	while (((data_vio = dequeue_lock_waiter(lock)) != NULL) &&
+	       !data_vio_has_allocation(data_vio)) {
+		/* Use the lower-level enqueue since we're just moving waiters around. */
+		vdo_enqueue_waiter(&temp_queue, &data_vio->waiter);
+	}
+
+	if (data_vio != NULL) {
+		/*
+		 * Move the rest of the waiters over to the temp queue, preserving the order they
+		 * arrived at the lock.
+		 */
+		vdo_transfer_all_waiters(&lock->waiters, &temp_queue);
+
+		/*
+		 * The current agent is being replaced and will have to wait to dedupe; make it the
+		 * first waiter since it was the first to reach the lock.
+		 */
+		vdo_enqueue_waiter(&lock->waiters, &lock->agent->waiter);
+		lock->agent = data_vio;
+	} else {
+		/* No one has an allocation, so keep the current agent. */
+		data_vio = lock->agent;
+	}
+
+	/* Swap all the waiters back onto the lock's queue. */
+	vdo_transfer_all_waiters(&temp_queue, &lock->waiters);
+	return data_vio;
+}
+
+/**
+ * start_writing() - Begin the non-duplicate write path.
+ * @lock: The hash lock (currently must be QUERYING).
+ * @agent: The data_vio currently acting as the agent for the lock.
+ *
+ * Begins the non-duplicate write path for a hash lock that had no advice, selecting a data_vio
+ * with an allocation as a new agent, if necessary, then resuming the agent on the data_vio write
+ * path.
+ */
+static void start_writing(struct hash_lock *lock, struct data_vio *agent)
+{
+	lock->state = VDO_HASH_LOCK_WRITING;
+
+	/*
+	 * The agent might not have received an allocation and so can't be used for writing, but
+	 * it's entirely possible that one of the waiters did.
+	 */
+	if (!data_vio_has_allocation(agent)) {
+		agent = select_writing_agent(lock);
+		/* If none of the waiters had an allocation, the writes all have to fail. */
+		if (!data_vio_has_allocation(agent)) {
+			/*
+			 * TODO: Should we keep a variant of BYPASSING that causes new arrivals to
+			 * fail immediately if they don't have an allocation? It might be possible
+			 * that on some path there would be non-waiters still referencing the lock,
+			 * so it would remain in the map as everything is currently spelled, even
+			 * if the agent and all waiters release.
+			 */
+			continue_data_vio_with_error(agent, VDO_NO_SPACE);
+			return;
+		}
+	}
+
+	/*
+	 * If the agent compresses, it might wait indefinitely in the packer, which would be bad if
+	 * there are any other data_vios waiting.
+	 */
+	if (vdo_has_waiters(&lock->waiters))
+		cancel_data_vio_compression(agent);
+
+	/*
+	 * Send the agent to the compress/pack/write path in vioWrite. If it succeeds, it will
+	 * return to the hash lock via vdo_continue_hash_lock() and call finish_writing().
+	 */
+	launch_compress_data_vio(agent);
+}
+
+/*
+ * Decode VDO duplicate advice from the old_metadata field of a UDS request.
+ * Returns true if valid advice was found and decoded
+ */
+static bool decode_uds_advice(struct dedupe_context *context)
+{
+	const struct uds_request *request = &context->request;
+	struct data_vio *data_vio = context->requestor;
+	size_t offset = 0;
+	const struct uds_record_data *encoding = &request->old_metadata;
+	struct vdo *vdo = vdo_from_data_vio(data_vio);
+	struct zoned_pbn *advice = &data_vio->duplicate;
+	u8 version;
+	int result;
+
+	if ((request->status != UDS_SUCCESS) || !request->found)
+		return false;
+
+	version = encoding->data[offset++];
+	if (version != UDS_ADVICE_VERSION) {
+		uds_log_error("invalid UDS advice version code %u", version);
+		return false;
+	}
+
+	advice->state = encoding->data[offset++];
+	advice->pbn = get_unaligned_le64(&encoding->data[offset]);
+	offset += sizeof(u64);
+	BUG_ON(offset != UDS_ADVICE_SIZE);
+
+	/* Don't use advice that's clearly meaningless. */
+	if ((advice->state == VDO_MAPPING_STATE_UNMAPPED) || (advice->pbn == VDO_ZERO_BLOCK)) {
+		uds_log_debug("Invalid advice from deduplication server: pbn %llu, state %u. Giving up on deduplication of logical block %llu",
+			      (unsigned long long) advice->pbn,
+			      advice->state,
+			      (unsigned long long) data_vio->logical.lbn);
+		atomic64_inc(&vdo->stats.invalid_advice_pbn_count);
+		return false;
+	}
+
+	result = vdo_get_physical_zone(vdo, advice->pbn, &advice->zone);
+	if ((result != VDO_SUCCESS) || (advice->zone == NULL)) {
+		uds_log_debug("Invalid physical block number from deduplication server: %llu, giving up on deduplication of logical block %llu",
+			      (unsigned long long) advice->pbn,
+			      (unsigned long long) data_vio->logical.lbn);
+		atomic64_inc(&vdo->stats.invalid_advice_pbn_count);
+		return false;
+	}
+
+	return true;
+}
+
+static void process_query_result(struct data_vio *agent)
+{
+	struct dedupe_context *context = agent->dedupe_context;
+
+	if (context == NULL)
+		return;
+
+	if (change_context_state(context, DEDUPE_CONTEXT_COMPLETE, DEDUPE_CONTEXT_IDLE)) {
+		agent->is_duplicate = decode_uds_advice(context);
+		release_context(context);
+	}
+}
+
+/**
+ * finish_querying() - Process the result of a UDS query performed by the agent for the lock.
+ * @completion: The completion of the data_vio that performed the query.
+ *
+ * This continuation is registered in start_querying().
+ */
+static void finish_querying(struct vdo_completion *completion)
+{
+	struct data_vio *agent = as_data_vio(completion);
+	struct hash_lock *lock = agent->hash_lock;
+
+	assert_hash_lock_agent(agent, __func__);
+
+	process_query_result(agent);
+
+	if (agent->is_duplicate) {
+		lock->duplicate = agent->duplicate;
+		/*
+		 * QUERYING -> LOCKING transition: Valid advice was obtained from UDS. Use the
+		 * QUERYING agent to start the hash lock on the unverified dedupe path, verifying
+		 * that the advice can be used.
+		 */
+		start_locking(lock, agent);
+	} else {
+		/*
+		 * The agent will be used as the duplicate if has an allocation; if it does, that
+		 * location was posted to UDS, so no update will be needed.
+		 */
+		lock->update_advice = !data_vio_has_allocation(agent);
+		/*
+		 * QUERYING -> WRITING transition: There was no advice or the advice wasn't valid,
+		 * so try to write or compress the data.
+		 */
+		start_writing(lock, agent);
+	}
+}
+
+/**
+ * start_querying() - Start deduplication for a hash lock.
+ * @lock: The initialized hash lock.
+ * @data_vio: The data_vio that has just obtained the new lock.
+ *
+ * Starts deduplication for a hash lock that has finished initializing by making the data_vio that
+ * requested it the agent, entering the QUERYING state, and using the agent to perform the UDS
+ * query on behalf of the lock.
+ */
+static void start_querying(struct hash_lock *lock, struct data_vio *data_vio)
+{
+	lock->agent = data_vio;
+	lock->state = VDO_HASH_LOCK_QUERYING;
+	data_vio->last_async_operation = VIO_ASYNC_OP_CHECK_FOR_DUPLICATION;
+	set_data_vio_hash_zone_callback(data_vio, finish_querying);
+	query_index(data_vio, (data_vio_has_allocation(data_vio) ? UDS_POST : UDS_QUERY));
+}
+
+/**
+ * report_bogus_lock_state() - Complain that a data_vio has entered a hash_lock that is in an
+ *                             unimplemented or unusable state and continue the data_vio with an
+ *                             error.
+ * @lock: The hash lock.
+ * @data_vio: The data_vio attempting to enter the lock.
+ */
+static void report_bogus_lock_state(struct hash_lock *lock, struct data_vio *data_vio)
+{
+	ASSERT_LOG_ONLY(false,
+			"hash lock must not be in unimplemented state %s",
+			get_hash_lock_state_name(lock->state));
+	continue_data_vio_with_error(data_vio, VDO_LOCK_ERROR);
+}
+
+/**
+ * vdo_continue_hash_lock() - Continue the processing state after writing, compressing, or
+ *                            deduplicating.
+ * @data_vio: The data_vio to continue processing in its hash lock.
+ *
+ * Asynchronously continue processing a data_vio in its hash lock after it has finished writing,
+ * compressing, or deduplicating, so it can share the result with any data_vios waiting in the hash
+ * lock, or update the UDS index, or simply release its share of the lock.
+ *
+ * Context: This must only be called in the correct thread for the hash zone.
+ */
+void vdo_continue_hash_lock(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+	struct hash_lock *lock = data_vio->hash_lock;
+
+	switch (lock->state) {
+	case VDO_HASH_LOCK_WRITING:
+		ASSERT_LOG_ONLY(data_vio == lock->agent,
+				"only the lock agent may continue the lock");
+		finish_writing(lock, data_vio);
+		break;
+
+	case VDO_HASH_LOCK_DEDUPING:
+		finish_deduping(lock, data_vio);
+		break;
+
+	case VDO_HASH_LOCK_BYPASSING:
+		/* This data_vio has finished the write path and the lock doesn't need it. */
+		exit_hash_lock(data_vio);
+		break;
+
+	case VDO_HASH_LOCK_INITIALIZING:
+	case VDO_HASH_LOCK_QUERYING:
+	case VDO_HASH_LOCK_UPDATING:
+	case VDO_HASH_LOCK_LOCKING:
+	case VDO_HASH_LOCK_VERIFYING:
+	case VDO_HASH_LOCK_UNLOCKING:
+		/* A lock in this state should never be re-entered. */
+		report_bogus_lock_state(lock, data_vio);
+		break;
+
+	default:
+		report_bogus_lock_state(lock, data_vio);
+	}
+}
+
+/**
+ * is_hash_collision() - Check to see if a hash collision has occurred.
+ * @lock: The lock to check.
+ * @candidate: The data_vio seeking to share the lock.
+ *
+ * Check whether the data in data_vios sharing a lock is different than in a data_vio seeking to
+ * share the lock, which should only be possible in the extremely unlikely case of a hash
+ * collision.
+ *
+ * Return: true if the given data_vio must not share the lock because it doesn't have the same data
+ *         as the lock holders.
+ */
+static bool is_hash_collision(struct hash_lock *lock, struct data_vio *candidate)
+{
+	struct data_vio *lock_holder;
+	struct hash_zone *zone;
+	bool collides;
+
+	if (list_empty(&lock->duplicate_ring))
+		return false;
+
+	lock_holder = list_first_entry(&lock->duplicate_ring, struct data_vio, hash_lock_entry);
+	zone = candidate->hash_zone;
+	collides = !blocks_equal(lock_holder->vio.data, candidate->vio.data);
+	if (collides)
+		increment_stat(&zone->statistics.concurrent_hash_collisions);
+	else
+		increment_stat(&zone->statistics.concurrent_data_matches);
+
+	return collides;
+}
+
+static inline int assert_hash_lock_preconditions(const struct data_vio *data_vio)
+{
+	int result;
+
+	/* FIXME: BUG_ON() and/or enter read-only mode? */
+	result = ASSERT(data_vio->hash_lock == NULL, "must not already hold a hash lock");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = ASSERT(list_empty(&data_vio->hash_lock_entry),
+			"must not already be a member of a hash lock ring");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	return ASSERT(data_vio->recovery_sequence_number == 0,
+		      "must not hold a recovery lock when getting a hash lock");
+}
+
+/**
+ * vdo_acquire_hash_lock() - Acquire or share a lock on a record name.
+ * @data_vio: The data_vio acquiring a lock on its record name.
+ *
+ * Acquire or share a lock on the hash (record name) of the data in a data_vio, updating the
+ * data_vio to reference the lock. This must only be called in the correct thread for the zone. In
+ * the unlikely case of a hash collision, this function will succeed, but the data_vio will not get
+ * a lock reference.
+ */
+void vdo_acquire_hash_lock(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+	struct hash_lock *lock;
+	int result;
+
+	assert_data_vio_in_hash_zone(data_vio);
+
+	result = assert_hash_lock_preconditions(data_vio);
+	if (result != VDO_SUCCESS) {
+		continue_data_vio_with_error(data_vio, result);
+		return;
+	}
+
+	result = acquire_lock(data_vio->hash_zone, &data_vio->record_name, NULL, &lock);
+	if (result != VDO_SUCCESS) {
+		continue_data_vio_with_error(data_vio, result);
+		return;
+	}
+
+	if (is_hash_collision(lock, data_vio)) {
+		/*
+		 * Hash collisions are extremely unlikely, but the bogus dedupe would be a data
+		 * corruption. Bypass optimization entirely. We can't compress a data_vio without
+		 * a hash_lock as the compressed write depends on the hash_lock to manage the
+		 * references for the compressed block.
+		 */
+		write_data_vio(data_vio);
+		return;
+	}
+
+	set_hash_lock(data_vio, lock);
+	switch (lock->state) {
+	case VDO_HASH_LOCK_INITIALIZING:
+		start_querying(lock, data_vio);
+		return;
+
+	case VDO_HASH_LOCK_QUERYING:
+	case VDO_HASH_LOCK_WRITING:
+	case VDO_HASH_LOCK_UPDATING:
+	case VDO_HASH_LOCK_LOCKING:
+	case VDO_HASH_LOCK_VERIFYING:
+	case VDO_HASH_LOCK_UNLOCKING:
+		/* The lock is busy, and can't be shared yet. */
+		wait_on_hash_lock(lock, data_vio);
+		return;
+
+	case VDO_HASH_LOCK_BYPASSING:
+		/* We can't use this lock, so bypass optimization entirely. */
+		vdo_release_hash_lock(data_vio);
+		write_data_vio(data_vio);
+		return;
+
+	case VDO_HASH_LOCK_DEDUPING:
+		launch_dedupe(lock, data_vio, false);
+		return;
+
+	default:
+		/* A lock in this state should not be acquired by new VIOs. */
+		report_bogus_lock_state(lock, data_vio);
+	}
+}
+
+/**
+ * vdo_release_hash_lock() - Release a data_vio's share of a hash lock, if held, and null out the
+ *                           data_vio's reference to it.
+ * @data_vio: The data_vio releasing its hash lock.
+ *
+ * If the data_vio is the only one holding the lock, this also releases any resources or locks used
+ * by the hash lock (such as a PBN read lock on a block containing data with the same hash) and
+ * returns the lock to the hash zone's lock pool.
+ *
+ * Context: This must only be called in the correct thread for the hash zone.
+ */
+void vdo_release_hash_lock(struct data_vio *data_vio)
+{
+	struct hash_lock *lock = data_vio->hash_lock;
+	struct hash_zone *zone = data_vio->hash_zone;
+
+	if (lock == NULL)
+		return;
+
+	set_hash_lock(data_vio, NULL);
+
+	if (lock->reference_count > 0)
+		/* The lock is still in use by other data_vios. */
+		return;
+
+	if (lock->registered) {
+		struct hash_lock *removed;
+
+		removed = vdo_pointer_map_remove(zone->hash_lock_map, &lock->hash);
+		ASSERT_LOG_ONLY(lock == removed, "hash lock being released must have been mapped");
+	} else {
+		ASSERT_LOG_ONLY(lock != vdo_pointer_map_get(zone->hash_lock_map, &lock->hash),
+				"unregistered hash lock must not be in the lock map");
+	}
+
+	ASSERT_LOG_ONLY(!vdo_has_waiters(&lock->waiters),
+			"hash lock returned to zone must have no waiters");
+	ASSERT_LOG_ONLY((lock->duplicate_lock == NULL),
+			"hash lock returned to zone must not reference a PBN lock");
+	ASSERT_LOG_ONLY((lock->state == VDO_HASH_LOCK_BYPASSING),
+			"returned hash lock must not be in use with state %s",
+			get_hash_lock_state_name(lock->state));
+	ASSERT_LOG_ONLY(list_empty(&lock->pool_node),
+			"hash lock returned to zone must not be in a pool ring");
+	ASSERT_LOG_ONLY(list_empty(&lock->duplicate_ring),
+			"hash lock returned to zone must not reference DataVIOs");
+
+	return_hash_lock_to_pool(zone, lock);
+}
+
+/**
+ * transfer_allocation_lock() - Transfer a data_vio's downgraded allocation PBN lock to the
+ *                              data_vio's hash lock, converting it to a duplicate PBN lock.
+ * @data_vio: The data_vio holding the allocation lock to transfer.
+ */
+static void transfer_allocation_lock(struct data_vio *data_vio)
+{
+	struct allocation *allocation = &data_vio->allocation;
+	struct hash_lock *hash_lock = data_vio->hash_lock;
+
+	ASSERT_LOG_ONLY(data_vio->new_mapped.pbn == allocation->pbn,
+			"transferred lock must be for the block written");
+
+	allocation->pbn = VDO_ZERO_BLOCK;
+
+	ASSERT_LOG_ONLY(vdo_is_pbn_read_lock(allocation->lock),
+			"must have downgraded the allocation lock before transfer");
+
+	hash_lock->duplicate = data_vio->new_mapped;
+	data_vio->duplicate = data_vio->new_mapped;
+
+	/*
+	 * Since the lock is being transferred, the holder count doesn't change (and isn't even
+	 * safe to examine on this thread).
+	 */
+	hash_lock->duplicate_lock = UDS_FORGET(allocation->lock);
+}
+
+/**
+ * vdo_share_compressed_write_lock() - Make a data_vio's hash lock a shared holder of the PBN lock
+ *                                     on the compressed block to which its data was just written.
+ * @data_vio: The data_vio which was just compressed.
+ * @pbn_lock: The PBN lock on the compressed block.
+ *
+ * If the lock is still a write lock (as it will be for the first share), it will be converted to a
+ * read lock. This also reserves a reference count increment for the data_vio.
+ */
+void vdo_share_compressed_write_lock(struct data_vio *data_vio, struct pbn_lock *pbn_lock)
+{
+	bool claimed;
+
+	ASSERT_LOG_ONLY(vdo_get_duplicate_lock(data_vio) == NULL,
+			"a duplicate PBN lock should not exist when writing");
+	ASSERT_LOG_ONLY(vdo_is_state_compressed(data_vio->new_mapped.state),
+			"lock transfer must be for a compressed write");
+	assert_data_vio_in_new_mapped_zone(data_vio);
+
+	/* First sharer downgrades the lock. */
+	if (!vdo_is_pbn_read_lock(pbn_lock))
+		vdo_downgrade_pbn_write_lock(pbn_lock, true);
+
+	/*
+	 * Get a share of the PBN lock, ensuring it cannot be released until after this data_vio
+	 * has had a chance to journal a reference.
+	 */
+	data_vio->duplicate = data_vio->new_mapped;
+	data_vio->hash_lock->duplicate = data_vio->new_mapped;
+	set_duplicate_lock(data_vio->hash_lock, pbn_lock);
+
+	/*
+	 * Claim a reference for this data_vio. Necessary since another hash_lock might start
+	 * deduplicating against it before our incRef.
+	 */
+	claimed = vdo_claim_pbn_lock_increment(pbn_lock);
+	ASSERT_LOG_ONLY(claimed, "impossible to fail to claim an initial increment");
+}
+
+/** compare_keys() - Implements pointer_key_comparator. */
+static bool compare_keys(const void *this_key, const void *that_key)
+{
+	/* Null keys are not supported. */
+	return (memcmp(this_key, that_key, sizeof(struct uds_record_name)) == 0);
+}
+
+/** hash_key() - Implements pointer_key_comparator. */
+static u32 hash_key(const void *key)
+{
+	const struct uds_record_name *name = key;
+
+	/* Use a fragment of the record name as a hash code. */
+	return get_unaligned_le32(&name->name[4]);
+}
+
+static int __must_check
+initialize_zone(struct vdo *vdo, struct hash_zones *zones, zone_count_t zone_number)
+{
+	int result;
+	data_vio_count_t i;
+	struct hash_zone *zone = &zones->zones[zone_number];
+
+	result = vdo_make_pointer_map(VDO_LOCK_MAP_CAPACITY,
+				      0,
+				      compare_keys,
+				      hash_key,
+				      &zone->hash_lock_map);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	vdo_set_admin_state_code(&zone->state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+	zone->zone_number = zone_number;
+	zone->thread_id = vdo->thread_config.hash_zone_threads[zone_number];
+	vdo_initialize_completion(&zone->completion, vdo, VDO_HASH_ZONE_COMPLETION);
+	vdo_set_completion_callback(&zone->completion,
+				    timeout_index_operations_callback,
+				    zone->thread_id);
+	INIT_LIST_HEAD(&zone->lock_pool);
+	result = UDS_ALLOCATE(LOCK_POOL_CAPACITY,
+			      struct hash_lock,
+			      "hash_lock array",
+			      &zone->lock_array);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	for (i = 0; i < LOCK_POOL_CAPACITY; i++)
+		return_hash_lock_to_pool(zone, &zone->lock_array[i]);
+
+	INIT_LIST_HEAD(&zone->available);
+	INIT_LIST_HEAD(&zone->pending);
+	result = uds_make_funnel_queue(&zone->timed_out_complete);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	timer_setup(&zone->timer, timeout_index_operations, 0);
+
+	for (i = 0; i < MAXIMUM_VDO_USER_VIOS; i++) {
+		struct dedupe_context *context = &zone->contexts[i];
+
+		context->zone = zone;
+		context->request.callback = finish_index_operation;
+		context->request.session = zones->index_session;
+		list_add(&context->list_entry, &zone->available);
+	}
+
+	return vdo_make_default_thread(vdo, zone->thread_id);
+}
+
+/** get_thread_id_for_zone() - Implements vdo_zone_thread_getter. */
+static thread_id_t get_thread_id_for_zone(void *context, zone_count_t zone_number)
+{
+	struct hash_zones *zones = context;
+
+	return zones->zones[zone_number].thread_id;
+}
+
+/**
+ * vdo_make_hash_zones() - Create the hash zones.
+ *
+ * @vdo: The vdo to which the zone will belong.
+ * @zones_ptr: A pointer to hold the zones.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+int vdo_make_hash_zones(struct vdo *vdo, struct hash_zones **zones_ptr)
+{
+	int result;
+	struct hash_zones *zones;
+	zone_count_t z;
+	zone_count_t zone_count = vdo->thread_config.hash_zone_count;
+
+	if (zone_count == 0)
+		return VDO_SUCCESS;
+
+	result = UDS_ALLOCATE_EXTENDED(struct hash_zones,
+				       zone_count,
+				       struct hash_zone,
+				       __func__,
+				       &zones);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = initialize_index(vdo, zones);
+	if (result != VDO_SUCCESS) {
+		UDS_FREE(zones);
+		return result;
+	}
+
+	vdo_set_admin_state_code(&zones->state, VDO_ADMIN_STATE_NEW);
+
+	zones->zone_count = zone_count;
+	for (z = 0; z < zone_count; z++) {
+		result = initialize_zone(vdo, zones, z);
+		if (result != VDO_SUCCESS) {
+			vdo_free_hash_zones(zones);
+			return result;
+		}
+	}
+
+	result = vdo_make_action_manager(zones->zone_count,
+					 get_thread_id_for_zone,
+					 vdo->thread_config.admin_thread,
+					 zones,
+					 NULL,
+					 vdo,
+					 &zones->manager);
+	if (result != VDO_SUCCESS) {
+		vdo_free_hash_zones(zones);
+		return result;
+	}
+
+	*zones_ptr = zones;
+	return VDO_SUCCESS;
+}
+
+void vdo_finish_dedupe_index(struct hash_zones *zones)
+{
+	if (zones == NULL)
+		return;
+
+	uds_destroy_index_session(UDS_FORGET(zones->index_session));
+}
+
+/**
+ * vdo_free_hash_zones() - Free the hash zones.
+ * @zones: The zone to free.
+ */
+void vdo_free_hash_zones(struct hash_zones *zones)
+{
+	zone_count_t i;
+
+	if (zones == NULL)
+		return;
+
+	UDS_FREE(UDS_FORGET(zones->manager));
+
+	for (i = 0; i < zones->zone_count; i++) {
+		struct hash_zone *zone = &zones->zones[i];
+
+		uds_free_funnel_queue(UDS_FORGET(zone->timed_out_complete));
+		vdo_free_pointer_map(UDS_FORGET(zone->hash_lock_map));
+		UDS_FREE(UDS_FORGET(zone->lock_array));
+	}
+
+	if (zones->index_session != NULL)
+		vdo_finish_dedupe_index(zones);
+
+	ratelimit_state_exit(&zones->ratelimiter);
+	if (vdo_get_admin_state_code(&zones->state) == VDO_ADMIN_STATE_NEW)
+		UDS_FREE(zones);
+	else
+		kobject_put(&zones->dedupe_directory);
+}
+
+static void initiate_suspend_index(struct admin_state *state)
+{
+	struct hash_zones *zones = container_of(state, struct hash_zones, state);
+	enum index_state index_state;
+
+	spin_lock(&zones->lock);
+	index_state = zones->index_state;
+	spin_unlock(&zones->lock);
+
+	if (index_state != IS_CLOSED) {
+		bool save = vdo_is_state_saving(&zones->state);
+		int result;
+
+		result = uds_suspend_index_session(zones->index_session, save);
+		if (result != UDS_SUCCESS)
+			uds_log_error_strerror(result, "Error suspending dedupe index");
+	}
+
+	vdo_finish_draining(state);
+}
+
+/**
+ * suspend_index() - Suspend the UDS index prior to draining hash zones.
+ *
+ * Implements vdo_action_preamble
+ */
+static void suspend_index(void *context, struct vdo_completion *completion)
+{
+	struct hash_zones *zones = context;
+
+	vdo_start_draining(&zones->state,
+			   vdo_get_current_manager_operation(zones->manager),
+			   completion,
+			   initiate_suspend_index);
+}
+
+/**
+ * initiate_drain() - Initiate a drain.
+ *
+ * Implements vdo_admin_initiator.
+ */
+static void initiate_drain(struct admin_state *state)
+{
+	check_for_drain_complete(container_of(state, struct hash_zone, state));
+}
+
+/**
+ * drain_hash_zone() - Drain a hash zone.
+ *
+ * Implements vdo_zone_action.
+ */
+static void drain_hash_zone(void *context, zone_count_t zone_number, struct vdo_completion *parent)
+{
+	struct hash_zones *zones = context;
+
+	vdo_start_draining(&zones->zones[zone_number].state,
+			   vdo_get_current_manager_operation(zones->manager),
+			   parent,
+			   initiate_drain);
+}
+
+/** vdo_drain_hash_zones() - Drain all hash zones. */
+void vdo_drain_hash_zones(struct hash_zones *zones, struct vdo_completion *parent)
+{
+	vdo_schedule_operation(zones->manager,
+			       parent->vdo->suspend_type,
+			       suspend_index,
+			       drain_hash_zone,
+			       NULL,
+			       parent);
+}
+
+static void launch_dedupe_state_change(struct hash_zones *zones)
+{
+	/* ASSERTION: We enter with the lock held. */
+	if (zones->changing || !vdo_is_state_normal(&zones->state))
+		/* Either a change is already in progress, or changes are not allowed. */
+		return;
+
+	if (zones->create_flag || (zones->index_state != zones->index_target)) {
+		zones->changing = true;
+		vdo_launch_completion(&zones->completion);
+		return;
+	}
+
+	/* ASSERTION: We exit with the lock held. */
+}
+
+/**
+ * resume_index() - Resume the UDS index prior to resuming hash zones.
+ *
+ * Implements vdo_action_preamble
+ */
+static void resume_index(void *context, struct vdo_completion *parent)
+{
+	struct hash_zones *zones = context;
+	struct device_config *config = parent->vdo->device_config;
+	int result;
+
+	zones->parameters.bdev = config->owned_device->bdev;
+	result = uds_resume_index_session(zones->index_session, zones->parameters.bdev);
+	if (result != UDS_SUCCESS)
+		uds_log_error_strerror(result, "Error resuming dedupe index");
+
+	spin_lock(&zones->lock);
+	vdo_resume_if_quiescent(&zones->state);
+
+	if (config->deduplication) {
+		zones->index_target = IS_OPENED;
+		WRITE_ONCE(zones->dedupe_flag, true);
+	} else {
+		zones->index_target = IS_CLOSED;
+	}
+
+	launch_dedupe_state_change(zones);
+	spin_unlock(&zones->lock);
+
+	vdo_finish_completion(parent);
+}
+
+/**
+ * resume_hash_zone() - Resume a hash zone.
+ *
+ * Implements vdo_zone_action.
+ */
+static void
+resume_hash_zone(void *context, zone_count_t zone_number, struct vdo_completion *parent)
+{
+	struct hash_zone *zone = &(((struct hash_zones *) context)->zones[zone_number]);
+
+	vdo_fail_completion(parent, vdo_resume_if_quiescent(&zone->state));
+}
+
+/**
+ * vdo_resume_hash_zones() - Resume a set of hash zones.
+ * @zones: The hash zones to resume.
+ * @parent: The object to notify when the zones have resumed.
+ */
+void vdo_resume_hash_zones(struct hash_zones *zones, struct vdo_completion *parent)
+{
+	if (vdo_is_read_only(parent->vdo)) {
+		vdo_launch_completion(parent);
+		return;
+	}
+
+	vdo_schedule_operation(zones->manager,
+			       VDO_ADMIN_STATE_RESUMING,
+			       resume_index,
+			       resume_hash_zone,
+			       NULL,
+			       parent);
+}
+
+/**
+ * get_hash_zone_statistics() - Add the statistics for this hash zone to the tally for all zones.
+ * @zone: The hash zone to query.
+ * @tally: The tally
+ */
+static void
+get_hash_zone_statistics(const struct hash_zone *zone, struct hash_lock_statistics *tally)
+{
+	const struct hash_lock_statistics *stats = &zone->statistics;
+
+	tally->dedupe_advice_valid += READ_ONCE(stats->dedupe_advice_valid);
+	tally->dedupe_advice_stale += READ_ONCE(stats->dedupe_advice_stale);
+	tally->concurrent_data_matches += READ_ONCE(stats->concurrent_data_matches);
+	tally->concurrent_hash_collisions += READ_ONCE(stats->concurrent_hash_collisions);
+	tally->curr_dedupe_queries += READ_ONCE(zone->active);
+}
+
+static void get_index_statistics(struct hash_zones *zones, struct index_statistics *stats)
+{
+	enum index_state state;
+	struct uds_index_stats index_stats;
+	int result;
+
+	spin_lock(&zones->lock);
+	state = zones->index_state;
+	spin_unlock(&zones->lock);
+
+	if (state != IS_OPENED)
+		return;
+
+	result = uds_get_index_session_stats(zones->index_session, &index_stats);
+	if (result != UDS_SUCCESS) {
+		uds_log_error_strerror(result, "Error reading index stats");
+		return;
+	}
+
+	stats->entries_indexed = index_stats.entries_indexed;
+	stats->posts_found = index_stats.posts_found;
+	stats->posts_not_found = index_stats.posts_not_found;
+	stats->queries_found = index_stats.queries_found;
+	stats->queries_not_found = index_stats.queries_not_found;
+	stats->updates_found = index_stats.updates_found;
+	stats->updates_not_found = index_stats.updates_not_found;
+	stats->entries_discarded = index_stats.entries_discarded;
+}
+
+/**
+ * vdo_get_dedupe_statistics() - Tally the statistics from all the hash zones and the UDS index.
+ * @hash_zones: The hash zones to query
+ *
+ * Return: The sum of the hash lock statistics from all hash zones plus the statistics from the UDS
+ *         index
+ */
+void vdo_get_dedupe_statistics(struct hash_zones *zones, struct vdo_statistics *stats)
+
+{
+	zone_count_t zone;
+
+	for (zone = 0; zone < zones->zone_count; zone++)
+		get_hash_zone_statistics(&zones->zones[zone], &stats->hash_lock);
+
+	get_index_statistics(zones, &stats->index);
+
+	/*
+	 * zones->timeouts gives the number of timeouts, and dedupe_context_busy gives the number
+	 * of queries not made because of earlier timeouts.
+	 */
+	stats->dedupe_advice_timeouts =
+		(atomic64_read(&zones->timeouts) + atomic64_read(&zones->dedupe_context_busy));
+}
+
+/**
+ * vdo_select_hash_zone() - Select the hash zone responsible for locking a given record name.
+ * @zones: The hash_zones from which to select.
+ * @name: The record name.
+ *
+ * Return: The hash zone responsible for the record name.
+ */
+struct hash_zone *
+vdo_select_hash_zone(struct hash_zones *zones, const struct uds_record_name *name)
+{
+	/*
+	 * Use a fragment of the record name as a hash code. Eight bits of hash should suffice
+	 * since the number of hash zones is small.
+	 * TODO: Verify that the first byte is independent enough.
+	 */
+	u32 hash = name->name[0];
+
+	/*
+	 * Scale the 8-bit hash fragment to a zone index by treating it as a binary fraction and
+	 * multiplying that by the zone count. If the hash is uniformly distributed over [0 ..
+	 * 2^8-1], then (hash * count / 2^8) should be uniformly distributed over [0 .. count-1].
+	 * The multiply and shift is much faster than a divide (modulus) on X86 CPUs.
+	 */
+	hash = (hash * zones->zone_count) >> 8;
+	return &zones->zones[hash];
+}
+
+/**
+ * dump_hash_lock() - Dump a compact description of hash_lock to the log if the lock is not on the
+ *                    free list.
+ * @lock: The hash lock to dump.
+ */
+static void dump_hash_lock(const struct hash_lock *lock)
+{
+	const char *state;
+
+	if (!list_empty(&lock->pool_node))
+		/* This lock is on the free list. */
+		return;
+
+	/*
+	 * Necessarily cryptic since we can log a lot of these. First three chars of state is
+	 * unambiguous. 'U' indicates a lock not registered in the map.
+	 */
+	state = get_hash_lock_state_name(lock->state);
+	uds_log_info("  hl %px: %3.3s %c%llu/%u rc=%u wc=%zu agt=%px",
+		     (const void *) lock, state, (lock->registered ? 'D' : 'U'),
+		     (unsigned long long) lock->duplicate.pbn,
+		     lock->duplicate.state, lock->reference_count,
+		     vdo_count_waiters(&lock->waiters), (void *) lock->agent);
+}
+
+static const char *index_state_to_string(struct hash_zones *zones, enum index_state state)
+{
+	if (!vdo_is_state_normal(&zones->state))
+		return SUSPENDED;
+
+	switch (state) {
+	case IS_CLOSED:
+		return zones->error_flag ? ERROR : CLOSED;
+	case IS_CHANGING:
+		return zones->index_target == IS_OPENED ? OPENING : CLOSING;
+	case IS_OPENED:
+		return READ_ONCE(zones->dedupe_flag) ? ONLINE : OFFLINE;
+	default:
+		return UNKNOWN;
+	}
+}
+
+/**
+ * vdo_dump_hash_zone() - Dump information about a hash zone to the log for debugging.
+ * @zone: The zone to dump.
+ */
+static void dump_hash_zone(const struct hash_zone *zone)
+{
+	data_vio_count_t i;
+
+	if (zone->hash_lock_map == NULL) {
+		uds_log_info("struct hash_zone %u: NULL map", zone->zone_number);
+		return;
+	}
+
+	uds_log_info("struct hash_zone %u: mapSize=%zu",
+		     zone->zone_number,
+		     vdo_pointer_map_size(zone->hash_lock_map));
+	for (i = 0; i < LOCK_POOL_CAPACITY; i++)
+		dump_hash_lock(&zone->lock_array[i]);
+}
+
+/**
+ * vdo_dump_hash_zones() - Dump information about the hash zones to the log for debugging.
+ * @zones: The zones to dump.
+ */
+void vdo_dump_hash_zones(struct hash_zones *zones)
+{
+	const char *state, *target;
+	zone_count_t zone;
+
+	spin_lock(&zones->lock);
+	state = index_state_to_string(zones, zones->index_state);
+	target = (zones->changing ? index_state_to_string(zones, zones->index_target) : NULL);
+	spin_unlock(&zones->lock);
+
+	uds_log_info("UDS index: state: %s", state);
+	if (target != NULL)
+		uds_log_info("UDS index: changing to state: %s", target);
+
+	for (zone = 0; zone < zones->zone_count; zone++)
+		dump_hash_zone(&zones->zones[zone]);
+}
diff --git a/drivers/md/dm-vdo/dedupe.h b/drivers/md/dm-vdo/dedupe.h
new file mode 100644
index 000000000000..af329fb0fa68
--- /dev/null
+++ b/drivers/md/dm-vdo/dedupe.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_DEDUPE_H
+#define VDO_DEDUPE_H
+
+#include <linux/list.h>
+#include <linux/timer.h>
+
+#include "uds.h"
+
+#include "admin-state.h"
+#include "constants.h"
+#include "statistics.h"
+#include "types.h"
+#include "wait-queue.h"
+
+struct dedupe_context {
+	struct hash_zone *zone;
+	struct uds_request request;
+	struct list_head list_entry;
+	struct funnel_queue_entry queue_entry;
+	u64 submission_jiffies;
+	struct data_vio *requestor;
+	atomic_t state;
+};
+
+struct hash_lock;
+
+struct hash_zone {
+	/* Which hash zone this is */
+	zone_count_t zone_number;
+
+	/* The administrative state of the zone */
+	struct admin_state state;
+
+	/* The thread ID for this zone */
+	thread_id_t thread_id;
+
+	/* Mapping from record name fields to hash_locks */
+	struct pointer_map *hash_lock_map;
+
+	/* List containing all unused hash_locks */
+	struct list_head lock_pool;
+
+	/*
+	 * Statistics shared by all hash locks in this zone. Only modified on the hash zone thread,
+	 * but queried by other threads.
+	 */
+	struct hash_lock_statistics statistics;
+
+	/* Array of all hash_locks */
+	struct hash_lock *lock_array;
+
+	/* These fields are used to manage the dedupe contexts */
+	struct list_head available;
+	struct list_head pending;
+	struct funnel_queue *timed_out_complete;
+	struct timer_list timer;
+	struct vdo_completion completion;
+	unsigned int active;
+	atomic_t timer_state;
+
+	/* The dedupe contexts for querying the index from this zone */
+	struct dedupe_context contexts[MAXIMUM_VDO_USER_VIOS];
+};
+
+struct hash_zones;
+
+struct pbn_lock * __must_check vdo_get_duplicate_lock(struct data_vio *data_vio);
+
+void vdo_acquire_hash_lock(struct vdo_completion *completion);
+void vdo_continue_hash_lock(struct vdo_completion *completion);
+void vdo_release_hash_lock(struct data_vio *data_vio);
+void vdo_clean_failed_hash_lock(struct data_vio *data_vio);
+void vdo_share_compressed_write_lock(struct data_vio *data_vio, struct pbn_lock *pbn_lock);
+
+int __must_check vdo_make_hash_zones(struct vdo *vdo, struct hash_zones **zones_ptr);
+
+void vdo_free_hash_zones(struct hash_zones *zones);
+
+void vdo_drain_hash_zones(struct hash_zones *zones, struct vdo_completion *parent);
+
+void vdo_get_dedupe_statistics(struct hash_zones *zones, struct vdo_statistics *stats);
+
+struct hash_zone * __must_check
+vdo_select_hash_zone(struct hash_zones *zones, const struct uds_record_name *name);
+
+void vdo_dump_hash_zones(struct hash_zones *zones);
+
+#endif /* VDO_DEDUPE_H */

From patchwork Thu Oct 26 21:41:20 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437946
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 65CB4C27C4A
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:28 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344889AbjJZVo2 (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:28 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40978 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344897AbjJZVoQ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:16 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 15143D4E
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356504;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=Fyv+qkUNIE1dYhEw0bAkLYwL70dlXQcZ7ntf/CeeDEQ=;
        b=IKKV+JcKfn7kADfoVfgxoRVO1fjJ5fOsXkXeR9AoSCR9G8AoMuyeI6yy6i3mdd3j7Te0mm
        ReDov5y9ylIDS8TG2F62fqjlDlTZdH0IMleGE324zKvg7y3y1yikGaJk9EgoU14APqCjdQ
        IE7D9YF4AubYJZLOAUBR50SFSV+8XjM=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-155-_e7USfzANpmOIPzP7h5yjQ-1; Thu, 26 Oct 2023 17:41:41 -0400
X-MC-Unique: _e7USfzANpmOIPzP7h5yjQ-1
Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com
 [10.11.54.7])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1E97E185A784;
        Thu, 26 Oct 2023 21:41:41 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 0A5A81C060AE;
        Thu, 26 Oct 2023 21:41:41 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id B56803003B; Thu, 26 Oct 2023 17:41:40 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 23/39] dm vdo: add use of deduplication index in hash zones
Date: Thu, 26 Oct 2023 17:41:20 -0400
Message-Id: <20231026214136.1067410-24-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.7
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Add the data and methods that manage queries to the deduplication
index and the responses from the index.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Co-developed-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Co-developed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/dedupe.c | 623 +++++++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/dedupe.h |  26 ++
 2 files changed, 649 insertions(+)

diff --git a/drivers/md/dm-vdo/dedupe.c b/drivers/md/dm-vdo/dedupe.c
index cf0fbdc30ca5..a6546a903120 100644
--- a/drivers/md/dm-vdo/dedupe.c
+++ b/drivers/md/dm-vdo/dedupe.c
@@ -143,6 +143,48 @@
 #include "vdo.h"
 #include "wait-queue.h"
 
+struct uds_attribute {
+	struct attribute attr;
+	const char *(*show_string)(struct hash_zones *hash_zones);
+};
+
+enum timer_state {
+	DEDUPE_QUERY_TIMER_IDLE,
+	DEDUPE_QUERY_TIMER_RUNNING,
+	DEDUPE_QUERY_TIMER_FIRED,
+};
+
+enum dedupe_context_state {
+	DEDUPE_CONTEXT_IDLE,
+	DEDUPE_CONTEXT_PENDING,
+	DEDUPE_CONTEXT_TIMED_OUT,
+	DEDUPE_CONTEXT_COMPLETE,
+	DEDUPE_CONTEXT_TIMED_OUT_COMPLETE,
+};
+
+/* Possible index states: closed, opened, or transitioning between those two. */
+enum index_state {
+	IS_CLOSED,
+	IS_CHANGING,
+	IS_OPENED,
+};
+
+static const char *CLOSED = "closed";
+static const char *CLOSING = "closing";
+static const char *ERROR = "error";
+static const char *OFFLINE = "offline";
+static const char *ONLINE = "online";
+static const char *OPENING = "opening";
+static const char *SUSPENDED = "suspended";
+static const char *UNKNOWN = "unknown";
+
+/* Version 2 uses the kernel space UDS index and is limited to 16 bytes */
+enum {
+	UDS_ADVICE_VERSION = 2,
+	/* version byte + state byte + 64-bit little-endian PBN */
+	UDS_ADVICE_SIZE = 1 + 1 + sizeof(u64),
+};
+
 enum hash_lock_state {
 	/* State for locks that are not in use or are being initialized. */
 	VDO_HASH_LOCK_INITIALIZING,
@@ -264,6 +306,13 @@ struct hash_zones {
 	struct hash_zone zones[];
 };
 
+/* These are in milliseconds. */
+unsigned int vdo_dedupe_index_timeout_interval = 5000;
+unsigned int vdo_dedupe_index_min_timer_interval = 100;
+/* Same two variables, in jiffies for easier consumption. */
+static u64 vdo_dedupe_index_timeout_jiffies;
+static u64 vdo_dedupe_index_min_timer_jiffies;
+
 static inline struct hash_zone *as_hash_zone(struct vdo_completion *completion)
 {
 	vdo_assert_completion_type(completion, VDO_HASH_ZONE_COMPLETION);
@@ -283,6 +332,16 @@ static inline void assert_in_hash_zone(struct hash_zone *zone, const char *name)
 			name);
 }
 
+static inline bool change_context_state(struct dedupe_context *context, int old, int new)
+{
+	return (atomic_cmpxchg(&context->state, old, new) == old);
+}
+
+static inline bool change_timer_state(struct hash_zone *zone, int old, int new)
+{
+	return (atomic_cmpxchg(&zone->timer_state, old, new) == old);
+}
+
 /**
  * return_hash_lock_to_pool() - (Re)initialize a hash lock and return it to its pool.
  * @zone: The zone from which the lock was borrowed.
@@ -1955,6 +2014,376 @@ static u32 hash_key(const void *key)
 	return get_unaligned_le32(&name->name[4]);
 }
 
+static void dedupe_kobj_release(struct kobject *directory)
+{
+	UDS_FREE(container_of(directory, struct hash_zones, dedupe_directory));
+}
+
+static ssize_t dedupe_status_show(struct kobject *directory, struct attribute *attr, char *buf)
+{
+	struct uds_attribute *ua = container_of(attr, struct uds_attribute, attr);
+	struct hash_zones *zones = container_of(directory, struct hash_zones, dedupe_directory);
+
+	if (ua->show_string != NULL)
+		return sprintf(buf, "%s\n", ua->show_string(zones));
+	else
+		return -EINVAL;
+}
+
+static ssize_t dedupe_status_store(struct kobject *kobj __always_unused,
+				   struct attribute *attr __always_unused,
+				   const char *buf __always_unused,
+				   size_t length __always_unused)
+{
+	return -EINVAL;
+}
+
+/*----------------------------------------------------------------------*/
+
+static const struct sysfs_ops dedupe_sysfs_ops = {
+	.show = dedupe_status_show,
+	.store = dedupe_status_store,
+};
+
+static struct uds_attribute dedupe_status_attribute = {
+	.attr = {.name = "status", .mode = 0444, },
+	.show_string = vdo_get_dedupe_index_state_name,
+};
+
+static struct attribute *dedupe_attrs[] = {
+	&dedupe_status_attribute.attr,
+	NULL,
+};
+ATTRIBUTE_GROUPS(dedupe);
+
+static const struct kobj_type dedupe_directory_type = {
+	.release = dedupe_kobj_release,
+	.sysfs_ops = &dedupe_sysfs_ops,
+	.default_groups = dedupe_groups,
+};
+
+static void start_uds_queue(void *ptr)
+{
+	/*
+	 * Allow the UDS dedupe worker thread to do memory allocations. It will only do allocations
+	 * during the UDS calls that open or close an index, but those allocations can safely sleep
+	 * while reserving a large amount of memory. We could use an allocations_allowed boolean
+	 * (like the base threads do), but it would be an unnecessary embellishment.
+	 */
+	struct vdo_thread *thread = vdo_get_work_queue_owner(vdo_get_current_work_queue());
+
+	uds_register_allocating_thread(&thread->allocating_thread, NULL);
+}
+
+static void finish_uds_queue(void *ptr __always_unused)
+{
+	uds_unregister_allocating_thread();
+}
+
+static void close_index(struct hash_zones *zones)
+{
+	int result;
+
+	/*
+	 * Change the index state so that get_index_statistics() will not try to use the index
+	 * session we are closing.
+	 */
+	zones->index_state = IS_CHANGING;
+	/* Close the index session, while not holding the lock. */
+	spin_unlock(&zones->lock);
+	result = uds_close_index(zones->index_session);
+
+	if (result != UDS_SUCCESS)
+		uds_log_error_strerror(result, "Error closing index");
+	spin_lock(&zones->lock);
+	zones->index_state = IS_CLOSED;
+	zones->error_flag |= result != UDS_SUCCESS;
+	/* ASSERTION: We leave in IS_CLOSED state. */
+}
+
+static void open_index(struct hash_zones *zones)
+{
+	/* ASSERTION: We enter in IS_CLOSED state. */
+	int result;
+	bool create_flag = zones->create_flag;
+
+	zones->create_flag = false;
+	/*
+	 * Change the index state so that the it will be reported to the outside world as
+	 * "opening".
+	 */
+	zones->index_state = IS_CHANGING;
+	zones->error_flag = false;
+
+	/* Open the index session, while not holding the lock */
+	spin_unlock(&zones->lock);
+	result = uds_open_index(create_flag ? UDS_CREATE : UDS_LOAD,
+				&zones->parameters,
+				zones->index_session);
+	if (result != UDS_SUCCESS)
+		uds_log_error_strerror(result, "Error opening index");
+
+	spin_lock(&zones->lock);
+	if (!create_flag) {
+		switch (result) {
+		case -ENOENT:
+			/*
+			 * Either there is no index, or there is no way we can recover the index.
+			 * We will be called again and try to create a new index.
+			 */
+			zones->index_state = IS_CLOSED;
+			zones->create_flag = true;
+			return;
+		default:
+			break;
+		}
+	}
+	if (result == UDS_SUCCESS) {
+		zones->index_state = IS_OPENED;
+	} else {
+		zones->index_state = IS_CLOSED;
+		zones->index_target = IS_CLOSED;
+		zones->error_flag = true;
+		spin_unlock(&zones->lock);
+		uds_log_info("Setting UDS index target state to error");
+		spin_lock(&zones->lock);
+	}
+	/*
+	 * ASSERTION: On success, we leave in IS_OPENED state.
+	 * ASSERTION: On failure, we leave in IS_CLOSED state.
+	 */
+}
+
+static void change_dedupe_state(struct vdo_completion *completion)
+{
+	struct hash_zones *zones = as_hash_zones(completion);
+
+	spin_lock(&zones->lock);
+
+	/* Loop until the index is in the target state and the create flag is clear. */
+	while (vdo_is_state_normal(&zones->state) &&
+	       ((zones->index_state != zones->index_target) || zones->create_flag)) {
+		if (zones->index_state == IS_OPENED)
+			close_index(zones);
+		else
+			open_index(zones);
+	}
+
+	zones->changing = false;
+	spin_unlock(&zones->lock);
+}
+
+static void start_expiration_timer(struct dedupe_context *context)
+{
+	u64 start_time = context->submission_jiffies;
+	u64 end_time;
+
+	if (!change_timer_state(context->zone,
+				DEDUPE_QUERY_TIMER_IDLE,
+				DEDUPE_QUERY_TIMER_RUNNING))
+		return;
+
+	end_time = max(start_time + vdo_dedupe_index_timeout_jiffies,
+		       jiffies + vdo_dedupe_index_min_timer_jiffies);
+	mod_timer(&context->zone->timer, end_time);
+}
+
+/**
+ * report_dedupe_timeouts() - Record and eventually report that some dedupe requests reached their
+ *                            expiration time without getting answers, so we timed them out.
+ * @zones: the hash zones.
+ * @timeouts: the number of newly timed out requests.
+ */
+static void report_dedupe_timeouts(struct hash_zones *zones, unsigned int timeouts)
+{
+	atomic64_add(timeouts, &zones->timeouts);
+	spin_lock(&zones->lock);
+	if (__ratelimit(&zones->ratelimiter)) {
+		u64 unreported = atomic64_read(&zones->timeouts);
+
+		unreported -= zones->reported_timeouts;
+		uds_log_debug("UDS index timeout on %llu requests",
+			      (unsigned long long) unreported);
+		zones->reported_timeouts += unreported;
+	}
+	spin_unlock(&zones->lock);
+}
+
+static int initialize_index(struct vdo *vdo, struct hash_zones *zones)
+{
+	int result;
+	off_t uds_offset;
+	struct volume_geometry geometry = vdo->geometry;
+	static const struct vdo_work_queue_type uds_queue_type = {
+		.start = start_uds_queue,
+		.finish = finish_uds_queue,
+		.max_priority = UDS_Q_MAX_PRIORITY,
+		.default_priority = UDS_Q_PRIORITY,
+	};
+
+	vdo_set_dedupe_index_timeout_interval(vdo_dedupe_index_timeout_interval);
+	vdo_set_dedupe_index_min_timer_interval(vdo_dedupe_index_min_timer_interval);
+
+	/*
+	 * Since we will save up the timeouts that would have been reported but were ratelimited,
+	 * we don't need to report ratelimiting.
+	 */
+	ratelimit_default_init(&zones->ratelimiter);
+	ratelimit_set_flags(&zones->ratelimiter, RATELIMIT_MSG_ON_RELEASE);
+	uds_offset = ((vdo_get_index_region_start(geometry) -
+		       geometry.bio_offset) * VDO_BLOCK_SIZE);
+	zones->parameters = (struct uds_parameters) {
+		.bdev = vdo->device_config->owned_device->bdev,
+		.offset = uds_offset,
+		.size = (vdo_get_index_region_size(geometry) * VDO_BLOCK_SIZE),
+		.memory_size = geometry.index_config.mem,
+		.sparse = geometry.index_config.sparse,
+		.nonce = (u64) geometry.nonce,
+	};
+
+	result = uds_create_index_session(&zones->index_session);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = vdo_make_thread(vdo, vdo->thread_config.dedupe_thread, &uds_queue_type, 1, NULL);
+	if (result != VDO_SUCCESS) {
+		uds_destroy_index_session(UDS_FORGET(zones->index_session));
+		uds_log_error("UDS index queue initialization failed (%d)", result);
+		return result;
+	}
+
+	vdo_initialize_completion(&zones->completion, vdo, VDO_HASH_ZONES_COMPLETION);
+	vdo_set_completion_callback(&zones->completion,
+				    change_dedupe_state,
+				    vdo->thread_config.dedupe_thread);
+	kobject_init(&zones->dedupe_directory, &dedupe_directory_type);
+	return VDO_SUCCESS;
+}
+
+/**
+ * finish_index_operation() - This is the UDS callback for index queries.
+ * @request: The uds request which has just completed.
+ */
+static void finish_index_operation(struct uds_request *request)
+{
+	struct dedupe_context *context = container_of(request, struct dedupe_context, request);
+
+	if (change_context_state(context, DEDUPE_CONTEXT_PENDING, DEDUPE_CONTEXT_COMPLETE)) {
+		/*
+		 * This query has not timed out, so send its data_vio back to its hash zone to
+		 * process the results.
+		 */
+		continue_data_vio(context->requestor);
+		return;
+	}
+
+	/*
+	 * This query has timed out, so try to mark it complete and hence eligible for reuse. Its
+	 * data_vio has already moved on.
+	 */
+	if (!change_context_state(context,
+				  DEDUPE_CONTEXT_TIMED_OUT,
+				  DEDUPE_CONTEXT_TIMED_OUT_COMPLETE))
+		ASSERT_LOG_ONLY(false,
+				"uds request was timed out (state %d)",
+				atomic_read(&context->state));
+
+	uds_funnel_queue_put(context->zone->timed_out_complete, &context->queue_entry);
+}
+
+/**
+ * check_for_drain_complete() - Check whether this zone has drained.
+ * @zone: The zone to check.
+ */
+static void check_for_drain_complete(struct hash_zone *zone)
+{
+	data_vio_count_t recycled = 0;
+
+	if (!vdo_is_state_draining(&zone->state))
+		return;
+
+	if ((atomic_read(&zone->timer_state) == DEDUPE_QUERY_TIMER_IDLE) ||
+	    change_timer_state(zone, DEDUPE_QUERY_TIMER_RUNNING, DEDUPE_QUERY_TIMER_IDLE))
+		del_timer_sync(&zone->timer);
+	else
+		/*
+		 * There is an in flight time-out, which must get processed before we can continue.
+		 */
+		return;
+
+	for (;;) {
+		struct dedupe_context *context;
+		struct funnel_queue_entry *entry;
+
+		entry = uds_funnel_queue_poll(zone->timed_out_complete);
+		if (entry == NULL)
+			break;
+
+		context = container_of(entry, struct dedupe_context, queue_entry);
+		atomic_set(&context->state, DEDUPE_CONTEXT_IDLE);
+		list_add(&context->list_entry, &zone->available);
+		recycled++;
+	}
+
+	if (recycled > 0)
+		WRITE_ONCE(zone->active, zone->active - recycled);
+	ASSERT_LOG_ONLY(READ_ONCE(zone->active) == 0, "all contexts inactive");
+	vdo_finish_draining(&zone->state);
+}
+
+static void timeout_index_operations_callback(struct vdo_completion *completion)
+{
+	struct dedupe_context *context, *tmp;
+	struct hash_zone *zone = as_hash_zone(completion);
+	u64 timeout_jiffies = msecs_to_jiffies(vdo_dedupe_index_timeout_interval);
+	unsigned long cutoff = jiffies - timeout_jiffies;
+	unsigned int timed_out = 0;
+
+	atomic_set(&zone->timer_state, DEDUPE_QUERY_TIMER_IDLE);
+	list_for_each_entry_safe(context, tmp, &zone->pending, list_entry) {
+		if (cutoff <= context->submission_jiffies) {
+			/*
+			 * We have reached the oldest query which has not timed out yet, so restart
+			 * the timer.
+			 */
+			start_expiration_timer(context);
+			break;
+		}
+
+		if (!change_context_state(context,
+					  DEDUPE_CONTEXT_PENDING,
+					  DEDUPE_CONTEXT_TIMED_OUT))
+			/*
+			 * This context completed between the time the timeout fired, and now. We
+			 * can treat it as a a successful query, its requestor is already enqueued
+			 * to process it.
+			 */
+			continue;
+
+		/*
+		 * Remove this context from the pending list so we won't look at it again on a
+		 * subsequent timeout. Once the index completes it, it will be reused. Meanwhile,
+		 * send its requestor on its way.
+		 */
+		list_del_init(&context->list_entry);
+		continue_data_vio(context->requestor);
+		timed_out++;
+	}
+
+	if (timed_out > 0)
+		report_dedupe_timeouts(completion->vdo->hash_zones, timed_out);
+
+	check_for_drain_complete(zone);
+}
+
+static void timeout_index_operations(struct timer_list *t)
+{
+	struct hash_zone *zone = from_timer(zone, t, timer);
+
+	if (change_timer_state(zone, DEDUPE_QUERY_TIMER_RUNNING, DEDUPE_QUERY_TIMER_FIRED))
+		vdo_launch_completion(&zone->completion);
+}
+
 static int __must_check
 initialize_zone(struct vdo *vdo, struct hash_zones *zones, zone_count_t zone_number)
 {
@@ -2448,3 +2877,197 @@ void vdo_dump_hash_zones(struct hash_zones *zones)
 	for (zone = 0; zone < zones->zone_count; zone++)
 		dump_hash_zone(&zones->zones[zone]);
 }
+
+void vdo_set_dedupe_index_timeout_interval(unsigned int value)
+{
+	u64 alb_jiffies;
+
+	/* Arbitrary maximum value is two minutes */
+	if (value > 120000)
+		value = 120000;
+	/* Arbitrary minimum value is 2 jiffies */
+	alb_jiffies = msecs_to_jiffies(value);
+
+	if (alb_jiffies < 2) {
+		alb_jiffies = 2;
+		value = jiffies_to_msecs(alb_jiffies);
+	}
+	vdo_dedupe_index_timeout_interval = value;
+	vdo_dedupe_index_timeout_jiffies = alb_jiffies;
+}
+
+void vdo_set_dedupe_index_min_timer_interval(unsigned int value)
+{
+	u64 min_jiffies;
+
+	/* Arbitrary maximum value is one second */
+	if (value > 1000)
+		value = 1000;
+
+	/* Arbitrary minimum value is 2 jiffies */
+	min_jiffies = msecs_to_jiffies(value);
+
+	if (min_jiffies < 2) {
+		min_jiffies = 2;
+		value = jiffies_to_msecs(min_jiffies);
+	}
+
+	vdo_dedupe_index_min_timer_interval = value;
+	vdo_dedupe_index_min_timer_jiffies = min_jiffies;
+}
+
+/**
+ * acquire_context() - Acquire a dedupe context from a hash_zone if any are available.
+ * @zone: the hash zone
+ *
+ * Return: A dedupe_context or NULL if none are available
+ */
+static struct dedupe_context * __must_check acquire_context(struct hash_zone *zone)
+{
+	struct dedupe_context *context;
+	struct funnel_queue_entry *entry;
+
+	assert_in_hash_zone(zone, __func__);
+
+	if (!list_empty(&zone->available)) {
+		WRITE_ONCE(zone->active, zone->active + 1);
+		context = list_first_entry(&zone->available, struct dedupe_context, list_entry);
+		list_del_init(&context->list_entry);
+		return context;
+	}
+
+	entry = uds_funnel_queue_poll(zone->timed_out_complete);
+	return ((entry == NULL) ? NULL : container_of(entry, struct dedupe_context, queue_entry));
+}
+
+static void prepare_uds_request(struct uds_request *request,
+				struct data_vio *data_vio,
+				enum uds_request_type operation)
+{
+	request->record_name = data_vio->record_name;
+	request->type = operation;
+	if ((operation == UDS_POST) || (operation == UDS_UPDATE)) {
+		size_t offset = 0;
+		struct uds_record_data *encoding = &request->new_metadata;
+
+		encoding->data[offset++] = UDS_ADVICE_VERSION;
+		encoding->data[offset++] = data_vio->new_mapped.state;
+		put_unaligned_le64(data_vio->new_mapped.pbn, &encoding->data[offset]);
+		offset += sizeof(u64);
+		BUG_ON(offset != UDS_ADVICE_SIZE);
+	}
+}
+
+/*
+ * The index operation will inquire about data_vio.record_name, providing (if the operation is
+ * appropriate) advice from the data_vio's new_mapped fields. The advice found in the index (or
+ * NULL if none) will be returned via receive_data_vio_dedupe_advice(). dedupe_context.status is
+ * set to the return status code of any asynchronous index processing.
+ */
+static void query_index(struct data_vio *data_vio, enum uds_request_type operation)
+{
+	int result;
+	struct dedupe_context *context;
+	struct vdo *vdo = vdo_from_data_vio(data_vio);
+	struct hash_zone *zone = data_vio->hash_zone;
+
+	assert_data_vio_in_hash_zone(data_vio);
+
+	if (!READ_ONCE(vdo->hash_zones->dedupe_flag)) {
+		continue_data_vio(data_vio);
+		return;
+	}
+
+	context = acquire_context(zone);
+	if (context == NULL) {
+		atomic64_inc(&vdo->hash_zones->dedupe_context_busy);
+		continue_data_vio(data_vio);
+		return;
+	}
+
+	data_vio->dedupe_context = context;
+	context->requestor = data_vio;
+	context->submission_jiffies = jiffies;
+	prepare_uds_request(&context->request, data_vio, operation);
+	atomic_set(&context->state, DEDUPE_CONTEXT_PENDING);
+	list_add_tail(&context->list_entry, &zone->pending);
+	start_expiration_timer(context);
+	result = uds_launch_request(&context->request);
+	if (result != UDS_SUCCESS) {
+		context->request.status = result;
+		finish_index_operation(&context->request);
+	}
+}
+
+static void set_target_state(struct hash_zones *zones,
+			     enum index_state target,
+			     bool change_dedupe,
+			     bool dedupe,
+			     bool set_create)
+{
+	const char *old_state, *new_state;
+
+	spin_lock(&zones->lock);
+	old_state = index_state_to_string(zones, zones->index_target);
+	if (change_dedupe)
+		WRITE_ONCE(zones->dedupe_flag, dedupe);
+
+	if (set_create)
+		zones->create_flag = true;
+
+	zones->index_target = target;
+	launch_dedupe_state_change(zones);
+	new_state = index_state_to_string(zones, zones->index_target);
+	spin_unlock(&zones->lock);
+
+	if (old_state != new_state)
+		uds_log_info("Setting UDS index target state to %s", new_state);
+}
+
+const char *vdo_get_dedupe_index_state_name(struct hash_zones *zones)
+{
+	const char *state;
+
+	spin_lock(&zones->lock);
+	state = index_state_to_string(zones, zones->index_state);
+	spin_unlock(&zones->lock);
+
+	return state;
+}
+
+/* Handle a dmsetup message relevant to the index. */
+int vdo_message_dedupe_index(struct hash_zones *zones, const char *name)
+{
+	if (strcasecmp(name, "index-close") == 0) {
+		set_target_state(zones, IS_CLOSED, false, false, false);
+		return 0;
+	} else if (strcasecmp(name, "index-create") == 0) {
+		set_target_state(zones, IS_OPENED, false, false, true);
+		return 0;
+	} else if (strcasecmp(name, "index-disable") == 0) {
+		set_target_state(zones, IS_OPENED, true, false, false);
+		return 0;
+	} else if (strcasecmp(name, "index-enable") == 0) {
+		set_target_state(zones, IS_OPENED, true, true, false);
+		return 0;
+	}
+	return -EINVAL;
+}
+
+int vdo_add_dedupe_index_sysfs(struct hash_zones *zones)
+{
+	int result = kobject_add(&zones->dedupe_directory,
+				 &zones->completion.vdo->vdo_directory,
+				 "dedupe");
+
+	if (result == 0)
+		vdo_set_admin_state_code(&zones->state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+
+	return result;
+}
+
+/* If create_flag, create a new index without first attempting to load an existing index. */
+void vdo_start_dedupe_index(struct hash_zones *zones, bool create_flag)
+{
+	set_target_state(zones, IS_OPENED, true, true, create_flag);
+}
diff --git a/drivers/md/dm-vdo/dedupe.h b/drivers/md/dm-vdo/dedupe.h
index af329fb0fa68..01fdb9bc29c1 100644
--- a/drivers/md/dm-vdo/dedupe.h
+++ b/drivers/md/dm-vdo/dedupe.h
@@ -90,4 +90,30 @@ vdo_select_hash_zone(struct hash_zones *zones, const struct uds_record_name *nam
 
 void vdo_dump_hash_zones(struct hash_zones *zones);
 
+const char *vdo_get_dedupe_index_state_name(struct hash_zones *zones);
+
+u64 vdo_get_dedupe_index_timeout_count(struct hash_zones *zones);
+
+int vdo_message_dedupe_index(struct hash_zones *zones, const char *name);
+
+int vdo_add_dedupe_index_sysfs(struct hash_zones *zones);
+
+void vdo_start_dedupe_index(struct hash_zones *zones, bool create_flag);
+
+void vdo_resume_hash_zones(struct hash_zones *zones, struct vdo_completion *parent);
+
+void vdo_finish_dedupe_index(struct hash_zones *zones);
+
+/* Interval (in milliseconds) from submission until switching to fast path and skipping UDS. */
+extern unsigned int vdo_dedupe_index_timeout_interval;
+
+/*
+ * Minimum time interval (in milliseconds) between timer invocations to check for requests waiting
+ * for UDS that should now time out.
+ */
+extern unsigned int vdo_dedupe_index_min_timer_interval;
+
+void vdo_set_dedupe_index_timeout_interval(unsigned int value);
+void vdo_set_dedupe_index_min_timer_interval(unsigned int value);
+
 #endif /* VDO_DEDUPE_H */

From patchwork Thu Oct 26 21:41:21 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437941
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 73EBDC25B48
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:10 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S234854AbjJZVoK (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:10 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40958 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234816AbjJZVoJ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:09 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8253BD4A
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356504;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=JL29HXBSdYE+EMxOpRxxI1rqpTvivAzV1rrrVQSyP9E=;
        b=N3WTaw+QIja6JfCtl1Ba7bDWXVY7poVBX1GRNvyqCitYP2EtG0KabQbudskqYIgm9uKwv5
        mXSEeeGqSyWV0EcZ1KD05UWXMV+GaKDI641yKZl48ERvj/Ca0FpTuuSUqGO3k0t9XJ5Xg3
        8c6cyROeYvcnaKvj/+ZVnLe3YwQJtH8=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-681-yGD2oWGxNLe4Uk897XwhVw-1; Thu, 26 Oct 2023 17:41:41 -0400
X-MC-Unique: yGD2oWGxNLe4Uk897XwhVw-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 3C78A81101F;
        Thu, 26 Oct 2023 21:41:41 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 2D5612949;
        Thu, 26 Oct 2023 21:41:41 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id DE3C43003D; Thu, 26 Oct 2023 17:41:40 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 24/39] dm vdo: add the compressed block bin packer
Date: Thu, 26 Oct 2023 17:41:21 -0400
Message-Id: <20231026214136.1067410-25-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

When blocks do not deduplicate, vdo will attempt to compress them. Up to 14
compressed blocks may be packed into a single data block (this limitation
is imposed by the block map). The packer implements a simple best-fit
packing algorithm and also manages the formatting and writing of compressed
blocks when bins fill.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/packer.c | 794 +++++++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/packer.h | 123 ++++++
 2 files changed, 917 insertions(+)
 create mode 100644 drivers/md/dm-vdo/packer.c
 create mode 100644 drivers/md/dm-vdo/packer.h

diff --git a/drivers/md/dm-vdo/packer.c b/drivers/md/dm-vdo/packer.c
new file mode 100644
index 000000000000..03d2c77931b2
--- /dev/null
+++ b/drivers/md/dm-vdo/packer.c
@@ -0,0 +1,794 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "packer.h"
+
+#include <linux/atomic.h>
+#include <linux/blkdev.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+#include "string-utils.h"
+
+#include "admin-state.h"
+#include "completion.h"
+#include "constants.h"
+#include "data-vio.h"
+#include "dedupe.h"
+#include "encodings.h"
+#include "io-submitter.h"
+#include "physical-zone.h"
+#include "status-codes.h"
+#include "vdo.h"
+#include "vio.h"
+
+static const struct version_number COMPRESSED_BLOCK_1_0 = {
+	.major_version = 1,
+	.minor_version = 0,
+};
+
+enum {
+	COMPRESSED_BLOCK_1_0_SIZE = 4 + 4 + (2 * VDO_MAX_COMPRESSION_SLOTS),
+};
+
+/**
+ * vdo_get_compressed_block_fragment() - Get a reference to a compressed fragment from a compressed
+ *                                       block.
+ * @mapping_state [in] The mapping state for the look up.
+ * @compressed_block [in] The compressed block that was read from disk.
+ * @fragment_offset [out] The offset of the fragment within a compressed block.
+ * @fragment_size [out] The size of the fragment.
+ *
+ * Return: If a valid compressed fragment is found, VDO_SUCCESS; otherwise, VDO_INVALID_FRAGMENT if
+ *         the fragment is invalid.
+ */
+int vdo_get_compressed_block_fragment(enum block_mapping_state mapping_state,
+				      struct compressed_block *block,
+				      u16 *fragment_offset,
+				      u16 *fragment_size)
+{
+	u16 compressed_size;
+	u16 offset = 0;
+	unsigned int i;
+	u8 slot;
+	struct version_number version;
+
+	if (!vdo_is_state_compressed(mapping_state))
+		return VDO_INVALID_FRAGMENT;
+
+	version = vdo_unpack_version_number(block->header.version);
+	if (!vdo_are_same_version(version, COMPRESSED_BLOCK_1_0))
+		return VDO_INVALID_FRAGMENT;
+
+	slot = mapping_state - VDO_MAPPING_STATE_COMPRESSED_BASE;
+	if (slot >= VDO_MAX_COMPRESSION_SLOTS)
+		return VDO_INVALID_FRAGMENT;
+
+	compressed_size = __le16_to_cpu(block->header.sizes[slot]);
+	for (i = 0; i < slot; i++) {
+		offset += __le16_to_cpu(block->header.sizes[i]);
+		if (offset >= VDO_COMPRESSED_BLOCK_DATA_SIZE)
+			return VDO_INVALID_FRAGMENT;
+	}
+
+	if ((offset + compressed_size) > VDO_COMPRESSED_BLOCK_DATA_SIZE)
+		return VDO_INVALID_FRAGMENT;
+
+	*fragment_offset = offset;
+	*fragment_size = compressed_size;
+	return VDO_SUCCESS;
+}
+
+/**
+ * assert_on_packer_thread() - Check that we are on the packer thread.
+ * @packer: The packer.
+ * @caller: The function which is asserting.
+ */
+static inline void assert_on_packer_thread(struct packer *packer, const char *caller)
+{
+	ASSERT_LOG_ONLY((vdo_get_callback_thread_id() == packer->thread_id),
+			"%s() called from packer thread", caller);
+}
+
+/**
+ * insert_in_sorted_list() - Insert a bin to the list.
+ * @packer: The packer.
+ * @bin: The bin to move to its sorted position.
+ *
+ * The list is in ascending order of free space. Since all bins are already in the list, this
+ * actually moves the bin to the correct position in the list.
+ */
+static void insert_in_sorted_list(struct packer *packer, struct packer_bin *bin)
+{
+	struct packer_bin *active_bin;
+
+	list_for_each_entry(active_bin, &packer->bins, list)
+		if (active_bin->free_space > bin->free_space) {
+			list_move_tail(&bin->list, &active_bin->list);
+			return;
+		}
+
+	list_move_tail(&bin->list, &packer->bins);
+}
+
+/**
+ * make_bin() - Allocate a bin and put it into the packer's list.
+ * @packer: The packer.
+ */
+static int __must_check make_bin(struct packer *packer)
+{
+	struct packer_bin *bin;
+	int result;
+
+	result = UDS_ALLOCATE_EXTENDED(struct packer_bin,
+				       VDO_MAX_COMPRESSION_SLOTS,
+				       struct vio *,
+				       __func__,
+				       &bin);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	bin->free_space = VDO_COMPRESSED_BLOCK_DATA_SIZE;
+	INIT_LIST_HEAD(&bin->list);
+	list_add_tail(&bin->list, &packer->bins);
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_make_packer() - Make a new block packer.
+ *
+ * @vdo: The vdo to which this packer belongs.
+ * @bin_count: The number of partial bins to keep in memory.
+ * @packer_ptr: A pointer to hold the new packer.
+ *
+ * Return: VDO_SUCCESS or an error
+ */
+int vdo_make_packer(struct vdo *vdo, block_count_t bin_count, struct packer **packer_ptr)
+{
+	struct packer *packer;
+	block_count_t i;
+	int result;
+
+	result = UDS_ALLOCATE(1, struct packer, __func__, &packer);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	packer->thread_id = vdo->thread_config.packer_thread;
+	packer->size = bin_count;
+	INIT_LIST_HEAD(&packer->bins);
+	vdo_set_admin_state_code(&packer->state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+
+	for (i = 0; i < bin_count; i++) {
+		result = make_bin(packer);
+		if (result != VDO_SUCCESS) {
+			vdo_free_packer(packer);
+			return result;
+		}
+	}
+
+	/*
+	 * The canceled bin can hold up to half the number of user vios. Every canceled vio in the
+	 * bin must have a canceler for which it is waiting, and any canceler will only have
+	 * canceled one lock holder at a time.
+	 */
+	result = UDS_ALLOCATE_EXTENDED(struct packer_bin,
+				       MAXIMUM_VDO_USER_VIOS / 2,
+				       struct vio *, __func__,
+				       &packer->canceled_bin);
+	if (result != VDO_SUCCESS) {
+		vdo_free_packer(packer);
+		return result;
+	}
+
+	result = vdo_make_default_thread(vdo, packer->thread_id);
+	if (result != VDO_SUCCESS) {
+		vdo_free_packer(packer);
+		return result;
+	}
+
+	*packer_ptr = packer;
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_free_packer() - Free a block packer.
+ * @packer: The packer to free.
+ */
+void vdo_free_packer(struct packer *packer)
+{
+	struct packer_bin *bin, *tmp;
+
+	if (packer == NULL)
+		return;
+
+	list_for_each_entry_safe(bin, tmp, &packer->bins, list) {
+		list_del_init(&bin->list);
+		UDS_FREE(bin);
+	}
+
+	UDS_FREE(UDS_FORGET(packer->canceled_bin));
+	UDS_FREE(packer);
+}
+
+/**
+ * get_packer_from_data_vio() - Get the packer from a data_vio.
+ * @data_vio: The data_vio.
+ *
+ * Return: The packer from the VDO to which the data_vio belongs.
+ */
+static inline struct packer *get_packer_from_data_vio(struct data_vio *data_vio)
+{
+	return vdo_from_data_vio(data_vio)->packer;
+}
+
+/**
+ * vdo_get_packer_statistics() - Get the current statistics from the packer.
+ * @packer: The packer to query.
+ *
+ * Return: a copy of the current statistics for the packer.
+ */
+struct packer_statistics vdo_get_packer_statistics(const struct packer *packer)
+{
+	const struct packer_statistics *stats = &packer->statistics;
+
+	return (struct packer_statistics) {
+		.compressed_fragments_written = READ_ONCE(stats->compressed_fragments_written),
+		.compressed_blocks_written = READ_ONCE(stats->compressed_blocks_written),
+		.compressed_fragments_in_packer = READ_ONCE(stats->compressed_fragments_in_packer),
+	};
+}
+
+/**
+ * abort_packing() - Abort packing a data_vio.
+ * @data_vio: The data_vio to abort.
+ */
+static void abort_packing(struct data_vio *data_vio)
+{
+	struct packer *packer = get_packer_from_data_vio(data_vio);
+
+	WRITE_ONCE(packer->statistics.compressed_fragments_in_packer,
+		   packer->statistics.compressed_fragments_in_packer - 1);
+
+	write_data_vio(data_vio);
+}
+
+/**
+ * release_compressed_write_waiter() - Update a data_vio for which a successful compressed write
+ *                                     has completed and send it on its way.
+
+ * @data_vio: The data_vio to release.
+ * @allocation: The allocation to which the compressed block was written.
+ */
+static void
+release_compressed_write_waiter(struct data_vio *data_vio, struct allocation *allocation)
+{
+	data_vio->new_mapped = (struct zoned_pbn) {
+		.pbn = allocation->pbn,
+		.zone = allocation->zone,
+		.state = data_vio->compression.slot + VDO_MAPPING_STATE_COMPRESSED_BASE,
+	};
+
+	vdo_share_compressed_write_lock(data_vio, allocation->lock);
+	update_metadata_for_data_vio_write(data_vio, allocation->lock);
+}
+
+/**
+ * finish_compressed_write() - Finish a compressed block write.
+ * @completion: The compressed write completion.
+ *
+ * This callback is registered in continue_after_allocation().
+ */
+static void finish_compressed_write(struct vdo_completion *completion)
+{
+	struct data_vio *agent = as_data_vio(completion);
+	struct data_vio *client, *next;
+
+	assert_data_vio_in_allocated_zone(agent);
+
+	/*
+	 * Process all the non-agent waiters first to ensure that the pbn lock can not be released
+	 * until all of them have had a chance to journal their increfs.
+	 */
+	for (client = agent->compression.next_in_batch; client != NULL; client = next) {
+		next = client->compression.next_in_batch;
+		release_compressed_write_waiter(client, &agent->allocation);
+	}
+
+	completion->error_handler = handle_data_vio_error;
+	release_compressed_write_waiter(agent, &agent->allocation);
+}
+
+static void handle_compressed_write_error(struct vdo_completion *completion)
+{
+	struct data_vio *agent = as_data_vio(completion);
+	struct allocation *allocation = &agent->allocation;
+	struct data_vio *client, *next;
+
+	if (vdo_requeue_completion_if_needed(completion, allocation->zone->thread_id))
+		return;
+
+	update_vio_error_stats(as_vio(completion),
+			       "Completing compressed write vio for physical block %llu with error",
+			       (unsigned long long) allocation->pbn);
+
+	for (client = agent->compression.next_in_batch; client != NULL; client = next) {
+		next = client->compression.next_in_batch;
+		write_data_vio(client);
+	}
+
+	/* Now that we've released the batch from the packer, forget the error and continue on. */
+	vdo_reset_completion(completion);
+	completion->error_handler = handle_data_vio_error;
+	write_data_vio(agent);
+}
+
+/**
+ * add_to_bin() - Put a data_vio in a specific packer_bin in which it will definitely fit.
+ * @bin: The bin in which to put the data_vio.
+ * @data_vio: The data_vio to add.
+ */
+static void add_to_bin(struct packer_bin *bin, struct data_vio *data_vio)
+{
+	data_vio->compression.bin = bin;
+	data_vio->compression.slot = bin->slots_used;
+	bin->incoming[bin->slots_used++] = data_vio;
+}
+
+/**
+ * remove_from_bin() - Get the next data_vio whose compression has not been canceled from a bin.
+ * @packer: The packer.
+ * @bin: The bin from which to get a data_vio.
+ *
+ * Any canceled data_vios will be moved to the canceled bin.
+ * Return: An uncanceled data_vio from the bin or NULL if there are none.
+ */
+static struct data_vio *remove_from_bin(struct packer *packer, struct packer_bin *bin)
+{
+	while (bin->slots_used > 0) {
+		struct data_vio *data_vio = bin->incoming[--bin->slots_used];
+
+		if (!advance_data_vio_compression_stage(data_vio).may_not_compress) {
+			data_vio->compression.bin = NULL;
+			return data_vio;
+		}
+
+		add_to_bin(packer->canceled_bin, data_vio);
+	}
+
+	/* The bin is now empty. */
+	bin->free_space = VDO_COMPRESSED_BLOCK_DATA_SIZE;
+	return NULL;
+}
+
+/**
+ * initialize_compressed_block() - Initialize a compressed block.
+ * @block: The compressed block to initialize.
+ * @size: The size of the agent's fragment.
+ *
+ * This method initializes the compressed block in the compressed write agent. Because the
+ * compressor already put the agent's compressed fragment at the start of the compressed block's
+ * data field, it needn't be copied. So all we need do is initialize the header and set the size of
+ * the agent's fragment.
+ */
+static void initialize_compressed_block(struct compressed_block *block, u16 size)
+{
+	/*
+	 * Make sure the block layout isn't accidentally changed by changing the length of the
+	 * block header.
+	 */
+	STATIC_ASSERT_SIZEOF(struct compressed_block_header, COMPRESSED_BLOCK_1_0_SIZE);
+
+	block->header.version = vdo_pack_version_number(COMPRESSED_BLOCK_1_0);
+	block->header.sizes[0] = __cpu_to_le16(size);
+}
+
+/**
+ * pack_fragment() - Pack a data_vio's fragment into the compressed block in which it is already
+ *                   known to fit.
+ * @compression: The agent's compression_state to pack in to.
+ * @data_vio: The data_vio to pack.
+ * @offset: The offset into the compressed block at which to pack the fragment.
+ * @compressed_block: The compressed block which will be written out when batch is fully packed.
+ *
+ * Return: The new amount of space used.
+ */
+static block_size_t __must_check
+pack_fragment(struct compression_state *compression,
+	      struct data_vio *data_vio,
+	      block_size_t offset,
+	      slot_number_t slot,
+	      struct compressed_block *block)
+{
+	struct compression_state *to_pack = &data_vio->compression;
+	char *fragment = to_pack->block->data;
+
+	to_pack->next_in_batch = compression->next_in_batch;
+	compression->next_in_batch = data_vio;
+	to_pack->slot = slot;
+	block->header.sizes[slot] = __cpu_to_le16(to_pack->size);
+	memcpy(&block->data[offset], fragment, to_pack->size);
+	return (offset + to_pack->size);
+}
+
+/**
+ * compressed_write_end_io() - The bio_end_io for a compressed block write.
+ * @bio: The bio for the compressed write.
+ */
+static void compressed_write_end_io(struct bio *bio)
+{
+	struct data_vio *data_vio = vio_as_data_vio(bio->bi_private);
+
+	vdo_count_completed_bios(bio);
+	set_data_vio_allocated_zone_callback(data_vio, finish_compressed_write);
+	continue_data_vio_with_error(data_vio, blk_status_to_errno(bio->bi_status));
+}
+
+/**
+ * write_bin() - Write out a bin.
+ * @packer: The packer.
+ * @bin: The bin to write.
+ */
+static void write_bin(struct packer *packer, struct packer_bin *bin)
+{
+	int result;
+	block_size_t offset;
+	slot_number_t slot = 1;
+	struct compression_state *compression;
+	struct compressed_block *block;
+	struct data_vio *agent = remove_from_bin(packer, bin);
+	struct data_vio *client;
+	struct packer_statistics *stats;
+
+	if (agent == NULL)
+		return;
+
+	compression = &agent->compression;
+	compression->slot = 0;
+	block = compression->block;
+	initialize_compressed_block(block, compression->size);
+	offset = compression->size;
+
+	while ((client = remove_from_bin(packer, bin)) != NULL)
+		offset = pack_fragment(compression, client, offset, slot++, block);
+
+	/*
+	 * If the batch contains only a single vio, then we save nothing by saving the compressed
+	 * form. Continue processing the single vio in the batch.
+	 */
+	if (slot == 1) {
+		abort_packing(agent);
+		return;
+	}
+
+	if (slot < VDO_MAX_COMPRESSION_SLOTS)
+		/* Clear out the sizes of the unused slots */
+		memset(&block->header.sizes[slot],
+		       0,
+		       (VDO_MAX_COMPRESSION_SLOTS - slot) * sizeof(__le16));
+
+	agent->vio.completion.error_handler = handle_compressed_write_error;
+	if (vdo_is_read_only(vdo_from_data_vio(agent))) {
+		continue_data_vio_with_error(agent, VDO_READ_ONLY);
+		return;
+	}
+
+	result = vio_reset_bio(&agent->vio,
+			       (char *) block,
+			       compressed_write_end_io,
+			       REQ_OP_WRITE,
+			       agent->allocation.pbn);
+	if (result != VDO_SUCCESS) {
+		continue_data_vio_with_error(agent, result);
+		return;
+	}
+
+	/*
+	 * Once the compressed write is submitted, the fragments are no longer in the packer, so
+	 * update stats now.
+	 */
+	stats = &packer->statistics;
+	WRITE_ONCE(stats->compressed_fragments_in_packer,
+		   (stats->compressed_fragments_in_packer - slot));
+	WRITE_ONCE(stats->compressed_fragments_written,
+		   (stats->compressed_fragments_written + slot));
+	WRITE_ONCE(stats->compressed_blocks_written, stats->compressed_blocks_written + 1);
+
+	submit_data_vio_io(agent);
+}
+
+/**
+ * add_data_vio_to_packer_bin() - Add a data_vio to a bin's incoming queue
+ * @packer: The packer.
+ * @bin: The bin to which to add the data_vio.
+ * @data_vio: The data_vio to add to the bin's queue.
+ *
+ * Adds a data_vio to a bin's incoming queue, handles logical space change, and calls physical
+ * space processor.
+ */
+static void add_data_vio_to_packer_bin(struct packer *packer,
+				       struct packer_bin *bin,
+				       struct data_vio *data_vio)
+{
+	/* If the selected bin doesn't have room, start a new batch to make room. */
+	if (bin->free_space < data_vio->compression.size)
+		write_bin(packer, bin);
+
+	add_to_bin(bin, data_vio);
+	bin->free_space -= data_vio->compression.size;
+
+	/* If we happen to exactly fill the bin, start a new batch. */
+	if ((bin->slots_used == VDO_MAX_COMPRESSION_SLOTS) ||
+	    (bin->free_space == 0))
+		write_bin(packer, bin);
+
+	/* Now that we've finished changing the free space, restore the sort order. */
+	insert_in_sorted_list(packer, bin);
+}
+
+/**
+ * select_bin() - Select the bin that should be used to pack the compressed data in a data_vio with
+ *                other data_vios.
+ * @packer: The packer.
+ * @data_vio: The data_vio.
+ */
+static struct packer_bin * __must_check
+select_bin(struct packer *packer, struct data_vio *data_vio)
+{
+	/*
+	 * First best fit: select the bin with the least free space that has enough room for the
+	 * compressed data in the data_vio.
+	 */
+	struct packer_bin *bin, *fullest_bin;
+
+	list_for_each_entry(bin, &packer->bins, list)
+		if (bin->free_space >= data_vio->compression.size)
+			return bin;
+
+	/*
+	 * None of the bins have enough space for the data_vio. We're not allowed to create new
+	 * bins, so we have to overflow one of the existing bins. It's pretty intuitive to select
+	 * the fullest bin, since that "wastes" the least amount of free space in the compressed
+	 * block. But if the space currently used in the fullest bin is smaller than the compressed
+	 * size of the incoming block, it seems wrong to force that bin to write when giving up on
+	 * compressing the incoming data_vio would likewise "waste" the least amount of free space.
+	 */
+	fullest_bin = list_first_entry(&packer->bins, struct packer_bin, list);
+	if (data_vio->compression.size >=
+	    (VDO_COMPRESSED_BLOCK_DATA_SIZE - fullest_bin->free_space))
+		return NULL;
+
+	/*
+	 * The fullest bin doesn't have room, but writing it out and starting a new batch with the
+	 * incoming data_vio will increase the packer's free space.
+	 */
+	return fullest_bin;
+}
+
+/**
+ * vdo_attempt_packing() - Attempt to rewrite the data in this data_vio as part of a compressed
+ *                         block.
+ * @data_vio: The data_vio to pack.
+ */
+void vdo_attempt_packing(struct data_vio *data_vio)
+{
+	int result;
+	struct packer_bin *bin;
+	struct data_vio_compression_status status = get_data_vio_compression_status(data_vio);
+	struct packer *packer = get_packer_from_data_vio(data_vio);
+
+	assert_on_packer_thread(packer, __func__);
+
+	result = ASSERT((status.stage == DATA_VIO_COMPRESSING),
+			"attempt to pack data_vio not ready for packing, stage: %u",
+			status.stage);
+	if (result != VDO_SUCCESS)
+		return;
+
+	/*
+	 * Increment whether or not this data_vio will be packed or not since abort_packing()
+	 * always decrements the counter.
+	 */
+	WRITE_ONCE(packer->statistics.compressed_fragments_in_packer,
+		   packer->statistics.compressed_fragments_in_packer + 1);
+
+	/*
+	 * If packing of this data_vio is disallowed for administrative reasons, give up before
+	 * making any state changes.
+	 */
+	if (!vdo_is_state_normal(&packer->state) ||
+	    (data_vio->flush_generation < packer->flush_generation)) {
+		abort_packing(data_vio);
+		return;
+	}
+
+	/*
+	 * The check of may_vio_block_in_packer() here will set the data_vio's compression state to
+	 * VIO_PACKING if the data_vio is allowed to be compressed (if it has already been
+	 * canceled, we'll fall out here). Once the data_vio is in the VIO_PACKING state, it must
+	 * be guaranteed to be put in a bin before any more requests can be processed by the packer
+	 * thread. Otherwise, a canceling data_vio could attempt to remove the canceled data_vio
+	 * from the packer and fail to rendezvous with it (VDO-2809). We must also make sure that
+	 * we will actually bin the data_vio and not give up on it as being larger than the space
+	 * used in the fullest bin. Hence we must call select_bin() before calling
+	 * may_vio_block_in_packer() (VDO-2826).
+	 */
+	bin = select_bin(packer, data_vio);
+	if ((bin == NULL) ||
+	    (advance_data_vio_compression_stage(data_vio).stage != DATA_VIO_PACKING)) {
+		abort_packing(data_vio);
+		return;
+	}
+
+	add_data_vio_to_packer_bin(packer, bin, data_vio);
+}
+
+/**
+ * check_for_drain_complete() - Check whether the packer has drained.
+ * @packer: The packer.
+ */
+static void check_for_drain_complete(struct packer *packer)
+{
+	if (vdo_is_state_draining(&packer->state) && (packer->canceled_bin->slots_used == 0))
+		vdo_finish_draining(&packer->state);
+}
+
+/**
+ * write_all_non_empty_bins() - Write out all non-empty bins on behalf of a flush or suspend.
+ * @packer: The packer being flushed.
+ */
+static void write_all_non_empty_bins(struct packer *packer)
+{
+	struct packer_bin *bin;
+
+	list_for_each_entry(bin, &packer->bins, list)
+		write_bin(packer, bin);
+		/*
+		 * We don't need to re-sort the bin here since this loop will make every bin have
+		 * the same amount of free space, so every ordering is sorted.
+		 */
+
+	check_for_drain_complete(packer);
+}
+
+/**
+ * vdo_flush_packer() - Request that the packer flush asynchronously.
+ * @packer: The packer to flush.
+ *
+ * All bins with at least two compressed data blocks will be written out, and any solitary pending
+ * VIOs will be released from the packer. While flushing is in progress, any VIOs submitted to
+ * vdo_attempt_packing() will be continued immediately without attempting to pack them.
+ */
+void vdo_flush_packer(struct packer *packer)
+{
+	assert_on_packer_thread(packer, __func__);
+	if (vdo_is_state_normal(&packer->state))
+		write_all_non_empty_bins(packer);
+}
+
+/**
+ * vdo_remove_lock_holder_from_packer() - Remove a lock holder from the packer.
+ * @completion: The data_vio which needs a lock held by a data_vio in the packer. The data_vio's
+ *              compression.lock_holder field will point to the data_vio to remove.
+ */
+void vdo_remove_lock_holder_from_packer(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+	struct packer *packer = get_packer_from_data_vio(data_vio);
+	struct data_vio *lock_holder;
+	struct packer_bin *bin;
+	slot_number_t slot;
+
+	assert_data_vio_in_packer_zone(data_vio);
+
+	lock_holder = UDS_FORGET(data_vio->compression.lock_holder);
+	bin = lock_holder->compression.bin;
+	ASSERT_LOG_ONLY((bin != NULL), "data_vio in packer has a bin");
+
+	slot = lock_holder->compression.slot;
+	bin->slots_used--;
+	if (slot < bin->slots_used) {
+		bin->incoming[slot] = bin->incoming[bin->slots_used];
+		bin->incoming[slot]->compression.slot = slot;
+	}
+
+	lock_holder->compression.bin = NULL;
+	lock_holder->compression.slot = 0;
+
+	if (bin != packer->canceled_bin) {
+		bin->free_space += lock_holder->compression.size;
+		insert_in_sorted_list(packer, bin);
+	}
+
+	abort_packing(lock_holder);
+	check_for_drain_complete(packer);
+}
+
+/**
+ * vdo_increment_packer_flush_generation() - Increment the flush generation in the packer.
+ * @packer: The packer.
+ *
+ * This will also cause the packer to flush so that any VIOs from previous generations will exit
+ * the packer.
+ */
+void vdo_increment_packer_flush_generation(struct packer *packer)
+{
+	assert_on_packer_thread(packer, __func__);
+	packer->flush_generation++;
+	vdo_flush_packer(packer);
+}
+
+/**
+ * initiate_drain() - Initiate a drain.
+ *
+ * Implements vdo_admin_initiator.
+ */
+static void initiate_drain(struct admin_state *state)
+{
+	struct packer *packer = container_of(state, struct packer, state);
+
+	write_all_non_empty_bins(packer);
+}
+
+/**
+ * vdo_drain_packer() - Drain the packer by preventing any more VIOs from entering the packer and
+ *                      then flushing.
+ * @packer: The packer to drain.
+ * @completion: The completion to finish when the packer has drained.
+ */
+void vdo_drain_packer(struct packer *packer, struct vdo_completion *completion)
+{
+	assert_on_packer_thread(packer, __func__);
+	vdo_start_draining(&packer->state, VDO_ADMIN_STATE_SUSPENDING, completion, initiate_drain);
+}
+
+/**
+ * vdo_resume_packer() - Resume a packer which has been suspended.
+ * @packer: The packer to resume.
+ * @parent: The completion to finish when the packer has resumed.
+ */
+void vdo_resume_packer(struct packer *packer, struct vdo_completion *parent)
+{
+	assert_on_packer_thread(packer, __func__);
+	vdo_continue_completion(parent, vdo_resume_if_quiescent(&packer->state));
+}
+
+static void dump_packer_bin(const struct packer_bin *bin, bool canceled)
+{
+	if (bin->slots_used == 0)
+		/* Don't dump empty bins. */
+		return;
+
+	uds_log_info("	  %sBin slots_used=%u free_space=%zu",
+		     (canceled ? "Canceled" : ""), bin->slots_used,
+		     bin->free_space);
+
+	/*
+	 * FIXME: dump vios in bin->incoming? The vios should have been dumped from the vio pool.
+	 * Maybe just dump their addresses so it's clear they're here?
+	 */
+}
+
+/**
+ * vdo_dump_packer() - Dump the packer.
+ * @packer: The packer.
+ *
+ * Context: dumps in a thread-unsafe fashion.
+ */
+void vdo_dump_packer(const struct packer *packer)
+{
+	struct packer_bin *bin;
+
+	uds_log_info("packer");
+	uds_log_info("	flushGeneration=%llu state %s  packer_bin_count=%llu",
+		     (unsigned long long) packer->flush_generation,
+		     vdo_get_admin_state_code(&packer->state)->name,
+		     (unsigned long long) packer->size);
+
+	list_for_each_entry(bin, &packer->bins, list)
+		dump_packer_bin(bin, false);
+
+	dump_packer_bin(packer->canceled_bin, true);
+}
diff --git a/drivers/md/dm-vdo/packer.h b/drivers/md/dm-vdo/packer.h
new file mode 100644
index 000000000000..6bf6822398ef
--- /dev/null
+++ b/drivers/md/dm-vdo/packer.h
@@ -0,0 +1,123 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_PACKER_H
+#define VDO_PACKER_H
+
+#include <linux/list.h>
+
+#include "admin-state.h"
+#include "constants.h"
+#include "encodings.h"
+#include "statistics.h"
+#include "types.h"
+#include "wait-queue.h"
+
+enum {
+	DEFAULT_PACKER_BINS = 16,
+};
+
+/* The header of a compressed block. */
+struct compressed_block_header {
+	/* Unsigned 32-bit major and minor versions, little-endian */
+	struct packed_version_number version;
+
+	/* List of unsigned 16-bit compressed block sizes, little-endian */
+	__le16 sizes[VDO_MAX_COMPRESSION_SLOTS];
+} __packed;
+
+enum {
+	VDO_COMPRESSED_BLOCK_DATA_SIZE = VDO_BLOCK_SIZE - sizeof(struct compressed_block_header),
+
+	/*
+	 * A compressed block is only written if we can pack at least two fragments into it, so a
+	 * fragment which fills the entire data portion of a compressed block is too big.
+	 */
+	VDO_MAX_COMPRESSED_FRAGMENT_SIZE = VDO_COMPRESSED_BLOCK_DATA_SIZE - 1,
+};
+
+/* * The compressed block overlay. */
+struct compressed_block {
+	struct compressed_block_header header;
+	char data[VDO_COMPRESSED_BLOCK_DATA_SIZE];
+} __packed;
+
+/*
+ * Each packer_bin holds an incomplete batch of data_vios that only partially fill a compressed
+ * block. The bins are kept in a ring sorted by the amount of unused space so the first bin with
+ * enough space to hold a newly-compressed data_vio can easily be found. When the bin fills up or
+ * is flushed, the first uncanceled data_vio in the bin is selected to be the agent for that bin.
+ * Upon entering the packer, each data_vio already has its compressed data in the first slot of the
+ * data_vio's compressed_block (overlaid on the data_vio's scratch_block). So the agent's fragment
+ * is already in place. The fragments for the other uncanceled data_vios in the bin are packed into
+ * the agent's compressed block. The agent then writes out the compressed block. If the write is
+ * successful, the agent shares its pbn lock which each of the other data_vios in its compressed
+ * block and sends each on its way. Finally the agent itself continues on the write path as before.
+ *
+ * There is one special bin which is used to hold data_vios which have been canceled and removed
+ * from their bin by the packer. These data_vios need to wait for the canceller to rendezvous with
+ * them (VDO-2809) and so they sit in this special bin.
+ */
+struct packer_bin {
+	/* List links for packer.packer_bins */
+	struct list_head list;
+	/* The number of items in the bin */
+	slot_number_t slots_used;
+	/* The number of compressed block bytes remaining in the current batch */
+	size_t free_space;
+	/* The current partial batch of data_vios, waiting for more */
+	struct data_vio *incoming[];
+};
+
+struct packer {
+	/* The ID of the packer's callback thread */
+	thread_id_t thread_id;
+	/* The number of bins */
+	block_count_t size;
+	/* A list of all packer_bins, kept sorted by free_space */
+	struct list_head bins;
+	/*
+	 * A bin to hold data_vios which were canceled out of the packer and are waiting to
+	 * rendezvous with the canceling data_vio.
+	 */
+	struct packer_bin *canceled_bin;
+
+	/* The current flush generation */
+	sequence_number_t flush_generation;
+
+	/* The administrative state of the packer */
+	struct admin_state state;
+
+	/* Statistics are only updated on the packer thread, but are accessed from other threads */
+	struct packer_statistics statistics;
+};
+
+int vdo_get_compressed_block_fragment(enum block_mapping_state mapping_state,
+				      struct compressed_block *block,
+				      u16 *fragment_offset,
+				      u16 *fragment_size);
+
+int __must_check
+vdo_make_packer(struct vdo *vdo, block_count_t bin_count, struct packer **packer_ptr);
+
+void vdo_free_packer(struct packer *packer);
+
+struct packer_statistics __must_check vdo_get_packer_statistics(const struct packer *packer);
+
+void vdo_attempt_packing(struct data_vio *data_vio);
+
+void vdo_flush_packer(struct packer *packer);
+
+void vdo_remove_lock_holder_from_packer(struct vdo_completion *completion);
+
+void vdo_increment_packer_flush_generation(struct packer *packer);
+
+void vdo_drain_packer(struct packer *packer, struct vdo_completion *completion);
+
+void vdo_resume_packer(struct packer *packer, struct vdo_completion *parent);
+
+void vdo_dump_packer(const struct packer *packer);
+
+#endif /* VDO_PACKER_H */

From patchwork Thu Oct 26 21:41:22 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437961
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 70BAAC27C47
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:34 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344880AbjJZVoe (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:34 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41528 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344830AbjJZVoW (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:22 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 845C2D4F
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356505;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=r6LRWmr2icS5jnOcuwPWBxMt8CIdZmzqcRONcHqSLZ4=;
        b=I29p6XONTRwReoWZ6W9JGMPQPf6hDWkTo1I1amG+tLUgdzAEqBNkvjOebHUHXAN5V2JtqG
        2yzMEmlb1B5PPCjr8LWbFrr4W4nzQWtS4wu37kViBnjhXfXjxp3z6ljSGVcSaEFZsmSpiW
        3+7wmdLgerbr6AcX9VPEp5cx2ZKDNJE=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-634-zWM1SzR6OvOKnkX2Il0A6Q-1; Thu, 26 Oct 2023 17:41:41 -0400
X-MC-Unique: zWM1SzR6OvOKnkX2Il0A6Q-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com
 [10.11.54.2])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 9575B1019C86;
        Thu, 26 Oct 2023 21:41:41 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 7241D40C6F79;
        Thu, 26 Oct 2023 21:41:41 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 174F33003C; Thu, 26 Oct 2023 17:41:41 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 25/39] dm vdo: add slab structure,
 slab journal and reference counters
Date: Thu, 26 Oct 2023 17:41:22 -0400
Message-Id: <20231026214136.1067410-26-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Most of the vdo volume belongs to the slab depot. The depot contains a
collection of slabs. The slabs can be up to 32GB, and are divided into
three sections. Most of a slab consists of a linear sequence of 4K blocks.
These blocks are used either to store data, or to hold portions of the
block map (see subsequent patches). In addition to the data blocks, each
slab has a set of reference counters, using 1 byte for each data block.
Finally each slab has a journal. Reference updates are written to the slab
journal, which is written out one block at a time as each block fills. A
copy of the reference counters is kept in memory, and are written out a
block at a time, in oldest-dirtied-order whenever there is a need to
reclaim slab journal space. The journal is used both to ensure that the
main recovery journal (see subsequent patches) can regularly free up space,
and also to amortize the cost of updating individual reference blocks.

This patch adds the slab structure as well as the slab journal and
reference counters.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/slab-depot.c | 2460 ++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/slab-depot.h |  253 ++++
 2 files changed, 2713 insertions(+)
 create mode 100644 drivers/md/dm-vdo/slab-depot.c
 create mode 100644 drivers/md/dm-vdo/slab-depot.h

diff --git a/drivers/md/dm-vdo/slab-depot.c b/drivers/md/dm-vdo/slab-depot.c
new file mode 100644
index 000000000000..48adf297a394
--- /dev/null
+++ b/drivers/md/dm-vdo/slab-depot.c
@@ -0,0 +1,2460 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "slab-depot.h"
+
+#include <linux/atomic.h>
+#include <linux/bio.h>
+#include <linux/log2.h>
+#include <linux/min_heap.h>
+#include <linux/minmax.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "numeric.h"
+#include "permassert.h"
+#include "string-utils.h"
+
+#include "action-manager.h"
+#include "admin-state.h"
+#include "completion.h"
+#include "constants.h"
+#include "data-vio.h"
+#include "encodings.h"
+#include "io-submitter.h"
+#include "physical-zone.h"
+#include "priority-table.h"
+#include "recovery-journal.h"
+#include "repair.h"
+#include "status-codes.h"
+#include "types.h"
+#include "vdo.h"
+#include "vio.h"
+#include "wait-queue.h"
+
+static const u64 BYTES_PER_WORD = sizeof(u64);
+static const bool NORMAL_OPERATION = true;
+
+/**
+ * get_lock() - Get the lock object for a slab journal block by sequence number.
+ * @journal: vdo_slab journal to retrieve from.
+ * @sequence_number: Sequence number of the block.
+ *
+ * Return: The lock object for the given sequence number.
+ */
+static inline struct journal_lock * __must_check
+get_lock(struct slab_journal *journal, sequence_number_t sequence_number)
+{
+	return &journal->locks[sequence_number % journal->size];
+}
+
+static bool is_slab_open(struct vdo_slab *slab)
+{
+	return (!vdo_is_state_quiescing(&slab->state) && !vdo_is_state_quiescent(&slab->state));
+}
+
+/**
+ * must_make_entries_to_flush() - Check whether there are entry waiters which should delay a flush.
+ * @journal: The journal to check.
+ *
+ * Return: true if there are no entry waiters, or if the slab is unrecovered.
+ */
+static inline bool __must_check must_make_entries_to_flush(struct slab_journal *journal)
+{
+	return ((journal->slab->status != VDO_SLAB_REBUILDING) &&
+		vdo_has_waiters(&journal->entry_waiters));
+}
+
+/**
+ * is_reaping() - Check whether a reap is currently in progress.
+ * @journal: The journal which may be reaping.
+ *
+ * Return: true if the journal is reaping.
+ */
+static inline bool __must_check is_reaping(struct slab_journal *journal)
+{
+	return (journal->head != journal->unreapable);
+}
+
+/**
+ * initialize_tail_block() - Initialize tail block as a new block.
+ * @journal: The journal whose tail block is being initialized.
+ */
+static void initialize_tail_block(struct slab_journal *journal)
+{
+	struct slab_journal_block_header *header = &journal->tail_header;
+
+	header->sequence_number = journal->tail;
+	header->entry_count = 0;
+	header->has_block_map_increments = false;
+}
+
+/**
+ * initialize_journal_state() - Set all journal fields appropriately to start journaling.
+ * @journal: The journal to be reset, based on its tail sequence number.
+ */
+static void initialize_journal_state(struct slab_journal *journal)
+{
+	journal->unreapable = journal->head;
+	journal->reap_lock = get_lock(journal, journal->unreapable);
+	journal->next_commit = journal->tail;
+	journal->summarized = journal->last_summarized = journal->tail;
+	initialize_tail_block(journal);
+}
+
+/**
+ * block_is_full() - Check whether a journal block is full.
+ * @journal: The slab journal for the block.
+ *
+ * Return: true if the tail block is full.
+ */
+static bool __must_check block_is_full(struct slab_journal *journal)
+{
+	journal_entry_count_t count = journal->tail_header.entry_count;
+
+	return (journal->tail_header.has_block_map_increments ?
+		(journal->full_entries_per_block == count) :
+		(journal->entries_per_block == count));
+}
+
+static void add_entries(struct slab_journal *journal);
+static void update_tail_block_location(struct slab_journal *journal);
+static void release_journal_locks(struct waiter *waiter, void *context);
+
+/**
+ * is_slab_journal_blank() - Check whether a slab's journal is blank.
+ *
+ * A slab journal is blank if it has never had any entries recorded in it.
+ *
+ * Return: true if the slab's journal has never been modified.
+ */
+static bool is_slab_journal_blank(const struct vdo_slab *slab)
+{
+	return ((slab->journal.tail == 1) && (slab->journal.tail_header.entry_count == 0));
+}
+
+/**
+ * mark_slab_journal_dirty() - Put a slab journal on the dirty ring of its allocator in the correct
+ *                             order.
+ * @journal: The journal to be marked dirty.
+ * @lock: The recovery journal lock held by the slab journal.
+ */
+static void mark_slab_journal_dirty(struct slab_journal *journal, sequence_number_t lock)
+{
+	struct slab_journal *dirty_journal;
+	struct list_head *dirty_list = &journal->slab->allocator->dirty_slab_journals;
+
+	ASSERT_LOG_ONLY(journal->recovery_lock == 0, "slab journal was clean");
+
+	journal->recovery_lock = lock;
+	list_for_each_entry_reverse(dirty_journal, dirty_list, dirty_entry) {
+		if (dirty_journal->recovery_lock <= journal->recovery_lock)
+			break;
+	}
+
+	list_move_tail(&journal->dirty_entry, dirty_journal->dirty_entry.next);
+}
+
+static void mark_slab_journal_clean(struct slab_journal *journal)
+{
+	journal->recovery_lock = 0;
+	list_del_init(&journal->dirty_entry);
+}
+
+static void check_if_slab_drained(struct vdo_slab *slab)
+{
+	bool read_only;
+	struct slab_journal *journal = &slab->journal;
+	const struct admin_state_code *code;
+
+	if (!vdo_is_state_draining(&slab->state) ||
+	    must_make_entries_to_flush(journal) ||
+	    is_reaping(journal) ||
+	    journal->waiting_to_commit ||
+	    !list_empty(&journal->uncommitted_blocks) ||
+	    journal->updating_slab_summary ||
+	    (slab->active_count > 0))
+		return;
+
+	/* When not suspending or recovering, the slab must be clean. */
+	code = vdo_get_admin_state_code(&slab->state);
+	read_only = vdo_is_read_only(slab->allocator->depot->vdo);
+	if (!read_only &&
+	    vdo_has_waiters(&slab->dirty_blocks) &&
+	    (code != VDO_ADMIN_STATE_SUSPENDING) &&
+	    (code != VDO_ADMIN_STATE_RECOVERING))
+		return;
+
+	vdo_finish_draining_with_result(&slab->state, (read_only ? VDO_READ_ONLY : VDO_SUCCESS));
+}
+
+/**
+ * finish_reaping() - Actually advance the head of the journal now that any necessary flushes are
+ *                    complete.
+ * @journal: The journal to be reaped.
+ */
+static void finish_reaping(struct slab_journal *journal)
+{
+	journal->head = journal->unreapable;
+	add_entries(journal);
+	check_if_slab_drained(journal->slab);
+}
+
+static void reap_slab_journal(struct slab_journal *journal);
+
+/**
+ * complete_reaping() - Finish reaping now that we have flushed the lower layer and then try
+ *                      reaping again in case we deferred reaping due to an outstanding vio.
+ * @completion: The flush vio.
+ */
+static void complete_reaping(struct vdo_completion *completion)
+{
+	struct slab_journal *journal = completion->parent;
+
+	return_vio_to_pool(journal->slab->allocator->vio_pool,
+			   vio_as_pooled_vio(as_vio(UDS_FORGET(completion))));
+	finish_reaping(journal);
+	reap_slab_journal(journal);
+}
+
+/**
+ * handle_flush_error() - Handle an error flushing the lower layer.
+ * @completion: The flush vio.
+ */
+static void handle_flush_error(struct vdo_completion *completion)
+{
+	vio_record_metadata_io_error(as_vio(completion));
+	vdo_enter_read_only_mode(completion->vdo, completion->result);
+	complete_reaping(completion);
+}
+
+static void flush_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct slab_journal *journal = vio->completion.parent;
+
+	continue_vio_after_io(vio,
+			      complete_reaping,
+			      journal->slab->allocator->thread_id);
+}
+
+/**
+ * flush_for_reaping() - A waiter callback for getting a vio with which to flush the lower layer
+ *                       prior to reaping.
+ * @waiter: The journal as a flush waiter.
+ * @context: The newly acquired flush vio.
+ */
+static void flush_for_reaping(struct waiter *waiter, void *context)
+{
+	struct slab_journal *journal = container_of(waiter, struct slab_journal, flush_waiter);
+	struct pooled_vio *pooled = context;
+	struct vio *vio = &pooled->vio;
+
+	vio->completion.parent = journal;
+	submit_flush_vio(vio, flush_endio, handle_flush_error);
+}
+
+/**
+ * reap_slab_journal() - Conduct a reap on a slab journal to reclaim unreferenced blocks.
+ * @journal: The slab journal.
+ */
+static void reap_slab_journal(struct slab_journal *journal)
+{
+	bool reaped = false;
+
+	if (is_reaping(journal))
+		/* We already have a reap in progress so wait for it to finish. */
+		return;
+
+	if ((journal->slab->status != VDO_SLAB_REBUILT) ||
+	    !vdo_is_state_normal(&journal->slab->state) ||
+	    vdo_is_read_only(journal->slab->allocator->depot->vdo))
+		/*
+		 * We must not reap in the first two cases, and there's no point in read-only mode.
+		 */
+		return;
+
+	/*
+	 * Start reclaiming blocks only when the journal head has no references. Then stop when a
+	 * block is referenced or reap reaches the most recently written block, referenced by the
+	 * slab summary, which has the sequence number just before the tail.
+	 */
+	while ((journal->unreapable < journal->tail) && (journal->reap_lock->count == 0)) {
+		reaped = true;
+		journal->unreapable++;
+		journal->reap_lock++;
+		if (journal->reap_lock == &journal->locks[journal->size])
+			journal->reap_lock = &journal->locks[0];
+	}
+
+	if (!reaped)
+		return;
+
+	/*
+	 * It is never safe to reap a slab journal block without first issuing a flush, regardless
+	 * of whether a user flush has been received or not. In the absence of the flush, the
+	 * reference block write which released the locks allowing the slab journal to reap may not
+	 * be persisted. Although slab summary writes will eventually issue flushes, multiple slab
+	 * journal block writes can be issued while previous slab summary updates have not yet been
+	 * made. Even though those slab journal block writes will be ignored if the slab summary
+	 * update is not persisted, they may still overwrite the to-be-reaped slab journal block
+	 * resulting in a loss of reference count updates (VDO-2912).
+	 */
+	journal->flush_waiter.callback = flush_for_reaping;
+	acquire_vio_from_pool(journal->slab->allocator->vio_pool, &journal->flush_waiter);
+}
+
+/**
+ * adjust_slab_journal_block_reference() - Adjust the reference count for a slab journal block.
+ * @journal: The slab journal.
+ * @sequence_number: The journal sequence number of the referenced block.
+ * @adjustment: Amount to adjust the reference counter.
+ *
+ * Note that when the adjustment is negative, the slab journal will be reaped.
+ */
+static void
+adjust_slab_journal_block_reference(struct slab_journal *journal,
+				    sequence_number_t sequence_number,
+				    int adjustment)
+{
+	struct journal_lock *lock;
+
+	if (sequence_number == 0)
+		return;
+
+	if (journal->slab->status == VDO_SLAB_REPLAYING)
+		/* Locks should not be used during offline replay. */
+		return;
+
+	ASSERT_LOG_ONLY((adjustment != 0), "adjustment must be non-zero");
+	lock = get_lock(journal, sequence_number);
+	if (adjustment < 0)
+		ASSERT_LOG_ONLY((-adjustment <= lock->count),
+				"adjustment %d of lock count %u for slab journal block %llu must not underflow",
+				adjustment,
+				lock->count,
+				(unsigned long long) sequence_number);
+
+	lock->count += adjustment;
+	if (lock->count == 0)
+		reap_slab_journal(journal);
+}
+
+/**
+ * release_journal_locks() - Callback invoked after a slab summary update completes.
+ * @waiter: The slab summary waiter that has just been notified.
+ * @context: The result code of the update.
+ *
+ * Registered in the constructor on behalf of update_tail_block_location().
+ *
+ * Implements waiter_callback.
+ */
+static void release_journal_locks(struct waiter *waiter, void *context)
+{
+	sequence_number_t first, i;
+	struct slab_journal *journal =
+		container_of(waiter, struct slab_journal, slab_summary_waiter);
+	int result = *((int *)context);
+
+	if (result != VDO_SUCCESS) {
+		if (result != VDO_READ_ONLY)
+			/*
+			 * Don't bother logging what might be lots of errors if we are already in
+			 * read-only mode.
+			 */
+			uds_log_error_strerror(result,
+					       "failed slab summary update %llu",
+					       (unsigned long long) journal->summarized);
+
+		journal->updating_slab_summary = false;
+		vdo_enter_read_only_mode(journal->slab->allocator->depot->vdo, result);
+		check_if_slab_drained(journal->slab);
+		return;
+	}
+
+	if (journal->partial_write_in_progress && (journal->summarized == journal->tail)) {
+		journal->partial_write_in_progress = false;
+		add_entries(journal);
+	}
+
+	first = journal->last_summarized;
+	journal->last_summarized = journal->summarized;
+	for (i = journal->summarized - 1; i >= first; i--) {
+		/*
+		 * Release the lock the summarized block held on the recovery journal. (During
+		 * replay, recovery_start will always be 0.)
+		 */
+		if (journal->recovery_journal != NULL) {
+			zone_count_t zone_number = journal->slab->allocator->zone_number;
+			struct journal_lock *lock = get_lock(journal, i);
+
+			vdo_release_recovery_journal_block_reference(journal->recovery_journal,
+								     lock->recovery_start,
+								     VDO_ZONE_TYPE_PHYSICAL,
+								     zone_number);
+		}
+
+		/*
+		 * Release our own lock against reaping for blocks that are committed. (This
+		 * function will not change locks during replay.)
+		 */
+		adjust_slab_journal_block_reference(journal, i, -1);
+	}
+
+	journal->updating_slab_summary = false;
+
+	reap_slab_journal(journal);
+
+	/* Check if the slab summary needs to be updated again. */
+	update_tail_block_location(journal);
+}
+
+/**
+ * update_tail_block_location() - Update the tail block location in the slab summary, if necessary.
+ * @journal: The slab journal that is updating its tail block location.
+ */
+static void update_tail_block_location(struct slab_journal *journal)
+{
+	block_count_t free_block_count;
+	struct vdo_slab *slab = journal->slab;
+
+	if (journal->updating_slab_summary ||
+	    vdo_is_read_only(journal->slab->allocator->depot->vdo) ||
+	    (journal->last_summarized >= journal->next_commit)) {
+		check_if_slab_drained(slab);
+		return;
+	}
+
+	if (slab->status != VDO_SLAB_REBUILT) {
+		u8 hint = slab->allocator->summary_entries[slab->slab_number].fullness_hint;
+
+		free_block_count = ((block_count_t) hint) << slab->allocator->depot->hint_shift;
+	} else {
+		free_block_count = slab->free_blocks;
+	}
+
+	journal->summarized = journal->next_commit;
+	journal->updating_slab_summary = true;
+
+	/*
+	 * Update slab summary as dirty.
+	 * vdo_slab journal can only reap past sequence number 1 when all the ref counts for this
+	 * slab have been written to the layer. Therefore, indicate that the ref counts must be
+	 * loaded when the journal head has reaped past sequence number 1.
+	 */
+	update_slab_summary_entry(slab,
+				  &journal->slab_summary_waiter,
+				  journal->summarized % journal->size,
+				  (journal->head > 1),
+				  false,
+				  free_block_count);
+}
+
+/**
+ * reopen_slab_journal() - Reopen a slab's journal by emptying it and then adding pending entries.
+ */
+static void reopen_slab_journal(struct vdo_slab *slab)
+{
+	struct slab_journal *journal = &slab->journal;
+	sequence_number_t block;
+
+	ASSERT_LOG_ONLY(journal->tail_header.entry_count == 0,
+			"vdo_slab journal's active block empty before reopening");
+	journal->head = journal->tail;
+	initialize_journal_state(journal);
+
+	/* Ensure no locks are spuriously held on an empty journal. */
+	for (block = 1; block <= journal->size; block++)
+		ASSERT_LOG_ONLY((get_lock(journal, block)->count == 0),
+				"Scrubbed journal's block %llu is not locked",
+				(unsigned long long) block);
+
+	add_entries(journal);
+}
+
+static sequence_number_t get_committing_sequence_number(const struct pooled_vio *vio)
+{
+	const struct packed_slab_journal_block *block =
+		(const struct packed_slab_journal_block *) vio->vio.data;
+
+	return __le64_to_cpu(block->header.sequence_number);
+}
+
+/**
+ * complete_write() - Handle post-commit processing.
+ * @completion: The write vio as a completion.
+ *
+ * This is the callback registered by write_slab_journal_block().
+ */
+static void complete_write(struct vdo_completion *completion)
+{
+	int result = completion->result;
+	struct pooled_vio *pooled = vio_as_pooled_vio(as_vio(completion));
+	struct slab_journal *journal = completion->parent;
+	sequence_number_t committed = get_committing_sequence_number(pooled);
+
+	list_del_init(&pooled->list_entry);
+	return_vio_to_pool(journal->slab->allocator->vio_pool, UDS_FORGET(pooled));
+
+	if (result != VDO_SUCCESS) {
+		vio_record_metadata_io_error(as_vio(completion));
+		uds_log_error_strerror(result,
+				       "cannot write slab journal block %llu",
+				       (unsigned long long) committed);
+		vdo_enter_read_only_mode(journal->slab->allocator->depot->vdo, result);
+		check_if_slab_drained(journal->slab);
+		return;
+	}
+
+	WRITE_ONCE(journal->events->blocks_written, journal->events->blocks_written + 1);
+
+	if (list_empty(&journal->uncommitted_blocks)) {
+		/* If no blocks are outstanding, then the commit point is at the tail. */
+		journal->next_commit = journal->tail;
+	} else {
+		/* The commit point is always the beginning of the oldest incomplete block. */
+		pooled = container_of(journal->uncommitted_blocks.next,
+				      struct pooled_vio,
+				      list_entry);
+		journal->next_commit = get_committing_sequence_number(pooled);
+	}
+
+	update_tail_block_location(journal);
+}
+
+static void write_slab_journal_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct slab_journal *journal = vio->completion.parent;
+
+	continue_vio_after_io(vio, complete_write, journal->slab->allocator->thread_id);
+}
+
+/**
+ * write_slab_journal_block() - Write a slab journal block.
+ * @waiter: The vio pool waiter which was just notified.
+ * @context: The vio pool entry for the write.
+ *
+ * Callback from acquire_vio_from_pool() registered in commit_tail().
+ */
+static void write_slab_journal_block(struct waiter *waiter, void *context)
+{
+	struct pooled_vio *pooled = context;
+	struct vio *vio = &pooled->vio;
+	struct slab_journal *journal = container_of(waiter, struct slab_journal, resource_waiter);
+	struct slab_journal_block_header *header = &journal->tail_header;
+	int unused_entries = journal->entries_per_block - header->entry_count;
+	physical_block_number_t block_number;
+	const struct admin_state_code *operation;
+
+	header->head = journal->head;
+	list_add_tail(&pooled->list_entry, &journal->uncommitted_blocks);
+	vdo_pack_slab_journal_block_header(header, &journal->block->header);
+
+	/* Copy the tail block into the vio. */
+	memcpy(pooled->vio.data, journal->block, VDO_BLOCK_SIZE);
+
+	ASSERT_LOG_ONLY(unused_entries >= 0, "vdo_slab journal block is not overfull");
+	if (unused_entries > 0) {
+		/*
+		 * Release the per-entry locks for any unused entries in the block we are about to
+		 * write.
+		 */
+		adjust_slab_journal_block_reference(journal,
+						    header->sequence_number,
+						    -unused_entries);
+		journal->partial_write_in_progress = !block_is_full(journal);
+	}
+
+	block_number = journal->slab->journal_origin + (header->sequence_number % journal->size);
+	vio->completion.parent = journal;
+
+	/*
+	 * This block won't be read in recovery until the slab summary is updated to refer to it.
+	 * The slab summary update does a flush which is sufficient to protect us from VDO-2331.
+	 */
+	submit_metadata_vio(UDS_FORGET(vio),
+			    block_number,
+			    write_slab_journal_endio,
+			    complete_write,
+			    REQ_OP_WRITE);
+
+	/* Since the write is submitted, the tail block structure can be reused. */
+	journal->tail++;
+	initialize_tail_block(journal);
+	journal->waiting_to_commit = false;
+
+	operation = vdo_get_admin_state_code(&journal->slab->state);
+	if (operation == VDO_ADMIN_STATE_WAITING_FOR_RECOVERY) {
+		vdo_finish_operation(&journal->slab->state,
+				     (vdo_is_read_only(journal->slab->allocator->depot->vdo) ?
+				      VDO_READ_ONLY :
+				      VDO_SUCCESS));
+		return;
+	}
+
+	add_entries(journal);
+}
+
+/**
+ * commit_tail() - Commit the tail block of the slab journal.
+ * @journal: The journal whose tail block should be committed.
+ */
+static void commit_tail(struct slab_journal *journal)
+{
+	if ((journal->tail_header.entry_count == 0) && must_make_entries_to_flush(journal))
+		/*
+		 * There are no entries at the moment, but there are some waiters, so defer
+		 * initiating the flush until those entries are ready to write.
+		 */
+		return;
+
+	if (vdo_is_read_only(journal->slab->allocator->depot->vdo) ||
+	    journal->waiting_to_commit ||
+	    (journal->tail_header.entry_count == 0))
+		/*
+		 * There is nothing to do since the tail block is empty, or writing, or the journal
+		 * is in read-only mode.
+		 */
+		return;
+
+	/*
+	 * Since we are about to commit the tail block, this journal no longer needs to be on the
+	 * ring of journals which the recovery journal might ask to commit.
+	 */
+	mark_slab_journal_clean(journal);
+
+	journal->waiting_to_commit = true;
+
+	journal->resource_waiter.callback = write_slab_journal_block;
+	acquire_vio_from_pool(journal->slab->allocator->vio_pool, &journal->resource_waiter);
+}
+
+/**
+ * encode_slab_journal_entry() - Encode a slab journal entry.
+ * @tail_header: The unpacked header for the block.
+ * @payload: The journal block payload to hold the entry.
+ * @sbn: The slab block number of the entry to encode.
+ * @operation: The type of the entry.
+ * @increment: True if this is an increment.
+ *
+ * Exposed for unit tests.
+ */
+static void
+encode_slab_journal_entry(struct slab_journal_block_header *tail_header,
+			  slab_journal_payload *payload,
+			  slab_block_number sbn,
+			  enum journal_operation operation,
+			  bool increment)
+{
+	journal_entry_count_t entry_number = tail_header->entry_count++;
+
+	if (operation == VDO_JOURNAL_BLOCK_MAP_REMAPPING) {
+		if (!tail_header->has_block_map_increments) {
+			memset(payload->full_entries.entry_types,
+			       0,
+			       VDO_SLAB_JOURNAL_ENTRY_TYPES_SIZE);
+			tail_header->has_block_map_increments = true;
+		}
+
+		payload->full_entries.entry_types[entry_number / 8] |=
+			((u8)1 << (entry_number % 8));
+	}
+
+	vdo_pack_slab_journal_entry(&payload->entries[entry_number], sbn, increment);
+}
+
+/**
+ * expand_journal_point() - Convert a recovery journal journal_point which refers to both an
+ *                          increment and a decrement to a single point which refers to one or the
+ *                          other.
+ * @recovery_point: The journal point to convert.
+ * @increment: Whether the current entry is an increment.
+ *
+ * Return: The expanded journal point
+ *
+ * Because each data_vio has but a single recovery journal point, but may need to make both
+ * increment and decrement entries in the same slab journal. In order to distinguish the two
+ * entries, the entry count of the expanded journal point is twice the actual recovery journal
+ * entry count for increments, and one more than that for decrements.
+ */
+static struct journal_point
+expand_journal_point(struct journal_point recovery_point, bool increment)
+{
+	recovery_point.entry_count *= 2;
+	if (!increment)
+		recovery_point.entry_count++;
+
+	return recovery_point;
+}
+
+/**
+ * add_entry() - Actually add an entry to the slab journal, potentially firing off a write if a
+ *               block becomes full.
+ * @journal: The slab journal to append to.
+ * @pbn: The pbn being adjusted.
+ * @operation: The type of entry to make.
+ * @increment: True if this is an increment.
+ * @recovery_point: The expanded recovery point.
+ *
+ * This function is synchronous.
+ */
+static void add_entry(struct slab_journal *journal,
+		      physical_block_number_t pbn,
+		      enum journal_operation operation,
+		      bool increment,
+		      struct journal_point recovery_point)
+{
+	struct packed_slab_journal_block *block = journal->block;
+	int result;
+
+	result = ASSERT(vdo_before_journal_point(&journal->tail_header.recovery_point,
+						 &recovery_point),
+			"recovery journal point is monotonically increasing, recovery point: %llu.%u, block recovery point: %llu.%u",
+			(unsigned long long) recovery_point.sequence_number,
+			recovery_point.entry_count,
+			(unsigned long long) journal->tail_header.recovery_point.sequence_number,
+			journal->tail_header.recovery_point.entry_count);
+	if (result != VDO_SUCCESS) {
+		vdo_enter_read_only_mode(journal->slab->allocator->depot->vdo, result);
+		return;
+	}
+
+	if (operation == VDO_JOURNAL_BLOCK_MAP_REMAPPING) {
+		result = ASSERT((journal->tail_header.entry_count <
+				 journal->full_entries_per_block),
+				"block has room for full entries");
+		if (result != VDO_SUCCESS) {
+			vdo_enter_read_only_mode(journal->slab->allocator->depot->vdo, result);
+			return;
+		}
+	}
+
+	encode_slab_journal_entry(&journal->tail_header,
+				  &block->payload,
+				  pbn - journal->slab->start,
+				  operation,
+				  increment);
+	journal->tail_header.recovery_point = recovery_point;
+	if (block_is_full(journal))
+		commit_tail(journal);
+}
+
+static inline block_count_t journal_length(const struct slab_journal *journal)
+{
+	return journal->tail - journal->head;
+}
+
+/**
+ * vdo_attempt_replay_into_slab() - Replay a recovery journal entry into a slab's journal.
+ * @slab: The slab to play into.
+ * @pbn: The PBN for the entry.
+ * @operation: The type of entry to add.
+ * @increment: True if this entry is an increment.
+ * @recovery_point: The recovery journal point corresponding to this entry.
+ * @parent: The completion to notify when there is space to add the entry if the entry could not be
+ *          added immediately.
+ *
+ * Return: true if the entry was added immediately.
+ */
+bool vdo_attempt_replay_into_slab(struct vdo_slab *slab,
+				  physical_block_number_t pbn,
+				  enum journal_operation operation,
+				  bool increment,
+				  struct journal_point *recovery_point,
+				  struct vdo_completion *parent)
+{
+	struct slab_journal *journal = &slab->journal;
+	struct slab_journal_block_header *header = &journal->tail_header;
+	struct journal_point expanded = expand_journal_point(*recovery_point, increment);
+
+	/* Only accept entries after the current recovery point. */
+	if (!vdo_before_journal_point(&journal->tail_header.recovery_point, &expanded))
+		return true;
+
+	if ((header->entry_count >= journal->full_entries_per_block) &&
+	    (header->has_block_map_increments || (operation == VDO_JOURNAL_BLOCK_MAP_REMAPPING)))
+		/*
+		 * The tail block does not have room for the entry we are attempting to add so
+		 * commit the tail block now.
+		 */
+		commit_tail(journal);
+
+	if (journal->waiting_to_commit) {
+		vdo_start_operation_with_waiter(&journal->slab->state,
+						VDO_ADMIN_STATE_WAITING_FOR_RECOVERY,
+						parent,
+						NULL);
+		return false;
+	}
+
+	if (journal_length(journal) >= journal->size) {
+		/*
+		 * We must have reaped the current head before the crash, since the blocked
+		 * threshold keeps us from having more entries than fit in a slab journal; hence we
+		 * can just advance the head (and unreapable block), as needed.
+		 */
+		journal->head++;
+		journal->unreapable++;
+	}
+
+	if (journal->slab->status == VDO_SLAB_REBUILT)
+		journal->slab->status = VDO_SLAB_REPLAYING;
+
+	add_entry(journal, pbn, operation, increment, expanded);
+	return true;
+}
+
+/**
+ * requires_reaping() - Check whether the journal must be reaped before adding new entries.
+ * @journal: The journal to check.
+ *
+ * Return: true if the journal must be reaped.
+ */
+static bool requires_reaping(const struct slab_journal *journal)
+{
+	return (journal_length(journal) >= journal->blocking_threshold);
+}
+
+/** finish_summary_update() - A waiter callback that resets the writing state of a slab. */
+static void finish_summary_update(struct waiter *waiter, void *context)
+{
+	struct vdo_slab *slab = container_of(waiter, struct vdo_slab, summary_waiter);
+	int result = *((int *) context);
+
+	slab->active_count--;
+
+	if ((result != VDO_SUCCESS) && (result != VDO_READ_ONLY)) {
+		uds_log_error_strerror(result, "failed to update slab summary");
+		vdo_enter_read_only_mode(slab->allocator->depot->vdo, result);
+	}
+
+	check_if_slab_drained(slab);
+}
+
+static void write_reference_block(struct waiter *waiter, void *context);
+
+/**
+ * launch_reference_block_write() - Launch the write of a dirty reference block by first acquiring
+ *                                  a VIO for it from the pool.
+ * @waiter: The waiter of the block which is starting to write.
+ * @context: The parent slab of the block.
+ *
+ * This can be asynchronous since the writer will have to wait if all VIOs in the pool are
+ * currently in use.
+ */
+static void launch_reference_block_write(struct waiter *waiter, void *context)
+{
+	struct vdo_slab *slab = context;
+
+	if (vdo_is_read_only(slab->allocator->depot->vdo))
+		return;
+
+	slab->active_count++;
+	container_of(waiter, struct reference_block, waiter)->is_writing = true;
+	waiter->callback = write_reference_block;
+	acquire_vio_from_pool(slab->allocator->vio_pool, waiter);
+}
+
+static void save_dirty_reference_blocks(struct vdo_slab *slab)
+{
+	vdo_notify_all_waiters(&slab->dirty_blocks, launch_reference_block_write, slab);
+	check_if_slab_drained(slab);
+}
+
+/**
+ * finish_reference_block_write() - After a reference block has written, clean it, release its
+ *                                  locks, and return its VIO to the pool.
+ * @completion: The VIO that just finished writing.
+ */
+static void finish_reference_block_write(struct vdo_completion *completion)
+{
+	struct vio *vio = as_vio(completion);
+	struct pooled_vio *pooled = vio_as_pooled_vio(vio);
+	struct reference_block *block = completion->parent;
+	struct vdo_slab *slab = block->slab;
+	tail_block_offset_t offset;
+
+	slab->active_count--;
+
+	/* Release the slab journal lock. */
+	adjust_slab_journal_block_reference(&slab->journal,
+					    block->slab_journal_lock_to_release,
+					    -1);
+	return_vio_to_pool(slab->allocator->vio_pool, pooled);
+
+	/*
+	 * We can't clear the is_writing flag earlier as releasing the slab journal lock may cause
+	 * us to be dirtied again, but we don't want to double enqueue.
+	 */
+	block->is_writing = false;
+
+	if (vdo_is_read_only(completion->vdo)) {
+		check_if_slab_drained(slab);
+		return;
+	}
+
+	/* Re-queue the block if it was re-dirtied while it was writing. */
+	if (block->is_dirty) {
+		vdo_enqueue_waiter(&block->slab->dirty_blocks, &block->waiter);
+		if (vdo_is_state_draining(&slab->state))
+			/* We must be saving, and this block will otherwise not be relaunched. */
+			save_dirty_reference_blocks(slab);
+
+		return;
+	}
+
+	/*
+	 * Mark the slab as clean in the slab summary if there are no dirty or writing blocks
+	 * and no summary update in progress.
+	 */
+	if ((slab->active_count > 0) || vdo_has_waiters(&slab->dirty_blocks)) {
+		check_if_slab_drained(slab);
+		return;
+	}
+
+	offset = slab->allocator->summary_entries[slab->slab_number].tail_block_offset;
+	slab->active_count++;
+	slab->summary_waiter.callback = finish_summary_update;
+	update_slab_summary_entry(slab,
+				  &slab->summary_waiter,
+				  offset,
+				  true,
+				  true,
+				  slab->free_blocks);
+}
+
+/**
+ * get_reference_counters_for_block() - Find the reference counters for a given block.
+ * @block: The reference_block in question.
+ *
+ * Return: A pointer to the reference counters for this block.
+ */
+static vdo_refcount_t * __must_check
+get_reference_counters_for_block(struct reference_block *block)
+{
+	size_t block_index = block - block->slab->reference_blocks;
+
+	return &block->slab->counters[block_index * COUNTS_PER_BLOCK];
+}
+
+/**
+ * pack_reference_block() - Copy data from a reference block to a buffer ready to be written out.
+ * @block: The block to copy.
+ * @buffer: The char buffer to fill with the packed block.
+ */
+static void pack_reference_block(struct reference_block *block, void *buffer)
+{
+	struct packed_reference_block *packed = buffer;
+	vdo_refcount_t *counters = get_reference_counters_for_block(block);
+	sector_count_t i;
+	struct packed_journal_point commit_point;
+
+	vdo_pack_journal_point(&block->slab->slab_journal_point, &commit_point);
+
+	for (i = 0; i < VDO_SECTORS_PER_BLOCK; i++) {
+		packed->sectors[i].commit_point = commit_point;
+		memcpy(packed->sectors[i].counts,
+		       counters + (i * COUNTS_PER_SECTOR),
+		       (sizeof(vdo_refcount_t) * COUNTS_PER_SECTOR));
+	}
+}
+
+static void write_reference_block_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct reference_block *block = vio->completion.parent;
+	thread_id_t thread_id = block->slab->allocator->thread_id;
+
+	continue_vio_after_io(vio, finish_reference_block_write, thread_id);
+}
+
+/**
+ * handle_io_error() - Handle an I/O error reading or writing a reference count block.
+ * @completion: The VIO doing the I/O as a completion.
+ */
+static void handle_io_error(struct vdo_completion *completion)
+{
+	int result = completion->result;
+	struct vio *vio = as_vio(completion);
+	struct vdo_slab *slab = ((struct reference_block *) completion->parent)->slab;
+
+	vio_record_metadata_io_error(vio);
+	return_vio_to_pool(slab->allocator->vio_pool, vio_as_pooled_vio(vio));
+	slab->active_count--;
+	vdo_enter_read_only_mode(slab->allocator->depot->vdo, result);
+	check_if_slab_drained(slab);
+}
+
+/**
+ * write_reference_block() - After a dirty block waiter has gotten a VIO from the VIO pool, copy
+ *                           its counters and associated data into the VIO, and launch the write.
+ * @waiter: The waiter of the dirty block.
+ * @context: The VIO returned by the pool.
+ */
+static void write_reference_block(struct waiter *waiter, void *context)
+{
+	size_t block_offset;
+	physical_block_number_t pbn;
+	struct pooled_vio *pooled = context;
+	struct vdo_completion *completion = &pooled->vio.completion;
+	struct reference_block *block = container_of(waiter, struct reference_block, waiter);
+
+	pack_reference_block(block, pooled->vio.data);
+	block_offset = (block - block->slab->reference_blocks);
+	pbn = (block->slab->ref_counts_origin + block_offset);
+	block->slab_journal_lock_to_release = block->slab_journal_lock;
+	completion->parent = block;
+
+	/*
+	 * Mark the block as clean, since we won't be committing any updates that happen after this
+	 * moment. As long as VIO order is preserved, two VIOs updating this block at once will not
+	 * cause complications.
+	 */
+	block->is_dirty = false;
+
+	/*
+	 * Flush before writing to ensure that the recovery journal and slab journal entries which
+	 * cover this reference update are stable (VDO-2331).
+	 */
+	WRITE_ONCE(block->slab->allocator->ref_counts_statistics.blocks_written,
+		   block->slab->allocator->ref_counts_statistics.blocks_written + 1);
+
+	completion->callback_thread_id = ((struct block_allocator *) pooled->context)->thread_id;
+	submit_metadata_vio(&pooled->vio,
+			    pbn,
+			    write_reference_block_endio,
+			    handle_io_error,
+			    REQ_OP_WRITE | REQ_PREFLUSH);
+}
+
+static void reclaim_journal_space(struct slab_journal *journal)
+{
+	block_count_t length = journal_length(journal);
+	struct vdo_slab *slab = journal->slab;
+	block_count_t write_count = vdo_count_waiters(&slab->dirty_blocks);
+	block_count_t written;
+
+	if ((length < journal->flushing_threshold) || (write_count == 0))
+		return;
+
+	/* The slab journal is over the first threshold, schedule some reference block writes. */
+	WRITE_ONCE(journal->events->flush_count, journal->events->flush_count + 1);
+	if (length < journal->flushing_deadline)
+		/* Schedule more writes the closer to the deadline we get. */
+		write_count = max_t(block_count_t,
+				    write_count / (journal->flushing_deadline - length + 1),
+				    1);
+
+	for (written = 0; written < write_count; written++)
+		vdo_notify_next_waiter(&slab->dirty_blocks, launch_reference_block_write, slab);
+}
+
+/**
+ * reference_count_to_status() - Convert a reference count to a reference status.
+ * @count: The count to convert.
+ *
+ * Return: The appropriate reference status.
+ */
+static enum reference_status __must_check
+reference_count_to_status(vdo_refcount_t count)
+{
+	if (count == EMPTY_REFERENCE_COUNT)
+		return RS_FREE;
+	else if (count == 1)
+		return RS_SINGLE;
+	else if (count == PROVISIONAL_REFERENCE_COUNT)
+		return RS_PROVISIONAL;
+	else
+		return RS_SHARED;
+}
+
+/**
+ * dirty_block() - Mark a reference count block as dirty, potentially adding it to the dirty queue
+ *                 if it wasn't already dirty.
+ * @block: The reference block to mark as dirty.
+ */
+static void dirty_block(struct reference_block *block)
+{
+	if (block->is_dirty)
+		return;
+
+	block->is_dirty = true;
+	if (!block->is_writing)
+		vdo_enqueue_waiter(&block->slab->dirty_blocks, &block->waiter);
+}
+
+/**
+ * get_reference_block() - Get the reference block that covers the given block index.
+ */
+static struct reference_block * __must_check
+get_reference_block(struct vdo_slab *slab, slab_block_number index)
+{
+	return &slab->reference_blocks[index / COUNTS_PER_BLOCK];
+}
+
+/**
+ * slab_block_number_from_pbn() - Determine the index within the slab of a particular physical
+ *                                block number.
+ * @slab: The slab.
+ * @physical_block_number: The physical block number.
+ * @slab_block_number_ptr: A pointer to the slab block number.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int __must_check
+slab_block_number_from_pbn(struct vdo_slab *slab,
+			   physical_block_number_t physical_block_number,
+			   slab_block_number *slab_block_number_ptr)
+{
+	u64 slab_block_number;
+
+	if (physical_block_number < slab->start)
+		return VDO_OUT_OF_RANGE;
+
+	slab_block_number = physical_block_number - slab->start;
+	if (slab_block_number >= slab->allocator->depot->slab_config.data_blocks)
+		return VDO_OUT_OF_RANGE;
+
+	*slab_block_number_ptr = slab_block_number;
+	return VDO_SUCCESS;
+}
+
+/**
+ * get_reference_counter() - Get the reference counter that covers the given physical block number.
+ * @slab: The slab to query.
+ * @pbn: The physical block number.
+ * @counter_ptr: A pointer to the reference counter.
+ */
+static int __must_check
+get_reference_counter(struct vdo_slab *slab,
+		      physical_block_number_t pbn,
+		      vdo_refcount_t **counter_ptr)
+{
+	slab_block_number index;
+	int result = slab_block_number_from_pbn(slab, pbn, &index);
+
+	if (result != VDO_SUCCESS)
+		return result;
+
+	*counter_ptr = &slab->counters[index];
+
+	return VDO_SUCCESS;
+}
+
+static unsigned int calculate_slab_priority(struct vdo_slab *slab)
+{
+	block_count_t free_blocks = slab->free_blocks;
+	unsigned int unopened_slab_priority = slab->allocator->unopened_slab_priority;
+	unsigned int priority;
+
+	/*
+	 * Wholly full slabs must be the only ones with lowest priority, 0.
+	 *
+	 * Slabs that have never been opened (empty, newly initialized, and never been written to)
+	 * have lower priority than previously opened slabs that have a significant number of free
+	 * blocks. This ranking causes VDO to avoid writing physical blocks for the first time
+	 * unless there are very few free blocks that have been previously written to.
+	 *
+	 * Since VDO doesn't discard blocks currently, reusing previously written blocks makes VDO
+	 * a better client of any underlying storage that is thinly-provisioned (though discarding
+	 * would be better).
+	 *
+	 * For all other slabs, the priority is derived from the logarithm of the number of free
+	 * blocks. Slabs with the same order of magnitude of free blocks have the same priority.
+	 * With 2^23 blocks, the priority will range from 1 to 25. The reserved
+	 * unopened_slab_priority divides the range and is skipped by the logarithmic mapping.
+	 */
+
+	if (free_blocks == 0)
+		return 0;
+
+	if (is_slab_journal_blank(slab))
+		return unopened_slab_priority;
+
+	priority = (1 + ilog2(free_blocks));
+	return ((priority < unopened_slab_priority) ? priority : priority + 1);
+}
+
+/*
+ * Slabs are essentially prioritized by an approximation of the number of free blocks in the slab
+ * so slabs with lots of free blocks with be opened for allocation before slabs that have few free
+ * blocks.
+ */
+static void prioritize_slab(struct vdo_slab *slab)
+{
+	ASSERT_LOG_ONLY(list_empty(&slab->allocq_entry),
+			"a slab must not already be on a ring when prioritizing");
+	slab->priority = calculate_slab_priority(slab);
+	vdo_priority_table_enqueue(slab->allocator->prioritized_slabs,
+				   slab->priority,
+				   &slab->allocq_entry);
+}
+
+/**
+ * adjust_free_block_count() - Adjust the free block count and (if needed) reprioritize the slab.
+ * @increment: should be true if the free block count went up.
+ */
+static void adjust_free_block_count(struct vdo_slab *slab, bool increment)
+{
+	struct block_allocator *allocator = slab->allocator;
+
+	WRITE_ONCE(allocator->allocated_blocks,
+		   allocator->allocated_blocks + (increment ? -1 : 1));
+
+	/* The open slab doesn't need to be reprioritized until it is closed. */
+	if (slab == allocator->open_slab)
+		return;
+
+	/* Don't bother adjusting the priority table if unneeded. */
+	if (slab->priority == calculate_slab_priority(slab))
+		return;
+
+	/*
+	 * Reprioritize the slab to reflect the new free block count by removing it from the table
+	 * and re-enqueuing it with the new priority.
+	 */
+	vdo_priority_table_remove(allocator->prioritized_slabs, &slab->allocq_entry);
+	prioritize_slab(slab);
+}
+
+/**
+ * increment_for_data() - Increment the reference count for a data block.
+ * @slab: The slab which owns the block.
+ * @block: The reference block which contains the block being updated.
+ * @block_number: The block to update.
+ * @old_status: The reference status of the data block before this increment.
+ * @lock: The pbn_lock associated with this increment (may be NULL).
+ * @counter_ptr: A pointer to the count for the data block (in, out).
+ * @adjust_block_count: Whether to update the allocator's free block count.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int increment_for_data(struct vdo_slab *slab,
+			      struct reference_block *block,
+			      slab_block_number block_number,
+			      enum reference_status old_status,
+			      struct pbn_lock *lock,
+			      vdo_refcount_t *counter_ptr,
+			      bool adjust_block_count)
+{
+	switch (old_status) {
+	case RS_FREE:
+		*counter_ptr = 1;
+		block->allocated_count++;
+		slab->free_blocks--;
+		if (adjust_block_count)
+			adjust_free_block_count(slab, false);
+
+		break;
+
+	case RS_PROVISIONAL:
+		*counter_ptr = 1;
+		break;
+
+	default:
+		/* Single or shared */
+		if (*counter_ptr >= MAXIMUM_REFERENCE_COUNT)
+			return uds_log_error_strerror(VDO_REF_COUNT_INVALID,
+						      "Incrementing a block already having 254 references (slab %u, offset %u)",
+						      slab->slab_number,
+						      block_number);
+		(*counter_ptr)++;
+	}
+
+	if (lock != NULL)
+		vdo_unassign_pbn_lock_provisional_reference(lock);
+	return VDO_SUCCESS;
+}
+
+/**
+ * decrement_for_data() - Decrement the reference count for a data block.
+ * @slab: The slab which owns the block.
+ * @block: The reference block which contains the block being updated.
+ * @block_number: The block to update.
+ * @old_status: The reference status of the data block before this decrement.
+ * @updater: The reference updater doing this operation in case we need to look up the pbn lock.
+ * @lock: The pbn_lock associated with the block being decremented (may be NULL).
+ * @counter_ptr: A pointer to the count for the data block (in, out).
+ * @adjust_block_count: Whether to update the allocator's free block count.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int decrement_for_data(struct vdo_slab *slab,
+			      struct reference_block *block,
+			      slab_block_number block_number,
+			      enum reference_status old_status,
+			      struct reference_updater *updater,
+			      vdo_refcount_t *counter_ptr,
+			      bool adjust_block_count)
+{
+	switch (old_status) {
+	case RS_FREE:
+		return uds_log_error_strerror(VDO_REF_COUNT_INVALID,
+					      "Decrementing free block at offset %u in slab %u",
+					      block_number,
+					      slab->slab_number);
+
+	case RS_PROVISIONAL:
+	case RS_SINGLE:
+		if (updater->zpbn.zone != NULL) {
+			struct pbn_lock *lock = vdo_get_physical_zone_pbn_lock(updater->zpbn.zone,
+									       updater->zpbn.pbn);
+
+			if (lock != NULL) {
+				/*
+				 * There is a read lock on this block, so the block must not become
+				 * unreferenced.
+				 */
+				*counter_ptr = PROVISIONAL_REFERENCE_COUNT;
+				vdo_assign_pbn_lock_provisional_reference(lock);
+				break;
+			}
+		}
+
+		*counter_ptr = EMPTY_REFERENCE_COUNT;
+		block->allocated_count--;
+		slab->free_blocks++;
+		if (adjust_block_count)
+			adjust_free_block_count(slab, true);
+
+		break;
+
+	default:
+		/* Shared */
+		(*counter_ptr)--;
+	}
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * increment_for_block_map() - Increment the reference count for a block map page.
+ * @slab: The slab which owns the block.
+ * @block: The reference block which contains the block being updated.
+ * @block_number: The block to update.
+ * @old_status: The reference status of the block before this increment.
+ * @lock: The pbn_lock associated with this increment (may be NULL).
+ * @normal_operation: Whether we are in normal operation vs. recovery or rebuild.
+ * @counter_ptr: A pointer to the count for the block (in, out).
+ * @adjust_block_count: Whether to update the allocator's free block count.
+ *
+ * All block map increments should be from provisional to MAXIMUM_REFERENCE_COUNT. Since block map
+ * blocks never dedupe they should never be adjusted from any other state. The adjustment always
+ * results in MAXIMUM_REFERENCE_COUNT as this value is used to prevent dedupe against block map
+ * blocks.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int increment_for_block_map(struct vdo_slab *slab,
+				   struct reference_block *block,
+				   slab_block_number block_number,
+				   enum reference_status old_status,
+				   struct pbn_lock *lock,
+				   bool normal_operation,
+				   vdo_refcount_t *counter_ptr,
+				   bool adjust_block_count)
+{
+	switch (old_status) {
+	case RS_FREE:
+		if (normal_operation)
+			return uds_log_error_strerror(VDO_REF_COUNT_INVALID,
+						      "Incrementing unallocated block map block (slab %u, offset %u)",
+						      slab->slab_number,
+						      block_number);
+
+		*counter_ptr = MAXIMUM_REFERENCE_COUNT;
+		block->allocated_count++;
+		slab->free_blocks--;
+		if (adjust_block_count)
+			adjust_free_block_count(slab, false);
+
+		return VDO_SUCCESS;
+
+	case RS_PROVISIONAL:
+		if (!normal_operation)
+			return uds_log_error_strerror(VDO_REF_COUNT_INVALID,
+						      "Block map block had provisional reference during replay (slab %u, offset %u)",
+						      slab->slab_number,
+						      block_number);
+
+		*counter_ptr = MAXIMUM_REFERENCE_COUNT;
+		if (lock != NULL)
+			vdo_unassign_pbn_lock_provisional_reference(lock);
+		return VDO_SUCCESS;
+
+	default:
+		return uds_log_error_strerror(VDO_REF_COUNT_INVALID,
+					      "Incrementing a block map block which is already referenced %u times (slab %u, offset %u)",
+					      *counter_ptr,
+					      slab->slab_number,
+					      block_number);
+	}
+}
+
+static bool __must_check is_valid_journal_point(const struct journal_point *point)
+{
+	return ((point != NULL) && (point->sequence_number > 0));
+}
+
+/**
+ * update_reference_count() - Update the reference count of a block.
+ * @slab: The slab which owns the block.
+ * @block: The reference block which contains the block being updated.
+ * @block_number: The block to update.
+ * @slab_journal_point: The slab journal point at which this update is journaled.
+ * @updater: The reference updater.
+ * @normal_operation: Whether we are in normal operation vs. recovery or rebuild.
+ * @adjust_block_count: Whether to update the slab's free block count.
+ * @provisional_decrement_ptr: A pointer which will be set to true if this update was a decrement
+ *                             of a provisional reference.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int
+update_reference_count(struct vdo_slab *slab,
+		       struct reference_block *block,
+		       slab_block_number block_number,
+		       const struct journal_point *slab_journal_point,
+		       struct reference_updater *updater,
+		       bool normal_operation,
+		       bool adjust_block_count,
+		       bool *provisional_decrement_ptr)
+{
+	vdo_refcount_t *counter_ptr = &slab->counters[block_number];
+	enum reference_status old_status = reference_count_to_status(*counter_ptr);
+	int result;
+
+	if (!updater->increment) {
+		result = decrement_for_data(slab,
+					    block,
+					    block_number,
+					    old_status,
+					    updater,
+					    counter_ptr,
+					    adjust_block_count);
+		if ((result == VDO_SUCCESS) && (old_status == RS_PROVISIONAL)) {
+			if (provisional_decrement_ptr != NULL)
+				*provisional_decrement_ptr = true;
+			return VDO_SUCCESS;
+		}
+	} else if (updater->operation == VDO_JOURNAL_DATA_REMAPPING) {
+		result = increment_for_data(slab,
+					    block,
+					    block_number,
+					    old_status,
+					    updater->lock,
+					    counter_ptr,
+					    adjust_block_count);
+	} else {
+		result = increment_for_block_map(slab,
+						 block,
+						 block_number,
+						 old_status,
+						 updater->lock,
+						 normal_operation,
+						 counter_ptr,
+						 adjust_block_count);
+	}
+
+	if (result != VDO_SUCCESS)
+		return result;
+
+	if (is_valid_journal_point(slab_journal_point))
+		slab->slab_journal_point = *slab_journal_point;
+
+	return VDO_SUCCESS;
+}
+
+static int __must_check
+adjust_reference_count(struct vdo_slab *slab,
+		       struct reference_updater *updater,
+		       const struct journal_point *slab_journal_point)
+{
+	slab_block_number block_number;
+	int result;
+	struct reference_block *block;
+	bool provisional_decrement = false;
+
+	if (!is_slab_open(slab))
+		return VDO_INVALID_ADMIN_STATE;
+
+	result = slab_block_number_from_pbn(slab, updater->zpbn.pbn, &block_number);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	block = get_reference_block(slab, block_number);
+	result = update_reference_count(slab,
+					block,
+					block_number,
+					slab_journal_point,
+					updater,
+					NORMAL_OPERATION,
+					true,
+					&provisional_decrement);
+	if ((result != VDO_SUCCESS) || provisional_decrement)
+		return result;
+
+	if (block->is_dirty && (block->slab_journal_lock > 0)) {
+		sequence_number_t entry_lock = slab_journal_point->sequence_number;
+		/*
+		 * This block is already dirty and a slab journal entry has been made for it since
+		 * the last time it was clean. We must release the per-entry slab journal lock for
+		 * the entry associated with the update we are now doing.
+		 */
+		result = ASSERT(is_valid_journal_point(slab_journal_point),
+				"Reference count adjustments need slab journal points.");
+		if (result != VDO_SUCCESS)
+			return result;
+
+		adjust_slab_journal_block_reference(&slab->journal, entry_lock, -1);
+		return VDO_SUCCESS;
+	}
+
+	/*
+	 * This may be the first time we are applying an update for which there is a slab journal
+	 * entry to this block since the block was cleaned. Therefore, we convert the per-entry
+	 * slab journal lock to an uncommitted reference block lock, if there is a per-entry lock.
+	 */
+	if (is_valid_journal_point(slab_journal_point))
+		block->slab_journal_lock = slab_journal_point->sequence_number;
+	else
+		block->slab_journal_lock = 0;
+
+	dirty_block(block);
+	return VDO_SUCCESS;
+}
+
+/**
+ * add_entry_from_waiter() - Add an entry to the slab journal.
+ * @waiter: The vio which should make an entry now.
+ * @context: The slab journal to make an entry in.
+ *
+ * This callback is invoked by add_entries() once it has determined that we are ready to make
+ * another entry in the slab journal. Implements waiter_callback.
+ */
+static void add_entry_from_waiter(struct waiter *waiter, void *context)
+{
+	int result;
+	struct reference_updater *updater = container_of(waiter, struct reference_updater, waiter);
+	struct data_vio *data_vio = data_vio_from_reference_updater(updater);
+	struct slab_journal *journal = context;
+	struct slab_journal_block_header *header = &journal->tail_header;
+	struct journal_point slab_journal_point = {
+		.sequence_number = header->sequence_number,
+		.entry_count = header->entry_count,
+	};
+	sequence_number_t recovery_block = data_vio->recovery_journal_point.sequence_number;
+
+	if (header->entry_count == 0) {
+		/*
+		 * This is the first entry in the current tail block, so get a lock on the recovery
+		 * journal which we will hold until this tail block is committed.
+		 */
+		get_lock(journal, header->sequence_number)->recovery_start = recovery_block;
+		if (journal->recovery_journal != NULL) {
+			zone_count_t zone_number = journal->slab->allocator->zone_number;
+
+			vdo_acquire_recovery_journal_block_reference(journal->recovery_journal,
+								     recovery_block,
+								     VDO_ZONE_TYPE_PHYSICAL,
+								     zone_number);
+		}
+
+		mark_slab_journal_dirty(journal, recovery_block);
+		reclaim_journal_space(journal);
+	}
+
+	add_entry(journal,
+		  updater->zpbn.pbn,
+		  updater->operation,
+		  updater->increment,
+		  expand_journal_point(data_vio->recovery_journal_point, updater->increment));
+
+	if (journal->slab->status != VDO_SLAB_REBUILT) {
+		/*
+		 * If the slab is unrecovered, scrubbing will take care of the count since the
+		 * update is now recorded in the journal.
+		 */
+		adjust_slab_journal_block_reference(journal,
+						    slab_journal_point.sequence_number,
+						    -1);
+		result = VDO_SUCCESS;
+	} else {
+		/* Now that an entry has been made in the slab journal, update the counter. */
+		result = adjust_reference_count(journal->slab, updater, &slab_journal_point);
+	}
+
+	if (updater->increment)
+		continue_data_vio_with_error(data_vio, result);
+	else
+		vdo_continue_completion(&data_vio->decrement_completion, result);
+}
+
+/**
+ * is_next_entry_a_block_map_increment() - Check whether the next entry to be made is a block map
+ *                                         increment.
+ * @journal: The journal.
+ *
+ * Return: true if the first entry waiter's operation is a block map increment.
+ */
+static inline bool is_next_entry_a_block_map_increment(struct slab_journal *journal)
+{
+	struct waiter *waiter = vdo_get_first_waiter(&journal->entry_waiters);
+	struct reference_updater *updater = container_of(waiter, struct reference_updater, waiter);
+
+	return (updater->operation == VDO_JOURNAL_BLOCK_MAP_REMAPPING);
+}
+
+/**
+ * add_entries() - Add as many entries as possible from the queue of vios waiting to make entries.
+ * @journal: The journal to which entries may be added.
+ *
+ * By processing the queue in order, we ensure that slab journal entries are made in the same order
+ * as recovery journal entries for the same increment or decrement.
+ */
+static void add_entries(struct slab_journal *journal)
+{
+	if (journal->adding_entries)
+		/* Protect against re-entrancy. */
+		return;
+
+	journal->adding_entries = true;
+	while (vdo_has_waiters(&journal->entry_waiters)) {
+		struct slab_journal_block_header *header = &journal->tail_header;
+
+		if (journal->partial_write_in_progress ||
+		    (journal->slab->status == VDO_SLAB_REBUILDING))
+			/*
+			 * Don't add entries while rebuilding or while a partial write is
+			 * outstanding (VDO-2399).
+			 */
+			break;
+
+		if (journal->waiting_to_commit) {
+			/*
+			 * If we are waiting for resources to write the tail block, and the tail
+			 * block is full, we can't make another entry.
+			 */
+			WRITE_ONCE(journal->events->tail_busy_count,
+				   journal->events->tail_busy_count + 1);
+			break;
+		} else if (is_next_entry_a_block_map_increment(journal) &&
+			   (header->entry_count >= journal->full_entries_per_block)) {
+			/*
+			 * The tail block does not have room for a block map increment, so commit
+			 * it now.
+			 */
+			commit_tail(journal);
+			if (journal->waiting_to_commit) {
+				WRITE_ONCE(journal->events->tail_busy_count,
+					   journal->events->tail_busy_count + 1);
+				break;
+			}
+		}
+
+		/* If the slab is over the blocking threshold, make the vio wait. */
+		if (requires_reaping(journal)) {
+			WRITE_ONCE(journal->events->blocked_count,
+				   journal->events->blocked_count + 1);
+			save_dirty_reference_blocks(journal->slab);
+			break;
+		}
+
+		if (header->entry_count == 0) {
+			struct journal_lock *lock = get_lock(journal, header->sequence_number);
+
+			/*
+			 * Check if the on disk slab journal is full. Because of the blocking and
+			 * scrubbing thresholds, this should never happen.
+			 */
+			if (lock->count > 0) {
+				ASSERT_LOG_ONLY((journal->head + journal->size) == journal->tail,
+						"New block has locks, but journal is not full");
+
+				/*
+				 * The blocking threshold must let the journal fill up if the new
+				 * block has locks; if the blocking threshold is smaller than the
+				 * journal size, the new block cannot possibly have locks already.
+				 */
+				ASSERT_LOG_ONLY((journal->blocking_threshold >= journal->size),
+						"New block can have locks already iff blocking threshold is at the end of the journal");
+
+				WRITE_ONCE(journal->events->disk_full_count,
+					   journal->events->disk_full_count + 1);
+				save_dirty_reference_blocks(journal->slab);
+				break;
+			}
+
+			/*
+			 * Don't allow the new block to be reaped until all of the reference count
+			 * blocks are written and the journal block has been fully committed as
+			 * well.
+			 */
+			lock->count = journal->entries_per_block + 1;
+
+			if (header->sequence_number == 1) {
+				struct vdo_slab *slab = journal->slab;
+				block_count_t i;
+
+				/*
+				 * This is the first entry in this slab journal, ever. Dirty all of
+				 * the reference count blocks. Each will acquire a lock on the tail
+				 * block so that the journal won't be reaped until the reference
+				 * counts are initialized. The lock acquisition must be done by the
+				 * ref_counts since here we don't know how many reference blocks
+				 * the ref_counts has.
+				 */
+				for (i = 0; i < slab->reference_block_count; i++) {
+					slab->reference_blocks[i].slab_journal_lock = 1;
+					dirty_block(&slab->reference_blocks[i]);
+				}
+
+				adjust_slab_journal_block_reference(journal,
+								    1,
+								    slab->reference_block_count);
+			}
+		}
+
+		vdo_notify_next_waiter(&journal->entry_waiters, add_entry_from_waiter, journal);
+	}
+
+	journal->adding_entries = false;
+
+	/* If there are no waiters, and we are flushing or saving, commit the tail block. */
+	if (vdo_is_state_draining(&journal->slab->state) &&
+	    !vdo_is_state_suspending(&journal->slab->state) &&
+	    !vdo_has_waiters(&journal->entry_waiters))
+		commit_tail(journal);
+}
+
+/**
+ * reset_search_cursor() - Reset the free block search back to the first reference counter in the
+ *                         first reference block of a slab.
+ */
+static void reset_search_cursor(struct vdo_slab *slab)
+{
+	struct search_cursor *cursor = &slab->search_cursor;
+
+	cursor->block = cursor->first_block;
+	cursor->index = 0;
+	/* Unit tests have slabs with only one reference block (and it's a runt). */
+	cursor->end_index = min_t(u32, COUNTS_PER_BLOCK, slab->block_count);
+}
+
+/**
+ * advance_search_cursor() - Advance the search cursor to the start of the next reference block in
+ *                           a slab,
+ *
+ * Wraps around to the first reference block if the current block is the last reference block.
+ *
+ * Return: true unless the cursor was at the last reference block.
+ */
+static bool advance_search_cursor(struct vdo_slab *slab)
+{
+	struct search_cursor *cursor = &slab->search_cursor;
+
+	/*
+	 * If we just finished searching the last reference block, then wrap back around to the
+	 * start of the array.
+	 */
+	if (cursor->block == cursor->last_block) {
+		reset_search_cursor(slab);
+		return false;
+	}
+
+	/* We're not already at the end, so advance to cursor to the next block. */
+	cursor->block++;
+	cursor->index = cursor->end_index;
+
+	if (cursor->block == cursor->last_block)
+		/* The last reference block will usually be a runt. */
+		cursor->end_index = slab->block_count;
+	else
+		cursor->end_index += COUNTS_PER_BLOCK;
+	return true;
+}
+
+/**
+ * replay_reference_count_change() - Replay the reference count adjustment from a slab journal
+ *                                   entry into the reference count for a block.
+ * @slab: The slab.
+ * @entry_point: The slab journal point for the entry.
+ * @entry: The slab journal entry being replayed.
+ *
+ * The adjustment will be ignored if it was already recorded in the reference count.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int
+replay_reference_count_change(struct vdo_slab *slab,
+			      const struct journal_point *entry_point,
+			      struct slab_journal_entry entry)
+{
+	int result;
+	struct reference_block *block = get_reference_block(slab, entry.sbn);
+	sector_count_t sector = (entry.sbn % COUNTS_PER_BLOCK) / COUNTS_PER_SECTOR;
+	struct reference_updater updater = {
+		.operation = entry.operation,
+		.increment = entry.increment,
+	};
+
+	if (!vdo_before_journal_point(&block->commit_points[sector], entry_point))
+		/* This entry is already reflected in the existing counts, so do nothing. */
+		return VDO_SUCCESS;
+
+	/* This entry is not yet counted in the reference counts. */
+	result = update_reference_count(slab,
+					block,
+					entry.sbn,
+					entry_point,
+					&updater,
+					!NORMAL_OPERATION,
+					false,
+					NULL);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	dirty_block(block);
+	return VDO_SUCCESS;
+}
+
+/**
+ * find_zero_byte_in_word() - Find the array index of the first zero byte in word-sized range of
+ *                            reference counters.
+ * @word_ptr: A pointer to the eight counter bytes to check.
+ * @start_index: The array index corresponding to word_ptr[0].
+ * @fail_index: The array index to return if no zero byte is found.
+ *
+ * The search does no bounds checking; the function relies on the array being sufficiently padded.
+ *
+ * Return: The array index of the first zero byte in the word, or the value passed as fail_index if
+ *         no zero byte was found.
+ */
+static inline slab_block_number
+find_zero_byte_in_word(const u8 *word_ptr,
+		       slab_block_number start_index,
+		       slab_block_number fail_index)
+{
+	u64 word = get_unaligned_le64(word_ptr);
+
+	/* This looks like a loop, but GCC will unroll the eight iterations for us. */
+	unsigned int offset;
+
+	for (offset = 0; offset < BYTES_PER_WORD; offset++) {
+		/* Assumes little-endian byte order, which we have on X86. */
+		if ((word & 0xFF) == 0)
+			return (start_index + offset);
+		word >>= 8;
+	}
+
+	return fail_index;
+}
+
+/**
+ * vdo_find_free_block() - Find the first block with a reference count of zero in the specified
+ *                         range of reference counter indexes.
+ * @slab: The slab counters to scan.
+ * @index_ptr: A pointer to hold the array index of the free block.
+ *
+ * Exposed for unit testing.
+ *
+ * Return: true if a free block was found in the specified range.
+ */
+static bool
+find_free_block(const struct vdo_slab *slab, slab_block_number *index_ptr)
+{
+	slab_block_number zero_index;
+	slab_block_number next_index = slab->search_cursor.index;
+	slab_block_number end_index = slab->search_cursor.end_index;
+	u8 *next_counter = &slab->counters[next_index];
+	u8 *end_counter = &slab->counters[end_index];
+
+	/*
+	 * Search every byte of the first unaligned word. (Array is padded so reading past end is
+	 * safe.)
+	 */
+	zero_index = find_zero_byte_in_word(next_counter, next_index, end_index);
+	if (zero_index < end_index) {
+		*index_ptr = zero_index;
+		return true;
+	}
+
+	/*
+	 * On architectures where unaligned word access is expensive, this would be a good place to
+	 * advance to an alignment boundary.
+	 */
+	next_index += BYTES_PER_WORD;
+	next_counter += BYTES_PER_WORD;
+
+	/*
+	 * Now we're word-aligned; check an word at a time until we find a word containing a zero.
+	 * (Array is padded so reading past end is safe.)
+	 */
+	while (next_counter < end_counter) {
+		/*
+		 * The following code is currently an exact copy of the code preceding the loop,
+		 * but if you try to merge them by using a do loop, it runs slower because a jump
+		 * instruction gets added at the start of the iteration.
+		 */
+		zero_index = find_zero_byte_in_word(next_counter, next_index, end_index);
+		if (zero_index < end_index) {
+			*index_ptr = zero_index;
+			return true;
+		}
+
+		next_index += BYTES_PER_WORD;
+		next_counter += BYTES_PER_WORD;
+	}
+
+	return false;
+}
+
+/**
+ * search_current_reference_block() - Search the reference block currently saved in the search
+ *                                    cursor for a reference count of zero, starting at the saved
+ *                                    counter index.
+ * @slab: The slab to search.
+ * @free_index_ptr: A pointer to receive the array index of the zero reference count.
+ *
+ * Return: true if an unreferenced counter was found.
+ */
+static bool search_current_reference_block(const struct vdo_slab *slab,
+					   slab_block_number *free_index_ptr)
+{
+	/* Don't bother searching if the current block is known to be full. */
+	return ((slab->search_cursor.block->allocated_count < COUNTS_PER_BLOCK) &&
+		find_free_block(slab, free_index_ptr));
+}
+
+/**
+ * search_reference_blocks() - Search each reference block for a reference count of zero.
+ * @slab: The slab to search.
+ * @free_index_ptr: A pointer to receive the array index of the zero reference count.
+ *
+ * Searches each reference block for a reference count of zero, starting at the reference block and
+ * counter index saved in the search cursor and searching up to the end of the last reference
+ * block. The search does not wrap.
+ *
+ * Return: true if an unreferenced counter was found.
+ */
+static bool
+search_reference_blocks(struct vdo_slab *slab, slab_block_number *free_index_ptr)
+{
+	/* Start searching at the saved search position in the current block. */
+	if (search_current_reference_block(slab, free_index_ptr))
+		return true;
+
+	/* Search each reference block up to the end of the slab. */
+	while (advance_search_cursor(slab))
+		if (search_current_reference_block(slab, free_index_ptr))
+			return true;
+
+	return false;
+}
+
+/**
+ * make_provisional_reference() - Do the bookkeeping for making a provisional reference.
+ */
+static void
+make_provisional_reference(struct vdo_slab *slab, slab_block_number block_number)
+{
+	struct reference_block *block = get_reference_block(slab, block_number);
+
+	/*
+	 * Make the initial transition from an unreferenced block to a
+	 * provisionally allocated block.
+	 */
+	slab->counters[block_number] = PROVISIONAL_REFERENCE_COUNT;
+
+	/* Account for the allocation. */
+	block->allocated_count++;
+	slab->free_blocks--;
+}
+
+/**
+ * dirty_all_reference_blocks() - Mark all reference count blocks in a slab as dirty.
+ */
+static void dirty_all_reference_blocks(struct vdo_slab *slab)
+{
+	block_count_t i;
+
+	for (i = 0; i < slab->reference_block_count; i++)
+		dirty_block(&slab->reference_blocks[i]);
+}
+
+/**
+ * clear_provisional_references() - Clear the provisional reference counts from a reference block.
+ * @block: The block to clear.
+ */
+static void clear_provisional_references(struct reference_block *block)
+{
+	vdo_refcount_t *counters = get_reference_counters_for_block(block);
+	block_count_t j;
+
+	for (j = 0; j < COUNTS_PER_BLOCK; j++) {
+		if (counters[j] == PROVISIONAL_REFERENCE_COUNT) {
+			counters[j] = EMPTY_REFERENCE_COUNT;
+			block->allocated_count--;
+		}
+	}
+}
+
+static inline bool journal_points_equal(struct journal_point first, struct journal_point second)
+{
+	return ((first.sequence_number == second.sequence_number) &&
+		(first.entry_count == second.entry_count));
+}
+
+/**
+ * unpack_reference_block() - Unpack reference counts blocks into the internal memory structure.
+ * @packed: The written reference block to be unpacked.
+ * @block: The internal reference block to be loaded.
+ */
+static void
+unpack_reference_block(struct packed_reference_block *packed, struct reference_block *block)
+{
+	block_count_t index;
+	sector_count_t i;
+	struct vdo_slab *slab = block->slab;
+	vdo_refcount_t *counters = get_reference_counters_for_block(block);
+
+	for (i = 0; i < VDO_SECTORS_PER_BLOCK; i++) {
+		struct packed_reference_sector *sector = &packed->sectors[i];
+
+		vdo_unpack_journal_point(&sector->commit_point, &block->commit_points[i]);
+		memcpy(counters + (i * COUNTS_PER_SECTOR),
+		       sector->counts,
+		       (sizeof(vdo_refcount_t) * COUNTS_PER_SECTOR));
+		/* The slab_journal_point must be the latest point found in any sector. */
+		if (vdo_before_journal_point(&slab->slab_journal_point, &block->commit_points[i]))
+			slab->slab_journal_point = block->commit_points[i];
+
+		if ((i > 0) &&
+		    !journal_points_equal(block->commit_points[0], block->commit_points[i])) {
+			size_t block_index = block - block->slab->reference_blocks;
+
+			uds_log_warning("Torn write detected in sector %u of reference block %zu of slab %u",
+					i,
+					block_index,
+					block->slab->slab_number);
+		}
+	}
+
+	block->allocated_count = 0;
+	for (index = 0; index < COUNTS_PER_BLOCK; index++)
+		if (counters[index] != EMPTY_REFERENCE_COUNT)
+			block->allocated_count++;
+}
+
+/**
+ * finish_reference_block_load() - After a reference block has been read, unpack it.
+ * @completion: The VIO that just finished reading.
+ */
+static void finish_reference_block_load(struct vdo_completion *completion)
+{
+	struct vio *vio = as_vio(completion);
+	struct pooled_vio *pooled = vio_as_pooled_vio(vio);
+	struct reference_block *block = completion->parent;
+	struct vdo_slab *slab = block->slab;
+
+	unpack_reference_block((struct packed_reference_block *) vio->data, block);
+	return_vio_to_pool(slab->allocator->vio_pool, pooled);
+	slab->active_count--;
+	clear_provisional_references(block);
+
+	slab->free_blocks -= block->allocated_count;
+	check_if_slab_drained(slab);
+}
+
+static void load_reference_block_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct reference_block *block = vio->completion.parent;
+
+	continue_vio_after_io(vio, finish_reference_block_load, block->slab->allocator->thread_id);
+}
+
+/**
+ * load_reference_block() - After a block waiter has gotten a VIO from the VIO pool, load the
+ *                          block.
+ * @waiter: The waiter of the block to load.
+ * @context: The VIO returned by the pool.
+ */
+static void load_reference_block(struct waiter *waiter, void *context)
+{
+	struct pooled_vio *pooled = context;
+	struct vio *vio = &pooled->vio;
+	struct reference_block *block = container_of(waiter, struct reference_block, waiter);
+	size_t block_offset = (block - block->slab->reference_blocks);
+
+	vio->completion.parent = block;
+	submit_metadata_vio(vio,
+			    block->slab->ref_counts_origin + block_offset,
+			    load_reference_block_endio,
+			    handle_io_error,
+			    REQ_OP_READ);
+}
+
+/**
+ * load_reference_blocks() - Load a slab's reference blocks from the underlying storage into a
+ *                           pre-allocated reference counter.
+ */
+static void load_reference_blocks(struct vdo_slab *slab)
+{
+	block_count_t i;
+
+	slab->free_blocks = slab->block_count;
+	slab->active_count = slab->reference_block_count;
+	for (i = 0; i < slab->reference_block_count; i++) {
+		struct waiter *waiter = &slab->reference_blocks[i].waiter;
+
+		waiter->callback = load_reference_block;
+		acquire_vio_from_pool(slab->allocator->vio_pool, waiter);
+	}
+}
+
+/**
+ * drain_slab() - Drain all reference count I/O.
+ *
+ * Depending upon the type of drain being performed (as recorded in the ref_count's vdo_slab), the
+ * reference blocks may be loaded from disk or dirty reference blocks may be written out.
+ */
+static void drain_slab(struct vdo_slab *slab)
+{
+	bool save;
+	bool load;
+	const struct admin_state_code *state = vdo_get_admin_state_code(&slab->state);
+
+	if (state == VDO_ADMIN_STATE_SUSPENDING)
+		return;
+
+	if ((state != VDO_ADMIN_STATE_REBUILDING) && (state != VDO_ADMIN_STATE_SAVE_FOR_SCRUBBING))
+		commit_tail(&slab->journal);
+
+	if ((state == VDO_ADMIN_STATE_RECOVERING) || (slab->counters == NULL))
+		return;
+
+	save = false;
+	load = slab->allocator->summary_entries[slab->slab_number].load_ref_counts;
+	if (state == VDO_ADMIN_STATE_SCRUBBING) {
+		if (load) {
+			load_reference_blocks(slab);
+			return;
+		}
+	} else if (state == VDO_ADMIN_STATE_SAVE_FOR_SCRUBBING) {
+		if (!load)
+			/* These reference counts were never written, so mark them all dirty. */
+			dirty_all_reference_blocks(slab);
+
+		save = true;
+	} else if (state == VDO_ADMIN_STATE_REBUILDING) {
+		/*
+		 * Write out the counters if the slab has written them before, or it has any
+		 * non-zero reference counts, or there are any slab journal blocks.
+		 */
+		block_count_t data_blocks = slab->allocator->depot->slab_config.data_blocks;
+
+		if (load ||
+		    (slab->free_blocks != data_blocks) ||
+		    !is_slab_journal_blank(slab)) {
+			dirty_all_reference_blocks(slab);
+			save = true;
+		}
+	} else if (state == VDO_ADMIN_STATE_SAVING) {
+		save = (slab->status == VDO_SLAB_REBUILT);
+	} else {
+		vdo_finish_draining_with_result(&slab->state, VDO_SUCCESS);
+		return;
+	}
+
+	if (save)
+		save_dirty_reference_blocks(slab);
+}
+
+static int allocate_slab_counters(struct vdo_slab *slab)
+{
+	int result;
+	size_t index, bytes;
+
+	result = ASSERT(slab->reference_blocks == NULL,
+			"vdo_slab %u doesn't allocate refcounts twice",
+			slab->slab_number);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(slab->reference_block_count,
+			      struct reference_block,
+			      __func__,
+			      &slab->reference_blocks);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	/*
+	 * Allocate such that the runt slab has a full-length memory array, plus a little padding
+	 * so we can word-search even at the very end.
+	 */
+	bytes = (slab->reference_block_count * COUNTS_PER_BLOCK) + (2 * BYTES_PER_WORD);
+	result = UDS_ALLOCATE(bytes, vdo_refcount_t, "ref counts array", &slab->counters);
+	if (result != UDS_SUCCESS) {
+		UDS_FREE(UDS_FORGET(slab->reference_blocks));
+		return result;
+	}
+
+	slab->search_cursor.first_block = slab->reference_blocks;
+	slab->search_cursor.last_block = &slab->reference_blocks[slab->reference_block_count - 1];
+	reset_search_cursor(slab);
+
+	for (index = 0; index < slab->reference_block_count; index++) {
+		slab->reference_blocks[index] = (struct reference_block) {
+			.slab = slab,
+		};
+	}
+
+	return VDO_SUCCESS;
+}
+
+static int allocate_counters_if_clean(struct vdo_slab *slab)
+{
+	if (vdo_is_state_clean_load(&slab->state))
+		return allocate_slab_counters(slab);
+
+	return VDO_SUCCESS;
+}
+
+static void finish_loading_journal(struct vdo_completion *completion)
+{
+	struct vio *vio = as_vio(completion);
+	struct slab_journal *journal = completion->parent;
+	struct vdo_slab *slab = journal->slab;
+	struct packed_slab_journal_block *block = (struct packed_slab_journal_block *) vio->data;
+	struct slab_journal_block_header header;
+
+	vdo_unpack_slab_journal_block_header(&block->header, &header);
+
+	/* FIXME: should it be an error if the following conditional fails? */
+	if ((header.metadata_type == VDO_METADATA_SLAB_JOURNAL) &&
+	    (header.nonce == slab->allocator->nonce)) {
+		journal->tail = header.sequence_number + 1;
+
+		/*
+		 * If the slab is clean, this implies the slab journal is empty, so advance the
+		 * head appropriately.
+		 */
+		journal->head = (slab->allocator->summary_entries[slab->slab_number].is_dirty ?
+				 header.head :
+				 journal->tail);
+		journal->tail_header = header;
+		initialize_journal_state(journal);
+	}
+
+	return_vio_to_pool(slab->allocator->vio_pool, vio_as_pooled_vio(vio));
+	vdo_finish_loading_with_result(&slab->state, allocate_counters_if_clean(slab));
+}
+
+static void read_slab_journal_tail_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct slab_journal *journal = vio->completion.parent;
+
+	continue_vio_after_io(vio, finish_loading_journal, journal->slab->allocator->thread_id);
+}
+
+static void handle_load_error(struct vdo_completion *completion)
+{
+	int result = completion->result;
+	struct slab_journal *journal = completion->parent;
+	struct vio *vio = as_vio(completion);
+
+	vio_record_metadata_io_error(vio);
+	return_vio_to_pool(journal->slab->allocator->vio_pool, vio_as_pooled_vio(vio));
+	vdo_finish_loading_with_result(&journal->slab->state, result);
+}
+
+/**
+ * read_slab_journal_tail() - Read the slab journal tail block by using a vio acquired from the vio
+ *                            pool.
+ * @waiter: The vio pool waiter which has just been notified.
+ * @context: The vio pool entry given to the waiter.
+ *
+ * This is the success callback from acquire_vio_from_pool() when loading a slab journal.
+ */
+static void read_slab_journal_tail(struct waiter *waiter, void *context)
+{
+	struct slab_journal *journal = container_of(waiter, struct slab_journal, resource_waiter);
+	struct vdo_slab *slab = journal->slab;
+	struct pooled_vio *pooled = context;
+	struct vio *vio = &pooled->vio;
+	tail_block_offset_t last_commit_point =
+		slab->allocator->summary_entries[slab->slab_number].tail_block_offset;
+
+	/*
+	 * Slab summary keeps the commit point offset, so the tail block is the block before that.
+	 * Calculation supports small journals in unit tests.
+	 */
+	tail_block_offset_t tail_block = ((last_commit_point == 0) ?
+					  (tail_block_offset_t)(journal->size - 1) :
+					  (last_commit_point - 1));
+
+	vio->completion.parent = journal;
+	vio->completion.callback_thread_id = slab->allocator->thread_id;
+	submit_metadata_vio(vio,
+			    slab->journal_origin + tail_block,
+			    read_slab_journal_tail_endio,
+			    handle_load_error,
+			    REQ_OP_READ);
+}
+
+/**
+ * load_slab_journal() - Load a slab's journal by reading the journal's tail.
+ */
+static void load_slab_journal(struct vdo_slab *slab)
+{
+	struct slab_journal *journal = &slab->journal;
+	tail_block_offset_t last_commit_point;
+
+	last_commit_point = slab->allocator->summary_entries[slab->slab_number].tail_block_offset;
+	if ((last_commit_point == 0) &&
+	    !slab->allocator->summary_entries[slab->slab_number].load_ref_counts) {
+		/*
+		 * This slab claims that it has a tail block at (journal->size - 1), but a head of
+		 * 1. This is impossible, due to the scrubbing threshold, on a real system, so
+		 * don't bother reading the (bogus) data off disk.
+		 */
+		ASSERT_LOG_ONLY(((journal->size < 16) ||
+				 (journal->scrubbing_threshold < (journal->size - 1))),
+				"Scrubbing threshold protects against reads of unwritten slab journal blocks");
+		vdo_finish_loading_with_result(&slab->state, allocate_counters_if_clean(slab));
+		return;
+	}
+
+	journal->resource_waiter.callback = read_slab_journal_tail;
+	acquire_vio_from_pool(slab->allocator->vio_pool, &journal->resource_waiter);
+}
+
+static void free_slab(struct vdo_slab *slab)
+{
+	if (slab == NULL)
+		return;
+
+	list_del(&slab->allocq_entry);
+	UDS_FREE(UDS_FORGET(slab->journal.block));
+	UDS_FREE(UDS_FORGET(slab->journal.locks));
+	UDS_FREE(UDS_FORGET(slab->counters));
+	UDS_FREE(UDS_FORGET(slab->reference_blocks));
+	UDS_FREE(slab);
+}
+
+static int initialize_slab_journal(struct vdo_slab *slab)
+{
+	struct slab_journal *journal = &slab->journal;
+	const struct slab_config *slab_config = &slab->allocator->depot->slab_config;
+	int result;
+
+	result = UDS_ALLOCATE(slab_config->slab_journal_blocks,
+			      struct journal_lock,
+			      __func__,
+			      &journal->locks);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(VDO_BLOCK_SIZE,
+			      char,
+			      "struct packed_slab_journal_block",
+			      (char **) &journal->block);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	journal->slab = slab;
+	journal->size = slab_config->slab_journal_blocks;
+	journal->flushing_threshold = slab_config->slab_journal_flushing_threshold;
+	journal->blocking_threshold = slab_config->slab_journal_blocking_threshold;
+	journal->scrubbing_threshold = slab_config->slab_journal_scrubbing_threshold;
+	journal->entries_per_block = VDO_SLAB_JOURNAL_ENTRIES_PER_BLOCK;
+	journal->full_entries_per_block = VDO_SLAB_JOURNAL_FULL_ENTRIES_PER_BLOCK;
+	journal->events = &slab->allocator->slab_journal_statistics;
+	journal->recovery_journal = slab->allocator->depot->vdo->recovery_journal;
+	journal->tail = 1;
+	journal->head = 1;
+
+	journal->flushing_deadline = journal->flushing_threshold;
+	/*
+	 * Set there to be some time between the deadline and the blocking threshold, so that
+	 * hopefully all are done before blocking.
+	 */
+	if ((journal->blocking_threshold - journal->flushing_threshold) > 5)
+		journal->flushing_deadline = journal->blocking_threshold - 5;
+
+	journal->slab_summary_waiter.callback = release_journal_locks;
+
+	INIT_LIST_HEAD(&journal->dirty_entry);
+	INIT_LIST_HEAD(&journal->uncommitted_blocks);
+
+	journal->tail_header.nonce = slab->allocator->nonce;
+	journal->tail_header.metadata_type = VDO_METADATA_SLAB_JOURNAL;
+	initialize_journal_state(journal);
+	return VDO_SUCCESS;
+}
+
+/**
+ * make_slab() - Construct a new, empty slab.
+ * @slab_origin: The physical block number within the block allocator partition of the first block
+ *               in the slab.
+ * @allocator: The block allocator to which the slab belongs.
+ * @slab_number: The slab number of the slab.
+ * @is_new: true if this slab is being allocated as part of a resize.
+ * @slab_ptr: A pointer to receive the new slab.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int __must_check
+make_slab(physical_block_number_t slab_origin,
+	  struct block_allocator *allocator,
+	  slab_count_t slab_number,
+	  bool is_new,
+	  struct vdo_slab **slab_ptr)
+{
+	const struct slab_config *slab_config = &allocator->depot->slab_config;
+	struct vdo_slab *slab;
+	int result;
+
+	result = UDS_ALLOCATE(1, struct vdo_slab, __func__, &slab);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	*slab = (struct vdo_slab) {
+		.allocator = allocator,
+		.start = slab_origin,
+		.end = slab_origin + slab_config->slab_blocks,
+		.slab_number = slab_number,
+		.ref_counts_origin = slab_origin + slab_config->data_blocks,
+		.journal_origin = vdo_get_slab_journal_start_block(slab_config, slab_origin),
+		.block_count = slab_config->data_blocks,
+		.free_blocks = slab_config->data_blocks,
+		.reference_block_count =
+			vdo_get_saved_reference_count_size(slab_config->data_blocks),
+	};
+	INIT_LIST_HEAD(&slab->allocq_entry);
+
+	result = initialize_slab_journal(slab);
+	if (result != VDO_SUCCESS) {
+		free_slab(slab);
+		return result;
+	}
+
+	if (is_new) {
+		vdo_set_admin_state_code(&slab->state, VDO_ADMIN_STATE_NEW);
+		result = allocate_slab_counters(slab);
+		if (result != VDO_SUCCESS) {
+			free_slab(slab);
+			return result;
+		}
+	} else {
+		vdo_set_admin_state_code(&slab->state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+	}
+
+	*slab_ptr = slab;
+	return VDO_SUCCESS;
+}
diff --git a/drivers/md/dm-vdo/slab-depot.h b/drivers/md/dm-vdo/slab-depot.h
new file mode 100644
index 000000000000..12086d475d1a
--- /dev/null
+++ b/drivers/md/dm-vdo/slab-depot.h
@@ -0,0 +1,253 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_SLAB_DEPOT_H
+#define VDO_SLAB_DEPOT_H
+
+#include <linux/atomic.h>
+#include <linux/dm-kcopyd.h>
+#include <linux/list.h>
+
+#include "numeric.h"
+
+#include "admin-state.h"
+#include "completion.h"
+#include "data-vio.h"
+#include "encodings.h"
+#include "physical-zone.h"
+#include "priority-table.h"
+#include "recovery-journal.h"
+#include "statistics.h"
+#include "types.h"
+#include "vio.h"
+#include "wait-queue.h"
+
+/*
+ * Represents the possible status of a block.
+ */
+enum reference_status {
+	RS_FREE, /* this block is free */
+	RS_SINGLE, /* this block is singly-referenced */
+	RS_SHARED, /* this block is shared */
+	RS_PROVISIONAL /* this block is provisionally allocated */
+};
+
+struct vdo_slab;
+
+struct journal_lock {
+	u16 count;
+	sequence_number_t recovery_start;
+};
+
+struct slab_journal {
+	/* A waiter object for getting a VIO pool entry */
+	struct waiter resource_waiter;
+	/* A waiter object for updating the slab summary */
+	struct waiter slab_summary_waiter;
+	/* A waiter object for getting a vio with which to flush */
+	struct waiter flush_waiter;
+	/* The queue of VIOs waiting to make an entry */
+	struct wait_queue entry_waiters;
+	/* The parent slab reference of this journal */
+	struct vdo_slab *slab;
+
+	/* Whether a tail block commit is pending */
+	bool waiting_to_commit;
+	/* Whether the journal is updating the slab summary */
+	bool updating_slab_summary;
+	/* Whether the journal is adding entries from the entry_waiters queue */
+	bool adding_entries;
+	/* Whether a partial write is in progress */
+	bool partial_write_in_progress;
+
+	/* The oldest block in the journal on disk */
+	sequence_number_t head;
+	/* The oldest block in the journal which may not be reaped */
+	sequence_number_t unreapable;
+	/* The end of the half-open interval of the active journal */
+	sequence_number_t tail;
+	/* The next journal block to be committed */
+	sequence_number_t next_commit;
+	/* The tail sequence number that is written in the slab summary */
+	sequence_number_t summarized;
+	/* The tail sequence number that was last summarized in slab summary */
+	sequence_number_t last_summarized;
+
+	/* The sequence number of the recovery journal lock */
+	sequence_number_t recovery_lock;
+
+	/*
+	 * The number of entries which fit in a single block. Can't use the constant because unit
+	 * tests change this number.
+	 */
+	journal_entry_count_t entries_per_block;
+	/*
+	 * The number of full entries which fit in a single block. Can't use the constant because
+	 * unit tests change this number.
+	 */
+	journal_entry_count_t full_entries_per_block;
+
+	/* The recovery journal of the VDO (slab journal holds locks on it) */
+	struct recovery_journal *recovery_journal;
+
+	/* The statistics shared by all slab journals in our physical zone */
+	struct slab_journal_statistics *events;
+	/* A list of the VIO pool entries for outstanding journal block writes */
+	struct list_head uncommitted_blocks;
+
+	/*
+	 * The current tail block header state. This will be packed into the block just before it
+	 * is written.
+	 */
+	struct slab_journal_block_header tail_header;
+	/* A pointer to a block-sized buffer holding the packed block data */
+	struct packed_slab_journal_block *block;
+
+	/* The number of blocks in the on-disk journal */
+	block_count_t size;
+	/* The number of blocks at which to start pushing reference blocks */
+	block_count_t flushing_threshold;
+	/* The number of blocks at which all reference blocks should be writing */
+	block_count_t flushing_deadline;
+	/* The number of blocks at which to wait for reference blocks to write */
+	block_count_t blocking_threshold;
+	/* The number of blocks at which to scrub the slab before coming online */
+	block_count_t scrubbing_threshold;
+
+	/* This list entry is for block_allocator to keep a queue of dirty journals */
+	struct list_head dirty_entry;
+
+	/* The lock for the oldest unreaped block of the journal */
+	struct journal_lock *reap_lock;
+	/* The locks for each on disk block */
+	struct journal_lock *locks;
+};
+
+/*
+ * Reference_block structure
+ *
+ * Blocks are used as a proxy, permitting saves of partial refcounts.
+ */
+struct reference_block {
+	/* This block waits on the ref_counts to tell it to write */
+	struct waiter waiter;
+	/* The slab to which this reference_block belongs */
+	struct vdo_slab *slab;
+	/* The number of references in this block that represent allocations */
+	block_size_t allocated_count;
+	/* The slab journal block on which this block must hold a lock */
+	sequence_number_t slab_journal_lock;
+	/* The slab journal block which should be released when this block is committed */
+	sequence_number_t slab_journal_lock_to_release;
+	/* The point up to which each sector is accurate on disk */
+	struct journal_point commit_points[VDO_SECTORS_PER_BLOCK];
+	/* Whether this block has been modified since it was written to disk */
+	bool is_dirty;
+	/* Whether this block is currently writing */
+	bool is_writing;
+};
+
+/* The search_cursor represents the saved position of a free block search. */
+struct search_cursor {
+	/* The reference block containing the current search index */
+	struct reference_block *block;
+	/* The position at which to start searching for the next free counter */
+	slab_block_number index;
+	/* The position just past the last valid counter in the current block */
+	slab_block_number end_index;
+
+	/* A pointer to the first reference block in the slab */
+	struct reference_block *first_block;
+	/* A pointer to the last reference block in the slab */
+	struct reference_block *last_block;
+};
+
+enum slab_rebuild_status {
+	VDO_SLAB_REBUILT,
+	VDO_SLAB_REPLAYING,
+	VDO_SLAB_REQUIRES_SCRUBBING,
+	VDO_SLAB_REQUIRES_HIGH_PRIORITY_SCRUBBING,
+	VDO_SLAB_REBUILDING,
+};
+
+/*
+ * This is the type declaration for the vdo_slab type. A vdo_slab currently consists of a run of
+ * 2^23 data blocks, but that will soon change to dedicate a small number of those blocks for
+ * metadata storage for the reference counts and slab journal for the slab.
+ *
+ * A reference count is maintained for each physical block number. The vast majority of blocks have
+ * a very small reference count (usually 0 or 1). For references less than or equal to MAXIMUM_REFS
+ * (254) the reference count is stored in counters[pbn].
+ */
+struct vdo_slab {
+	/* A list entry to queue this slab in a block_allocator list */
+	struct list_head allocq_entry;
+
+	/* The struct block_allocator that owns this slab */
+	struct block_allocator *allocator;
+
+	/* The journal for this slab */
+	struct slab_journal journal;
+
+	/* The slab number of this slab */
+	slab_count_t slab_number;
+	/* The offset in the allocator partition of the first block in this slab */
+	physical_block_number_t start;
+	/* The offset of the first block past the end of this slab */
+	physical_block_number_t end;
+	/* The starting translated PBN of the slab journal */
+	physical_block_number_t journal_origin;
+	/* The starting translated PBN of the reference counts */
+	physical_block_number_t ref_counts_origin;
+
+	/* The administrative state of the slab */
+	struct admin_state state;
+	/* The status of the slab */
+	enum slab_rebuild_status status;
+	/* Whether the slab was ever queued for scrubbing */
+	bool was_queued_for_scrubbing;
+
+	/* The priority at which this slab has been queued for allocation */
+	u8 priority;
+
+	/* Fields beyond this point are the reference counts for the data blocks in this slab. */
+	/* The size of the counters array */
+	u32 block_count;
+	/* The number of free blocks */
+	u32 free_blocks;
+	/* The array of reference counts */
+	vdo_refcount_t *counters; /* use UDS_ALLOCATE to align data ptr */
+
+	/* The saved block pointer and array indexes for the free block search */
+	struct search_cursor search_cursor;
+
+	/* A list of the dirty blocks waiting to be written out */
+	struct wait_queue dirty_blocks;
+	/* The number of blocks which are currently writing */
+	size_t active_count;
+
+	/* A waiter object for updating the slab summary */
+	struct waiter summary_waiter;
+
+	/* The latest slab journal for which there has been a reference count update */
+	struct journal_point slab_journal_point;
+
+	/* The number of reference count blocks */
+	u32 reference_block_count;
+	/* reference count block array */
+	struct reference_block *reference_blocks;
+};
+
+bool __must_check
+vdo_attempt_replay_into_slab(struct vdo_slab *slab,
+			     physical_block_number_t pbn,
+			     enum journal_operation operation,
+			     bool increment,
+			     struct journal_point *recovery_point,
+			     struct vdo_completion *parent);
+
+void vdo_notify_slab_journals_are_recovered(struct vdo_completion *completion);
+
+#endif /* VDO_SLAB_DEPOT_H */

From patchwork Thu Oct 26 21:41:23 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437943
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D3FB5C25B6E
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:25 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S234963AbjJZVoZ (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:25 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52820 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235021AbjJZVoP (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:15 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D15E9D55
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356505;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=FLdkNZeDzBDZw0g7S19QfUjJntZEmFaL/AX3V9UwJCc=;
        b=bm9QMHhUF8ylqrKVJHMGgQaDBkzHLKZE6MPi9BGE0bZK0msDASAyRJ2Ns8J8GOA92wJQ6a
        hMLo9PZbNhXb+wg0i1wwGf+COHu42LoBXo5mom05ir8/i84xO7ctwBhHBXzSN8ZoK/b5U/
        sZ3wxrpBPXH5soOL2CgEkH1m5Sl+UT8=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-634-Y1-fw5DLPLa5XLFmMF7Bxw-1; Thu,
 26 Oct 2023 17:41:42 -0400
X-MC-Unique: Y1-fw5DLPLa5XLFmMF7Bxw-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com
 [10.11.54.2])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id AF09638143BF;
        Thu, 26 Oct 2023 21:41:41 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 9AE9E40C6F79;
        Thu, 26 Oct 2023 21:41:41 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 517613003D; Thu, 26 Oct 2023 17:41:41 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 26/39] dm vdo: add the slab summary
Date: Thu, 26 Oct 2023 17:41:23 -0400
Message-Id: <20231026214136.1067410-27-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

The slab depot maintains an additional small data structure, the "slab
summary," which is used to reduce the amount of work needed to come back
online after a crash. The slab summary maintains an entry for each slab
indicating whether or not the slab has ever been used, whether it is clean
(i.e. all of its reference count updates have been persisted to storage),
and approximately how full it is. During recovery, each physical zone will
attempt to recover at least one slab, stopping whenever it has recovered a
slab which has some free blocks. Once each zone has some space (or has
determined that none is available), the target can resume normal operation
in a degraded mode. Read and write requests can be serviced, perhaps with
degraded performance, while the remainder of the dirty slabs are recovered.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/slab-depot.c | 326 +++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/slab-depot.h |  74 ++++++++
 2 files changed, 400 insertions(+)

diff --git a/drivers/md/dm-vdo/slab-depot.c b/drivers/md/dm-vdo/slab-depot.c
index 48adf297a394..e2270319513a 100644
--- a/drivers/md/dm-vdo/slab-depot.c
+++ b/drivers/md/dm-vdo/slab-depot.c
@@ -190,6 +190,201 @@ static void check_if_slab_drained(struct vdo_slab *slab)
 	vdo_finish_draining_with_result(&slab->state, (read_only ? VDO_READ_ONLY : VDO_SUCCESS));
 }
 
+/* FULLNESS HINT COMPUTATION */
+
+/**
+ * compute_fullness_hint() - Translate a slab's free block count into a 'fullness hint' that can be
+ *                           stored in a slab_summary_entry's 7 bits that are dedicated to its free
+ *                           count.
+ * @depot: The depot whose summary being updated.
+ * @free_blocks: The number of free blocks.
+ *
+ * Note: the number of free blocks must be strictly less than 2^23 blocks, even though
+ * theoretically slabs could contain precisely 2^23 blocks; there is an assumption that at least
+ * one block is used by metadata. This assumption is necessary; otherwise, the fullness hint might
+ * overflow. The fullness hint formula is roughly (fullness >> 16) & 0x7f, but (2^23 >> 16) & 0x7f
+ * is 0, which would make it impossible to distinguish completely full from completely empty.
+ *
+ * Return: A fullness hint, which can be stored in 7 bits.
+ */
+static u8 __must_check compute_fullness_hint(struct slab_depot *depot, block_count_t free_blocks)
+{
+	block_count_t hint;
+
+	ASSERT_LOG_ONLY((free_blocks < (1 << 23)), "free blocks must be less than 2^23");
+
+	if (free_blocks == 0)
+		return 0;
+
+	hint = free_blocks >> depot->hint_shift;
+	return ((hint == 0) ? 1 : hint);
+}
+
+/**
+ * check_summary_drain_complete() - Check whether an allocators summary has finished draining.
+ */
+static void check_summary_drain_complete(struct block_allocator *allocator)
+{
+	struct vdo *vdo = allocator->depot->vdo;
+
+	if (!vdo_is_state_draining(&allocator->summary_state) ||
+	    (allocator->summary_write_count > 0))
+		return;
+
+	vdo_finish_operation(&allocator->summary_state,
+			     (vdo_is_read_only(vdo) ? VDO_READ_ONLY : VDO_SUCCESS));
+}
+
+/**
+ * notify_summary_waiters() - Wake all the waiters in a given queue.
+ * @allocator: The block allocator summary which owns the queue.
+ * @queue: The queue to notify.
+ */
+static void notify_summary_waiters(struct block_allocator *allocator, struct wait_queue *queue)
+{
+	int result = (vdo_is_read_only(allocator->depot->vdo) ? VDO_READ_ONLY : VDO_SUCCESS);
+
+	vdo_notify_all_waiters(queue, NULL, &result);
+}
+
+static void launch_write(struct slab_summary_block *summary_block);
+
+/**
+ * finish_updating_slab_summary_block() - Finish processing a block which attempted to write,
+ *                                        whether or not the attempt succeeded.
+ * @block: The block.
+ */
+static void finish_updating_slab_summary_block(struct slab_summary_block *block)
+{
+	notify_summary_waiters(block->allocator, &block->current_update_waiters);
+	block->writing = false;
+	block->allocator->summary_write_count--;
+	if (vdo_has_waiters(&block->next_update_waiters))
+		launch_write(block);
+	else
+		check_summary_drain_complete(block->allocator);
+}
+
+/**
+ * finish_update() - This is the callback for a successful summary block write.
+ * @completion: The write vio.
+ */
+static void finish_update(struct vdo_completion *completion)
+{
+	struct slab_summary_block *block =
+		container_of(as_vio(completion), struct slab_summary_block, vio);
+
+	atomic64_inc(&block->allocator->depot->summary_statistics.blocks_written);
+	finish_updating_slab_summary_block(block);
+}
+
+/**
+ * handle_write_error() - Handle an error writing a slab summary block.
+ * @completion: The write VIO.
+ */
+static void handle_write_error(struct vdo_completion *completion)
+{
+	struct slab_summary_block *block =
+		container_of(as_vio(completion), struct slab_summary_block, vio);
+
+	vio_record_metadata_io_error(as_vio(completion));
+	vdo_enter_read_only_mode(completion->vdo, completion->result);
+	finish_updating_slab_summary_block(block);
+}
+
+static void write_slab_summary_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct slab_summary_block *block = container_of(vio, struct slab_summary_block, vio);
+
+	continue_vio_after_io(vio, finish_update, block->allocator->thread_id);
+}
+
+/**
+ * launch_write() - Write a slab summary block unless it is currently out for writing.
+ * @block: The block that needs to be committed.
+ */
+static void launch_write(struct slab_summary_block *block)
+{
+	struct block_allocator *allocator = block->allocator;
+	struct slab_depot *depot = allocator->depot;
+	physical_block_number_t pbn;
+
+	if (block->writing)
+		return;
+
+	allocator->summary_write_count++;
+	vdo_transfer_all_waiters(&block->next_update_waiters, &block->current_update_waiters);
+	block->writing = true;
+
+	if (vdo_is_read_only(depot->vdo)) {
+		finish_updating_slab_summary_block(block);
+		return;
+	}
+
+	memcpy(block->outgoing_entries, block->entries, VDO_BLOCK_SIZE);
+
+	/*
+	 * Flush before writing to ensure that the slab journal tail blocks and reference updates
+	 * covered by this summary update are stable (VDO-2332).
+	 */
+	pbn = (depot->summary_origin +
+	       (VDO_SLAB_SUMMARY_BLOCKS_PER_ZONE * allocator->zone_number) +
+	       block->index);
+	submit_metadata_vio(&block->vio,
+			    pbn,
+			    write_slab_summary_endio,
+			    handle_write_error,
+			    REQ_OP_WRITE | REQ_PREFLUSH);
+}
+
+/**
+ * update_slab_summary_entry() - Update the entry for a slab.
+ * @slab: The slab whose entry is to be updated
+ * @waiter: The waiter that is updating the summary.
+ * @tail_block_offset: The offset of the slab journal's tail block.
+ * @load_ref_counts: Whether the reference counts must be loaded from disk on the vdo load.
+ * @is_clean: Whether the slab is clean.
+ * @free_blocks: The number of free blocks.
+ */
+static void
+update_slab_summary_entry(struct vdo_slab *slab,
+			  struct waiter *waiter,
+			  tail_block_offset_t tail_block_offset,
+			  bool load_ref_counts,
+			  bool is_clean,
+			  block_count_t free_blocks)
+{
+	u8 index = slab->slab_number / VDO_SLAB_SUMMARY_ENTRIES_PER_BLOCK;
+	struct block_allocator *allocator = slab->allocator;
+	struct slab_summary_block *block = &allocator->summary_blocks[index];
+	int result;
+	struct slab_summary_entry *entry;
+
+	if (vdo_is_read_only(block->vio.completion.vdo)) {
+		result = VDO_READ_ONLY;
+		waiter->callback(waiter, &result);
+		return;
+	}
+
+	if (vdo_is_state_draining(&allocator->summary_state) ||
+	    vdo_is_state_quiescent(&allocator->summary_state)) {
+		result = VDO_INVALID_ADMIN_STATE;
+		waiter->callback(waiter, &result);
+		return;
+	}
+
+	entry = &allocator->summary_entries[slab->slab_number];
+	*entry = (struct slab_summary_entry) {
+		.tail_block_offset = tail_block_offset,
+		.load_ref_counts = (entry->load_ref_counts || load_ref_counts),
+		.is_dirty = !is_clean,
+		.fullness_hint = compute_fullness_hint(allocator->depot, free_blocks),
+	};
+	vdo_enqueue_waiter(&block->next_update_waiters, waiter);
+	launch_write(block);
+}
+
 /**
  * finish_reaping() - Actually advance the head of the journal now that any necessary flushes are
  *                    complete.
@@ -2458,3 +2653,134 @@ make_slab(physical_block_number_t slab_origin,
 	*slab_ptr = slab;
 	return VDO_SUCCESS;
 }
+
+/**
+ * finish_combining_zones() - Clean up after saving out the combined slab summary.
+ * @completion: The vio which was used to write the summary data.
+ */
+static void finish_combining_zones(struct vdo_completion *completion)
+{
+	int result = completion->result;
+	struct vdo_completion *parent = completion->parent;
+
+	free_vio(as_vio(UDS_FORGET(completion)));
+	vdo_fail_completion(parent, result);
+}
+
+static void handle_combining_error(struct vdo_completion *completion)
+{
+	vio_record_metadata_io_error(as_vio(completion));
+	finish_combining_zones(completion);
+}
+
+static void write_summary_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct vdo *vdo = vio->completion.vdo;
+
+	continue_vio_after_io(vio, finish_combining_zones, vdo->thread_config.admin_thread);
+}
+
+/**
+ * combine_summaries() - Treating the current entries buffer as the on-disk value of all zones,
+ *                       update every zone to the correct values for every slab.
+ * @depot: The depot whose summary entries should be combined.
+ */
+static void combine_summaries(struct slab_depot *depot)
+{
+	/*
+	 * Combine all the old summary data into the portion of the buffer corresponding to the
+	 * first zone.
+	 */
+	zone_count_t zone = 0;
+	struct slab_summary_entry *entries = depot->summary_entries;
+
+	if (depot->old_zone_count > 1) {
+		slab_count_t entry_number;
+
+		for (entry_number = 0; entry_number < MAX_VDO_SLABS; entry_number++) {
+			if (zone != 0)
+				memcpy(entries + entry_number,
+				       entries + (zone * MAX_VDO_SLABS) + entry_number,
+				       sizeof(struct slab_summary_entry));
+			zone++;
+			if (zone == depot->old_zone_count)
+				zone = 0;
+		}
+	}
+
+	/* Copy the combined data to each zones's region of the buffer. */
+	for (zone = 1; zone < MAX_VDO_PHYSICAL_ZONES; zone++)
+		memcpy(entries + (zone * MAX_VDO_SLABS),
+		       entries,
+		       MAX_VDO_SLABS * sizeof(struct slab_summary_entry));
+}
+
+/**
+ * finish_loading_summary() - Finish loading slab summary data.
+ * @completion: The vio which was used to read the summary data.
+ *
+ * Combines the slab summary data from all the previously written zones and copies the combined
+ * summary to each partition's data region. Then writes the combined summary back out to disk. This
+ * callback is registered in load_summary_endio().
+ */
+static void finish_loading_summary(struct vdo_completion *completion)
+{
+	struct slab_depot *depot = completion->vdo->depot;
+
+	/* Combine the summary from each zone so each zone is correct for all slabs. */
+	combine_summaries(depot);
+
+	/* Write the combined summary back out. */
+	submit_metadata_vio(as_vio(completion),
+			    depot->summary_origin,
+			    write_summary_endio,
+			    handle_combining_error,
+			    REQ_OP_WRITE);
+}
+
+static void load_summary_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct vdo *vdo = vio->completion.vdo;
+
+	continue_vio_after_io(vio, finish_loading_summary, vdo->thread_config.admin_thread);
+}
+
+/**
+ * load_slab_summary() - The preamble of a load operation.
+ *
+ * Implements vdo_action_preamble.
+ */
+static void load_slab_summary(void *context, struct vdo_completion *parent)
+{
+	int result;
+	struct vio *vio;
+	struct slab_depot *depot = context;
+	const struct admin_state_code *operation =
+		vdo_get_current_manager_operation(depot->action_manager);
+
+	result = create_multi_block_metadata_vio(depot->vdo,
+						 VIO_TYPE_SLAB_SUMMARY,
+						 VIO_PRIORITY_METADATA,
+						 parent,
+						 VDO_SLAB_SUMMARY_BLOCKS,
+						 (char *) depot->summary_entries,
+						 &vio);
+	if (result != VDO_SUCCESS) {
+		vdo_fail_completion(parent, result);
+		return;
+	}
+
+	if ((operation == VDO_ADMIN_STATE_FORMATTING) ||
+	    (operation == VDO_ADMIN_STATE_LOADING_FOR_REBUILD)) {
+		finish_loading_summary(&vio->completion);
+		return;
+	}
+
+	submit_metadata_vio(vio,
+			    depot->summary_origin,
+			    load_summary_endio,
+			    handle_combining_error,
+			    REQ_OP_READ);
+}
diff --git a/drivers/md/dm-vdo/slab-depot.h b/drivers/md/dm-vdo/slab-depot.h
index 12086d475d1a..5e06746c7e71 100644
--- a/drivers/md/dm-vdo/slab-depot.h
+++ b/drivers/md/dm-vdo/slab-depot.h
@@ -24,6 +24,23 @@
 #include "vio.h"
 #include "wait-queue.h"
 
+/*
+ * A slab_depot is responsible for managing all of the slabs and block allocators of a VDO. It has
+ * a single array of slabs in order to eliminate the need for additional math in order to compute
+ * which physical zone a PBN is in. It also has a block_allocator per zone.
+ *
+ * Load operations are required to be performed on a single thread. Normal operations are assumed
+ * to be performed in the appropriate zone. Allocations and reference count updates must be done
+ * from the thread of their physical zone. Requests to commit slab journal tail blocks from the
+ * recovery journal must be done on the journal zone thread. Save operations are required to be
+ * launched from the same thread as the original load operation.
+ */
+
+enum {
+	/* The number of vios in the vio pool is proportional to the throughput of the VDO. */
+	BLOCK_ALLOCATOR_VIO_POOL_SIZE = 128,
+};
+
 /*
  * Represents the possible status of a block.
  */
@@ -240,6 +257,63 @@ struct vdo_slab {
 	struct reference_block *reference_blocks;
 };
 
+/*
+ * The slab_summary provides hints during load and recovery about the state of the slabs in order
+ * to avoid the need to read the slab journals in their entirety before a VDO can come online.
+ *
+ * The information in the summary for each slab includes the rough number of free blocks (which is
+ * used to prioritize scrubbing), the cleanliness of a slab (so that clean slabs containing free
+ * space will be used on restart), and the location of the tail block of the slab's journal.
+ *
+ * The slab_summary has its own partition at the end of the volume which is sized to allow for a
+ * complete copy of the summary for each of up to 16 physical zones.
+ *
+ * During resize, the slab_summary moves its backing partition and is saved once moved; the
+ * slab_summary is not permitted to overwrite the previous recovery journal space.
+ *
+ * The slab_summary does not have its own version information, but relies on the VDO volume version
+ * number.
+ */
+
+/*
+ * A slab status is a very small structure for use in determining the ordering of slabs in the
+ * scrubbing process.
+ */
+struct slab_status {
+	slab_count_t slab_number;
+	bool is_clean;
+	u8 emptiness;
+};
+
+struct slab_summary_block {
+	/* The block_allocator to which this block belongs */
+	struct block_allocator *allocator;
+	/* The index of this block in its zone's summary */
+	block_count_t index;
+	/* Whether this block has a write outstanding */
+	bool writing;
+	/* Ring of updates waiting on the outstanding write */
+	struct wait_queue current_update_waiters;
+	/* Ring of updates waiting on the next write */
+	struct wait_queue next_update_waiters;
+	/* The active slab_summary_entry array for this block */
+	struct slab_summary_entry *entries;
+	/* The vio used to write this block */
+	struct vio vio;
+	/* The packed entries, one block long, backing the vio */
+	char *outgoing_entries;
+};
+
+/*
+ * The statistics for all the slab summary zones owned by this slab summary. These fields are all
+ * mutated only by their physical zone threads, but are read by other threads when gathering
+ * statistics for the entire depot.
+ */
+struct atomic_slab_summary_statistics {
+	/* Number of blocks written */
+	atomic64_t blocks_written;
+};
+
 bool __must_check
 vdo_attempt_replay_into_slab(struct vdo_slab *slab,
 			     physical_block_number_t pbn,

From patchwork Thu Oct 26 21:41:24 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437964
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E419BC25B48
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:57 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229785AbjJZVo5 (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:57 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38964 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229621AbjJZVo4 (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:56 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0BAA7D7A
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:42:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356524;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=YA74pHO8HjrOjiw6xYJQMo4t3ajikkYYnAmQX9rQ8a0=;
        b=Jdsg2TaWCObfCENvWibe1oly4iTL8CNtJlVgZEtWNg71ALOExOagRQG3hMoF2v/GzDFi23
        eSa/4vf75FFT1mhEeX5sBSWZiaRQpe+jaOUsYQIB+SeiTuO+FFKibdbVZEJdfR7vDoUSt0
        YYbzZ9HLfVZoSwO8e+PRWDDdohGa5lU=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-52-Yg90SzUbNLeSxp6fd5mWlQ-1; Thu, 26 Oct 2023 17:41:42 -0400
X-MC-Unique: Yg90SzUbNLeSxp6fd5mWlQ-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id CC75F185A780;
        Thu, 26 Oct 2023 21:41:41 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id B7EC410751;
        Thu, 26 Oct 2023 21:41:41 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 79F4F3003B; Thu, 26 Oct 2023 17:41:41 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 27/39] dm vdo: add the block allocators and physical zones
Date: Thu, 26 Oct 2023 17:41:24 -0400
Message-Id: <20231026214136.1067410-28-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Each slab is independent of every other. They are assigned to "physical
zones" in round-robin fashion. If there are P physical zones, then slab n
is assigned to zone n mod P. The set of slabs in each physical zone is
managed by a block allocator.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/physical-zone.c |  650 +++++++++++
 drivers/md/dm-vdo/physical-zone.h |  115 ++
 drivers/md/dm-vdo/slab-depot.c    | 1796 ++++++++++++++++++++++++++---
 drivers/md/dm-vdo/slab-depot.h    |  146 +++
 4 files changed, 2540 insertions(+), 167 deletions(-)
 create mode 100644 drivers/md/dm-vdo/physical-zone.c
 create mode 100644 drivers/md/dm-vdo/physical-zone.h

diff --git a/drivers/md/dm-vdo/physical-zone.c b/drivers/md/dm-vdo/physical-zone.c
new file mode 100644
index 000000000000..822da4e8f579
--- /dev/null
+++ b/drivers/md/dm-vdo/physical-zone.c
@@ -0,0 +1,650 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "physical-zone.h"
+
+#include <linux/list.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+
+#include "block-map.h"
+#include "completion.h"
+#include "constants.h"
+#include "data-vio.h"
+#include "dedupe.h"
+#include "encodings.h"
+#include "flush.h"
+#include "int-map.h"
+#include "slab-depot.h"
+#include "status-codes.h"
+#include "vdo.h"
+
+enum {
+	/* Each user data_vio needs a PBN read lock and write lock. */
+	LOCK_POOL_CAPACITY = 2 * MAXIMUM_VDO_USER_VIOS,
+};
+
+struct pbn_lock_implementation {
+	enum pbn_lock_type type;
+	const char *name;
+	const char *release_reason;
+};
+
+/* This array must have an entry for every pbn_lock_type value. */
+static const struct pbn_lock_implementation LOCK_IMPLEMENTATIONS[] = {
+	[VIO_READ_LOCK] = {
+		.type = VIO_READ_LOCK,
+		.name = "read",
+		.release_reason = "candidate duplicate",
+	},
+	[VIO_WRITE_LOCK] = {
+		.type = VIO_WRITE_LOCK,
+		.name = "write",
+		.release_reason = "newly allocated",
+	},
+	[VIO_BLOCK_MAP_WRITE_LOCK] = {
+		.type = VIO_BLOCK_MAP_WRITE_LOCK,
+		.name = "block map write",
+		.release_reason = "block map write",
+	},
+};
+
+static inline bool has_lock_type(const struct pbn_lock *lock, enum pbn_lock_type type)
+{
+	return (lock->implementation == &LOCK_IMPLEMENTATIONS[type]);
+}
+
+/**
+ * vdo_is_pbn_read_lock() - Check whether a pbn_lock is a read lock.
+ * @lock: The lock to check.
+ *
+ * Return: true if the lock is a read lock.
+ */
+bool vdo_is_pbn_read_lock(const struct pbn_lock *lock)
+{
+	return has_lock_type(lock, VIO_READ_LOCK);
+}
+
+static inline void set_pbn_lock_type(struct pbn_lock *lock, enum pbn_lock_type type)
+{
+	lock->implementation = &LOCK_IMPLEMENTATIONS[type];
+}
+
+/**
+ * vdo_downgrade_pbn_write_lock() - Downgrade a PBN write lock to a PBN read lock.
+ * @lock: The PBN write lock to downgrade.
+ *
+ * The lock holder count is cleared and the caller is responsible for setting the new count.
+ */
+void vdo_downgrade_pbn_write_lock(struct pbn_lock *lock, bool compressed_write)
+{
+	ASSERT_LOG_ONLY(!vdo_is_pbn_read_lock(lock),
+			"PBN lock must not already have been downgraded");
+	ASSERT_LOG_ONLY(!has_lock_type(lock, VIO_BLOCK_MAP_WRITE_LOCK),
+			"must not downgrade block map write locks");
+	ASSERT_LOG_ONLY(lock->holder_count == 1,
+			"PBN write lock should have one holder but has %u",
+			lock->holder_count);
+	/*
+	 * data_vio write locks are downgraded in place--the writer retains the hold on the lock.
+	 * If this was a compressed write, the holder has not yet journaled its own inc ref,
+	 * otherwise, it has.
+	 */
+	lock->increment_limit =
+		(compressed_write ? MAXIMUM_REFERENCE_COUNT : MAXIMUM_REFERENCE_COUNT - 1);
+	set_pbn_lock_type(lock, VIO_READ_LOCK);
+}
+
+/**
+ * vdo_claim_pbn_lock_increment() - Try to claim one of the available reference count increments on
+ *				    a read lock.
+ * @lock: The PBN read lock from which to claim an increment.
+ *
+ * Claims may be attempted from any thread. A claim is only valid until the PBN lock is released.
+ *
+ * Return: true if the claim succeeded, guaranteeing one increment can be made without overflowing
+ *	   the PBN's reference count.
+ */
+bool vdo_claim_pbn_lock_increment(struct pbn_lock *lock)
+{
+	/*
+	 * Claim the next free reference atomically since hash locks from multiple hash zone
+	 * threads might be concurrently deduplicating against a single PBN lock on compressed
+	 * block. As long as hitting the increment limit will lead to the PBN lock being released
+	 * in a sane time-frame, we won't overflow a 32-bit claim counter, allowing a simple add
+	 * instead of a compare-and-swap.
+	 */
+	u32 claim_number = (u32) atomic_add_return(1, &lock->increments_claimed);
+
+	return (claim_number <= lock->increment_limit);
+}
+
+/**
+ * vdo_assign_pbn_lock_provisional_reference() - Inform a PBN lock that it is responsible for a
+ *						 provisional reference.
+ * @lock: The PBN lock.
+ */
+void vdo_assign_pbn_lock_provisional_reference(struct pbn_lock *lock)
+{
+	ASSERT_LOG_ONLY(!lock->has_provisional_reference,
+			"lock does not have a provisional reference");
+	lock->has_provisional_reference = true;
+}
+
+/**
+ * vdo_unassign_pbn_lock_provisional_reference() - Inform a PBN lock that it is no longer
+ *						   responsible for a provisional reference.
+ * @lock: The PBN lock.
+ */
+void vdo_unassign_pbn_lock_provisional_reference(struct pbn_lock *lock)
+{
+	lock->has_provisional_reference = false;
+}
+
+/**
+ * release_pbn_lock_provisional_reference() - If the lock is responsible for a provisional
+ *					      reference, release that reference.
+ * @lock: The lock.
+ * @locked_pbn: The PBN covered by the lock.
+ * @allocator: The block allocator from which to release the reference.
+ *
+ * This method is called when the lock is released.
+ */
+static void
+release_pbn_lock_provisional_reference(struct pbn_lock *lock,
+				       physical_block_number_t locked_pbn,
+				       struct block_allocator *allocator)
+{
+	int result;
+
+	if (!vdo_pbn_lock_has_provisional_reference(lock))
+		return;
+
+	result = vdo_release_block_reference(allocator, locked_pbn);
+	if (result != VDO_SUCCESS)
+		uds_log_error_strerror(result,
+				       "Failed to release reference to %s physical block %llu",
+				       lock->implementation->release_reason,
+				       (unsigned long long) locked_pbn);
+
+	vdo_unassign_pbn_lock_provisional_reference(lock);
+}
+
+/**
+ * union idle_pbn_lock - PBN lock list entries.
+ *
+ * Unused (idle) PBN locks are kept in a list. Just like in a malloc implementation, the lock
+ * structure is unused memory, so we can save a bit of space (and not pollute the lock structure
+ * proper) by using a union to overlay the lock structure with the free list.
+ */
+typedef union {
+	/** @entry: Only used while locks are in the pool. */
+	struct list_head entry;
+	/** @lock: Only used while locks are not in the pool. */
+	struct pbn_lock lock;
+} idle_pbn_lock;
+
+/**
+ * struct pbn_lock_pool - list of PBN locks.
+ *
+ * The lock pool is little more than the memory allocated for the locks.
+ */
+struct pbn_lock_pool {
+	/** @capacity: The number of locks allocated for the pool. */
+	size_t capacity;
+	/** @borrowed: The number of locks currently borrowed from the pool. */
+	size_t borrowed;
+	/** @idle_list: A list containing all idle PBN lock instances. */
+	struct list_head idle_list;
+	/** @locks: The memory for all the locks allocated by this pool. */
+	idle_pbn_lock locks[];
+};
+
+/**
+ * return_pbn_lock_to_pool() - Return a pbn lock to its pool.
+ * @pool: The pool from which the lock was borrowed.
+ * @lock: The last reference to the lock being returned.
+ *
+ * It must be the last live reference, as if the memory were being freed (the lock memory will
+ * re-initialized or zeroed).
+ */
+static void return_pbn_lock_to_pool(struct pbn_lock_pool *pool, struct pbn_lock *lock)
+{
+	idle_pbn_lock *idle;
+
+	/* A bit expensive, but will promptly catch some use-after-free errors. */
+	memset(lock, 0, sizeof(*lock));
+
+	idle = container_of(lock, idle_pbn_lock, lock);
+	INIT_LIST_HEAD(&idle->entry);
+	list_add_tail(&idle->entry, &pool->idle_list);
+
+	ASSERT_LOG_ONLY(pool->borrowed > 0, "shouldn't return more than borrowed");
+	pool->borrowed -= 1;
+}
+
+/**
+ * make_pbn_lock_pool() - Create a new PBN lock pool and all the lock instances it can loan out.
+ *
+ * @capacity: The number of PBN locks to allocate for the pool.
+ * @pool_ptr: A pointer to receive the new pool.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int make_pbn_lock_pool(size_t capacity, struct pbn_lock_pool **pool_ptr)
+{
+	size_t i;
+	struct pbn_lock_pool *pool;
+	int result;
+
+	result = UDS_ALLOCATE_EXTENDED(struct pbn_lock_pool,
+				       capacity,
+				       idle_pbn_lock,
+				       __func__,
+				       &pool);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	pool->capacity = capacity;
+	pool->borrowed = capacity;
+	INIT_LIST_HEAD(&pool->idle_list);
+
+	for (i = 0; i < capacity; i++)
+		return_pbn_lock_to_pool(pool, &pool->locks[i].lock);
+
+	*pool_ptr = pool;
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_free_pbn_lock_pool() - Free a PBN lock pool.
+ * @pool: The lock pool to free.
+ *
+ * This also frees all the PBN locks it allocated, so the caller must ensure that all locks have
+ * been returned to the pool.
+ */
+static void free_pbn_lock_pool(struct pbn_lock_pool *pool)
+{
+	if (pool == NULL)
+		return;
+
+	ASSERT_LOG_ONLY(pool->borrowed == 0,
+			"All PBN locks must be returned to the pool before it is freed, but %zu locks are still on loan",
+			pool->borrowed);
+	UDS_FREE(pool);
+}
+
+/**
+ * borrow_pbn_lock_from_pool() - Borrow a PBN lock from the pool and initialize it with the
+ *				 provided type.
+ * @pool: The pool from which to borrow.
+ * @type: The type with which to initialize the lock.
+ * @lock_ptr:  A pointer to receive the borrowed lock.
+ *
+ * Pools do not grow on demand or allocate memory, so this will fail if the pool is empty. Borrowed
+ * locks are still associated with this pool and must be returned to only this pool.
+ *
+ * Return: VDO_SUCCESS, or VDO_LOCK_ERROR if the pool is empty.
+ */
+static int __must_check
+borrow_pbn_lock_from_pool(struct pbn_lock_pool *pool,
+			  enum pbn_lock_type type,
+			  struct pbn_lock **lock_ptr)
+{
+	int result;
+	struct list_head *idle_entry;
+	idle_pbn_lock *idle;
+
+	if (pool->borrowed >= pool->capacity)
+		return uds_log_error_strerror(VDO_LOCK_ERROR, "no free PBN locks left to borrow");
+	pool->borrowed += 1;
+
+	result = ASSERT(!list_empty(&pool->idle_list),
+			"idle list should not be empty if pool not at capacity");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	idle_entry = pool->idle_list.prev;
+	list_del(idle_entry);
+	memset(idle_entry, 0, sizeof(*idle_entry));
+
+	idle = list_entry(idle_entry, idle_pbn_lock, entry);
+	idle->lock.holder_count = 0;
+	set_pbn_lock_type(&idle->lock, type);
+
+	*lock_ptr = &idle->lock;
+	return VDO_SUCCESS;
+}
+
+/**
+ * initialize_zone() - Initialize a physical zone.
+ * @vdo: The vdo to which the zone will belong.
+ * @zones: The physical_zones to which the zone being initialized belongs
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int initialize_zone(struct vdo *vdo, struct physical_zones *zones)
+{
+	int result;
+	zone_count_t zone_number = zones->zone_count;
+	struct physical_zone *zone = &zones->zones[zone_number];
+
+	result = vdo_make_int_map(VDO_LOCK_MAP_CAPACITY, 0, &zone->pbn_operations);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = make_pbn_lock_pool(LOCK_POOL_CAPACITY, &zone->lock_pool);
+	if (result != VDO_SUCCESS) {
+		vdo_free_int_map(zone->pbn_operations);
+		return result;
+	}
+
+	zone->zone_number = zone_number;
+	zone->thread_id = vdo->thread_config.physical_threads[zone_number];
+	zone->allocator = &vdo->depot->allocators[zone_number];
+	zone->next = &zones->zones[(zone_number + 1) % vdo->thread_config.physical_zone_count];
+	result = vdo_make_default_thread(vdo, zone->thread_id);
+	if (result != VDO_SUCCESS) {
+		free_pbn_lock_pool(UDS_FORGET(zone->lock_pool));
+		vdo_free_int_map(zone->pbn_operations);
+		return result;
+	}
+	return result;
+}
+
+/**
+ * vdo_make_physical_zones() - Make the physical zones for a vdo.
+ * @vdo: The vdo being constructed
+ * @zones_ptr: A pointer to hold the zones
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+int vdo_make_physical_zones(struct vdo *vdo, struct physical_zones **zones_ptr)
+{
+	struct physical_zones *zones;
+	int result;
+	zone_count_t zone_count = vdo->thread_config.physical_zone_count;
+
+	if (zone_count == 0)
+		return VDO_SUCCESS;
+
+	result = UDS_ALLOCATE_EXTENDED(struct physical_zones,
+				       zone_count,
+				       struct physical_zone,
+				       __func__,
+				       &zones);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	for (zones->zone_count = 0; zones->zone_count < zone_count; zones->zone_count++) {
+		result = initialize_zone(vdo, zones);
+		if (result != VDO_SUCCESS) {
+			vdo_free_physical_zones(zones);
+			return result;
+		}
+	}
+
+	*zones_ptr = zones;
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_free_physical_zones() - Destroy the physical zones.
+ * @zones: The zones to free.
+ */
+void vdo_free_physical_zones(struct physical_zones *zones)
+{
+	zone_count_t index;
+
+	if (zones == NULL)
+		return;
+
+	for (index = 0; index < zones->zone_count; index++) {
+		struct physical_zone *zone = &zones->zones[index];
+
+		free_pbn_lock_pool(UDS_FORGET(zone->lock_pool));
+		vdo_free_int_map(UDS_FORGET(zone->pbn_operations));
+	}
+
+	UDS_FREE(zones);
+}
+
+/**
+ * vdo_get_physical_zone_pbn_lock() - Get the lock on a PBN if one exists.
+ * @zone: The physical zone responsible for the PBN.
+ * @pbn: The physical block number whose lock is desired.
+ *
+ * Return: The lock or NULL if the PBN is not locked.
+ */
+struct pbn_lock *
+vdo_get_physical_zone_pbn_lock(struct physical_zone *zone, physical_block_number_t pbn)
+{
+	return ((zone == NULL) ? NULL : vdo_int_map_get(zone->pbn_operations, pbn));
+}
+
+/**
+ * vdo_attempt_physical_zone_pbn_lock() - Attempt to lock a physical block in the zone responsible
+ *					  for it.
+ * @zone: The physical zone responsible for the PBN.
+ * @pbn: The physical block number to lock.
+ * @type: The type with which to initialize a new lock.
+ * @lock_ptr:  A pointer to receive the lock, existing or new.
+ *
+ * If the PBN is already locked, the existing lock will be returned. Otherwise, a new lock instance
+ * will be borrowed from the pool, initialized, and returned. The lock owner will be NULL for a new
+ * lock acquired by the caller, who is responsible for setting that field promptly. The lock owner
+ * will be non-NULL when there is already an existing lock on the PBN.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_attempt_physical_zone_pbn_lock(struct physical_zone *zone,
+				       physical_block_number_t pbn,
+				       enum pbn_lock_type type,
+				       struct pbn_lock **lock_ptr)
+{
+	/*
+	 * Borrow and prepare a lock from the pool so we don't have to do two int_map accesses in
+	 * the common case of no lock contention.
+	 */
+	struct pbn_lock *lock, *new_lock = NULL;
+	int result;
+
+	result = borrow_pbn_lock_from_pool(zone->lock_pool, type, &new_lock);
+	if (result != VDO_SUCCESS) {
+		ASSERT_LOG_ONLY(false, "must always be able to borrow a PBN lock");
+		return result;
+	}
+
+	result = vdo_int_map_put(zone->pbn_operations, pbn, new_lock, false, (void **) &lock);
+	if (result != VDO_SUCCESS) {
+		return_pbn_lock_to_pool(zone->lock_pool, new_lock);
+		return result;
+	}
+
+	if (lock != NULL) {
+		/* The lock is already held, so we don't need the borrowed one. */
+		return_pbn_lock_to_pool(zone->lock_pool, UDS_FORGET(new_lock));
+		result = ASSERT(lock->holder_count > 0,
+				"physical block %llu lock held",
+				(unsigned long long) pbn);
+		if (result != VDO_SUCCESS)
+			return result;
+		*lock_ptr = lock;
+	} else {
+		*lock_ptr = new_lock;
+	}
+	return VDO_SUCCESS;
+}
+
+/**
+ * allocate_and_lock_block() - Attempt to allocate a block from this zone.
+ * @allocation: The struct allocation of the data_vio attempting to allocate.
+ *
+ * If a block is allocated, the recipient will also hold a lock on it.
+ *
+ * Return: VDO_SUCCESS if a block was allocated, or an error code.
+ */
+static int allocate_and_lock_block(struct allocation *allocation)
+{
+	int result;
+	struct pbn_lock *lock;
+
+	ASSERT_LOG_ONLY(allocation->lock == NULL,
+			"must not allocate a block while already holding a lock on one");
+
+	result = vdo_allocate_block(allocation->zone->allocator, &allocation->pbn);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = vdo_attempt_physical_zone_pbn_lock(allocation->zone,
+						    allocation->pbn,
+						    allocation->write_lock_type,
+						    &lock);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	if (lock->holder_count > 0)
+		/* This block is already locked, which should be impossible. */
+		return uds_log_error_strerror(VDO_LOCK_ERROR,
+					      "Newly allocated block %llu was spuriously locked (holder_count=%u)",
+					      (unsigned long long) allocation->pbn,
+					      lock->holder_count);
+
+	/* We've successfully acquired a new lock, so mark it as ours. */
+	lock->holder_count += 1;
+	allocation->lock = lock;
+	vdo_assign_pbn_lock_provisional_reference(lock);
+	return VDO_SUCCESS;
+}
+
+/**
+ * retry_allocation() - Retry allocating a block now that we're done waiting for scrubbing.
+ * @waiter: The allocating_vio that was waiting to allocate.
+ * @context: The context (unused).
+ */
+static void retry_allocation(struct waiter *waiter, void *context __always_unused)
+{
+	struct data_vio *data_vio = waiter_as_data_vio(waiter);
+
+	/* Now that some slab has scrubbed, restart the allocation process. */
+	data_vio->allocation.wait_for_clean_slab = false;
+	data_vio->allocation.first_allocation_zone = data_vio->allocation.zone->zone_number;
+	continue_data_vio(data_vio);
+}
+
+/**
+ * continue_allocating() - Continue searching for an allocation by enqueuing to wait for scrubbing
+ *			   or switching to the next zone.
+ * @data_vio: The data_vio attempting to get an allocation.
+ *
+ * This method should only be called from the error handler set in data_vio_allocate_data_block.
+ *
+ * Return: true if the allocation process has continued in another zone.
+ */
+static bool continue_allocating(struct data_vio *data_vio)
+{
+	struct allocation *allocation = &data_vio->allocation;
+	struct physical_zone *zone = allocation->zone;
+	struct vdo_completion *completion = &data_vio->vio.completion;
+	int result = VDO_SUCCESS;
+	bool was_waiting = allocation->wait_for_clean_slab;
+	bool tried_all = (allocation->first_allocation_zone == zone->next->zone_number);
+
+	vdo_reset_completion(completion);
+
+	if (tried_all && !was_waiting) {
+		/*
+		 * We've already looked in all the zones, and found nothing. So go through the
+		 * zones again, and wait for each to scrub before trying to allocate.
+		 */
+		allocation->wait_for_clean_slab = true;
+		allocation->first_allocation_zone = zone->zone_number;
+	}
+
+	if (allocation->wait_for_clean_slab) {
+		data_vio->waiter.callback = retry_allocation;
+		result = vdo_enqueue_clean_slab_waiter(zone->allocator, &data_vio->waiter);
+		if (result == VDO_SUCCESS)
+			/* We've enqueued to wait for a slab to be scrubbed. */
+			return true;
+
+		if ((result != VDO_NO_SPACE) || (was_waiting && tried_all)) {
+			vdo_set_completion_result(completion, result);
+			return false;
+		}
+	}
+
+	allocation->zone = zone->next;
+	completion->callback_thread_id = allocation->zone->thread_id;
+	vdo_launch_completion(completion);
+	return true;
+}
+
+/**
+ * vdo_allocate_block_in_zone() - Attempt to allocate a block in the current physical zone, and if
+ *				  that fails try the next if possible.
+ * @data_vio: The data_vio needing an allocation.
+ *
+ * Return: true if a block was allocated, if not the data_vio will have been dispatched so the
+ *         caller must not touch it.
+ */
+bool vdo_allocate_block_in_zone(struct data_vio *data_vio)
+{
+	int result = allocate_and_lock_block(&data_vio->allocation);
+
+	if (result == VDO_SUCCESS)
+		return true;
+
+	if ((result != VDO_NO_SPACE) || !continue_allocating(data_vio))
+		continue_data_vio_with_error(data_vio, result);
+
+	return false;
+}
+
+/**
+ * vdo_release_physical_zone_pbn_lock() - Release a physical block lock if it is held and return it
+ *                                        to the lock pool.
+ * @zone: The physical zone in which the lock was obtained.
+ * @locked_pbn: The physical block number to unlock.
+ * @lock: The lock being released.
+ *
+ * It must be the last live reference, as if the memory were being freed (the
+ * lock memory will re-initialized or zeroed).
+ */
+void vdo_release_physical_zone_pbn_lock(struct physical_zone *zone,
+					physical_block_number_t locked_pbn,
+					struct pbn_lock *lock)
+{
+	struct pbn_lock *holder;
+
+	if (lock == NULL)
+		return;
+
+	ASSERT_LOG_ONLY(lock->holder_count > 0, "should not be releasing a lock that is not held");
+
+	lock->holder_count -= 1;
+	if (lock->holder_count > 0)
+		/* The lock was shared and is still referenced, so don't release it yet. */
+		return;
+
+	holder = vdo_int_map_remove(zone->pbn_operations, locked_pbn);
+	ASSERT_LOG_ONLY((lock == holder),
+			"physical block lock mismatch for block %llu",
+			(unsigned long long) locked_pbn);
+
+	release_pbn_lock_provisional_reference(lock, locked_pbn, zone->allocator);
+	return_pbn_lock_to_pool(zone->lock_pool, lock);
+}
+
+/**
+ * vdo_dump_physical_zone() - Dump information about a physical zone to the log for debugging.
+ * @zone: The zone to dump.
+ */
+void vdo_dump_physical_zone(const struct physical_zone *zone)
+{
+	vdo_dump_block_allocator(zone->allocator);
+}
diff --git a/drivers/md/dm-vdo/physical-zone.h b/drivers/md/dm-vdo/physical-zone.h
new file mode 100644
index 000000000000..ec693d54b488
--- /dev/null
+++ b/drivers/md/dm-vdo/physical-zone.h
@@ -0,0 +1,115 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_PHYSICAL_ZONE_H
+#define VDO_PHYSICAL_ZONE_H
+
+#include <linux/atomic.h>
+
+#include "types.h"
+
+/*
+ * The type of a PBN lock.
+ */
+enum pbn_lock_type {
+	VIO_READ_LOCK,
+	VIO_WRITE_LOCK,
+	VIO_BLOCK_MAP_WRITE_LOCK,
+};
+
+struct pbn_lock_implementation;
+
+/*
+ * A PBN lock.
+ */
+struct pbn_lock {
+	/* The implementation of the lock */
+	const struct pbn_lock_implementation *implementation;
+
+	/* The number of VIOs holding or sharing this lock */
+	data_vio_count_t holder_count;
+	/*
+	 * The number of compressed block writers holding a share of this lock while they are
+	 * acquiring a reference to the PBN.
+	 */
+	u8 fragment_locks;
+
+	/* Whether the locked PBN has been provisionally referenced on behalf of the lock holder. */
+	bool has_provisional_reference;
+
+	/*
+	 * For read locks, the number of references that were known to be available on the locked
+	 * block at the time the lock was acquired.
+	 */
+	u8 increment_limit;
+
+	/*
+	 * For read locks, the number of data_vios that have tried to claim one of the available
+	 * increments during the lifetime of the lock. Each claim will first increment this
+	 * counter, so it can exceed the increment limit.
+	 */
+	atomic_t increments_claimed;
+};
+
+struct physical_zone {
+	/* Which physical zone this is */
+	zone_count_t zone_number;
+	/* The thread ID for this zone */
+	thread_id_t thread_id;
+	/* In progress operations keyed by PBN */
+	struct int_map *pbn_operations;
+	/* Pool of unused pbn_lock instances */
+	struct pbn_lock_pool *lock_pool;
+	/* The block allocator for this zone */
+	struct block_allocator *allocator;
+	/* The next zone from which to attempt an allocation */
+	struct physical_zone *next;
+};
+
+struct physical_zones {
+	/* The number of zones */
+	zone_count_t zone_count;
+	/* The physical zones themselves */
+	struct physical_zone zones[];
+};
+
+bool __must_check vdo_is_pbn_read_lock(const struct pbn_lock *lock);
+void vdo_downgrade_pbn_write_lock(struct pbn_lock *lock, bool compressed_write);
+bool __must_check vdo_claim_pbn_lock_increment(struct pbn_lock *lock);
+
+/**
+ * vdo_pbn_lock_has_provisional_reference() - Check whether a PBN lock has a provisional reference.
+ * @lock: The PBN lock.
+ */
+static inline bool vdo_pbn_lock_has_provisional_reference(struct pbn_lock *lock)
+{
+	return ((lock != NULL) && lock->has_provisional_reference);
+}
+
+void vdo_assign_pbn_lock_provisional_reference(struct pbn_lock *lock);
+void vdo_unassign_pbn_lock_provisional_reference(struct pbn_lock *lock);
+
+int __must_check vdo_make_physical_zones(struct vdo *vdo, struct physical_zones **zones_ptr);
+
+void vdo_free_physical_zones(struct physical_zones *zones);
+
+struct pbn_lock * __must_check
+vdo_get_physical_zone_pbn_lock(struct physical_zone *zone, physical_block_number_t pbn);
+
+int __must_check
+vdo_attempt_physical_zone_pbn_lock(struct physical_zone *zone,
+				   physical_block_number_t pbn,
+				   enum pbn_lock_type type,
+				   struct pbn_lock **lock_ptr);
+
+bool __must_check vdo_allocate_block_in_zone(struct data_vio *data_vio);
+
+void vdo_release_physical_zone_pbn_lock(struct physical_zone *zone,
+					physical_block_number_t locked_pbn,
+					struct pbn_lock *lock);
+
+void vdo_dump_physical_zone(const struct physical_zone *zone);
+
+#endif /* VDO_PHYSICAL_ZONE_H */
diff --git a/drivers/md/dm-vdo/slab-depot.c b/drivers/md/dm-vdo/slab-depot.c
index e2270319513a..dbed9c3c8a62 100644
--- a/drivers/md/dm-vdo/slab-depot.c
+++ b/drivers/md/dm-vdo/slab-depot.c
@@ -37,6 +37,13 @@
 static const u64 BYTES_PER_WORD = sizeof(u64);
 static const bool NORMAL_OPERATION = true;
 
+struct slab_journal_eraser {
+	struct vdo_completion *parent;
+	struct dm_kcopyd_client *client;
+	block_count_t blocks;
+	struct slab_iterator slabs;
+};
+
 /**
  * get_lock() - Get the lock object for a slab journal block by sequence number.
  * @journal: vdo_slab journal to retrieve from.
@@ -1972,6 +1979,44 @@ static bool advance_search_cursor(struct vdo_slab *slab)
 	return true;
 }
 
+/**
+ * vdo_adjust_reference_count_for_rebuild() - Adjust the reference count of a block during rebuild.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_adjust_reference_count_for_rebuild(struct slab_depot *depot,
+					   physical_block_number_t pbn,
+					   enum journal_operation operation)
+{
+	int result;
+	slab_block_number block_number;
+	struct reference_block *block;
+	struct vdo_slab *slab = vdo_get_slab(depot, pbn);
+	struct reference_updater updater = {
+		.operation = operation,
+		.increment = true,
+	};
+
+	result = slab_block_number_from_pbn(slab, pbn, &block_number);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	block = get_reference_block(slab, block_number);
+	result = update_reference_count(slab,
+					block,
+					block_number,
+					NULL,
+					&updater,
+					!NORMAL_OPERATION,
+					false,
+					NULL);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	dirty_block(block);
+	return VDO_SUCCESS;
+}
+
 /**
  * replay_reference_count_change() - Replay the reference count adjustment from a slab journal
  *                                   entry into the reference count for a block.
@@ -2529,240 +2574,1510 @@ static void load_slab_journal(struct vdo_slab *slab)
 	acquire_vio_from_pool(slab->allocator->vio_pool, &journal->resource_waiter);
 }
 
-static void free_slab(struct vdo_slab *slab)
+static void register_slab_for_scrubbing(struct vdo_slab *slab, bool high_priority)
 {
-	if (slab == NULL)
+	struct slab_scrubber *scrubber = &slab->allocator->scrubber;
+
+	ASSERT_LOG_ONLY((slab->status != VDO_SLAB_REBUILT), "slab to be scrubbed is unrecovered");
+
+	if (slab->status != VDO_SLAB_REQUIRES_SCRUBBING)
 		return;
 
-	list_del(&slab->allocq_entry);
-	UDS_FREE(UDS_FORGET(slab->journal.block));
-	UDS_FREE(UDS_FORGET(slab->journal.locks));
-	UDS_FREE(UDS_FORGET(slab->counters));
-	UDS_FREE(UDS_FORGET(slab->reference_blocks));
-	UDS_FREE(slab);
+	list_del_init(&slab->allocq_entry);
+	if (!slab->was_queued_for_scrubbing) {
+		WRITE_ONCE(scrubber->slab_count, scrubber->slab_count + 1);
+		slab->was_queued_for_scrubbing = true;
+	}
+
+	if (high_priority) {
+		slab->status = VDO_SLAB_REQUIRES_HIGH_PRIORITY_SCRUBBING;
+		list_add_tail(&slab->allocq_entry, &scrubber->high_priority_slabs);
+		return;
+	}
+
+	list_add_tail(&slab->allocq_entry, &scrubber->slabs);
 }
 
-static int initialize_slab_journal(struct vdo_slab *slab)
+/* Queue a slab for allocation or scrubbing. */
+static void queue_slab(struct vdo_slab *slab)
 {
-	struct slab_journal *journal = &slab->journal;
-	const struct slab_config *slab_config = &slab->allocator->depot->slab_config;
+	struct block_allocator *allocator = slab->allocator;
+	block_count_t free_blocks;
 	int result;
 
-	result = UDS_ALLOCATE(slab_config->slab_journal_blocks,
-			      struct journal_lock,
-			      __func__,
-			      &journal->locks);
-	if (result != VDO_SUCCESS)
-		return result;
+	ASSERT_LOG_ONLY(list_empty(&slab->allocq_entry),
+			"a requeued slab must not already be on a ring");
 
-	result = UDS_ALLOCATE(VDO_BLOCK_SIZE,
-			      char,
-			      "struct packed_slab_journal_block",
-			      (char **) &journal->block);
-	if (result != VDO_SUCCESS)
-		return result;
+	if (vdo_is_read_only(allocator->depot->vdo))
+		return;
 
-	journal->slab = slab;
-	journal->size = slab_config->slab_journal_blocks;
-	journal->flushing_threshold = slab_config->slab_journal_flushing_threshold;
-	journal->blocking_threshold = slab_config->slab_journal_blocking_threshold;
-	journal->scrubbing_threshold = slab_config->slab_journal_scrubbing_threshold;
-	journal->entries_per_block = VDO_SLAB_JOURNAL_ENTRIES_PER_BLOCK;
-	journal->full_entries_per_block = VDO_SLAB_JOURNAL_FULL_ENTRIES_PER_BLOCK;
-	journal->events = &slab->allocator->slab_journal_statistics;
-	journal->recovery_journal = slab->allocator->depot->vdo->recovery_journal;
-	journal->tail = 1;
-	journal->head = 1;
+	free_blocks = slab->free_blocks;
+	result = ASSERT((free_blocks <= allocator->depot->slab_config.data_blocks),
+			"rebuilt slab %u must have a valid free block count (has %llu, expected maximum %llu)",
+			slab->slab_number,
+			(unsigned long long) free_blocks,
+			(unsigned long long) allocator->depot->slab_config.data_blocks);
+	if (result != VDO_SUCCESS) {
+		vdo_enter_read_only_mode(allocator->depot->vdo, result);
+		return;
+	}
 
-	journal->flushing_deadline = journal->flushing_threshold;
-	/*
-	 * Set there to be some time between the deadline and the blocking threshold, so that
-	 * hopefully all are done before blocking.
-	 */
-	if ((journal->blocking_threshold - journal->flushing_threshold) > 5)
-		journal->flushing_deadline = journal->blocking_threshold - 5;
+	if (slab->status != VDO_SLAB_REBUILT) {
+		register_slab_for_scrubbing(slab, false);
+		return;
+	}
 
-	journal->slab_summary_waiter.callback = release_journal_locks;
+	if (!vdo_is_state_resuming(&slab->state)) {
+		/*
+		 * If the slab is resuming, we've already accounted for it here, so don't do it
+		 * again.
+		 * FIXME: under what situation would the slab be resuming here?
+		 */
+		WRITE_ONCE(allocator->allocated_blocks, allocator->allocated_blocks - free_blocks);
+		if (!is_slab_journal_blank(slab))
+			WRITE_ONCE(allocator->statistics.slabs_opened,
+				   allocator->statistics.slabs_opened + 1);
+	}
 
-	INIT_LIST_HEAD(&journal->dirty_entry);
-	INIT_LIST_HEAD(&journal->uncommitted_blocks);
+	if (allocator->depot->vdo->suspend_type == VDO_ADMIN_STATE_SAVING)
+		reopen_slab_journal(slab);
 
-	journal->tail_header.nonce = slab->allocator->nonce;
-	journal->tail_header.metadata_type = VDO_METADATA_SLAB_JOURNAL;
-	initialize_journal_state(journal);
-	return VDO_SUCCESS;
+	prioritize_slab(slab);
 }
 
 /**
- * make_slab() - Construct a new, empty slab.
- * @slab_origin: The physical block number within the block allocator partition of the first block
- *               in the slab.
- * @allocator: The block allocator to which the slab belongs.
- * @slab_number: The slab number of the slab.
- * @is_new: true if this slab is being allocated as part of a resize.
- * @slab_ptr: A pointer to receive the new slab.
+ * initiate_slab_action() - Initiate a slab action.
  *
- * Return: VDO_SUCCESS or an error code.
+ * Implements vdo_admin_initiator.
  */
-static int __must_check
-make_slab(physical_block_number_t slab_origin,
-	  struct block_allocator *allocator,
-	  slab_count_t slab_number,
-	  bool is_new,
-	  struct vdo_slab **slab_ptr)
+static void initiate_slab_action(struct admin_state *state)
 {
-	const struct slab_config *slab_config = &allocator->depot->slab_config;
-	struct vdo_slab *slab;
-	int result;
+	struct vdo_slab *slab = container_of(state, struct vdo_slab, state);
 
-	result = UDS_ALLOCATE(1, struct vdo_slab, __func__, &slab);
-	if (result != VDO_SUCCESS)
-		return result;
+	if (vdo_is_state_draining(state)) {
+		const struct admin_state_code *operation = vdo_get_admin_state_code(state);
 
-	*slab = (struct vdo_slab) {
-		.allocator = allocator,
-		.start = slab_origin,
-		.end = slab_origin + slab_config->slab_blocks,
-		.slab_number = slab_number,
-		.ref_counts_origin = slab_origin + slab_config->data_blocks,
-		.journal_origin = vdo_get_slab_journal_start_block(slab_config, slab_origin),
-		.block_count = slab_config->data_blocks,
-		.free_blocks = slab_config->data_blocks,
-		.reference_block_count =
-			vdo_get_saved_reference_count_size(slab_config->data_blocks),
-	};
-	INIT_LIST_HEAD(&slab->allocq_entry);
+		if (operation == VDO_ADMIN_STATE_SCRUBBING)
+			slab->status = VDO_SLAB_REBUILDING;
 
-	result = initialize_slab_journal(slab);
-	if (result != VDO_SUCCESS) {
-		free_slab(slab);
-		return result;
+		drain_slab(slab);
+		check_if_slab_drained(slab);
+		return;
 	}
 
-	if (is_new) {
-		vdo_set_admin_state_code(&slab->state, VDO_ADMIN_STATE_NEW);
-		result = allocate_slab_counters(slab);
-		if (result != VDO_SUCCESS) {
-			free_slab(slab);
-			return result;
-		}
-	} else {
-		vdo_set_admin_state_code(&slab->state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+	if (vdo_is_state_loading(state)) {
+		load_slab_journal(slab);
+		return;
 	}
 
-	*slab_ptr = slab;
-	return VDO_SUCCESS;
+	if (vdo_is_state_resuming(state)) {
+		queue_slab(slab);
+		vdo_finish_resuming(state);
+		return;
+	}
+
+	vdo_finish_operation(state, VDO_INVALID_ADMIN_STATE);
 }
 
 /**
- * finish_combining_zones() - Clean up after saving out the combined slab summary.
- * @completion: The vio which was used to write the summary data.
+ * get_next_slab() - Get the next slab to scrub.
+ * @scrubber: The slab scrubber.
+ *
+ * Return: The next slab to scrub or NULL if there are none.
  */
-static void finish_combining_zones(struct vdo_completion *completion)
+static struct vdo_slab *get_next_slab(struct slab_scrubber *scrubber)
 {
-	int result = completion->result;
-	struct vdo_completion *parent = completion->parent;
+	struct vdo_slab *slab;
 
-	free_vio(as_vio(UDS_FORGET(completion)));
-	vdo_fail_completion(parent, result);
+	slab = list_first_entry_or_null(&scrubber->high_priority_slabs,
+					struct vdo_slab,
+					allocq_entry);
+	if (slab != NULL)
+		return slab;
+
+	return list_first_entry_or_null(&scrubber->slabs, struct vdo_slab, allocq_entry);
 }
 
-static void handle_combining_error(struct vdo_completion *completion)
+/**
+ * has_slabs_to_scrub() - Check whether a scrubber has slabs to scrub.
+ * @scrubber: The scrubber to check.
+ *
+ * Return: true if the scrubber has slabs to scrub.
+ */
+static bool __must_check has_slabs_to_scrub(struct slab_scrubber *scrubber)
 {
-	vio_record_metadata_io_error(as_vio(completion));
-	finish_combining_zones(completion);
+	return (get_next_slab(scrubber) != NULL);
 }
 
-static void write_summary_endio(struct bio *bio)
+/**
+ * uninitialize_scrubber_vio() - Clean up the slab_scrubber's vio.
+ * @scrubber: The scrubber.
+ */
+static void uninitialize_scrubber_vio(struct slab_scrubber *scrubber)
 {
-	struct vio *vio = bio->bi_private;
-	struct vdo *vdo = vio->completion.vdo;
-
-	continue_vio_after_io(vio, finish_combining_zones, vdo->thread_config.admin_thread);
+	UDS_FREE(UDS_FORGET(scrubber->vio.data));
+	free_vio_components(&scrubber->vio);
 }
 
 /**
- * combine_summaries() - Treating the current entries buffer as the on-disk value of all zones,
- *                       update every zone to the correct values for every slab.
- * @depot: The depot whose summary entries should be combined.
+ * finish_scrubbing() - Stop scrubbing, either because there are no more slabs to scrub or because
+ *                      there's been an error.
+ * @scrubber: The scrubber.
  */
-static void combine_summaries(struct slab_depot *depot)
+static void finish_scrubbing(struct slab_scrubber *scrubber, int result)
 {
-	/*
-	 * Combine all the old summary data into the portion of the buffer corresponding to the
-	 * first zone.
-	 */
-	zone_count_t zone = 0;
-	struct slab_summary_entry *entries = depot->summary_entries;
+	bool notify = vdo_has_waiters(&scrubber->waiters);
+	bool done = !has_slabs_to_scrub(scrubber);
+	struct block_allocator *allocator =
+		container_of(scrubber, struct block_allocator, scrubber);
+
+	if (done)
+		uninitialize_scrubber_vio(scrubber);
+
+	if (scrubber->high_priority_only) {
+		scrubber->high_priority_only = false;
+		vdo_fail_completion(UDS_FORGET(scrubber->vio.completion.parent), result);
+	} else if (done && (atomic_add_return(-1, &allocator->depot->zones_to_scrub) == 0)) {
+		/* All of our slabs were scrubbed, and we're the last allocator to finish. */
+		enum vdo_state prior_state =
+			atomic_cmpxchg(&allocator->depot->vdo->state, VDO_RECOVERING, VDO_DIRTY);
 
-	if (depot->old_zone_count > 1) {
-		slab_count_t entry_number;
+		/*
+		 * To be safe, even if the CAS failed, ensure anything that follows is ordered with
+		 * respect to whatever state change did happen.
+		 */
+		smp_mb__after_atomic();
 
-		for (entry_number = 0; entry_number < MAX_VDO_SLABS; entry_number++) {
-			if (zone != 0)
-				memcpy(entries + entry_number,
-				       entries + (zone * MAX_VDO_SLABS) + entry_number,
-				       sizeof(struct slab_summary_entry));
-			zone++;
-			if (zone == depot->old_zone_count)
-				zone = 0;
-		}
+		/*
+		 * We must check the VDO state here and not the depot's read_only_notifier since
+		 * the compare-swap-above could have failed due to a read-only entry which our own
+		 * thread does not yet know about.
+		 */
+		if (prior_state == VDO_DIRTY)
+			uds_log_info("VDO commencing normal operation");
+		else if (prior_state == VDO_RECOVERING)
+			uds_log_info("Exiting recovery mode");
 	}
 
-	/* Copy the combined data to each zones's region of the buffer. */
-	for (zone = 1; zone < MAX_VDO_PHYSICAL_ZONES; zone++)
-		memcpy(entries + (zone * MAX_VDO_SLABS),
-		       entries,
-		       MAX_VDO_SLABS * sizeof(struct slab_summary_entry));
+	/*
+	 * Note that the scrubber has stopped, and inform anyone who might be waiting for that to
+	 * happen.
+	 */
+	if (!vdo_finish_draining(&scrubber->admin_state))
+		WRITE_ONCE(scrubber->admin_state.current_state, VDO_ADMIN_STATE_SUSPENDED);
+
+	/*
+	 * We can't notify waiters until after we've finished draining or they'll just requeue.
+	 * Fortunately if there were waiters, we can't have been freed yet.
+	 */
+	if (notify)
+		vdo_notify_all_waiters(&scrubber->waiters, NULL, NULL);
 }
 
+static void scrub_next_slab(struct slab_scrubber *scrubber);
+
 /**
- * finish_loading_summary() - Finish loading slab summary data.
- * @completion: The vio which was used to read the summary data.
+ * slab_scrubbed() - Notify the scrubber that a slab has been scrubbed.
+ * @completion: The slab rebuild completion.
  *
- * Combines the slab summary data from all the previously written zones and copies the combined
- * summary to each partition's data region. Then writes the combined summary back out to disk. This
- * callback is registered in load_summary_endio().
+ * This callback is registered in apply_journal_entries().
  */
-static void finish_loading_summary(struct vdo_completion *completion)
+static void slab_scrubbed(struct vdo_completion *completion)
 {
-	struct slab_depot *depot = completion->vdo->depot;
-
-	/* Combine the summary from each zone so each zone is correct for all slabs. */
-	combine_summaries(depot);
+	struct slab_scrubber *scrubber =
+		container_of(as_vio(completion), struct slab_scrubber, vio);
+	struct vdo_slab *slab = scrubber->slab;
+
+	slab->status = VDO_SLAB_REBUILT;
+	queue_slab(slab);
+	reopen_slab_journal(slab);
+	WRITE_ONCE(scrubber->slab_count, scrubber->slab_count - 1);
+	scrub_next_slab(scrubber);
+}
 
-	/* Write the combined summary back out. */
-	submit_metadata_vio(as_vio(completion),
-			    depot->summary_origin,
-			    write_summary_endio,
-			    handle_combining_error,
-			    REQ_OP_WRITE);
+/**
+ * abort_scrubbing() - Abort scrubbing due to an error.
+ * @scrubber: The slab scrubber.
+ * @result: The error.
+ */
+static void abort_scrubbing(struct slab_scrubber *scrubber, int result)
+{
+	vdo_enter_read_only_mode(scrubber->vio.completion.vdo, result);
+	finish_scrubbing(scrubber, result);
 }
 
-static void load_summary_endio(struct bio *bio)
+/**
+ * handle_scrubber_error() - Handle errors while rebuilding a slab.
+ * @completion: The slab rebuild completion.
+ */
+static void handle_scrubber_error(struct vdo_completion *completion)
 {
-	struct vio *vio = bio->bi_private;
-	struct vdo *vdo = vio->completion.vdo;
+	struct vio *vio = as_vio(completion);
 
-	continue_vio_after_io(vio, finish_loading_summary, vdo->thread_config.admin_thread);
+	vio_record_metadata_io_error(vio);
+	abort_scrubbing(container_of(vio, struct slab_scrubber, vio), completion->result);
 }
 
 /**
- * load_slab_summary() - The preamble of a load operation.
+ * apply_block_entries() - Apply all the entries in a block to the reference counts.
+ * @block: A block with entries to apply.
+ * @entry_count: The number of entries to apply.
+ * @block_number: The sequence number of the block.
+ * @slab: The slab to apply the entries to.
  *
- * Implements vdo_action_preamble.
+ * Return: VDO_SUCCESS or an error code.
  */
-static void load_slab_summary(void *context, struct vdo_completion *parent)
+static int apply_block_entries(struct packed_slab_journal_block *block,
+			       journal_entry_count_t entry_count,
+			       sequence_number_t block_number,
+			       struct vdo_slab *slab)
 {
+	struct journal_point entry_point = {
+		.sequence_number = block_number,
+		.entry_count = 0,
+	};
 	int result;
-	struct vio *vio;
-	struct slab_depot *depot = context;
-	const struct admin_state_code *operation =
-		vdo_get_current_manager_operation(depot->action_manager);
+	slab_block_number max_sbn = slab->end - slab->start;
+
+	while (entry_point.entry_count < entry_count) {
+		struct slab_journal_entry entry =
+			vdo_decode_slab_journal_entry(block, entry_point.entry_count);
+
+		if (entry.sbn > max_sbn)
+			/* This entry is out of bounds. */
+			return uds_log_error_strerror(VDO_CORRUPT_JOURNAL,
+						      "vdo_slab journal entry (%llu, %u) had invalid offset %u in slab (size %u blocks)",
+						      (unsigned long long) block_number,
+						      entry_point.entry_count,
+						      entry.sbn,
+						      max_sbn);
+
+		result = replay_reference_count_change(slab, &entry_point, entry);
+		if (result != VDO_SUCCESS) {
+			uds_log_error_strerror(result,
+					       "vdo_slab journal entry (%llu, %u) (%s of offset %u) could not be applied in slab %u",
+					       (unsigned long long) block_number,
+					       entry_point.entry_count,
+					       vdo_get_journal_operation_name(entry.operation),
+					       entry.sbn,
+					       slab->slab_number);
+			return result;
+		}
+		entry_point.entry_count++;
+	}
 
-	result = create_multi_block_metadata_vio(depot->vdo,
-						 VIO_TYPE_SLAB_SUMMARY,
-						 VIO_PRIORITY_METADATA,
+	return VDO_SUCCESS;
+}
+
+/**
+ * apply_journal_entries() - Find the relevant vio of the slab journal and apply all valid entries.
+ * @completion: The metadata read vio completion.
+ *
+ * This is a callback registered in start_scrubbing().
+ */
+static void apply_journal_entries(struct vdo_completion *completion)
+{
+	int result;
+	struct slab_scrubber *scrubber
+		= container_of(as_vio(completion), struct slab_scrubber, vio);
+	struct vdo_slab *slab = scrubber->slab;
+	struct slab_journal *journal = &slab->journal;
+
+	/* Find the boundaries of the useful part of the journal. */
+	sequence_number_t tail = journal->tail;
+	tail_block_offset_t end_index = (tail - 1) % journal->size;
+	char *end_data = scrubber->vio.data + (end_index * VDO_BLOCK_SIZE);
+	struct packed_slab_journal_block *end_block =
+		(struct packed_slab_journal_block *) end_data;
+
+	sequence_number_t head = __le64_to_cpu(end_block->header.head);
+	tail_block_offset_t head_index = head % journal->size;
+	block_count_t index = head_index;
+
+	struct journal_point ref_counts_point = slab->slab_journal_point;
+	struct journal_point last_entry_applied = ref_counts_point;
+	sequence_number_t sequence;
+
+	for (sequence = head; sequence < tail; sequence++) {
+		char *block_data = scrubber->vio.data + (index * VDO_BLOCK_SIZE);
+		struct packed_slab_journal_block *block =
+			(struct packed_slab_journal_block *) block_data;
+		struct slab_journal_block_header header;
+
+		vdo_unpack_slab_journal_block_header(&block->header, &header);
+
+		if ((header.nonce != slab->allocator->nonce) ||
+		    (header.metadata_type != VDO_METADATA_SLAB_JOURNAL) ||
+		    (header.sequence_number != sequence) ||
+		    (header.entry_count > journal->entries_per_block) ||
+		    (header.has_block_map_increments &&
+		     (header.entry_count > journal->full_entries_per_block))) {
+			/* The block is not what we expect it to be. */
+			uds_log_error("vdo_slab journal block for slab %u was invalid",
+				      slab->slab_number);
+			abort_scrubbing(scrubber, VDO_CORRUPT_JOURNAL);
+			return;
+		}
+
+		result = apply_block_entries(block, header.entry_count, sequence, slab);
+		if (result != VDO_SUCCESS) {
+			abort_scrubbing(scrubber, result);
+			return;
+		}
+
+		last_entry_applied.sequence_number = sequence;
+		last_entry_applied.entry_count = header.entry_count - 1;
+		index++;
+		if (index == journal->size)
+			index = 0;
+	}
+
+	/*
+	 * At the end of rebuild, the reference counters should be accurate to the end of the
+	 * journal we just applied.
+	 */
+	result = ASSERT(!vdo_before_journal_point(&last_entry_applied, &ref_counts_point),
+			"Refcounts are not more accurate than the slab journal");
+	if (result != VDO_SUCCESS) {
+		abort_scrubbing(scrubber, result);
+		return;
+	}
+
+	/* Save out the rebuilt reference blocks. */
+	vdo_prepare_completion(completion,
+			       slab_scrubbed,
+			       handle_scrubber_error,
+			       slab->allocator->thread_id,
+			       completion->parent);
+	vdo_start_operation_with_waiter(&slab->state,
+					VDO_ADMIN_STATE_SAVE_FOR_SCRUBBING,
+					completion,
+					initiate_slab_action);
+}
+
+static void read_slab_journal_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct slab_scrubber *scrubber = container_of(vio, struct slab_scrubber, vio);
+
+	continue_vio_after_io(bio->bi_private,
+			      apply_journal_entries,
+			      scrubber->slab->allocator->thread_id);
+}
+
+/**
+ * start_scrubbing() - Read the current slab's journal from disk now that it has been flushed.
+ * @completion: The scrubber's vio completion.
+ *
+ * This callback is registered in scrub_next_slab().
+ */
+static void start_scrubbing(struct vdo_completion *completion)
+{
+	struct slab_scrubber *scrubber =
+		container_of(as_vio(completion), struct slab_scrubber, vio);
+	struct vdo_slab *slab = scrubber->slab;
+
+	if (!slab->allocator->summary_entries[slab->slab_number].is_dirty) {
+		slab_scrubbed(completion);
+		return;
+	}
+
+	submit_metadata_vio(&scrubber->vio,
+			    slab->journal_origin,
+			    read_slab_journal_endio,
+			    handle_scrubber_error,
+			    REQ_OP_READ);
+}
+
+/**
+ * scrub_next_slab() - Scrub the next slab if there is one.
+ * @scrubber: The scrubber.
+ */
+static void scrub_next_slab(struct slab_scrubber *scrubber)
+{
+	struct vdo_completion *completion = &scrubber->vio.completion;
+	struct vdo_slab *slab;
+
+	/*
+	 * Note: this notify call is always safe only because scrubbing can only be started when
+	 * the VDO is quiescent.
+	 */
+	vdo_notify_all_waiters(&scrubber->waiters, NULL, NULL);
+
+	if (vdo_is_read_only(completion->vdo)) {
+		finish_scrubbing(scrubber, VDO_READ_ONLY);
+		return;
+	}
+
+	slab = get_next_slab(scrubber);
+	if ((slab == NULL) ||
+	    (scrubber->high_priority_only && list_empty(&scrubber->high_priority_slabs))) {
+		finish_scrubbing(scrubber, VDO_SUCCESS);
+		return;
+	}
+
+	if (vdo_finish_draining(&scrubber->admin_state))
+		return;
+
+	list_del_init(&slab->allocq_entry);
+	scrubber->slab = slab;
+	vdo_prepare_completion(completion,
+			       start_scrubbing,
+			       handle_scrubber_error,
+			       slab->allocator->thread_id,
+			       completion->parent);
+	vdo_start_operation_with_waiter(&slab->state,
+					VDO_ADMIN_STATE_SCRUBBING,
+					completion,
+					initiate_slab_action);
+}
+
+/**
+ * scrub_slabs() - Scrub all of an allocator's slabs that are eligible for scrubbing.
+ * @allocator: The block_allocator to scrub.
+ * @parent: The completion to notify when scrubbing is done, implies high_priority, may be NULL.
+ */
+static void scrub_slabs(struct block_allocator *allocator, struct vdo_completion *parent)
+{
+	struct slab_scrubber *scrubber = &allocator->scrubber;
+
+	scrubber->vio.completion.parent = parent;
+	scrubber->high_priority_only = (parent != NULL);
+	if (!has_slabs_to_scrub(scrubber)) {
+		finish_scrubbing(scrubber, VDO_SUCCESS);
+		return;
+	}
+
+	if (scrubber->high_priority_only &&
+	    vdo_is_priority_table_empty(allocator->prioritized_slabs) &&
+	    list_empty(&scrubber->high_priority_slabs))
+		register_slab_for_scrubbing(get_next_slab(scrubber), true);
+
+	vdo_resume_if_quiescent(&scrubber->admin_state);
+	scrub_next_slab(scrubber);
+}
+
+static inline void assert_on_allocator_thread(thread_id_t thread_id, const char *function_name)
+{
+	ASSERT_LOG_ONLY((vdo_get_callback_thread_id() == thread_id),
+			"%s called on correct thread",
+			function_name);
+}
+
+static void register_slab_with_allocator(struct block_allocator *allocator, struct vdo_slab *slab)
+{
+	allocator->slab_count++;
+	allocator->last_slab = slab->slab_number;
+}
+
+static struct slab_iterator get_slab_iterator(const struct block_allocator *allocator)
+{
+	return get_depot_slab_iterator(allocator->depot,
+				       allocator->last_slab,
+				       allocator->zone_number,
+				       allocator->depot->zone_count);
+}
+
+/**
+ * next_slab() - Get the next slab from a slab_iterator and advance the iterator
+ * @iterator: The slab_iterator.
+ *
+ * Return: The next slab or NULL if the iterator is exhausted.
+ */
+static struct vdo_slab *next_slab(struct slab_iterator *iterator)
+{
+	struct vdo_slab *slab = iterator->next;
+
+	if ((slab == NULL) || (slab->slab_number < iterator->end + iterator->stride))
+		iterator->next = NULL;
+	else
+		iterator->next = iterator->slabs[slab->slab_number - iterator->stride];
+
+	return slab;
+}
+
+/**
+ * abort_waiter() - Abort vios waiting to make journal entries when read-only.
+ *
+ * This callback is invoked on all vios waiting to make slab journal entries after the VDO has gone
+ * into read-only mode. Implements waiter_callback.
+ */
+static void abort_waiter(struct waiter *waiter, void *context __always_unused)
+{
+	struct reference_updater *updater = container_of(waiter, struct reference_updater, waiter);
+	struct data_vio *data_vio = data_vio_from_reference_updater(updater);
+
+	if (updater->increment) {
+		continue_data_vio_with_error(data_vio, VDO_READ_ONLY);
+		return;
+	}
+
+	vdo_continue_completion(&data_vio->decrement_completion, VDO_READ_ONLY);
+}
+
+/* Implements vdo_read_only_notification. */
+static void notify_block_allocator_of_read_only_mode(void *listener, struct vdo_completion *parent)
+{
+	struct block_allocator *allocator = listener;
+	struct slab_iterator iterator;
+
+	assert_on_allocator_thread(allocator->thread_id, __func__);
+	iterator = get_slab_iterator(allocator);
+	while (iterator.next != NULL) {
+		struct vdo_slab *slab = next_slab(&iterator);
+
+		vdo_notify_all_waiters(&slab->journal.entry_waiters, abort_waiter, &slab->journal);
+		check_if_slab_drained(slab);
+	}
+
+	vdo_finish_completion(parent);
+}
+
+/**
+ * vdo_acquire_provisional_reference() - Acquire a provisional reference on behalf of a PBN lock if
+ *                                       the block it locks is unreferenced.
+ * @slab: The slab which contains the block.
+ * @pbn: The physical block to reference.
+ * @lock: The lock.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_acquire_provisional_reference(struct vdo_slab *slab,
+				      physical_block_number_t pbn,
+				      struct pbn_lock *lock)
+{
+	slab_block_number block_number;
+	int result;
+
+	if (vdo_pbn_lock_has_provisional_reference(lock))
+		return VDO_SUCCESS;
+
+	if (!is_slab_open(slab))
+		return VDO_INVALID_ADMIN_STATE;
+
+	result = slab_block_number_from_pbn(slab, pbn, &block_number);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	if (slab->counters[block_number] == EMPTY_REFERENCE_COUNT) {
+		make_provisional_reference(slab, block_number);
+		if (lock != NULL)
+			vdo_assign_pbn_lock_provisional_reference(lock);
+	}
+
+	if (vdo_pbn_lock_has_provisional_reference(lock))
+		adjust_free_block_count(slab, false);
+
+	return VDO_SUCCESS;
+}
+
+static int __must_check
+allocate_slab_block(struct vdo_slab *slab, physical_block_number_t *block_number_ptr)
+{
+	slab_block_number free_index;
+
+	if (!is_slab_open(slab))
+		return VDO_INVALID_ADMIN_STATE;
+
+	if (!search_reference_blocks(slab, &free_index))
+		return VDO_NO_SPACE;
+
+	ASSERT_LOG_ONLY((slab->counters[free_index] == EMPTY_REFERENCE_COUNT),
+			"free block must have ref count of zero");
+	make_provisional_reference(slab, free_index);
+	adjust_free_block_count(slab, false);
+
+	/*
+	 * Update the search hint so the next search will start at the array index just past the
+	 * free block we just found.
+	 */
+	slab->search_cursor.index = (free_index + 1);
+
+	*block_number_ptr = slab->start + free_index;
+	return VDO_SUCCESS;
+}
+
+/**
+ * open_slab() - Prepare a slab to be allocated from.
+ * @slab: The slab.
+ */
+static void open_slab(struct vdo_slab *slab)
+{
+	reset_search_cursor(slab);
+	if (is_slab_journal_blank(slab)) {
+		WRITE_ONCE(slab->allocator->statistics.slabs_opened,
+			   slab->allocator->statistics.slabs_opened + 1);
+		dirty_all_reference_blocks(slab);
+	} else {
+		WRITE_ONCE(slab->allocator->statistics.slabs_reopened,
+			   slab->allocator->statistics.slabs_reopened + 1);
+	}
+
+	slab->allocator->open_slab = slab;
+}
+
+
+/*
+ * The block allocated will have a provisional reference and the reference must be either confirmed
+ * with a subsequent increment or vacated with a subsequent decrement via
+ * vdo_release_block_reference().
+ */
+int vdo_allocate_block(struct block_allocator *allocator,
+		       physical_block_number_t *block_number_ptr)
+{
+	int result;
+
+	if (allocator->open_slab != NULL) {
+		/* Try to allocate the next block in the currently open slab. */
+		result = allocate_slab_block(allocator->open_slab, block_number_ptr);
+		if ((result == VDO_SUCCESS) || (result != VDO_NO_SPACE))
+			return result;
+
+		/* Put the exhausted open slab back into the priority table. */
+		prioritize_slab(allocator->open_slab);
+	}
+
+	/* Remove the highest priority slab from the priority table and make it the open slab. */
+	open_slab(list_entry(vdo_priority_table_dequeue(allocator->prioritized_slabs),
+			     struct vdo_slab,
+			     allocq_entry));
+
+	/*
+	 * Try allocating again. If we're out of space immediately after opening a slab, then every
+	 * slab must be fully allocated.
+	 */
+	return allocate_slab_block(allocator->open_slab, block_number_ptr);
+}
+
+/**
+ * vdo_enqueue_clean_slab_waiter() - Wait for a clean slab.
+ * @allocator: The block_allocator on which to wait.
+ * @waiter: The waiter.
+ *
+ * Return: VDO_SUCCESS if the waiter was queued, VDO_NO_SPACE if there are no slabs to scrub, and
+ *         some other error otherwise.
+ */
+int vdo_enqueue_clean_slab_waiter(struct block_allocator *allocator, struct waiter *waiter)
+{
+	if (vdo_is_read_only(allocator->depot->vdo))
+		return VDO_READ_ONLY;
+
+	if (vdo_is_state_quiescent(&allocator->scrubber.admin_state))
+		return VDO_NO_SPACE;
+
+	vdo_enqueue_waiter(&allocator->scrubber.waiters, waiter);
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_modify_reference_count() - Modify the reference count of a block by first making a slab
+ *                                journal entry and then updating the reference counter.
+ *
+ * @data_vio: The data_vio for which to add the entry.
+ * @updater: Which of the data_vio's reference updaters is being submitted.
+ */
+void vdo_modify_reference_count(struct vdo_completion *completion,
+				struct reference_updater *updater)
+{
+	struct vdo_slab *slab = vdo_get_slab(completion->vdo->depot, updater->zpbn.pbn);
+
+	if (!is_slab_open(slab)) {
+		vdo_continue_completion(completion, VDO_INVALID_ADMIN_STATE);
+		return;
+	}
+
+	if (vdo_is_read_only(completion->vdo)) {
+		vdo_continue_completion(completion, VDO_READ_ONLY);
+		return;
+	}
+
+	vdo_enqueue_waiter(&slab->journal.entry_waiters, &updater->waiter);
+	if ((slab->status != VDO_SLAB_REBUILT) && requires_reaping(&slab->journal))
+		register_slab_for_scrubbing(slab, true);
+
+	add_entries(&slab->journal);
+}
+
+/* Release an unused provisional reference. */
+int vdo_release_block_reference(struct block_allocator *allocator, physical_block_number_t pbn)
+{
+	struct reference_updater updater;
+
+	if (pbn == VDO_ZERO_BLOCK)
+		return VDO_SUCCESS;
+
+	updater = (struct reference_updater) {
+		.operation = VDO_JOURNAL_DATA_REMAPPING,
+		.increment = false,
+		.zpbn = {
+			.pbn = pbn,
+		},
+	};
+
+	return adjust_reference_count(vdo_get_slab(allocator->depot, pbn), &updater, NULL);
+}
+
+/*
+ * This is a min_heap callback function orders slab_status structures using the 'is_clean' field as
+ * the primary key and the 'emptiness' field as the secondary key.
+ *
+ * Slabs need to be pushed onto the rings in the same order they are to be popped off. Popping
+ * should always get the most empty first, so pushing should be from most empty to least empty.
+ * Thus, the ordering is reversed from the usual sense since min_heap returns smaller elements
+ * before larger ones.
+ */
+static bool slab_status_is_less_than(const void *item1, const void *item2)
+{
+	const struct slab_status *info1 = (const struct slab_status *) item1;
+	const struct slab_status *info2 = (const struct slab_status *) item2;
+
+	if (info1->is_clean != info2->is_clean)
+		return info1->is_clean;
+	if (info1->emptiness != info2->emptiness)
+		return info1->emptiness > info2->emptiness;
+	return info1->slab_number < info2->slab_number;
+}
+
+static void swap_slab_statuses(void *item1, void *item2)
+{
+	struct slab_status *info1 = item1;
+	struct slab_status *info2 = item2;
+
+	swap(*info1, *info2);
+}
+
+static const struct min_heap_callbacks slab_status_min_heap = {
+	.elem_size = sizeof(struct slab_status),
+	.less = slab_status_is_less_than,
+	.swp = swap_slab_statuses,
+};
+
+/* Inform the slab actor that a action has finished on some slab; used by apply_to_slabs(). */
+static void slab_action_callback(struct vdo_completion *completion)
+{
+	struct block_allocator *allocator = vdo_as_block_allocator(completion);
+	struct slab_actor *actor = &allocator->slab_actor;
+
+	if (--actor->slab_action_count == 0) {
+		actor->callback(completion);
+		return;
+	}
+
+	vdo_reset_completion(completion);
+}
+
+/* Preserve the error from part of an action and continue. */
+static void handle_operation_error(struct vdo_completion *completion)
+{
+	struct block_allocator *allocator = vdo_as_block_allocator(completion);
+
+	if (allocator->state.waiter != NULL)
+		vdo_set_completion_result(allocator->state.waiter, completion->result);
+	completion->callback(completion);
+}
+
+/* Perform an action on each of an allocator's slabs in parallel. */
+static void apply_to_slabs(struct block_allocator *allocator, vdo_action *callback)
+{
+	struct slab_iterator iterator;
+
+	vdo_prepare_completion(&allocator->completion,
+			       slab_action_callback,
+			       handle_operation_error,
+			       allocator->thread_id,
+			       NULL);
+	allocator->completion.requeue = false;
+
+	/*
+	 * Since we are going to dequeue all of the slabs, the open slab will become invalid, so
+	 * clear it.
+	 */
+	allocator->open_slab = NULL;
+
+	/* Ensure that we don't finish before we're done starting. */
+	allocator->slab_actor = (struct slab_actor) {
+		.slab_action_count = 1,
+		.callback = callback,
+	};
+
+	iterator = get_slab_iterator(allocator);
+	while (iterator.next != NULL) {
+		const struct admin_state_code *operation =
+			vdo_get_admin_state_code(&allocator->state);
+		struct vdo_slab *slab = next_slab(&iterator);
+
+		list_del_init(&slab->allocq_entry);
+		allocator->slab_actor.slab_action_count++;
+		vdo_start_operation_with_waiter(&slab->state,
+						operation,
+						&allocator->completion,
+						initiate_slab_action);
+	}
+
+	slab_action_callback(&allocator->completion);
+}
+
+static void finish_loading_allocator(struct vdo_completion *completion)
+{
+	struct block_allocator *allocator = vdo_as_block_allocator(completion);
+	const struct admin_state_code *operation = vdo_get_admin_state_code(&allocator->state);
+
+	if (allocator->eraser != NULL)
+		dm_kcopyd_client_destroy(UDS_FORGET(allocator->eraser));
+
+	if (operation == VDO_ADMIN_STATE_LOADING_FOR_RECOVERY) {
+		void *context = vdo_get_current_action_context(allocator->depot->action_manager);
+
+		vdo_replay_into_slab_journals(allocator, context);
+		return;
+	}
+
+	vdo_finish_loading(&allocator->state);
+}
+
+static void erase_next_slab_journal(struct block_allocator *allocator);
+
+static void copy_callback(int read_err, unsigned long write_err, void *context)
+{
+	struct block_allocator *allocator = context;
+	int result = (((read_err == 0) && (write_err == 0)) ? VDO_SUCCESS : -EIO);
+
+	if (result != VDO_SUCCESS) {
+		vdo_fail_completion(&allocator->completion, result);
+		return;
+	}
+
+	erase_next_slab_journal(allocator);
+}
+
+/* erase_next_slab_journal() - Erase the next slab journal. */
+static void erase_next_slab_journal(struct block_allocator *allocator)
+{
+	struct vdo_slab *slab;
+	physical_block_number_t pbn;
+	struct dm_io_region regions[1];
+	struct slab_depot *depot = allocator->depot;
+	block_count_t blocks = depot->slab_config.slab_journal_blocks;
+
+	if (allocator->slabs_to_erase.next == NULL) {
+		vdo_finish_completion(&allocator->completion);
+		return;
+	}
+
+	slab = next_slab(&allocator->slabs_to_erase);
+	pbn = slab->journal_origin - depot->vdo->geometry.bio_offset;
+	regions[0] = (struct dm_io_region) {
+		.bdev = vdo_get_backing_device(depot->vdo),
+		.sector = pbn * VDO_SECTORS_PER_BLOCK,
+		.count = blocks * VDO_SECTORS_PER_BLOCK,
+	};
+	dm_kcopyd_zero(allocator->eraser, 1, regions, 0, copy_callback, allocator);
+}
+
+/* Implements vdo_admin_initiator. */
+static void initiate_load(struct admin_state *state)
+{
+	struct block_allocator *allocator = container_of(state, struct block_allocator, state);
+	const struct admin_state_code *operation = vdo_get_admin_state_code(state);
+
+	if (operation == VDO_ADMIN_STATE_LOADING_FOR_REBUILD) {
+		/*
+		 * Must requeue because the kcopyd client cannot be freed in the same stack frame
+		 * as the kcopyd callback, lest it deadlock.
+		 */
+		vdo_prepare_completion_for_requeue(&allocator->completion,
+						   finish_loading_allocator,
+						   handle_operation_error,
+						   allocator->thread_id,
+						   NULL);
+		allocator->eraser = dm_kcopyd_client_create(NULL);
+		if (allocator->eraser == NULL) {
+			vdo_fail_completion(&allocator->completion, -ENOMEM);
+			return;
+		}
+		allocator->slabs_to_erase = get_slab_iterator(allocator);
+
+		erase_next_slab_journal(allocator);
+		return;
+	}
+
+	apply_to_slabs(allocator, finish_loading_allocator);
+}
+
+/**
+ * vdo_notify_slab_journals_are_recovered() - Inform a block allocator that its slab journals have
+ *                                            been recovered from the recovery journal.
+ * @completion The allocator completion
+ */
+void vdo_notify_slab_journals_are_recovered(struct vdo_completion *completion)
+{
+	struct block_allocator *allocator = vdo_as_block_allocator(completion);
+
+	vdo_finish_loading_with_result(&allocator->state, completion->result);
+}
+
+static int
+get_slab_statuses(struct block_allocator *allocator, struct slab_status **statuses_ptr)
+{
+	int result;
+	struct slab_status *statuses;
+	struct slab_iterator iterator = get_slab_iterator(allocator);
+
+	result = UDS_ALLOCATE(allocator->slab_count, struct slab_status, __func__, &statuses);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	*statuses_ptr = statuses;
+
+	while (iterator.next != NULL)  {
+		slab_count_t slab_number = next_slab(&iterator)->slab_number;
+
+		*statuses++ = (struct slab_status) {
+			.slab_number = slab_number,
+			.is_clean = !allocator->summary_entries[slab_number].is_dirty,
+			.emptiness = allocator->summary_entries[slab_number].fullness_hint,
+		};
+	}
+
+	return VDO_SUCCESS;
+}
+
+/* Prepare slabs for allocation or scrubbing. */
+static int __must_check
+vdo_prepare_slabs_for_allocation(struct block_allocator *allocator)
+{
+	struct slab_status current_slab_status;
+	struct min_heap heap;
+	int result;
+	struct slab_status *slab_statuses;
+	struct slab_depot *depot = allocator->depot;
+
+	WRITE_ONCE(allocator->allocated_blocks,
+		   allocator->slab_count * depot->slab_config.data_blocks);
+	result = get_slab_statuses(allocator, &slab_statuses);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	/* Sort the slabs by cleanliness, then by emptiness hint. */
+	heap = (struct min_heap) {
+		.data = slab_statuses,
+		.nr = allocator->slab_count,
+		.size = allocator->slab_count,
+	};
+	min_heapify_all(&heap, &slab_status_min_heap);
+
+	while (heap.nr > 0) {
+		bool high_priority;
+		struct vdo_slab *slab;
+		struct slab_journal *journal;
+
+		current_slab_status = slab_statuses[0];
+		min_heap_pop(&heap, &slab_status_min_heap);
+		slab = depot->slabs[current_slab_status.slab_number];
+
+		if ((depot->load_type == VDO_SLAB_DEPOT_REBUILD_LOAD) ||
+		    (!allocator->summary_entries[slab->slab_number].load_ref_counts &&
+		     current_slab_status.is_clean)) {
+			queue_slab(slab);
+			continue;
+		}
+
+		slab->status = VDO_SLAB_REQUIRES_SCRUBBING;
+		journal = &slab->journal;
+		high_priority = ((current_slab_status.is_clean &&
+				 (depot->load_type == VDO_SLAB_DEPOT_NORMAL_LOAD)) ||
+				 (journal_length(journal) >= journal->scrubbing_threshold));
+		register_slab_for_scrubbing(slab, high_priority);
+	}
+
+	UDS_FREE(slab_statuses);
+	return VDO_SUCCESS;
+}
+
+static const char *status_to_string(enum slab_rebuild_status status)
+{
+	switch (status) {
+	case VDO_SLAB_REBUILT:
+		return "REBUILT";
+	case VDO_SLAB_REQUIRES_SCRUBBING:
+		return "SCRUBBING";
+	case VDO_SLAB_REQUIRES_HIGH_PRIORITY_SCRUBBING:
+		return "PRIORITY_SCRUBBING";
+	case VDO_SLAB_REBUILDING:
+		return "REBUILDING";
+	case VDO_SLAB_REPLAYING:
+		return "REPLAYING";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+void vdo_dump_block_allocator(const struct block_allocator *allocator)
+{
+	unsigned int pause_counter = 0;
+	struct slab_iterator iterator = get_slab_iterator(allocator);
+	const struct slab_scrubber *scrubber = &allocator->scrubber;
+
+	uds_log_info("block_allocator zone %u", allocator->zone_number);
+	while (iterator.next != NULL) {
+		struct vdo_slab *slab = next_slab(&iterator);
+		struct slab_journal *journal = &slab->journal;
+
+		if (slab->reference_blocks != NULL)
+			/* Terse because there are a lot of slabs to dump and syslog is lossy. */
+			uds_log_info("slab %u: P%u, %llu free",
+				     slab->slab_number,
+				     slab->priority,
+				     (unsigned long long) slab->free_blocks);
+		else
+			uds_log_info("slab %u: status %s",
+				     slab->slab_number,
+				     status_to_string(slab->status));
+
+		uds_log_info("  slab journal: entry_waiters=%zu waiting_to_commit=%s updating_slab_summary=%s head=%llu unreapable=%llu tail=%llu next_commit=%llu summarized=%llu last_summarized=%llu recovery_lock=%llu dirty=%s",
+			     vdo_count_waiters(&journal->entry_waiters),
+			     uds_bool_to_string(journal->waiting_to_commit),
+			     uds_bool_to_string(journal->updating_slab_summary),
+			     (unsigned long long) journal->head,
+			     (unsigned long long) journal->unreapable,
+			     (unsigned long long) journal->tail,
+			     (unsigned long long) journal->next_commit,
+			     (unsigned long long) journal->summarized,
+			     (unsigned long long) journal->last_summarized,
+			     (unsigned long long) journal->recovery_lock,
+			     uds_bool_to_string(journal->recovery_lock != 0));
+		/*
+		 * Given the frequency with which the locks are just a tiny bit off, it might be
+		 * worth dumping all the locks, but that might be too much logging.
+		 */
+
+		if (slab->counters != NULL)
+			/* Terse because there are a lot of slabs to dump and syslog is lossy. */
+			uds_log_info("  slab: free=%u/%u blocks=%u dirty=%zu active=%zu journal@(%llu,%u)",
+				     slab->free_blocks,
+				     slab->block_count,
+				     slab->reference_block_count,
+				     vdo_count_waiters(&slab->dirty_blocks),
+				     slab->active_count,
+				     (unsigned long long) slab->slab_journal_point.sequence_number,
+				     slab->slab_journal_point.entry_count);
+		else
+			uds_log_info("  no counters");
+
+		/*
+		 * Wait for a while after each batch of 32 slabs dumped, an arbitrary number,
+		 * allowing the kernel log a chance to be flushed instead of being overrun.
+		 */
+		if (pause_counter++ == 31) {
+			pause_counter = 0;
+			uds_pause_for_logger();
+		}
+	}
+
+	uds_log_info("slab_scrubber slab_count %u waiters %zu %s%s",
+		     READ_ONCE(scrubber->slab_count),
+		     vdo_count_waiters(&scrubber->waiters),
+		     vdo_get_admin_state_code(&scrubber->admin_state)->name,
+		     scrubber->high_priority_only ? ", high_priority_only " : "");
+}
+
+static void free_slab(struct vdo_slab *slab)
+{
+	if (slab == NULL)
+		return;
+
+	list_del(&slab->allocq_entry);
+	UDS_FREE(UDS_FORGET(slab->journal.block));
+	UDS_FREE(UDS_FORGET(slab->journal.locks));
+	UDS_FREE(UDS_FORGET(slab->counters));
+	UDS_FREE(UDS_FORGET(slab->reference_blocks));
+	UDS_FREE(slab);
+}
+
+static int initialize_slab_journal(struct vdo_slab *slab)
+{
+	struct slab_journal *journal = &slab->journal;
+	const struct slab_config *slab_config = &slab->allocator->depot->slab_config;
+	int result;
+
+	result = UDS_ALLOCATE(slab_config->slab_journal_blocks,
+			      struct journal_lock,
+			      __func__,
+			      &journal->locks);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(VDO_BLOCK_SIZE,
+			      char,
+			      "struct packed_slab_journal_block",
+			      (char **) &journal->block);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	journal->slab = slab;
+	journal->size = slab_config->slab_journal_blocks;
+	journal->flushing_threshold = slab_config->slab_journal_flushing_threshold;
+	journal->blocking_threshold = slab_config->slab_journal_blocking_threshold;
+	journal->scrubbing_threshold = slab_config->slab_journal_scrubbing_threshold;
+	journal->entries_per_block = VDO_SLAB_JOURNAL_ENTRIES_PER_BLOCK;
+	journal->full_entries_per_block = VDO_SLAB_JOURNAL_FULL_ENTRIES_PER_BLOCK;
+	journal->events = &slab->allocator->slab_journal_statistics;
+	journal->recovery_journal = slab->allocator->depot->vdo->recovery_journal;
+	journal->tail = 1;
+	journal->head = 1;
+
+	journal->flushing_deadline = journal->flushing_threshold;
+	/*
+	 * Set there to be some time between the deadline and the blocking threshold, so that
+	 * hopefully all are done before blocking.
+	 */
+	if ((journal->blocking_threshold - journal->flushing_threshold) > 5)
+		journal->flushing_deadline = journal->blocking_threshold - 5;
+
+	journal->slab_summary_waiter.callback = release_journal_locks;
+
+	INIT_LIST_HEAD(&journal->dirty_entry);
+	INIT_LIST_HEAD(&journal->uncommitted_blocks);
+
+	journal->tail_header.nonce = slab->allocator->nonce;
+	journal->tail_header.metadata_type = VDO_METADATA_SLAB_JOURNAL;
+	initialize_journal_state(journal);
+	return VDO_SUCCESS;
+}
+
+/**
+ * make_slab() - Construct a new, empty slab.
+ * @slab_origin: The physical block number within the block allocator partition of the first block
+ *               in the slab.
+ * @allocator: The block allocator to which the slab belongs.
+ * @slab_number: The slab number of the slab.
+ * @is_new: true if this slab is being allocated as part of a resize.
+ * @slab_ptr: A pointer to receive the new slab.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int __must_check
+make_slab(physical_block_number_t slab_origin,
+	  struct block_allocator *allocator,
+	  slab_count_t slab_number,
+	  bool is_new,
+	  struct vdo_slab **slab_ptr)
+{
+	const struct slab_config *slab_config = &allocator->depot->slab_config;
+	struct vdo_slab *slab;
+	int result;
+
+	result = UDS_ALLOCATE(1, struct vdo_slab, __func__, &slab);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	*slab = (struct vdo_slab) {
+		.allocator = allocator,
+		.start = slab_origin,
+		.end = slab_origin + slab_config->slab_blocks,
+		.slab_number = slab_number,
+		.ref_counts_origin = slab_origin + slab_config->data_blocks,
+		.journal_origin = vdo_get_slab_journal_start_block(slab_config, slab_origin),
+		.block_count = slab_config->data_blocks,
+		.free_blocks = slab_config->data_blocks,
+		.reference_block_count =
+			vdo_get_saved_reference_count_size(slab_config->data_blocks),
+	};
+	INIT_LIST_HEAD(&slab->allocq_entry);
+
+	result = initialize_slab_journal(slab);
+	if (result != VDO_SUCCESS) {
+		free_slab(slab);
+		return result;
+	}
+
+	if (is_new) {
+		vdo_set_admin_state_code(&slab->state, VDO_ADMIN_STATE_NEW);
+		result = allocate_slab_counters(slab);
+		if (result != VDO_SUCCESS) {
+			free_slab(slab);
+			return result;
+		}
+	} else {
+		vdo_set_admin_state_code(&slab->state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+	}
+
+	*slab_ptr = slab;
+	return VDO_SUCCESS;
+}
+
+/**
+ * initialize_slab_scrubber() - Initialize an allocator's slab scrubber.
+ * @allocator: The allocator being initialized
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int initialize_slab_scrubber(struct block_allocator *allocator)
+{
+	struct slab_scrubber *scrubber = &allocator->scrubber;
+	block_count_t slab_journal_size = allocator->depot->slab_config.slab_journal_blocks;
+	char *journal_data;
+	int result;
+
+	result = UDS_ALLOCATE(VDO_BLOCK_SIZE * slab_journal_size, char, __func__, &journal_data);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = allocate_vio_components(allocator->completion.vdo,
+					 VIO_TYPE_SLAB_JOURNAL,
+					 VIO_PRIORITY_METADATA,
+					 allocator,
+					 slab_journal_size,
+					 journal_data,
+					 &scrubber->vio);
+	if (result != VDO_SUCCESS) {
+		UDS_FREE(journal_data);
+		return result;
+	}
+
+	INIT_LIST_HEAD(&scrubber->high_priority_slabs);
+	INIT_LIST_HEAD(&scrubber->slabs);
+	vdo_set_admin_state_code(&scrubber->admin_state, VDO_ADMIN_STATE_SUSPENDED);
+	return VDO_SUCCESS;
+}
+
+/**
+ * initialize_slab_summary_block() - Initialize a slab_summary_block.
+ * @allocator: The allocator which owns the block.
+ * @index: The index of this block in its zone's summary.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int __must_check
+initialize_slab_summary_block(struct block_allocator *allocator, block_count_t index)
+{
+	struct slab_summary_block *block = &allocator->summary_blocks[index];
+	int result;
+
+	result = UDS_ALLOCATE(VDO_BLOCK_SIZE, char, __func__, &block->outgoing_entries);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = allocate_vio_components(allocator->depot->vdo,
+					 VIO_TYPE_SLAB_SUMMARY,
+					 VIO_PRIORITY_METADATA,
+					 NULL,
+					 1,
+					 block->outgoing_entries,
+					 &block->vio);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	block->allocator = allocator;
+	block->entries = &allocator->summary_entries[VDO_SLAB_SUMMARY_ENTRIES_PER_BLOCK * index];
+	block->index = index;
+	return VDO_SUCCESS;
+}
+
+static int __must_check initialize_block_allocator(struct slab_depot *depot, zone_count_t zone)
+{
+	int result;
+	block_count_t i;
+	struct block_allocator *allocator = &depot->allocators[zone];
+	struct vdo *vdo = depot->vdo;
+	block_count_t max_free_blocks = depot->slab_config.data_blocks;
+	unsigned int max_priority = (2 + ilog2(max_free_blocks));
+
+	*allocator = (struct block_allocator) {
+		.depot = depot,
+		.zone_number = zone,
+		.thread_id = vdo->thread_config.physical_threads[zone],
+		.nonce = vdo->states.vdo.nonce,
+	};
+
+	INIT_LIST_HEAD(&allocator->dirty_slab_journals);
+	vdo_set_admin_state_code(&allocator->state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+	result = vdo_register_read_only_listener(vdo,
+						 allocator,
+						 notify_block_allocator_of_read_only_mode,
+						 allocator->thread_id);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	vdo_initialize_completion(&allocator->completion, vdo, VDO_BLOCK_ALLOCATOR_COMPLETION);
+	result = make_vio_pool(vdo,
+			       BLOCK_ALLOCATOR_VIO_POOL_SIZE,
+			       allocator->thread_id,
+			       VIO_TYPE_SLAB_JOURNAL,
+			       VIO_PRIORITY_METADATA,
+			       allocator,
+			       &allocator->vio_pool);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = initialize_slab_scrubber(allocator);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = vdo_make_priority_table(max_priority, &allocator->prioritized_slabs);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(VDO_SLAB_SUMMARY_BLOCKS_PER_ZONE,
+			      struct slab_summary_block,
+			      __func__,
+			      &allocator->summary_blocks);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	vdo_set_admin_state_code(&allocator->summary_state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+	allocator->summary_entries = depot->summary_entries + (MAX_VDO_SLABS * zone);
+
+	/* Initialize each summary block. */
+	for (i = 0; i < VDO_SLAB_SUMMARY_BLOCKS_PER_ZONE; i++) {
+		result = initialize_slab_summary_block(allocator, i);
+		if (result != VDO_SUCCESS)
+			return result;
+	}
+
+	/*
+	 * Performing well atop thin provisioned storage requires either that VDO discards freed
+	 * blocks, or that the block allocator try to use slabs that already have allocated blocks
+	 * in preference to slabs that have never been opened. For reasons we have not been able to
+	 * fully understand, some SSD machines have been have been very sensitive (50% reduction in
+	 * test throughput) to very slight differences in the timing and locality of block
+	 * allocation. Assigning a low priority to unopened slabs (max_priority/2, say) would be
+	 * ideal for the story, but anything less than a very high threshold (max_priority - 1)
+	 * hurts on these machines.
+	 *
+	 * This sets the free block threshold for preferring to open an unopened slab to the binary
+	 * floor of 3/4ths the total number of data blocks in a slab, which will generally evaluate
+	 * to about half the slab size.
+	 */
+	allocator->unopened_slab_priority = (1 + ilog2((max_free_blocks * 3) / 4));
+
+	return VDO_SUCCESS;
+}
+
+static void uninitialize_allocator_summary(struct block_allocator *allocator)
+{
+	block_count_t i;
+
+	if (allocator->summary_blocks == NULL)
+		return;
+
+	for (i = 0; i < VDO_SLAB_SUMMARY_BLOCKS_PER_ZONE; i++) {
+		free_vio_components(&allocator->summary_blocks[i].vio);
+		UDS_FREE(UDS_FORGET(allocator->summary_blocks[i].outgoing_entries));
+	}
+
+	UDS_FREE(UDS_FORGET(allocator->summary_blocks));
+}
+
+/**
+ * finish_combining_zones() - Clean up after saving out the combined slab summary.
+ * @completion: The vio which was used to write the summary data.
+ */
+static void finish_combining_zones(struct vdo_completion *completion)
+{
+	int result = completion->result;
+	struct vdo_completion *parent = completion->parent;
+
+	free_vio(as_vio(UDS_FORGET(completion)));
+	vdo_fail_completion(parent, result);
+}
+
+static void handle_combining_error(struct vdo_completion *completion)
+{
+	vio_record_metadata_io_error(as_vio(completion));
+	finish_combining_zones(completion);
+}
+
+static void write_summary_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct vdo *vdo = vio->completion.vdo;
+
+	continue_vio_after_io(vio, finish_combining_zones, vdo->thread_config.admin_thread);
+}
+
+/**
+ * combine_summaries() - Treating the current entries buffer as the on-disk value of all zones,
+ *                       update every zone to the correct values for every slab.
+ * @depot: The depot whose summary entries should be combined.
+ */
+static void combine_summaries(struct slab_depot *depot)
+{
+	/*
+	 * Combine all the old summary data into the portion of the buffer corresponding to the
+	 * first zone.
+	 */
+	zone_count_t zone = 0;
+	struct slab_summary_entry *entries = depot->summary_entries;
+
+	if (depot->old_zone_count > 1) {
+		slab_count_t entry_number;
+
+		for (entry_number = 0; entry_number < MAX_VDO_SLABS; entry_number++) {
+			if (zone != 0)
+				memcpy(entries + entry_number,
+				       entries + (zone * MAX_VDO_SLABS) + entry_number,
+				       sizeof(struct slab_summary_entry));
+			zone++;
+			if (zone == depot->old_zone_count)
+				zone = 0;
+		}
+	}
+
+	/* Copy the combined data to each zones's region of the buffer. */
+	for (zone = 1; zone < MAX_VDO_PHYSICAL_ZONES; zone++)
+		memcpy(entries + (zone * MAX_VDO_SLABS),
+		       entries,
+		       MAX_VDO_SLABS * sizeof(struct slab_summary_entry));
+}
+
+/**
+ * finish_loading_summary() - Finish loading slab summary data.
+ * @completion: The vio which was used to read the summary data.
+ *
+ * Combines the slab summary data from all the previously written zones and copies the combined
+ * summary to each partition's data region. Then writes the combined summary back out to disk. This
+ * callback is registered in load_summary_endio().
+ */
+static void finish_loading_summary(struct vdo_completion *completion)
+{
+	struct slab_depot *depot = completion->vdo->depot;
+
+	/* Combine the summary from each zone so each zone is correct for all slabs. */
+	combine_summaries(depot);
+
+	/* Write the combined summary back out. */
+	submit_metadata_vio(as_vio(completion),
+			    depot->summary_origin,
+			    write_summary_endio,
+			    handle_combining_error,
+			    REQ_OP_WRITE);
+}
+
+static void load_summary_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct vdo *vdo = vio->completion.vdo;
+
+	continue_vio_after_io(vio, finish_loading_summary, vdo->thread_config.admin_thread);
+}
+
+/**
+ * load_slab_summary() - The preamble of a load operation.
+ *
+ * Implements vdo_action_preamble.
+ */
+static void load_slab_summary(void *context, struct vdo_completion *parent)
+{
+	int result;
+	struct vio *vio;
+	struct slab_depot *depot = context;
+	const struct admin_state_code *operation =
+		vdo_get_current_manager_operation(depot->action_manager);
+
+	result = create_multi_block_metadata_vio(depot->vdo,
+						 VIO_TYPE_SLAB_SUMMARY,
+						 VIO_PRIORITY_METADATA,
 						 parent,
 						 VDO_SLAB_SUMMARY_BLOCKS,
 						 (char *) depot->summary_entries,
@@ -2784,3 +4099,150 @@ static void load_slab_summary(void *context, struct vdo_completion *parent)
 			    handle_combining_error,
 			    REQ_OP_READ);
 }
+
+/**
+ * stop_scrubbing() - Tell the scrubber to stop scrubbing after it finishes the slab it is
+ *                    currently working on.
+ * @scrubber: The scrubber to stop.
+ * @parent: The completion to notify when scrubbing has stopped.
+ */
+static void stop_scrubbing(struct block_allocator *allocator)
+{
+	struct slab_scrubber *scrubber = &allocator->scrubber;
+
+	if (vdo_is_state_quiescent(&scrubber->admin_state))
+		vdo_finish_completion(&allocator->completion);
+	else
+		vdo_start_draining(&scrubber->admin_state,
+				   VDO_ADMIN_STATE_SUSPENDING,
+				   &allocator->completion,
+				   NULL);
+}
+
+/* Implements vdo_admin_initiator. */
+static void initiate_summary_drain(struct admin_state *state)
+{
+	check_summary_drain_complete(container_of(state, struct block_allocator, summary_state));
+}
+
+static void do_drain_step(struct vdo_completion *completion)
+{
+	struct block_allocator *allocator = vdo_as_block_allocator(completion);
+
+	vdo_prepare_completion_for_requeue(&allocator->completion,
+					   do_drain_step,
+					   handle_operation_error,
+					   allocator->thread_id,
+					   NULL);
+	switch (++allocator->drain_step) {
+	case VDO_DRAIN_ALLOCATOR_STEP_SCRUBBER:
+		stop_scrubbing(allocator);
+		return;
+
+	case VDO_DRAIN_ALLOCATOR_STEP_SLABS:
+		apply_to_slabs(allocator, do_drain_step);
+		return;
+
+	case VDO_DRAIN_ALLOCATOR_STEP_SUMMARY:
+		vdo_start_draining(&allocator->summary_state,
+				   vdo_get_admin_state_code(&allocator->state),
+				   completion,
+				   initiate_summary_drain);
+		return;
+
+	case VDO_DRAIN_ALLOCATOR_STEP_FINISHED:
+		ASSERT_LOG_ONLY(!is_vio_pool_busy(allocator->vio_pool), "vio pool not busy");
+		vdo_finish_draining_with_result(&allocator->state, completion->result);
+		return;
+
+	default:
+		vdo_finish_draining_with_result(&allocator->state, UDS_BAD_STATE);
+	}
+}
+
+/* Implements vdo_admin_initiator. */
+static void initiate_drain(struct admin_state *state)
+{
+	struct block_allocator *allocator = container_of(state, struct block_allocator, state);
+
+	allocator->drain_step = VDO_DRAIN_ALLOCATOR_START;
+	do_drain_step(&allocator->completion);
+}
+
+/**
+ * resume_scrubbing() - Tell the scrubber to resume scrubbing if it has been stopped.
+ * @allocator: The allocator being resumed.
+ */
+static void resume_scrubbing(struct block_allocator *allocator)
+{
+	int result;
+	struct slab_scrubber *scrubber = &allocator->scrubber;
+
+	if (!has_slabs_to_scrub(scrubber)) {
+		vdo_finish_completion(&allocator->completion);
+		return;
+	}
+
+	result = vdo_resume_if_quiescent(&scrubber->admin_state);
+	if (result != VDO_SUCCESS) {
+		vdo_fail_completion(&allocator->completion, result);
+		return;
+	}
+
+	scrub_next_slab(scrubber);
+	vdo_finish_completion(&allocator->completion);
+}
+
+static void do_resume_step(struct vdo_completion *completion)
+{
+	struct block_allocator *allocator = vdo_as_block_allocator(completion);
+
+	vdo_prepare_completion_for_requeue(&allocator->completion,
+					   do_resume_step,
+					   handle_operation_error,
+					   allocator->thread_id,
+					   NULL);
+	switch (--allocator->drain_step) {
+	case VDO_DRAIN_ALLOCATOR_STEP_SUMMARY:
+		vdo_fail_completion(completion,
+				    vdo_resume_if_quiescent(&allocator->summary_state));
+		return;
+
+	case VDO_DRAIN_ALLOCATOR_STEP_SLABS:
+		apply_to_slabs(allocator, do_resume_step);
+		return;
+
+	case VDO_DRAIN_ALLOCATOR_STEP_SCRUBBER:
+		resume_scrubbing(allocator);
+		return;
+
+	case VDO_DRAIN_ALLOCATOR_START:
+		vdo_finish_resuming_with_result(&allocator->state, completion->result);
+		return;
+
+	default:
+		vdo_finish_resuming_with_result(&allocator->state, UDS_BAD_STATE);
+	}
+}
+
+/* Implements vdo_admin_initiator. */
+static void initiate_resume(struct admin_state *state)
+{
+	struct block_allocator *allocator = container_of(state, struct block_allocator, state);
+
+	allocator->drain_step = VDO_DRAIN_ALLOCATOR_STEP_FINISHED;
+	do_resume_step(&allocator->completion);
+}
+
+/* Implements vdo_zone_action. */
+static void resume_allocator(void *context,
+			     zone_count_t zone_number,
+			     struct vdo_completion *parent)
+{
+	struct slab_depot *depot = context;
+
+	vdo_start_resuming(&depot->allocators[zone_number].state,
+			   vdo_get_current_manager_operation(depot->action_manager),
+			   parent,
+			   initiate_resume);
+}
diff --git a/drivers/md/dm-vdo/slab-depot.h b/drivers/md/dm-vdo/slab-depot.h
index 5e06746c7e71..6ec4be7b5822 100644
--- a/drivers/md/dm-vdo/slab-depot.h
+++ b/drivers/md/dm-vdo/slab-depot.h
@@ -257,6 +257,54 @@ struct vdo_slab {
 	struct reference_block *reference_blocks;
 };
 
+enum block_allocator_drain_step {
+	VDO_DRAIN_ALLOCATOR_START,
+	VDO_DRAIN_ALLOCATOR_STEP_SCRUBBER,
+	VDO_DRAIN_ALLOCATOR_STEP_SLABS,
+	VDO_DRAIN_ALLOCATOR_STEP_SUMMARY,
+	VDO_DRAIN_ALLOCATOR_STEP_FINISHED,
+};
+
+struct slab_scrubber {
+	/* The queue of slabs to scrub first */
+	struct list_head high_priority_slabs;
+	/* The queue of slabs to scrub once there are no high_priority_slabs */
+	struct list_head slabs;
+	/* The queue of VIOs waiting for a slab to be scrubbed */
+	struct wait_queue waiters;
+
+	/*
+	 * The number of slabs that are unrecovered or being scrubbed. This field is modified by
+	 * the physical zone thread, but is queried by other threads.
+	 */
+	slab_count_t slab_count;
+
+	/* The administrative state of the scrubber */
+	struct admin_state admin_state;
+	/* Whether to only scrub high-priority slabs */
+	bool high_priority_only;
+	/* The slab currently being scrubbed */
+	struct vdo_slab *slab;
+	/* The vio for loading slab journal blocks */
+	struct vio vio;
+};
+
+/* A sub-structure for applying actions in parallel to all an allocator's slabs. */
+struct slab_actor {
+	/* The number of slabs performing a slab action */
+	slab_count_t slab_action_count;
+	/* The method to call when a slab action has been completed by all slabs */
+	vdo_action *callback;
+};
+
+/* A slab_iterator is a structure for iterating over a set of slabs. */
+struct slab_iterator {
+	struct vdo_slab **slabs;
+	struct vdo_slab *next;
+	slab_count_t end;
+	slab_count_t stride;
+};
+
 /*
  * The slab_summary provides hints during load and recovery about the state of the slabs in order
  * to avoid the need to read the slab journals in their entirety before a VDO can come online.
@@ -314,6 +362,81 @@ struct atomic_slab_summary_statistics {
 	atomic64_t blocks_written;
 };
 
+struct block_allocator {
+	struct vdo_completion completion;
+	/* The slab depot for this allocator */
+	struct slab_depot *depot;
+	/* The nonce of the VDO */
+	nonce_t nonce;
+	/* The physical zone number of this allocator */
+	zone_count_t zone_number;
+	/* The thread ID for this allocator's physical zone */
+	thread_id_t thread_id;
+	/* The number of slabs in this allocator */
+	slab_count_t slab_count;
+	/* The number of the last slab owned by this allocator */
+	slab_count_t last_slab;
+	/* The reduced priority level used to preserve unopened slabs */
+	unsigned int unopened_slab_priority;
+	/* The state of this allocator */
+	struct admin_state state;
+	/* The actor for applying an action to all slabs */
+	struct slab_actor slab_actor;
+
+	/* The slab from which blocks are currently being allocated */
+	struct vdo_slab *open_slab;
+	/* A priority queue containing all slabs available for allocation */
+	struct priority_table *prioritized_slabs;
+	/* The slab scrubber */
+	struct slab_scrubber scrubber;
+	/* What phase of the close operation the allocator is to perform */
+	enum block_allocator_drain_step drain_step;
+
+	/*
+	 * These statistics are all mutated only by the physical zone thread, but are read by other
+	 * threads when gathering statistics for the entire depot.
+	 */
+	/*
+	 * The count of allocated blocks in this zone. Not in block_allocator_statistics for
+	 * historical reasons.
+	 */
+	u64 allocated_blocks;
+	/* Statistics for this block allocator */
+	struct block_allocator_statistics statistics;
+	/* Cumulative statistics for the slab journals in this zone */
+	struct slab_journal_statistics slab_journal_statistics;
+	/* Cumulative statistics for the reference counters in this zone */
+	struct ref_counts_statistics ref_counts_statistics;
+
+	/*
+	 * This is the head of a queue of slab journals which have entries in their tail blocks
+	 * which have not yet started to commit. When the recovery journal is under space pressure,
+	 * slab journals which have uncommitted entries holding a lock on the recovery journal head
+	 * are forced to commit their blocks early. This list is kept in order, with the tail
+	 * containing the slab journal holding the most recent recovery journal lock.
+	 */
+	struct list_head dirty_slab_journals;
+
+	/* The vio pool for reading and writing block allocator metadata */
+	struct vio_pool *vio_pool;
+	/* The dm_kcopyd client for erasing slab journals */
+	struct dm_kcopyd_client *eraser;
+	/* Iterator over the slabs to be erased */
+	struct slab_iterator slabs_to_erase;
+
+	/* The portion of the slab summary managed by this allocator */
+	/* The state of the slab summary */
+	struct admin_state summary_state;
+	/* The number of outstanding summary writes */
+	block_count_t summary_write_count;
+	/* The array (owned by the blocks) of all entries */
+	struct slab_summary_entry *summary_entries;
+	/* The array of slab_summary_blocks */
+	struct slab_summary_block *summary_blocks;
+};
+
+struct reference_updater;
+
 bool __must_check
 vdo_attempt_replay_into_slab(struct vdo_slab *slab,
 			     physical_block_number_t pbn,
@@ -322,6 +445,29 @@ vdo_attempt_replay_into_slab(struct vdo_slab *slab,
 			     struct journal_point *recovery_point,
 			     struct vdo_completion *parent);
 
+static inline struct block_allocator *vdo_as_block_allocator(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VDO_BLOCK_ALLOCATOR_COMPLETION);
+	return container_of(completion, struct block_allocator, completion);
+}
+
+int __must_check vdo_acquire_provisional_reference(struct vdo_slab *slab,
+						   physical_block_number_t pbn,
+						   struct pbn_lock *lock);
+
+int __must_check
+vdo_allocate_block(struct block_allocator *allocator, physical_block_number_t *block_number_ptr);
+
+int vdo_enqueue_clean_slab_waiter(struct block_allocator *allocator, struct waiter *waiter);
+
+void vdo_modify_reference_count(struct vdo_completion *completion,
+				struct reference_updater *updater);
+
+int __must_check vdo_release_block_reference(struct block_allocator *allocator,
+					     physical_block_number_t pbn);
+
 void vdo_notify_slab_journals_are_recovered(struct vdo_completion *completion);
 
+void vdo_dump_block_allocator(const struct block_allocator *allocator);
+
 #endif /* VDO_SLAB_DEPOT_H */

From patchwork Thu Oct 26 21:41:25 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437937
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6DC60C27C48
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:43:18 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232306AbjJZVnS (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:43:18 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41014 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344852AbjJZVnQ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:43:16 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 84B49D45
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356503;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=/iMXLHNntDucEEZtd8E1eBTpqdAdPA/fcEkpzCJcGeA=;
        b=ao/3trVDnikCjwJt4XMn3rJmQBm73ZzHBTv9dH7Sugb1qisTqJLW+VcsyJir+RSdRLTLS6
        Oze1wtQG41Zx1CdEULeK/spMScsX41/lkpYHBQRF9RLGqY0Ro9zQUjBagAGNuuIzEZ4u7/
        szDO0zqIDL6XSjicMY5tDdxXHDKHBUk=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-205-xrRbsqJvNFOZgZLD6KSZ4w-1; Thu, 26 Oct 2023 17:41:42 -0400
X-MC-Unique: xrRbsqJvNFOZgZLD6KSZ4w-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1116880F923;
        Thu, 26 Oct 2023 21:41:42 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id ECE1C2949;
        Thu, 26 Oct 2023 21:41:41 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id A22003003C; Thu, 26 Oct 2023 17:41:41 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 28/39] dm vdo: add the slab depot
Date: Thu, 26 Oct 2023 17:41:25 -0400
Message-Id: <20231026214136.1067410-29-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Add the data and methods that implement the slab_depot that manages
the allocation of slabs of blocks added by the preceding patches.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/slab-depot.c | 965 +++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/slab-depot.h | 121 +++++
 2 files changed, 1086 insertions(+)

diff --git a/drivers/md/dm-vdo/slab-depot.c b/drivers/md/dm-vdo/slab-depot.c
index dbed9c3c8a62..ba9cdb720506 100644
--- a/drivers/md/dm-vdo/slab-depot.c
+++ b/drivers/md/dm-vdo/slab-depot.c
@@ -3070,6 +3070,32 @@ static void register_slab_with_allocator(struct block_allocator *allocator, stru
 	allocator->last_slab = slab->slab_number;
 }
 
+/**
+ * get_depot_slab_iterator() - Return a slab_iterator over the slabs in a slab_depot.
+ * @depot: The depot over which to iterate.
+ * @start: The number of the slab to start iterating from.
+ * @end: The number of the last slab which may be returned.
+ * @stride: The difference in slab number between successive slabs.
+ *
+ * Iteration always occurs from higher to lower numbered slabs.
+ *
+ * Return: An initialized iterator structure.
+ */
+static struct slab_iterator get_depot_slab_iterator(struct slab_depot *depot,
+						    slab_count_t start,
+						    slab_count_t end,
+						    slab_count_t stride)
+{
+	struct vdo_slab **slabs = depot->slabs;
+
+	return (struct slab_iterator) {
+		.slabs = slabs,
+		.next = (((slabs == NULL) || (start < end)) ? NULL : slabs[start]),
+		.end = end,
+		.stride = stride,
+	};
+}
+
 static struct slab_iterator get_slab_iterator(const struct block_allocator *allocator)
 {
 	return get_depot_slab_iterator(allocator->depot,
@@ -3806,6 +3832,171 @@ make_slab(physical_block_number_t slab_origin,
 	return VDO_SUCCESS;
 }
 
+/**
+ * allocate_slabs() - Allocate a new slab pointer array.
+ * @depot: The depot.
+ * @slab_count: The number of slabs the depot should have in the new array.
+ *
+ * Any existing slab pointers will be copied into the new array, and slabs will be allocated as
+ * needed. The newly allocated slabs will not be distributed for use by the block allocators.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int allocate_slabs(struct slab_depot *depot, slab_count_t slab_count)
+{
+	block_count_t slab_size;
+	bool resizing = false;
+	physical_block_number_t slab_origin;
+	int result;
+
+	result = UDS_ALLOCATE(slab_count,
+			      struct vdo_slab *,
+			      "slab pointer array",
+			      &depot->new_slabs);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	if (depot->slabs != NULL) {
+		memcpy(depot->new_slabs,
+		       depot->slabs,
+		       depot->slab_count * sizeof(struct vdo_slab *));
+		resizing = true;
+	}
+
+	slab_size = depot->slab_config.slab_blocks;
+	slab_origin = depot->first_block + (depot->slab_count * slab_size);
+
+	for (depot->new_slab_count = depot->slab_count;
+	     depot->new_slab_count < slab_count;
+	     depot->new_slab_count++, slab_origin += slab_size) {
+		struct block_allocator *allocator =
+			&depot->allocators[depot->new_slab_count % depot->zone_count];
+		struct vdo_slab **slab_ptr = &depot->new_slabs[depot->new_slab_count];
+
+		result = make_slab(slab_origin,
+				   allocator,
+				   depot->new_slab_count,
+				   resizing,
+				   slab_ptr);
+		if (result != VDO_SUCCESS)
+			return result;
+	}
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_abandon_new_slabs() - Abandon any new slabs in this depot, freeing them as needed.
+ * @depot: The depot.
+ */
+void vdo_abandon_new_slabs(struct slab_depot *depot)
+{
+	slab_count_t i;
+
+	if (depot->new_slabs == NULL)
+		return;
+
+	for (i = depot->slab_count; i < depot->new_slab_count; i++)
+		free_slab(UDS_FORGET(depot->new_slabs[i]));
+	depot->new_slab_count = 0;
+	depot->new_size = 0;
+	UDS_FREE(UDS_FORGET(depot->new_slabs));
+}
+
+/**
+ * get_allocator_thread_id() - Get the ID of the thread on which a given allocator operates.
+ *
+ * Implements vdo_zone_thread_getter.
+ */
+static thread_id_t get_allocator_thread_id(void *context, zone_count_t zone_number)
+{
+	return ((struct slab_depot *) context)->allocators[zone_number].thread_id;
+}
+
+/**
+ * release_recovery_journal_lock() - Request the slab journal to release the recovery journal lock
+ *                                   it may hold on a specified recovery journal block.
+ * @journal: The slab journal.
+ * @recovery_lock: The sequence number of the recovery journal block whose locks should be
+ *                 released.
+ *
+ * Return: true if the journal does hold a lock on the specified block (which it will release).
+ */
+static bool __must_check
+release_recovery_journal_lock(struct slab_journal *journal, sequence_number_t recovery_lock)
+{
+	if (recovery_lock > journal->recovery_lock) {
+		ASSERT_LOG_ONLY((recovery_lock < journal->recovery_lock),
+				"slab journal recovery lock is not older than the recovery journal head");
+		return false;
+	}
+
+	if ((recovery_lock < journal->recovery_lock) ||
+	    vdo_is_read_only(journal->slab->allocator->depot->vdo))
+		return false;
+
+	/* All locks are held by the block which is in progress; write it. */
+	commit_tail(journal);
+	return true;
+}
+
+/*
+ * Request a commit of all dirty tail blocks which are locking the recovery journal block the depot
+ * is seeking to release.
+ *
+ * Implements vdo_zone_action.
+ */
+static void release_tail_block_locks(void *context,
+				     zone_count_t zone_number,
+				     struct vdo_completion *parent)
+{
+	struct slab_journal *journal, *tmp;
+	struct slab_depot *depot = context;
+	struct list_head *list = &depot->allocators[zone_number].dirty_slab_journals;
+
+	list_for_each_entry_safe(journal, tmp, list, dirty_entry) {
+		if (!release_recovery_journal_lock(journal, depot->active_release_request))
+			break;
+	}
+
+	vdo_finish_completion(parent);
+}
+
+/**
+ * prepare_for_tail_block_commit() - Prepare to commit oldest tail blocks.
+ *
+ * Implements vdo_action_preamble.
+ */
+static void prepare_for_tail_block_commit(void *context, struct vdo_completion *parent)
+{
+	struct slab_depot *depot = context;
+
+	depot->active_release_request = depot->new_release_request;
+	vdo_finish_completion(parent);
+}
+
+/**
+ * schedule_tail_block_commit() - Schedule a tail block commit if necessary.
+ *
+ * This method should not be called directly. Rather, call vdo_schedule_default_action() on the
+ * depot's action manager.
+ *
+ * Implements vdo_action_scheduler.
+ */
+static bool schedule_tail_block_commit(void *context)
+{
+	struct slab_depot *depot = context;
+
+	if (depot->new_release_request == depot->active_release_request)
+		return false;
+
+	return vdo_schedule_action(depot->action_manager,
+				   prepare_for_tail_block_commit,
+				   release_tail_block_locks,
+				   NULL,
+				   NULL);
+}
+
 /**
  * initialize_slab_scrubber() - Initialize an allocator's slab scrubber.
  * @allocator: The allocator being initialized
@@ -3954,6 +4145,151 @@ static int __must_check initialize_block_allocator(struct slab_depot *depot, zon
 	return VDO_SUCCESS;
 }
 
+static int allocate_components(struct slab_depot *depot,
+			       struct partition *summary_partition)
+{
+	int result;
+	zone_count_t zone;
+	slab_count_t slab_count;
+	u8 hint;
+	u32 i;
+	const struct thread_config *thread_config = &depot->vdo->thread_config;
+
+	result = vdo_make_action_manager(depot->zone_count,
+					 get_allocator_thread_id,
+					 thread_config->journal_thread,
+					 depot,
+					 schedule_tail_block_commit,
+					 depot->vdo,
+					 &depot->action_manager);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	depot->origin = depot->first_block;
+
+	/* block size must be a multiple of entry size */
+	STATIC_ASSERT((VDO_BLOCK_SIZE % sizeof(struct slab_summary_entry)) == 0);
+
+	depot->summary_origin = summary_partition->offset;
+	depot->hint_shift = vdo_get_slab_summary_hint_shift(depot->slab_size_shift);
+	result = UDS_ALLOCATE(MAXIMUM_VDO_SLAB_SUMMARY_ENTRIES,
+			      struct slab_summary_entry,
+			      __func__,
+			      &depot->summary_entries);
+	if (result != VDO_SUCCESS)
+		return result;
+
+
+	/* Initialize all the entries. */
+	hint = compute_fullness_hint(depot, depot->slab_config.data_blocks);
+	for (i = 0; i < MAXIMUM_VDO_SLAB_SUMMARY_ENTRIES; i++) {
+		/*
+		 * This default tail block offset must be reflected in
+		 * slabJournal.c::read_slab_journal_tail().
+		 */
+		depot->summary_entries[i] = (struct slab_summary_entry) {
+			.tail_block_offset = 0,
+			.fullness_hint = hint,
+			.load_ref_counts = false,
+			.is_dirty = false,
+		};
+	}
+
+	if (result != VDO_SUCCESS)
+		return result;
+
+	slab_count = vdo_compute_slab_count(depot->first_block,
+					    depot->last_block,
+					    depot->slab_size_shift);
+	if (thread_config->physical_zone_count > slab_count)
+		return uds_log_error_strerror(VDO_BAD_CONFIGURATION,
+					      "%u physical zones exceeds slab count %u",
+					      thread_config->physical_zone_count,
+					      slab_count);
+
+	/* Initialize the block allocators. */
+	for (zone = 0; zone < depot->zone_count; zone++) {
+		result = initialize_block_allocator(depot, zone);
+		if (result != VDO_SUCCESS)
+			return result;
+	}
+
+	/* Allocate slabs. */
+	result = allocate_slabs(depot, slab_count);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	/* Use the new slabs. */
+	for (i = depot->slab_count; i < depot->new_slab_count; i++) {
+		struct vdo_slab *slab = depot->new_slabs[i];
+
+		register_slab_with_allocator(slab->allocator, slab);
+		WRITE_ONCE(depot->slab_count, depot->slab_count + 1);
+	}
+
+	depot->slabs = depot->new_slabs;
+	depot->new_slabs = NULL;
+	depot->new_slab_count = 0;
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_decode_slab_depot() - Make a slab depot and configure it with the state read from the super
+ *                           block.
+ * @state: The slab depot state from the super block.
+ * @vdo: The VDO which will own the depot.
+ * @summary_partition: The partition which holds the slab summary.
+ * @depot_ptr: A pointer to hold the depot.
+ *
+ * Return: A success or error code.
+ */
+int vdo_decode_slab_depot(struct slab_depot_state_2_0 state,
+			  struct vdo *vdo,
+			  struct partition *summary_partition,
+			  struct slab_depot **depot_ptr)
+{
+	unsigned int slab_size_shift;
+	struct slab_depot *depot;
+	int result;
+
+	/*
+	 * Calculate the bit shift for efficiently mapping block numbers to slabs. Using a shift
+	 * requires that the slab size be a power of two.
+	 */
+	block_count_t slab_size = state.slab_config.slab_blocks;
+
+	if (!is_power_of_2(slab_size))
+		return uds_log_error_strerror(UDS_INVALID_ARGUMENT,
+					      "slab size must be a power of two");
+	slab_size_shift = ilog2(slab_size);
+
+	result = UDS_ALLOCATE_EXTENDED(struct slab_depot,
+				       vdo->thread_config.physical_zone_count,
+				       struct block_allocator,
+				       __func__,
+				       &depot);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	depot->vdo = vdo;
+	depot->old_zone_count = state.zone_count;
+	depot->zone_count = vdo->thread_config.physical_zone_count;
+	depot->slab_config = state.slab_config;
+	depot->first_block = state.first_block;
+	depot->last_block = state.last_block;
+	depot->slab_size_shift = slab_size_shift;
+
+	result = allocate_components(depot, summary_partition);
+	if (result != VDO_SUCCESS) {
+		vdo_free_slab_depot(depot);
+		return result;
+	}
+
+	*depot_ptr = depot;
+	return VDO_SUCCESS;
+}
+
 static void uninitialize_allocator_summary(struct block_allocator *allocator)
 {
 	block_count_t i;
@@ -3969,6 +4305,229 @@ static void uninitialize_allocator_summary(struct block_allocator *allocator)
 	UDS_FREE(UDS_FORGET(allocator->summary_blocks));
 }
 
+/**
+ * vdo_free_slab_depot() - Destroy a slab depot.
+ * @depot: The depot to destroy.
+ */
+void vdo_free_slab_depot(struct slab_depot *depot)
+{
+	zone_count_t zone = 0;
+
+	if (depot == NULL)
+		return;
+
+	vdo_abandon_new_slabs(depot);
+
+	for (zone = 0; zone < depot->zone_count; zone++) {
+		struct block_allocator *allocator = &depot->allocators[zone];
+
+		if (allocator->eraser != NULL)
+			dm_kcopyd_client_destroy(UDS_FORGET(allocator->eraser));
+
+		uninitialize_allocator_summary(allocator);
+		uninitialize_scrubber_vio(&allocator->scrubber);
+		free_vio_pool(UDS_FORGET(allocator->vio_pool));
+		vdo_free_priority_table(UDS_FORGET(allocator->prioritized_slabs));
+	}
+
+	if (depot->slabs != NULL) {
+		slab_count_t i;
+
+		for (i = 0; i < depot->slab_count; i++)
+			free_slab(UDS_FORGET(depot->slabs[i]));
+	}
+
+	UDS_FREE(UDS_FORGET(depot->slabs));
+	UDS_FREE(UDS_FORGET(depot->action_manager));
+	UDS_FREE(UDS_FORGET(depot->summary_entries));
+	UDS_FREE(depot);
+}
+
+/**
+ * vdo_record_slab_depot() - Record the state of a slab depot for encoding into the super block.
+ * @depot: The depot to encode.
+ *
+ * Return: The depot state.
+ */
+struct slab_depot_state_2_0 vdo_record_slab_depot(const struct slab_depot *depot)
+{
+	/*
+	 * If this depot is currently using 0 zones, it must have been synchronously loaded by a
+	 * tool and is now being saved. We did not load and combine the slab summary, so we still
+	 * need to do that next time we load with the old zone count rather than 0.
+	 */
+	struct slab_depot_state_2_0 state;
+	zone_count_t zones_to_record = depot->zone_count;
+
+	if (depot->zone_count == 0)
+		zones_to_record = depot->old_zone_count;
+
+	state = (struct slab_depot_state_2_0) {
+		.slab_config = depot->slab_config,
+		.first_block = depot->first_block,
+		.last_block = depot->last_block,
+		.zone_count = zones_to_record,
+	};
+
+	return state;
+}
+
+/**
+ * vdo_allocate_reference_counters() - Allocate the reference counters for all slabs in the depot.
+ *
+ * Context: This method may be called only before entering normal operation from the load thread.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_allocate_reference_counters(struct slab_depot *depot)
+{
+	struct slab_iterator iterator =
+		get_depot_slab_iterator(depot, depot->slab_count - 1, 0, 1);
+
+	while (iterator.next != NULL) {
+		int result = allocate_slab_counters(next_slab(&iterator));
+
+		if (result != VDO_SUCCESS)
+			return result;
+	}
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * get_slab_number() - Get the number of the slab that contains a specified block.
+ * @depot: The slab depot.
+ * @pbn: The physical block number.
+ * @slab_number_ptr: A pointer to hold the slab number.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int __must_check get_slab_number(const struct slab_depot *depot,
+					physical_block_number_t pbn,
+					slab_count_t *slab_number_ptr)
+{
+	slab_count_t slab_number;
+
+	if (pbn < depot->first_block)
+		return VDO_OUT_OF_RANGE;
+
+	slab_number = (pbn - depot->first_block) >> depot->slab_size_shift;
+	if (slab_number >= depot->slab_count)
+		return VDO_OUT_OF_RANGE;
+
+	*slab_number_ptr = slab_number;
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_get_slab() - Get the slab object for the slab that contains a specified block.
+ * @depot: The slab depot.
+ * @pbn: The physical block number.
+ *
+ * Will put the VDO in read-only mode if the PBN is not a valid data block nor the zero block.
+ *
+ * Return: The slab containing the block, or NULL if the block number is the zero block or
+ * otherwise out of range.
+ */
+struct vdo_slab *vdo_get_slab(const struct slab_depot *depot, physical_block_number_t pbn)
+{
+	slab_count_t slab_number;
+	int result;
+
+	if (pbn == VDO_ZERO_BLOCK)
+		return NULL;
+
+	result = get_slab_number(depot, pbn, &slab_number);
+	if (result != VDO_SUCCESS) {
+		vdo_enter_read_only_mode(depot->vdo, result);
+		return NULL;
+	}
+
+	return depot->slabs[slab_number];
+}
+
+/**
+ * vdo_get_increment_limit() - Determine how many new references a block can acquire.
+ * @depot: The slab depot.
+ * @pbn: The physical block number that is being queried.
+ *
+ * Context: This method must be called from the physical zone thread of the PBN.
+ *
+ * Return: The number of available references.
+ */
+u8 vdo_get_increment_limit(struct slab_depot *depot, physical_block_number_t pbn)
+{
+	struct vdo_slab *slab = vdo_get_slab(depot, pbn);
+	vdo_refcount_t *counter_ptr = NULL;
+	int result;
+
+	if ((slab == NULL) || (slab->status != VDO_SLAB_REBUILT))
+		return 0;
+
+	result = get_reference_counter(slab, pbn, &counter_ptr);
+	if (result != VDO_SUCCESS)
+		return 0;
+
+	if (*counter_ptr == PROVISIONAL_REFERENCE_COUNT)
+		return (MAXIMUM_REFERENCE_COUNT - 1);
+
+	return (MAXIMUM_REFERENCE_COUNT - *counter_ptr);
+}
+
+/**
+ * vdo_is_physical_data_block() - Determine whether the given PBN refers to a data block.
+ * @depot: The depot.
+ * @pbn: The physical block number to ask about.
+ *
+ * Return: True if the PBN corresponds to a data block.
+ */
+bool vdo_is_physical_data_block(const struct slab_depot *depot, physical_block_number_t pbn)
+{
+	slab_count_t slab_number;
+	slab_block_number sbn;
+
+	return ((pbn == VDO_ZERO_BLOCK) ||
+		((get_slab_number(depot, pbn, &slab_number) == VDO_SUCCESS) &&
+		 (slab_block_number_from_pbn(depot->slabs[slab_number], pbn, &sbn) ==
+		  VDO_SUCCESS)));
+}
+
+/**
+ * vdo_get_slab_depot_allocated_blocks() - Get the total number of data blocks allocated across all
+ * the slabs in the depot.
+ * @depot: The slab depot.
+ *
+ * This is the total number of blocks with a non-zero reference count.
+ *
+ * Context: This may be called from any thread.
+ *
+ * Return: The total number of blocks with a non-zero reference count.
+ */
+block_count_t vdo_get_slab_depot_allocated_blocks(const struct slab_depot *depot)
+{
+	block_count_t total = 0;
+	zone_count_t zone;
+
+	for (zone = 0; zone < depot->zone_count; zone++)
+		/* The allocators are responsible for thread safety. */
+		total += READ_ONCE(depot->allocators[zone].allocated_blocks);
+	return total;
+}
+
+/**
+ * vdo_get_slab_depot_data_blocks() - Get the total number of data blocks in all the slabs in the
+ *                                    depot.
+ * @depot: The slab depot.
+ *
+ * Context: This may be called from any thread.
+ *
+ * Return: The total number of data blocks in all slabs.
+ */
+block_count_t vdo_get_slab_depot_data_blocks(const struct slab_depot *depot)
+{
+	return (READ_ONCE(depot->slab_count) * depot->slab_config.data_blocks);
+}
+
 /**
  * finish_combining_zones() - Clean up after saving out the combined slab summary.
  * @completion: The vio which was used to write the summary data.
@@ -4100,6 +4659,193 @@ static void load_slab_summary(void *context, struct vdo_completion *parent)
 			    REQ_OP_READ);
 }
 
+/* Implements vdo_zone_action. */
+static void load_allocator(void *context, zone_count_t zone_number, struct vdo_completion *parent)
+{
+	struct slab_depot *depot = context;
+
+	vdo_start_loading(&depot->allocators[zone_number].state,
+			  vdo_get_current_manager_operation(depot->action_manager),
+			  parent,
+			  initiate_load);
+}
+
+/**
+ * vdo_load_slab_depot() - Asynchronously load any slab depot state that isn't included in the
+ *                         super_block component.
+ * @depot: The depot to load.
+ * @operation: The type of load to perform.
+ * @parent: The completion to notify when the load is complete.
+ * @context: Additional context for the load operation; may be NULL.
+ *
+ * This method may be called only before entering normal operation from the load thread.
+ */
+void vdo_load_slab_depot(struct slab_depot *depot,
+			 const struct admin_state_code *operation,
+			 struct vdo_completion *parent,
+			 void *context)
+{
+	if (vdo_assert_load_operation(operation, parent))
+		vdo_schedule_operation_with_context(depot->action_manager,
+						    operation,
+						    load_slab_summary,
+						    load_allocator,
+						    NULL,
+						    context,
+						    parent);
+}
+
+/* Implements vdo_zone_action. */
+static void prepare_to_allocate(void *context,
+				zone_count_t zone_number,
+				struct vdo_completion *parent)
+{
+	struct slab_depot *depot = context;
+	struct block_allocator *allocator = &depot->allocators[zone_number];
+	int result;
+
+	result = vdo_prepare_slabs_for_allocation(allocator);
+	if (result != VDO_SUCCESS) {
+		vdo_fail_completion(parent, result);
+		return;
+	}
+
+	scrub_slabs(allocator, parent);
+}
+
+/**
+ * vdo_prepare_slab_depot_to_allocate() - Prepare the slab depot to come online and start
+ *                                        allocating blocks.
+ * @depot: The depot to prepare.
+ * @load_type: The load type.
+ * @parent: The completion to notify when the operation is complete.
+ *
+ * This method may be called only before entering normal operation from the load thread. It must be
+ * called before allocation may proceed.
+ */
+void vdo_prepare_slab_depot_to_allocate(struct slab_depot *depot,
+					enum slab_depot_load_type load_type,
+					struct vdo_completion *parent)
+{
+	depot->load_type = load_type;
+	atomic_set(&depot->zones_to_scrub, depot->zone_count);
+	vdo_schedule_action(depot->action_manager, NULL, prepare_to_allocate, NULL, parent);
+}
+
+/**
+ * vdo_update_slab_depot_size() - Update the slab depot to reflect its new size in memory.
+ * @depot: The depot to update.
+ *
+ * This size is saved to disk as part of the super block.
+ */
+void vdo_update_slab_depot_size(struct slab_depot *depot)
+{
+	depot->last_block = depot->new_last_block;
+}
+
+/**
+ * vdo_prepare_to_grow_slab_depot() - Allocate new memory needed for a resize of a slab depot to
+ *                                    the given size.
+ * @depot: The depot to prepare to resize.
+ * @partition: The new depot partition
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_prepare_to_grow_slab_depot(struct slab_depot *depot, const struct partition *partition)
+{
+	struct slab_depot_state_2_0 new_state;
+	int result;
+	slab_count_t new_slab_count;
+
+	if ((partition->count >> depot->slab_size_shift) <= depot->slab_count)
+		return VDO_INCREMENT_TOO_SMALL;
+
+	/* Generate the depot configuration for the new block count. */
+	ASSERT_LOG_ONLY(depot->first_block == partition->offset,
+			"New slab depot partition doesn't change origin");
+	result = vdo_configure_slab_depot(partition,
+					  depot->slab_config,
+					  depot->zone_count,
+					  &new_state);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	new_slab_count = vdo_compute_slab_count(depot->first_block,
+						new_state.last_block,
+						depot->slab_size_shift);
+	if (new_slab_count <= depot->slab_count)
+		return uds_log_error_strerror(VDO_INCREMENT_TOO_SMALL, "Depot can only grow");
+	if (new_slab_count == depot->new_slab_count)
+		/* Check it out, we've already got all the new slabs allocated! */
+		return VDO_SUCCESS;
+
+	vdo_abandon_new_slabs(depot);
+	result = allocate_slabs(depot, new_slab_count);
+	if (result != VDO_SUCCESS) {
+		vdo_abandon_new_slabs(depot);
+		return result;
+	}
+
+	depot->new_size = partition->count;
+	depot->old_last_block = depot->last_block;
+	depot->new_last_block = new_state.last_block;
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * finish_registration() - Finish registering new slabs now that all of the allocators have
+ *                         received their new slabs.
+ *
+ * Implements vdo_action_conclusion.
+ */
+static int finish_registration(void *context)
+{
+	struct slab_depot *depot = context;
+
+	WRITE_ONCE(depot->slab_count, depot->new_slab_count);
+	UDS_FREE(depot->slabs);
+	depot->slabs = depot->new_slabs;
+	depot->new_slabs = NULL;
+	depot->new_slab_count = 0;
+	return VDO_SUCCESS;
+}
+
+/* Implements vdo_zone_action. */
+static void register_new_slabs(void *context,
+			       zone_count_t zone_number,
+			       struct vdo_completion *parent)
+{
+	struct slab_depot *depot = context;
+	struct block_allocator *allocator = &depot->allocators[zone_number];
+	slab_count_t i;
+
+	for (i = depot->slab_count; i < depot->new_slab_count; i++) {
+		struct vdo_slab *slab = depot->new_slabs[i];
+
+		if (slab->allocator == allocator)
+			register_slab_with_allocator(allocator, slab);
+	}
+
+	vdo_finish_completion(parent);
+}
+
+/**
+ * vdo_use_new_slabs() - Use the new slabs allocated for resize.
+ * @depot: The depot.
+ * @parent: The object to notify when complete.
+ */
+void vdo_use_new_slabs(struct slab_depot *depot, struct vdo_completion *parent)
+{
+	ASSERT_LOG_ONLY(depot->new_slabs != NULL, "Must have new slabs to use");
+	vdo_schedule_operation(depot->action_manager,
+			       VDO_ADMIN_STATE_SUSPENDED_OPERATION,
+			       NULL,
+			       register_new_slabs,
+			       finish_registration,
+			       parent);
+}
+
 /**
  * stop_scrubbing() - Tell the scrubber to stop scrubbing after it finishes the slab it is
  *                    currently working on.
@@ -4169,6 +4915,43 @@ static void initiate_drain(struct admin_state *state)
 	do_drain_step(&allocator->completion);
 }
 
+/*
+ * Drain all allocator I/O. Depending upon the type of drain, some or all dirty metadata may be
+ * written to disk. The type of drain will be determined from the state of the allocator's depot.
+ *
+ * Implements vdo_zone_action.
+ */
+static void drain_allocator(void *context, zone_count_t zone_number, struct vdo_completion *parent)
+{
+	struct slab_depot *depot = context;
+
+	vdo_start_draining(&depot->allocators[zone_number].state,
+			   vdo_get_current_manager_operation(depot->action_manager),
+			   parent,
+			   initiate_drain);
+}
+
+/**
+ * vdo_drain_slab_depot() - Drain all slab depot I/O.
+ * @depot: The depot to drain.
+ * @operation: The drain operation (flush, rebuild, suspend, or save).
+ * @parent: The completion to finish when the drain is complete.
+ *
+ * If saving, or flushing, all dirty depot metadata will be written out. If saving or suspending,
+ * the depot will be left in a suspended state.
+ */
+void vdo_drain_slab_depot(struct slab_depot *depot,
+			  const struct admin_state_code *operation,
+			  struct vdo_completion *parent)
+{
+	vdo_schedule_operation(depot->action_manager,
+			       operation,
+			       NULL,
+			       drain_allocator,
+			       NULL,
+			       parent);
+}
+
 /**
  * resume_scrubbing() - Tell the scrubber to resume scrubbing if it has been stopped.
  * @allocator: The allocator being resumed.
@@ -4246,3 +5029,185 @@ static void resume_allocator(void *context,
 			   parent,
 			   initiate_resume);
 }
+
+/**
+ * vdo_resume_slab_depot() - Resume a suspended slab depot.
+ * @depot: The depot to resume.
+ * @parent: The completion to finish when the depot has resumed.
+ */
+void vdo_resume_slab_depot(struct slab_depot *depot, struct vdo_completion *parent)
+{
+	if (vdo_is_read_only(depot->vdo)) {
+		vdo_continue_completion(parent, VDO_READ_ONLY);
+		return;
+	}
+
+	vdo_schedule_operation(depot->action_manager,
+			       VDO_ADMIN_STATE_RESUMING,
+			       NULL,
+			       resume_allocator,
+			       NULL,
+			       parent);
+}
+
+/**
+ * vdo_commit_oldest_slab_journal_tail_blocks() - Commit all dirty tail blocks which are locking a
+ *                                                given recovery journal block.
+ * @depot: The depot.
+ * @recovery_block_number: The sequence number of the recovery journal block whose locks should be
+ *                         released.
+ *
+ * Context: This method must be called from the journal zone thread.
+ */
+void vdo_commit_oldest_slab_journal_tail_blocks(struct slab_depot *depot,
+						sequence_number_t recovery_block_number)
+{
+	if (depot == NULL)
+		return;
+
+	depot->new_release_request = recovery_block_number;
+	vdo_schedule_default_action(depot->action_manager);
+}
+
+/* Implements vdo_zone_action. */
+static void scrub_all_unrecovered_slabs(void *context,
+					zone_count_t zone_number,
+					struct vdo_completion *parent)
+{
+	struct slab_depot *depot = context;
+
+	scrub_slabs(&depot->allocators[zone_number], NULL);
+	vdo_launch_completion(parent);
+}
+
+/**
+ * vdo_scrub_all_unrecovered_slabs() - Scrub all unrecovered slabs.
+ * @depot: The depot to scrub.
+ * @parent: The object to notify when scrubbing has been launched for all zones.
+ */
+void vdo_scrub_all_unrecovered_slabs(struct slab_depot *depot, struct vdo_completion *parent)
+{
+	vdo_schedule_action(depot->action_manager,
+			    NULL,
+			    scrub_all_unrecovered_slabs,
+			    NULL,
+			    parent);
+}
+
+/**
+ * get_block_allocator_statistics() - Get the total of the statistics from all the block allocators
+ *                                    in the depot.
+ * @depot: The slab depot.
+ *
+ * Return: The statistics from all block allocators in the depot.
+ */
+static struct block_allocator_statistics __must_check
+get_block_allocator_statistics(const struct slab_depot *depot)
+{
+	struct block_allocator_statistics totals;
+	zone_count_t zone;
+
+	memset(&totals, 0, sizeof(totals));
+
+	for (zone = 0; zone < depot->zone_count; zone++) {
+		const struct block_allocator *allocator = &depot->allocators[zone];
+		const struct block_allocator_statistics *stats = &allocator->statistics;
+
+		totals.slab_count += allocator->slab_count;
+		totals.slabs_opened += READ_ONCE(stats->slabs_opened);
+		totals.slabs_reopened += READ_ONCE(stats->slabs_reopened);
+	}
+
+	return totals;
+}
+
+/**
+ * get_ref_counts_statistics() - Get the cumulative ref_counts statistics for the depot.
+ * @depot: The slab depot.
+ *
+ * Return: The cumulative statistics for all ref_counts in the depot.
+ */
+static struct ref_counts_statistics __must_check
+get_ref_counts_statistics(const struct slab_depot *depot)
+{
+	struct ref_counts_statistics totals;
+	zone_count_t zone;
+
+	memset(&totals, 0, sizeof(totals));
+
+	for (zone = 0; zone < depot->zone_count; zone++) {
+		totals.blocks_written +=
+			READ_ONCE(depot->allocators[zone].ref_counts_statistics.blocks_written);
+	}
+
+	return totals;
+}
+
+/**
+ * get_depot_slab_journal_statistics() - Get the aggregated slab journal statistics for the depot.
+ * @depot: The slab depot.
+ *
+ * Return: The aggregated statistics for all slab journals in the depot.
+ */
+static struct slab_journal_statistics __must_check
+get_slab_journal_statistics(const struct slab_depot *depot)
+{
+	struct slab_journal_statistics totals;
+	zone_count_t zone;
+
+	memset(&totals, 0, sizeof(totals));
+
+	for (zone = 0; zone < depot->zone_count; zone++) {
+		const struct slab_journal_statistics *stats =
+			&depot->allocators[zone].slab_journal_statistics;
+
+		totals.disk_full_count += READ_ONCE(stats->disk_full_count);
+		totals.flush_count += READ_ONCE(stats->flush_count);
+		totals.blocked_count += READ_ONCE(stats->blocked_count);
+		totals.blocks_written += READ_ONCE(stats->blocks_written);
+		totals.tail_busy_count += READ_ONCE(stats->tail_busy_count);
+	}
+
+	return totals;
+}
+
+/**
+ * vdo_get_slab_depot_statistics() - Get all the vdo_statistics fields that are properties of the
+ *                                   slab depot.
+ * @depot: The slab depot.
+ * @stats: The vdo statistics structure to partially fill.
+ */
+void vdo_get_slab_depot_statistics(const struct slab_depot *depot, struct vdo_statistics *stats)
+{
+	slab_count_t slab_count = READ_ONCE(depot->slab_count);
+	slab_count_t unrecovered = 0;
+	zone_count_t zone;
+
+	for (zone = 0; zone < depot->zone_count; zone++) {
+		/* The allocators are responsible for thread safety. */
+		unrecovered += READ_ONCE(depot->allocators[zone].scrubber.slab_count);
+	}
+
+	stats->recovery_percentage = (slab_count - unrecovered) * 100 / slab_count;
+	stats->allocator = get_block_allocator_statistics(depot);
+	stats->ref_counts = get_ref_counts_statistics(depot);
+	stats->slab_journal = get_slab_journal_statistics(depot);
+	stats->slab_summary = (struct slab_summary_statistics) {
+		.blocks_written = atomic64_read(&depot->summary_statistics.blocks_written),
+	};
+}
+
+/**
+ * vdo_dump_slab_depot() - Dump the slab depot, in a thread-unsafe fashion.
+ * @depot: The slab depot.
+ */
+void vdo_dump_slab_depot(const struct slab_depot *depot)
+{
+	uds_log_info("vdo slab depot");
+	uds_log_info("  zone_count=%u old_zone_count=%u slabCount=%u active_release_request=%llu new_release_request=%llu",
+		     (unsigned int) depot->zone_count,
+		     (unsigned int) depot->old_zone_count,
+		     READ_ONCE(depot->slab_count),
+		     (unsigned long long) depot->active_release_request,
+		     (unsigned long long) depot->new_release_request);
+}
diff --git a/drivers/md/dm-vdo/slab-depot.h b/drivers/md/dm-vdo/slab-depot.h
index 6ec4be7b5822..44655d697fa0 100644
--- a/drivers/md/dm-vdo/slab-depot.h
+++ b/drivers/md/dm-vdo/slab-depot.h
@@ -435,6 +435,66 @@ struct block_allocator {
 	struct slab_summary_block *summary_blocks;
 };
 
+enum slab_depot_load_type {
+	VDO_SLAB_DEPOT_NORMAL_LOAD,
+	VDO_SLAB_DEPOT_RECOVERY_LOAD,
+	VDO_SLAB_DEPOT_REBUILD_LOAD
+};
+
+struct slab_depot {
+	zone_count_t zone_count;
+	zone_count_t old_zone_count;
+	struct vdo *vdo;
+	struct slab_config slab_config;
+	struct action_manager *action_manager;
+
+	physical_block_number_t first_block;
+	physical_block_number_t last_block;
+	physical_block_number_t origin;
+
+	/* slab_size == (1 << slab_size_shift) */
+	unsigned int slab_size_shift;
+
+	/* Determines how slabs should be queued during load */
+	enum slab_depot_load_type load_type;
+
+	/* The state for notifying slab journals to release recovery journal */
+	sequence_number_t active_release_request;
+	sequence_number_t new_release_request;
+
+	/* State variables for scrubbing complete handling */
+	atomic_t zones_to_scrub;
+
+	/* Array of pointers to individually allocated slabs */
+	struct vdo_slab **slabs;
+	/* The number of slabs currently allocated and stored in 'slabs' */
+	slab_count_t slab_count;
+
+	/* Array of pointers to a larger set of slabs (used during resize) */
+	struct vdo_slab **new_slabs;
+	/* The number of slabs currently allocated and stored in 'new_slabs' */
+	slab_count_t new_slab_count;
+	/* The size that 'new_slabs' was allocated for */
+	block_count_t new_size;
+
+	/* The last block before resize, for rollback */
+	physical_block_number_t old_last_block;
+	/* The last block after resize, for resize */
+	physical_block_number_t new_last_block;
+
+	/* The statistics for the slab summary */
+	struct atomic_slab_summary_statistics summary_statistics;
+	/* The start of the slab summary partition */
+	physical_block_number_t summary_origin;
+	/* The number of bits to shift to get a 7-bit fullness hint */
+	unsigned int hint_shift;
+	/* The slab summary entries for all of the zones the partition can hold */
+	struct slab_summary_entry *summary_entries;
+
+	/* The block allocators for this depot */
+	struct block_allocator allocators[];
+};
+
 struct reference_updater;
 
 bool __must_check
@@ -445,6 +505,11 @@ vdo_attempt_replay_into_slab(struct vdo_slab *slab,
 			     struct journal_point *recovery_point,
 			     struct vdo_completion *parent);
 
+int __must_check
+vdo_adjust_reference_count_for_rebuild(struct slab_depot *depot,
+				       physical_block_number_t pbn,
+				       enum journal_operation operation);
+
 static inline struct block_allocator *vdo_as_block_allocator(struct vdo_completion *completion)
 {
 	vdo_assert_completion_type(completion, VDO_BLOCK_ALLOCATOR_COMPLETION);
@@ -470,4 +535,60 @@ void vdo_notify_slab_journals_are_recovered(struct vdo_completion *completion);
 
 void vdo_dump_block_allocator(const struct block_allocator *allocator);
 
+int __must_check vdo_decode_slab_depot(struct slab_depot_state_2_0 state,
+				       struct vdo *vdo,
+				       struct partition *summary_partition,
+				       struct slab_depot **depot_ptr);
+
+void vdo_free_slab_depot(struct slab_depot *depot);
+
+struct slab_depot_state_2_0 __must_check vdo_record_slab_depot(const struct slab_depot *depot);
+
+int __must_check vdo_allocate_reference_counters(struct slab_depot *depot);
+
+struct vdo_slab * __must_check
+vdo_get_slab(const struct slab_depot *depot, physical_block_number_t pbn);
+
+u8 __must_check vdo_get_increment_limit(struct slab_depot *depot, physical_block_number_t pbn);
+
+bool __must_check
+vdo_is_physical_data_block(const struct slab_depot *depot, physical_block_number_t pbn);
+
+block_count_t __must_check vdo_get_slab_depot_allocated_blocks(const struct slab_depot *depot);
+
+block_count_t __must_check vdo_get_slab_depot_data_blocks(const struct slab_depot *depot);
+
+void vdo_get_slab_depot_statistics(const struct slab_depot *depot, struct vdo_statistics *stats);
+
+void vdo_load_slab_depot(struct slab_depot *depot,
+			 const struct admin_state_code *operation,
+			 struct vdo_completion *parent,
+			 void *context);
+
+void vdo_prepare_slab_depot_to_allocate(struct slab_depot *depot,
+					enum slab_depot_load_type load_type,
+					struct vdo_completion *parent);
+
+void vdo_update_slab_depot_size(struct slab_depot *depot);
+
+int __must_check
+vdo_prepare_to_grow_slab_depot(struct slab_depot *depot, const struct partition *partition);
+
+void vdo_use_new_slabs(struct slab_depot *depot, struct vdo_completion *parent);
+
+void vdo_abandon_new_slabs(struct slab_depot *depot);
+
+void vdo_drain_slab_depot(struct slab_depot *depot,
+			  const struct admin_state_code *operation,
+			  struct vdo_completion *parent);
+
+void vdo_resume_slab_depot(struct slab_depot *depot, struct vdo_completion *parent);
+
+void vdo_commit_oldest_slab_journal_tail_blocks(struct slab_depot *depot,
+						sequence_number_t recovery_block_number);
+
+void vdo_scrub_all_unrecovered_slabs(struct slab_depot *depot, struct vdo_completion *parent);
+
+void vdo_dump_slab_depot(const struct slab_depot *depot);
+
 #endif /* VDO_SLAB_DEPOT_H */

From patchwork Thu Oct 26 21:41:26 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437948
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 43CFBC27C49
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:29 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344933AbjJZVo3 (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:29 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41026 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344887AbjJZVoQ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:16 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C176AD4C
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356504;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=YRpqX318yKs0FI7wcKnos8/xZf3b0XP2y9bXPQ0s/9U=;
        b=PESfxq7D0yhGvPUqSOlpXuPZajMfXJ0QAYlT8te7yMH85EAmDLZoQgrcNVnG9hZNz7+HMU
        /u/3A3YufxVdWSdULFIHbQAn5do3OWGcTSLczijfA1/hZMNUeUhAPxYx38h/vMQ4Yo2Z6s
        GIim8UqTGnVr3B7Hsbjw0ttmwM0OTvw=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-685-j1Js9vq-MbWh0e3cMv25Og-1; Thu,
 26 Oct 2023 17:41:43 -0400
X-MC-Unique: j1Js9vq-MbWh0e3cMv25Og-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 3F43529ABA38;
        Thu, 26 Oct 2023 21:41:42 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 22B222949;
        Thu, 26 Oct 2023 21:41:42 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id CCF1D3003B; Thu, 26 Oct 2023 17:41:41 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 29/39] dm vdo: add the block map
Date: Thu, 26 Oct 2023 17:41:26 -0400
Message-Id: <20231026214136.1067410-30-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

The block map contains the logical to physical mapping. It can be thought
of as an array with one entry per logical address. Each entry is 5 bytes:
36 bits contain the physical block number which holds the data for the
given logical address, and the remaining 4 bits are used to indicate the
nature of the mapping. Of the 16 possible states, one represents a logical
address which is unmapped (i.e. it has never been written, or has been
discarded), one represents an uncompressed block, and the other 14 states
are used to indicate that the mapped data is compressed, and which of the
compression slots in the compressed block this logical address maps to.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/block-map.c | 2145 +++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/block-map.h |  232 ++++
 2 files changed, 2377 insertions(+)
 create mode 100644 drivers/md/dm-vdo/block-map.c
 create mode 100644 drivers/md/dm-vdo/block-map.h

diff --git a/drivers/md/dm-vdo/block-map.c b/drivers/md/dm-vdo/block-map.c
new file mode 100644
index 000000000000..a4bf60cd2613
--- /dev/null
+++ b/drivers/md/dm-vdo/block-map.c
@@ -0,0 +1,2145 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "block-map.h"
+
+#include <linux/bio.h>
+#include <linux/ratelimit.h>
+
+#include "errors.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+
+#include "action-manager.h"
+#include "admin-state.h"
+#include "completion.h"
+#include "constants.h"
+#include "data-vio.h"
+#include "encodings.h"
+#include "io-submitter.h"
+#include "physical-zone.h"
+#include "recovery-journal.h"
+#include "slab-depot.h"
+#include "status-codes.h"
+#include "types.h"
+#include "vdo.h"
+#include "vio.h"
+#include "wait-queue.h"
+
+/**
+ * DOC: Block map eras
+ *
+ * The block map era, or maximum age, is used as follows:
+ *
+ * Each block map page, when dirty, records the earliest recovery journal block sequence number of
+ * the changes reflected in that dirty block. Sequence numbers are classified into eras: every
+ * @maximum_age sequence numbers, we switch to a new era. Block map pages are assigned to eras
+ * according to the sequence number they record.
+ *
+ * In the current (newest) era, block map pages are not written unless there is cache pressure. In
+ * the next oldest era, each time a new journal block is written 1/@maximum_age of the pages in
+ * this era are issued for write. In all older eras, pages are issued for write immediately.
+ */
+
+struct page_descriptor {
+	root_count_t root_index;
+	height_t height;
+	page_number_t page_index;
+	slot_number_t slot;
+} __packed;
+
+union page_key {
+	struct page_descriptor descriptor;
+	u64 key;
+};
+
+struct write_if_not_dirtied_context {
+	struct block_map_zone *zone;
+	u8 generation;
+};
+
+struct block_map_tree_segment {
+	struct tree_page *levels[VDO_BLOCK_MAP_TREE_HEIGHT];
+};
+
+struct block_map_tree {
+	struct block_map_tree_segment *segments;
+} block_map_tree;
+
+struct forest {
+	struct block_map *map;
+	size_t segments;
+	struct boundary *boundaries;
+	struct tree_page **pages;
+	struct block_map_tree trees[];
+};
+
+struct cursor_level {
+	page_number_t page_index;
+	slot_number_t slot;
+};
+
+struct cursors;
+
+struct cursor {
+	struct waiter waiter;
+	struct block_map_tree *tree;
+	height_t height;
+	struct cursors *parent;
+	struct boundary boundary;
+	struct cursor_level levels[VDO_BLOCK_MAP_TREE_HEIGHT];
+	struct pooled_vio *vio;
+};
+
+struct cursors {
+	struct block_map_zone *zone;
+	struct vio_pool *pool;
+	vdo_entry_callback *entry_callback;
+	struct vdo_completion *parent;
+	root_count_t active_roots;
+	struct cursor cursors[];
+};
+
+/* Used to indicate that the page holding the location of a tree root has been "loaded". */
+const physical_block_number_t VDO_INVALID_PBN = 0xFFFFFFFFFFFFFFFF;
+
+/**
+ * get_tree_page_by_index() - Get the tree page for a given height and page index.
+ *
+ * Return: The requested page.
+ */
+static struct tree_page * __must_check
+get_tree_page_by_index(struct forest *forest,
+		       root_count_t root_index,
+		       height_t height,
+		       page_number_t page_index)
+{
+	page_number_t offset = 0;
+	size_t segment;
+
+	for (segment = 0; segment < forest->segments; segment++) {
+		page_number_t border = forest->boundaries[segment].levels[height - 1];
+
+		if (page_index < border) {
+			struct block_map_tree *tree = &forest->trees[root_index];
+
+			return &(tree->segments[segment].levels[height - 1][page_index - offset]);
+		}
+		offset = border;
+	}
+
+	return NULL;
+}
+
+/* Get the page referred to by the lock's tree slot at its current height. */
+static inline struct tree_page *
+get_tree_page(const struct block_map_zone *zone, const struct tree_lock *lock)
+{
+	return get_tree_page_by_index(zone->block_map->forest,
+				      lock->root_index,
+				      lock->height,
+				      lock->tree_slots[lock->height].page_index);
+}
+
+/** vdo_copy_valid_page() - Validate and copy a buffer to a page. */
+bool vdo_copy_valid_page(char *buffer,
+			 nonce_t nonce,
+			 physical_block_number_t pbn,
+			 struct block_map_page *page)
+{
+	struct block_map_page *loaded = (struct block_map_page *) buffer;
+	enum block_map_page_validity validity = vdo_validate_block_map_page(loaded, nonce, pbn);
+
+	if (validity == VDO_BLOCK_MAP_PAGE_VALID) {
+		memcpy(page, loaded, VDO_BLOCK_SIZE);
+		return true;
+	}
+
+	if (validity == VDO_BLOCK_MAP_PAGE_BAD)
+		uds_log_error_strerror(VDO_BAD_PAGE,
+				       "Expected page %llu but got page %llu instead",
+				       (unsigned long long) pbn,
+				       (unsigned long long) vdo_get_block_map_page_pbn(loaded));
+
+	return false;
+}
+
+/**
+ * in_cyclic_range() - Check whether the given value is between the lower and upper bounds, within
+ *                     a cyclic range of values from 0 to (modulus - 1).
+ * @lower: The lowest value to accept.
+ * @value: The value to check.
+ * @upper: The highest value to accept.
+ * @modulus: The size of the cyclic space, no more than 2^15.
+ *
+ * The value and both bounds must be smaller than the modulus.
+ *
+ * Return: true if the value is in range.
+ */
+static bool in_cyclic_range(u16 lower, u16 value, u16 upper, u16 modulus)
+{
+	if (value < lower)
+		value += modulus;
+	if (upper < lower)
+		upper += modulus;
+	return (value <= upper);
+}
+
+/**
+ * is_not_older() - Check whether a generation is strictly older than some other generation in the
+ *                  context of a zone's current generation range.
+ * @zone: The zone in which to do the comparison.
+ * @a: The generation in question.
+ * @b: The generation to compare to.
+ *
+ * Return: true if generation @a is not strictly older than generation @b in the context of @zone
+ */
+static bool __must_check is_not_older(struct block_map_zone *zone, u8 a, u8 b)
+{
+	int result;
+
+	result = ASSERT((in_cyclic_range(zone->oldest_generation, a, zone->generation, 1 << 8) &&
+			 in_cyclic_range(zone->oldest_generation, b, zone->generation, 1 << 8)),
+			"generation(s) %u, %u are out of range [%u, %u]",
+			a, b, zone->oldest_generation, zone->generation);
+	if (result != VDO_SUCCESS) {
+		enter_zone_read_only_mode(zone, result);
+		return true;
+	}
+
+	return in_cyclic_range(b, a, zone->generation, 1 << 8);
+}
+
+static void release_generation(struct block_map_zone *zone, u8 generation)
+{
+	int result;
+
+	result = ASSERT((zone->dirty_page_counts[generation] > 0),
+			"dirty page count underflow for generation %u",
+			generation);
+	if (result != VDO_SUCCESS) {
+		enter_zone_read_only_mode(zone, result);
+		return;
+	}
+
+	zone->dirty_page_counts[generation]--;
+	while ((zone->dirty_page_counts[zone->oldest_generation] == 0) &&
+	       (zone->oldest_generation != zone->generation))
+		zone->oldest_generation++;
+}
+
+static void
+set_generation(struct block_map_zone *zone, struct tree_page *page, u8 new_generation)
+{
+	u32 new_count;
+	int result;
+	bool decrement_old = vdo_is_waiting(&page->waiter);
+	u8 old_generation = page->generation;
+
+	if (decrement_old && (old_generation == new_generation))
+		return;
+
+	page->generation = new_generation;
+	new_count = ++zone->dirty_page_counts[new_generation];
+	result = ASSERT((new_count != 0),
+			"dirty page count overflow for generation %u",
+			new_generation);
+	if (result != VDO_SUCCESS) {
+		enter_zone_read_only_mode(zone, result);
+		return;
+	}
+
+	if (decrement_old)
+		release_generation(zone, old_generation);
+}
+
+static void write_page(struct tree_page *tree_page, struct pooled_vio *vio);
+
+/* Implements waiter_callback */
+static void write_page_callback(struct waiter *waiter, void *context)
+{
+	write_page(container_of(waiter, struct tree_page, waiter), (struct pooled_vio *) context);
+}
+
+static void acquire_vio(struct waiter *waiter, struct block_map_zone *zone)
+{
+	waiter->callback = write_page_callback;
+	acquire_vio_from_pool(zone->vio_pool, waiter);
+}
+
+/* Return: true if all possible generations were not already active */
+static bool attempt_increment(struct block_map_zone *zone)
+{
+	u8 generation = zone->generation + 1;
+
+	if (zone->oldest_generation == generation)
+		return false;
+
+	zone->generation = generation;
+	return true;
+}
+
+/* Launches a flush if one is not already in progress. */
+static void enqueue_page(struct tree_page *page, struct block_map_zone *zone)
+{
+	if ((zone->flusher == NULL) && attempt_increment(zone)) {
+		zone->flusher = page;
+		acquire_vio(&page->waiter, zone);
+		return;
+	}
+
+	vdo_enqueue_waiter(&zone->flush_waiters, &page->waiter);
+}
+
+static void write_page_if_not_dirtied(struct waiter *waiter, void *context)
+{
+	struct tree_page *page = container_of(waiter, struct tree_page, waiter);
+	struct write_if_not_dirtied_context *write_context = context;
+
+	if (page->generation == write_context->generation) {
+		acquire_vio(waiter, write_context->zone);
+		return;
+	}
+
+	enqueue_page(page, write_context->zone);
+}
+
+static void return_to_pool(struct block_map_zone *zone, struct pooled_vio *vio)
+{
+	return_vio_to_pool(zone->vio_pool, vio);
+	check_for_drain_complete(zone);
+}
+
+/* This callback is registered in write_initialized_page(). */
+static void finish_page_write(struct vdo_completion *completion)
+{
+	bool dirty;
+	struct vio *vio = as_vio(completion);
+	struct pooled_vio *pooled = container_of(vio, struct pooled_vio, vio);
+	struct tree_page *page = completion->parent;
+	struct block_map_zone *zone = pooled->context;
+
+	vdo_release_recovery_journal_block_reference(zone->block_map->journal,
+						     page->writing_recovery_lock,
+						     VDO_ZONE_TYPE_LOGICAL,
+						     zone->zone_number);
+
+	dirty = (page->writing_generation != page->generation);
+	release_generation(zone, page->writing_generation);
+	page->writing = false;
+
+	if (zone->flusher == page) {
+		struct write_if_not_dirtied_context context = {
+			.zone = zone,
+			.generation = page->writing_generation,
+		};
+
+		vdo_notify_all_waiters(&zone->flush_waiters, write_page_if_not_dirtied, &context);
+		if (dirty && attempt_increment(zone)) {
+			write_page(page, pooled);
+			return;
+		}
+
+		zone->flusher = NULL;
+	}
+
+	if (dirty) {
+		enqueue_page(page, zone);
+	} else if ((zone->flusher == NULL) &&
+		   vdo_has_waiters(&zone->flush_waiters) &&
+		   attempt_increment(zone)) {
+		zone->flusher = container_of(vdo_dequeue_next_waiter(&zone->flush_waiters),
+					     struct tree_page,
+					     waiter);
+		write_page(zone->flusher, pooled);
+		return;
+	}
+
+	return_to_pool(zone, pooled);
+}
+
+static void handle_write_error(struct vdo_completion *completion)
+{
+	int result = completion->result;
+	struct vio *vio = as_vio(completion);
+	struct pooled_vio *pooled = container_of(vio, struct pooled_vio, vio);
+	struct block_map_zone *zone = pooled->context;
+
+	vio_record_metadata_io_error(vio);
+	enter_zone_read_only_mode(zone, result);
+	return_to_pool(zone, pooled);
+}
+
+static void write_page_endio(struct bio *bio);
+
+static void write_initialized_page(struct vdo_completion *completion)
+{
+	struct vio *vio = as_vio(completion);
+	struct pooled_vio *pooled = container_of(vio, struct pooled_vio, vio);
+	struct block_map_zone *zone = pooled->context;
+	struct tree_page *tree_page = completion->parent;
+	struct block_map_page *page = (struct block_map_page *) vio->data;
+	unsigned int operation = REQ_OP_WRITE | REQ_PRIO;
+
+	/*
+	 * Now that we know the page has been written at least once, mark the copy we are writing
+	 * as initialized.
+	 */
+	page->header.initialized = true;
+
+	if (zone->flusher == tree_page)
+		operation |= REQ_PREFLUSH;
+
+	submit_metadata_vio(vio,
+			    vdo_get_block_map_page_pbn(page),
+			    write_page_endio,
+			    handle_write_error,
+			    operation);
+}
+
+static void write_page_endio(struct bio *bio)
+{
+	struct pooled_vio *vio = bio->bi_private;
+	struct block_map_zone *zone = vio->context;
+	struct block_map_page *page = (struct block_map_page *) vio->vio.data;
+
+	continue_vio_after_io(&vio->vio,
+			      (page->header.initialized ?
+			       finish_page_write :
+			       write_initialized_page),
+			      zone->thread_id);
+}
+
+static void write_page(struct tree_page *tree_page, struct pooled_vio *vio)
+{
+	struct vdo_completion *completion = &vio->vio.completion;
+	struct block_map_zone *zone = vio->context;
+	struct block_map_page *page = vdo_as_block_map_page(tree_page);
+
+	if ((zone->flusher != tree_page) &&
+	    is_not_older(zone, tree_page->generation, zone->generation)) {
+		/*
+		 * This page was re-dirtied after the last flush was issued, hence we need to do
+		 * another flush.
+		 */
+		enqueue_page(tree_page, zone);
+		return_to_pool(zone, vio);
+		return;
+	}
+
+	completion->parent = tree_page;
+	memcpy(vio->vio.data, tree_page->page_buffer, VDO_BLOCK_SIZE);
+	completion->callback_thread_id = zone->thread_id;
+
+	tree_page->writing = true;
+	tree_page->writing_generation = tree_page->generation;
+	tree_page->writing_recovery_lock = tree_page->recovery_lock;
+
+	/* Clear this now so that we know this page is not on any dirty list. */
+	tree_page->recovery_lock = 0;
+
+	/*
+	 * We've already copied the page into the vio which will write it, so if it was not yet
+	 * initialized, the first write will indicate that (for torn write protection). It is now
+	 * safe to mark it as initialized in memory since if the write fails, the in memory state
+	 * will become irrelevant.
+	 */
+	if (page->header.initialized) {
+		write_initialized_page(completion);
+		return;
+	}
+
+	page->header.initialized = true;
+	submit_metadata_vio(&vio->vio,
+			    vdo_get_block_map_page_pbn(page),
+			    write_page_endio,
+			    handle_write_error,
+			    REQ_OP_WRITE | REQ_PRIO);
+}
+
+/* Release a lock on a page which was being loaded or allocated. */
+static void release_page_lock(struct data_vio *data_vio, char *what)
+{
+	struct block_map_zone *zone;
+	struct tree_lock *lock_holder;
+	struct tree_lock *lock = &data_vio->tree_lock;
+
+	ASSERT_LOG_ONLY(lock->locked,
+			"release of unlocked block map page %s for key %llu in tree %u",
+			what, (unsigned long long) lock->key,
+			lock->root_index);
+
+	zone = data_vio->logical.zone->block_map_zone;
+	lock_holder = vdo_int_map_remove(zone->loading_pages, lock->key);
+	ASSERT_LOG_ONLY((lock_holder == lock),
+			"block map page %s mismatch for key %llu in tree %u",
+			what,
+			(unsigned long long) lock->key,
+			lock->root_index);
+	lock->locked = false;
+}
+
+static void finish_lookup(struct data_vio *data_vio, int result)
+{
+	data_vio->tree_lock.height = 0;
+
+	--data_vio->logical.zone->block_map_zone->active_lookups;
+
+	set_data_vio_logical_callback(data_vio, continue_data_vio_with_block_map_slot);
+	data_vio->vio.completion.error_handler = handle_data_vio_error;
+	continue_data_vio_with_error(data_vio, result);
+}
+
+static void abort_lookup_for_waiter(struct waiter *waiter, void *context)
+{
+	struct data_vio *data_vio = waiter_as_data_vio(waiter);
+	int result = *((int *) context);
+
+	if (!data_vio->write) {
+		if (result == VDO_NO_SPACE)
+			result = VDO_SUCCESS;
+	} else if (result != VDO_NO_SPACE) {
+		result = VDO_READ_ONLY;
+	}
+
+	finish_lookup(data_vio, result);
+}
+
+static void abort_lookup(struct data_vio *data_vio, int result, char *what)
+{
+	if (result != VDO_NO_SPACE)
+		enter_zone_read_only_mode(data_vio->logical.zone->block_map_zone, result);
+
+	if (data_vio->tree_lock.locked) {
+		release_page_lock(data_vio, what);
+		vdo_notify_all_waiters(&data_vio->tree_lock.waiters,
+				       abort_lookup_for_waiter,
+				       &result);
+	}
+
+	finish_lookup(data_vio, result);
+}
+
+static void abort_load(struct data_vio *data_vio, int result)
+{
+	abort_lookup(data_vio, result, "load");
+}
+
+static bool __must_check
+is_invalid_tree_entry(const struct vdo *vdo, const struct data_location *mapping, height_t height)
+{
+	if (!vdo_is_valid_location(mapping) ||
+	    vdo_is_state_compressed(mapping->state) ||
+	    (vdo_is_mapped_location(mapping) && (mapping->pbn == VDO_ZERO_BLOCK)))
+		return true;
+
+	/* Roots aren't physical data blocks, so we can't check their PBNs. */
+	if (height == VDO_BLOCK_MAP_TREE_HEIGHT)
+		return false;
+
+	return !vdo_is_physical_data_block(vdo->depot, mapping->pbn);
+}
+
+static void load_block_map_page(struct block_map_zone *zone, struct data_vio *data_vio);
+static void allocate_block_map_page(struct block_map_zone *zone, struct data_vio *data_vio);
+
+static void continue_with_loaded_page(struct data_vio *data_vio, struct block_map_page *page)
+{
+	struct tree_lock *lock = &data_vio->tree_lock;
+	struct block_map_tree_slot slot = lock->tree_slots[lock->height];
+	struct data_location mapping =
+		vdo_unpack_block_map_entry(&page->entries[slot.block_map_slot.slot]);
+
+	if (is_invalid_tree_entry(vdo_from_data_vio(data_vio), &mapping, lock->height)) {
+		uds_log_error_strerror(VDO_BAD_MAPPING,
+				       "Invalid block map tree PBN: %llu with state %u for page index %u at height %u",
+				       (unsigned long long) mapping.pbn,
+				       mapping.state,
+				       lock->tree_slots[lock->height - 1].page_index,
+				       lock->height - 1);
+		abort_load(data_vio, VDO_BAD_MAPPING);
+		return;
+	}
+
+	if (!vdo_is_mapped_location(&mapping)) {
+		/* The page we need is unallocated */
+		allocate_block_map_page(data_vio->logical.zone->block_map_zone, data_vio);
+		return;
+	}
+
+	lock->tree_slots[lock->height - 1].block_map_slot.pbn = mapping.pbn;
+	if (lock->height == 1) {
+		finish_lookup(data_vio, VDO_SUCCESS);
+		return;
+	}
+
+	/* We know what page we need to load next */
+	load_block_map_page(data_vio->logical.zone->block_map_zone, data_vio);
+}
+
+static void continue_load_for_waiter(struct waiter *waiter, void *context)
+{
+	struct data_vio *data_vio = waiter_as_data_vio(waiter);
+
+	data_vio->tree_lock.height--;
+	continue_with_loaded_page(data_vio, (struct block_map_page *) context);
+}
+
+static void finish_block_map_page_load(struct vdo_completion *completion)
+{
+	physical_block_number_t pbn;
+	struct tree_page *tree_page;
+	struct block_map_page *page;
+	nonce_t nonce;
+	struct vio *vio = as_vio(completion);
+	struct pooled_vio *pooled = vio_as_pooled_vio(vio);
+	struct data_vio *data_vio = completion->parent;
+	struct block_map_zone *zone = pooled->context;
+	struct tree_lock *tree_lock = &data_vio->tree_lock;
+
+	tree_lock->height--;
+	pbn = tree_lock->tree_slots[tree_lock->height].block_map_slot.pbn;
+	tree_page = get_tree_page(zone, tree_lock);
+	page = (struct block_map_page *) tree_page->page_buffer;
+	nonce = zone->block_map->nonce;
+
+	if (!vdo_copy_valid_page(vio->data, nonce, pbn, page))
+		vdo_format_block_map_page(page, nonce, pbn, false);
+	return_vio_to_pool(zone->vio_pool, pooled);
+
+	/* Release our claim to the load and wake any waiters */
+	release_page_lock(data_vio, "load");
+	vdo_notify_all_waiters(&tree_lock->waiters, continue_load_for_waiter, page);
+	continue_with_loaded_page(data_vio, page);
+}
+
+static void handle_io_error(struct vdo_completion *completion)
+{
+	int result = completion->result;
+	struct vio *vio = as_vio(completion);
+	struct pooled_vio *pooled = container_of(vio, struct pooled_vio, vio);
+	struct data_vio *data_vio = completion->parent;
+	struct block_map_zone *zone = pooled->context;
+
+	vio_record_metadata_io_error(vio);
+	return_vio_to_pool(zone->vio_pool, pooled);
+	abort_load(data_vio, result);
+}
+
+static void load_page_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct data_vio *data_vio = vio->completion.parent;
+
+	continue_vio_after_io(vio, finish_block_map_page_load, data_vio->logical.zone->thread_id);
+}
+
+static void load_page(struct waiter *waiter, void *context)
+{
+	struct pooled_vio *pooled = context;
+	struct data_vio *data_vio = waiter_as_data_vio(waiter);
+	struct tree_lock *lock = &data_vio->tree_lock;
+	physical_block_number_t pbn = lock->tree_slots[lock->height - 1].block_map_slot.pbn;
+
+	pooled->vio.completion.parent = data_vio;
+	submit_metadata_vio(&pooled->vio,
+			    pbn,
+			    load_page_endio,
+			    handle_io_error,
+			    REQ_OP_READ | REQ_PRIO);
+}
+
+/*
+ * If the page is already locked, queue up to wait for the lock to be released. If the lock is
+ * acquired, @data_vio->tree_lock.locked will be true.
+ */
+static int attempt_page_lock(struct block_map_zone *zone, struct data_vio *data_vio)
+{
+	int result;
+	struct tree_lock *lock_holder;
+	struct tree_lock *lock = &data_vio->tree_lock;
+	height_t height = lock->height;
+	struct block_map_tree_slot tree_slot = lock->tree_slots[height];
+	union page_key key;
+
+	key.descriptor = (struct page_descriptor) {
+		.root_index = lock->root_index,
+		.height = height,
+		.page_index = tree_slot.page_index,
+		.slot = tree_slot.block_map_slot.slot,
+	};
+	lock->key = key.key;
+
+	result = vdo_int_map_put(zone->loading_pages,
+				 lock->key,
+				 lock,
+				 false,
+				 (void **) &lock_holder);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	if (lock_holder == NULL) {
+		/* We got the lock */
+		data_vio->tree_lock.locked = true;
+		return VDO_SUCCESS;
+	}
+
+	/* Someone else is loading or allocating the page we need */
+	vdo_enqueue_waiter(&lock_holder->waiters, &data_vio->waiter);
+	return VDO_SUCCESS;
+}
+
+/* Load a block map tree page from disk, for the next level in the data vio tree lock. */
+static void load_block_map_page(struct block_map_zone *zone, struct data_vio *data_vio)
+{
+	int result;
+
+	result = attempt_page_lock(zone, data_vio);
+	if (result != VDO_SUCCESS) {
+		abort_load(data_vio, result);
+		return;
+	}
+
+	if (data_vio->tree_lock.locked) {
+		data_vio->waiter.callback = load_page;
+		acquire_vio_from_pool(zone->vio_pool, &data_vio->waiter);
+	}
+}
+
+static void allocation_failure(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	if (vdo_requeue_completion_if_needed(completion, data_vio->logical.zone->thread_id))
+		return;
+
+	abort_lookup(data_vio, completion->result, "allocation");
+}
+
+static void continue_allocation_for_waiter(struct waiter *waiter, void *context)
+{
+	struct data_vio *data_vio = waiter_as_data_vio(waiter);
+	struct tree_lock *tree_lock = &data_vio->tree_lock;
+	physical_block_number_t pbn = *((physical_block_number_t *) context);
+
+	tree_lock->height--;
+	data_vio->tree_lock.tree_slots[tree_lock->height].block_map_slot.pbn = pbn;
+
+	if (tree_lock->height == 0) {
+		finish_lookup(data_vio, VDO_SUCCESS);
+		return;
+	}
+
+	allocate_block_map_page(data_vio->logical.zone->block_map_zone, data_vio);
+}
+
+/** expire_oldest_list() - Expire the oldest list. */
+static void expire_oldest_list(struct dirty_lists *dirty_lists)
+{
+	block_count_t i = dirty_lists->offset++;
+
+	dirty_lists->oldest_period++;
+	if (!list_empty(&dirty_lists->eras[i][VDO_TREE_PAGE]))
+		list_splice_tail_init(&dirty_lists->eras[i][VDO_TREE_PAGE],
+				      &dirty_lists->expired[VDO_TREE_PAGE]);
+	if (!list_empty(&dirty_lists->eras[i][VDO_CACHE_PAGE]))
+		list_splice_tail_init(&dirty_lists->eras[i][VDO_CACHE_PAGE],
+				      &dirty_lists->expired[VDO_CACHE_PAGE]);
+
+	if (dirty_lists->offset == dirty_lists->maximum_age)
+		dirty_lists->offset = 0;
+}
+
+
+/** update_period() - Update the dirty_lists period if necessary. */
+static void update_period(struct dirty_lists *dirty, sequence_number_t period)
+{
+	while (dirty->next_period <= period) {
+		if ((dirty->next_period - dirty->oldest_period) == dirty->maximum_age)
+			expire_oldest_list(dirty);
+		dirty->next_period++;
+	}
+}
+
+/** write_expired_elements() - Write out the expired list. */
+static void write_expired_elements(struct block_map_zone *zone)
+{
+	struct tree_page *page, *ttmp;
+	struct page_info *info, *ptmp;
+	struct list_head *expired;
+	u8 generation = zone->generation;
+
+	expired = &zone->dirty_lists->expired[VDO_TREE_PAGE];
+	list_for_each_entry_safe(page, ttmp, expired, entry) {
+		int result;
+
+		list_del_init(&page->entry);
+
+		result = ASSERT(!vdo_is_waiting(&page->waiter),
+				"Newly expired page not already waiting to write");
+		if (result != VDO_SUCCESS) {
+			enter_zone_read_only_mode(zone, result);
+			continue;
+		}
+
+		set_generation(zone, page, generation);
+		if (!page->writing)
+			enqueue_page(page, zone);
+	}
+
+	expired = &zone->dirty_lists->expired[VDO_CACHE_PAGE];
+	list_for_each_entry_safe(info, ptmp, expired, state_entry) {
+		list_del_init(&info->state_entry);
+		schedule_page_save(info);
+	}
+
+	save_pages(&zone->page_cache);
+}
+
+/**
+ * add_to_dirty_lists() - Add an element to the dirty lists.
+ * @zone: The zone in which we are operating.
+ * @entry: The list entry of the element to add.
+ * @type: The type of page.
+ * @old_period: The period in which the element was previously dirtied, or 0 if it was not dirty.
+ * @new_period: The period in which the element has now been dirtied, or 0 if it does not hold a
+ *              lock.
+ */
+static void
+add_to_dirty_lists(struct block_map_zone *zone,
+		   struct list_head *entry,
+		   enum block_map_page_type type,
+		   sequence_number_t old_period,
+		   sequence_number_t new_period)
+{
+	struct dirty_lists *dirty_lists = zone->dirty_lists;
+
+	if ((old_period == new_period) || ((old_period != 0) && (old_period < new_period)))
+		return;
+
+	if (new_period < dirty_lists->oldest_period) {
+		list_move_tail(entry, &dirty_lists->expired[type]);
+	} else {
+		update_period(dirty_lists, new_period);
+		list_move_tail(entry,
+			       &dirty_lists->eras[new_period % dirty_lists->maximum_age][type]);
+	}
+
+	write_expired_elements(zone);
+}
+
+/*
+ * Record the allocation in the tree and wake any waiters now that the write lock has been
+ * released.
+ */
+static void finish_block_map_allocation(struct vdo_completion *completion)
+{
+	physical_block_number_t pbn;
+	struct tree_page *tree_page;
+	struct block_map_page *page;
+	sequence_number_t old_lock;
+	struct data_vio *data_vio = as_data_vio(completion);
+	struct block_map_zone *zone = data_vio->logical.zone->block_map_zone;
+	struct tree_lock *tree_lock = &data_vio->tree_lock;
+	height_t height = tree_lock->height;
+
+	assert_data_vio_in_logical_zone(data_vio);
+
+	tree_page = get_tree_page(zone, tree_lock);
+	pbn = tree_lock->tree_slots[height - 1].block_map_slot.pbn;
+
+	/* Record the allocation. */
+	page = (struct block_map_page *) tree_page->page_buffer;
+	old_lock = tree_page->recovery_lock;
+	vdo_update_block_map_page(page,
+				  data_vio,
+				  pbn,
+				  VDO_MAPPING_STATE_UNCOMPRESSED,
+				  &tree_page->recovery_lock);
+
+	if (vdo_is_waiting(&tree_page->waiter)) {
+		/* This page is waiting to be written out. */
+		if (zone->flusher != tree_page)
+			/*
+			 * The outstanding flush won't cover the update we just made, so mark the
+			 * page as needing another flush.
+			 */
+			set_generation(zone, tree_page, zone->generation);
+	} else {
+		/* Put the page on a dirty list */
+		if (old_lock == 0)
+			INIT_LIST_HEAD(&tree_page->entry);
+		add_to_dirty_lists(zone,
+				   &tree_page->entry,
+				   VDO_TREE_PAGE,
+				   old_lock,
+				   tree_page->recovery_lock);
+	}
+
+	tree_lock->height--;
+	if (height > 1) {
+		/* Format the interior node we just allocated (in memory). */
+		tree_page = get_tree_page(zone, tree_lock);
+		vdo_format_block_map_page(tree_page->page_buffer,
+					  zone->block_map->nonce,
+					  pbn,
+					  false);
+	}
+
+	/* Release our claim to the allocation and wake any waiters */
+	release_page_lock(data_vio, "allocation");
+	vdo_notify_all_waiters(&tree_lock->waiters, continue_allocation_for_waiter, &pbn);
+	if (tree_lock->height == 0) {
+		finish_lookup(data_vio, VDO_SUCCESS);
+		return;
+	}
+
+	allocate_block_map_page(zone, data_vio);
+}
+
+static void release_block_map_write_lock(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_in_allocated_zone(data_vio);
+
+	release_data_vio_allocation_lock(data_vio, true);
+	launch_data_vio_logical_callback(data_vio, finish_block_map_allocation);
+}
+
+/*
+ * Newly allocated block map pages are set to have to MAXIMUM_REFERENCES after they are journaled,
+ * to prevent deduplication against the block after we release the write lock on it, but before we
+ * write out the page.
+ */
+static void set_block_map_page_reference_count(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_in_allocated_zone(data_vio);
+
+	completion->callback = release_block_map_write_lock;
+	vdo_modify_reference_count(completion, &data_vio->increment_updater);
+}
+
+static void journal_block_map_allocation(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+
+	assert_data_vio_in_journal_zone(data_vio);
+
+	set_data_vio_allocated_zone_callback(data_vio, set_block_map_page_reference_count);
+	vdo_add_recovery_journal_entry(completion->vdo->recovery_journal, data_vio);
+}
+
+static void allocate_block(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion);
+	struct tree_lock *lock = &data_vio->tree_lock;
+	physical_block_number_t pbn;
+
+	assert_data_vio_in_allocated_zone(data_vio);
+
+	if (!vdo_allocate_block_in_zone(data_vio))
+		return;
+
+	pbn = data_vio->allocation.pbn;
+	lock->tree_slots[lock->height - 1].block_map_slot.pbn = pbn;
+	data_vio->increment_updater = (struct reference_updater) {
+		.operation = VDO_JOURNAL_BLOCK_MAP_REMAPPING,
+		.increment = true,
+		.zpbn = {
+			.pbn = pbn,
+			.state = VDO_MAPPING_STATE_UNCOMPRESSED,
+		},
+		.lock = data_vio->allocation.lock,
+	};
+
+	launch_data_vio_journal_callback(data_vio, journal_block_map_allocation);
+}
+
+static void allocate_block_map_page(struct block_map_zone *zone, struct data_vio *data_vio)
+{
+	int result;
+
+	if (!data_vio->write || data_vio->is_trim) {
+		/* This is a pure read or a trim, so there's nothing left to do here. */
+		finish_lookup(data_vio, VDO_SUCCESS);
+		return;
+	}
+
+	result = attempt_page_lock(zone, data_vio);
+	if (result != VDO_SUCCESS) {
+		abort_lookup(data_vio, result, "allocation");
+		return;
+	}
+
+	if (!data_vio->tree_lock.locked)
+		return;
+
+	data_vio_allocate_data_block(data_vio,
+				     VIO_BLOCK_MAP_WRITE_LOCK,
+				     allocate_block,
+				     allocation_failure);
+}
+
+/**
+ * vdo_find_block_map_slot() - Find the block map slot in which the block map entry for a data_vio
+ *                             resides and cache that result in the data_vio.
+ *
+ * All ancestors in the tree will be allocated or loaded, as needed.
+ */
+void vdo_find_block_map_slot(struct data_vio *data_vio)
+{
+	page_number_t page_index;
+	struct block_map_tree_slot tree_slot;
+	struct data_location mapping;
+	struct block_map_page *page = NULL;
+	struct tree_lock *lock = &data_vio->tree_lock;
+	struct block_map_zone *zone = data_vio->logical.zone->block_map_zone;
+
+	zone->active_lookups++;
+	if (vdo_is_state_draining(&zone->state)) {
+		finish_lookup(data_vio, VDO_SHUTTING_DOWN);
+		return;
+	}
+
+	lock->tree_slots[0].block_map_slot.slot =
+		data_vio->logical.lbn % VDO_BLOCK_MAP_ENTRIES_PER_PAGE;
+	page_index = (lock->tree_slots[0].page_index / zone->block_map->root_count);
+	tree_slot = (struct block_map_tree_slot) {
+		.page_index = page_index / VDO_BLOCK_MAP_ENTRIES_PER_PAGE,
+		.block_map_slot = {
+			.pbn = 0,
+			.slot = page_index % VDO_BLOCK_MAP_ENTRIES_PER_PAGE,
+		},
+	};
+
+	for (lock->height = 1; lock->height <= VDO_BLOCK_MAP_TREE_HEIGHT; lock->height++) {
+		physical_block_number_t pbn;
+
+		lock->tree_slots[lock->height] = tree_slot;
+		page = (struct block_map_page *) (get_tree_page(zone, lock)->page_buffer);
+		pbn = vdo_get_block_map_page_pbn(page);
+		if (pbn != VDO_ZERO_BLOCK) {
+			lock->tree_slots[lock->height].block_map_slot.pbn = pbn;
+			break;
+		}
+
+		/* Calculate the index and slot for the next level. */
+		tree_slot.block_map_slot.slot =
+			tree_slot.page_index % VDO_BLOCK_MAP_ENTRIES_PER_PAGE;
+		tree_slot.page_index = tree_slot.page_index / VDO_BLOCK_MAP_ENTRIES_PER_PAGE;
+	}
+
+	/* The page at this height has been allocated and loaded. */
+	mapping = vdo_unpack_block_map_entry(&page->entries[tree_slot.block_map_slot.slot]);
+	if (is_invalid_tree_entry(vdo_from_data_vio(data_vio), &mapping, lock->height)) {
+		uds_log_error_strerror(VDO_BAD_MAPPING,
+				       "Invalid block map tree PBN: %llu with state %u for page index %u at height %u",
+				       (unsigned long long) mapping.pbn,
+				       mapping.state,
+				       lock->tree_slots[lock->height - 1].page_index,
+				       lock->height - 1);
+		abort_load(data_vio, VDO_BAD_MAPPING);
+		return;
+	}
+
+	if (!vdo_is_mapped_location(&mapping)) {
+		/* The page we want one level down has not been allocated, so allocate it. */
+		allocate_block_map_page(zone, data_vio);
+		return;
+	}
+
+	lock->tree_slots[lock->height - 1].block_map_slot.pbn = mapping.pbn;
+	if (lock->height == 1) {
+		/* This is the ultimate block map page, so we're done */
+		finish_lookup(data_vio, VDO_SUCCESS);
+		return;
+	}
+
+	/* We know what page we need to load. */
+	load_block_map_page(zone, data_vio);
+}
+
+/*
+ * Find the PBN of a leaf block map page. This method may only be used after all allocated tree
+ * pages have been loaded, otherwise, it may give the wrong answer (0).
+ */
+physical_block_number_t
+vdo_find_block_map_page_pbn(struct block_map *map, page_number_t page_number)
+{
+	struct data_location mapping;
+	struct tree_page *tree_page;
+	struct block_map_page *page;
+	root_count_t root_index = page_number % map->root_count;
+	page_number_t page_index = page_number / map->root_count;
+	slot_number_t slot = page_index % VDO_BLOCK_MAP_ENTRIES_PER_PAGE;
+
+	page_index /= VDO_BLOCK_MAP_ENTRIES_PER_PAGE;
+
+	tree_page = get_tree_page_by_index(map->forest, root_index, 1, page_index);
+	page = (struct block_map_page *) tree_page->page_buffer;
+	if (!page->header.initialized)
+		return VDO_ZERO_BLOCK;
+
+	mapping = vdo_unpack_block_map_entry(&page->entries[slot]);
+	if (!vdo_is_valid_location(&mapping) || vdo_is_state_compressed(mapping.state))
+		return VDO_ZERO_BLOCK;
+	return mapping.pbn;
+}
+
+/*
+ * Write a tree page or indicate that it has been re-dirtied if it is already being written. This
+ * method is used when correcting errors in the tree during read-only rebuild.
+ */
+void vdo_write_tree_page(struct tree_page *page, struct block_map_zone *zone)
+{
+	bool waiting = vdo_is_waiting(&page->waiter);
+
+	if (waiting && (zone->flusher == page))
+		return;
+
+	set_generation(zone, page, zone->generation);
+	if (waiting || page->writing)
+		return;
+
+	enqueue_page(page, zone);
+}
+
+static int make_segment(struct forest *old_forest,
+			block_count_t new_pages,
+			struct boundary *new_boundary,
+			struct forest *forest)
+{
+	size_t index = (old_forest == NULL) ? 0 : old_forest->segments;
+	struct tree_page *page_ptr;
+	page_count_t segment_sizes[VDO_BLOCK_MAP_TREE_HEIGHT];
+	height_t height;
+	root_count_t root;
+	int result;
+
+	forest->segments = index + 1;
+
+	result = UDS_ALLOCATE(forest->segments, struct boundary,
+			      "forest boundary array", &forest->boundaries);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(forest->segments,
+			      struct tree_page *,
+			      "forest page pointers",
+			      &forest->pages);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(new_pages,
+			      struct tree_page,
+			      "new forest pages",
+			      &forest->pages[index]);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	if (index > 0) {
+		memcpy(forest->boundaries,
+		       old_forest->boundaries,
+		       index * sizeof(struct boundary));
+		memcpy(forest->pages, old_forest->pages, index * sizeof(struct tree_page *));
+	}
+
+	memcpy(&(forest->boundaries[index]), new_boundary, sizeof(struct boundary));
+
+	for (height = 0; height < VDO_BLOCK_MAP_TREE_HEIGHT; height++) {
+		segment_sizes[height] = new_boundary->levels[height];
+		if (index > 0)
+			segment_sizes[height] -= old_forest->boundaries[index - 1].levels[height];
+	}
+
+	page_ptr = forest->pages[index];
+	for (root = 0; root < forest->map->root_count; root++) {
+		struct block_map_tree_segment *segment;
+		struct block_map_tree *tree = &(forest->trees[root]);
+		height_t height;
+
+		int result = UDS_ALLOCATE(forest->segments,
+					  struct block_map_tree_segment,
+					  "tree root segments",
+					  &tree->segments);
+		if (result != VDO_SUCCESS)
+			return result;
+
+		if (index > 0)
+			memcpy(tree->segments,
+			       old_forest->trees[root].segments,
+			       index * sizeof(struct block_map_tree_segment));
+
+		segment = &(tree->segments[index]);
+		for (height = 0; height < VDO_BLOCK_MAP_TREE_HEIGHT; height++) {
+			if (segment_sizes[height] == 0)
+				continue;
+
+			segment->levels[height] = page_ptr;
+			if (height == (VDO_BLOCK_MAP_TREE_HEIGHT - 1)) {
+				/* Record the root. */
+				struct block_map_page *page =
+					vdo_format_block_map_page(page_ptr->page_buffer,
+								  forest->map->nonce,
+								  VDO_INVALID_PBN,
+								  true);
+				page->entries[0] =
+					vdo_pack_block_map_entry(forest->map->root_origin + root,
+								 VDO_MAPPING_STATE_UNCOMPRESSED);
+			}
+			page_ptr += segment_sizes[height];
+		}
+	}
+
+	return VDO_SUCCESS;
+}
+
+static void deforest(struct forest *forest, size_t first_page_segment)
+{
+	root_count_t root;
+
+	if (forest->pages != NULL) {
+		size_t segment;
+
+		for (segment = first_page_segment; segment < forest->segments; segment++)
+			UDS_FREE(forest->pages[segment]);
+		UDS_FREE(forest->pages);
+	}
+
+	for (root = 0; root < forest->map->root_count; root++)
+		UDS_FREE(forest->trees[root].segments);
+
+	UDS_FREE(forest->boundaries);
+	UDS_FREE(forest);
+}
+
+/**
+ * make_forest() - Make a collection of trees for a block_map, expanding the existing forest if
+ *                 there is one.
+ * @entries: The number of entries the block map will hold.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int make_forest(struct block_map *map, block_count_t entries)
+{
+	struct forest *forest, *old_forest = map->forest;
+	struct boundary new_boundary, *old_boundary = NULL;
+	block_count_t new_pages;
+	int result;
+
+	if (old_forest != NULL)
+		old_boundary = &(old_forest->boundaries[old_forest->segments - 1]);
+
+	new_pages = vdo_compute_new_forest_pages(map->root_count, old_boundary,
+						 entries, &new_boundary);
+	if (new_pages == 0) {
+		map->next_entry_count = entries;
+		return VDO_SUCCESS;
+	}
+
+	result = UDS_ALLOCATE_EXTENDED(struct forest, map->root_count,
+				       struct block_map_tree, __func__,
+				       &forest);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	forest->map = map;
+	result = make_segment(old_forest, new_pages, &new_boundary, forest);
+	if (result != VDO_SUCCESS) {
+		deforest(forest, forest->segments - 1);
+		return result;
+	}
+
+	map->next_forest = forest;
+	map->next_entry_count = entries;
+	return VDO_SUCCESS;
+}
+
+/**
+ * replace_forest() - Replace a block_map's forest with the already-prepared larger forest.
+ */
+static void replace_forest(struct block_map *map)
+{
+	if (map->next_forest != NULL) {
+		if (map->forest != NULL)
+			deforest(map->forest, map->forest->segments);
+		map->forest = UDS_FORGET(map->next_forest);
+	}
+
+	map->entry_count = map->next_entry_count;
+	map->next_entry_count = 0;
+}
+
+/**
+ * finish_cursor() - Finish the traversal of a single tree. If it was the last cursor, finish the
+ *                   traversal.
+ */
+static void finish_cursor(struct cursor *cursor)
+{
+	struct cursors *cursors = cursor->parent;
+	struct vdo_completion *parent = cursors->parent;
+
+	return_vio_to_pool(cursors->pool, UDS_FORGET(cursor->vio));
+	if (--cursors->active_roots > 0)
+		return;
+
+	UDS_FREE(cursors);
+
+	vdo_finish_completion(parent);
+}
+
+static void traverse(struct cursor *cursor);
+
+/**
+ * continue_traversal() - Continue traversing a block map tree.
+ * @completion: The VIO doing a read or write.
+ */
+static void continue_traversal(struct vdo_completion *completion)
+{
+	vio_record_metadata_io_error(as_vio(completion));
+	traverse(completion->parent);
+}
+
+/**
+ * finish_traversal_load() - Continue traversing a block map tree now that a page has been loaded.
+ * @completion: The VIO doing the read.
+ */
+static void finish_traversal_load(struct vdo_completion *completion)
+{
+	struct cursor *cursor = completion->parent;
+	height_t height = cursor->height;
+	struct cursor_level *level = &cursor->levels[height];
+	struct tree_page *tree_page =
+		&(cursor->tree->segments[0].levels[height][level->page_index]);
+	struct block_map_page *page = (struct block_map_page *) tree_page->page_buffer;
+
+	vdo_copy_valid_page(cursor->vio->vio.data,
+			    cursor->parent->zone->block_map->nonce,
+			    pbn_from_vio_bio(cursor->vio->vio.bio),
+			    page);
+	traverse(cursor);
+}
+
+static void traversal_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct cursor *cursor = vio->completion.parent;
+
+	continue_vio_after_io(vio, finish_traversal_load, cursor->parent->zone->thread_id);
+}
+
+/**
+ * traverse() - Traverse a single block map tree.
+ *
+ * This is the recursive heart of the traversal process.
+ */
+static void traverse(struct cursor *cursor)
+{
+	for (; cursor->height < VDO_BLOCK_MAP_TREE_HEIGHT; cursor->height++) {
+		height_t height = cursor->height;
+		struct cursor_level *level = &cursor->levels[height];
+		struct tree_page *tree_page =
+			&(cursor->tree->segments[0].levels[height][level->page_index]);
+		struct block_map_page *page = (struct block_map_page *) tree_page->page_buffer;
+
+		if (!page->header.initialized)
+			continue;
+
+		for (; level->slot < VDO_BLOCK_MAP_ENTRIES_PER_PAGE; level->slot++) {
+			struct cursor_level *next_level;
+			page_number_t entry_index =
+				(VDO_BLOCK_MAP_ENTRIES_PER_PAGE * level->page_index) + level->slot;
+			struct data_location location =
+				vdo_unpack_block_map_entry(&page->entries[level->slot]);
+
+			if (!vdo_is_valid_location(&location)) {
+				/* This entry is invalid, so remove it from the page. */
+				page->entries[level->slot] = UNMAPPED_BLOCK_MAP_ENTRY;
+				vdo_write_tree_page(tree_page, cursor->parent->zone);
+				continue;
+			}
+
+			if (!vdo_is_mapped_location(&location))
+				continue;
+
+			/* Erase mapped entries past the end of the logical space. */
+			if (entry_index >= cursor->boundary.levels[height]) {
+				page->entries[level->slot] = UNMAPPED_BLOCK_MAP_ENTRY;
+				vdo_write_tree_page(tree_page, cursor->parent->zone);
+				continue;
+			}
+
+			if (cursor->height < VDO_BLOCK_MAP_TREE_HEIGHT - 1) {
+				int result =
+					cursor->parent->entry_callback(location.pbn,
+								       cursor->parent->parent);
+
+				if (result != VDO_SUCCESS) {
+					page->entries[level->slot] = UNMAPPED_BLOCK_MAP_ENTRY;
+					vdo_write_tree_page(tree_page, cursor->parent->zone);
+					continue;
+				}
+			}
+
+			if (cursor->height == 0)
+				continue;
+
+			cursor->height--;
+			next_level = &cursor->levels[cursor->height];
+			next_level->page_index = entry_index;
+			next_level->slot = 0;
+			level->slot++;
+			submit_metadata_vio(&cursor->vio->vio,
+					    location.pbn,
+					    traversal_endio,
+					    continue_traversal,
+					    REQ_OP_READ | REQ_PRIO);
+			return;
+		}
+	}
+
+	finish_cursor(cursor);
+}
+
+/**
+ * launch_cursor() - Start traversing a single block map tree now that the cursor has a VIO with
+ *                   which to load pages.
+ * @context: The pooled_vio just acquired.
+ *
+ * Implements waiter_callback.
+ */
+static void launch_cursor(struct waiter *waiter, void *context)
+{
+	struct cursor *cursor = container_of(waiter, struct cursor, waiter);
+	struct pooled_vio *pooled = context;
+
+	cursor->vio = pooled;
+	pooled->vio.completion.parent = cursor;
+	pooled->vio.completion.callback_thread_id = cursor->parent->zone->thread_id;
+	traverse(cursor);
+}
+
+/**
+ * compute_boundary() - Compute the number of pages used at each level of the given root's tree.
+ *
+ * Return: The list of page counts as a boundary structure.
+ */
+static struct boundary compute_boundary(struct block_map *map, root_count_t root_index)
+{
+	struct boundary boundary;
+	height_t height;
+	page_count_t leaf_pages = vdo_compute_block_map_page_count(map->entry_count);
+	/*
+	 * Compute the leaf pages for this root. If the number of leaf pages does not distribute
+	 * evenly, we must determine if this root gets an extra page. Extra pages are assigned to
+	 * roots starting from tree 0.
+	 */
+	page_count_t last_tree_root = (leaf_pages - 1) % map->root_count;
+	page_count_t level_pages = leaf_pages / map->root_count;
+
+	if (root_index <= last_tree_root)
+		level_pages++;
+
+	for (height = 0; height < VDO_BLOCK_MAP_TREE_HEIGHT - 1; height++) {
+		boundary.levels[height] = level_pages;
+		level_pages = DIV_ROUND_UP(level_pages, VDO_BLOCK_MAP_ENTRIES_PER_PAGE);
+	}
+
+	/* The root node always exists, even if the root is otherwise unused. */
+	boundary.levels[VDO_BLOCK_MAP_TREE_HEIGHT - 1] = 1;
+
+	return boundary;
+}
+
+/**
+ * vdo_traverse_forest() - Walk the entire forest of a block map.
+ * @callback: A function to call with the pbn of each allocated node in the forest.
+ * @parent: The completion to notify on each traversed PBN, and when the traversal is complete.
+ */
+void vdo_traverse_forest(struct block_map *map,
+			 vdo_entry_callback *callback,
+			 struct vdo_completion *parent)
+{
+	root_count_t root;
+	struct cursors *cursors;
+	int result;
+
+	result = UDS_ALLOCATE_EXTENDED(struct cursors,
+				       map->root_count,
+				       struct cursor,
+				       __func__,
+				       &cursors);
+	if (result != VDO_SUCCESS) {
+		vdo_fail_completion(parent, result);
+		return;
+	}
+
+	cursors->zone = &map->zones[0];
+	cursors->pool = cursors->zone->vio_pool;
+	cursors->entry_callback = callback;
+	cursors->parent = parent;
+	cursors->active_roots = map->root_count;
+	for (root = 0; root < map->root_count; root++) {
+		struct cursor *cursor = &cursors->cursors[root];
+
+		*cursor = (struct cursor) {
+			.tree = &map->forest->trees[root],
+			.height = VDO_BLOCK_MAP_TREE_HEIGHT - 1,
+			.parent = cursors,
+			.boundary = compute_boundary(map, root),
+		};
+
+		cursor->waiter.callback = launch_cursor;
+		acquire_vio_from_pool(cursors->pool, &cursor->waiter);
+	};
+}
+
+/**
+ * initialize_block_map_zone() - Initialize the per-zone portions of the block map.
+ * @maximum_age: The number of journal blocks before a dirtied page is considered old and must be
+ *               written out.
+ */
+static int __must_check initialize_block_map_zone(struct block_map *map,
+						  zone_count_t zone_number,
+						  struct vdo *vdo,
+						  page_count_t cache_size,
+						  block_count_t maximum_age)
+{
+	int result;
+	block_count_t i;
+	struct block_map_zone *zone = &map->zones[zone_number];
+
+	STATIC_ASSERT_SIZEOF(struct page_descriptor, sizeof(u64));
+
+	zone->zone_number = zone_number;
+	zone->thread_id = vdo->thread_config.logical_threads[zone_number];
+	zone->block_map = map;
+
+	result = UDS_ALLOCATE_EXTENDED(struct dirty_lists,
+				       maximum_age,
+				       dirty_era_t,
+				       __func__,
+				       &zone->dirty_lists);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	zone->dirty_lists->maximum_age = maximum_age;
+	INIT_LIST_HEAD(&zone->dirty_lists->expired[VDO_TREE_PAGE]);
+	INIT_LIST_HEAD(&zone->dirty_lists->expired[VDO_CACHE_PAGE]);
+
+	for (i = 0; i < maximum_age; i++) {
+		INIT_LIST_HEAD(&zone->dirty_lists->eras[i][VDO_TREE_PAGE]);
+		INIT_LIST_HEAD(&zone->dirty_lists->eras[i][VDO_CACHE_PAGE]);
+	}
+
+	result = vdo_make_int_map(VDO_LOCK_MAP_CAPACITY, 0, &zone->loading_pages);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = make_vio_pool(vdo,
+			       BLOCK_MAP_VIO_POOL_SIZE,
+			       zone->thread_id,
+			       VIO_TYPE_BLOCK_MAP_INTERIOR,
+			       VIO_PRIORITY_METADATA,
+			       zone,
+			       &zone->vio_pool);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	vdo_set_admin_state_code(&zone->state, VDO_ADMIN_STATE_NORMAL_OPERATION);
+
+	zone->page_cache.zone = zone;
+	zone->page_cache.vdo = vdo;
+	zone->page_cache.page_count = cache_size / map->zone_count;
+	zone->page_cache.stats.free_pages = zone->page_cache.page_count;
+
+	result = allocate_cache_components(&zone->page_cache);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	/* initialize empty circular queues */
+	INIT_LIST_HEAD(&zone->page_cache.lru_list);
+	INIT_LIST_HEAD(&zone->page_cache.outgoing_list);
+
+	return VDO_SUCCESS;
+}
+
+/* Implements vdo_zone_thread_getter */
+static thread_id_t get_block_map_zone_thread_id(void *context, zone_count_t zone_number)
+{
+	struct block_map *map = context;
+
+	return map->zones[zone_number].thread_id;
+}
+
+/* Implements vdo_action_preamble */
+static void prepare_for_era_advance(void *context, struct vdo_completion *parent)
+{
+	struct block_map *map = context;
+
+	map->current_era_point = map->pending_era_point;
+	vdo_finish_completion(parent);
+}
+
+/* Implements vdo_zone_action */
+static void advance_block_map_zone_era(void *context,
+				       zone_count_t zone_number,
+				       struct vdo_completion *parent)
+{
+	struct block_map *map = context;
+	struct block_map_zone *zone = &map->zones[zone_number];
+
+	update_period(zone->dirty_lists, map->current_era_point);
+	write_expired_elements(zone);
+	vdo_finish_completion(parent);
+}
+
+/*
+ * Schedule an era advance if necessary. This method should not be called directly. Rather, call
+ * vdo_schedule_default_action() on the block map's action manager.
+ *
+ * Implements vdo_action_scheduler.
+ */
+static bool schedule_era_advance(void *context)
+{
+	struct block_map *map = context;
+
+	if (map->current_era_point == map->pending_era_point)
+		return false;
+
+	return vdo_schedule_action(map->action_manager,
+				   prepare_for_era_advance,
+				   advance_block_map_zone_era,
+				   NULL,
+				   NULL);
+}
+
+static void uninitialize_block_map_zone(struct block_map_zone *zone)
+{
+	struct vdo_page_cache *cache = &zone->page_cache;
+
+	UDS_FREE(UDS_FORGET(zone->dirty_lists));
+	free_vio_pool(UDS_FORGET(zone->vio_pool));
+	vdo_free_int_map(UDS_FORGET(zone->loading_pages));
+	if (cache->infos != NULL) {
+		struct page_info *info;
+
+		for (info = cache->infos; info < cache->infos + cache->page_count; ++info)
+			free_vio(UDS_FORGET(info->vio));
+	}
+
+	vdo_free_int_map(UDS_FORGET(cache->page_map));
+	UDS_FREE(UDS_FORGET(cache->infos));
+	UDS_FREE(UDS_FORGET(cache->pages));
+}
+
+void vdo_free_block_map(struct block_map *map)
+{
+	zone_count_t zone;
+
+	if (map == NULL)
+		return;
+
+	for (zone = 0; zone < map->zone_count; zone++)
+		uninitialize_block_map_zone(&map->zones[zone]);
+
+	vdo_abandon_block_map_growth(map);
+	if (map->forest != NULL)
+		deforest(UDS_FORGET(map->forest), 0);
+	UDS_FREE(UDS_FORGET(map->action_manager));
+	UDS_FREE(map);
+}
+
+/* @journal may be NULL. */
+int vdo_decode_block_map(struct block_map_state_2_0 state,
+			 block_count_t logical_blocks,
+			 struct vdo *vdo,
+			 struct recovery_journal *journal,
+			 nonce_t nonce,
+			 page_count_t cache_size,
+			 block_count_t maximum_age,
+			 struct block_map **map_ptr)
+{
+	struct block_map *map;
+	int result;
+	zone_count_t zone = 0;
+
+	STATIC_ASSERT(VDO_BLOCK_MAP_ENTRIES_PER_PAGE ==
+		      ((VDO_BLOCK_SIZE - sizeof(struct block_map_page)) /
+		       sizeof(struct block_map_entry)));
+	result = ASSERT(cache_size > 0, "block map cache size is specified");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE_EXTENDED(struct block_map,
+				       vdo->thread_config.logical_zone_count,
+				       struct block_map_zone,
+				       __func__,
+				       &map);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	map->vdo = vdo;
+	map->root_origin = state.root_origin;
+	map->root_count = state.root_count;
+	map->entry_count = logical_blocks;
+	map->journal = journal;
+	map->nonce = nonce;
+
+	result = make_forest(map, map->entry_count);
+	if (result != VDO_SUCCESS) {
+		vdo_free_block_map(map);
+		return result;
+	}
+
+	replace_forest(map);
+
+	map->zone_count = vdo->thread_config.logical_zone_count;
+	for (zone = 0; zone < map->zone_count; zone++) {
+		result = initialize_block_map_zone(map, zone, vdo, cache_size, maximum_age);
+		if (result != VDO_SUCCESS) {
+			vdo_free_block_map(map);
+			return result;
+		}
+	}
+
+	result = vdo_make_action_manager(map->zone_count,
+					 get_block_map_zone_thread_id,
+					 vdo_get_recovery_journal_thread_id(journal),
+					 map,
+					 schedule_era_advance,
+					 vdo,
+					 &map->action_manager);
+	if (result != VDO_SUCCESS) {
+		vdo_free_block_map(map);
+		return result;
+	}
+
+	*map_ptr = map;
+	return VDO_SUCCESS;
+}
+
+struct block_map_state_2_0 vdo_record_block_map(const struct block_map *map)
+{
+	return (struct block_map_state_2_0) {
+		.flat_page_origin = VDO_BLOCK_MAP_FLAT_PAGE_ORIGIN,
+		/* This is the flat page count, which has turned out to always be 0. */
+		.flat_page_count = 0,
+		.root_origin = map->root_origin,
+		.root_count = map->root_count,
+	};
+}
+
+/* The block map needs to know the journals' sequence number to initialize the eras. */
+void vdo_initialize_block_map_from_journal(struct block_map *map, struct recovery_journal *journal)
+{
+	zone_count_t z = 0;
+
+	map->current_era_point = vdo_get_recovery_journal_current_sequence_number(journal);
+	map->pending_era_point = map->current_era_point;
+
+	for (z = 0; z < map->zone_count; z++) {
+		struct dirty_lists *dirty_lists = map->zones[z].dirty_lists;
+
+		ASSERT_LOG_ONLY(dirty_lists->next_period == 0, "current period not set");
+		dirty_lists->oldest_period = map->current_era_point;
+		dirty_lists->next_period = map->current_era_point + 1;
+		dirty_lists->offset = map->current_era_point % dirty_lists->maximum_age;
+	}
+}
+
+/* Compute the logical zone for the LBN of a data vio. */
+zone_count_t vdo_compute_logical_zone(struct data_vio *data_vio)
+{
+	struct block_map *map = vdo_from_data_vio(data_vio)->block_map;
+	struct tree_lock *tree_lock = &data_vio->tree_lock;
+	page_number_t page_number = data_vio->logical.lbn / VDO_BLOCK_MAP_ENTRIES_PER_PAGE;
+
+	tree_lock->tree_slots[0].page_index = page_number;
+	tree_lock->root_index = page_number % map->root_count;
+	return (tree_lock->root_index % map->zone_count);
+}
+
+void vdo_advance_block_map_era(struct block_map *map, sequence_number_t recovery_block_number)
+{
+	if (map == NULL)
+		return;
+
+	map->pending_era_point = recovery_block_number;
+	vdo_schedule_default_action(map->action_manager);
+}
+
+/* Implements vdo_admin_initiator */
+static void initiate_drain(struct admin_state *state)
+{
+	struct block_map_zone *zone = container_of(state, struct block_map_zone, state);
+
+	ASSERT_LOG_ONLY((zone->active_lookups == 0),
+			"%s() called with no active lookups",
+			__func__);
+
+	if (!vdo_is_state_suspending(state)) {
+		while (zone->dirty_lists->oldest_period < zone->dirty_lists->next_period)
+			expire_oldest_list(zone->dirty_lists);
+		write_expired_elements(zone);
+	}
+
+	check_for_drain_complete(zone);
+}
+
+/* Implements vdo_zone_action. */
+static void drain_zone(void *context, zone_count_t zone_number, struct vdo_completion *parent)
+{
+	struct block_map *map = context;
+	struct block_map_zone *zone = &map->zones[zone_number];
+
+	vdo_start_draining(&zone->state,
+			   vdo_get_current_manager_operation(map->action_manager),
+			   parent,
+			   initiate_drain);
+}
+
+void vdo_drain_block_map(struct block_map *map,
+			 const struct admin_state_code *operation,
+			 struct vdo_completion *parent)
+{
+	vdo_schedule_operation(map->action_manager, operation, NULL, drain_zone, NULL, parent);
+}
+
+/* Implements vdo_zone_action. */
+static void
+resume_block_map_zone(void *context, zone_count_t zone_number, struct vdo_completion *parent)
+{
+	struct block_map *map = context;
+	struct block_map_zone *zone = &map->zones[zone_number];
+
+	vdo_fail_completion(parent, vdo_resume_if_quiescent(&zone->state));
+}
+
+void vdo_resume_block_map(struct block_map *map, struct vdo_completion *parent)
+{
+	vdo_schedule_operation(map->action_manager,
+			       VDO_ADMIN_STATE_RESUMING,
+			       NULL,
+			       resume_block_map_zone,
+			       NULL,
+			       parent);
+}
+
+/* Allocate an expanded collection of trees, for a future growth. */
+int vdo_prepare_to_grow_block_map(struct block_map *map, block_count_t new_logical_blocks)
+{
+	if (map->next_entry_count == new_logical_blocks)
+		return VDO_SUCCESS;
+
+	if (map->next_entry_count > 0)
+		vdo_abandon_block_map_growth(map);
+
+	if (new_logical_blocks < map->entry_count) {
+		map->next_entry_count = map->entry_count;
+		return VDO_SUCCESS;
+	}
+
+	return make_forest(map, new_logical_blocks);
+}
+
+/* Implements vdo_action_preamble */
+static void grow_forest(void *context, struct vdo_completion *completion)
+{
+	replace_forest(context);
+	vdo_finish_completion(completion);
+}
+
+/* Requires vdo_prepare_to_grow_block_map() to have been previously called. */
+void vdo_grow_block_map(struct block_map *map, struct vdo_completion *parent)
+{
+	vdo_schedule_operation(map->action_manager,
+			       VDO_ADMIN_STATE_SUSPENDED_OPERATION,
+			       grow_forest,
+			       NULL,
+			       NULL,
+			       parent);
+}
+
+void vdo_abandon_block_map_growth(struct block_map *map)
+{
+	struct forest *forest = UDS_FORGET(map->next_forest);
+
+	if (forest != NULL)
+		deforest(forest, forest->segments - 1);
+
+	map->next_entry_count = 0;
+}
+
+/* Release the page completion and then continue the requester. */
+static inline void finish_processing_page(struct vdo_completion *completion, int result)
+{
+	struct vdo_completion *parent = completion->parent;
+
+	vdo_release_page_completion(completion);
+	vdo_continue_completion(parent, result);
+}
+
+static void handle_page_error(struct vdo_completion *completion)
+{
+	finish_processing_page(completion, completion->result);
+}
+
+/* Fetch the mapping page for a block map update, and call the provided handler when fetched. */
+static void fetch_mapping_page(struct data_vio *data_vio, bool modifiable, vdo_action *action)
+{
+	struct block_map_zone *zone = data_vio->logical.zone->block_map_zone;
+
+	if (vdo_is_state_draining(&zone->state)) {
+		continue_data_vio_with_error(data_vio, VDO_SHUTTING_DOWN);
+		return;
+	}
+
+	vdo_get_page(&data_vio->page_completion,
+		     zone,
+		     data_vio->tree_lock.tree_slots[0].block_map_slot.pbn,
+		     modifiable,
+		     &data_vio->vio.completion,
+		     action,
+		     handle_page_error,
+		     false);
+}
+
+/**
+ * clear_mapped_location() - Clear a data_vio's mapped block location, setting it to be unmapped.
+ *
+ * This indicates the block map entry for the logical block is either unmapped or corrupted.
+ */
+static void clear_mapped_location(struct data_vio *data_vio)
+{
+	data_vio->mapped = (struct zoned_pbn) {
+		.state = VDO_MAPPING_STATE_UNMAPPED,
+	};
+}
+
+/**
+ * set_mapped_location() - Decode and validate a block map entry, and set the mapped location of a
+ *                         data_vio.
+ *
+ * Return: VDO_SUCCESS or VDO_BAD_MAPPING if the map entry is invalid or an error code for any
+ *         other failure
+ */
+static int __must_check
+set_mapped_location(struct data_vio *data_vio, const struct block_map_entry *entry)
+{
+	/* Unpack the PBN for logging purposes even if the entry is invalid. */
+	struct data_location mapped = vdo_unpack_block_map_entry(entry);
+
+	if (vdo_is_valid_location(&mapped)) {
+		int result;
+
+		result = vdo_get_physical_zone(vdo_from_data_vio(data_vio),
+					       mapped.pbn,
+					       &data_vio->mapped.zone);
+		if (result == VDO_SUCCESS) {
+			data_vio->mapped.pbn = mapped.pbn;
+			data_vio->mapped.state = mapped.state;
+			return VDO_SUCCESS;
+		}
+
+		/*
+		 * Return all errors not specifically known to be errors from validating the
+		 * location.
+		 */
+		if ((result != VDO_OUT_OF_RANGE) && (result != VDO_BAD_MAPPING))
+			return result;
+	}
+
+	/*
+	 * Log the corruption even if we wind up ignoring it for write VIOs, converting all cases
+	 * to VDO_BAD_MAPPING.
+	 */
+	uds_log_error_strerror(VDO_BAD_MAPPING,
+			       "PBN %llu with state %u read from the block map was invalid",
+			       (unsigned long long) mapped.pbn,
+			       mapped.state);
+
+	/*
+	 * A read VIO has no option but to report the bad mapping--reading zeros would be hiding
+	 * known data loss.
+	 */
+	if (!data_vio->write)
+		return VDO_BAD_MAPPING;
+
+	/*
+	 * A write VIO only reads this mapping to decref the old block. Treat this as an unmapped
+	 * entry rather than fail the write.
+	 */
+	clear_mapped_location(data_vio);
+	return VDO_SUCCESS;
+}
+
+/* This callback is registered in vdo_get_mapped_block(). */
+static void get_mapping_from_fetched_page(struct vdo_completion *completion)
+{
+	int result;
+	struct vdo_page_completion *vpc = as_vdo_page_completion(completion);
+	const struct block_map_page *page;
+	const struct block_map_entry *entry;
+	struct data_vio *data_vio = as_data_vio(completion->parent);
+	struct block_map_tree_slot *tree_slot;
+
+	if (completion->result != VDO_SUCCESS) {
+		finish_processing_page(completion, completion->result);
+		return;
+	}
+
+	result = validate_completed_page(vpc, false);
+	if (result != VDO_SUCCESS) {
+		finish_processing_page(completion, result);
+		return;
+	}
+
+	page = (const struct block_map_page *) get_page_buffer(vpc->info);
+	tree_slot = &data_vio->tree_lock.tree_slots[0];
+	entry = &page->entries[tree_slot->block_map_slot.slot];
+
+	result = set_mapped_location(data_vio, entry);
+	finish_processing_page(completion, result);
+}
+
+void vdo_update_block_map_page(struct block_map_page *page,
+			       struct data_vio *data_vio,
+			       physical_block_number_t pbn,
+			       enum block_mapping_state mapping_state,
+			       sequence_number_t *recovery_lock)
+{
+	struct block_map_zone *zone = data_vio->logical.zone->block_map_zone;
+	struct block_map *block_map = zone->block_map;
+	struct recovery_journal *journal = block_map->journal;
+	sequence_number_t old_locked, new_locked;
+	struct tree_lock *tree_lock = &data_vio->tree_lock;
+
+	/* Encode the new mapping. */
+	page->entries[tree_lock->tree_slots[tree_lock->height].block_map_slot.slot] =
+		vdo_pack_block_map_entry(pbn, mapping_state);
+
+	/* Adjust references on the recovery journal blocks. */
+	old_locked = *recovery_lock;
+	new_locked = data_vio->recovery_sequence_number;
+
+	if ((old_locked == 0) || (old_locked > new_locked)) {
+		vdo_acquire_recovery_journal_block_reference(journal,
+							     new_locked,
+							     VDO_ZONE_TYPE_LOGICAL,
+							     zone->zone_number);
+
+		if (old_locked > 0)
+			vdo_release_recovery_journal_block_reference(journal,
+								     old_locked,
+								     VDO_ZONE_TYPE_LOGICAL,
+								     zone->zone_number);
+
+		*recovery_lock = new_locked;
+	}
+
+	/*
+	 * FIXME: explain this more
+	 * Release the transferred lock from the data_vio.
+	 */
+	vdo_release_journal_entry_lock(journal, new_locked);
+	data_vio->recovery_sequence_number = 0;
+}
+
+static void put_mapping_in_fetched_page(struct vdo_completion *completion)
+{
+	struct data_vio *data_vio = as_data_vio(completion->parent);
+	sequence_number_t old_lock;
+	struct vdo_page_completion *vpc;
+	struct page_info *info;
+	int result;
+
+	if (completion->result != VDO_SUCCESS) {
+		finish_processing_page(completion, completion->result);
+		return;
+	}
+
+	vpc = as_vdo_page_completion(completion);
+	result = validate_completed_page(vpc, true);
+	if (result != VDO_SUCCESS) {
+		finish_processing_page(completion, result);
+		return;
+	}
+
+	info = vpc->info;
+	old_lock = info->recovery_lock;
+	vdo_update_block_map_page((struct block_map_page *) get_page_buffer(info),
+				  data_vio,
+				  data_vio->new_mapped.pbn,
+				  data_vio->new_mapped.state,
+				  &info->recovery_lock);
+	set_info_state(info, PS_DIRTY);
+	add_to_dirty_lists(info->cache->zone,
+			   &info->state_entry,
+			   VDO_CACHE_PAGE,
+			   old_lock,
+			   info->recovery_lock);
+	finish_processing_page(completion, VDO_SUCCESS);
+}
+
+/* Read a stored block mapping into a data_vio. */
+void vdo_get_mapped_block(struct data_vio *data_vio)
+{
+	if (data_vio->tree_lock.tree_slots[0].block_map_slot.pbn == VDO_ZERO_BLOCK) {
+		/*
+		 * We know that the block map page for this LBN has not been allocated, so the
+		 * block must be unmapped.
+		 */
+		clear_mapped_location(data_vio);
+		continue_data_vio(data_vio);
+		return;
+	}
+
+	fetch_mapping_page(data_vio, false, get_mapping_from_fetched_page);
+}
+
+/* Update a stored block mapping to reflect a data_vio's new mapping. */
+void vdo_put_mapped_block(struct data_vio *data_vio)
+{
+	fetch_mapping_page(data_vio, true, put_mapping_in_fetched_page);
+}
+
+struct block_map_statistics vdo_get_block_map_statistics(struct block_map *map)
+{
+	zone_count_t zone = 0;
+	struct block_map_statistics totals;
+
+	memset(&totals, 0, sizeof(struct block_map_statistics));
+	for (zone = 0; zone < map->zone_count; zone++) {
+		const struct block_map_statistics *stats = &(map->zones[zone].page_cache.stats);
+
+		totals.dirty_pages += READ_ONCE(stats->dirty_pages);
+		totals.clean_pages += READ_ONCE(stats->clean_pages);
+		totals.free_pages += READ_ONCE(stats->free_pages);
+		totals.failed_pages += READ_ONCE(stats->failed_pages);
+		totals.incoming_pages += READ_ONCE(stats->incoming_pages);
+		totals.outgoing_pages += READ_ONCE(stats->outgoing_pages);
+		totals.cache_pressure += READ_ONCE(stats->cache_pressure);
+		totals.read_count += READ_ONCE(stats->read_count);
+		totals.write_count += READ_ONCE(stats->write_count);
+		totals.failed_reads += READ_ONCE(stats->failed_reads);
+		totals.failed_writes += READ_ONCE(stats->failed_writes);
+		totals.reclaimed += READ_ONCE(stats->reclaimed);
+		totals.read_outgoing += READ_ONCE(stats->read_outgoing);
+		totals.found_in_cache += READ_ONCE(stats->found_in_cache);
+		totals.discard_required += READ_ONCE(stats->discard_required);
+		totals.wait_for_page += READ_ONCE(stats->wait_for_page);
+		totals.fetch_required += READ_ONCE(stats->fetch_required);
+		totals.pages_loaded += READ_ONCE(stats->pages_loaded);
+		totals.pages_saved += READ_ONCE(stats->pages_saved);
+		totals.flush_count += READ_ONCE(stats->flush_count);
+	}
+
+	return totals;
+}
diff --git a/drivers/md/dm-vdo/block-map.h b/drivers/md/dm-vdo/block-map.h
new file mode 100644
index 000000000000..2c4d40f21a69
--- /dev/null
+++ b/drivers/md/dm-vdo/block-map.h
@@ -0,0 +1,232 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_BLOCK_MAP_H
+#define VDO_BLOCK_MAP_H
+
+#include <linux/list.h>
+
+#include "numeric.h"
+
+#include "admin-state.h"
+#include "completion.h"
+#include "encodings.h"
+#include "int-map.h"
+#include "statistics.h"
+#include "types.h"
+#include "vio.h"
+#include "wait-queue.h"
+
+enum {
+	BLOCK_MAP_VIO_POOL_SIZE = 64,
+};
+
+/* Used to indicate that the page holding the location of a tree root has been "loaded". */
+extern const physical_block_number_t VDO_INVALID_PBN;
+
+/*
+ * Generation counter for page references.
+ */
+typedef u32 vdo_page_generation;
+
+struct forest;
+
+struct tree_page {
+	struct waiter waiter;
+
+	/* Dirty list entry */
+	struct list_head entry;
+
+	/* If dirty, the tree zone flush generation in which it was last dirtied. */
+	u8 generation;
+
+	/* Whether this page is an interior tree page being written out. */
+	bool writing;
+
+	/* If writing, the tree zone flush generation of the copy being written. */
+	u8 writing_generation;
+
+	/*
+	 * Sequence number of the earliest recovery journal block containing uncommitted updates to
+	 * this page
+	 */
+	sequence_number_t recovery_lock;
+
+	/* The value of recovery_lock when the this page last started writing */
+	sequence_number_t writing_recovery_lock;
+
+	char page_buffer[VDO_BLOCK_SIZE];
+};
+
+enum block_map_page_type {
+	VDO_TREE_PAGE,
+	VDO_CACHE_PAGE,
+};
+
+typedef struct list_head dirty_era_t[2];
+
+struct dirty_lists {
+	/** The number of periods after which an element will be expired */
+	block_count_t maximum_age;
+	/** The oldest period which has unexpired elements */
+	sequence_number_t oldest_period;
+	/** One more than the current period */
+	sequence_number_t next_period;
+	/** The offset in the array of lists of the oldest period */
+	block_count_t offset;
+	/** Expired pages */
+	dirty_era_t expired;
+	/** The lists of dirty pages */
+	dirty_era_t eras[];
+};
+
+struct block_map_zone {
+	zone_count_t zone_number;
+	thread_id_t thread_id;
+	struct admin_state state;
+	struct block_map *block_map;
+	/* Dirty pages, by era*/
+	struct dirty_lists *dirty_lists;
+	struct vdo_page_cache page_cache;
+	data_vio_count_t active_lookups;
+	struct int_map *loading_pages;
+	struct vio_pool *vio_pool;
+	/* The tree page which has issued or will be issuing a flush */
+	struct tree_page *flusher;
+	struct wait_queue flush_waiters;
+	/* The generation after the most recent flush */
+	u8 generation;
+	u8 oldest_generation;
+	/* The counts of dirty pages in each generation */
+	u32 dirty_page_counts[256];
+};
+
+struct block_map {
+	struct vdo *vdo;
+	struct action_manager *action_manager;
+	/* The absolute PBN of the first root of the tree part of the block map */
+	physical_block_number_t root_origin;
+	block_count_t root_count;
+
+	/* The era point we are currently distributing to the zones */
+	sequence_number_t current_era_point;
+	/* The next era point */
+	sequence_number_t pending_era_point;
+
+	/* The number of entries in block map */
+	block_count_t entry_count;
+	nonce_t nonce;
+	struct recovery_journal *journal;
+
+	/* The trees for finding block map pages */
+	struct forest *forest;
+	/* The expanded trees awaiting growth */
+	struct forest *next_forest;
+	/* The number of entries after growth */
+	block_count_t next_entry_count;
+
+	zone_count_t zone_count;
+	struct block_map_zone zones[];
+};
+
+/**
+ * typedef vdo_entry_callback - A function to be called for each allocated PBN when traversing the
+ *                              forest.
+ * @pbn: A PBN of a tree node.
+ * @completion: The parent completion of the traversal.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+typedef int vdo_entry_callback(physical_block_number_t pbn, struct vdo_completion *completion);
+
+static inline struct block_map_page * __must_check
+vdo_as_block_map_page(struct tree_page *tree_page)
+{
+	return (struct block_map_page *) tree_page->page_buffer;
+}
+
+bool vdo_copy_valid_page(char *buffer,
+			 nonce_t nonce,
+			 physical_block_number_t pbn,
+			 struct block_map_page *page);
+
+void vdo_find_block_map_slot(struct data_vio *data_vio);
+
+physical_block_number_t
+vdo_find_block_map_page_pbn(struct block_map *map, page_number_t page_number);
+
+void vdo_write_tree_page(struct tree_page *page, struct block_map_zone *zone);
+
+void vdo_traverse_forest(struct block_map *map,
+			 vdo_entry_callback *callback,
+			 struct vdo_completion *parent);
+
+int __must_check vdo_decode_block_map(struct block_map_state_2_0 state,
+				      block_count_t logical_blocks,
+				      struct vdo *vdo,
+				      struct recovery_journal *journal,
+				      nonce_t nonce,
+				      page_count_t cache_size,
+				      block_count_t maximum_age,
+				      struct block_map **map_ptr);
+
+void vdo_drain_block_map(struct block_map *map,
+			 const struct admin_state_code *operation,
+			 struct vdo_completion *parent);
+
+void vdo_resume_block_map(struct block_map *map, struct vdo_completion *parent);
+
+int __must_check
+vdo_prepare_to_grow_block_map(struct block_map *map, block_count_t new_logical_blocks);
+
+void vdo_grow_block_map(struct block_map *map, struct vdo_completion *parent);
+
+void vdo_abandon_block_map_growth(struct block_map *map);
+
+void vdo_free_block_map(struct block_map *map);
+
+struct block_map_state_2_0 __must_check
+vdo_record_block_map(const struct block_map *map);
+
+void vdo_initialize_block_map_from_journal(struct block_map *map,
+					   struct recovery_journal *journal);
+
+zone_count_t vdo_compute_logical_zone(struct data_vio *data_vio);
+
+void vdo_advance_block_map_era(struct block_map *map, sequence_number_t recovery_block_number);
+
+void vdo_update_block_map_page(struct block_map_page *page,
+			       struct data_vio *data_vio,
+			       physical_block_number_t pbn,
+			       enum block_mapping_state mapping_state,
+			       sequence_number_t *recovery_lock);
+
+void vdo_get_mapped_block(struct data_vio *data_vio);
+
+void vdo_put_mapped_block(struct data_vio *data_vio);
+
+struct block_map_statistics __must_check vdo_get_block_map_statistics(struct block_map *map);
+
+/**
+ * vdo_convert_maximum_age() - Convert the maximum age to reflect the new recovery journal format
+ * @age: The configured maximum age
+ *
+ * Return: The converted age
+ *
+ * In the old recovery journal format, each journal block held 311 entries, and every write bio
+ * made two entries. The old maximum age was half the usable journal length. In the new format,
+ * each block holds only 217 entries, but each bio only makes one entry. We convert the configured
+ * age so that the number of writes in a block map era is the same in the old and new formats. This
+ * keeps the bound on the amount of work required to recover the block map from the recovery
+ * journal the same across the format change. It also keeps the amortization of block map page
+ * writes to write bios the same.
+ */
+static inline block_count_t vdo_convert_maximum_age(block_count_t age)
+{
+	return DIV_ROUND_UP(age * RECOVERY_JOURNAL_1_ENTRIES_PER_BLOCK,
+			    2 * RECOVERY_JOURNAL_ENTRIES_PER_BLOCK);
+}
+
+#endif /* VDO_BLOCK_MAP_H */

From patchwork Thu Oct 26 21:41:27 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437952
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E5212C25B72
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:33 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344897AbjJZVod (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:33 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41512 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234981AbjJZVoV (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:21 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51AE7D57
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356506;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=PveRRsk3DU1EzC0tDdCWtFn648wg02u6m2NQaGdEUaM=;
        b=MrR+3J/TQ8OC+ORR5Uj4tuK55G7T9o2Vzok7vJMqtzJZRbB8xJ+70gg/L1BeTVwF9HFhvH
        EH2OWKqbgHUngukzWtt+Ps7bt3xbsoDNy7N04VFoAGcM+KW3wAEl4HfXKx+zvm65sba+i1
        +BEYT1/ZWWO1bPNlUGGCTdvvN+MyG5M=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-132-HZAuXY_WOni20RA4YKdS4w-1; Thu,
 26 Oct 2023 17:41:42 -0400
X-MC-Unique: HZAuXY_WOni20RA4YKdS4w-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 4D25C1C04B75;
        Thu, 26 Oct 2023 21:41:42 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 42D5A2949;
        Thu, 26 Oct 2023 21:41:42 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 02FE33003D; Thu, 26 Oct 2023 17:41:41 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 30/39] dm vdo: implement the block map page cache
Date: Thu, 26 Oct 2023 17:41:27 -0400
Message-Id: <20231026214136.1067410-31-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

The set of leaf pages of the block map tree is too large to fit in memory,
so each block map zone maintains a cache of leaf pages. This patch adds the
implementation of that cache.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/block-map.c | 1231 +++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/block-map.h |  165 +++++
 2 files changed, 1396 insertions(+)

diff --git a/drivers/md/dm-vdo/block-map.c b/drivers/md/dm-vdo/block-map.c
index a4bf60cd2613..d989e0366a25 100644
--- a/drivers/md/dm-vdo/block-map.c
+++ b/drivers/md/dm-vdo/block-map.c
@@ -106,6 +106,1237 @@ struct cursors {
 /* Used to indicate that the page holding the location of a tree root has been "loaded". */
 const physical_block_number_t VDO_INVALID_PBN = 0xFFFFFFFFFFFFFFFF;
 
+enum {
+	LOG_INTERVAL = 4000,
+	DISPLAY_INTERVAL = 100000,
+};
+
+/*
+ * For adjusting VDO page cache statistic fields which are only mutated on the logical zone thread.
+ * Prevents any compiler shenanigans from affecting other threads reading those stats.
+ */
+#define ADD_ONCE(value, delta) WRITE_ONCE(value, (value) + (delta))
+
+static inline bool is_dirty(const struct page_info *info)
+{
+	return info->state == PS_DIRTY;
+}
+
+static inline bool is_present(const struct page_info *info)
+{
+	return (info->state == PS_RESIDENT) || (info->state == PS_DIRTY);
+}
+
+static inline bool is_in_flight(const struct page_info *info)
+{
+	return (info->state == PS_INCOMING) || (info->state == PS_OUTGOING);
+}
+
+static inline bool is_incoming(const struct page_info *info)
+{
+	return info->state == PS_INCOMING;
+}
+
+static inline bool is_outgoing(const struct page_info *info)
+{
+	return info->state == PS_OUTGOING;
+}
+
+static inline bool is_valid(const struct page_info *info)
+{
+	return is_present(info) || is_outgoing(info);
+}
+
+static char *get_page_buffer(struct page_info *info)
+{
+	struct vdo_page_cache *cache = info->cache;
+
+	return &cache->pages[(info - cache->infos) * VDO_BLOCK_SIZE];
+}
+
+static inline struct vdo_page_completion *page_completion_from_waiter(struct waiter *waiter)
+{
+	struct vdo_page_completion *completion;
+
+	if (waiter == NULL)
+		return NULL;
+
+	completion = container_of(waiter, struct vdo_page_completion, waiter);
+	vdo_assert_completion_type(&completion->completion, VDO_PAGE_COMPLETION);
+	return completion;
+}
+
+/**
+ * initialize_info() - Initialize all page info structures and put them on the free list.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int initialize_info(struct vdo_page_cache *cache)
+{
+	struct page_info *info;
+
+	INIT_LIST_HEAD(&cache->free_list);
+	for (info = cache->infos; info < cache->infos + cache->page_count; ++info) {
+		int result;
+
+		info->cache = cache;
+		info->state = PS_FREE;
+		info->pbn = NO_PAGE;
+
+		result = create_metadata_vio(cache->vdo,
+					     VIO_TYPE_BLOCK_MAP,
+					     VIO_PRIORITY_METADATA, info,
+					     get_page_buffer(info),
+					     &info->vio);
+		if (result != VDO_SUCCESS)
+			return result;
+
+		/* The thread ID should never change. */
+		info->vio->completion.callback_thread_id = cache->zone->thread_id;
+
+		INIT_LIST_HEAD(&info->state_entry);
+		list_add_tail(&info->state_entry, &cache->free_list);
+		INIT_LIST_HEAD(&info->lru_entry);
+	}
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * allocate_cache_components() - Allocate components of the cache which require their own
+ *                               allocation.
+ * @maximum_age: The number of journal blocks before a dirtied page is considered old and must be
+ *               written out.
+ *
+ * The caller is responsible for all clean up on errors.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int __must_check allocate_cache_components(struct vdo_page_cache *cache)
+{
+	u64 size = cache->page_count * (u64) VDO_BLOCK_SIZE;
+	int result;
+
+	result = UDS_ALLOCATE(cache->page_count, struct page_info, "page infos", &cache->infos);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = uds_allocate_memory(size, VDO_BLOCK_SIZE, "cache pages", &cache->pages);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = vdo_make_int_map(cache->page_count, 0, &cache->page_map);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	return initialize_info(cache);
+}
+
+/**
+ * assert_on_cache_thread() - Assert that a function has been called on the VDO page cache's
+ *                            thread.
+ */
+static inline void assert_on_cache_thread(struct vdo_page_cache *cache, const char *function_name)
+{
+	thread_id_t thread_id = vdo_get_callback_thread_id();
+
+	ASSERT_LOG_ONLY((thread_id == cache->zone->thread_id),
+			"%s() must only be called on cache thread %d, not thread %d",
+			function_name,
+			cache->zone->thread_id,
+			thread_id);
+}
+
+/** assert_io_allowed() - Assert that a page cache may issue I/O. */
+static inline void assert_io_allowed(struct vdo_page_cache *cache)
+{
+	ASSERT_LOG_ONLY(!vdo_is_state_quiescent(&cache->zone->state),
+			"VDO page cache may issue I/O");
+}
+
+/** report_cache_pressure() - Log and, if enabled, report cache pressure. */
+static void report_cache_pressure(struct vdo_page_cache *cache)
+{
+	ADD_ONCE(cache->stats.cache_pressure, 1);
+	if (cache->waiter_count > cache->page_count) {
+		if ((cache->pressure_report % LOG_INTERVAL) == 0)
+			uds_log_info("page cache pressure %u", cache->stats.cache_pressure);
+
+		if (++cache->pressure_report >= DISPLAY_INTERVAL)
+			cache->pressure_report = 0;
+	}
+}
+
+/**
+ * get_page_state_name() - Return the name of a page state.
+ *
+ * If the page state is invalid a static string is returned and the invalid state is logged.
+ *
+ * Return: A pointer to a static page state name.
+ */
+static const char * __must_check get_page_state_name(enum vdo_page_buffer_state state)
+{
+	int result;
+	static const char * const state_names[] = {
+		"UDS_FREE", "INCOMING", "FAILED", "RESIDENT", "DIRTY", "OUTGOING"
+	};
+
+	STATIC_ASSERT(ARRAY_SIZE(state_names) == PAGE_STATE_COUNT);
+
+	result = ASSERT(state < ARRAY_SIZE(state_names), "Unknown page_state value %d", state);
+	if (result != UDS_SUCCESS)
+		return "[UNKNOWN PAGE STATE]";
+
+	return state_names[state];
+}
+
+/**
+ * update_counter() - Update the counter associated with a given state.
+ * @info: The page info to count.
+ * @delta: The delta to apply to the counter.
+ */
+static void update_counter(struct page_info *info, s32 delta)
+{
+	struct block_map_statistics *stats = &info->cache->stats;
+
+	switch (info->state) {
+	case PS_FREE:
+		ADD_ONCE(stats->free_pages, delta);
+		return;
+
+	case PS_INCOMING:
+		ADD_ONCE(stats->incoming_pages, delta);
+		return;
+
+	case PS_OUTGOING:
+		ADD_ONCE(stats->outgoing_pages, delta);
+		return;
+
+	case PS_FAILED:
+		ADD_ONCE(stats->failed_pages, delta);
+		return;
+
+	case PS_RESIDENT:
+		ADD_ONCE(stats->clean_pages, delta);
+		return;
+
+	case PS_DIRTY:
+		ADD_ONCE(stats->dirty_pages, delta);
+		return;
+
+	default:
+		return;
+	}
+}
+
+/** update_lru() - Update the lru information for an active page. */
+static void update_lru(struct page_info *info)
+{
+	if (info->cache->lru_list.prev != &info->lru_entry)
+		list_move_tail(&info->lru_entry, &info->cache->lru_list);
+}
+
+/**
+ * set_info_state() - Set the state of a page_info and put it on the right list, adjusting
+ *                    counters.
+ */
+static void
+set_info_state(struct page_info *info, enum vdo_page_buffer_state new_state)
+{
+	if (new_state == info->state)
+		return;
+
+	update_counter(info, -1);
+	info->state = new_state;
+	update_counter(info, 1);
+
+	switch (info->state) {
+	case PS_FREE:
+	case PS_FAILED:
+		list_move_tail(&info->state_entry, &info->cache->free_list);
+		return;
+
+	case PS_OUTGOING:
+		list_move_tail(&info->state_entry, &info->cache->outgoing_list);
+		return;
+
+	case PS_DIRTY:
+		return;
+
+	default:
+		list_del_init(&info->state_entry);
+	}
+}
+
+/** set_info_pbn() - Set the pbn for an info, updating the map as needed. */
+static int __must_check set_info_pbn(struct page_info *info, physical_block_number_t pbn)
+{
+	struct vdo_page_cache *cache = info->cache;
+
+	/* Either the new or the old page number must be NO_PAGE. */
+	int result = ASSERT((pbn == NO_PAGE) || (info->pbn == NO_PAGE),
+			    "Must free a page before reusing it.");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	if (info->pbn != NO_PAGE)
+		vdo_int_map_remove(cache->page_map, info->pbn);
+
+	info->pbn = pbn;
+
+	if (pbn != NO_PAGE) {
+		result = vdo_int_map_put(cache->page_map, pbn, info, true, NULL);
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+	return VDO_SUCCESS;
+}
+
+/** reset_page_info() - Reset page info to represent an unallocated page. */
+static int reset_page_info(struct page_info *info)
+{
+	int result;
+
+	result = ASSERT(info->busy == 0, "VDO Page must not be busy");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT(!vdo_has_waiters(&info->waiting), "VDO Page must not have waiters");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = set_info_pbn(info, NO_PAGE);
+	set_info_state(info, PS_FREE);
+	list_del_init(&info->lru_entry);
+	return result;
+}
+
+/**
+ * find_free_page() - Find a free page.
+ *
+ * Return: A pointer to the page info structure (if found), NULL otherwise.
+ */
+static struct page_info * __must_check find_free_page(struct vdo_page_cache *cache)
+{
+	struct page_info *info;
+
+	info = list_first_entry_or_null(&cache->free_list, struct page_info, state_entry);
+	if (info != NULL)
+		list_del_init(&info->state_entry);
+	return info;
+}
+
+/**
+ * find_page() - Find the page info (if any) associated with a given pbn.
+ * @pbn: The absolute physical block number of the page.
+ *
+ * Return: The page info for the page if available, or NULL if not.
+ */
+static struct page_info * __must_check
+find_page(struct vdo_page_cache *cache, physical_block_number_t pbn)
+{
+	if ((cache->last_found != NULL) && (cache->last_found->pbn == pbn))
+		return cache->last_found;
+	cache->last_found = vdo_int_map_get(cache->page_map, pbn);
+	return cache->last_found;
+}
+
+/**
+ * select_lru_page() - Determine which page is least recently used.
+ *
+ * Picks the least recently used from among the non-busy entries at the front of each of the lru
+ * ring. Since whenever we mark a page busy we also put it to the end of the ring it is unlikely
+ * that the entries at the front are busy unless the queue is very short, but not impossible.
+ *
+ * Return: A pointer to the info structure for a relevant page, or NULL if no such page can be
+ *         found. The page can be dirty or resident.
+ */
+static struct page_info * __must_check select_lru_page(struct vdo_page_cache *cache)
+{
+	struct page_info *info;
+
+	list_for_each_entry(info, &cache->lru_list, lru_entry)
+		if ((info->busy == 0) && !is_in_flight(info))
+			return info;
+
+	return NULL;
+}
+
+/* ASYNCHRONOUS INTERFACE BEYOND THIS POINT */
+
+/**
+ * complete_with_page() - Helper to complete the VDO Page Completion request successfully.
+ * @info: The page info representing the result page.
+ * @vdo_page_comp: The VDO page completion to complete.
+ */
+static void complete_with_page(struct page_info *info, struct vdo_page_completion *vdo_page_comp)
+{
+	bool available = vdo_page_comp->writable ? is_present(info) : is_valid(info);
+
+	if (!available) {
+		uds_log_error_strerror(VDO_BAD_PAGE,
+				       "Requested cache page %llu in state %s is not %s",
+				       (unsigned long long) info->pbn,
+				       get_page_state_name(info->state),
+				       vdo_page_comp->writable ? "present" :
+				       "valid");
+		vdo_fail_completion(&vdo_page_comp->completion, VDO_BAD_PAGE);
+		return;
+	}
+
+	vdo_page_comp->info = info;
+	vdo_page_comp->ready = true;
+	vdo_finish_completion(&vdo_page_comp->completion);
+}
+
+/**
+ * complete_waiter_with_error() - Complete a page completion with an error code.
+ * @waiter: The page completion, as a waiter.
+ * @result_ptr: A pointer to the error code.
+ *
+ * Implements waiter_callback.
+ */
+static void complete_waiter_with_error(struct waiter *waiter, void *result_ptr)
+{
+	int *result = result_ptr;
+
+	vdo_fail_completion(&page_completion_from_waiter(waiter)->completion, *result);
+}
+
+/**
+ * complete_waiter_with_page() - Complete a page completion with a page.
+ * @waiter: The page completion, as a waiter.
+ * @page_info: The page info to complete with.
+ *
+ * Implements waiter_callback.
+ */
+static void complete_waiter_with_page(struct waiter *waiter, void *page_info)
+{
+	complete_with_page((struct page_info *) page_info, page_completion_from_waiter(waiter));
+}
+
+/**
+ * distribute_page_over_queue() - Complete a queue of VDO page completions with a page result.
+ *
+ * Upon completion the queue will be empty.
+ *
+ * Return: The number of pages distributed.
+ */
+static unsigned int distribute_page_over_queue(struct page_info *info, struct wait_queue *queue)
+{
+	size_t pages;
+
+	update_lru(info);
+	pages = vdo_count_waiters(queue);
+
+	/*
+	 * Increment the busy count once for each pending completion so that this page does not
+	 * stop being busy until all completions have been processed (VDO-83).
+	 */
+	info->busy += pages;
+
+	vdo_notify_all_waiters(queue, complete_waiter_with_page, info);
+	return pages;
+}
+
+/**
+ * set_persistent_error() - Set a persistent error which all requests will receive in the future.
+ * @context: A string describing what triggered the error.
+ *
+ * Once triggered, all enqueued completions will get this error. Any future requests will result in
+ * this error as well.
+ */
+static void set_persistent_error(struct vdo_page_cache *cache, const char *context, int result)
+{
+	struct page_info *info;
+	/* If we're already read-only, there's no need to log. */
+	struct vdo *vdo = cache->zone->block_map->vdo;
+
+	if ((result != VDO_READ_ONLY) && !vdo_is_read_only(vdo)) {
+		uds_log_error_strerror(result, "VDO Page Cache persistent error: %s", context);
+		vdo_enter_read_only_mode(vdo, result);
+	}
+
+	assert_on_cache_thread(cache, __func__);
+
+	vdo_notify_all_waiters(&cache->free_waiters, complete_waiter_with_error, &result);
+	cache->waiter_count = 0;
+
+	for (info = cache->infos; info < cache->infos + cache->page_count; ++info)
+		vdo_notify_all_waiters(&info->waiting, complete_waiter_with_error, &result);
+}
+
+/**
+ * validate_completed_page() - Check that a page completion which is being freed to the cache
+ *                             referred to a valid page and is in a valid state.
+ * @writable: Whether a writable page is required.
+ *
+ * Return: VDO_SUCCESS if the page was valid, otherwise as error
+ */
+static int __must_check
+validate_completed_page(struct vdo_page_completion *completion, bool writable)
+{
+	int result;
+
+	result = ASSERT(completion->ready, "VDO Page completion not ready");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT(completion->info != NULL, "VDO Page Completion must be complete");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT(completion->info->pbn == completion->pbn,
+			"VDO Page Completion pbn must be consistent");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT(is_valid(completion->info), "VDO Page Completion page must be valid");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if (writable) {
+		result = ASSERT(completion->writable, "VDO Page Completion is writable");
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	return VDO_SUCCESS;
+}
+
+static void check_for_drain_complete(struct block_map_zone *zone)
+{
+	if (vdo_is_state_draining(&zone->state) &&
+	    (zone->active_lookups == 0) &&
+	    !vdo_has_waiters(&zone->flush_waiters) &&
+	    !is_vio_pool_busy(zone->vio_pool) &&
+	    (zone->page_cache.outstanding_reads == 0) &&
+	    (zone->page_cache.outstanding_writes == 0))
+		vdo_finish_draining_with_result(&zone->state,
+						(vdo_is_read_only(zone->block_map->vdo) ?
+						 VDO_READ_ONLY : VDO_SUCCESS));
+}
+
+static void enter_zone_read_only_mode(struct block_map_zone *zone, int result)
+{
+	vdo_enter_read_only_mode(zone->block_map->vdo, result);
+
+	/*
+	 * We are in read-only mode, so we won't ever write any page out. Just take all waiters off
+	 * the queue so the zone can drain.
+	 */
+	while (vdo_has_waiters(&zone->flush_waiters))
+		vdo_dequeue_next_waiter(&zone->flush_waiters);
+
+	check_for_drain_complete(zone);
+}
+
+static bool __must_check
+validate_completed_page_or_enter_read_only_mode(struct vdo_page_completion *completion,
+						bool writable)
+{
+	int result = validate_completed_page(completion, writable);
+
+	if (result == VDO_SUCCESS)
+		return true;
+
+	enter_zone_read_only_mode(completion->info->cache->zone, result);
+	return false;
+}
+
+/**
+ * handle_load_error() - Handle page load errors.
+ * @completion: The page read vio.
+ */
+static void handle_load_error(struct vdo_completion *completion)
+{
+	int result = completion->result;
+	struct page_info *info = completion->parent;
+	struct vdo_page_cache *cache = info->cache;
+
+	assert_on_cache_thread(cache, __func__);
+	vio_record_metadata_io_error(as_vio(completion));
+	vdo_enter_read_only_mode(cache->zone->block_map->vdo, result);
+	ADD_ONCE(cache->stats.failed_reads, 1);
+	set_info_state(info, PS_FAILED);
+	vdo_notify_all_waiters(&info->waiting, complete_waiter_with_error, &result);
+	reset_page_info(info);
+
+	/*
+	 * Don't decrement until right before calling check_for_drain_complete() to
+	 * ensure that the above work can't cause the page cache to be freed out from under us.
+	 */
+	cache->outstanding_reads--;
+	check_for_drain_complete(cache->zone);
+}
+
+/**
+ * page_is_loaded() - Callback used when a page has been loaded.
+ * @completion: The vio which has loaded the page. Its parent is the page_info.
+ */
+static void page_is_loaded(struct vdo_completion *completion)
+{
+	struct page_info *info = completion->parent;
+	struct vdo_page_cache *cache = info->cache;
+	nonce_t nonce = info->cache->zone->block_map->nonce;
+	struct block_map_page *page;
+	enum block_map_page_validity validity;
+
+	assert_on_cache_thread(cache, __func__);
+
+	page = (struct block_map_page *) get_page_buffer(info);
+	validity = vdo_validate_block_map_page(page, nonce, info->pbn);
+	if (validity == VDO_BLOCK_MAP_PAGE_BAD) {
+		physical_block_number_t pbn = vdo_get_block_map_page_pbn(page);
+		int result = uds_log_error_strerror(VDO_BAD_PAGE,
+						    "Expected page %llu but got page %llu instead",
+						    (unsigned long long) info->pbn,
+						    (unsigned long long) pbn);
+
+		vdo_continue_completion(completion, result);
+		return;
+	}
+
+	if (validity == VDO_BLOCK_MAP_PAGE_INVALID)
+		vdo_format_block_map_page(page, nonce, info->pbn, false);
+
+	info->recovery_lock = 0;
+	set_info_state(info, PS_RESIDENT);
+	distribute_page_over_queue(info, &info->waiting);
+
+	/*
+	 * Don't decrement until right before calling check_for_drain_complete() to
+	 * ensure that the above work can't cause the page cache to be freed out from under us.
+	 */
+	cache->outstanding_reads--;
+	check_for_drain_complete(cache->zone);
+}
+
+/**
+ * handle_rebuild_read_error() - Handle a read error during a read-only rebuild.
+ * @completion: The page load completion.
+ */
+static void handle_rebuild_read_error(struct vdo_completion *completion)
+{
+	struct page_info *info = completion->parent;
+	struct vdo_page_cache *cache = info->cache;
+
+	assert_on_cache_thread(cache, __func__);
+
+	/*
+	 * We are doing a read-only rebuild, so treat this as a successful read of an uninitialized
+	 * page.
+	 */
+	vio_record_metadata_io_error(as_vio(completion));
+	ADD_ONCE(cache->stats.failed_reads, 1);
+	memset(get_page_buffer(info), 0, VDO_BLOCK_SIZE);
+	vdo_reset_completion(completion);
+	page_is_loaded(completion);
+}
+
+static void load_cache_page_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct page_info *info = vio->completion.parent;
+
+	continue_vio_after_io(vio, page_is_loaded, info->cache->zone->thread_id);
+}
+
+/**
+ * launch_page_load() - Begin the process of loading a page.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int __must_check launch_page_load(struct page_info *info, physical_block_number_t pbn)
+{
+	int result;
+	vdo_action *callback;
+	struct vdo_page_cache *cache = info->cache;
+
+	assert_io_allowed(cache);
+
+	result = set_info_pbn(info, pbn);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = ASSERT((info->busy == 0), "Page is not busy before loading.");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	set_info_state(info, PS_INCOMING);
+	cache->outstanding_reads++;
+	ADD_ONCE(cache->stats.pages_loaded, 1);
+	callback = (cache->rebuilding ? handle_rebuild_read_error : handle_load_error);
+	submit_metadata_vio(info->vio,
+			    pbn,
+			    load_cache_page_endio,
+			    callback,
+			    REQ_OP_READ | REQ_PRIO);
+	return VDO_SUCCESS;
+}
+
+static void write_pages(struct vdo_completion *completion);
+
+/** handle_flush_error() - Handle errors flushing the layer. */
+static void handle_flush_error(struct vdo_completion *completion)
+{
+	struct page_info *info = completion->parent;
+
+	vio_record_metadata_io_error(as_vio(completion));
+	set_persistent_error(info->cache, "flush failed", completion->result);
+	write_pages(completion);
+}
+
+static void flush_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct page_info *info = vio->completion.parent;
+
+	continue_vio_after_io(vio, write_pages, info->cache->zone->thread_id);
+}
+
+/** save_pages() - Attempt to save the outgoing pages by first flushing the layer. */
+static void save_pages(struct vdo_page_cache *cache)
+{
+	struct page_info *info;
+	struct vio *vio;
+
+	if ((cache->pages_in_flush > 0) || (cache->pages_to_flush == 0))
+		return;
+
+	assert_io_allowed(cache);
+
+	info = list_first_entry(&cache->outgoing_list, struct page_info, state_entry);
+
+	cache->pages_in_flush = cache->pages_to_flush;
+	cache->pages_to_flush = 0;
+	ADD_ONCE(cache->stats.flush_count, 1);
+
+	vio = info->vio;
+
+	/*
+	 * We must make sure that the recovery journal entries that changed these pages were
+	 * successfully persisted, and thus must issue a flush before each batch of pages is
+	 * written to ensure this.
+	 */
+	submit_flush_vio(vio, flush_endio, handle_flush_error);
+}
+
+/**
+ * schedule_page_save() - Add a page to the outgoing list of pages waiting to be saved.
+ *
+ * Once in the list, a page may not be used until it has been written out.
+ */
+static void schedule_page_save(struct page_info *info)
+{
+	if (info->busy > 0) {
+		info->write_status = WRITE_STATUS_DEFERRED;
+		return;
+	}
+
+	info->cache->pages_to_flush++;
+	info->cache->outstanding_writes++;
+	set_info_state(info, PS_OUTGOING);
+}
+
+/**
+ * launch_page_save() - Add a page to outgoing pages waiting to be saved, and then start saving
+ * pages if another save is not in progress.
+ */
+static void launch_page_save(struct page_info *info)
+{
+	schedule_page_save(info);
+	save_pages(info->cache);
+}
+
+/**
+ * completion_needs_page() - Determine whether a given vdo_page_completion (as a waiter) is
+ *                           requesting a given page number.
+ * @context: A pointer to the pbn of the desired page.
+ *
+ * Implements waiter_match.
+ *
+ * Return: true if the page completion is for the desired page number.
+ */
+static bool completion_needs_page(struct waiter *waiter, void *context)
+{
+	physical_block_number_t *pbn = context;
+
+	return (page_completion_from_waiter(waiter)->pbn == *pbn);
+}
+
+/**
+ * allocate_free_page() - Allocate a free page to the first completion in the waiting queue, and
+ *                        any other completions that match it in page number.
+ */
+static void allocate_free_page(struct page_info *info)
+{
+	int result;
+	struct waiter *oldest_waiter;
+	physical_block_number_t pbn;
+	struct vdo_page_cache *cache = info->cache;
+
+	assert_on_cache_thread(cache, __func__);
+
+	if (!vdo_has_waiters(&cache->free_waiters)) {
+		if (cache->stats.cache_pressure > 0) {
+			uds_log_info("page cache pressure relieved");
+			WRITE_ONCE(cache->stats.cache_pressure, 0);
+		}
+		return;
+	}
+
+	result = reset_page_info(info);
+	if (result != VDO_SUCCESS) {
+		set_persistent_error(cache, "cannot reset page info", result);
+		return;
+	}
+
+	oldest_waiter = vdo_get_first_waiter(&cache->free_waiters);
+	pbn = page_completion_from_waiter(oldest_waiter)->pbn;
+
+	/*
+	 * Remove all entries which match the page number in question and push them onto the page
+	 * info's wait queue.
+	 */
+	vdo_dequeue_matching_waiters(&cache->free_waiters,
+				     completion_needs_page,
+				     &pbn,
+				     &info->waiting);
+	cache->waiter_count -= vdo_count_waiters(&info->waiting);
+
+	result = launch_page_load(info, pbn);
+	if (result != VDO_SUCCESS)
+		vdo_notify_all_waiters(&info->waiting, complete_waiter_with_error, &result);
+}
+
+/**
+ * discard_a_page() - Begin the process of discarding a page.
+ *
+ * If no page is discardable, increments a count of deferred frees so that the next release of a
+ * page which is no longer busy will kick off another discard cycle. This is an indication that the
+ * cache is not big enough.
+ *
+ * If the selected page is not dirty, immediately allocates the page to the oldest completion
+ * waiting for a free page.
+ */
+static void discard_a_page(struct vdo_page_cache *cache)
+{
+	struct page_info *info = select_lru_page(cache);
+
+	if (info == NULL) {
+		report_cache_pressure(cache);
+		return;
+	}
+
+	if (!is_dirty(info)) {
+		allocate_free_page(info);
+		return;
+	}
+
+	ASSERT_LOG_ONLY(!is_in_flight(info), "page selected for discard is not in flight");
+
+	++cache->discard_count;
+	info->write_status = WRITE_STATUS_DISCARD;
+	launch_page_save(info);
+}
+
+/**
+ * discard_page_for_completion() - Helper used to trigger a discard so that the completion can get
+ *                                 a different page.
+ */
+static void discard_page_for_completion(struct vdo_page_completion *vdo_page_comp)
+{
+	struct vdo_page_cache *cache = vdo_page_comp->cache;
+
+	++cache->waiter_count;
+	vdo_enqueue_waiter(&cache->free_waiters, &vdo_page_comp->waiter);
+	discard_a_page(cache);
+}
+
+/**
+ * discard_page_if_needed() - Helper used to trigger a discard if the cache needs another free
+ *                            page.
+ * @cache: The page cache.
+ */
+static void discard_page_if_needed(struct vdo_page_cache *cache)
+{
+	if (cache->waiter_count > cache->discard_count)
+		discard_a_page(cache);
+}
+
+/**
+ * write_has_finished() - Inform the cache that a write has finished (possibly with an error).
+ * @info: The info structure for the page whose write just completed.
+ *
+ * Return: true if the page write was a discard.
+ */
+static bool write_has_finished(struct page_info *info)
+{
+	bool was_discard = (info->write_status == WRITE_STATUS_DISCARD);
+
+	assert_on_cache_thread(info->cache, __func__);
+	info->cache->outstanding_writes--;
+
+	info->write_status = WRITE_STATUS_NORMAL;
+	return was_discard;
+}
+
+/**
+ * handle_page_write_error() - Handler for page write errors.
+ * @completion: The page write vio.
+ */
+static void handle_page_write_error(struct vdo_completion *completion)
+{
+	int result = completion->result;
+	struct page_info *info = completion->parent;
+	struct vdo_page_cache *cache = info->cache;
+
+	vio_record_metadata_io_error(as_vio(completion));
+
+	/* If we're already read-only, write failures are to be expected. */
+	if (result != VDO_READ_ONLY) {
+		static DEFINE_RATELIMIT_STATE(error_limiter,
+					      DEFAULT_RATELIMIT_INTERVAL,
+					      DEFAULT_RATELIMIT_BURST);
+
+		if (__ratelimit(&error_limiter))
+			uds_log_error("failed to write block map page %llu",
+				      (unsigned long long) info->pbn);
+	}
+
+	set_info_state(info, PS_DIRTY);
+	ADD_ONCE(cache->stats.failed_writes, 1);
+	set_persistent_error(cache, "cannot write page", result);
+
+	if (!write_has_finished(info))
+		discard_page_if_needed(cache);
+
+	check_for_drain_complete(cache->zone);
+}
+
+static void page_is_written_out(struct vdo_completion *completion);
+
+static void write_cache_page_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct page_info *info = vio->completion.parent;
+
+	continue_vio_after_io(vio, page_is_written_out, info->cache->zone->thread_id);
+}
+
+/**
+ * page_is_written_out() - Callback used when a page has been written out.
+ * @completion: The vio which wrote the page. Its parent is a page_info.
+ */
+static void page_is_written_out(struct vdo_completion *completion)
+{
+	bool was_discard, reclaimed;
+	u32 reclamations;
+	struct page_info *info = completion->parent;
+	struct vdo_page_cache *cache = info->cache;
+	struct block_map_page *page = (struct block_map_page *) get_page_buffer(info);
+
+	if (!page->header.initialized) {
+		page->header.initialized = true;
+		submit_metadata_vio(info->vio,
+				    info->pbn,
+				    write_cache_page_endio,
+				    handle_page_write_error,
+				    (REQ_OP_WRITE | REQ_PRIO | REQ_PREFLUSH));
+		return;
+	}
+
+	/* Handle journal updates and torn write protection. */
+	vdo_release_recovery_journal_block_reference(cache->zone->block_map->journal,
+						     info->recovery_lock,
+						     VDO_ZONE_TYPE_LOGICAL,
+						     cache->zone->zone_number);
+	info->recovery_lock = 0;
+	was_discard = write_has_finished(info);
+	reclaimed = (!was_discard || (info->busy > 0) || vdo_has_waiters(&info->waiting));
+
+	set_info_state(info, PS_RESIDENT);
+
+	reclamations = distribute_page_over_queue(info, &info->waiting);
+	ADD_ONCE(cache->stats.reclaimed, reclamations);
+
+	if (was_discard)
+		cache->discard_count--;
+
+	if (reclaimed)
+		discard_page_if_needed(cache);
+	else
+		allocate_free_page(info);
+
+	check_for_drain_complete(cache->zone);
+}
+
+/**
+ * write_pages() - Write the batch of pages which were covered by the layer flush which just
+ *                 completed.
+ * @flush_completion: The flush vio.
+ *
+ * This callback is registered in save_pages().
+ */
+static void write_pages(struct vdo_completion *flush_completion)
+{
+	struct vdo_page_cache *cache = ((struct page_info *) flush_completion->parent)->cache;
+
+	/*
+	 * We need to cache these two values on the stack since in the error case below, it is
+	 * possible for the last page info to cause the page cache to get freed. Hence once we
+	 * launch the last page, it may be unsafe to dereference the cache [VDO-4724].
+	 */
+	bool has_unflushed_pages = (cache->pages_to_flush > 0);
+	page_count_t pages_in_flush = cache->pages_in_flush;
+
+	cache->pages_in_flush = 0;
+	while (pages_in_flush-- > 0) {
+		struct page_info *info =
+			list_first_entry(&cache->outgoing_list, struct page_info, state_entry);
+
+		list_del_init(&info->state_entry);
+		if (vdo_is_read_only(info->cache->zone->block_map->vdo)) {
+			struct vdo_completion *completion = &info->vio->completion;
+
+			vdo_reset_completion(completion);
+			completion->callback = page_is_written_out;
+			completion->error_handler = handle_page_write_error;
+			vdo_fail_completion(completion, VDO_READ_ONLY);
+			continue;
+		}
+		ADD_ONCE(info->cache->stats.pages_saved, 1);
+		submit_metadata_vio(info->vio,
+				    info->pbn,
+				    write_cache_page_endio,
+				    handle_page_write_error,
+				    REQ_OP_WRITE | REQ_PRIO);
+	}
+
+	if (has_unflushed_pages)
+		/*
+		 * If there are unflushed pages, the cache can't have been freed, so this call is
+		 * safe.
+		 */
+		save_pages(cache);
+}
+
+/**
+ * vdo_release_page_completion() - Release a VDO Page Completion.
+ *
+ * The page referenced by this completion (if any) will no longer be held busy by this completion.
+ * If a page becomes discardable and there are completions awaiting free pages then a new round of
+ * page discarding is started.
+ */
+void vdo_release_page_completion(struct vdo_completion *completion)
+{
+	struct page_info *discard_info = NULL;
+	struct vdo_page_completion *page_completion = as_vdo_page_completion(completion);
+	struct vdo_page_cache *cache;
+
+	if (completion->result == VDO_SUCCESS) {
+		if (!validate_completed_page_or_enter_read_only_mode(page_completion, false))
+			return;
+
+		if (--page_completion->info->busy == 0)
+			discard_info = page_completion->info;
+	}
+
+	ASSERT_LOG_ONLY((page_completion->waiter.next_waiter == NULL),
+			"Page being released after leaving all queues");
+
+	page_completion->info = NULL;
+	cache = page_completion->cache;
+	assert_on_cache_thread(cache, __func__);
+
+	if (discard_info != NULL) {
+		if (discard_info->write_status == WRITE_STATUS_DEFERRED) {
+			discard_info->write_status = WRITE_STATUS_NORMAL;
+			launch_page_save(discard_info);
+		}
+		/*
+		 * if there are excess requests for pages (that have not already started discards)
+		 * we need to discard some page (which may be this one)
+		 */
+		discard_page_if_needed(cache);
+	}
+}
+
+/**
+ * load_page_for_completion() - Helper function to load a page as described by a VDO Page
+ *                              Completion.
+ */
+static void
+load_page_for_completion(struct page_info *info, struct vdo_page_completion *vdo_page_comp)
+{
+	int result;
+
+	vdo_enqueue_waiter(&info->waiting, &vdo_page_comp->waiter);
+	result = launch_page_load(info, vdo_page_comp->pbn);
+	if (result != VDO_SUCCESS)
+		vdo_notify_all_waiters(&info->waiting, complete_waiter_with_error, &result);
+}
+
+/**
+ * vdo_get_page() - Initialize a page completion and get a block map page.
+ * @page_completion: The vdo_page_completion to initialize.
+ * @zone: The block map zone of the desired page.
+ * @pbn: The absolute physical block of the desired page.
+ * @writable: Whether the page can be modified.
+ * @parent: The object to notify when the fetch is complete.
+ * @callback: The notification callback.
+ * @error_handler: The handler for fetch errors.
+ * @requeue: Whether we must requeue when notifying the parent.
+ *
+ * May cause another page to be discarded (potentially writing a dirty page) and the one nominated
+ * by the completion to be loaded from disk. When the callback is invoked, the page will be
+ * resident in the cache and marked busy. All callers must call vdo_release_page_completion()
+ * when they are done with the page to clear the busy mark.
+ */
+void vdo_get_page(struct vdo_page_completion *page_completion,
+		  struct block_map_zone *zone,
+		  physical_block_number_t pbn,
+		  bool writable,
+		  void *parent,
+		  vdo_action *callback,
+		  vdo_action *error_handler,
+		  bool requeue)
+{
+	struct vdo_page_cache *cache = &zone->page_cache;
+	struct vdo_completion *completion = &page_completion->completion;
+	struct page_info *info;
+
+	assert_on_cache_thread(cache, __func__);
+	ASSERT_LOG_ONLY((page_completion->waiter.next_waiter == NULL),
+			"New page completion was not already on a wait queue");
+
+	*page_completion = (struct vdo_page_completion) {
+		.pbn = pbn,
+		.writable = writable,
+		.cache = cache,
+	};
+
+	vdo_initialize_completion(completion, cache->vdo, VDO_PAGE_COMPLETION);
+	vdo_prepare_completion(completion,
+			       callback,
+			       error_handler,
+			       cache->zone->thread_id,
+			       parent);
+	completion->requeue = requeue;
+
+	if (page_completion->writable && vdo_is_read_only(cache->zone->block_map->vdo)) {
+		vdo_fail_completion(completion, VDO_READ_ONLY);
+		return;
+	}
+
+	if (page_completion->writable)
+		ADD_ONCE(cache->stats.write_count, 1);
+	else
+		ADD_ONCE(cache->stats.read_count, 1);
+
+	info = find_page(cache, page_completion->pbn);
+	if (info != NULL) {
+		/* The page is in the cache already. */
+		if ((info->write_status == WRITE_STATUS_DEFERRED) ||
+		    is_incoming(info) ||
+		    (is_outgoing(info) && page_completion->writable)) {
+			/* The page is unusable until it has finished I/O. */
+			ADD_ONCE(cache->stats.wait_for_page, 1);
+			vdo_enqueue_waiter(&info->waiting, &page_completion->waiter);
+			return;
+		}
+
+		if (is_valid(info)) {
+			/* The page is usable. */
+			ADD_ONCE(cache->stats.found_in_cache, 1);
+			if (!is_present(info))
+				ADD_ONCE(cache->stats.read_outgoing, 1);
+			update_lru(info);
+			++info->busy;
+			complete_with_page(info, page_completion);
+			return;
+		}
+		/* Something horrible has gone wrong. */
+		ASSERT_LOG_ONLY(false, "Info found in a usable state.");
+	}
+
+	/* The page must be fetched. */
+	info = find_free_page(cache);
+	if (info != NULL) {
+		ADD_ONCE(cache->stats.fetch_required, 1);
+		load_page_for_completion(info, page_completion);
+		return;
+	}
+
+	/* The page must wait for a page to be discarded. */
+	ADD_ONCE(cache->stats.discard_required, 1);
+	discard_page_for_completion(page_completion);
+}
+
+/**
+ * vdo_request_page_write() - Request that a VDO page be written out as soon as it is not busy.
+ * @completion: The vdo_page_completion containing the page.
+ */
+void vdo_request_page_write(struct vdo_completion *completion)
+{
+	struct page_info *info;
+	struct vdo_page_completion *vdo_page_comp = as_vdo_page_completion(completion);
+
+	if (!validate_completed_page_or_enter_read_only_mode(vdo_page_comp, true))
+		return;
+
+	info = vdo_page_comp->info;
+	set_info_state(info, PS_DIRTY);
+	launch_page_save(info);
+}
+
+/**
+ * vdo_get_cached_page() - Get the block map page from a page completion.
+ * @completion: A vdo page completion whose callback has been called.
+ * @page_ptr: A pointer to hold the page
+ *
+ * Return: VDO_SUCCESS or an error
+ */
+int vdo_get_cached_page(struct vdo_completion *completion, struct block_map_page **page_ptr)
+{
+	int result;
+	struct vdo_page_completion *vpc;
+
+	vpc = as_vdo_page_completion(completion);
+	result = validate_completed_page(vpc, true);
+	if (result == VDO_SUCCESS)
+		*page_ptr = (struct block_map_page *) get_page_buffer(vpc->info);
+
+	return result;
+}
+
+/**
+ * vdo_invalidate_page_cache() - Invalidate all entries in the VDO page cache.
+ *
+ * There must not be any dirty pages in the cache.
+ *
+ * Return: A success or error code.
+ */
+int vdo_invalidate_page_cache(struct vdo_page_cache *cache)
+{
+	struct page_info *info;
+
+	assert_on_cache_thread(cache, __func__);
+
+	/* Make sure we don't throw away any dirty pages. */
+	for (info = cache->infos; info < cache->infos + cache->page_count; info++) {
+		int result = ASSERT(!is_dirty(info), "cache must have no dirty pages");
+
+		if (result != VDO_SUCCESS)
+			return result;
+	}
+
+	/* Reset the page map by re-allocating it. */
+	vdo_free_int_map(UDS_FORGET(cache->page_map));
+	return vdo_make_int_map(cache->page_count, 0, &cache->page_map);
+}
+
 /**
  * get_tree_page_by_index() - Get the tree page for a given height and page index.
  *
diff --git a/drivers/md/dm-vdo/block-map.h b/drivers/md/dm-vdo/block-map.h
index 2c4d40f21a69..d13e87b79806 100644
--- a/drivers/md/dm-vdo/block-map.h
+++ b/drivers/md/dm-vdo/block-map.h
@@ -31,6 +31,147 @@ extern const physical_block_number_t VDO_INVALID_PBN;
  */
 typedef u32 vdo_page_generation;
 
+static const physical_block_number_t NO_PAGE = 0xFFFFFFFFFFFFFFFF;
+static const struct block_map_entry UNMAPPED_BLOCK_MAP_ENTRY = {
+	.mapping_state = VDO_MAPPING_STATE_UNMAPPED & 0x0F,
+	.pbn_high_nibble = 0,
+	.pbn_low_word = __cpu_to_le32(VDO_ZERO_BLOCK & UINT_MAX),
+};
+
+/* The VDO Page Cache abstraction. */
+struct vdo_page_cache {
+	/* the VDO which owns this cache */
+	struct vdo *vdo;
+	/* number of pages in cache */
+	page_count_t page_count;
+	/* number of pages to write in the current batch */
+	page_count_t pages_in_batch;
+	/* Whether the VDO is doing a read-only rebuild */
+	bool rebuilding;
+
+	/* array of page information entries */
+	struct page_info *infos;
+	/* raw memory for pages */
+	char *pages;
+	/* cache last found page info */
+	struct page_info *last_found;
+	/* map of page number to info */
+	struct int_map *page_map;
+	/* main LRU list (all infos) */
+	struct list_head lru_list;
+	/* free page list (oldest first) */
+	struct list_head free_list;
+	/* outgoing page list */
+	struct list_head outgoing_list;
+	/* number of read I/O operations pending */
+	page_count_t outstanding_reads;
+	/* number of write I/O operations pending */
+	page_count_t outstanding_writes;
+	/* number of pages covered by the current flush */
+	page_count_t pages_in_flush;
+	/* number of pages waiting to be included in the next flush */
+	page_count_t pages_to_flush;
+	/* number of discards in progress */
+	unsigned int discard_count;
+	/* how many VPCs waiting for free page */
+	unsigned int waiter_count;
+	/* queue of waiters who want a free page */
+	struct wait_queue free_waiters;
+	/*
+	 * Statistics are only updated on the logical zone thread, but are accessed from other
+	 * threads.
+	 */
+	struct block_map_statistics stats;
+	/* counter for pressure reports */
+	u32 pressure_report;
+	/* the block map zone to which this cache belongs */
+	struct block_map_zone *zone;
+};
+
+/*
+ * The state of a page buffer. If the page buffer is free no particular page is bound to it,
+ * otherwise the page buffer is bound to particular page whose absolute pbn is in the pbn field. If
+ * the page is resident or dirty the page data is stable and may be accessed. Otherwise the page is
+ * in flight (incoming or outgoing) and its data should not be accessed.
+ *
+ * @note Update the static data in get_page_state_name() if you change this enumeration.
+ */
+enum vdo_page_buffer_state {
+	/* this page buffer is not being used */
+	PS_FREE,
+	/* this page is being read from store */
+	PS_INCOMING,
+	/* attempt to load this page failed */
+	PS_FAILED,
+	/* this page is valid and un-modified */
+	PS_RESIDENT,
+	/* this page is valid and modified */
+	PS_DIRTY,
+	/* this page is being written and should not be used */
+	PS_OUTGOING,
+	/* not a state */
+	PAGE_STATE_COUNT,
+} __packed;
+
+/*
+ * The write status of page
+ */
+enum vdo_page_write_status {
+	WRITE_STATUS_NORMAL,
+	WRITE_STATUS_DISCARD,
+	WRITE_STATUS_DEFERRED,
+} __packed;
+
+/* Per-page-slot information. */
+struct page_info {
+	/* Preallocated page struct vio */
+	struct vio *vio;
+	/* back-link for references */
+	struct vdo_page_cache *cache;
+	/* the pbn of the page */
+	physical_block_number_t pbn;
+	/* page is busy (temporarily locked) */
+	u16 busy;
+	/* the write status the page */
+	enum vdo_page_write_status write_status;
+	/* page state */
+	enum vdo_page_buffer_state state;
+	/* queue of completions awaiting this item */
+	struct wait_queue waiting;
+	/* state linked list entry */
+	struct list_head state_entry;
+	/* LRU entry */
+	struct list_head lru_entry;
+	/*
+	 * The earliest recovery journal block containing uncommitted updates to the block map page
+	 * associated with this page_info. A reference (lock) is held on that block to prevent it
+	 * from being reaped. When this value changes, the reference on the old value must be
+	 * released and a reference on the new value must be acquired.
+	 */
+	sequence_number_t recovery_lock;
+};
+
+/*
+ * A completion awaiting a specific page. Also a live reference into the page once completed, until
+ * freed.
+ */
+struct vdo_page_completion {
+	/* The generic completion */
+	struct vdo_completion completion;
+	/* The cache involved */
+	struct vdo_page_cache *cache;
+	/* The waiter for the pending list */
+	struct waiter waiter;
+	/* The absolute physical block number of the page on disk */
+	physical_block_number_t pbn;
+	/* Whether the page may be modified */
+	bool writable;
+	/* Whether the page is available */
+	bool ready;
+	/* The info structure for the page, only valid when ready */
+	struct page_info *info;
+};
+
 struct forest;
 
 struct tree_page {
@@ -141,6 +282,30 @@ struct block_map {
  */
 typedef int vdo_entry_callback(physical_block_number_t pbn, struct vdo_completion *completion);
 
+static inline struct vdo_page_completion *as_vdo_page_completion(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VDO_PAGE_COMPLETION);
+	return container_of(completion, struct vdo_page_completion, completion);
+}
+
+void vdo_release_page_completion(struct vdo_completion *completion);
+
+void vdo_get_page(struct vdo_page_completion *page_completion,
+		  struct block_map_zone *zone,
+		  physical_block_number_t pbn,
+		  bool writable,
+		  void *parent,
+		  vdo_action *callback,
+		  vdo_action *error_handler,
+		  bool requeue);
+
+void vdo_request_page_write(struct vdo_completion *completion);
+
+int __must_check vdo_get_cached_page(struct vdo_completion *completion,
+				     struct block_map_page **page_ptr);
+
+int __must_check vdo_invalidate_page_cache(struct vdo_page_cache *cache);
+
 static inline struct block_map_page * __must_check
 vdo_as_block_map_page(struct tree_page *tree_page)
 {

From patchwork Thu Oct 26 21:41:28 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437955
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3695AC25B70
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:33 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344896AbjJZVoc (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:32 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41498 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234978AbjJZVoU (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:20 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EF0F9D4D
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356504;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=EW37FE/4eS1FyK1wdvp5n/uvJfYCNVFzS555lMT0ohM=;
        b=a+6crojT+smY+xGdTfwqpvByHADlS+AltSTuI1qWZ4+yp0oaYd3iGn80MdW15dsu1W3I+C
        RtD+EqlgXCYLa/MEeWOZ9amnXce6YlVhINaSIWye2xCsZqKCOCTXsb5rl7ieGli6K/25XF
        a5ZvwVETKzEtHagla84RqyMNztZf4XI=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-228-cYMjLd_DNGqAZNxTKpLEQg-1; Thu, 26 Oct 2023 17:41:42 -0400
X-MC-Unique: cYMjLd_DNGqAZNxTKpLEQg-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com
 [10.11.54.8])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 989FC800CAA;
        Thu, 26 Oct 2023 21:41:42 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 7B618C1596D;
        Thu, 26 Oct 2023 21:41:42 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 2D1483003E; Thu, 26 Oct 2023 17:41:42 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 31/39] dm vdo: add the recovery journal
Date: Thu, 26 Oct 2023 17:41:28 -0400
Message-Id: <20231026214136.1067410-32-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

The recovery journal is used to amortize updates across the block map and
slab depot. Each write request causes an entry to be made in the journal.
Entries are either "data remappings" or "block map remappings." For a data
remapping, the journal records the logical address affected and its old and
new physical mappings. For a block map remapping, the journal records the
block map page number and the physical block allocated for it (block map
pages are never reclaimed, so the old mapping is always 0). Each journal
entry and the data write it represents must be stable on disk before the
other metadata structures may be updated to reflect the operation.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/recovery-journal.c | 1772 ++++++++++++++++++++++++++
 drivers/md/dm-vdo/recovery-journal.h |  313 +++++
 2 files changed, 2085 insertions(+)
 create mode 100644 drivers/md/dm-vdo/recovery-journal.c
 create mode 100644 drivers/md/dm-vdo/recovery-journal.h

diff --git a/drivers/md/dm-vdo/recovery-journal.c b/drivers/md/dm-vdo/recovery-journal.c
new file mode 100644
index 000000000000..6abf43e99a48
--- /dev/null
+++ b/drivers/md/dm-vdo/recovery-journal.c
@@ -0,0 +1,1772 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "recovery-journal.h"
+
+#include <linux/atomic.h>
+#include <linux/bio.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+
+#include "block-map.h"
+#include "completion.h"
+#include "constants.h"
+#include "data-vio.h"
+#include "encodings.h"
+#include "io-submitter.h"
+#include "slab-depot.h"
+#include "types.h"
+#include "vdo.h"
+#include "vio.h"
+#include "wait-queue.h"
+
+static const u64 RECOVERY_COUNT_MASK = 0xff;
+
+enum {
+	/*
+	 * The number of reserved blocks must be large enough to prevent a new recovery journal
+	 * block write from overwriting a block which appears to still be a valid head block of the
+	 * journal. Currently, that means reserving enough space for all 2048 data_vios.
+	 */
+	RECOVERY_JOURNAL_RESERVED_BLOCKS =
+		(MAXIMUM_VDO_USER_VIOS / RECOVERY_JOURNAL_ENTRIES_PER_BLOCK) + 2,
+	WRITE_FLAGS = REQ_OP_WRITE | REQ_PRIO | REQ_PREFLUSH | REQ_SYNC | REQ_FUA,
+};
+
+/**
+ * DOC: Lock Counters.
+ *
+ * A lock_counter is intended to keep all of the locks for the blocks in the recovery journal. The
+ * per-zone counters are all kept in a single array which is arranged by zone (i.e. zone 0's lock 0
+ * is at index 0, zone 0's lock 1 is at index 1, and zone 1's lock 0 is at index 'locks'. This
+ * arrangement is intended to minimize cache-line contention for counters from different zones.
+ *
+ * The locks are implemented as a single object instead of as a lock counter per lock both to
+ * afford this opportunity to reduce cache line contention and also to eliminate the need to have a
+ * completion per lock.
+ *
+ * Lock sets are laid out with the set for recovery journal first, followed by the logical zones,
+ * and then the physical zones.
+ */
+
+enum lock_counter_state {
+	LOCK_COUNTER_STATE_NOT_NOTIFYING,
+	LOCK_COUNTER_STATE_NOTIFYING,
+	LOCK_COUNTER_STATE_SUSPENDED,
+};
+
+/**
+ * get_zone_count_ptr() - Get a pointer to the zone count for a given lock on a given zone.
+ * @journal: The recovery journal.
+ * @lock_number: The lock to get.
+ * @zone_type: The zone type whose count is desired.
+ *
+ * Return: A pointer to the zone count for the given lock and zone.
+ */
+static inline atomic_t *get_zone_count_ptr(struct recovery_journal *journal,
+					   block_count_t lock_number,
+					   enum vdo_zone_type zone_type)
+{
+	return ((zone_type == VDO_ZONE_TYPE_LOGICAL)
+		? &journal->lock_counter.logical_zone_counts[lock_number]
+		: &journal->lock_counter.physical_zone_counts[lock_number]);
+}
+
+/**
+ * get_counter() - Get the zone counter for a given lock on a given zone.
+ * @journal: The recovery journal.
+ * @lock_number: The lock to get.
+ * @zone_type: The zone type whose count is desired.
+ * @zone_id: The zone index whose count is desired.
+ *
+ * Return: The counter for the given lock and zone.
+ */
+static inline u16 *get_counter(struct recovery_journal *journal,
+			       block_count_t lock_number,
+			       enum vdo_zone_type zone_type,
+			       zone_count_t zone_id)
+{
+	struct lock_counter *counter = &journal->lock_counter;
+	block_count_t zone_counter = (counter->locks * zone_id) + lock_number;
+
+	if (zone_type == VDO_ZONE_TYPE_JOURNAL)
+		return &counter->journal_counters[zone_counter];
+
+	if (zone_type == VDO_ZONE_TYPE_LOGICAL)
+		return &counter->logical_counters[zone_counter];
+
+	return &counter->physical_counters[zone_counter];
+}
+
+static atomic_t *get_decrement_counter(struct recovery_journal *journal, block_count_t lock_number)
+{
+	return &journal->lock_counter.journal_decrement_counts[lock_number];
+}
+
+/**
+ * is_journal_zone_locked() - Check whether the journal zone is locked for a given lock.
+ * @journal: The recovery journal.
+ * @lock_number: The lock to check.
+ *
+ * Return: true if the journal zone is locked.
+ */
+static bool is_journal_zone_locked(struct recovery_journal *journal, block_count_t lock_number)
+{
+	u16 journal_value = *(get_counter(journal, lock_number, VDO_ZONE_TYPE_JOURNAL, 0));
+	u32 decrements = atomic_read(get_decrement_counter(journal, lock_number));
+
+	/* Pairs with barrier in vdo_release_journal_entry_lock() */
+	smp_rmb();
+	ASSERT_LOG_ONLY((decrements <= journal_value),
+			"journal zone lock counter must not underflow");
+	return (journal_value != decrements);
+}
+
+/**
+ * vdo_release_recovery_journal_block_reference() - Release a reference to a recovery journal
+ *                                                  block.
+ * @journal: The recovery journal.
+ * @sequence_number: The journal sequence number of the referenced block.
+ * @zone_type: The type of the zone making the adjustment.
+ * @zone_id: The ID of the zone making the adjustment.
+ *
+ * If this is the last reference for a given zone type, an attempt will be made to reap the
+ * journal.
+ */
+void vdo_release_recovery_journal_block_reference(struct recovery_journal *journal,
+						  sequence_number_t sequence_number,
+						  enum vdo_zone_type zone_type,
+						  zone_count_t zone_id)
+{
+	u16 *current_value;
+	block_count_t lock_number;
+	int prior_state;
+
+	if (sequence_number == 0)
+		return;
+
+	lock_number = vdo_get_recovery_journal_block_number(journal, sequence_number);
+	current_value = get_counter(journal, lock_number, zone_type, zone_id);
+
+	ASSERT_LOG_ONLY((*current_value >= 1), "decrement of lock counter must not underflow");
+	*current_value -= 1;
+
+	if (zone_type == VDO_ZONE_TYPE_JOURNAL) {
+		if (is_journal_zone_locked(journal, lock_number))
+			return;
+	} else {
+		atomic_t *zone_count;
+
+		if (*current_value != 0)
+			return;
+
+		zone_count = get_zone_count_ptr(journal, lock_number, zone_type);
+
+		if (atomic_add_return(-1, zone_count) > 0)
+			return;
+	}
+
+	/*
+	 * Extra barriers because this was original developed using a CAS operation that implicitly
+	 * had them.
+	 */
+	smp_mb__before_atomic();
+	prior_state = atomic_cmpxchg(&journal->lock_counter.state,
+				     LOCK_COUNTER_STATE_NOT_NOTIFYING,
+				     LOCK_COUNTER_STATE_NOTIFYING);
+	/* same as before_atomic */
+	smp_mb__after_atomic();
+
+	if (prior_state != LOCK_COUNTER_STATE_NOT_NOTIFYING)
+		return;
+
+	vdo_launch_completion(&journal->lock_counter.completion);
+}
+
+static inline struct recovery_journal_block * __must_check
+get_journal_block(struct list_head *list)
+{
+	return list_first_entry_or_null(list, struct recovery_journal_block, list_node);
+}
+
+/**
+ * pop_free_list() - Get a block from the end of the free list.
+ * @journal: The journal.
+ *
+ * Return: The block or NULL if the list is empty.
+ */
+static struct recovery_journal_block * __must_check
+pop_free_list(struct recovery_journal *journal)
+{
+	struct recovery_journal_block *block;
+
+	if (list_empty(&journal->free_tail_blocks))
+		return NULL;
+
+	block = list_last_entry(&journal->free_tail_blocks,
+				struct recovery_journal_block,
+				list_node);
+	list_del_init(&block->list_node);
+	return block;
+}
+
+/**
+ * is_block_dirty() - Check whether a recovery block is dirty.
+ * @block: The block to check.
+ *
+ * Indicates it has any uncommitted entries, which includes both entries not written and entries
+ * written but not yet acknowledged.
+ *
+ * Return: true if the block has any uncommitted entries.
+ */
+static inline bool __must_check is_block_dirty(const struct recovery_journal_block *block)
+{
+	return (block->uncommitted_entry_count > 0);
+}
+
+/**
+ * is_block_empty() - Check whether a journal block is empty.
+ * @block: The block to check.
+ *
+ * Return: true if the block has no entries.
+ */
+static inline bool __must_check is_block_empty(const struct recovery_journal_block *block)
+{
+	return (block->entry_count == 0);
+}
+
+/**
+ * is_block_full() - Check whether a journal block is full.
+ * @block: The block to check.
+ *
+ * Return: true if the block is full.
+ */
+static inline bool __must_check is_block_full(const struct recovery_journal_block *block)
+{
+	return ((block == NULL) || (block->journal->entries_per_block == block->entry_count));
+}
+
+/**
+ * assert_on_journal_thread() - Assert that we are running on the journal thread.
+ * @journal: The journal.
+ * @function_name: The function doing the check (for logging).
+ */
+static void assert_on_journal_thread(struct recovery_journal *journal, const char *function_name)
+{
+	ASSERT_LOG_ONLY((vdo_get_callback_thread_id() == journal->thread_id),
+			"%s() called on journal thread", function_name);
+}
+
+/**
+ * continue_waiter() - Release a data_vio from the journal.
+ *
+ * Invoked whenever a data_vio is to be released from the journal, either because its entry was
+ * committed to disk, or because there was an error. Implements waiter_callback.
+ */
+static void continue_waiter(struct waiter *waiter, void *context)
+{
+	continue_data_vio_with_error(waiter_as_data_vio(waiter), *((int *) context));
+}
+
+/**
+ * has_block_waiters() - Check whether the journal has any waiters on any blocks.
+ * @journal: The journal in question.
+ *
+ * Return: true if any block has a waiter.
+ */
+static inline bool has_block_waiters(struct recovery_journal *journal)
+{
+	struct recovery_journal_block *block = get_journal_block(&journal->active_tail_blocks);
+
+	/*
+	 * Either the first active tail block (if it exists) has waiters, or no active tail block
+	 * has waiters.
+	 */
+	return ((block != NULL) &&
+		(vdo_has_waiters(&block->entry_waiters) ||
+		 vdo_has_waiters(&block->commit_waiters)));
+}
+
+static void recycle_journal_blocks(struct recovery_journal *journal);
+static void recycle_journal_block(struct recovery_journal_block *block);
+static void notify_commit_waiters(struct recovery_journal *journal);
+
+/**
+ * suspend_lock_counter() - Prevent the lock counter from notifying.
+ * @counter: The counter.
+ *
+ * Return: true if the lock counter was not notifying and hence the suspend was efficacious.
+ */
+static bool suspend_lock_counter(struct lock_counter *counter)
+{
+	int prior_state;
+
+	/*
+	 * Extra barriers because this was originally developed using a CAS operation that
+	 * implicitly had them.
+	 */
+	smp_mb__before_atomic();
+	prior_state = atomic_cmpxchg(&counter->state,
+				     LOCK_COUNTER_STATE_NOT_NOTIFYING,
+				     LOCK_COUNTER_STATE_SUSPENDED);
+	/* same as before_atomic */
+	smp_mb__after_atomic();
+
+	return ((prior_state == LOCK_COUNTER_STATE_SUSPENDED) ||
+		(prior_state == LOCK_COUNTER_STATE_NOT_NOTIFYING));
+}
+
+static inline bool is_read_only(struct recovery_journal *journal)
+{
+	return vdo_is_read_only(journal->flush_vio->completion.vdo);
+}
+
+/**
+ * check_for_drain_complete() - Check whether the journal has drained.
+ * @journal: The journal which may have just drained.
+ */
+static void check_for_drain_complete(struct recovery_journal *journal)
+{
+	int result = VDO_SUCCESS;
+
+	if (is_read_only(journal)) {
+		result = VDO_READ_ONLY;
+		/*
+		 * Clean up any full active blocks which were not written due to read-only mode.
+		 *
+		 * FIXME: This would probably be better as a short-circuit in write_block().
+		 */
+		notify_commit_waiters(journal);
+		recycle_journal_blocks(journal);
+
+		/* Release any data_vios waiting to be assigned entries. */
+		vdo_notify_all_waiters(&journal->entry_waiters, continue_waiter, &result);
+	}
+
+	if (!vdo_is_state_draining(&journal->state) ||
+	    journal->reaping ||
+	    has_block_waiters(journal) ||
+	    vdo_has_waiters(&journal->entry_waiters) ||
+	    !suspend_lock_counter(&journal->lock_counter))
+		return;
+
+	if (vdo_is_state_saving(&journal->state)) {
+		if (journal->active_block != NULL) {
+			ASSERT_LOG_ONLY(((result == VDO_READ_ONLY) ||
+					 !is_block_dirty(journal->active_block)),
+					"journal being saved has clean active block");
+			recycle_journal_block(journal->active_block);
+		}
+
+		ASSERT_LOG_ONLY(list_empty(&journal->active_tail_blocks),
+				"all blocks in a journal being saved must be inactive");
+	}
+
+	vdo_finish_draining_with_result(&journal->state, result);
+}
+
+/**
+ * notify_recovery_journal_of_read_only_mode() - Notify a recovery journal that the VDO has gone
+ *                                               read-only.
+ * @listener: The journal.
+ * @parent: The completion to notify in order to acknowledge the notification.
+ *
+ * Implements vdo_read_only_notification.
+ */
+static void
+notify_recovery_journal_of_read_only_mode(void *listener, struct vdo_completion *parent)
+{
+	check_for_drain_complete(listener);
+	vdo_finish_completion(parent);
+}
+
+/**
+ * enter_journal_read_only_mode() - Put the journal in read-only mode.
+ * @journal: The journal which has failed.
+ * @error_code: The error result triggering this call.
+ *
+ * All attempts to add entries after this function is called will fail. All VIOs waiting for
+ * commits will be awakened with an error.
+ */
+static void enter_journal_read_only_mode(struct recovery_journal *journal, int error_code)
+{
+	vdo_enter_read_only_mode(journal->flush_vio->completion.vdo, error_code);
+	check_for_drain_complete(journal);
+}
+
+/**
+ * vdo_get_recovery_journal_current_sequence_number() - Obtain the recovery journal's current
+ *                                                      sequence number.
+ * @journal: The journal in question.
+ *
+ * Exposed only so the block map can be initialized therefrom.
+ *
+ * Return: The sequence number of the tail block.
+ */
+sequence_number_t
+vdo_get_recovery_journal_current_sequence_number(struct recovery_journal *journal)
+{
+	return journal->tail;
+}
+
+/**
+ * get_recovery_journal_head() - Get the head of the recovery journal.
+ * @journal: The journal.
+ *
+ * The head is the lowest sequence number of the block map head and the slab journal head.
+ *
+ * Return: the head of the journal.
+ */
+static inline sequence_number_t get_recovery_journal_head(const struct recovery_journal *journal)
+{
+	return min(journal->block_map_head, journal->slab_journal_head);
+}
+
+/**
+ * compute_recovery_count_byte() - Compute the recovery count byte for a given recovery count.
+ * @recovery_count: The recovery count.
+ *
+ * Return: The byte corresponding to the recovery count.
+ */
+static inline u8 __must_check compute_recovery_count_byte(u64 recovery_count)
+{
+	return (u8)(recovery_count & RECOVERY_COUNT_MASK);
+}
+
+/**
+ * check_slab_journal_commit_threshold() - Check whether the journal is over the threshold, and if
+ *                                         so, force the oldest slab journal tail block to commit.
+ * @journal: The journal.
+ */
+static void check_slab_journal_commit_threshold(struct recovery_journal *journal)
+{
+	block_count_t current_length = journal->tail - journal->slab_journal_head;
+
+	if (current_length > journal->slab_journal_commit_threshold) {
+		journal->events.slab_journal_commits_requested++;
+		vdo_commit_oldest_slab_journal_tail_blocks(journal->depot,
+							   journal->slab_journal_head);
+	}
+}
+
+static void reap_recovery_journal(struct recovery_journal *journal);
+static void assign_entries(struct recovery_journal *journal);
+
+/**
+ * finish_reaping() - Finish reaping the journal.
+ * @journal: The journal being reaped.
+ */
+static void finish_reaping(struct recovery_journal *journal)
+{
+	block_count_t blocks_reaped;
+	sequence_number_t old_head = get_recovery_journal_head(journal);
+
+	journal->block_map_head = journal->block_map_reap_head;
+	journal->slab_journal_head = journal->slab_journal_reap_head;
+	blocks_reaped = get_recovery_journal_head(journal) - old_head;
+	journal->available_space += blocks_reaped * journal->entries_per_block;
+	journal->reaping = false;
+	check_slab_journal_commit_threshold(journal);
+	assign_entries(journal);
+	check_for_drain_complete(journal);
+}
+
+/**
+ * complete_reaping() - Finish reaping the journal after flushing the lower layer.
+ * @completion: The journal's flush VIO.
+ *
+ * This is the callback registered in reap_recovery_journal().
+ */
+static void complete_reaping(struct vdo_completion *completion)
+{
+	struct recovery_journal *journal = completion->parent;
+
+	finish_reaping(journal);
+
+	/* Try reaping again in case more locks were released while flush was out. */
+	reap_recovery_journal(journal);
+}
+
+/**
+ * handle_flush_error() - Handle an error when flushing the lower layer due to reaping.
+ * @completion: The journal's flush VIO.
+ */
+static void handle_flush_error(struct vdo_completion *completion)
+{
+	struct recovery_journal *journal = completion->parent;
+
+	vio_record_metadata_io_error(as_vio(completion));
+	journal->reaping = false;
+	enter_journal_read_only_mode(journal, completion->result);
+}
+
+static void flush_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct recovery_journal *journal = vio->completion.parent;
+
+	continue_vio_after_io(vio, complete_reaping, journal->thread_id);
+}
+
+/**
+ * initialize_journal_state() - Set all journal fields appropriately to start journaling from the
+ *                              current active block.
+ * @journal: The journal to be reset based on its active block.
+ */
+static void initialize_journal_state(struct recovery_journal *journal)
+{
+	journal->append_point.sequence_number = journal->tail;
+	journal->last_write_acknowledged = journal->tail;
+	journal->block_map_head = journal->tail;
+	journal->slab_journal_head = journal->tail;
+	journal->block_map_reap_head = journal->tail;
+	journal->slab_journal_reap_head = journal->tail;
+	journal->block_map_head_block_number =
+		vdo_get_recovery_journal_block_number(journal, journal->block_map_head);
+	journal->slab_journal_head_block_number =
+		vdo_get_recovery_journal_block_number(journal, journal->slab_journal_head);
+	journal->available_space =
+		(journal->entries_per_block * vdo_get_recovery_journal_length(journal->size));
+}
+
+/**
+ * vdo_get_recovery_journal_length() - Get the number of usable recovery journal blocks.
+ * @journal_size: The size of the recovery journal in blocks.
+ *
+ * Return: the number of recovery journal blocks usable for entries.
+ */
+block_count_t vdo_get_recovery_journal_length(block_count_t journal_size)
+{
+	block_count_t reserved_blocks = journal_size / 4;
+
+	if (reserved_blocks > RECOVERY_JOURNAL_RESERVED_BLOCKS)
+		reserved_blocks = RECOVERY_JOURNAL_RESERVED_BLOCKS;
+	return (journal_size - reserved_blocks);
+}
+
+/**
+ * reap_recovery_journal_callback() - Attempt to reap the journal.
+ * @completion: The lock counter completion.
+ *
+ * Attempts to reap the journal now that all the locks on some journal block have been released.
+ * This is the callback registered with the lock counter.
+ */
+static void reap_recovery_journal_callback(struct vdo_completion *completion)
+{
+	struct recovery_journal *journal = (struct recovery_journal *) completion->parent;
+	/*
+	 * The acknowledgment must be done before reaping so that there is no race between
+	 * acknowledging the notification and unlocks wishing to notify.
+	 */
+	smp_wmb();
+	atomic_set(&journal->lock_counter.state, LOCK_COUNTER_STATE_NOT_NOTIFYING);
+
+	if (vdo_is_state_quiescing(&journal->state)) {
+		/*
+		 * Don't start reaping when the journal is trying to quiesce. Do check if this
+		 * notification is the last thing the is waiting on.
+		 */
+		check_for_drain_complete(journal);
+		return;
+	}
+
+	reap_recovery_journal(journal);
+	check_slab_journal_commit_threshold(journal);
+}
+
+/**
+ * initialize_lock_counter() - Initialize a lock counter.
+ *
+ * @journal: The recovery journal.
+ * @vdo: The vdo.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int __must_check initialize_lock_counter(struct recovery_journal *journal, struct vdo *vdo)
+{
+	int result;
+	struct thread_config *config = &vdo->thread_config;
+	struct lock_counter *counter = &journal->lock_counter;
+
+	result = UDS_ALLOCATE(journal->size, u16, __func__, &counter->journal_counters);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(journal->size,
+			      atomic_t,
+			      __func__,
+			      &counter->journal_decrement_counts);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(journal->size * config->logical_zone_count,
+			      u16,
+			      __func__,
+			      &counter->logical_counters);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(journal->size, atomic_t, __func__, &counter->logical_zone_counts);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(journal->size * config->physical_zone_count,
+			      u16,
+			      __func__,
+			      &counter->physical_counters);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(journal->size, atomic_t, __func__, &counter->physical_zone_counts);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	vdo_initialize_completion(&counter->completion, vdo, VDO_LOCK_COUNTER_COMPLETION);
+	vdo_prepare_completion(&counter->completion,
+			       reap_recovery_journal_callback,
+			       reap_recovery_journal_callback,
+			       config->journal_thread,
+			       journal);
+	counter->logical_zones = config->logical_zone_count;
+	counter->physical_zones = config->physical_zone_count;
+	counter->locks = journal->size;
+	return VDO_SUCCESS;
+}
+
+/**
+ * set_journal_tail() - Set the journal's tail sequence number.
+ * @journal: The journal whose tail is to be set.
+ * @tail: The new tail value.
+ */
+static void set_journal_tail(struct recovery_journal *journal, sequence_number_t tail)
+{
+	/* VDO does not support sequence numbers above 1 << 48 in the slab journal. */
+	if (tail >= (1ULL << 48))
+		enter_journal_read_only_mode(journal, VDO_JOURNAL_OVERFLOW);
+
+	journal->tail = tail;
+}
+
+/**
+ * initialize_recovery_block() - Initialize a journal block.
+ * @vdo: The vdo from which to construct vios.
+ * @journal: The journal to which the block will belong.
+ * @block: The block to initialize.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int initialize_recovery_block(struct vdo *vdo,
+				     struct recovery_journal *journal,
+				     struct recovery_journal_block *block)
+{
+	char *data;
+	int result;
+
+	/*
+	 * Ensure that a block is large enough to store RECOVERY_JOURNAL_ENTRIES_PER_BLOCK entries.
+	 */
+	STATIC_ASSERT(RECOVERY_JOURNAL_ENTRIES_PER_BLOCK
+		      <= ((VDO_BLOCK_SIZE - sizeof(struct packed_journal_header)) /
+			  sizeof(struct packed_recovery_journal_entry)));
+
+	/*
+	 * Allocate a full block for the journal block even though not all of the space is used
+	 * since the VIO needs to write a full disk block.
+	 */
+	result = UDS_ALLOCATE(VDO_BLOCK_SIZE, char, __func__, &data);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = allocate_vio_components(vdo,
+					 VIO_TYPE_RECOVERY_JOURNAL,
+					 VIO_PRIORITY_HIGH,
+					 block,
+					 1,
+					 data,
+					 &block->vio);
+	if (result != VDO_SUCCESS) {
+		UDS_FREE(data);
+		return result;
+	}
+
+	list_add_tail(&block->list_node, &journal->free_tail_blocks);
+	block->journal = journal;
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_decode_recovery_journal() - Make a recovery journal and initialize it with the state that
+ *                                 was decoded from the super block.
+ *
+ * @state: The decoded state of the journal.
+ * @nonce: The nonce of the VDO.
+ * @vdo: The VDO.
+ * @partition: The partition for the journal.
+ * @recovery_count: The VDO's number of completed recoveries.
+ * @journal_size: The number of blocks in the journal on disk.
+ * @journal_ptr: The pointer to hold the new recovery journal.
+ *
+ * Return: A success or error code.
+ */
+int vdo_decode_recovery_journal(struct recovery_journal_state_7_0 state,
+				nonce_t nonce,
+				struct vdo *vdo,
+				struct partition *partition,
+				u64 recovery_count,
+				block_count_t journal_size,
+				struct recovery_journal **journal_ptr)
+{
+	block_count_t i;
+	struct recovery_journal *journal;
+	int result;
+
+	result = UDS_ALLOCATE_EXTENDED(struct recovery_journal,
+				       RECOVERY_JOURNAL_RESERVED_BLOCKS,
+				       struct recovery_journal_block,
+				       __func__,
+				       &journal);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	INIT_LIST_HEAD(&journal->free_tail_blocks);
+	INIT_LIST_HEAD(&journal->active_tail_blocks);
+	vdo_initialize_wait_queue(&journal->pending_writes);
+
+	journal->thread_id = vdo->thread_config.journal_thread;
+	journal->origin = partition->offset;
+	journal->nonce = nonce;
+	journal->recovery_count = compute_recovery_count_byte(recovery_count);
+	journal->size = journal_size;
+	journal->slab_journal_commit_threshold = (journal_size * 2) / 3;
+	journal->logical_blocks_used = state.logical_blocks_used;
+	journal->block_map_data_blocks = state.block_map_data_blocks;
+	journal->entries_per_block = RECOVERY_JOURNAL_ENTRIES_PER_BLOCK;
+	set_journal_tail(journal, state.journal_start);
+	initialize_journal_state(journal);
+	/* TODO: this will have to change if we make initial resume of a VDO a real resume */
+	vdo_set_admin_state_code(&journal->state, VDO_ADMIN_STATE_SUSPENDED);
+
+	for (i = 0; i < RECOVERY_JOURNAL_RESERVED_BLOCKS; i++) {
+		struct recovery_journal_block *block = &journal->blocks[i];
+
+		result = initialize_recovery_block(vdo, journal, block);
+		if (result != VDO_SUCCESS) {
+			vdo_free_recovery_journal(journal);
+			return result;
+		}
+	}
+
+	result = initialize_lock_counter(journal, vdo);
+	if (result != VDO_SUCCESS) {
+		vdo_free_recovery_journal(journal);
+		return result;
+	}
+
+	result = create_metadata_vio(vdo,
+				     VIO_TYPE_RECOVERY_JOURNAL,
+				     VIO_PRIORITY_HIGH,
+				     journal,
+				     NULL,
+				     &journal->flush_vio);
+	if (result != VDO_SUCCESS) {
+		vdo_free_recovery_journal(journal);
+		return result;
+	}
+
+	result = vdo_register_read_only_listener(vdo,
+						 journal,
+						 notify_recovery_journal_of_read_only_mode,
+						 journal->thread_id);
+	if (result != VDO_SUCCESS) {
+		vdo_free_recovery_journal(journal);
+		return result;
+	}
+
+	result = vdo_make_default_thread(vdo, journal->thread_id);
+	if (result != VDO_SUCCESS) {
+		vdo_free_recovery_journal(journal);
+		return result;
+	}
+
+	journal->flush_vio->completion.callback_thread_id = journal->thread_id;
+	*journal_ptr = journal;
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_free_recovery_journal() - Free a recovery journal.
+ * @journal: The recovery journal to free.
+ */
+void vdo_free_recovery_journal(struct recovery_journal *journal)
+{
+	block_count_t i;
+
+	if (journal == NULL)
+		return;
+
+	UDS_FREE(UDS_FORGET(journal->lock_counter.logical_zone_counts));
+	UDS_FREE(UDS_FORGET(journal->lock_counter.physical_zone_counts));
+	UDS_FREE(UDS_FORGET(journal->lock_counter.journal_counters));
+	UDS_FREE(UDS_FORGET(journal->lock_counter.journal_decrement_counts));
+	UDS_FREE(UDS_FORGET(journal->lock_counter.logical_counters));
+	UDS_FREE(UDS_FORGET(journal->lock_counter.physical_counters));
+	free_vio(UDS_FORGET(journal->flush_vio));
+
+	/*
+	 * FIXME: eventually, the journal should be constructed in a quiescent state which
+	 *        requires opening before use.
+	 */
+	if (!vdo_is_state_quiescent(&journal->state))
+		ASSERT_LOG_ONLY(list_empty(&journal->active_tail_blocks),
+				"journal being freed has no active tail blocks");
+	else if (!vdo_is_state_saved(&journal->state) && !list_empty(&journal->active_tail_blocks))
+		uds_log_warning("journal being freed has uncommitted entries");
+
+	for (i = 0; i < RECOVERY_JOURNAL_RESERVED_BLOCKS; i++) {
+		struct recovery_journal_block *block = &journal->blocks[i];
+
+		UDS_FREE(UDS_FORGET(block->vio.data));
+		free_vio_components(&block->vio);
+	}
+
+	UDS_FREE(journal);
+}
+
+/**
+ * vdo_initialize_recovery_journal_post_repair() - Initialize the journal after a repair.
+ * @journal: The journal in question.
+ * @recovery_count: The number of completed recoveries.
+ * @tail: The new tail block sequence number.
+ * @logical_blocks_used: The new number of logical blocks used.
+ * @block_map_data_blocks: The new number of block map data blocks.
+ */
+void vdo_initialize_recovery_journal_post_repair(struct recovery_journal *journal,
+						  u64 recovery_count,
+						  sequence_number_t tail,
+						  block_count_t logical_blocks_used,
+						  block_count_t block_map_data_blocks)
+{
+	set_journal_tail(journal, tail + 1);
+	journal->recovery_count = compute_recovery_count_byte(recovery_count);
+	initialize_journal_state(journal);
+	journal->logical_blocks_used = logical_blocks_used;
+	journal->block_map_data_blocks = block_map_data_blocks;
+}
+
+/**
+ * vdo_get_journal_block_map_data_blocks_used() - Get the number of block map pages, allocated from
+ *                                                data blocks, currently in use.
+ * @journal: The journal in question.
+ *
+ * Return: The number of block map pages allocated from slabs.
+ */
+block_count_t vdo_get_journal_block_map_data_blocks_used(struct recovery_journal *journal)
+{
+	return journal->block_map_data_blocks;
+}
+
+/**
+ * vdo_get_recovery_journal_thread_id() - Get the ID of a recovery journal's thread.
+ * @journal: The journal to query.
+ *
+ * Return: The ID of the journal's thread.
+ */
+thread_id_t vdo_get_recovery_journal_thread_id(struct recovery_journal *journal)
+{
+	return journal->thread_id;
+}
+
+/**
+ * vdo_open_recovery_journal() - Prepare the journal for new entries.
+ * @journal: The journal in question.
+ * @depot: The slab depot for this VDO.
+ * @block_map: The block map for this VDO.
+ */
+void vdo_open_recovery_journal(struct recovery_journal *journal,
+			       struct slab_depot *depot,
+			       struct block_map *block_map)
+{
+	journal->depot = depot;
+	journal->block_map = block_map;
+	WRITE_ONCE(journal->state.current_state,
+		   VDO_ADMIN_STATE_NORMAL_OPERATION);
+}
+
+/**
+ * vdo_record_recovery_journal() - Record the state of a recovery journal for encoding in the super
+ *                                 block.
+ * @journal: the recovery journal.
+ *
+ * Return: the state of the journal.
+ */
+struct recovery_journal_state_7_0
+vdo_record_recovery_journal(const struct recovery_journal *journal)
+{
+	struct recovery_journal_state_7_0 state = {
+		.logical_blocks_used = journal->logical_blocks_used,
+		.block_map_data_blocks = journal->block_map_data_blocks,
+	};
+
+	if (vdo_is_state_saved(&journal->state))
+		/*
+		 * If the journal is saved, we should start one past the active block (since the
+		 * active block is not guaranteed to be empty).
+		 */
+		state.journal_start = journal->tail;
+	else
+		/*
+		 * When we're merely suspended or have gone read-only, we must record the first
+		 * block that might have entries that need to be applied.
+		 */
+		state.journal_start = get_recovery_journal_head(journal);
+
+	return state;
+}
+
+/**
+ * get_block_header() - Get a pointer to the packed journal block header in the block buffer.
+ * @block: The recovery block.
+ *
+ * Return: The block's header.
+ */
+static inline struct packed_journal_header *
+get_block_header(const struct recovery_journal_block *block)
+{
+	return (struct packed_journal_header *) block->vio.data;
+}
+
+/**
+ * set_active_sector() - Set the current sector of the current block and initialize it.
+ * @block: The block to update.
+ * @sector: A pointer to the first byte of the new sector.
+ */
+static void set_active_sector(struct recovery_journal_block *block, void *sector)
+{
+	block->sector = (struct packed_journal_sector *) sector;
+	block->sector->check_byte = get_block_header(block)->check_byte;
+	block->sector->recovery_count = block->journal->recovery_count;
+	block->sector->entry_count = 0;
+}
+
+/**
+ * advance_tail() - Advance the tail of the journal.
+ * @journal: The journal whose tail should be advanced.
+ *
+ * Return: true if the tail was advanced.
+ */
+static bool advance_tail(struct recovery_journal *journal)
+{
+	struct recovery_block_header unpacked;
+	struct packed_journal_header *header;
+	struct recovery_journal_block *block;
+
+	block = journal->active_block = pop_free_list(journal);
+	if (block == NULL)
+		return false;
+
+	list_move_tail(&block->list_node, &journal->active_tail_blocks);
+
+	unpacked = (struct recovery_block_header) {
+		.metadata_type = VDO_METADATA_RECOVERY_JOURNAL_2,
+		.block_map_data_blocks = journal->block_map_data_blocks,
+		.logical_blocks_used = journal->logical_blocks_used,
+		.nonce = journal->nonce,
+		.recovery_count = journal->recovery_count,
+		.sequence_number = journal->tail,
+		.check_byte = vdo_compute_recovery_journal_check_byte(journal, journal->tail),
+	};
+
+	header = get_block_header(block);
+	memset(block->vio.data, 0x0, VDO_BLOCK_SIZE);
+	block->sequence_number = journal->tail;
+	block->entry_count = 0;
+	block->uncommitted_entry_count = 0;
+	block->block_number = vdo_get_recovery_journal_block_number(journal, journal->tail);
+
+	vdo_pack_recovery_block_header(&unpacked, header);
+	set_active_sector(block, vdo_get_journal_block_sector(header, 1));
+	set_journal_tail(journal, journal->tail + 1);
+	vdo_advance_block_map_era(journal->block_map, journal->tail);
+	return true;
+}
+
+/**
+ * initialize_lock_count() - Initialize the value of the journal zone's counter for a given lock.
+ * @journal: The recovery journal.
+ *
+ * Context: This must be called from the journal zone.
+ */
+static void initialize_lock_count(struct recovery_journal *journal)
+{
+	u16 *journal_value;
+	block_count_t lock_number = journal->active_block->block_number;
+	atomic_t *decrement_counter = get_decrement_counter(journal, lock_number);
+
+	journal_value = get_counter(journal, lock_number, VDO_ZONE_TYPE_JOURNAL, 0);
+	ASSERT_LOG_ONLY((*journal_value == atomic_read(decrement_counter)),
+			"count to be initialized not in use");
+	*journal_value = journal->entries_per_block + 1;
+	atomic_set(decrement_counter, 0);
+}
+
+/**
+ * prepare_to_assign_entry() - Prepare the currently active block to receive an entry and check
+ *			       whether an entry of the given type may be assigned at this time.
+ * @journal: The journal receiving an entry.
+ *
+ * Return: true if there is space in the journal to store an entry of the specified type.
+ */
+static bool prepare_to_assign_entry(struct recovery_journal *journal)
+{
+	if (journal->available_space == 0)
+		return false;
+
+	if (is_block_full(journal->active_block) && !advance_tail(journal))
+		return false;
+
+	if (!is_block_empty(journal->active_block))
+		return true;
+
+	if ((journal->tail - get_recovery_journal_head(journal)) > journal->size) {
+		/* Cannot use this block since the journal is full. */
+		journal->events.disk_full++;
+		return false;
+	}
+
+	/*
+	 * Don't allow the new block to be reaped until all of its entries have been committed to
+	 * the block map and until the journal block has been fully committed as well. Because the
+	 * block map update is done only after any slab journal entries have been made, the
+	 * per-entry lock for the block map entry serves to protect those as well.
+	 */
+	initialize_lock_count(journal);
+	return true;
+}
+
+static void write_blocks(struct recovery_journal *journal);
+
+/**
+ * schedule_block_write() - Queue a block for writing.
+ * @journal: The journal in question.
+ * @block: The block which is now ready to write.
+ *
+ * The block is expected to be full. If the block is currently writing, this is a noop as the block
+ * will be queued for writing when the write finishes. The block must not currently be queued for
+ * writing.
+ */
+static void
+schedule_block_write(struct recovery_journal *journal, struct recovery_journal_block *block)
+{
+	if (!block->committing)
+		vdo_enqueue_waiter(&journal->pending_writes, &block->write_waiter);
+	/*
+	 * At the end of adding entries, or discovering this partial block is now full and ready to
+	 * rewrite, we will call write_blocks() and write a whole batch.
+	 */
+}
+
+/**
+ * release_journal_block_reference() - Release a reference to a journal block.
+ * @block: The journal block from which to release a reference.
+ */
+static void release_journal_block_reference(struct recovery_journal_block *block)
+{
+	vdo_release_recovery_journal_block_reference(block->journal,
+						     block->sequence_number,
+						     VDO_ZONE_TYPE_JOURNAL,
+						     0);
+}
+
+static void update_usages(struct recovery_journal *journal, struct data_vio *data_vio)
+{
+	if (data_vio->increment_updater.operation == VDO_JOURNAL_BLOCK_MAP_REMAPPING) {
+		journal->block_map_data_blocks++;
+		return;
+	}
+
+	if (data_vio->new_mapped.state != VDO_MAPPING_STATE_UNMAPPED)
+		journal->logical_blocks_used++;
+
+	if (data_vio->mapped.state != VDO_MAPPING_STATE_UNMAPPED)
+		journal->logical_blocks_used--;
+}
+
+/**
+ * assign_entry() - Assign an entry waiter to the active block.
+ *
+ * Implements waiter_callback.
+ */
+static void assign_entry(struct waiter *waiter, void *context)
+{
+	struct data_vio *data_vio = waiter_as_data_vio(waiter);
+	struct recovery_journal_block *block = (struct recovery_journal_block *) context;
+	struct recovery_journal *journal = block->journal;
+
+	/* Record the point at which we will make the journal entry. */
+	data_vio->recovery_journal_point = (struct journal_point) {
+		.sequence_number = block->sequence_number,
+		.entry_count = block->entry_count,
+	};
+
+	update_usages(journal, data_vio);
+	journal->available_space--;
+
+	if (!vdo_has_waiters(&block->entry_waiters))
+		journal->events.blocks.started++;
+
+	vdo_enqueue_waiter(&block->entry_waiters, &data_vio->waiter);
+	block->entry_count++;
+	block->uncommitted_entry_count++;
+	journal->events.entries.started++;
+
+	if (is_block_full(block))
+		/*
+		 * The block is full, so we can write it anytime henceforth. If it is already
+		 * committing, we'll queue it for writing when it comes back.
+		 */
+		schedule_block_write(journal, block);
+
+	/* Force out slab journal tail blocks when threshold is reached. */
+	check_slab_journal_commit_threshold(journal);
+}
+
+static void assign_entries(struct recovery_journal *journal)
+{
+	if (journal->adding_entries)
+		/* Protect against re-entrancy. */
+		return;
+
+	journal->adding_entries = true;
+	while (vdo_has_waiters(&journal->entry_waiters) && prepare_to_assign_entry(journal))
+		vdo_notify_next_waiter(&journal->entry_waiters,
+				       assign_entry,
+				       journal->active_block);
+
+	/* Now that we've finished with entries, see if we have a batch of blocks to write. */
+	write_blocks(journal);
+	journal->adding_entries = false;
+}
+
+/**
+ * recycle_journal_block() - Prepare an in-memory journal block to be reused now that it has been
+ *                           fully committed.
+ * @block: The block to be recycled.
+ */
+static void recycle_journal_block(struct recovery_journal_block *block)
+{
+	struct recovery_journal *journal = block->journal;
+	block_count_t i;
+
+	list_move_tail(&block->list_node, &journal->free_tail_blocks);
+
+	/* Release any unused entry locks. */
+	for (i = block->entry_count; i < journal->entries_per_block; i++)
+		release_journal_block_reference(block);
+
+	/*
+	 * Release our own lock against reaping now that the block is completely committed, or
+	 * we're giving up because we're in read-only mode.
+	 */
+	if (block->entry_count > 0)
+		release_journal_block_reference(block);
+
+	if (block == journal->active_block)
+		journal->active_block = NULL;
+}
+
+/**
+ * continue_committed_waiter() - invoked whenever a VIO is to be released from the journal because
+ *                               its entry was committed to disk.
+ *
+ * Implements waiter_callback.
+ */
+static void continue_committed_waiter(struct waiter *waiter, void *context)
+{
+	struct data_vio *data_vio = waiter_as_data_vio(waiter);
+	struct recovery_journal *journal = (struct recovery_journal *)context;
+	int result = (is_read_only(journal) ? VDO_READ_ONLY : VDO_SUCCESS);
+	bool has_decrement;
+
+	ASSERT_LOG_ONLY(vdo_before_journal_point(&journal->commit_point,
+						 &data_vio->recovery_journal_point),
+			"DataVIOs released from recovery journal in order. Recovery journal point is (%llu, %u), but commit waiter point is (%llu, %u)",
+			(unsigned long long) journal->commit_point.sequence_number,
+			journal->commit_point.entry_count,
+			(unsigned long long) data_vio->recovery_journal_point.sequence_number,
+			data_vio->recovery_journal_point.entry_count);
+
+	journal->commit_point = data_vio->recovery_journal_point;
+	data_vio->last_async_operation = VIO_ASYNC_OP_UPDATE_REFERENCE_COUNTS;
+	if (result != VDO_SUCCESS) {
+		continue_data_vio_with_error(data_vio, result);
+		return;
+	}
+
+	/*
+	 * The increment must be launched first since it must come before the
+	 * decrement if they are in the same slab.
+	 */
+	has_decrement = (data_vio->decrement_updater.zpbn.pbn != VDO_ZERO_BLOCK);
+	if ((data_vio->increment_updater.zpbn.pbn != VDO_ZERO_BLOCK) || !has_decrement)
+		continue_data_vio(data_vio);
+
+	if (has_decrement)
+		vdo_launch_completion(&data_vio->decrement_completion);
+}
+
+/**
+ * notify_commit_waiters() - Notify any VIOs whose entries have now committed.
+ * @journal: The recovery journal to update.
+ */
+static void notify_commit_waiters(struct recovery_journal *journal)
+{
+	struct recovery_journal_block *block;
+
+	list_for_each_entry(block, &journal->active_tail_blocks, list_node) {
+		if (block->committing)
+			return;
+
+		vdo_notify_all_waiters(&block->commit_waiters, continue_committed_waiter, journal);
+		if (is_read_only(journal))
+			vdo_notify_all_waiters(&block->entry_waiters,
+					       continue_committed_waiter,
+					       journal);
+		else if (is_block_dirty(block) || !is_block_full(block))
+			/* Stop at partially-committed or partially-filled blocks. */
+			return;
+	}
+}
+
+/**
+ * recycle_journal_blocks() - Recycle any journal blocks which have been fully committed.
+ * @journal: The recovery journal to update.
+ */
+static void recycle_journal_blocks(struct recovery_journal *journal)
+{
+	struct recovery_journal_block *block, *tmp;
+
+	list_for_each_entry_safe(block, tmp, &journal->active_tail_blocks, list_node) {
+		if (block->committing)
+			/* Don't recycle committing blocks. */
+			return;
+
+		if (!is_read_only(journal) && (is_block_dirty(block) || !is_block_full(block)))
+			/*
+			 * Don't recycle partially written or partially full blocks, except in
+			 * read-only mode.
+			 */
+			return;
+
+		recycle_journal_block(block);
+	}
+}
+
+/**
+ * complete_write() - Handle post-commit processing.
+ * @completion: The completion of the VIO writing this block.
+ *
+ * This is the callback registered by write_block(). If more entries accumulated in the block being
+ * committed while the commit was in progress, another commit will be initiated.
+ */
+static void complete_write(struct vdo_completion *completion)
+{
+	struct recovery_journal_block *block = completion->parent;
+	struct recovery_journal *journal = block->journal;
+	struct recovery_journal_block *last_active_block;
+
+	assert_on_journal_thread(journal, __func__);
+
+	journal->pending_write_count -= 1;
+	journal->events.blocks.committed += 1;
+	journal->events.entries.committed += block->entries_in_commit;
+	block->uncommitted_entry_count -= block->entries_in_commit;
+	block->entries_in_commit = 0;
+	block->committing = false;
+
+	/* If this block is the latest block to be acknowledged, record that fact. */
+	if (block->sequence_number > journal->last_write_acknowledged)
+		journal->last_write_acknowledged = block->sequence_number;
+
+	last_active_block = get_journal_block(&journal->active_tail_blocks);
+	ASSERT_LOG_ONLY((block->sequence_number >= last_active_block->sequence_number),
+			"completed journal write is still active");
+
+	notify_commit_waiters(journal);
+
+	/*
+	 * Is this block now full? Reaping, and adding entries, might have already sent it off for
+	 * rewriting; else, queue it for rewrite.
+	 */
+	if (is_block_dirty(block) && is_block_full(block))
+		schedule_block_write(journal, block);
+
+	recycle_journal_blocks(journal);
+	write_blocks(journal);
+
+	check_for_drain_complete(journal);
+}
+
+static void handle_write_error(struct vdo_completion *completion)
+{
+	struct recovery_journal_block *block = completion->parent;
+	struct recovery_journal *journal = block->journal;
+
+	vio_record_metadata_io_error(as_vio(completion));
+	uds_log_error_strerror(completion->result,
+			       "cannot write recovery journal block %llu",
+			       (unsigned long long) block->sequence_number);
+	enter_journal_read_only_mode(journal, completion->result);
+	complete_write(completion);
+}
+
+static void complete_write_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct recovery_journal_block *block = vio->completion.parent;
+	struct recovery_journal *journal = block->journal;
+
+	continue_vio_after_io(vio, complete_write, journal->thread_id);
+}
+
+/**
+ * add_queued_recovery_entries() - Actually add entries from the queue to the given block.
+ * @block: The journal block.
+ */
+static void add_queued_recovery_entries(struct recovery_journal_block *block)
+{
+	while (vdo_has_waiters(&block->entry_waiters)) {
+		struct data_vio *data_vio =
+			waiter_as_data_vio(vdo_dequeue_next_waiter(&block->entry_waiters));
+		struct tree_lock *lock = &data_vio->tree_lock;
+		struct packed_recovery_journal_entry *packed_entry;
+		struct recovery_journal_entry new_entry;
+
+		if (block->sector->entry_count == RECOVERY_JOURNAL_ENTRIES_PER_SECTOR)
+			set_active_sector(block, (char *) block->sector + VDO_SECTOR_SIZE);
+
+		/* Compose and encode the entry. */
+		packed_entry = &block->sector->entries[block->sector->entry_count++];
+		new_entry = (struct recovery_journal_entry) {
+			.mapping = {
+				.pbn = data_vio->increment_updater.zpbn.pbn,
+				.state = data_vio->increment_updater.zpbn.state,
+			},
+			.unmapping = {
+				.pbn = data_vio->decrement_updater.zpbn.pbn,
+				.state = data_vio->decrement_updater.zpbn.state,
+			},
+			.operation = data_vio->increment_updater.operation,
+			.slot = lock->tree_slots[lock->height].block_map_slot,
+		};
+		*packed_entry = vdo_pack_recovery_journal_entry(&new_entry);
+		data_vio->recovery_sequence_number = block->sequence_number;
+
+		/* Enqueue the data_vio to wait for its entry to commit. */
+		vdo_enqueue_waiter(&block->commit_waiters, &data_vio->waiter);
+	}
+}
+
+/**
+ * write_block() - Issue a block for writing.
+ *
+ * Implements waiter_callback.
+ */
+static void write_block(struct waiter *waiter, void *context __always_unused)
+{
+	struct recovery_journal_block *block =
+		container_of(waiter, struct recovery_journal_block, write_waiter);
+	struct recovery_journal *journal = block->journal;
+	struct packed_journal_header *header = get_block_header(block);
+
+	if (block->committing || !vdo_has_waiters(&block->entry_waiters) || is_read_only(journal))
+		return;
+
+	block->entries_in_commit = vdo_count_waiters(&block->entry_waiters);
+	add_queued_recovery_entries(block);
+
+	journal->pending_write_count += 1;
+	journal->events.blocks.written += 1;
+	journal->events.entries.written += block->entries_in_commit;
+
+	header->block_map_head = __cpu_to_le64(journal->block_map_head);
+	header->slab_journal_head = __cpu_to_le64(journal->slab_journal_head);
+	header->entry_count = __cpu_to_le16(block->entry_count);
+
+	block->committing = true;
+
+	/*
+	 * We must issue a flush and a FUA for every commit. The flush is necessary to ensure that
+	 * the data being referenced is stable. The FUA is necessary to ensure that the journal
+	 * block itself is stable before allowing overwrites of the lbn's previous data.
+	 */
+	submit_metadata_vio(&block->vio,
+			    journal->origin + block->block_number,
+			    complete_write_endio,
+			    handle_write_error,
+			    WRITE_FLAGS);
+}
+
+
+/**
+ * write_blocks() - Attempt to commit blocks, according to write policy.
+ * @journal: The recovery journal.
+ */
+static void write_blocks(struct recovery_journal *journal)
+{
+	assert_on_journal_thread(journal, __func__);
+	/*
+	 * We call this function after adding entries to the journal and after finishing a block
+	 * write. Thus, when this function terminates we must either have no VIOs waiting in the
+	 * journal or have some outstanding IO to provide a future wakeup.
+	 *
+	 * We want to only issue full blocks if there are no pending writes. However, if there are
+	 * no outstanding writes and some unwritten entries, we must issue a block, even if it's
+	 * the active block and it isn't full.
+	 */
+	if (journal->pending_write_count > 0)
+		return;
+
+	/* Write all the full blocks. */
+	vdo_notify_all_waiters(&journal->pending_writes, write_block, NULL);
+
+	/*
+	 * Do we need to write the active block? Only if we have no outstanding writes, even after
+	 * issuing all of the full writes.
+	 */
+	if ((journal->pending_write_count == 0) && (journal->active_block != NULL))
+		write_block(&journal->active_block->write_waiter, NULL);
+}
+
+/**
+ * vdo_add_recovery_journal_entry() - Add an entry to a recovery journal.
+ * @journal: The journal in which to make an entry.
+ * @data_vio: The data_vio for which to add the entry. The entry will be taken
+ *	      from the logical and new_mapped fields of the data_vio. The
+ *	      data_vio's recovery_sequence_number field will be set to the
+ *	      sequence number of the journal block in which the entry was
+ *	      made.
+ *
+ * This method is asynchronous. The data_vio will not be called back until the entry is committed
+ * to the on-disk journal.
+ */
+void vdo_add_recovery_journal_entry(struct recovery_journal *journal, struct data_vio *data_vio)
+{
+	assert_on_journal_thread(journal, __func__);
+	if (!vdo_is_state_normal(&journal->state)) {
+		continue_data_vio_with_error(data_vio, VDO_INVALID_ADMIN_STATE);
+		return;
+	}
+
+	if (is_read_only(journal)) {
+		continue_data_vio_with_error(data_vio, VDO_READ_ONLY);
+		return;
+	}
+
+	ASSERT_LOG_ONLY(data_vio->recovery_sequence_number == 0,
+			"journal lock not held for new entry");
+
+	vdo_advance_journal_point(&journal->append_point, journal->entries_per_block);
+	vdo_enqueue_waiter(&journal->entry_waiters, &data_vio->waiter);
+	assign_entries(journal);
+}
+
+/**
+ * is_lock_locked() - Check whether a lock is locked for a zone type.
+ * @journal: The recovery journal.
+ * @lock_number: The lock to check.
+ * @zone_type: The type of the zone.
+ *
+ * If the recovery journal has a lock on the lock number, both logical and physical zones are
+ * considered locked.
+ *
+ * Return: true if the specified lock has references (is locked).
+ */
+static bool
+is_lock_locked(struct recovery_journal *journal,
+	       block_count_t lock_number,
+	       enum vdo_zone_type zone_type)
+{
+	atomic_t *zone_count;
+	bool locked;
+
+	if (is_journal_zone_locked(journal, lock_number))
+		return true;
+
+	zone_count = get_zone_count_ptr(journal, lock_number, zone_type);
+	locked = (atomic_read(zone_count) != 0);
+	/* Pairs with implicit barrier in vdo_release_recovery_journal_block_reference() */
+	smp_rmb();
+	return locked;
+}
+
+/**
+ * reap_recovery_journal() - Conduct a sweep on a recovery journal to reclaim unreferenced blocks.
+ * @journal: The recovery journal.
+ */
+static void reap_recovery_journal(struct recovery_journal *journal)
+{
+	if (journal->reaping)
+		/*
+		 * We already have an outstanding reap in progress. We need to wait for it to
+		 * finish.
+		 */
+		return;
+
+	if (vdo_is_state_quiescent(&journal->state))
+		/* We are supposed to not do IO. Don't botch it by reaping. */
+		return;
+
+	/*
+	 * Start reclaiming blocks only when the journal head has no references. Then stop when a
+	 * block is referenced.
+	 */
+	while ((journal->block_map_reap_head < journal->last_write_acknowledged) &&
+		!is_lock_locked(journal,
+				journal->block_map_head_block_number,
+				VDO_ZONE_TYPE_LOGICAL)) {
+		journal->block_map_reap_head++;
+		if (++journal->block_map_head_block_number == journal->size)
+			journal->block_map_head_block_number = 0;
+	}
+
+	while ((journal->slab_journal_reap_head < journal->last_write_acknowledged) &&
+		!is_lock_locked(journal,
+				journal->slab_journal_head_block_number,
+				VDO_ZONE_TYPE_PHYSICAL)) {
+		journal->slab_journal_reap_head++;
+		if (++journal->slab_journal_head_block_number == journal->size)
+			journal->slab_journal_head_block_number = 0;
+	}
+
+	if ((journal->block_map_reap_head == journal->block_map_head) &&
+	    (journal->slab_journal_reap_head == journal->slab_journal_head))
+		/* Nothing happened. */
+		return;
+
+	/*
+	 * If the block map head will advance, we must flush any block map page modified by the
+	 * entries we are reaping. If the slab journal head will advance, we must flush the slab
+	 * summary update covering the slab journal that just released some lock.
+	 */
+	journal->reaping = true;
+	submit_flush_vio(journal->flush_vio, flush_endio, handle_flush_error);
+}
+
+/**
+ * vdo_acquire_recovery_journal_block_reference() - Acquire a reference to a recovery journal block
+ *                                                  from somewhere other than the journal itself.
+ * @journal: The recovery journal.
+ * @sequence_number: The journal sequence number of the referenced block.
+ * @zone_type: The type of the zone making the adjustment.
+ * @zone_id: The ID of the zone making the adjustment.
+ */
+void vdo_acquire_recovery_journal_block_reference(struct recovery_journal *journal,
+						  sequence_number_t sequence_number,
+						  enum vdo_zone_type zone_type,
+						  zone_count_t zone_id)
+{
+	block_count_t lock_number;
+	u16 *current_value;
+
+	if (sequence_number == 0)
+		return;
+
+	ASSERT_LOG_ONLY((zone_type != VDO_ZONE_TYPE_JOURNAL),
+			"invalid lock count increment from journal zone");
+
+	lock_number = vdo_get_recovery_journal_block_number(journal, sequence_number);
+	current_value = get_counter(journal, lock_number, zone_type, zone_id);
+	ASSERT_LOG_ONLY(*current_value < U16_MAX,
+			"increment of lock counter must not overflow");
+
+	if (*current_value == 0) {
+		/*
+		 * This zone is acquiring this lock for the first time. Extra barriers because this
+		 * was original developed using an atomic add operation that implicitly had them.
+		 */
+		smp_mb__before_atomic();
+		atomic_inc(get_zone_count_ptr(journal, lock_number, zone_type));
+		/* same as before_atomic */
+		smp_mb__after_atomic();
+	}
+	*current_value += 1;
+}
+
+/**
+ * vdo_release_journal_entry_lock() - Release a single per-entry reference count for a recovery
+ *                                    journal block.
+ * @journal: The recovery journal.
+ * @sequence_number: The journal sequence number of the referenced block.
+ */
+void vdo_release_journal_entry_lock(struct recovery_journal *journal,
+				    sequence_number_t sequence_number)
+{
+	block_count_t lock_number;
+
+	if (sequence_number == 0)
+		return;
+
+	lock_number = vdo_get_recovery_journal_block_number(journal, sequence_number);
+	/*
+	 * Extra barriers because this was originally developed using an atomic add operation that
+	 * implicitly had them.
+	 */
+	smp_mb__before_atomic();
+	atomic_inc(get_decrement_counter(journal, lock_number));
+	/* same as before_atomic */
+	smp_mb__after_atomic();
+}
+
+/**
+ * initiate_drain() - Initiate a drain.
+ *
+ * Implements vdo_admin_initiator.
+ */
+static void initiate_drain(struct admin_state *state)
+{
+	check_for_drain_complete(container_of(state, struct recovery_journal, state));
+}
+
+/**
+ * vdo_drain_recovery_journal() - Drain recovery journal I/O.
+ * @journal: The journal to drain.
+ * @operation: The drain operation (suspend or save).
+ * @parent: The completion to notify once the journal is drained.
+ *
+ * All uncommitted entries will be written out.
+ */
+void vdo_drain_recovery_journal(struct recovery_journal *journal,
+				const struct admin_state_code *operation,
+				struct vdo_completion *parent)
+{
+	assert_on_journal_thread(journal, __func__);
+	vdo_start_draining(&journal->state, operation, parent, initiate_drain);
+}
+
+/**
+ * resume_lock_counter() - Re-allow notifications from a suspended lock counter.
+ * @counter: The counter.
+ *
+ * Return: true if the lock counter was suspended.
+ */
+static bool resume_lock_counter(struct lock_counter *counter)
+{
+	int prior_state;
+
+	/*
+	 * Extra barriers because this was original developed using a CAS operation that implicitly
+	 * had them.
+	 */
+	smp_mb__before_atomic();
+	prior_state = atomic_cmpxchg(&counter->state,
+				     LOCK_COUNTER_STATE_SUSPENDED,
+				     LOCK_COUNTER_STATE_NOT_NOTIFYING);
+	/* same as before_atomic */
+	smp_mb__after_atomic();
+
+	return (prior_state == LOCK_COUNTER_STATE_SUSPENDED);
+}
+
+/**
+ * vdo_resume_recovery_journal() - Resume a recovery journal which has been drained.
+ * @journal: The journal to resume.
+ * @parent: The completion to finish once the journal is resumed.
+ */
+void vdo_resume_recovery_journal(struct recovery_journal *journal, struct vdo_completion *parent)
+{
+	bool saved;
+
+	assert_on_journal_thread(journal, __func__);
+	saved = vdo_is_state_saved(&journal->state);
+	vdo_set_completion_result(parent, vdo_resume_if_quiescent(&journal->state));
+	if (is_read_only(journal)) {
+		vdo_continue_completion(parent, VDO_READ_ONLY);
+		return;
+	}
+
+	if (saved)
+		initialize_journal_state(journal);
+
+	if (resume_lock_counter(&journal->lock_counter))
+		/* We might have missed a notification. */
+		reap_recovery_journal(journal);
+
+	vdo_launch_completion(parent);
+}
+
+/**
+ * vdo_get_recovery_journal_logical_blocks_used() - Get the number of logical blocks in use by the
+ *                                                  VDO.
+ * @journal: The journal.
+ *
+ * Return: The number of logical blocks in use by the VDO.
+ */
+block_count_t vdo_get_recovery_journal_logical_blocks_used(const struct recovery_journal *journal)
+{
+	return journal->logical_blocks_used;
+}
+
+/**
+ * vdo_get_recovery_journal_statistics() - Get the current statistics from the recovery journal.
+ * @journal: The recovery journal to query.
+ *
+ * Return: A copy of the current statistics for the journal.
+ */
+struct recovery_journal_statistics
+vdo_get_recovery_journal_statistics(const struct recovery_journal *journal)
+{
+	return journal->events;
+}
+
+/**
+ * dump_recovery_block() - Dump the contents of the recovery block to the log.
+ * @block: The block to dump.
+ */
+static void dump_recovery_block(const struct recovery_journal_block *block)
+{
+	uds_log_info("    sequence number %llu; entries %u; %s; %zu entry waiters; %zu commit waiters",
+		     (unsigned long long) block->sequence_number,
+		     block->entry_count,
+		     (block->committing ? "committing" : "waiting"),
+		     vdo_count_waiters(&block->entry_waiters),
+		     vdo_count_waiters(&block->commit_waiters));
+}
+
+/**
+ * vdo_dump_recovery_journal_statistics() - Dump some current statistics and other debug info from
+ *                                          the recovery journal.
+ * @journal: The recovery journal to dump.
+ */
+void vdo_dump_recovery_journal_statistics(const struct recovery_journal *journal)
+{
+	const struct recovery_journal_block *block;
+	struct recovery_journal_statistics stats = vdo_get_recovery_journal_statistics(journal);
+
+	uds_log_info("Recovery Journal");
+	uds_log_info("	block_map_head=%llu slab_journal_head=%llu last_write_acknowledged=%llu tail=%llu block_map_reap_head=%llu slab_journal_reap_head=%llu disk_full=%llu slab_journal_commits_requested=%llu entry_waiters=%zu",
+		     (unsigned long long) journal->block_map_head,
+		     (unsigned long long) journal->slab_journal_head,
+		     (unsigned long long) journal->last_write_acknowledged,
+		     (unsigned long long) journal->tail,
+		     (unsigned long long) journal->block_map_reap_head,
+		     (unsigned long long) journal->slab_journal_reap_head,
+		     (unsigned long long) stats.disk_full,
+		     (unsigned long long) stats.slab_journal_commits_requested,
+		     vdo_count_waiters(&journal->entry_waiters));
+	uds_log_info("	entries: started=%llu written=%llu committed=%llu",
+		     (unsigned long long) stats.entries.started,
+		     (unsigned long long) stats.entries.written,
+		     (unsigned long long) stats.entries.committed);
+	uds_log_info("	blocks: started=%llu written=%llu committed=%llu",
+		     (unsigned long long) stats.blocks.started,
+		     (unsigned long long) stats.blocks.written,
+		     (unsigned long long) stats.blocks.committed);
+
+	uds_log_info("	active blocks:");
+	list_for_each_entry(block, &journal->active_tail_blocks, list_node)
+		dump_recovery_block(block);
+}
diff --git a/drivers/md/dm-vdo/recovery-journal.h b/drivers/md/dm-vdo/recovery-journal.h
new file mode 100644
index 000000000000..8c22493e4f21
--- /dev/null
+++ b/drivers/md/dm-vdo/recovery-journal.h
@@ -0,0 +1,313 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_RECOVERY_JOURNAL_H
+#define VDO_RECOVERY_JOURNAL_H
+
+#include <linux/list.h>
+
+#include "numeric.h"
+
+#include "admin-state.h"
+#include "constants.h"
+#include "encodings.h"
+#include "flush.h"
+#include "statistics.h"
+#include "types.h"
+#include "wait-queue.h"
+
+/**
+ * DOC: recovery journal.
+ *
+ * The recovery_journal provides a log of all block mapping and reference count changes which have
+ * not yet been stably written to the block map or slab journals. This log helps to reduce the
+ * write amplification of writes by providing amortization of slab journal and block map page
+ * updates.
+ *
+ * The journal consists of a set of on-disk blocks arranged as a circular log with monotonically
+ * increasing sequence numbers. Three sequence numbers serve to define the active extent of the
+ * journal. The 'head' is the oldest active block in the journal. The 'tail' is the end of the
+ * half-open interval containing the active blocks. 'active' is the number of the block actively
+ * receiving entries. In an empty journal, head == active == tail. Once any entries are added, tail
+ * = active + 1, and head may be any value in the interval [tail - size, active].
+ *
+ * The journal also contains a set of in-memory blocks which are used to buffer up entries until
+ * they can be committed. In general the number of in-memory blocks ('tail_buffer_count') will be
+ * less than the on-disk size. Each in-memory block is also a vdo_completion. Each in-memory block
+ * has a vio which is used to commit that block to disk. The vio's data is the on-disk
+ * representation of the journal block. In addition each in-memory block has a buffer which is used
+ * to accumulate entries while a partial commit of the block is in progress. In-memory blocks are
+ * kept on two rings. Free blocks live on the 'free_tail_blocks' ring. When a block becomes active
+ * (see below) it is moved to the 'active_tail_blocks' ring. When a block is fully committed, it is
+ * moved back to the 'free_tail_blocks' ring.
+ *
+ * When entries are added to the journal, they are added to the active in-memory block, as
+ * indicated by the 'active_block' field. If the caller wishes to wait for the entry to be
+ * committed, the requesting VIO will be attached to the in-memory block to which the caller's
+ * entry was added. If the caller does wish to wait, or if the entry filled the active block, an
+ * attempt will be made to commit that block to disk. If there is already another commit in
+ * progress, the attempt will be ignored and then automatically retried when the in-progress commit
+ * completes. If there is no commit in progress, any data_vios waiting on the block are transferred
+ * to the block's vio which is then written, automatically waking all of the waiters when it
+ * completes. When the write completes, any entries which accumulated in the block are copied to
+ * the vio's data buffer.
+ *
+ * Finally, the journal maintains a set of counters, one for each on disk journal block. These
+ * counters are used as locks to prevent premature reaping of journal blocks. Each time a new
+ * sequence number is used, the counter for the corresponding block is incremented. The counter is
+ * subsequently decremented when that block is filled and then committed for the last time. This
+ * prevents blocks from being reaped while they are still being updated. The counter is also
+ * incremented once for each entry added to a block, and decremented once each time the block map
+ * is updated in memory for that request. This prevents blocks from being reaped while their VIOs
+ * are still active. Finally, each in-memory block map page tracks the oldest journal block that
+ * contains entries corresponding to uncommitted updates to that block map page. Each time an
+ * in-memory block map page is updated, it checks if the journal block for the VIO is earlier than
+ * the one it references, in which case it increments the count on the earlier journal block and
+ * decrements the count on the later journal block, maintaining a lock on the oldest journal block
+ * containing entries for that page. When a block map page has been flushed from the cache, the
+ * counter for the journal block it references is decremented. Whenever the counter for the head
+ * block goes to 0, the head is advanced until it comes to a block whose counter is not 0 or until
+ * it reaches the active block. This is the mechanism for reclaiming journal space on disk.
+ *
+ * If there is no in-memory space when a VIO attempts to add an entry, the VIO will be attached to
+ * the 'commit_completion' and will be woken the next time a full block has committed. If there is
+ * no on-disk space when a VIO attempts to add an entry, the VIO will be attached to the
+ * 'reap_completion', and will be woken the next time a journal block is reaped.
+ */
+
+enum vdo_zone_type {
+	VDO_ZONE_TYPE_ADMIN,
+	VDO_ZONE_TYPE_JOURNAL,
+	VDO_ZONE_TYPE_LOGICAL,
+	VDO_ZONE_TYPE_PHYSICAL,
+};
+
+struct lock_counter {
+	/** The completion for notifying the owner of a lock release */
+	struct vdo_completion completion;
+	/** The number of logical zones which may hold locks */
+	zone_count_t logical_zones;
+	/** The number of physical zones which may hold locks */
+	zone_count_t physical_zones;
+	/** The number of locks */
+	block_count_t locks;
+	/** Whether the lock release notification is in flight */
+	atomic_t state;
+	/** The number of logical zones which hold each lock */
+	atomic_t *logical_zone_counts;
+	/** The number of physical zones which hold each lock */
+	atomic_t *physical_zone_counts;
+	/** The per-lock counts for the journal zone */
+	u16 *journal_counters;
+	/** The per-lock decrement counts for the journal zone */
+	atomic_t *journal_decrement_counts;
+	/** The per-zone, per-lock reference counts for logical zones */
+	u16 *logical_counters;
+	/** The per-zone, per-lock reference counts for physical zones */
+	u16 *physical_counters;
+};
+
+struct recovery_journal_block {
+	/* The doubly linked pointers for the free or active lists */
+	struct list_head list_node;
+	/* The waiter for the pending full block list */
+	struct waiter write_waiter;
+	/* The journal to which this block belongs */
+	struct recovery_journal *journal;
+	/* A pointer to the current sector in the packed block buffer */
+	struct packed_journal_sector *sector;
+	/* The vio for writing this block */
+	struct vio vio;
+	/* The sequence number for this block */
+	sequence_number_t sequence_number;
+	/* The location of this block in the on-disk journal */
+	physical_block_number_t block_number;
+	/* Whether this block is being committed */
+	bool committing;
+	/* The total number of entries in this block */
+	journal_entry_count_t entry_count;
+	/* The total number of uncommitted entries (queued or committing) */
+	journal_entry_count_t uncommitted_entry_count;
+	/* The number of new entries in the current commit */
+	journal_entry_count_t entries_in_commit;
+	/* The queue of vios which will make entries for the next commit */
+	struct wait_queue entry_waiters;
+	/* The queue of vios waiting for the current commit */
+	struct wait_queue commit_waiters;
+};
+
+struct recovery_journal {
+	/* The thread ID of the journal zone */
+	thread_id_t thread_id;
+	/* The slab depot which can hold locks on this journal */
+	struct slab_depot *depot;
+	/* The block map which can hold locks on this journal */
+	struct block_map *block_map;
+	/* The queue of vios waiting to make entries */
+	struct wait_queue entry_waiters;
+	/* The number of free entries in the journal */
+	u64 available_space;
+	/* The number of decrement entries which need to be made */
+	data_vio_count_t pending_decrement_count;
+	/* Whether the journal is adding entries from the increment or decrement waiters queues */
+	bool adding_entries;
+	/* The administrative state of the journal */
+	struct admin_state state;
+	/* Whether a reap is in progress */
+	bool reaping;
+	/* The location of the first journal block */
+	physical_block_number_t origin;
+	/* The oldest active block in the journal on disk for block map rebuild */
+	sequence_number_t block_map_head;
+	/* The oldest active block in the journal on disk for slab journal replay */
+	sequence_number_t slab_journal_head;
+	/* The newest block in the journal on disk to which a write has finished */
+	sequence_number_t last_write_acknowledged;
+	/* The end of the half-open interval of the active journal */
+	sequence_number_t tail;
+	/* The point at which the last entry will have been added */
+	struct journal_point append_point;
+	/* The journal point of the vio most recently released from the journal */
+	struct journal_point commit_point;
+	/* The nonce of the VDO */
+	nonce_t nonce;
+	/* The number of recoveries completed by the VDO */
+	u8 recovery_count;
+	/* The number of entries which fit in a single block */
+	journal_entry_count_t entries_per_block;
+	/* Unused in-memory journal blocks */
+	struct list_head free_tail_blocks;
+	/* In-memory journal blocks with records */
+	struct list_head active_tail_blocks;
+	/* A pointer to the active block (the one we are adding entries to now) */
+	struct recovery_journal_block *active_block;
+	/* Journal blocks that need writing */
+	struct wait_queue pending_writes;
+	/* The new block map reap head after reaping */
+	sequence_number_t block_map_reap_head;
+	/* The head block number for the block map rebuild range */
+	block_count_t block_map_head_block_number;
+	/* The new slab journal reap head after reaping */
+	sequence_number_t slab_journal_reap_head;
+	/* The head block number for the slab journal replay range */
+	block_count_t slab_journal_head_block_number;
+	/* The data-less vio, usable only for flushing */
+	struct vio *flush_vio;
+	/* The number of blocks in the on-disk journal */
+	block_count_t size;
+	/* The number of logical blocks that are in-use */
+	block_count_t logical_blocks_used;
+	/* The number of block map pages that are allocated */
+	block_count_t block_map_data_blocks;
+	/* The number of journal blocks written but not yet acknowledged */
+	block_count_t pending_write_count;
+	/* The threshold at which slab journal tail blocks will be written out */
+	block_count_t slab_journal_commit_threshold;
+	/* Counters for events in the journal that are reported as statistics */
+	struct recovery_journal_statistics events;
+	/* The locks for each on-disk block */
+	struct lock_counter lock_counter;
+	/* The tail blocks */
+	struct recovery_journal_block blocks[];
+};
+
+/**
+ * vdo_get_recovery_journal_block_number() - Get the physical block number for a given sequence
+ *                                           number.
+ * @journal: The journal.
+ * @sequence: The sequence number of the desired block.
+ *
+ * Return: The block number corresponding to the sequence number.
+ */
+static inline physical_block_number_t __must_check
+vdo_get_recovery_journal_block_number(const struct recovery_journal *journal,
+				      sequence_number_t sequence)
+{
+	/*
+	 * Since journal size is a power of two, the block number modulus can just be extracted
+	 * from the low-order bits of the sequence.
+	 */
+	return vdo_compute_recovery_journal_block_number(journal->size, sequence);
+}
+
+/**
+ * vdo_compute_recovery_journal_check_byte() - Compute the check byte for a given sequence number.
+ * @journal: The journal.
+ * @sequence: The sequence number.
+ *
+ * Return: The check byte corresponding to the sequence number.
+ */
+static inline u8 __must_check
+vdo_compute_recovery_journal_check_byte(const struct recovery_journal *journal,
+					sequence_number_t sequence)
+{
+	/* The check byte must change with each trip around the journal. */
+	return (((sequence / journal->size) & 0x7F) | 0x80);
+}
+
+int __must_check vdo_decode_recovery_journal(struct recovery_journal_state_7_0 state,
+					     nonce_t nonce,
+					     struct vdo *vdo,
+					     struct partition *partition,
+					     u64 recovery_count,
+					     block_count_t journal_size,
+					     struct recovery_journal **journal_ptr);
+
+void vdo_free_recovery_journal(struct recovery_journal *journal);
+
+void vdo_initialize_recovery_journal_post_repair(struct recovery_journal *journal,
+						 u64 recovery_count,
+						 sequence_number_t tail,
+						 block_count_t logical_blocks_used,
+						 block_count_t block_map_data_blocks);
+
+block_count_t __must_check
+vdo_get_journal_block_map_data_blocks_used(struct recovery_journal *journal);
+
+thread_id_t __must_check vdo_get_recovery_journal_thread_id(struct recovery_journal *journal);
+
+void vdo_open_recovery_journal(struct recovery_journal *journal,
+			       struct slab_depot *depot,
+			       struct block_map *block_map);
+
+sequence_number_t
+vdo_get_recovery_journal_current_sequence_number(struct recovery_journal *journal);
+
+block_count_t __must_check vdo_get_recovery_journal_length(block_count_t journal_size);
+
+struct recovery_journal_state_7_0 __must_check
+vdo_record_recovery_journal(const struct recovery_journal *journal);
+
+void vdo_add_recovery_journal_entry(struct recovery_journal *journal, struct data_vio *data_vio);
+
+void vdo_acquire_recovery_journal_block_reference(struct recovery_journal *journal,
+						  sequence_number_t sequence_number,
+						  enum vdo_zone_type zone_type,
+						  zone_count_t zone_id);
+
+void vdo_release_recovery_journal_block_reference(struct recovery_journal *journal,
+						  sequence_number_t sequence_number,
+						  enum vdo_zone_type zone_type,
+						  zone_count_t zone_id);
+
+void vdo_release_journal_entry_lock(struct recovery_journal *journal,
+				    sequence_number_t sequence_number);
+
+void vdo_drain_recovery_journal(struct recovery_journal *journal,
+				const struct admin_state_code *operation,
+				struct vdo_completion *parent);
+
+void vdo_resume_recovery_journal(struct recovery_journal *journal,
+				 struct vdo_completion *parent);
+
+block_count_t __must_check
+vdo_get_recovery_journal_logical_blocks_used(const struct recovery_journal *journal);
+
+struct recovery_journal_statistics __must_check
+vdo_get_recovery_journal_statistics(const struct recovery_journal *journal);
+
+void vdo_dump_recovery_journal_statistics(const struct recovery_journal *journal);
+
+#endif /* VDO_RECOVERY_JOURNAL_H */

From patchwork Thu Oct 26 21:41:29 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437960
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EB0DBC27C48
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:36 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344887AbjJZVog (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:36 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41056 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234957AbjJZVoY (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:24 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A23BAD5E
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356507;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=payuzbiSCsa5GeWGRuB9Zq2l2VcFyo+AMJMii6kgC0I=;
        b=RDcskvIUlNjOizjkMY85U5p2lIEbzKmQWSkqSJlJj6F2CRidEEQJXtcEtYOmtScWul80AM
        rNBB6G6HSKUqPB7t7NdTfDBVUGprfjs8R3jIxA/mMTxS2KHcqFL3W9ZLG8AQVog3E6f1gB
        afU1BZ64VQnyFufJ3iuwV1Ioymb2FY8=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-271-nANN7grsNKiJIgY5KkTZHw-1; Thu, 26 Oct 2023 17:41:43 -0400
X-MC-Unique: nANN7grsNKiJIgY5KkTZHw-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com
 [10.11.54.3])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BF49F88D185;
        Thu, 26 Oct 2023 21:41:42 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id A571C1121320;
        Thu, 26 Oct 2023 21:41:42 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 5BA1A3003C; Thu, 26 Oct 2023 17:41:42 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 32/39] dm vdo: add repair of damaged vdo volumes
Date: Thu, 26 Oct 2023 17:41:29 -0400
Message-Id: <20231026214136.1067410-33-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.3
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

When a vdo is restarted after a crash, it will automatically attempt to
recover from its journals.

If a vdo encounters an unrecoverable error, it will enter read-only mode.
This mode indicates that some previously acknowledged data may have been
lost. The vdo may be instructed to rebuild as best it can in order to
return to a writable state. Although some data may be lost, this process
will ensure that the vdo's own metadata is self-consistent.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/repair.c | 1775 ++++++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/repair.h |   14 +
 2 files changed, 1789 insertions(+)
 create mode 100644 drivers/md/dm-vdo/repair.c
 create mode 100644 drivers/md/dm-vdo/repair.h

diff --git a/drivers/md/dm-vdo/repair.c b/drivers/md/dm-vdo/repair.c
new file mode 100644
index 000000000000..3f4aa570befc
--- /dev/null
+++ b/drivers/md/dm-vdo/repair.c
@@ -0,0 +1,1775 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "repair.h"
+
+#include <linux/min_heap.h>
+#include <linux/minmax.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+
+#include "block-map.h"
+#include "completion.h"
+#include "constants.h"
+#include "encodings.h"
+#include "int-map.h"
+#include "io-submitter.h"
+#include "recovery-journal.h"
+#include "slab-depot.h"
+#include "types.h"
+#include "vdo.h"
+#include "wait-queue.h"
+
+/*
+ * An explicitly numbered block mapping. Numbering the mappings allows them to be sorted by logical
+ * block number during repair while still preserving the relative order of journal entries with
+ * the same logical block number.
+ */
+struct numbered_block_mapping {
+	struct block_map_slot block_map_slot;
+	struct block_map_entry block_map_entry;
+	/* A serial number to use during replay */
+	u32 number;
+} __packed;
+
+/*
+ * The absolute position of an entry in the recovery journal, including the sector number and the
+ * entry number within the sector.
+ */
+struct recovery_point {
+	/* Block sequence number */
+	sequence_number_t sequence_number;
+	/* Sector number */
+	u8 sector_count;
+	/* Entry number */
+	journal_entry_count_t entry_count;
+	/* Whether or not the increment portion of the current entry has been applied */
+	bool increment_applied;
+};
+
+struct repair_completion {
+	/* The completion header */
+	struct vdo_completion completion;
+
+	/* A buffer to hold the data read off disk */
+	char *journal_data;
+
+	/* For loading the journal */
+	data_vio_count_t vio_count;
+	data_vio_count_t vios_complete;
+	struct vio *vios;
+
+	/* The number of entries to be applied to the block map */
+	size_t block_map_entry_count;
+	/* The sequence number of the first valid block for block map recovery */
+	sequence_number_t block_map_head;
+	/* The sequence number of the first valid block for slab journal replay */
+	sequence_number_t slab_journal_head;
+	/* The sequence number of the last valid block of the journal (if known) */
+	sequence_number_t tail;
+	/*
+	 * The highest sequence number of the journal. During recovery (vs read-only rebuild), not
+	 * the same as the tail, since the tail ignores blocks after the first hole.
+	 */
+	sequence_number_t highest_tail;
+
+	/* The number of logical blocks currently known to be in use */
+	block_count_t logical_blocks_used;
+	/* The number of block map data blocks known to be allocated */
+	block_count_t block_map_data_blocks;
+
+	/* These fields are for playing the journal into the block map */
+	/* The entry data for the block map recovery */
+	struct numbered_block_mapping *entries;
+	/* The number of entries in the entry array */
+	size_t entry_count;
+	/* number of pending (non-ready) requests*/
+	page_count_t outstanding;
+	/* number of page completions */
+	page_count_t page_count;
+	bool launching;
+	/*
+	 * a heap wrapping journal_entries. It re-orders and sorts journal entries in ascending LBN
+	 * order, then original journal order. This permits efficient iteration over the journal
+	 * entries in order.
+	 */
+	struct min_heap replay_heap;
+	/* Fields tracking progress through the journal entries. */
+	struct numbered_block_mapping *current_entry;
+	struct numbered_block_mapping *current_unfetched_entry;
+	/* Current requested page's PBN */
+	physical_block_number_t pbn;
+
+	/* These fields are only used during recovery. */
+	/* A location just beyond the last valid entry of the journal */
+	struct recovery_point tail_recovery_point;
+	/* The location of the next recovery journal entry to apply */
+	struct recovery_point next_recovery_point;
+	/* The journal point to give to the next synthesized decref */
+	struct journal_point next_journal_point;
+	/* The number of entries played into slab journals */
+	size_t entries_added_to_slab_journals;
+
+	/* These fields are only used during read-only rebuild */
+	page_count_t page_to_fetch;
+	/* the number of leaf pages in the block map */
+	page_count_t leaf_pages;
+	/* the last slot of the block map */
+	struct block_map_slot last_slot;
+
+	/*
+	 * The page completions used for playing the journal into the block map, and, during
+	 * read-only rebuild, for rebuilding the reference counts from the block map.
+	 */
+	struct vdo_page_completion page_completions[];
+};
+
+/*
+ * This is a min_heap callback function that orders numbered_block_mappings using the
+ * 'block_map_slot' field as the primary key and the mapping 'number' field as the secondary key.
+ * Using the mapping number preserves the journal order of entries for the same slot, allowing us
+ * to sort by slot while still ensuring we replay all entries with the same slot in the exact order
+ * as they appeared in the journal.
+ */
+static bool mapping_is_less_than(const void *item1, const void *item2)
+{
+	const struct numbered_block_mapping *mapping1 =
+		(const struct numbered_block_mapping *) item1;
+	const struct numbered_block_mapping *mapping2 =
+		(const struct numbered_block_mapping *) item2;
+
+	if (mapping1->block_map_slot.pbn != mapping2->block_map_slot.pbn)
+		return mapping1->block_map_slot.pbn < mapping2->block_map_slot.pbn;
+
+	if (mapping1->block_map_slot.slot != mapping2->block_map_slot.slot)
+		return mapping1->block_map_slot.slot < mapping2->block_map_slot.slot;
+
+	if (mapping1->number != mapping2->number)
+		return mapping1->number < mapping2->number;
+
+	return 0;
+}
+
+static void swap_mappings(void *item1, void *item2)
+{
+	struct numbered_block_mapping *mapping1 = item1;
+	struct numbered_block_mapping *mapping2 = item2;
+
+	swap(*mapping1, *mapping2);
+}
+
+static const struct min_heap_callbacks repair_min_heap = {
+	.elem_size = sizeof(struct numbered_block_mapping),
+	.less = mapping_is_less_than,
+	.swp = swap_mappings,
+};
+
+static struct numbered_block_mapping *
+sort_next_heap_element(struct repair_completion *repair)
+{
+	struct min_heap *heap = &repair->replay_heap;
+	struct numbered_block_mapping *last;
+
+	if (heap->nr == 0)
+		return NULL;
+
+	/*
+	 * Swap the next heap element with the last one on the heap, popping it off the heap,
+	 * restore the heap invariant, and return a pointer to the popped element.
+	 */
+	last = &repair->entries[--heap->nr];
+	swap_mappings(heap->data, last);
+	min_heapify(heap, 0, &repair_min_heap);
+	return last;
+}
+
+/**
+ * as_repair_completion() - Convert a generic completion to a repair_completion.
+ * @completion: The completion to convert.
+ *
+ * Return: The repair_completion.
+ */
+static inline struct repair_completion * __must_check
+as_repair_completion(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VDO_REPAIR_COMPLETION);
+	return container_of(completion, struct repair_completion, completion);
+}
+
+static void prepare_repair_completion(struct repair_completion *repair,
+				      vdo_action *callback,
+				      enum vdo_zone_type zone_type)
+{
+	struct vdo_completion *completion = &repair->completion;
+	const struct thread_config *thread_config = &completion->vdo->thread_config;
+	thread_id_t thread_id;
+
+	/* All blockmap access is done on single thread, so use logical zone 0. */
+	thread_id = ((zone_type == VDO_ZONE_TYPE_LOGICAL) ?
+		     thread_config->logical_threads[0] :
+		     thread_config->admin_thread);
+	vdo_reset_completion(completion);
+	vdo_set_completion_callback(completion, callback, thread_id);
+}
+
+static void launch_repair_completion(struct repair_completion *repair,
+				     vdo_action *callback,
+				     enum vdo_zone_type zone_type)
+{
+	prepare_repair_completion(repair, callback, zone_type);
+	vdo_launch_completion(&repair->completion);
+}
+
+static void uninitialize_vios(struct repair_completion *repair)
+{
+	while (repair->vio_count > 0)
+		free_vio_components(&repair->vios[--repair->vio_count]);
+
+	UDS_FREE(UDS_FORGET(repair->vios));
+}
+
+static void free_repair_completion(struct repair_completion *repair)
+{
+	if (repair == NULL)
+		return;
+
+	/*
+	 * We do this here because this function is the only common bottleneck for all clean up
+	 * paths.
+	 */
+	repair->completion.vdo->block_map->zones[0].page_cache.rebuilding = false;
+
+	uninitialize_vios(repair);
+	UDS_FREE(UDS_FORGET(repair->journal_data));
+	UDS_FREE(UDS_FORGET(repair->entries));
+	UDS_FREE(repair);
+}
+
+static void finish_repair(struct vdo_completion *completion)
+{
+	struct vdo_completion *parent = completion->parent;
+	struct vdo *vdo = completion->vdo;
+	struct repair_completion *repair = as_repair_completion(completion);
+
+	vdo_assert_on_admin_thread(vdo, __func__);
+
+	if (vdo->load_state != VDO_REBUILD_FOR_UPGRADE)
+		vdo->states.vdo.complete_recoveries++;
+
+	vdo_initialize_recovery_journal_post_repair(vdo->recovery_journal,
+						    vdo->states.vdo.complete_recoveries,
+						    repair->highest_tail,
+						    repair->logical_blocks_used,
+						    repair->block_map_data_blocks);
+	free_repair_completion(UDS_FORGET(repair));
+
+	if (vdo_state_requires_read_only_rebuild(vdo->load_state)) {
+		uds_log_info("Read-only rebuild complete");
+		vdo_launch_completion(parent);
+		return;
+	}
+
+	/* FIXME: shouldn't this say either "recovery" or "repair"? */
+	uds_log_info("Rebuild complete");
+
+	/*
+	 * Now that we've freed the repair completion and its vast array of journal entries, we
+	 * can allocate refcounts.
+	 */
+	vdo_continue_completion(parent, vdo_allocate_reference_counters(vdo->depot));
+}
+
+/**
+ * abort_repair() - Handle a repair error.
+ * @completion: The repair completion.
+ */
+static void abort_repair(struct vdo_completion *completion)
+{
+	struct vdo_completion *parent = completion->parent;
+	int result = completion->result;
+	struct repair_completion *repair = as_repair_completion(completion);
+
+	if (vdo_state_requires_read_only_rebuild(completion->vdo->load_state))
+		uds_log_info("Read-only rebuild aborted");
+	else
+		uds_log_warning("Recovery aborted");
+
+	free_repair_completion(UDS_FORGET(repair));
+	vdo_continue_completion(parent, result);
+}
+
+/**
+ * abort_on_error() - Abort a repair if there is an error.
+ * @result: The result to check.
+ * @repair: The repair completion.
+ *
+ * Return: true if the result was an error.
+ */
+static bool __must_check abort_on_error(int result, struct repair_completion *repair)
+{
+	if (result == VDO_SUCCESS)
+		return false;
+
+	vdo_fail_completion(&repair->completion, result);
+	return true;
+}
+
+/**
+ * drain_slab_depot() - Flush out all dirty refcounts blocks now that they have been rebuilt or
+ *                      recovered.
+ */
+static void drain_slab_depot(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+	struct repair_completion *repair = as_repair_completion(completion);
+	const struct admin_state_code *operation;
+
+	vdo_assert_on_admin_thread(vdo, __func__);
+
+	prepare_repair_completion(repair, finish_repair, VDO_ZONE_TYPE_ADMIN);
+	if (vdo_state_requires_read_only_rebuild(vdo->load_state)) {
+		uds_log_info("Saving rebuilt state");
+		operation = VDO_ADMIN_STATE_REBUILDING;
+	} else {
+		uds_log_info("Replayed %zu journal entries into slab journals",
+			     repair->entries_added_to_slab_journals);
+		operation = VDO_ADMIN_STATE_RECOVERING;
+	}
+
+	vdo_drain_slab_depot(vdo->depot, operation, completion);
+}
+
+/**
+ * flush_block_map_updates() - Flush the block map now that all the reference counts are rebuilt.
+ * @completion: The repair completion.
+ *
+ * This callback is registered in finish_if_done().
+ */
+static void flush_block_map_updates(struct vdo_completion *completion)
+{
+	vdo_assert_on_admin_thread(completion->vdo, __func__);
+
+	uds_log_info("Flushing block map changes");
+	prepare_repair_completion(as_repair_completion(completion),
+				  drain_slab_depot,
+				  VDO_ZONE_TYPE_ADMIN);
+	vdo_drain_block_map(completion->vdo->block_map, VDO_ADMIN_STATE_RECOVERING, completion);
+}
+
+static bool fetch_page(struct repair_completion *repair, struct vdo_completion *completion);
+
+/**
+ * handle_page_load_error() - Handle an error loading a page.
+ * @completion: The vdo_page_completion.
+ */
+static void handle_page_load_error(struct vdo_completion *completion)
+{
+	struct repair_completion *repair = completion->parent;
+
+	repair->outstanding--;
+	vdo_set_completion_result(&repair->completion, completion->result);
+	vdo_release_page_completion(completion);
+	fetch_page(repair, completion);
+}
+
+/**
+ * unmap_entry() - Unmap an invalid entry and indicate that its page must be written out.
+ * @page: The page containing the entries
+ * @completion: The page_completion for writing the page
+ * @slot: The slot to unmap
+ */
+static void
+unmap_entry(struct block_map_page *page, struct vdo_completion *completion, slot_number_t slot)
+{
+	page->entries[slot] = UNMAPPED_BLOCK_MAP_ENTRY;
+	vdo_request_page_write(completion);
+}
+
+/**
+ * remove_out_of_bounds_entries() - Unmap entries which outside the logical space.
+ * @page: The page containing the entries
+ * @completion: The page_completion for writing the page
+ * @start: The first slot to check
+ */
+static void remove_out_of_bounds_entries(struct block_map_page *page,
+					 struct vdo_completion *completion,
+					 slot_number_t start)
+{
+	slot_number_t slot;
+
+	for (slot = start; slot < VDO_BLOCK_MAP_ENTRIES_PER_PAGE; slot++) {
+		struct data_location mapping = vdo_unpack_block_map_entry(&page->entries[slot]);
+
+		if (vdo_is_mapped_location(&mapping))
+			unmap_entry(page, completion, slot);
+	}
+}
+
+/**
+ * process_slot() - Update the reference counts for a single entry.
+ * @page: The page containing the entries
+ * @completion: The page_completion for writing the page
+ * @slot: The slot to check
+ *
+ * Return: true if the entry was a valid mapping
+ */
+static bool
+process_slot(struct block_map_page *page, struct vdo_completion *completion, slot_number_t slot)
+{
+	struct slab_depot *depot = completion->vdo->depot;
+	int result;
+	struct data_location mapping = vdo_unpack_block_map_entry(&page->entries[slot]);
+
+	if (!vdo_is_valid_location(&mapping)) {
+		/* This entry is invalid, so remove it from the page. */
+		unmap_entry(page, completion, slot);
+		return false;
+	}
+
+	if (!vdo_is_mapped_location(&mapping))
+		return false;
+
+
+	if (mapping.pbn == VDO_ZERO_BLOCK)
+		return true;
+
+	if (!vdo_is_physical_data_block(depot, mapping.pbn)) {
+		/*
+		 * This is a nonsense mapping. Remove it from the map so we're at least consistent
+		 * and mark the page dirty.
+		 */
+		unmap_entry(page, completion, slot);
+		return false;
+	}
+
+	result = vdo_adjust_reference_count_for_rebuild(depot,
+							mapping.pbn,
+							VDO_JOURNAL_DATA_REMAPPING);
+	if (result == VDO_SUCCESS)
+		return true;
+
+	uds_log_error_strerror(result,
+			       "Could not adjust reference count for PBN %llu, slot %u mapped to PBN %llu",
+			       (unsigned long long) vdo_get_block_map_page_pbn(page),
+			       slot,
+			       (unsigned long long) mapping.pbn);
+	unmap_entry(page, completion, slot);
+	return false;
+}
+
+/**
+ * rebuild_reference_counts_from_page() - Rebuild reference counts from a block map page.
+ * @repair: The repair completion.
+ * @completion: The page completion holding the page.
+ */
+static void rebuild_reference_counts_from_page(struct repair_completion *repair,
+					       struct vdo_completion *completion)
+{
+	slot_number_t slot, last_slot;
+	struct block_map_page *page;
+	int result;
+
+	result = vdo_get_cached_page(completion, &page);
+	if (result != VDO_SUCCESS) {
+		vdo_set_completion_result(&repair->completion, result);
+		return;
+	}
+
+	if (!page->header.initialized)
+		return;
+
+	/* Remove any bogus entries which exist beyond the end of the logical space. */
+	if (vdo_get_block_map_page_pbn(page) == repair->last_slot.pbn) {
+		last_slot = repair->last_slot.slot;
+		remove_out_of_bounds_entries(page, completion, last_slot);
+	} else {
+		last_slot = VDO_BLOCK_MAP_ENTRIES_PER_PAGE;
+	}
+
+	/* Inform the slab depot of all entries on this page. */
+	for (slot = 0; slot < last_slot; slot++) {
+		if (process_slot(page, completion, slot))
+			repair->logical_blocks_used++;
+	}
+}
+
+/**
+ * page_loaded() - Process a page which has just been loaded.
+ * @completion: The vdo_page_completion for the fetched page.
+ *
+ * This callback is registered by fetch_page().
+ */
+static void page_loaded(struct vdo_completion *completion)
+{
+	struct repair_completion *repair = completion->parent;
+
+	repair->outstanding--;
+	rebuild_reference_counts_from_page(repair, completion);
+	vdo_release_page_completion(completion);
+
+	/* Advance progress to the next page, and fetch the next page we haven't yet requested. */
+	fetch_page(repair, completion);
+}
+
+static physical_block_number_t
+get_pbn_to_fetch(struct repair_completion *repair, struct block_map *block_map)
+{
+	physical_block_number_t pbn = VDO_ZERO_BLOCK;
+
+	if (repair->completion.result != VDO_SUCCESS)
+		return VDO_ZERO_BLOCK;
+
+	while ((pbn == VDO_ZERO_BLOCK) && (repair->page_to_fetch < repair->leaf_pages))
+		pbn = vdo_find_block_map_page_pbn(block_map, repair->page_to_fetch++);
+
+	if (vdo_is_physical_data_block(repair->completion.vdo->depot, pbn))
+		return pbn;
+
+	vdo_set_completion_result(&repair->completion, VDO_BAD_MAPPING);
+	return VDO_ZERO_BLOCK;
+}
+
+/**
+ * fetch_page() - Fetch a page from the block map.
+ * @repair: The repair_completion.
+ * @completion: The page completion to use.
+ *
+ * Return true if the rebuild is complete
+ */
+static bool fetch_page(struct repair_completion *repair, struct vdo_completion *completion)
+{
+	struct vdo_page_completion *page_completion = (struct vdo_page_completion *) completion;
+	struct block_map *block_map = repair->completion.vdo->block_map;
+	physical_block_number_t pbn = get_pbn_to_fetch(repair, block_map);
+
+	if (pbn != VDO_ZERO_BLOCK) {
+		repair->outstanding++;
+		/*
+		 * We must set the requeue flag here to ensure that we don't blow the stack if all
+		 * the requested pages are already in the cache or get load errors.
+		 */
+		vdo_get_page(page_completion,
+			     &block_map->zones[0],
+			     pbn,
+			     true,
+			     repair,
+			     page_loaded,
+			     handle_page_load_error,
+			     true);
+	}
+
+	if (repair->outstanding > 0)
+		return false;
+
+	launch_repair_completion(repair, flush_block_map_updates, VDO_ZONE_TYPE_ADMIN);
+	return true;
+}
+
+/**
+ * rebuild_from_leaves() - Rebuild reference counts from the leaf block map pages.
+ * @completion: The repair completion.
+ *
+ * Rebuilds reference counts from the leaf block map pages now that reference counts have been
+ * rebuilt from the interior tree pages (which have been loaded in the process). This callback is
+ * registered in rebuild_reference_counts().
+ */
+static void rebuild_from_leaves(struct vdo_completion *completion)
+{
+	page_count_t i;
+	struct repair_completion *repair = as_repair_completion(completion);
+	struct block_map *map = completion->vdo->block_map;
+
+	repair->logical_blocks_used = 0;
+
+	/*
+	 * The PBN calculation doesn't work until the tree pages have been loaded, so we can't set
+	 * this value at the start of repair.
+	 */
+	repair->leaf_pages = vdo_compute_block_map_page_count(map->entry_count);
+	repair->last_slot = (struct block_map_slot) {
+		.slot = map->entry_count % VDO_BLOCK_MAP_ENTRIES_PER_PAGE,
+		.pbn = vdo_find_block_map_page_pbn(map, repair->leaf_pages - 1),
+	};
+	if (repair->last_slot.slot == 0)
+		repair->last_slot.slot = VDO_BLOCK_MAP_ENTRIES_PER_PAGE;
+
+	for (i = 0; i < repair->page_count; i++) {
+		if (fetch_page(repair, &repair->page_completions[i].completion))
+			/*
+			 * The rebuild has already moved on, so it isn't safe nor is there a need
+			 * to launch any more fetches.
+			 */
+			return;
+	}
+}
+
+/**
+ * process_entry() - Process a single entry from the block map tree.
+ * @pbn: A pbn which holds a block map tree page.
+ * @completion: The parent completion of the traversal.
+ *
+ * Implements vdo_entry_callback.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int process_entry(physical_block_number_t pbn, struct vdo_completion *completion)
+{
+	struct repair_completion *repair = as_repair_completion(completion);
+	struct slab_depot *depot = completion->vdo->depot;
+	int result;
+
+	if ((pbn == VDO_ZERO_BLOCK) || !vdo_is_physical_data_block(depot, pbn))
+		return uds_log_error_strerror(VDO_BAD_CONFIGURATION,
+					      "PBN %llu out of range",
+					      (unsigned long long) pbn);
+
+	result = vdo_adjust_reference_count_for_rebuild(depot,
+							pbn,
+							VDO_JOURNAL_BLOCK_MAP_REMAPPING);
+	if (result != VDO_SUCCESS)
+		return uds_log_error_strerror(result,
+					      "Could not adjust reference count for block map tree PBN %llu",
+					      (unsigned long long) pbn);
+
+	repair->block_map_data_blocks++;
+	return VDO_SUCCESS;
+}
+
+static void rebuild_reference_counts(struct vdo_completion *completion)
+{
+	struct repair_completion *repair = as_repair_completion(completion);
+	struct vdo *vdo = completion->vdo;
+	struct vdo_page_cache *cache = &vdo->block_map->zones[0].page_cache;
+
+	/* We must allocate ref_counts before we can rebuild them. */
+	if (abort_on_error(vdo_allocate_reference_counters(vdo->depot), repair))
+		return;
+
+	/*
+	 * Completion chaining from page cache hits can lead to stack overflow during the rebuild,
+	 * so clear out the cache before this rebuild phase.
+	 */
+	if (abort_on_error(vdo_invalidate_page_cache(cache), repair))
+		return;
+
+	prepare_repair_completion(repair, rebuild_from_leaves, VDO_ZONE_TYPE_LOGICAL);
+	vdo_traverse_forest(vdo->block_map, process_entry, completion);
+}
+
+/**
+ * increment_recovery_point() - Move the given recovery point forward by one entry.
+ */
+static void increment_recovery_point(struct recovery_point *point)
+{
+	if (++point->entry_count < RECOVERY_JOURNAL_ENTRIES_PER_SECTOR)
+		return;
+
+	point->entry_count = 0;
+	if (point->sector_count < (VDO_SECTORS_PER_BLOCK - 1)) {
+		point->sector_count++;
+		return;
+	}
+
+	point->sequence_number++;
+	point->sector_count = 1;
+}
+
+/**
+ * advance_points() - Advance the current recovery and journal points.
+ * @repair: The repair_completion whose points are to be advanced.
+ * @entries_per_block: The number of entries in a recovery journal block.
+ */
+static void
+advance_points(struct repair_completion *repair, journal_entry_count_t entries_per_block)
+{
+	if (!repair->next_recovery_point.increment_applied) {
+		repair->next_recovery_point.increment_applied	= true;
+		return;
+	}
+
+	increment_recovery_point(&repair->next_recovery_point);
+	vdo_advance_journal_point(&repair->next_journal_point, entries_per_block);
+	repair->next_recovery_point.increment_applied	= false;
+}
+
+/**
+ * before_recovery_point() - Check whether the first point precedes the second point.
+ * @first: The first recovery point.
+ * @second: The second recovery point.
+ *
+ * Return: true if the first point precedes the second point.
+ */
+static bool __must_check
+before_recovery_point(const struct recovery_point *first, const struct recovery_point *second)
+{
+	if (first->sequence_number < second->sequence_number)
+		return true;
+
+	if (first->sequence_number > second->sequence_number)
+		return false;
+
+	if (first->sector_count < second->sector_count)
+		return true;
+
+	return ((first->sector_count == second->sector_count) &&
+		(first->entry_count < second->entry_count));
+}
+
+static struct packed_journal_sector * __must_check
+get_sector(struct recovery_journal *journal,
+	   char *journal_data,
+	   sequence_number_t sequence,
+	   u8 sector_number)
+{
+	off_t offset;
+
+	offset = ((vdo_get_recovery_journal_block_number(journal, sequence) * VDO_BLOCK_SIZE) +
+		  (VDO_SECTOR_SIZE * sector_number));
+	return (struct packed_journal_sector *) (journal_data + offset);
+}
+
+/**
+ * get_entry() - Unpack the recovery journal entry associated with the given recovery point.
+ * @repair: The repair completion.
+ * @point: The recovery point.
+ *
+ * Return: The unpacked contents of the matching recovery journal entry.
+ */
+static struct recovery_journal_entry
+get_entry(const struct repair_completion *repair, const struct recovery_point *point)
+{
+	struct packed_journal_sector *sector;
+
+	sector = get_sector(repair->completion.vdo->recovery_journal,
+			    repair->journal_data,
+			    point->sequence_number,
+			    point->sector_count);
+	return vdo_unpack_recovery_journal_entry(&sector->entries[point->entry_count]);
+}
+
+/**
+ * validate_recovery_journal_entry() - Validate a recovery journal entry.
+ * @vdo: The vdo.
+ * @entry: The entry to validate.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int
+validate_recovery_journal_entry(const struct vdo *vdo, const struct recovery_journal_entry *entry)
+{
+	if ((entry->slot.pbn >= vdo->states.vdo.config.physical_blocks) ||
+	    (entry->slot.slot >= VDO_BLOCK_MAP_ENTRIES_PER_PAGE) ||
+	    !vdo_is_valid_location(&entry->mapping) ||
+	    !vdo_is_valid_location(&entry->unmapping) ||
+	    !vdo_is_physical_data_block(vdo->depot, entry->mapping.pbn) ||
+	    !vdo_is_physical_data_block(vdo->depot, entry->unmapping.pbn))
+		return uds_log_error_strerror(VDO_CORRUPT_JOURNAL,
+					      "Invalid entry: %s (%llu, %u) from %llu to %llu is not within bounds",
+					      vdo_get_journal_operation_name(entry->operation),
+					      (unsigned long long) entry->slot.pbn,
+					      entry->slot.slot,
+					      (unsigned long long) entry->unmapping.pbn,
+					      (unsigned long long) entry->mapping.pbn);
+
+	if ((entry->operation == VDO_JOURNAL_BLOCK_MAP_REMAPPING) &&
+	    (vdo_is_state_compressed(entry->mapping.state) ||
+	     (entry->mapping.pbn == VDO_ZERO_BLOCK) ||
+	     (entry->unmapping.state != VDO_MAPPING_STATE_UNMAPPED) ||
+	     (entry->unmapping.pbn != VDO_ZERO_BLOCK)))
+		return uds_log_error_strerror(VDO_CORRUPT_JOURNAL,
+					      "Invalid entry: %s (%llu, %u) from %llu to %llu is not a valid tree mapping",
+					      vdo_get_journal_operation_name(entry->operation),
+					      (unsigned long long) entry->slot.pbn,
+					      entry->slot.slot,
+					      (unsigned long long) entry->unmapping.pbn,
+					      (unsigned long long) entry->mapping.pbn);
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * add_slab_journal_entries() - Replay recovery journal entries into the slab journals of the
+ *                              allocator currently being recovered.
+ * @completion: The allocator completion.
+ *
+ * Waits for slab journal tailblock space when necessary. This method is its own callback.
+ */
+static void add_slab_journal_entries(struct vdo_completion *completion)
+{
+	struct recovery_point *recovery_point;
+	struct repair_completion *repair = completion->parent;
+	struct vdo *vdo = completion->vdo;
+	struct recovery_journal *journal = vdo->recovery_journal;
+	struct block_allocator *allocator = vdo_as_block_allocator(completion);
+
+	/* Get ready in case we need to enqueue again. */
+	vdo_prepare_completion(completion,
+			       add_slab_journal_entries,
+			       vdo_notify_slab_journals_are_recovered,
+			       completion->callback_thread_id,
+			       repair);
+	for (recovery_point = &repair->next_recovery_point;
+	     before_recovery_point(recovery_point, &repair->tail_recovery_point);
+	     advance_points(repair, journal->entries_per_block)) {
+		int result;
+		physical_block_number_t pbn;
+		struct vdo_slab *slab;
+		struct recovery_journal_entry entry = get_entry(repair, recovery_point);
+		bool increment = !repair->next_recovery_point.increment_applied;
+
+		if (increment) {
+			result = validate_recovery_journal_entry(vdo, &entry);
+			if (result != VDO_SUCCESS) {
+				vdo_enter_read_only_mode(vdo, result);
+				vdo_fail_completion(completion, result);
+				return;
+			}
+
+			pbn = entry.mapping.pbn;
+		} else {
+			pbn = entry.unmapping.pbn;
+		}
+
+		if (pbn == VDO_ZERO_BLOCK)
+			continue;
+
+		slab = vdo_get_slab(vdo->depot, pbn);
+		if (slab->allocator != allocator)
+			continue;
+
+		if (!vdo_attempt_replay_into_slab(slab,
+						  pbn,
+						  entry.operation,
+						  increment,
+						  &repair->next_journal_point,
+						  completion))
+			return;
+
+		repair->entries_added_to_slab_journals++;
+	}
+
+	vdo_notify_slab_journals_are_recovered(completion);
+}
+
+/**
+ * vdo_replay_into_slab_journals() - Replay recovery journal entries in the slab journals of slabs
+ *                                   owned by a given block_allocator.
+ * @allocator: The allocator whose slab journals are to be recovered.
+ * @context: The slab depot load context supplied by a recovery when it loads the depot.
+ */
+void vdo_replay_into_slab_journals(struct block_allocator *allocator, void *context)
+{
+	struct vdo_completion *completion = &allocator->completion;
+	struct repair_completion *repair = context;
+	struct vdo *vdo = completion->vdo;
+
+	vdo_assert_on_physical_zone_thread(vdo, allocator->zone_number, __func__);
+	if (repair->entry_count == 0) {
+		/* there's nothing to replay */
+		repair->logical_blocks_used = vdo->recovery_journal->logical_blocks_used;
+		repair->block_map_data_blocks = vdo->recovery_journal->block_map_data_blocks;
+		vdo_notify_slab_journals_are_recovered(completion);
+		return;
+	}
+
+	repair->next_recovery_point = (struct recovery_point) {
+		.sequence_number = repair->slab_journal_head,
+		.sector_count = 1,
+		.entry_count = 0,
+	};
+
+	repair->next_journal_point = (struct journal_point) {
+		.sequence_number = repair->slab_journal_head,
+		.entry_count = 0,
+	};
+
+	uds_log_info("Replaying entries into slab journals for zone %u", allocator->zone_number);
+	completion->parent = repair;
+	add_slab_journal_entries(completion);
+}
+
+static void load_slab_depot(struct vdo_completion *completion)
+{
+	struct repair_completion *repair = as_repair_completion(completion);
+	const struct admin_state_code *operation;
+
+	vdo_assert_on_admin_thread(completion->vdo, __func__);
+
+	if (vdo_state_requires_read_only_rebuild(completion->vdo->load_state)) {
+		prepare_repair_completion(repair, rebuild_reference_counts, VDO_ZONE_TYPE_LOGICAL);
+		operation = VDO_ADMIN_STATE_LOADING_FOR_REBUILD;
+	} else {
+		prepare_repair_completion(repair, drain_slab_depot, VDO_ZONE_TYPE_ADMIN);
+		operation = VDO_ADMIN_STATE_LOADING_FOR_RECOVERY;
+	}
+
+	vdo_load_slab_depot(completion->vdo->depot, operation, completion, repair);
+}
+
+static void flush_block_map(struct vdo_completion *completion)
+{
+	struct repair_completion *repair = as_repair_completion(completion);
+	const struct admin_state_code *operation;
+
+	vdo_assert_on_admin_thread(completion->vdo, __func__);
+
+	uds_log_info("Flushing block map changes");
+	prepare_repair_completion(repair, load_slab_depot, VDO_ZONE_TYPE_ADMIN);
+	operation = (vdo_state_requires_read_only_rebuild(completion->vdo->load_state) ?
+		     VDO_ADMIN_STATE_REBUILDING :
+		     VDO_ADMIN_STATE_RECOVERING);
+	vdo_drain_block_map(completion->vdo->block_map, operation, completion);
+}
+
+static bool finish_if_done(struct repair_completion *repair)
+{
+	/* Pages are still being launched or there is still work to do */
+	if (repair->launching || (repair->outstanding > 0))
+		return false;
+
+	if (repair->completion.result != VDO_SUCCESS) {
+		page_count_t i;
+
+		for (i = 0; i < repair->page_count; i++) {
+			struct vdo_page_completion *page_completion =
+				&repair->page_completions[i];
+
+			if (page_completion->ready)
+				vdo_release_page_completion(&page_completion->completion);
+		}
+
+		vdo_launch_completion(&repair->completion);
+		return true;
+	}
+
+	if (repair->current_entry >= repair->entries)
+		return false;
+
+	launch_repair_completion(repair, flush_block_map, VDO_ZONE_TYPE_ADMIN);
+	return true;
+}
+
+static void abort_block_map_recovery(struct repair_completion *repair, int result)
+{
+	vdo_set_completion_result(&repair->completion, result);
+	finish_if_done(repair);
+}
+
+/**
+ * find_entry_starting_next_page() - Find the first journal entry after a given entry which is not
+ *                                   on the same block map page.
+ * @current_entry: The entry to search from.
+ * @needs_sort: Whether sorting is needed to proceed.
+ *
+ * Return: Pointer to the first later journal entry on a different block map page, or a pointer to
+ *         just before the journal entries if no subsequent entry is on a different block map page.
+ */
+static struct numbered_block_mapping *
+find_entry_starting_next_page(struct repair_completion *repair,
+			      struct numbered_block_mapping *current_entry,
+			      bool needs_sort)
+{
+	size_t current_page;
+
+	/* If current_entry is invalid, return immediately. */
+	if (current_entry < repair->entries)
+		return current_entry;
+
+	current_page = current_entry->block_map_slot.pbn;
+
+	/* Decrement current_entry until it's out of bounds or on a different page. */
+	while ((current_entry >= repair->entries) &&
+	       (current_entry->block_map_slot.pbn == current_page)) {
+		if (needs_sort) {
+			struct numbered_block_mapping *just_sorted_entry =
+				sort_next_heap_element(repair);
+			ASSERT_LOG_ONLY(just_sorted_entry < current_entry,
+					"heap is returning elements in an unexpected order");
+		}
+
+		current_entry--;
+	}
+
+	return current_entry;
+}
+
+/*
+ * Apply a range of journal entries [starting_entry, ending_entry) journal
+ * entries to a block map page.
+ */
+static void apply_journal_entries_to_page(struct block_map_page *page,
+					  struct numbered_block_mapping *starting_entry,
+					  struct numbered_block_mapping *ending_entry)
+{
+	struct numbered_block_mapping *current_entry = starting_entry;
+
+	while (current_entry != ending_entry) {
+		page->entries[current_entry->block_map_slot.slot] = current_entry->block_map_entry;
+		current_entry--;
+	}
+}
+
+static void recover_ready_pages(struct repair_completion *repair,
+				struct vdo_completion *completion);
+
+static void block_map_page_loaded(struct vdo_completion *completion)
+{
+	struct repair_completion *repair = as_repair_completion(completion->parent);
+
+	repair->outstanding--;
+	if (!repair->launching)
+		recover_ready_pages(repair, completion);
+}
+
+static void handle_block_map_page_load_error(struct vdo_completion *completion)
+{
+	struct repair_completion *repair = as_repair_completion(completion->parent);
+
+	repair->outstanding--;
+	abort_block_map_recovery(repair, completion->result);
+}
+
+static void fetch_block_map_page(struct repair_completion *repair,
+				 struct vdo_completion *completion)
+{
+	physical_block_number_t pbn;
+
+	if (repair->current_unfetched_entry < repair->entries)
+		/* Nothing left to fetch. */
+		return;
+
+	/* Fetch the next page we haven't yet requested. */
+	pbn = repair->current_unfetched_entry->block_map_slot.pbn;
+	repair->current_unfetched_entry =
+		find_entry_starting_next_page(repair, repair->current_unfetched_entry, true);
+	repair->outstanding++;
+	vdo_get_page(((struct vdo_page_completion *) completion),
+		     &repair->completion.vdo->block_map->zones[0],
+		     pbn,
+		     true,
+		     &repair->completion,
+		     block_map_page_loaded,
+		     handle_block_map_page_load_error,
+		     false);
+}
+
+static struct vdo_page_completion *
+get_next_page_completion(struct repair_completion *repair, struct vdo_page_completion *completion)
+{
+	completion++;
+	if (completion == (&repair->page_completions[repair->page_count]))
+		completion = &repair->page_completions[0];
+	return completion;
+}
+
+static void recover_ready_pages(struct repair_completion *repair,
+				struct vdo_completion *completion)
+{
+	struct vdo_page_completion *page_completion = (struct vdo_page_completion *) completion;
+
+	if (finish_if_done(repair))
+		return;
+
+	if (repair->pbn != page_completion->pbn)
+		return;
+
+	while (page_completion->ready) {
+		struct numbered_block_mapping *start_of_next_page;
+		struct block_map_page *page;
+		int result;
+
+		result = vdo_get_cached_page(completion, &page);
+		if (result != VDO_SUCCESS) {
+			abort_block_map_recovery(repair, result);
+			return;
+		}
+
+		start_of_next_page =
+			find_entry_starting_next_page(repair, repair->current_entry, false);
+		apply_journal_entries_to_page(page, repair->current_entry, start_of_next_page);
+		repair->current_entry = start_of_next_page;
+		vdo_request_page_write(completion);
+		vdo_release_page_completion(completion);
+
+		if (finish_if_done(repair))
+			return;
+
+		repair->pbn = repair->current_entry->block_map_slot.pbn;
+		fetch_block_map_page(repair, completion);
+		page_completion = get_next_page_completion(repair, page_completion);
+		completion = &page_completion->completion;
+	}
+}
+
+static void recover_block_map(struct vdo_completion *completion)
+{
+	struct repair_completion *repair = as_repair_completion(completion);
+	struct vdo *vdo = completion->vdo;
+	struct numbered_block_mapping *first_sorted_entry;
+	page_count_t i;
+
+	vdo_assert_on_logical_zone_thread(vdo, 0, __func__);
+
+	/* Suppress block map errors. */
+	vdo->block_map->zones[0].page_cache.rebuilding =
+		vdo_state_requires_read_only_rebuild(vdo->load_state);
+
+	if (repair->block_map_entry_count == 0) {
+		uds_log_info("Replaying 0 recovery entries into block map");
+		UDS_FREE(UDS_FORGET(repair->journal_data));
+		launch_repair_completion(repair, load_slab_depot, VDO_ZONE_TYPE_ADMIN);
+		return;
+	}
+
+	/*
+	 * Organize the journal entries into a binary heap so we can iterate over them in sorted
+	 * order incrementally, avoiding an expensive sort call.
+	 */
+	repair->replay_heap = (struct min_heap) {
+		.data = repair->entries,
+		.nr = repair->block_map_entry_count,
+		.size = repair->block_map_entry_count,
+	};
+	min_heapify_all(&repair->replay_heap, &repair_min_heap);
+
+	uds_log_info("Replaying %zu recovery entries into block map",
+		     repair->block_map_entry_count);
+
+	repair->current_entry = &repair->entries[repair->block_map_entry_count - 1];
+	first_sorted_entry = sort_next_heap_element(repair);
+	ASSERT_LOG_ONLY(first_sorted_entry == repair->current_entry,
+			"heap is returning elements in an unexpected order");
+
+	/* Prevent any page from being processed until all pages have been launched. */
+	repair->launching = true;
+	repair->pbn = repair->current_entry->block_map_slot.pbn;
+	repair->current_unfetched_entry = repair->current_entry;
+	for (i = 0; i < repair->page_count; i++) {
+		if (repair->current_unfetched_entry < repair->entries)
+			break;
+
+		fetch_block_map_page(repair, &repair->page_completions[i].completion);
+	}
+	repair->launching = false;
+
+	/* Process any ready pages. */
+	recover_ready_pages(repair, &repair->page_completions[0].completion);
+}
+
+/**
+ * get_recovery_journal_block_header() - Get the block header for a block at a position in the
+ *                                       journal data and unpack it.
+ * @journal: The recovery journal.
+ * @data: The recovery journal data.
+ * @sequence: The sequence number.
+ *
+ * Return: The unpacked header.
+ */
+static struct recovery_block_header __must_check
+get_recovery_journal_block_header(struct recovery_journal *journal,
+				  char *data,
+				  sequence_number_t sequence)
+{
+	physical_block_number_t pbn = vdo_get_recovery_journal_block_number(journal, sequence);
+	char *header = &data[pbn * VDO_BLOCK_SIZE];
+
+	return vdo_unpack_recovery_block_header((struct packed_journal_header *) header);
+}
+
+/**
+ * is_valid_recovery_journal_block() - Determine whether the given header describes a valid block
+ *                                     for the given journal.
+ * @journal: The journal to use.
+ * @header: The unpacked block header to check.
+ * @old_ok: Whether an old format header is valid.
+ *
+ * A block is not valid if it is unformatted, or if it is older than the last successful recovery
+ * or reformat.
+ *
+ * Return: True if the header is valid.
+ */
+static bool __must_check
+is_valid_recovery_journal_block(const struct recovery_journal *journal,
+				const struct recovery_block_header *header,
+				bool old_ok)
+{
+	if ((header->nonce != journal->nonce) ||
+	    (header->recovery_count != journal->recovery_count))
+		return false;
+
+	if (header->metadata_type == VDO_METADATA_RECOVERY_JOURNAL_2)
+		return (header->entry_count <= journal->entries_per_block);
+
+	return (old_ok &&
+		(header->metadata_type == VDO_METADATA_RECOVERY_JOURNAL) &&
+		(header->entry_count <= RECOVERY_JOURNAL_1_ENTRIES_PER_BLOCK));
+}
+
+/**
+ * is_exact_recovery_journal_block() - Determine whether the given header describes the exact block
+ *                                     indicated.
+ * @journal: The journal to use.
+ * @header: The unpacked block header to check.
+ * @sequence: The expected sequence number.
+ * @type: The expected metadata type.
+ *
+ * Return: True if the block matches.
+ */
+static bool __must_check
+is_exact_recovery_journal_block(const struct recovery_journal *journal,
+				const struct recovery_block_header *header,
+				sequence_number_t sequence,
+				enum vdo_metadata_type type)
+{
+	return ((header->metadata_type == type) &&
+		(header->sequence_number == sequence) &&
+		(is_valid_recovery_journal_block(journal, header, true)));
+}
+
+/**
+ * find_recovery_journal_head_and_tail() - Find the tail and head of the journal.
+ *
+ * Return: True if there were valid journal blocks.
+ */
+static bool find_recovery_journal_head_and_tail(struct repair_completion *repair)
+{
+	struct recovery_journal *journal = repair->completion.vdo->recovery_journal;
+	bool found_entries = false;
+	physical_block_number_t i;
+
+	/*
+	 * Ensure that we don't replay old entries since we know the tail recorded in the super
+	 * block must be a lower bound. Not doing so can result in extra data loss by setting the
+	 * tail too early.
+	 */
+	repair->highest_tail = journal->tail;
+	for (i = 0; i < journal->size; i++) {
+		struct recovery_block_header header =
+			get_recovery_journal_block_header(journal, repair->journal_data, i);
+
+		if (!is_valid_recovery_journal_block(journal, &header, true))
+			/* This block is old or incorrectly formatted */
+			continue;
+
+		if (vdo_get_recovery_journal_block_number(journal, header.sequence_number) != i)
+			/* This block is in the wrong location */
+			continue;
+
+		if (header.sequence_number >= repair->highest_tail) {
+			found_entries = true;
+			repair->highest_tail = header.sequence_number;
+		}
+
+		if (!found_entries)
+			continue;
+
+		if (header.block_map_head > repair->block_map_head)
+			repair->block_map_head = header.block_map_head;
+
+		if (header.slab_journal_head > repair->slab_journal_head)
+			repair->slab_journal_head = header.slab_journal_head;
+	}
+
+	return found_entries;
+}
+
+/**
+ * unpack_entry() - Unpack a recovery journal entry in either format.
+ * @vdo: The vdo.
+ * @packed: The entry to unpack.
+ * @format: The expected format of the entry.
+ * @entry: The unpacked entry.
+ *
+ * Return: true if the entry should be applied.3
+ */
+static bool unpack_entry(struct vdo *vdo,
+			 char *packed,
+			 enum vdo_metadata_type format,
+			 struct recovery_journal_entry *entry)
+{
+	if (format == VDO_METADATA_RECOVERY_JOURNAL_2) {
+		struct packed_recovery_journal_entry *packed_entry =
+			(struct packed_recovery_journal_entry *) packed;
+
+		*entry = vdo_unpack_recovery_journal_entry(packed_entry);
+	} else {
+		physical_block_number_t low32, high4;
+
+		struct packed_recovery_journal_entry_1 *packed_entry =
+			(struct packed_recovery_journal_entry_1 *) packed;
+
+		if (packed_entry->operation == VDO_JOURNAL_DATA_INCREMENT)
+			entry->operation = VDO_JOURNAL_DATA_REMAPPING;
+		else if (packed_entry->operation == VDO_JOURNAL_BLOCK_MAP_INCREMENT)
+			entry->operation = VDO_JOURNAL_BLOCK_MAP_REMAPPING;
+		else
+			return false;
+
+		low32 = __le32_to_cpu(packed_entry->pbn_low_word);
+		high4 = packed_entry->pbn_high_nibble;
+		entry->slot = (struct block_map_slot) {
+			.pbn = ((high4 << 32) | low32),
+			.slot = (packed_entry->slot_low | (packed_entry->slot_high << 6)),
+		};
+		entry->mapping = vdo_unpack_block_map_entry(&packed_entry->block_map_entry);
+		entry->unmapping = (struct data_location) {
+			.pbn = VDO_ZERO_BLOCK,
+			.state = VDO_MAPPING_STATE_UNMAPPED,
+		};
+	}
+
+	return (validate_recovery_journal_entry(vdo, entry) == VDO_SUCCESS);
+}
+
+/**
+ * append_sector_entries() - Append an array of recovery journal entries from a journal block
+ *                           sector to the array of numbered mappings in the repair completion,
+ *                           numbering each entry in the order they are appended.
+ * @repair: The repair completion.
+ * @entries: The entries in the sector.
+ * @format: The format of the sector.
+ * @entry_count: The number of entries to append.
+ */
+static void append_sector_entries(struct repair_completion *repair,
+				  char *entries,
+				  enum vdo_metadata_type format,
+				  journal_entry_count_t entry_count)
+{
+	journal_entry_count_t i;
+	struct vdo *vdo = repair->completion.vdo;
+	off_t increment = ((format == VDO_METADATA_RECOVERY_JOURNAL_2)
+			   ? sizeof(struct packed_recovery_journal_entry)
+			   : sizeof(struct packed_recovery_journal_entry_1));
+
+	for (i = 0; i < entry_count; i++, entries += increment) {
+		struct recovery_journal_entry entry;
+
+		if (!unpack_entry(vdo, entries, format, &entry))
+			/* When recovering from read-only mode, ignore damaged entries. */
+			continue;
+
+		repair->entries[repair->block_map_entry_count] =
+			(struct numbered_block_mapping) {
+			.block_map_slot = entry.slot,
+			.block_map_entry = vdo_pack_block_map_entry(entry.mapping.pbn,
+								    entry.mapping.state),
+			.number = repair->block_map_entry_count,
+		};
+		repair->block_map_entry_count++;
+	}
+}
+
+static journal_entry_count_t entries_per_sector(enum vdo_metadata_type format, u8 sector_number)
+{
+	if (format == VDO_METADATA_RECOVERY_JOURNAL_2)
+		return RECOVERY_JOURNAL_ENTRIES_PER_SECTOR;
+
+	return ((sector_number == (VDO_SECTORS_PER_BLOCK - 1))
+		? RECOVERY_JOURNAL_1_ENTRIES_IN_LAST_SECTOR
+		: RECOVERY_JOURNAL_1_ENTRIES_PER_SECTOR);
+}
+
+static void extract_entries_from_block(struct repair_completion *repair,
+				       struct recovery_journal *journal,
+				       sequence_number_t sequence,
+				       enum vdo_metadata_type format,
+				       journal_entry_count_t entries)
+{
+	sector_count_t i;
+	struct recovery_block_header header =
+		get_recovery_journal_block_header(journal, repair->journal_data, sequence);
+
+	if (!is_exact_recovery_journal_block(journal, &header, sequence, format))
+		/* This block is invalid, so skip it. */
+		return;
+
+	entries = min(entries, header.entry_count);
+	for (i = 1; i < VDO_SECTORS_PER_BLOCK; i++) {
+		struct packed_journal_sector *sector =
+			get_sector(journal, repair->journal_data, sequence, i);
+		journal_entry_count_t sector_entries = min(entries, entries_per_sector(format, i));
+
+		if (vdo_is_valid_recovery_journal_sector(&header, sector, i)) {
+			/* Only extract as many as the block header calls for. */
+			append_sector_entries(repair,
+					      (char *) sector->entries,
+					      format,
+					      min_t(journal_entry_count_t,
+						    sector->entry_count,
+						    sector_entries));
+		}
+
+		/*
+		 * Even if the sector wasn't full, count it as full when counting up to the
+		 * entry count the block header claims.
+		 */
+		entries -= sector_entries;
+	}
+}
+
+static int parse_journal_for_rebuild(struct repair_completion *repair)
+{
+	int result;
+	sequence_number_t i;
+	block_count_t count;
+	enum vdo_metadata_type format;
+	struct vdo *vdo = repair->completion.vdo;
+	struct recovery_journal *journal = vdo->recovery_journal;
+	journal_entry_count_t entries_per_block = journal->entries_per_block;
+
+	format = get_recovery_journal_block_header(journal,
+						   repair->journal_data,
+						   repair->highest_tail).metadata_type;
+	if (format == VDO_METADATA_RECOVERY_JOURNAL)
+		entries_per_block = RECOVERY_JOURNAL_1_ENTRIES_PER_BLOCK;
+
+	/*
+	 * Allocate an array of numbered_block_mapping structures large enough to transcribe every
+	 * packed_recovery_journal_entry from every valid journal block.
+	 */
+	count = ((repair->highest_tail - repair->block_map_head + 1) * entries_per_block);
+	result = UDS_ALLOCATE(count, struct numbered_block_mapping, __func__, &repair->entries);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	for (i = repair->block_map_head; i <= repair->highest_tail; i++)
+		extract_entries_from_block(repair, journal, i, format, entries_per_block);
+
+	return VDO_SUCCESS;
+}
+
+static int validate_heads(struct repair_completion *repair)
+{
+	/* Both reap heads must be behind the tail. */
+	if ((repair->block_map_head <= repair->tail) &&
+	    (repair->slab_journal_head <= repair->tail))
+		return VDO_SUCCESS;
+
+
+	return uds_log_error_strerror(VDO_CORRUPT_JOURNAL,
+				      "Journal tail too early. block map head: %llu, slab journal head: %llu, tail: %llu",
+				      (unsigned long long) repair->block_map_head,
+				      (unsigned long long) repair->slab_journal_head,
+				      (unsigned long long) repair->tail);
+}
+
+/**
+ * extract_new_mappings() - Find all valid new mappings to be applied to the block map.
+ *
+ * The mappings are extracted from the journal and stored in a sortable array so that all of the
+ * mappings to be applied to a given block map page can be done in a single page fetch.
+ */
+static int extract_new_mappings(struct repair_completion *repair)
+{
+	int result;
+	struct vdo *vdo = repair->completion.vdo;
+	struct recovery_point recovery_point = {
+		.sequence_number = repair->block_map_head,
+		.sector_count = 1,
+		.entry_count = 0,
+	};
+
+	/*
+	 * Allocate an array of numbered_block_mapping structs just large enough to transcribe
+	 * every packed_recovery_journal_entry from every valid journal block.
+	 */
+	result = UDS_ALLOCATE(repair->entry_count,
+			      struct numbered_block_mapping,
+			      __func__,
+			      &repair->entries);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	for (; before_recovery_point(&recovery_point, &repair->tail_recovery_point);
+	     increment_recovery_point(&recovery_point)) {
+		struct recovery_journal_entry entry = get_entry(repair, &recovery_point);
+
+		result = validate_recovery_journal_entry(vdo, &entry);
+		if (result != VDO_SUCCESS) {
+			vdo_enter_read_only_mode(vdo, result);
+			return result;
+		}
+
+		repair->entries[repair->block_map_entry_count] =
+			(struct numbered_block_mapping) {
+			.block_map_slot = entry.slot,
+			.block_map_entry = vdo_pack_block_map_entry(entry.mapping.pbn,
+								    entry.mapping.state),
+			.number = repair->block_map_entry_count,
+		};
+		repair->block_map_entry_count++;
+	}
+
+	result = ASSERT((repair->block_map_entry_count <= repair->entry_count),
+			"approximate entry count is an upper bound");
+	if (result != VDO_SUCCESS)
+		vdo_enter_read_only_mode(vdo, result);
+
+	return result;
+}
+
+/**
+ * compute_usages() - Compute the lbns in use and block map data blocks counts from the tail of
+ *                    the journal.
+ */
+static noinline int compute_usages(struct repair_completion *repair)
+{
+	/*
+	 * VDO-5182: function is declared noinline to avoid what is likely a spurious valgrind
+	 * error about this structure being uninitialized.
+	 */
+	struct recovery_point recovery_point = {
+		.sequence_number = repair->tail,
+		.sector_count = 1,
+		.entry_count = 0,
+	};
+
+	struct vdo *vdo = repair->completion.vdo;
+	struct recovery_journal *journal = vdo->recovery_journal;
+	struct recovery_block_header header =
+		get_recovery_journal_block_header(journal, repair->journal_data, repair->tail);
+
+	repair->logical_blocks_used = header.logical_blocks_used;
+	repair->block_map_data_blocks = header.block_map_data_blocks;
+
+	for (; before_recovery_point(&recovery_point, &repair->tail_recovery_point);
+	     increment_recovery_point(&recovery_point)) {
+		struct recovery_journal_entry entry = get_entry(repair, &recovery_point);
+		int result;
+
+		result = validate_recovery_journal_entry(vdo, &entry);
+		if (result != VDO_SUCCESS) {
+			vdo_enter_read_only_mode(vdo, result);
+			return result;
+		}
+
+		if (entry.operation == VDO_JOURNAL_BLOCK_MAP_REMAPPING) {
+			repair->block_map_data_blocks++;
+			continue;
+		}
+
+		if (vdo_is_mapped_location(&entry.mapping))
+			repair->logical_blocks_used++;
+
+		if (vdo_is_mapped_location(&entry.unmapping))
+			repair->logical_blocks_used--;
+	}
+
+	return VDO_SUCCESS;
+}
+
+static int parse_journal_for_recovery(struct repair_completion *repair)
+{
+	int result;
+	sequence_number_t i, head;
+	bool found_entries = false;
+	struct recovery_journal *journal = repair->completion.vdo->recovery_journal;
+
+	head = min(repair->block_map_head, repair->slab_journal_head);
+	for (i = head; i <= repair->highest_tail; i++) {
+		struct recovery_block_header header;
+		journal_entry_count_t block_entries;
+		u8 j;
+
+		repair->tail = i;
+		repair->tail_recovery_point = (struct recovery_point) {
+			.sequence_number = i,
+			.sector_count = 0,
+			.entry_count = 0,
+		};
+
+		header = get_recovery_journal_block_header(journal, repair->journal_data, i);
+		if (header.metadata_type == VDO_METADATA_RECOVERY_JOURNAL) {
+			/* This is an old format block, so we need to upgrade */
+			uds_log_error_strerror(VDO_UNSUPPORTED_VERSION,
+					       "Recovery journal is in the old format, a read-only rebuild is required.");
+			vdo_enter_read_only_mode(repair->completion.vdo, VDO_UNSUPPORTED_VERSION);
+			return VDO_UNSUPPORTED_VERSION;
+		}
+
+		if (!is_exact_recovery_journal_block(journal,
+						     &header,
+						     i,
+						     VDO_METADATA_RECOVERY_JOURNAL_2))
+			/* A bad block header was found so this must be the end of the journal. */
+			break;
+
+		block_entries = header.entry_count;
+
+		/* Examine each sector in turn to determine the last valid sector. */
+		for (j = 1; j < VDO_SECTORS_PER_BLOCK; j++) {
+			struct packed_journal_sector *sector =
+				get_sector(journal, repair->journal_data, i, j);
+			journal_entry_count_t sector_entries =
+				min_t(journal_entry_count_t, sector->entry_count, block_entries);
+
+			/* A bad sector means that this block was torn. */
+			if (!vdo_is_valid_recovery_journal_sector(&header, sector, j))
+				break;
+
+			if (sector_entries > 0) {
+				found_entries = true;
+				repair->tail_recovery_point.sector_count++;
+				repair->tail_recovery_point.entry_count = sector_entries;
+				block_entries -= sector_entries;
+				repair->entry_count += sector_entries;
+			}
+
+			/* If this sector is short, the later sectors can't matter. */
+			if ((sector_entries < RECOVERY_JOURNAL_ENTRIES_PER_SECTOR) ||
+			    (block_entries == 0))
+				break;
+		}
+
+		/* If this block was not filled, or if it tore, no later block can matter. */
+		if ((header.entry_count != journal->entries_per_block) || (block_entries > 0))
+			break;
+	}
+
+	if (!found_entries)
+		return validate_heads(repair);
+
+	/* Set the tail to the last valid tail block, if there is one. */
+	if (repair->tail_recovery_point.sector_count == 0)
+		repair->tail--;
+
+	result = validate_heads(repair);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	uds_log_info("Highest-numbered recovery journal block has sequence number %llu, and the highest-numbered usable block is %llu",
+		     (unsigned long long) repair->highest_tail,
+		     (unsigned long long) repair->tail);
+
+	result = extract_new_mappings(repair);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	return compute_usages(repair);
+}
+
+static int parse_journal(struct repair_completion *repair)
+{
+	if (!find_recovery_journal_head_and_tail(repair))
+		return VDO_SUCCESS;
+
+	return (vdo_state_requires_read_only_rebuild(repair->completion.vdo->load_state) ?
+		parse_journal_for_rebuild(repair) :
+		parse_journal_for_recovery(repair));
+}
+
+static void finish_journal_load(struct vdo_completion *completion)
+{
+	struct repair_completion *repair = completion->parent;
+
+	if (++repair->vios_complete != repair->vio_count)
+		return;
+
+	uds_log_info("Finished reading recovery journal");
+	uninitialize_vios(repair);
+	prepare_repair_completion(repair, recover_block_map, VDO_ZONE_TYPE_LOGICAL);
+	vdo_continue_completion(&repair->completion, parse_journal(repair));
+}
+
+static void handle_journal_load_error(struct vdo_completion *completion)
+{
+	struct repair_completion *repair = completion->parent;
+
+	/* Preserve the error */
+	vdo_set_completion_result(&repair->completion, completion->result);
+	vio_record_metadata_io_error(as_vio(completion));
+	completion->callback(completion);
+}
+
+static void read_journal_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct vdo *vdo = vio->completion.vdo;
+
+	continue_vio_after_io(vio, finish_journal_load, vdo->thread_config.admin_thread);
+}
+
+/**
+ * vdo_repair() - Load the recovery journal and then recover or rebuild a vdo.
+ * @parent: The completion to notify when the operation is complete
+ */
+void vdo_repair(struct vdo_completion *parent)
+{
+	int result;
+	char *ptr;
+	struct repair_completion *repair;
+	struct vdo *vdo = parent->vdo;
+	struct recovery_journal *journal = vdo->recovery_journal;
+	physical_block_number_t pbn = journal->origin;
+	block_count_t remaining = journal->size;
+	block_count_t vio_count = DIV_ROUND_UP(remaining, MAX_BLOCKS_PER_VIO);
+	page_count_t page_count = min_t(page_count_t,
+					vdo->device_config->cache_size >> 1,
+					MAXIMUM_SIMULTANEOUS_VDO_BLOCK_MAP_RESTORATION_READS);
+
+	vdo_assert_on_admin_thread(vdo, __func__);
+
+	if (vdo->load_state == VDO_FORCE_REBUILD) {
+		uds_log_warning("Rebuilding reference counts to clear read-only mode");
+		vdo->states.vdo.read_only_recoveries++;
+	} else if (vdo->load_state == VDO_REBUILD_FOR_UPGRADE) {
+		uds_log_warning("Rebuilding reference counts for upgrade");
+	} else {
+		uds_log_warning("Device was dirty, rebuilding reference counts");
+	}
+
+	result = UDS_ALLOCATE_EXTENDED(struct repair_completion,
+				       page_count,
+				       struct vdo_page_completion,
+				       __func__,
+				       &repair);
+	if (result != VDO_SUCCESS) {
+		vdo_fail_completion(parent, result);
+		return;
+	}
+
+	vdo_initialize_completion(&repair->completion, vdo, VDO_REPAIR_COMPLETION);
+	repair->completion.error_handler = abort_repair;
+	repair->completion.parent = parent;
+	prepare_repair_completion(repair, finish_repair, VDO_ZONE_TYPE_ADMIN);
+	repair->page_count = page_count;
+
+	result = UDS_ALLOCATE(remaining * VDO_BLOCK_SIZE, char, __func__, &repair->journal_data);
+	if (abort_on_error(result, repair))
+		return;
+
+	result = UDS_ALLOCATE(vio_count, struct vio, __func__, &repair->vios);
+	if (abort_on_error(result, repair))
+		return;
+
+	ptr = repair->journal_data;
+	for (repair->vio_count = 0; repair->vio_count < vio_count; repair->vio_count++) {
+		block_count_t blocks = min_t(block_count_t, remaining, MAX_BLOCKS_PER_VIO);
+
+		result = allocate_vio_components(vdo,
+						 VIO_TYPE_RECOVERY_JOURNAL,
+						 VIO_PRIORITY_METADATA,
+						 repair,
+						 blocks,
+						 ptr,
+						 &repair->vios[repair->vio_count]);
+		if (abort_on_error(result, repair))
+			return;
+
+		ptr += (blocks * VDO_BLOCK_SIZE);
+		remaining -= blocks;
+	}
+
+	for (vio_count = 0;
+	     vio_count < repair->vio_count;
+	     vio_count++, pbn += MAX_BLOCKS_PER_VIO)
+		submit_metadata_vio(&repair->vios[vio_count],
+				    pbn,
+				    read_journal_endio,
+				    handle_journal_load_error,
+				    REQ_OP_READ);
+}
diff --git a/drivers/md/dm-vdo/repair.h b/drivers/md/dm-vdo/repair.h
new file mode 100644
index 000000000000..ff255cf41486
--- /dev/null
+++ b/drivers/md/dm-vdo/repair.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_REPAIR_H
+#define VDO_REPAIR_H
+
+#include "types.h"
+
+void vdo_replay_into_slab_journals(struct block_allocator *allocator, void *context);
+void vdo_repair(struct vdo_completion *parent);
+
+#endif /* VDO_REPAIR_H */

From patchwork Thu Oct 26 21:41:30 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437959
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2A55AC27C59
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:37 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344913AbjJZVoh (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:37 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41042 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234902AbjJZVoY (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:24 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 78199D58
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356506;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=BbJB2yij4eneO6j8G/VsGLsB8aznACfTOUh/fqF9zq8=;
        b=FWEYlQiqBOYAsnjqmU3+eJGbM67KW5kRNoPMoEwbXZ1H1s2xaLi2AEKluReBaIr41hfcuo
        NfKVShT8uuk7q2e66FVyKuTnWLV8WKoZ9+xvDCwm+ebaXfWx+drXms5Byp3gYI7y6EL2dv
        e99/fSdIdNoLQwcX9cRbekJkrvoNFNY=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-228-w6a2Z7dyOUS1poTitKVI_A-1; Thu, 26 Oct 2023 17:41:43 -0400
X-MC-Unique: w6a2Z7dyOUS1poTitKVI_A-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id D960E80F922;
        Thu, 26 Oct 2023 21:41:42 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id C54402949;
        Thu, 26 Oct 2023 21:41:42 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 850933003B; Thu, 26 Oct 2023 17:41:42 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 33/39] dm vdo: add the primary vdo structure
Date: Thu, 26 Oct 2023 17:41:30 -0400
Message-Id: <20231026214136.1067410-34-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Add the data and methods that manage the dm-vdo target itself. This
includes the overall state of the target and its threads, the state of
the logical volumes, startup, shutdown, and statistics.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/vdo.c | 1845 +++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/vdo.h |  381 ++++++++
 2 files changed, 2226 insertions(+)
 create mode 100644 drivers/md/dm-vdo/vdo.c
 create mode 100644 drivers/md/dm-vdo/vdo.h

diff --git a/drivers/md/dm-vdo/vdo.c b/drivers/md/dm-vdo/vdo.c
new file mode 100644
index 000000000000..72cbe6921893
--- /dev/null
+++ b/drivers/md/dm-vdo/vdo.c
@@ -0,0 +1,1845 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+/*
+ * This file contains the main entry points for normal operations on a vdo as well as functions for
+ * constructing and destroying vdo instances (in memory).
+ */
+
+/**
+ * DOC:
+ *
+ * A read_only_notifier has a single completion which is used to perform read-only notifications,
+ * however, vdo_enter_read_only_mode() may be called from any thread. A pair of fields, protected
+ * by a spinlock, are used to control the read-only mode entry process. The first field holds the
+ * read-only error. The second is the state field, which may hold any of the four special values
+ * enumerated here.
+ *
+ * When vdo_enter_read_only_mode() is called from some vdo thread, if the read_only_error field
+ * already contains an error (i.e. its value is not VDO_SUCCESS), then some other error has already
+ * initiated the read-only process, and nothing more is done. Otherwise, the new error is stored in
+ * the read_only_error field, and the state field is consulted. If the state is MAY_NOTIFY, it is
+ * set to NOTIFYING, and the notification process begins. If the state is MAY_NOT_NOTIFY, then
+ * notifications are currently disallowed, generally due to the vdo being suspended. In this case,
+ * the nothing more will be done until the vdo is resumed, at which point the notification will be
+ * performed. In any other case, the vdo is already read-only, and there is nothing more to do.
+ */
+
+#include "vdo.h"
+
+#include <linux/completion.h>
+#include <linux/device-mapper.h>
+#include <linux/kernel.h>
+#include <linux/lz4.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+#include "string-utils.h"
+
+#include "block-map.h"
+#include "completion.h"
+#include "data-vio.h"
+#include "dedupe.h"
+#include "encodings.h"
+#include "io-submitter.h"
+#include "logical-zone.h"
+#include "packer.h"
+#include "physical-zone.h"
+#include "pool-sysfs.h"
+#include "recovery-journal.h"
+#include "slab-depot.h"
+#include "statistics.h"
+#include "status-codes.h"
+#include "vio.h"
+#include "work-queue.h"
+
+
+enum { PARANOID_THREAD_CONSISTENCY_CHECKS = 0 };
+
+struct sync_completion {
+	struct vdo_completion vdo_completion;
+	struct completion completion;
+};
+
+/*
+ * We don't expect this set to ever get really large, so a linked list is adequate. We can use a
+ * pointer_map if we need to later.
+ */
+struct device_registry {
+	struct list_head links;
+	/* TODO: Convert to rcu per kernel recommendation. */
+	rwlock_t lock;
+};
+
+static struct device_registry registry;
+
+/**
+ * vdo_initialize_device_registry_once() - Initialize the necessary structures for the device
+ *                                         registry.
+ */
+void vdo_initialize_device_registry_once(void)
+{
+	INIT_LIST_HEAD(&registry.links);
+	rwlock_init(&registry.lock);
+}
+
+/** vdo_is_equal() - Implements vdo_filter_t. */
+static bool vdo_is_equal(struct vdo *vdo, const void *context)
+{
+	return ((void *) vdo == context);
+}
+
+/**
+ * filter_vdos_locked() - Find a vdo in the registry if it exists there.
+ * @filter: The filter function to apply to devices.
+ * @context: A bit of context to provide the filter.
+ *
+ * Context: Must be called holding the lock.
+ *
+ * Return: the vdo object found, if any.
+ */
+static struct vdo * __must_check filter_vdos_locked(vdo_filter_t *filter, const void *context)
+{
+	struct vdo *vdo;
+
+	list_for_each_entry(vdo, &registry.links, registration)
+		if (filter(vdo, context))
+			return vdo;
+
+	return NULL;
+}
+
+/**
+ * vdo_find_matching() - Find and return the first (if any) vdo matching a given filter function.
+ * @filter: The filter function to apply to vdos.
+ * @context: A bit of context to provide the filter.
+ */
+struct vdo *vdo_find_matching(vdo_filter_t *filter, const void *context)
+{
+	struct vdo *vdo;
+
+	read_lock(&registry.lock);
+	vdo = filter_vdos_locked(filter, context);
+	read_unlock(&registry.lock);
+	return vdo;
+}
+
+static void start_vdo_request_queue(void *ptr)
+{
+	struct vdo_thread *thread = vdo_get_work_queue_owner(vdo_get_current_work_queue());
+
+	uds_register_allocating_thread(&thread->allocating_thread,
+				       &thread->vdo->allocations_allowed);
+}
+
+static void finish_vdo_request_queue(void *ptr)
+{
+	uds_unregister_allocating_thread();
+}
+
+#ifdef MODULE
+#define MODULE_NAME THIS_MODULE->name
+#else
+#define MODULE_NAME "dm-vdo"
+#endif  /* MODULE */
+
+static const struct vdo_work_queue_type default_queue_type = {
+	.start = start_vdo_request_queue,
+	.finish = finish_vdo_request_queue,
+	.max_priority = VDO_DEFAULT_Q_MAX_PRIORITY,
+	.default_priority = VDO_DEFAULT_Q_COMPLETION_PRIORITY,
+};
+
+static const struct vdo_work_queue_type bio_ack_q_type = {
+	.start = NULL,
+	.finish = NULL,
+	.max_priority = BIO_ACK_Q_MAX_PRIORITY,
+	.default_priority = BIO_ACK_Q_ACK_PRIORITY,
+};
+
+static const struct vdo_work_queue_type cpu_q_type = {
+	.start = NULL,
+	.finish = NULL,
+	.max_priority = CPU_Q_MAX_PRIORITY,
+	.default_priority = CPU_Q_MAX_PRIORITY,
+};
+
+static void uninitialize_thread_config(struct thread_config *config)
+{
+	UDS_FREE(UDS_FORGET(config->logical_threads));
+	UDS_FREE(UDS_FORGET(config->physical_threads));
+	UDS_FREE(UDS_FORGET(config->hash_zone_threads));
+	UDS_FREE(UDS_FORGET(config->bio_threads));
+	memset(config, 0, sizeof(struct thread_config));
+}
+
+static void
+assign_thread_ids(struct thread_config *config, thread_id_t thread_ids[], zone_count_t count)
+{
+	zone_count_t zone;
+
+	for (zone = 0; zone < count; zone++)
+		thread_ids[zone] = config->thread_count++;
+}
+
+/**
+ * initialize_thread_config() - Initialize the thread mapping
+ *
+ * If the logical, physical, and hash zone counts are all 0, a single thread will be shared by all
+ * three plus the packer and recovery journal. Otherwise, there must be at least one of each type,
+ * and each will have its own thread, as will the packer and recovery journal.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int __must_check
+initialize_thread_config(struct thread_count_config counts, struct thread_config *config)
+{
+	int result;
+	bool single = ((counts.logical_zones + counts.physical_zones + counts.hash_zones) == 0);
+
+	config->bio_thread_count = counts.bio_threads;
+	if (single) {
+		config->logical_zone_count = 1;
+		config->physical_zone_count = 1;
+		config->hash_zone_count = 1;
+	} else {
+		config->logical_zone_count = counts.logical_zones;
+		config->physical_zone_count = counts.physical_zones;
+		config->hash_zone_count = counts.hash_zones;
+	}
+
+	result = UDS_ALLOCATE(config->logical_zone_count,
+			      thread_id_t,
+			      "logical thread array",
+			      &config->logical_threads);
+	if (result != VDO_SUCCESS) {
+		uninitialize_thread_config(config);
+		return result;
+	}
+
+	result = UDS_ALLOCATE(config->physical_zone_count,
+			      thread_id_t,
+			      "physical thread array",
+			      &config->physical_threads);
+	if (result != VDO_SUCCESS) {
+		uninitialize_thread_config(config);
+		return result;
+	}
+
+	result = UDS_ALLOCATE(config->hash_zone_count,
+			      thread_id_t,
+			      "hash thread array",
+			      &config->hash_zone_threads);
+	if (result != VDO_SUCCESS) {
+		uninitialize_thread_config(config);
+		return result;
+	}
+
+	result = UDS_ALLOCATE(config->bio_thread_count,
+			      thread_id_t,
+			      "bio thread array",
+			      &config->bio_threads);
+	if (result != VDO_SUCCESS) {
+		uninitialize_thread_config(config);
+		return result;
+	}
+
+	if (single) {
+		config->logical_threads[0] = config->thread_count;
+		config->physical_threads[0] = config->thread_count;
+		config->hash_zone_threads[0] = config->thread_count++;
+	} else {
+		config->admin_thread = config->thread_count;
+		config->journal_thread = config->thread_count++;
+		config->packer_thread = config->thread_count++;
+		assign_thread_ids(config, config->logical_threads, counts.logical_zones);
+		assign_thread_ids(config, config->physical_threads, counts.physical_zones);
+		assign_thread_ids(config, config->hash_zone_threads, counts.hash_zones);
+	}
+
+	config->dedupe_thread = config->thread_count++;
+	config->bio_ack_thread =
+		((counts.bio_ack_threads > 0) ? config->thread_count++ : VDO_INVALID_THREAD_ID);
+	config->cpu_thread = config->thread_count++;
+	assign_thread_ids(config, config->bio_threads, counts.bio_threads);
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_read_geometry_block() - Synchronously read the geometry block from a vdo's underlying block
+ *                             device.
+ * @vdo: The vdo whose geometry is to be read.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int __must_check read_geometry_block(struct vdo *vdo)
+{
+	struct vio *vio;
+	char *block;
+	int result;
+
+	result = UDS_ALLOCATE(VDO_BLOCK_SIZE, u8, __func__, &block);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = create_metadata_vio(vdo, VIO_TYPE_GEOMETRY, VIO_PRIORITY_HIGH, NULL, block, &vio);
+	if (result != VDO_SUCCESS) {
+		UDS_FREE(block);
+		return result;
+	}
+
+	/*
+	 * This is only safe because, having not already loaded the geometry, the vdo's geometry's
+	 * bio_offset field is 0, so the fact that vio_reset_bio() will subtract that offset from
+	 * the supplied pbn is not a problem.
+	 */
+	result = vio_reset_bio(vio, block, NULL, REQ_OP_READ, VDO_GEOMETRY_BLOCK_LOCATION);
+	if (result != VDO_SUCCESS) {
+		free_vio(UDS_FORGET(vio));
+		UDS_FREE(block);
+		return result;
+	}
+
+	bio_set_dev(vio->bio, vdo_get_backing_device(vdo));
+	submit_bio_wait(vio->bio);
+	result = blk_status_to_errno(vio->bio->bi_status);
+	free_vio(UDS_FORGET(vio));
+	if (result != 0) {
+		uds_log_error_strerror(result, "synchronous read failed");
+		UDS_FREE(block);
+		return -EIO;
+	}
+
+	result = vdo_parse_geometry_block((u8 *) block, &vdo->geometry);
+	UDS_FREE(block);
+	return result;
+}
+
+static bool get_zone_thread_name(const thread_id_t thread_ids[],
+				 zone_count_t count,
+				 thread_id_t id,
+				 const char *prefix,
+				 char *buffer,
+				 size_t buffer_length)
+{
+	if (id >= thread_ids[0]) {
+		thread_id_t index = id - thread_ids[0];
+
+		if (index < count) {
+			snprintf(buffer, buffer_length, "%s%d", prefix, index);
+			return true;
+		}
+	}
+	return false;
+}
+
+/**
+ * get_thread_name() - Format the name of the worker thread desired to support a given work queue.
+ * @thread_config: The thread configuration.
+ * @thread_id: The thread id.
+ * @buffer: Where to put the formatted name.
+ * @buffer_length: Size of the output buffer.
+ *
+ * The physical layer may add a prefix identifying the product; the output from this function
+ * should just identify the thread.
+ */
+static void
+get_thread_name(const struct thread_config *thread_config,
+		thread_id_t thread_id,
+		char *buffer,
+		size_t buffer_length)
+{
+	if (thread_id == thread_config->journal_thread) {
+		if (thread_config->packer_thread == thread_id) {
+			/*
+			 * This is the "single thread" config where one thread is used for the
+			 * journal, packer, logical, physical, and hash zones. In that case, it is
+			 * known as the "request queue."
+			 */
+			snprintf(buffer, buffer_length, "reqQ");
+			return;
+		}
+
+		snprintf(buffer, buffer_length, "journalQ");
+		return;
+	} else if (thread_id == thread_config->admin_thread) {
+		/* Theoretically this could be different from the journal thread. */
+		snprintf(buffer, buffer_length, "adminQ");
+		return;
+	} else if (thread_id == thread_config->packer_thread) {
+		snprintf(buffer, buffer_length, "packerQ");
+		return;
+	} else if (thread_id == thread_config->dedupe_thread) {
+		snprintf(buffer, buffer_length, "dedupeQ");
+		return;
+	} else if (thread_id == thread_config->bio_ack_thread) {
+		snprintf(buffer, buffer_length, "ackQ");
+		return;
+	} else if (thread_id == thread_config->cpu_thread) {
+		snprintf(buffer, buffer_length, "cpuQ");
+		return;
+	}
+
+	if (get_zone_thread_name(thread_config->logical_threads,
+				 thread_config->logical_zone_count,
+				 thread_id,
+				 "logQ",
+				 buffer,
+				 buffer_length))
+		return;
+
+	if (get_zone_thread_name(thread_config->physical_threads,
+				 thread_config->physical_zone_count,
+				 thread_id,
+				 "physQ",
+				 buffer,
+				 buffer_length))
+		return;
+
+	if (get_zone_thread_name(thread_config->hash_zone_threads,
+				 thread_config->hash_zone_count,
+				 thread_id,
+				 "hashQ",
+				 buffer,
+				 buffer_length))
+		return;
+
+	if (get_zone_thread_name(thread_config->bio_threads,
+				 thread_config->bio_thread_count,
+				 thread_id,
+				 "bioQ",
+				 buffer,
+				 buffer_length))
+		return;
+
+	/* Some sort of misconfiguration? */
+	snprintf(buffer, buffer_length, "reqQ%d", thread_id);
+}
+
+/**
+ * vdo_make_thread() - Construct a single vdo work_queue and its associated thread (or threads for
+ *                     round-robin queues).
+ * @vdo: The vdo which owns the thread.
+ * @thread_id: The id of the thread to create (as determined by the thread_config).
+ * @type: The description of the work queue for this thread.
+ * @queue_count: The number of actual threads/queues contained in the "thread".
+ * @contexts: An array of queue_count contexts, one for each individual queue; may be NULL.
+ *
+ * Each "thread" constructed by this method is represented by a unique thread id in the thread
+ * config, and completions can be enqueued to the queue and run on the threads comprising this
+ * entity.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_make_thread(struct vdo *vdo,
+		    thread_id_t thread_id,
+		    const struct vdo_work_queue_type *type,
+		    unsigned int queue_count,
+		    void *contexts[])
+{
+	struct vdo_thread *thread = &vdo->threads[thread_id];
+	char queue_name[MAX_VDO_WORK_QUEUE_NAME_LEN];
+
+	if (type == NULL)
+		type = &default_queue_type;
+
+	if (thread->queue != NULL)
+		return ASSERT(vdo_work_queue_type_is(thread->queue, type),
+			      "already constructed vdo thread %u is of the correct type",
+			      thread_id);
+
+	thread->vdo = vdo;
+	thread->thread_id = thread_id;
+	get_thread_name(&vdo->thread_config, thread_id, queue_name, sizeof(queue_name));
+	return vdo_make_work_queue(vdo->thread_name_prefix,
+				   queue_name,
+				   thread,
+				   type,
+				   queue_count,
+				   contexts,
+				   &thread->queue);
+}
+
+/**
+ * register_vdo() - Register a VDO; it must not already be registered.
+ * @vdo: The vdo to register.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int register_vdo(struct vdo *vdo)
+{
+	int result;
+
+	write_lock(&registry.lock);
+	result = ASSERT(filter_vdos_locked(vdo_is_equal, vdo) == NULL,
+			"VDO not already registered");
+	if (result == VDO_SUCCESS) {
+		INIT_LIST_HEAD(&vdo->registration);
+		list_add_tail(&vdo->registration, &registry.links);
+	}
+	write_unlock(&registry.lock);
+
+	return result;
+}
+
+/**
+ * initialize_vdo() - Do the portion of initializing a vdo which will clean up after itself on
+ *                    error.
+ * @vdo: The vdo being initialized
+ * @config: The configuration of the vdo
+ * @instance: The instance number of the vdo
+ * @reason: The buffer to hold the failure reason on error
+ */
+static int
+initialize_vdo(struct vdo *vdo, struct device_config *config, unsigned int instance, char **reason)
+{
+	int result;
+	zone_count_t i;
+
+	vdo->device_config = config;
+	vdo->starting_sector_offset = config->owning_target->begin;
+	vdo->instance = instance;
+	vdo->allocations_allowed = true;
+	vdo_set_admin_state_code(&vdo->admin.state, VDO_ADMIN_STATE_NEW);
+	INIT_LIST_HEAD(&vdo->device_config_list);
+	vdo_initialize_completion(&vdo->admin.completion, vdo, VDO_ADMIN_COMPLETION);
+	init_completion(&vdo->admin.callback_sync);
+	mutex_init(&vdo->stats_mutex);
+	result = read_geometry_block(vdo);
+	if (result != VDO_SUCCESS) {
+		*reason = "Could not load geometry block";
+		return result;
+	}
+
+	result = initialize_thread_config(config->thread_counts, &vdo->thread_config);
+	if (result != VDO_SUCCESS) {
+		*reason = "Cannot create thread configuration";
+		return result;
+	}
+
+	uds_log_info("zones: %d logical, %d physical, %d hash; total threads: %d",
+		     config->thread_counts.logical_zones,
+		     config->thread_counts.physical_zones,
+		     config->thread_counts.hash_zones,
+		     vdo->thread_config.thread_count);
+
+	/* Compression context storage */
+	result = UDS_ALLOCATE(config->thread_counts.cpu_threads,
+			      char *,
+			      "LZ4 context",
+			      &vdo->compression_context);
+	if (result != VDO_SUCCESS) {
+		*reason = "cannot allocate LZ4 context";
+		return result;
+	}
+
+	for (i = 0; i < config->thread_counts.cpu_threads; i++) {
+		result = UDS_ALLOCATE(LZ4_MEM_COMPRESS,
+				      char,
+				      "LZ4 context",
+				      &vdo->compression_context[i]);
+		if (result != VDO_SUCCESS) {
+			*reason = "cannot allocate LZ4 context";
+			return result;
+		}
+	}
+
+	result = register_vdo(vdo);
+	if (result != VDO_SUCCESS) {
+		*reason = "Cannot add VDO to device registry";
+		return result;
+	}
+
+	vdo_set_admin_state_code(&vdo->admin.state, VDO_ADMIN_STATE_INITIALIZED);
+	return result;
+}
+
+/**
+ * vdo_make() - Allocate and initialize a vdo.
+ * @instance: Device instantiation counter.
+ * @config: The device configuration.
+ * @reason: The reason for any failure during this call.
+ * @vdo_ptr: A pointer to hold the created vdo.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_make(unsigned int instance,
+	     struct device_config *config,
+	     char **reason,
+	     struct vdo **vdo_ptr)
+{
+	int result;
+	struct vdo *vdo;
+
+	/* VDO-3769 - Set a generic reason so we don't ever return garbage. */
+	*reason = "Unspecified error";
+
+	result = UDS_ALLOCATE(1, struct vdo, __func__, &vdo);
+	if (result != UDS_SUCCESS) {
+		*reason = "Cannot allocate VDO";
+		return result;
+	}
+
+	result = initialize_vdo(vdo, config, instance, reason);
+	if (result != VDO_SUCCESS) {
+		vdo_destroy(vdo);
+		return result;
+	}
+
+	/* From here on, the caller will clean up if there is an error. */
+	*vdo_ptr = vdo;
+
+	snprintf(vdo->thread_name_prefix,
+		 sizeof(vdo->thread_name_prefix),
+		 "%s%u",
+		 MODULE_NAME,
+		 instance);
+	BUG_ON(vdo->thread_name_prefix[0] == '\0');
+	result = UDS_ALLOCATE(vdo->thread_config.thread_count,
+			      struct vdo_thread,
+			      __func__,
+			      &vdo->threads);
+	if (result != VDO_SUCCESS) {
+		*reason = "Cannot allocate thread structures";
+		return result;
+	}
+
+	result = vdo_make_thread(vdo,
+				 vdo->thread_config.admin_thread,
+				 &default_queue_type,
+				 1,
+				 NULL);
+	if (result != VDO_SUCCESS) {
+		*reason = "Cannot make admin thread";
+		return result;
+	}
+
+	result = vdo_make_flusher(vdo);
+	if (result != VDO_SUCCESS) {
+		*reason = "Cannot make flusher zones";
+		return result;
+	}
+
+	result = vdo_make_packer(vdo, DEFAULT_PACKER_BINS, &vdo->packer);
+	if (result != VDO_SUCCESS) {
+		*reason = "Cannot make packer zones";
+		return result;
+	}
+
+	BUG_ON(vdo->device_config->logical_block_size <= 0);
+	BUG_ON(vdo->device_config->owned_device == NULL);
+	result = make_data_vio_pool(vdo,
+				    MAXIMUM_VDO_USER_VIOS,
+				    MAXIMUM_VDO_USER_VIOS * 3 / 4,
+				    &vdo->data_vio_pool);
+	if (result != VDO_SUCCESS) {
+		*reason = "Cannot allocate data_vio pool";
+		return result;
+	}
+
+	result = vdo_make_io_submitter(config->thread_counts.bio_threads,
+				       config->thread_counts.bio_rotation_interval,
+				       get_data_vio_pool_request_limit(vdo->data_vio_pool),
+				       vdo,
+				       &vdo->io_submitter);
+	if (result != VDO_SUCCESS) {
+		*reason = "bio submission initialization failed";
+		return result;
+	}
+
+	if (vdo_uses_bio_ack_queue(vdo)) {
+		result = vdo_make_thread(vdo,
+					 vdo->thread_config.bio_ack_thread,
+					 &bio_ack_q_type,
+					 config->thread_counts.bio_ack_threads,
+					 NULL);
+		if (result != VDO_SUCCESS) {
+			*reason = "bio ack queue initialization failed";
+			return result;
+		}
+	}
+
+	result = vdo_make_thread(vdo,
+				 vdo->thread_config.cpu_thread,
+				 &cpu_q_type,
+				 config->thread_counts.cpu_threads,
+				 (void **) vdo->compression_context);
+	if (result != VDO_SUCCESS) {
+		*reason = "CPU queue initialization failed";
+		return result;
+	}
+
+	return VDO_SUCCESS;
+}
+
+static void finish_vdo(struct vdo *vdo)
+{
+	int i;
+
+	if (vdo->threads == NULL)
+		return;
+
+	vdo_cleanup_io_submitter(vdo->io_submitter);
+	vdo_finish_dedupe_index(vdo->hash_zones);
+
+	for (i = 0; i < vdo->thread_config.thread_count; i++)
+		vdo_finish_work_queue(vdo->threads[i].queue);
+}
+
+/**
+ * free_listeners() - Free the list of read-only listeners associated with a thread.
+ * @thread_data: The thread holding the list to free.
+ */
+static void free_listeners(struct vdo_thread *thread)
+{
+	struct read_only_listener *listener, *next;
+
+	for (listener = UDS_FORGET(thread->listeners); listener != NULL; listener = next) {
+		next = UDS_FORGET(listener->next);
+		UDS_FREE(listener);
+	}
+}
+
+static void uninitialize_super_block(struct vdo_super_block *super_block)
+{
+	free_vio_components(&super_block->vio);
+	UDS_FREE(super_block->buffer);
+}
+
+/**
+ * unregister_vdo() - Remove a vdo from the device registry.
+ * @vdo: The vdo to remove.
+ */
+static void unregister_vdo(struct vdo *vdo)
+{
+	write_lock(&registry.lock);
+	if (filter_vdos_locked(vdo_is_equal, vdo) == vdo)
+		list_del_init(&vdo->registration);
+
+	write_unlock(&registry.lock);
+}
+
+/**
+ * vdo_destroy() - Destroy a vdo instance.
+ * @vdo: The vdo to destroy (may be NULL).
+ */
+void vdo_destroy(struct vdo *vdo)
+{
+	unsigned int i;
+
+	if (vdo == NULL)
+		return;
+
+	/* A running VDO should never be destroyed without suspending first. */
+	BUG_ON(vdo_get_admin_state(vdo)->normal);
+
+	vdo->allocations_allowed = true;
+
+	/* Stop services that need to gather VDO statistics from the worker threads. */
+	if (vdo->sysfs_added) {
+		init_completion(&vdo->stats_shutdown);
+		kobject_put(&vdo->stats_directory);
+		wait_for_completion(&vdo->stats_shutdown);
+	}
+
+	finish_vdo(vdo);
+	unregister_vdo(vdo);
+	free_data_vio_pool(vdo->data_vio_pool);
+	vdo_free_io_submitter(UDS_FORGET(vdo->io_submitter));
+	vdo_free_flusher(UDS_FORGET(vdo->flusher));
+	vdo_free_packer(UDS_FORGET(vdo->packer));
+	vdo_free_recovery_journal(UDS_FORGET(vdo->recovery_journal));
+	vdo_free_slab_depot(UDS_FORGET(vdo->depot));
+	vdo_uninitialize_layout(&vdo->layout);
+	vdo_uninitialize_layout(&vdo->next_layout);
+	if (vdo->partition_copier)
+		dm_kcopyd_client_destroy(UDS_FORGET(vdo->partition_copier));
+	uninitialize_super_block(&vdo->super_block);
+	vdo_free_block_map(UDS_FORGET(vdo->block_map));
+	vdo_free_hash_zones(UDS_FORGET(vdo->hash_zones));
+	vdo_free_physical_zones(UDS_FORGET(vdo->physical_zones));
+	vdo_free_logical_zones(UDS_FORGET(vdo->logical_zones));
+
+	if (vdo->threads != NULL) {
+		for (i = 0; i < vdo->thread_config.thread_count; i++) {
+			free_listeners(&vdo->threads[i]);
+			vdo_free_work_queue(UDS_FORGET(vdo->threads[i].queue));
+		}
+		UDS_FREE(UDS_FORGET(vdo->threads));
+	}
+
+	uninitialize_thread_config(&vdo->thread_config);
+
+	if (vdo->compression_context != NULL) {
+		for (i = 0; i < vdo->device_config->thread_counts.cpu_threads; i++)
+			UDS_FREE(UDS_FORGET(vdo->compression_context[i]));
+
+		UDS_FREE(UDS_FORGET(vdo->compression_context));
+	}
+
+	/*
+	 * The call to kobject_put on the kobj sysfs node will decrement its reference count; when
+	 * the count goes to zero the VDO object will be freed as a side effect.
+	 */
+	if (!vdo->sysfs_added)
+		UDS_FREE(vdo);
+	else
+		kobject_put(&vdo->vdo_directory);
+}
+
+static int initialize_super_block(struct vdo *vdo, struct vdo_super_block *super_block)
+{
+	int result;
+
+	result = UDS_ALLOCATE(VDO_BLOCK_SIZE,
+			      char,
+			      "encoded super block",
+			      (char **) &vdo->super_block.buffer);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	return allocate_vio_components(vdo,
+				       VIO_TYPE_SUPER_BLOCK,
+				       VIO_PRIORITY_METADATA,
+				       NULL,
+				       1,
+				       (char *) super_block->buffer,
+				       &vdo->super_block.vio);
+}
+
+/**
+ * finish_reading_super_block() - Continue after loading the super block.
+ * @completion: The super block vio.
+ *
+ * This callback is registered in vdo_load_super_block().
+ */
+static void finish_reading_super_block(struct vdo_completion *completion)
+{
+	struct vdo_super_block *super_block =
+		container_of(as_vio(completion), struct vdo_super_block, vio);
+
+	vdo_continue_completion(UDS_FORGET(completion->parent),
+				vdo_decode_super_block(super_block->buffer));
+}
+
+/**
+ * handle_super_block_read_error() - Handle an error reading the super block.
+ * @completion: The super block vio.
+ *
+ * This error handler is registered in vdo_load_super_block().
+ */
+static void handle_super_block_read_error(struct vdo_completion *completion)
+{
+	vio_record_metadata_io_error(as_vio(completion));
+	finish_reading_super_block(completion);
+}
+
+static void read_super_block_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct vdo_completion *parent = vio->completion.parent;
+
+	continue_vio_after_io(vio, finish_reading_super_block, parent->callback_thread_id);
+}
+
+/**
+ * vdo_load_super_block() - Allocate a super block and read its contents from storage.
+ * @vdo: The vdo containing the super block on disk.
+ * @parent: The completion to notify after loading the super block.
+ */
+void vdo_load_super_block(struct vdo *vdo, struct vdo_completion *parent)
+{
+	int result;
+
+	result = initialize_super_block(vdo, &vdo->super_block);
+	if (result != VDO_SUCCESS) {
+		vdo_continue_completion(parent, result);
+		return;
+	}
+
+	vdo->super_block.vio.completion.parent = parent;
+	submit_metadata_vio(&vdo->super_block.vio,
+			    vdo_get_data_region_start(vdo->geometry),
+			    read_super_block_endio,
+			    handle_super_block_read_error,
+			    REQ_OP_READ);
+}
+
+/**
+ * pool_stats_release() - Signal that sysfs stats have been shut down.
+ * @directory: The vdo stats directory.
+ */
+static void pool_stats_release(struct kobject *directory)
+{
+	struct vdo *vdo = container_of(directory, struct vdo, stats_directory);
+
+	complete(&vdo->stats_shutdown);
+}
+
+ATTRIBUTE_GROUPS(vdo_pool_stats);
+static const struct kobj_type stats_directory_type = {
+	.release = pool_stats_release,
+	.sysfs_ops = &vdo_pool_stats_sysfs_ops,
+	.default_groups = vdo_pool_stats_groups,
+};
+
+/**
+ * vdo_add_sysfs_stats_dir() - Add the stats directory to the vdo sysfs directory.
+ * @vdo: The vdo.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_add_sysfs_stats_dir(struct vdo *vdo)
+{
+	int result;
+
+	kobject_init(&vdo->stats_directory, &stats_directory_type);
+	result = kobject_add(&vdo->stats_directory, &vdo->vdo_directory, "statistics");
+	if (result != 0)
+		return VDO_CANT_ADD_SYSFS_NODE;
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_get_backing_device() - Get the block device object underlying a vdo.
+ * @vdo: The vdo.
+ *
+ * Return: The vdo's current block device.
+ */
+struct block_device *vdo_get_backing_device(const struct vdo *vdo)
+{
+	return vdo->device_config->owned_device->bdev;
+}
+
+/**
+ * vdo_get_device_name() - Get the device name associated with the vdo target.
+ * @target: The target device interface.
+ *
+ * Return: The block device name.
+ */
+const char *vdo_get_device_name(const struct dm_target *target)
+{
+	return dm_device_name(dm_table_get_md(target->table));
+}
+
+/**
+ * vdo_synchronous_flush() - Issue a flush request and wait for it to complete.
+ * @vdo: The vdo.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_synchronous_flush(struct vdo *vdo)
+{
+	int result;
+	struct bio bio;
+
+	bio_init(&bio, vdo_get_backing_device(vdo), 0, 0,
+		 REQ_OP_WRITE | REQ_PREFLUSH);
+	submit_bio_wait(&bio);
+	result = blk_status_to_errno(bio.bi_status);
+
+	atomic64_inc(&vdo->stats.flush_out);
+	if (result != 0) {
+		uds_log_error_strerror(result, "synchronous flush failed");
+		result = -EIO;
+	}
+
+	bio_uninit(&bio);
+	return result;
+}
+
+/**
+ * vdo_get_state() - Get the current state of the vdo.
+ * @vdo: The vdo.
+
+ * Context: This method may be called from any thread.
+ *
+ * Return: The current state of the vdo.
+ */
+enum vdo_state vdo_get_state(const struct vdo *vdo)
+{
+	enum vdo_state state = atomic_read(&vdo->state);
+
+	/* pairs with barriers where state field is changed */
+	smp_rmb();
+	return state;
+}
+
+/**
+ * vdo_set_state() - Set the current state of the vdo.
+ * @vdo: The vdo whose state is to be set.
+ * @state: The new state of the vdo.
+ *
+ * Context: This method may be called from any thread.
+ */
+void vdo_set_state(struct vdo *vdo, enum vdo_state state)
+{
+	/* pairs with barrier in vdo_get_state */
+	smp_wmb();
+	atomic_set(&vdo->state, state);
+}
+
+/**
+ * vdo_get_admin_state() - Get the admin state of the vdo.
+ * @vdo: The vdo.
+ *
+ * Return: The code for the vdo's current admin state.
+ */
+const struct admin_state_code *vdo_get_admin_state(const struct vdo *vdo)
+{
+	return vdo_get_admin_state_code(&vdo->admin.state);
+}
+
+/**
+ * record_vdo() - Record the state of the VDO for encoding in the super block.
+ */
+static void record_vdo(struct vdo *vdo)
+{
+	/* This is for backwards compatibility. */
+	vdo->states.unused = vdo->geometry.unused;
+	vdo->states.vdo.state = vdo_get_state(vdo);
+	vdo->states.block_map = vdo_record_block_map(vdo->block_map);
+	vdo->states.recovery_journal = vdo_record_recovery_journal(vdo->recovery_journal);
+	vdo->states.slab_depot = vdo_record_slab_depot(vdo->depot);
+	vdo->states.layout = vdo->layout;
+}
+
+/**
+ * continue_super_block_parent() - Continue the parent of a super block save operation.
+ * @completion: The super block vio.
+ *
+ * This callback is registered in vdo_save_components().
+ */
+static void continue_super_block_parent(struct vdo_completion *completion)
+{
+	vdo_continue_completion(UDS_FORGET(completion->parent), completion->result);
+}
+
+/**
+ * handle_save_error() - Log a super block save error.
+ * @completion: The super block vio.
+ *
+ * This error handler is registered in vdo_save_components().
+ */
+static void handle_save_error(struct vdo_completion *completion)
+{
+	struct vdo_super_block *super_block =
+		container_of(as_vio(completion), struct vdo_super_block, vio);
+
+	vio_record_metadata_io_error(&super_block->vio);
+	uds_log_error_strerror(completion->result, "super block save failed");
+	/*
+	 * Mark the super block as unwritable so that we won't attempt to write it again. This
+	 * avoids the case where a growth attempt fails writing the super block with the new size,
+	 * but the subsequent attempt to write out the read-only state succeeds. In this case,
+	 * writes which happened just before the suspend would not be visible if the VDO is
+	 * restarted without rebuilding, but, after a read-only rebuild, the effects of those
+	 * writes would reappear.
+	 */
+	super_block->unwritable = true;
+	completion->callback(completion);
+}
+
+static void super_block_write_endio(struct bio *bio)
+{
+	struct vio *vio = bio->bi_private;
+	struct vdo_completion *parent = vio->completion.parent;
+
+	continue_vio_after_io(vio, continue_super_block_parent, parent->callback_thread_id);
+}
+
+/**
+ * vdo_save_components() - Encode the vdo and save the super block asynchronously.
+ * @vdo: The vdo whose state is being saved.
+ * @parent: The completion to notify when the save is complete.
+ */
+void vdo_save_components(struct vdo *vdo, struct vdo_completion *parent)
+{
+	struct vdo_super_block *super_block = &vdo->super_block;
+
+	if (super_block->unwritable) {
+		vdo_continue_completion(parent, VDO_READ_ONLY);
+		return;
+	}
+
+	if (super_block->vio.completion.parent != NULL) {
+		vdo_continue_completion(parent, VDO_COMPONENT_BUSY);
+		return;
+	}
+
+	record_vdo(vdo);
+
+	vdo_encode_super_block(super_block->buffer, &vdo->states);
+	super_block->vio.completion.parent = parent;
+	super_block->vio.completion.callback_thread_id = parent->callback_thread_id;
+	submit_metadata_vio(&super_block->vio,
+			    vdo_get_data_region_start(vdo->geometry),
+			    super_block_write_endio,
+			    handle_save_error,
+			    REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA);
+}
+
+/**
+ * vdo_register_read_only_listener() - Register a listener to be notified when the VDO goes
+ *                                     read-only.
+ * @vdo: The vdo to register with.
+ * @listener: The object to notify.
+ * @notification: The function to call to send the notification.
+ * @thread_id: The id of the thread on which to send the notification.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_register_read_only_listener(struct vdo *vdo,
+				    void *listener,
+				    vdo_read_only_notification *notification,
+				    thread_id_t thread_id)
+{
+	struct vdo_thread *thread = &vdo->threads[thread_id];
+	struct read_only_listener *read_only_listener;
+	int result;
+
+	result = ASSERT(thread_id != vdo->thread_config.dedupe_thread,
+			"read only listener not registered on dedupe thread");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = UDS_ALLOCATE(1, struct read_only_listener, __func__, &read_only_listener);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	*read_only_listener = (struct read_only_listener) {
+		.listener = listener,
+		.notify = notification,
+		.next = thread->listeners,
+	};
+
+	thread->listeners = read_only_listener;
+	return VDO_SUCCESS;
+}
+
+/**
+ * notify_vdo_of_read_only_mode() - Notify a vdo that it is going read-only.
+ * @listener: The vdo.
+ * @parent: The completion to notify in order to acknowledge the notification.
+ *
+ * This will save the read-only state to the super block.
+ *
+ * Implements vdo_read_only_notification.
+ */
+static void notify_vdo_of_read_only_mode(void *listener, struct vdo_completion *parent)
+{
+	struct vdo *vdo = listener;
+
+	if (vdo_in_read_only_mode(vdo))
+		vdo_finish_completion(parent);
+
+	vdo_set_state(vdo, VDO_READ_ONLY_MODE);
+	vdo_save_components(vdo, parent);
+}
+
+/**
+ * vdo_enable_read_only_entry() - Enable a vdo to enter read-only mode on errors.
+ * @vdo: The vdo to enable.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_enable_read_only_entry(struct vdo *vdo)
+{
+	thread_id_t id;
+	bool is_read_only = vdo_in_read_only_mode(vdo);
+	struct read_only_notifier *notifier = &vdo->read_only_notifier;
+
+	if (is_read_only) {
+		notifier->read_only_error = VDO_READ_ONLY;
+		notifier->state = NOTIFIED;
+	} else {
+		notifier->state = MAY_NOT_NOTIFY;
+	}
+
+	spin_lock_init(&notifier->lock);
+	vdo_initialize_completion(&notifier->completion, vdo, VDO_READ_ONLY_MODE_COMPLETION);
+
+	for (id = 0; id < vdo->thread_config.thread_count; id++)
+		vdo->threads[id].is_read_only = is_read_only;
+
+	return vdo_register_read_only_listener(vdo,
+					       vdo,
+					       notify_vdo_of_read_only_mode,
+					       vdo->thread_config.admin_thread);
+}
+
+/**
+ * vdo_wait_until_not_entering_read_only_mode() - Wait until no read-only notifications are in
+ *                                                progress and prevent any subsequent
+ *                                                notifications.
+ * @parent: The completion to notify when no threads are entering read-only mode.
+ *
+ * Notifications may be re-enabled by calling vdo_allow_read_only_mode_entry().
+ */
+void vdo_wait_until_not_entering_read_only_mode(struct vdo_completion *parent)
+{
+	struct vdo *vdo = parent->vdo;
+	struct read_only_notifier *notifier = &vdo->read_only_notifier;
+
+	vdo_assert_on_admin_thread(vdo, __func__);
+
+	if (notifier->waiter != NULL) {
+		vdo_continue_completion(parent, VDO_COMPONENT_BUSY);
+		return;
+	}
+
+	spin_lock(&notifier->lock);
+	if (notifier->state == NOTIFYING)
+		notifier->waiter = parent;
+	else if (notifier->state == MAY_NOTIFY)
+		notifier->state = MAY_NOT_NOTIFY;
+	spin_unlock(&notifier->lock);
+
+	if (notifier->waiter == NULL) {
+		/*
+		 * A notification was not in progress, and now they are
+		 * disallowed.
+		 */
+		vdo_launch_completion(parent);
+		return;
+	}
+}
+
+/**
+ * as_notifier() - Convert a generic vdo_completion to a read_only_notifier.
+ * @completion: The completion to convert.
+ *
+ * Return: The completion as a read_only_notifier.
+ */
+static inline struct read_only_notifier *as_notifier(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VDO_READ_ONLY_MODE_COMPLETION);
+	return container_of(completion, struct read_only_notifier, completion);
+}
+
+/**
+ * finish_entering_read_only_mode() - Complete the process of entering read only mode.
+ * @completion: The read-only mode completion.
+ */
+static void finish_entering_read_only_mode(struct vdo_completion *completion)
+{
+	struct read_only_notifier *notifier = as_notifier(completion);
+
+	vdo_assert_on_admin_thread(completion->vdo, __func__);
+
+	spin_lock(&notifier->lock);
+	notifier->state = NOTIFIED;
+	spin_unlock(&notifier->lock);
+
+	if (notifier->waiter != NULL)
+		vdo_continue_completion(UDS_FORGET(notifier->waiter), completion->result);
+}
+
+/**
+ * make_thread_read_only() - Inform each thread that the VDO is in read-only mode.
+ * @completion: The read-only mode completion.
+ */
+static void make_thread_read_only(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+	thread_id_t thread_id = completion->callback_thread_id;
+	struct read_only_notifier *notifier = as_notifier(completion);
+	struct read_only_listener *listener = completion->parent;
+
+	if (listener == NULL) {
+		/* This is the first call on this thread */
+		struct vdo_thread *thread = &vdo->threads[thread_id];
+
+		thread->is_read_only = true;
+		listener = thread->listeners;
+		if (thread_id == 0)
+			uds_log_error_strerror(READ_ONCE(notifier->read_only_error),
+					       "Unrecoverable error, entering read-only mode");
+	} else {
+		/* We've just finished notifying a listener */
+		listener = listener->next;
+	}
+
+	if (listener != NULL) {
+		/* We have a listener to notify */
+		vdo_prepare_completion(completion,
+				       make_thread_read_only,
+				       make_thread_read_only,
+				       thread_id,
+				       listener);
+		listener->notify(listener->listener, completion);
+		return;
+	}
+
+	/* We're done with this thread */
+	if (++thread_id == vdo->thread_config.dedupe_thread)
+		/*
+		 * We don't want to notify the dedupe thread since it may be
+		 * blocked rebuilding the index.
+		 */
+		++thread_id;
+
+	if (thread_id >= vdo->thread_config.thread_count)
+		/* There are no more threads */
+		vdo_prepare_completion(completion,
+				       finish_entering_read_only_mode,
+				       finish_entering_read_only_mode,
+				       vdo->thread_config.admin_thread,
+				       NULL);
+	else
+		vdo_prepare_completion(completion,
+				       make_thread_read_only,
+				       make_thread_read_only,
+				       thread_id,
+				       NULL);
+
+	vdo_launch_completion(completion);
+}
+
+/**
+ * vdo_allow_read_only_mode_entry() - Allow the notifier to put the VDO into read-only mode,
+ *                                    reversing the effects of
+ *                                    vdo_wait_until_not_entering_read_only_mode().
+ * @parent: The object to notify once the operation is complete.
+ *
+ * If some thread tried to put the vdo into read-only mode while notifications were disallowed, it
+ * will be done when this method is called. If that happens, the parent will not be notified until
+ * the vdo has actually entered read-only mode and attempted to save the super block.
+ *
+ * Context: This method may only be called from the admin thread.
+ */
+void vdo_allow_read_only_mode_entry(struct vdo_completion *parent)
+{
+	struct vdo *vdo = parent->vdo;
+	struct read_only_notifier *notifier = &vdo->read_only_notifier;
+
+	vdo_assert_on_admin_thread(vdo, __func__);
+
+	if (notifier->waiter != NULL) {
+		vdo_continue_completion(parent, VDO_COMPONENT_BUSY);
+		return;
+	}
+
+	spin_lock(&notifier->lock);
+	if (notifier->state == MAY_NOT_NOTIFY) {
+		if (notifier->read_only_error == VDO_SUCCESS) {
+			notifier->state = MAY_NOTIFY;
+		} else {
+			notifier->state = NOTIFYING;
+			notifier->waiter = parent;
+		}
+	}
+	spin_unlock(&notifier->lock);
+
+	if (notifier->waiter == NULL) {
+		/* We're done */
+		vdo_launch_completion(parent);
+		return;
+	}
+
+	/* Do the pending notification. */
+	make_thread_read_only(&notifier->completion);
+}
+
+/**
+ * vdo_enter_read_only_mode() - Put a VDO into read-only mode and save the read-only state in the
+ *                              super block.
+ * @vdo: The vdo.
+ * @error_code: The error which caused the VDO to enter read-only mode.
+ *
+ * This method is a no-op if the VDO is already read-only.
+ */
+void vdo_enter_read_only_mode(struct vdo *vdo, int error_code)
+{
+	bool notify = false;
+	thread_id_t thread_id = vdo_get_callback_thread_id();
+	struct read_only_notifier *notifier = &vdo->read_only_notifier;
+	struct vdo_thread *thread;
+
+	if (thread_id != VDO_INVALID_THREAD_ID) {
+		thread = &vdo->threads[thread_id];
+		if (thread->is_read_only)
+			/* This thread has already gone read-only. */
+			return;
+
+		/* Record for this thread that the VDO is read-only. */
+		thread->is_read_only = true;
+	}
+
+	spin_lock(&notifier->lock);
+	if (notifier->read_only_error == VDO_SUCCESS) {
+		WRITE_ONCE(notifier->read_only_error, error_code);
+		if (notifier->state == MAY_NOTIFY) {
+			notifier->state = NOTIFYING;
+			notify = true;
+		}
+	}
+	spin_unlock(&notifier->lock);
+
+	if (!notify)
+		/* The notifier is already aware of a read-only error */
+		return;
+
+	/* Initiate a notification starting on the lowest numbered thread. */
+	vdo_launch_completion_callback(&notifier->completion, make_thread_read_only, 0);
+}
+
+/**
+ * vdo_is_read_only() - Check whether the VDO is read-only.
+ * @vdo: The vdo.
+ *
+ * Return: true if the vdo is read-only.
+ *
+ * This method may be called from any thread, as opposed to examining the VDO's state field which
+ * is only safe to check from the admin thread.
+ */
+bool vdo_is_read_only(struct vdo *vdo)
+{
+	return vdo->threads[vdo_get_callback_thread_id()].is_read_only;
+}
+
+/**
+ * vdo_in_read_only_mode() - Check whether a vdo is in read-only mode.
+ * @vdo: The vdo to query.
+ *
+ * Return: true if the vdo is in read-only mode.
+ */
+bool vdo_in_read_only_mode(const struct vdo *vdo)
+{
+	return (vdo_get_state(vdo) == VDO_READ_ONLY_MODE);
+}
+
+/**
+ * vdo_in_recovery_mode() - Check whether the vdo is in recovery mode.
+ * @vdo: The vdo to query.
+ *
+ * Return: true if the vdo is in recovery mode.
+ */
+bool vdo_in_recovery_mode(const struct vdo *vdo)
+{
+	return (vdo_get_state(vdo) == VDO_RECOVERING);
+}
+
+/**
+ * vdo_enter_recovery_mode() - Put the vdo into recovery mode.
+ * @vdo: The vdo.
+ */
+void vdo_enter_recovery_mode(struct vdo *vdo)
+{
+	vdo_assert_on_admin_thread(vdo, __func__);
+
+	if (vdo_in_read_only_mode(vdo))
+		return;
+
+	uds_log_info("Entering recovery mode");
+	vdo_set_state(vdo, VDO_RECOVERING);
+}
+
+/**
+ * complete_synchronous_action() - Signal the waiting thread that a synchronous action is complete.
+ * @completion: The sync completion.
+ */
+static void complete_synchronous_action(struct vdo_completion *completion)
+{
+	vdo_assert_completion_type(completion, VDO_SYNC_COMPLETION);
+	complete(&(container_of(completion, struct sync_completion, vdo_completion)->completion));
+}
+
+/**
+ * perform_synchronous_action() - Launch an action on a VDO thread and wait for it to complete.
+ * @vdo: The vdo.
+ * @action: The callback to launch.
+ * @thread_id: The thread on which to run the action.
+ * @parent: The parent of the sync completion (may be NULL).
+ */
+static int perform_synchronous_action(struct vdo *vdo,
+				      vdo_action *action,
+				      thread_id_t thread_id,
+				      void *parent)
+{
+	struct sync_completion sync;
+
+	vdo_initialize_completion(&sync.vdo_completion, vdo, VDO_SYNC_COMPLETION);
+	init_completion(&sync.completion);
+	sync.vdo_completion.parent = parent;
+	vdo_launch_completion_callback(&sync.vdo_completion, action, thread_id);
+	wait_for_completion(&sync.completion);
+	return sync.vdo_completion.result;
+}
+
+/**
+ * set_compression_callback() - Callback to turn compression on or off.
+ * @completion: The completion.
+ */
+static void set_compression_callback(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+	bool *enable = completion->parent;
+	bool was_enabled = vdo_get_compressing(vdo);
+
+	if (*enable != was_enabled) {
+		WRITE_ONCE(vdo->compressing, *enable);
+		if (was_enabled)
+			/* Signal the packer to flush since compression has been disabled. */
+			vdo_flush_packer(vdo->packer);
+	}
+
+	uds_log_info("compression is %s", (*enable ? "enabled" : "disabled"));
+	*enable = was_enabled;
+	complete_synchronous_action(completion);
+}
+
+/**
+ * vdo_set_compressing() - Turn compression on or off.
+ * @vdo: The vdo.
+ * @enable: Whether to enable or disable compression.
+ *
+ * Return: Whether compression was previously on or off.
+ */
+bool vdo_set_compressing(struct vdo *vdo, bool enable)
+{
+	perform_synchronous_action(vdo,
+				   set_compression_callback,
+				   vdo->thread_config.packer_thread,
+				   &enable);
+	return enable;
+}
+
+/**
+ * vdo_get_compressing() - Get whether compression is enabled in a vdo.
+ * @vdo: The vdo.
+ *
+ * Return: State of compression.
+ */
+bool vdo_get_compressing(struct vdo *vdo)
+{
+	return READ_ONCE(vdo->compressing);
+}
+
+static size_t get_block_map_cache_size(const struct vdo *vdo)
+{
+	return ((size_t) vdo->device_config->cache_size) * VDO_BLOCK_SIZE;
+}
+
+static struct error_statistics __must_check get_vdo_error_statistics(const struct vdo *vdo)
+{
+	/*
+	 * The error counts can be incremented from arbitrary threads and so must be incremented
+	 * atomically, but they are just statistics with no semantics that could rely on memory
+	 * order, so unfenced reads are sufficient.
+	 */
+	const struct atomic_statistics *atoms = &vdo->stats;
+
+	return (struct error_statistics) {
+		.invalid_advice_pbn_count = atomic64_read(&atoms->invalid_advice_pbn_count),
+		.no_space_error_count = atomic64_read(&atoms->no_space_error_count),
+		.read_only_error_count = atomic64_read(&atoms->read_only_error_count),
+	};
+}
+
+static void copy_bio_stat(struct bio_stats *b, const struct atomic_bio_stats *a)
+{
+	b->read = atomic64_read(&a->read);
+	b->write = atomic64_read(&a->write);
+	b->discard = atomic64_read(&a->discard);
+	b->flush = atomic64_read(&a->flush);
+	b->empty_flush = atomic64_read(&a->empty_flush);
+	b->fua = atomic64_read(&a->fua);
+}
+
+static struct bio_stats subtract_bio_stats(struct bio_stats minuend, struct bio_stats subtrahend)
+{
+	return (struct bio_stats) {
+		.read = minuend.read - subtrahend.read,
+		.write = minuend.write - subtrahend.write,
+		.discard = minuend.discard - subtrahend.discard,
+		.flush = minuend.flush - subtrahend.flush,
+		.empty_flush = minuend.empty_flush - subtrahend.empty_flush,
+		.fua = minuend.fua - subtrahend.fua,
+	};
+}
+
+/**
+ * vdo_get_physical_blocks_allocated() - Get the number of physical blocks in use by user data.
+ * @vdo: The vdo.
+ *
+ * Return: The number of blocks allocated for user data.
+ */
+static block_count_t __must_check vdo_get_physical_blocks_allocated(const struct vdo *vdo)
+{
+	return (vdo_get_slab_depot_allocated_blocks(vdo->depot) -
+		vdo_get_journal_block_map_data_blocks_used(vdo->recovery_journal));
+}
+
+/**
+ * vdo_get_physical_blocks_overhead() - Get the number of physical blocks used by vdo metadata.
+ * @vdo: The vdo.
+ *
+ * Return: The number of overhead blocks.
+ */
+static block_count_t __must_check vdo_get_physical_blocks_overhead(const struct vdo *vdo)
+{
+	/*
+	 * config.physical_blocks is mutated during resize and is in a packed structure,
+	 * but resize runs on admin thread.
+	 * TODO: Verify that this is always safe.
+	 */
+	return (vdo->states.vdo.config.physical_blocks -
+		vdo_get_slab_depot_data_blocks(vdo->depot) +
+		vdo_get_journal_block_map_data_blocks_used(vdo->recovery_journal));
+}
+
+static const char *vdo_describe_state(enum vdo_state state)
+{
+	/* These strings should all fit in the 15 chars of VDOStatistics.mode. */
+	switch (state) {
+	case VDO_RECOVERING:
+		return "recovering";
+
+	case VDO_READ_ONLY_MODE:
+		return "read-only";
+
+	default:
+		return "normal";
+	}
+}
+
+/**
+ * get_vdo_statistics() - Populate a vdo_statistics structure on the admin thread.
+ * @vdo: The vdo.
+ * @stats: The statistics structure to populate.
+ */
+static void get_vdo_statistics(const struct vdo *vdo, struct vdo_statistics *stats)
+{
+	struct recovery_journal *journal = vdo->recovery_journal;
+	enum vdo_state state = vdo_get_state(vdo);
+
+	vdo_assert_on_admin_thread(vdo, __func__);
+
+	/* start with a clean slate */
+	memset(stats, 0, sizeof(struct vdo_statistics));
+
+	/*
+	 * These are immutable properties of the vdo object, so it is safe to query them from any
+	 * thread.
+	 */
+	stats->version = STATISTICS_VERSION;
+	stats->logical_blocks = vdo->states.vdo.config.logical_blocks;
+	/*
+	 * config.physical_blocks is mutated during resize and is in a packed structure, but resize
+	 * runs on the admin thread.
+	 * TODO: verify that this is always safe
+	 */
+	stats->physical_blocks = vdo->states.vdo.config.physical_blocks;
+	stats->block_size = VDO_BLOCK_SIZE;
+	stats->complete_recoveries = vdo->states.vdo.complete_recoveries;
+	stats->read_only_recoveries = vdo->states.vdo.read_only_recoveries;
+	stats->block_map_cache_size = get_block_map_cache_size(vdo);
+
+	/* The callees are responsible for thread-safety. */
+	stats->data_blocks_used = vdo_get_physical_blocks_allocated(vdo);
+	stats->overhead_blocks_used = vdo_get_physical_blocks_overhead(vdo);
+	stats->logical_blocks_used = vdo_get_recovery_journal_logical_blocks_used(journal);
+	vdo_get_slab_depot_statistics(vdo->depot, stats);
+	stats->journal = vdo_get_recovery_journal_statistics(journal);
+	stats->packer = vdo_get_packer_statistics(vdo->packer);
+	stats->block_map = vdo_get_block_map_statistics(vdo->block_map);
+	vdo_get_dedupe_statistics(vdo->hash_zones, stats);
+	stats->errors = get_vdo_error_statistics(vdo);
+	stats->in_recovery_mode = (state == VDO_RECOVERING);
+	snprintf(stats->mode, sizeof(stats->mode), "%s", vdo_describe_state(state));
+
+	stats->instance = vdo->instance;
+	stats->current_vios_in_progress = get_data_vio_pool_active_requests(vdo->data_vio_pool);
+	stats->max_vios = get_data_vio_pool_maximum_requests(vdo->data_vio_pool);
+
+	stats->flush_out = atomic64_read(&vdo->stats.flush_out);
+	stats->logical_block_size = vdo->device_config->logical_block_size;
+	copy_bio_stat(&stats->bios_in, &vdo->stats.bios_in);
+	copy_bio_stat(&stats->bios_in_partial, &vdo->stats.bios_in_partial);
+	copy_bio_stat(&stats->bios_out, &vdo->stats.bios_out);
+	copy_bio_stat(&stats->bios_meta, &vdo->stats.bios_meta);
+	copy_bio_stat(&stats->bios_journal, &vdo->stats.bios_journal);
+	copy_bio_stat(&stats->bios_page_cache, &vdo->stats.bios_page_cache);
+	copy_bio_stat(&stats->bios_out_completed, &vdo->stats.bios_out_completed);
+	copy_bio_stat(&stats->bios_meta_completed, &vdo->stats.bios_meta_completed);
+	copy_bio_stat(&stats->bios_journal_completed, &vdo->stats.bios_journal_completed);
+	copy_bio_stat(&stats->bios_page_cache_completed, &vdo->stats.bios_page_cache_completed);
+	copy_bio_stat(&stats->bios_acknowledged, &vdo->stats.bios_acknowledged);
+	copy_bio_stat(&stats->bios_acknowledged_partial, &vdo->stats.bios_acknowledged_partial);
+	stats->bios_in_progress = subtract_bio_stats(stats->bios_in, stats->bios_acknowledged);
+	uds_get_memory_stats(&stats->memory_usage.bytes_used,
+			     &stats->memory_usage.peak_bytes_used);
+}
+
+/**
+ * vdo_fetch_statistics_callback() - Action to populate a vdo_statistics
+ *                                   structure on the admin thread.
+ * @completion: The completion.
+ *
+ * This callback is registered in vdo_fetch_statistics().
+ */
+static void vdo_fetch_statistics_callback(struct vdo_completion *completion)
+{
+	get_vdo_statistics(completion->vdo, completion->parent);
+	complete_synchronous_action(completion);
+}
+
+/**
+ * vdo_fetch_statistics() - Fetch statistics on the correct thread.
+ * @vdo: The vdo.
+ * @stats: The vdo statistics are returned here.
+ */
+void vdo_fetch_statistics(struct vdo *vdo, struct vdo_statistics *stats)
+{
+	perform_synchronous_action(vdo,
+				   vdo_fetch_statistics_callback,
+				   vdo->thread_config.admin_thread,
+				   stats);
+}
+
+/**
+ * vdo_get_callback_thread_id() - Get the id of the callback thread on which a completion is
+ *                                currently running.
+ *
+ * Return: The current thread ID, or -1 if no such thread.
+ */
+thread_id_t vdo_get_callback_thread_id(void)
+{
+	struct vdo_work_queue *queue = vdo_get_current_work_queue();
+	struct vdo_thread *thread;
+	thread_id_t thread_id;
+
+	if (queue == NULL)
+		return VDO_INVALID_THREAD_ID;
+
+	thread = vdo_get_work_queue_owner(queue);
+	thread_id = thread->thread_id;
+
+	if (PARANOID_THREAD_CONSISTENCY_CHECKS) {
+		BUG_ON(thread_id >= thread->vdo->thread_config.thread_count);
+		BUG_ON(thread != &thread->vdo->threads[thread_id]);
+	}
+
+	return thread_id;
+}
+
+/**
+ * vdo_dump_status() - Dump status information about a vdo to the log for debugging.
+ * @vdo: The vdo to dump.
+ */
+void vdo_dump_status(const struct vdo *vdo)
+{
+	zone_count_t zone;
+
+	vdo_dump_flusher(vdo->flusher);
+	vdo_dump_recovery_journal_statistics(vdo->recovery_journal);
+	vdo_dump_packer(vdo->packer);
+	vdo_dump_slab_depot(vdo->depot);
+
+	for (zone = 0; zone < vdo->thread_config.logical_zone_count; zone++)
+		vdo_dump_logical_zone(&vdo->logical_zones->zones[zone]);
+
+	for (zone = 0; zone < vdo->thread_config.physical_zone_count; zone++)
+		vdo_dump_physical_zone(&vdo->physical_zones->zones[zone]);
+
+	vdo_dump_hash_zones(vdo->hash_zones);
+}
+
+/**
+ * vdo_assert_on_admin_thread() - Assert that we are running on the admin thread.
+ * @vdo: The vdo.
+ * @name: The name of the function which should be running on the admin thread (for logging).
+ */
+void vdo_assert_on_admin_thread(const struct vdo *vdo, const char *name)
+{
+	ASSERT_LOG_ONLY((vdo_get_callback_thread_id() == vdo->thread_config.admin_thread),
+			"%s called on admin thread",
+			name);
+}
+
+/**
+ * vdo_assert_on_logical_zone_thread() - Assert that this function was called on the specified
+ *                                       logical zone thread.
+ * @vdo: The vdo.
+ * @logical_zone: The number of the logical zone.
+ * @name: The name of the calling function.
+ */
+void vdo_assert_on_logical_zone_thread(const struct vdo *vdo,
+				       zone_count_t logical_zone,
+				       const char *name)
+{
+	ASSERT_LOG_ONLY((vdo_get_callback_thread_id() ==
+			 vdo->thread_config.logical_threads[logical_zone]),
+			"%s called on logical thread",
+			name);
+}
+
+/**
+ * vdo_assert_on_physical_zone_thread() - Assert that this function was called on the specified
+ *                                        physical zone thread.
+ * @vdo: The vdo.
+ * @physical_zone: The number of the physical zone.
+ * @name: The name of the calling function.
+ */
+void vdo_assert_on_physical_zone_thread(const struct vdo *vdo,
+					zone_count_t physical_zone,
+					const char *name)
+{
+	ASSERT_LOG_ONLY((vdo_get_callback_thread_id() ==
+			 vdo->thread_config.physical_threads[physical_zone]),
+			"%s called on physical thread",
+			name);
+}
+
+/**
+ * vdo_get_physical_zone() - Get the physical zone responsible for a given physical block number.
+ * @vdo: The vdo containing the physical zones.
+ * @pbn: The PBN of the data block.
+ * @zone_ptr: A pointer to return the physical zone.
+ *
+ * Gets the physical zone responsible for a given physical block number of a data block in this vdo
+ * instance, or of the zero block (for which a NULL zone is returned). For any other block number
+ * that is not in the range of valid data block numbers in any slab, an error will be returned.
+ * This function is safe to call on invalid block numbers; it will not put the vdo into read-only
+ * mode.
+ *
+ * Return: VDO_SUCCESS or VDO_OUT_OF_RANGE if the block number is invalid or an error code for any
+ *         other failure.
+ */
+int vdo_get_physical_zone(const struct vdo *vdo,
+			  physical_block_number_t pbn,
+			  struct physical_zone **zone_ptr)
+{
+	struct vdo_slab *slab;
+	int result;
+
+	if (pbn == VDO_ZERO_BLOCK) {
+		*zone_ptr = NULL;
+		return VDO_SUCCESS;
+	}
+
+	/*
+	 * Used because it does a more restrictive bounds check than vdo_get_slab(), and done first
+	 * because it won't trigger read-only mode on an invalid PBN.
+	 */
+	if (!vdo_is_physical_data_block(vdo->depot, pbn))
+		return VDO_OUT_OF_RANGE;
+
+	/* With the PBN already checked, we should always succeed in finding a slab. */
+	slab = vdo_get_slab(vdo->depot, pbn);
+	result = ASSERT(slab != NULL, "vdo_get_slab must succeed on all valid PBNs");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	*zone_ptr = &vdo->physical_zones->zones[slab->allocator->zone_number];
+	return VDO_SUCCESS;
+}
diff --git a/drivers/md/dm-vdo/vdo.h b/drivers/md/dm-vdo/vdo.h
new file mode 100644
index 000000000000..a2caf54221c8
--- /dev/null
+++ b/drivers/md/dm-vdo/vdo.h
@@ -0,0 +1,381 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_H
+#define VDO_H
+
+#include <linux/atomic.h>
+#include <linux/blk_types.h>
+#include <linux/completion.h>
+#include <linux/dm-kcopyd.h>
+#include <linux/kobject.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+#include "admin-state.h"
+#include "encodings.h"
+#include "packer.h"
+#include "physical-zone.h"
+#include "statistics.h"
+#include "thread-registry.h"
+#include "types.h"
+#include "uds.h"
+#include "work-queue.h"
+
+enum notifier_state {
+	/** Notifications are allowed but not in progress */
+	MAY_NOTIFY,
+	/** A notification is in progress */
+	NOTIFYING,
+	/** Notifications are not allowed */
+	MAY_NOT_NOTIFY,
+	/** A notification has completed */
+	NOTIFIED,
+};
+
+/**
+ * typedef vdo_read_only_notification - A function to notify a listener that the VDO has gone
+ *                                      read-only.
+ * @listener: The object to notify.
+ * @parent: The completion to notify in order to acknowledge the notification.
+ */
+typedef void vdo_read_only_notification(void *listener, struct vdo_completion *parent);
+
+/*
+ * An object to be notified when the VDO enters read-only mode
+ */
+struct read_only_listener {
+	/* The listener */
+	void *listener;
+	/* The method to call to notify the listener */
+	vdo_read_only_notification *notify;
+	/* A pointer to the next listener */
+	struct read_only_listener *next;
+};
+
+struct vdo_thread {
+	struct vdo *vdo;
+	thread_id_t thread_id;
+	struct vdo_work_queue *queue;
+	/*
+	 * Each thread maintains its own notion of whether the VDO is read-only so that the
+	 * read-only state can be checked from any base thread without worrying about
+	 * synchronization or thread safety. This does mean that knowledge of the VDO going
+	 * read-only does not occur simultaneously across the VDO's threads, but that does not seem
+	 * to cause any problems.
+	 */
+	bool is_read_only;
+	/*
+	 * A list of objects waiting to be notified on this thread that the VDO has entered
+	 * read-only mode.
+	 */
+	struct read_only_listener *listeners;
+	struct registered_thread allocating_thread;
+};
+
+/* Keep struct bio statistics atomically */
+struct atomic_bio_stats {
+	atomic64_t read; /* Number of not REQ_WRITE bios */
+	atomic64_t write; /* Number of REQ_WRITE bios */
+	atomic64_t discard; /* Number of REQ_DISCARD bios */
+	atomic64_t flush; /* Number of REQ_FLUSH bios */
+	atomic64_t empty_flush; /* Number of REQ_PREFLUSH bios without data */
+	atomic64_t fua; /* Number of REQ_FUA bios */
+};
+
+/* Counters are atomic since updates can arrive concurrently from arbitrary threads. */
+struct atomic_statistics {
+	atomic64_t bios_submitted;
+	atomic64_t bios_completed;
+	atomic64_t flush_out;
+	atomic64_t invalid_advice_pbn_count;
+	atomic64_t no_space_error_count;
+	atomic64_t read_only_error_count;
+	struct atomic_bio_stats bios_in;
+	struct atomic_bio_stats bios_in_partial;
+	struct atomic_bio_stats bios_out;
+	struct atomic_bio_stats bios_out_completed;
+	struct atomic_bio_stats bios_acknowledged;
+	struct atomic_bio_stats bios_acknowledged_partial;
+	struct atomic_bio_stats bios_meta;
+	struct atomic_bio_stats bios_meta_completed;
+	struct atomic_bio_stats bios_journal;
+	struct atomic_bio_stats bios_journal_completed;
+	struct atomic_bio_stats bios_page_cache;
+	struct atomic_bio_stats bios_page_cache_completed;
+};
+
+struct read_only_notifier {
+	/* The completion for entering read-only mode */
+	struct vdo_completion completion;
+	/* A completion waiting for notifications to be drained or enabled */
+	struct vdo_completion *waiter;
+	/* Lock to protect the next two fields */
+	spinlock_t lock;
+	/* The code of the error which put the VDO into read-only mode */
+	int read_only_error;
+	/* The current state of the notifier (values described above) */
+	enum notifier_state state;
+};
+
+/*
+ * The thread ID returned when the current thread is not a vdo thread, or can not be determined
+ * (usually due to being at interrupt context).
+ */
+#define VDO_INVALID_THREAD_ID ((thread_id_t) -1)
+
+struct thread_config {
+	zone_count_t logical_zone_count;
+	zone_count_t physical_zone_count;
+	zone_count_t hash_zone_count;
+	thread_count_t bio_thread_count;
+	thread_count_t thread_count;
+	thread_id_t admin_thread;
+	thread_id_t journal_thread;
+	thread_id_t packer_thread;
+	thread_id_t dedupe_thread;
+	thread_id_t bio_ack_thread;
+	thread_id_t cpu_thread;
+	thread_id_t *logical_threads;
+	thread_id_t *physical_threads;
+	thread_id_t *hash_zone_threads;
+	thread_id_t *bio_threads;
+};
+
+struct thread_count_config;
+
+struct vdo_super_block {
+	/* The vio for reading and writing the super block to disk */
+	struct vio vio;
+	/* A buffer to hold the super block */
+	u8 *buffer;
+	/* Whether this super block may not be written */
+	bool unwritable;
+};
+
+struct data_vio_pool;
+
+struct vdo_administrator {
+	struct vdo_completion completion;
+	struct admin_state state;
+	atomic_t busy;
+	u32 phase;
+	struct completion callback_sync;
+};
+
+struct vdo {
+	char thread_name_prefix[MAX_VDO_WORK_QUEUE_NAME_LEN];
+	struct vdo_thread *threads;
+	vdo_action *action;
+	struct vdo_completion *completion;
+	struct vio_tracer *vio_tracer;
+
+	/* The atomic version of the state of this vdo */
+	atomic_t state;
+	/* The full state of all components */
+	struct vdo_component_states states;
+	/*
+	 * A counter value to attach to thread names and log messages to identify the individual
+	 * device.
+	 */
+	unsigned int instance;
+	/* The read-only notifier */
+	struct read_only_notifier read_only_notifier;
+	/* The load-time configuration of this vdo */
+	struct device_config *device_config;
+	/* The thread mapping */
+	struct thread_config thread_config;
+
+	/* The super block */
+	struct vdo_super_block super_block;
+
+	/* The partitioning of the underlying storage */
+	struct layout layout;
+	struct layout next_layout;
+	struct dm_kcopyd_client *partition_copier;
+
+	/* The block map */
+	struct block_map *block_map;
+
+	/* The journal for block map recovery */
+	struct recovery_journal *recovery_journal;
+
+	/* The slab depot */
+	struct slab_depot *depot;
+
+	/* The compressed-block packer */
+	struct packer *packer;
+	/* Whether incoming data should be compressed */
+	bool compressing;
+
+	/* The handler for flush requests */
+	struct flusher *flusher;
+
+	/* The state the vdo was in when loaded (primarily for unit tests) */
+	enum vdo_state load_state;
+
+	/* The logical zones of this vdo */
+	struct logical_zones *logical_zones;
+
+	/* The physical zones of this vdo */
+	struct physical_zones *physical_zones;
+
+	/* The hash lock zones of this vdo */
+	struct hash_zones *hash_zones;
+
+	/* Bio submission manager used for sending bios to the storage device. */
+	struct io_submitter *io_submitter;
+
+	/* The pool of data_vios for servicing incoming bios */
+	struct data_vio_pool *data_vio_pool;
+
+	/* The manager for administrative operations */
+	struct vdo_administrator admin;
+
+	/* Flags controlling administrative operations */
+	const struct admin_state_code *suspend_type;
+	bool allocations_allowed;
+	bool dump_on_shutdown;
+	atomic_t processing_message;
+
+	/*
+	 * Statistics
+	 * Atomic stats counters
+	 */
+	struct atomic_statistics stats;
+	/* Used to gather statistics without allocating memory */
+	struct vdo_statistics stats_buffer;
+	/* Protects the stats_buffer */
+	struct mutex stats_mutex;
+	/* true if sysfs directory is set up */
+	bool sysfs_added;
+	/* Used when shutting down the sysfs statistics */
+	struct completion stats_shutdown;
+
+
+	/* A list of all device_configs referencing this vdo */
+	struct list_head device_config_list;
+
+	/* This VDO's list entry for the device registry */
+	struct list_head registration;
+
+	/* Underlying block device info. */
+	u64 starting_sector_offset;
+	struct volume_geometry geometry;
+
+	/* For sysfs */
+	struct kobject vdo_directory;
+	struct kobject stats_directory;
+
+	/* N blobs of context data for LZ4 code, one per CPU thread. */
+	char **compression_context;
+};
+
+
+/**
+ * vdo_uses_bio_ack_queue() - Indicate whether the vdo is configured to use a separate work queue
+ *                            for acknowledging received and processed bios.
+ * @vdo: The vdo.
+ *
+ * Note that this directly controls the handling of write operations, but the compile-time flag
+ * VDO_USE_BIO_ACK_QUEUE_FOR_READ is also checked for read operations.
+ *
+ * Return: Whether a bio-acknowledgement work queue is in use.
+ */
+static inline bool vdo_uses_bio_ack_queue(struct vdo *vdo)
+{
+	return vdo->device_config->thread_counts.bio_ack_threads > 0;
+}
+
+/**
+ * typedef vdo_filter_t - Method type for vdo matching methods.
+ *
+ * A filter function returns false if the vdo doesn't match.
+ */
+typedef bool vdo_filter_t(struct vdo *vdo, const void *context);
+
+void vdo_initialize_device_registry_once(void);
+struct vdo * __must_check vdo_find_matching(vdo_filter_t *filter, const void *context);
+
+int __must_check vdo_make_thread(struct vdo *vdo,
+				 thread_id_t thread_id,
+				 const struct vdo_work_queue_type *type,
+				 unsigned int queue_count,
+				 void *contexts[]);
+
+static inline int __must_check vdo_make_default_thread(struct vdo *vdo, thread_id_t thread_id)
+{
+	return vdo_make_thread(vdo, thread_id, NULL, 1, NULL);
+}
+
+int __must_check
+vdo_make(unsigned int instance, struct device_config *config, char **reason, struct vdo **vdo_ptr);
+
+void vdo_destroy(struct vdo *vdo);
+
+void vdo_load_super_block(struct vdo *vdo, struct vdo_completion *parent);
+
+int __must_check vdo_add_sysfs_stats_dir(struct vdo *vdo);
+
+struct block_device * __must_check vdo_get_backing_device(const struct vdo *vdo);
+
+const char * __must_check vdo_get_device_name(const struct dm_target *target);
+
+int __must_check vdo_synchronous_flush(struct vdo *vdo);
+
+const struct admin_state_code * __must_check vdo_get_admin_state(const struct vdo *vdo);
+
+bool vdo_set_compressing(struct vdo *vdo, bool enable);
+
+bool vdo_get_compressing(struct vdo *vdo);
+
+void vdo_fetch_statistics(struct vdo *vdo, struct vdo_statistics *stats);
+
+thread_id_t vdo_get_callback_thread_id(void);
+
+enum vdo_state __must_check vdo_get_state(const struct vdo *vdo);
+
+void vdo_set_state(struct vdo *vdo, enum vdo_state state);
+
+void vdo_save_components(struct vdo *vdo, struct vdo_completion *parent);
+
+int vdo_register_read_only_listener(struct vdo *vdo,
+				    void *listener,
+				    vdo_read_only_notification *notification,
+				    thread_id_t thread_id);
+
+int vdo_enable_read_only_entry(struct vdo *vdo);
+
+void vdo_wait_until_not_entering_read_only_mode(struct vdo_completion *parent);
+
+void vdo_allow_read_only_mode_entry(struct vdo_completion *parent);
+
+void vdo_enter_read_only_mode(struct vdo *vdo, int error_code);
+
+bool __must_check vdo_is_read_only(struct vdo *vdo);
+
+bool __must_check vdo_in_read_only_mode(const struct vdo *vdo);
+
+bool __must_check vdo_in_recovery_mode(const struct vdo *vdo);
+
+void vdo_enter_recovery_mode(struct vdo *vdo);
+
+void vdo_assert_on_admin_thread(const struct vdo *vdo, const char *name);
+
+void vdo_assert_on_logical_zone_thread(const struct vdo *vdo,
+				       zone_count_t logical_zone,
+				       const char *name);
+
+void vdo_assert_on_physical_zone_thread(const struct vdo *vdo,
+					zone_count_t physical_zone,
+					const char *name);
+
+int __must_check vdo_get_physical_zone(const struct vdo *vdo,
+				       physical_block_number_t pbn,
+				       struct physical_zone **zone_ptr);
+
+void vdo_dump_status(const struct vdo *vdo);
+
+#endif /* VDO_H */

From patchwork Thu Oct 26 21:41:31 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437962
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3353CC27C57
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:35 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344830AbjJZVoe (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:34 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41538 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234875AbjJZVoX (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:23 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D5864D5A
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356508;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=voNjYh+XCHZBSQn3Cd55J1ZDlYEQhHPpLgFtmB0aszM=;
        b=UA/hnJ9mcg8c3Ctt9gXpJmxBAmrlw+18KL9CQS1XH4q5T+Jzo2xLVifZmJAVR/UKaCYdxk
        CFj8myVMg38gcwQQCFWmwqn2I1WVSyWfat/Vsb5m0MqJAjctCUUFRiQLKWfGPXnoqSAhzA
        NrEfhLjrr2BIZwrVCY1yPosNd3BUO2M=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-127-Gb3BrOKiOeOl1_qi0lPuOA-1; Thu, 26 Oct 2023 17:41:43 -0400
X-MC-Unique: Gb3BrOKiOeOl1_qi0lPuOA-1
Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com
 [10.11.54.7])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 60524857F8C;
        Thu, 26 Oct 2023 21:41:43 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 05AD81C060B0;
        Thu, 26 Oct 2023 21:41:43 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id AFBDC3003D; Thu, 26 Oct 2023 17:41:42 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 34/39] dm vdo: add the on-disk formats and marshalling of
 vdo structures
Date: Thu, 26 Oct 2023 17:41:31 -0400
Message-Id: <20231026214136.1067410-35-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.7
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Add data and methods for marshalling and unmarshalling the persistent
metadata.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/encodings.c | 1488 +++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/encodings.h | 1307 +++++++++++++++++++++++++++++
 2 files changed, 2795 insertions(+)
 create mode 100644 drivers/md/dm-vdo/encodings.c
 create mode 100644 drivers/md/dm-vdo/encodings.h

diff --git a/drivers/md/dm-vdo/encodings.c b/drivers/md/dm-vdo/encodings.c
new file mode 100644
index 000000000000..00ed1e70b647
--- /dev/null
+++ b/drivers/md/dm-vdo/encodings.c
@@ -0,0 +1,1488 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "encodings.h"
+
+#include <linux/log2.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "permassert.h"
+
+#include "constants.h"
+#include "status-codes.h"
+#include "types.h"
+
+struct geometry_block {
+	char magic_number[VDO_GEOMETRY_MAGIC_NUMBER_SIZE];
+	struct packed_header header;
+	u32 checksum;
+} __packed;
+
+static const struct header GEOMETRY_BLOCK_HEADER_5_0 = {
+	.id = VDO_GEOMETRY_BLOCK,
+	.version = {
+		.major_version = 5,
+		.minor_version = 0,
+	},
+	/*
+	 * Note: this size isn't just the payload size following the header, like it is everywhere
+	 * else in VDO.
+	 */
+	.size = sizeof(struct geometry_block) + sizeof(struct volume_geometry),
+};
+
+static const struct header GEOMETRY_BLOCK_HEADER_4_0 = {
+	.id = VDO_GEOMETRY_BLOCK,
+	.version = {
+		.major_version = 4,
+		.minor_version = 0,
+	},
+	/*
+	 * Note: this size isn't just the payload size following the header, like it is everywhere
+	 * else in VDO.
+	 */
+	.size = sizeof(struct geometry_block) + sizeof(struct volume_geometry_4_0),
+};
+
+const u8 VDO_GEOMETRY_MAGIC_NUMBER[VDO_GEOMETRY_MAGIC_NUMBER_SIZE + 1] = "dmvdo001";
+
+enum {
+	PAGE_HEADER_4_1_SIZE = 8 + 8 + 8 + 1 + 1 + 1 + 1,
+};
+
+static const struct version_number BLOCK_MAP_4_1 = {
+	.major_version = 4,
+	.minor_version = 1,
+};
+
+const struct header VDO_BLOCK_MAP_HEADER_2_0 = {
+	.id = VDO_BLOCK_MAP,
+	.version = {
+		.major_version = 2,
+		.minor_version = 0,
+	},
+	.size = sizeof(struct block_map_state_2_0),
+};
+
+const struct header VDO_RECOVERY_JOURNAL_HEADER_7_0 = {
+	.id = VDO_RECOVERY_JOURNAL,
+	.version = {
+			.major_version = 7,
+			.minor_version = 0,
+		},
+	.size = sizeof(struct recovery_journal_state_7_0),
+};
+
+const struct header VDO_SLAB_DEPOT_HEADER_2_0 = {
+	.id = VDO_SLAB_DEPOT,
+	.version = {
+		.major_version = 2,
+		.minor_version = 0,
+	},
+	.size = sizeof(struct slab_depot_state_2_0),
+};
+
+const struct header VDO_LAYOUT_HEADER_3_0 = {
+	.id = VDO_LAYOUT,
+	.version = {
+		.major_version = 3,
+		.minor_version = 0,
+	},
+	.size = sizeof(struct layout_3_0) + (sizeof(struct partition_3_0) * VDO_PARTITION_COUNT),
+};
+
+static const enum partition_id REQUIRED_PARTITIONS[] = {
+	VDO_BLOCK_MAP_PARTITION,
+	VDO_SLAB_DEPOT_PARTITION,
+	VDO_RECOVERY_JOURNAL_PARTITION,
+	VDO_SLAB_SUMMARY_PARTITION,
+};
+
+/*
+ * The current version for the data encoded in the super block. This must be changed any time there
+ * is a change to encoding of the component data of any VDO component.
+ */
+static const struct version_number VDO_COMPONENT_DATA_41_0 = {
+	.major_version = 41,
+	.minor_version = 0,
+};
+
+const struct version_number VDO_VOLUME_VERSION_67_0 = {
+	.major_version = 67,
+	.minor_version = 0,
+};
+
+static const struct header SUPER_BLOCK_HEADER_12_0 = {
+	.id = VDO_SUPER_BLOCK,
+	.version = {
+			.major_version = 12,
+			.minor_version = 0,
+		},
+
+	/* This is the minimum size, if the super block contains no components. */
+	.size = VDO_SUPER_BLOCK_FIXED_SIZE - VDO_ENCODED_HEADER_SIZE,
+};
+
+/**
+ * validate_version() - Check whether a version matches an expected version.
+ * @expected_version: The expected version.
+ * @actual_version: The version being validated.
+ * @component_name: The name of the component or the calling function (for error logging).
+ *
+ * Logs an error describing a mismatch.
+ *
+ * Return: VDO_SUCCESS             if the versions are the same,
+ *         VDO_UNSUPPORTED_VERSION if the versions don't match.
+ */
+static int __must_check validate_version(struct version_number expected_version,
+					 struct version_number actual_version,
+					 const char *component_name)
+{
+	if (!vdo_are_same_version(expected_version, actual_version))
+		return uds_log_error_strerror(VDO_UNSUPPORTED_VERSION,
+					      "%s version mismatch, expected %d.%d, got %d.%d",
+					      component_name,
+					      expected_version.major_version,
+					      expected_version.minor_version,
+					      actual_version.major_version,
+					      actual_version.minor_version);
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_validate_header() - Check whether a header matches expectations.
+ * @expected_header: The expected header.
+ * @actual_header: The header being validated.
+ * @exact_size: If true, the size fields of the two headers must be the same, otherwise it is
+ *              required that actual_header.size >= expected_header.size.
+ * @name: The name of the component or the calling function (for error logging).
+ *
+ * Logs an error describing the first mismatch found.
+ *
+ * Return: VDO_SUCCESS             if the header meets expectations,
+ *         VDO_INCORRECT_COMPONENT if the component ids don't match,
+ *         VDO_UNSUPPORTED_VERSION if the versions or sizes don't match.
+ */
+int vdo_validate_header(const struct header *expected_header,
+			const struct header *actual_header,
+			bool exact_size,
+			const char *name)
+{
+	int result;
+
+	if (expected_header->id != actual_header->id)
+		return uds_log_error_strerror(VDO_INCORRECT_COMPONENT,
+					      "%s ID mismatch, expected %d, got %d",
+					      name,
+					      expected_header->id,
+					      actual_header->id);
+
+	result = validate_version(expected_header->version, actual_header->version, name);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	if ((expected_header->size > actual_header->size) ||
+	    (exact_size && (expected_header->size < actual_header->size)))
+		return uds_log_error_strerror(VDO_UNSUPPORTED_VERSION,
+					      "%s size mismatch, expected %zu, got %zu",
+					      name,
+					      expected_header->size,
+					      actual_header->size);
+
+	return VDO_SUCCESS;
+}
+
+static void encode_version_number(u8 *buffer, size_t *offset, struct version_number version)
+{
+	struct packed_version_number packed = vdo_pack_version_number(version);
+
+	memcpy(buffer + *offset, &packed, sizeof(packed));
+	*offset += sizeof(packed);
+}
+
+void vdo_encode_header(u8 *buffer, size_t *offset, const struct header *header)
+{
+	struct packed_header packed = vdo_pack_header(header);
+
+	memcpy(buffer + *offset, &packed, sizeof(packed));
+	*offset += sizeof(packed);
+}
+
+static void decode_version_number(u8 *buffer, size_t *offset, struct version_number *version)
+{
+	struct packed_version_number packed;
+
+	memcpy(&packed, buffer + *offset, sizeof(packed));
+	*offset += sizeof(packed);
+	*version = vdo_unpack_version_number(packed);
+}
+
+void vdo_decode_header(u8 *buffer, size_t *offset, struct header *header)
+{
+	struct packed_header packed;
+
+	memcpy(&packed, buffer + *offset, sizeof(packed));
+	*offset += sizeof(packed);
+
+	*header = vdo_unpack_header(&packed);
+}
+
+/**
+ * decode_volume_geometry() - Decode the on-disk representation of a volume geometry from a buffer.
+ * @buffer: A buffer to decode from.
+ * @offset: The offset in the buffer at which to decode.
+ * @geometry: The structure to receive the decoded fields.
+ * @version: The geometry block version to decode.
+ */
+static void
+decode_volume_geometry(u8 *buffer, size_t *offset, struct volume_geometry *geometry, u32 version)
+{
+	u32 unused, mem;
+	enum volume_region_id id;
+	nonce_t nonce;
+	block_count_t bio_offset = 0;
+	bool sparse;
+
+	/* This is for backwards compatibility. */
+	decode_u32_le(buffer, offset, &unused);
+	geometry->unused = unused;
+
+	decode_u64_le(buffer, offset, &nonce);
+	geometry->nonce = nonce;
+
+	memcpy((unsigned char *) &geometry->uuid, buffer + *offset, sizeof(uuid_t));
+	*offset += sizeof(uuid_t);
+
+	if (version > 4)
+		decode_u64_le(buffer, offset, &bio_offset);
+	geometry->bio_offset = bio_offset;
+
+	for (id = 0; id < VDO_VOLUME_REGION_COUNT; id++) {
+		physical_block_number_t start_block;
+		enum volume_region_id saved_id;
+
+		decode_u32_le(buffer, offset, &saved_id);
+		decode_u64_le(buffer, offset, &start_block);
+
+		geometry->regions[id] = (struct volume_region) {
+			.id = saved_id,
+			.start_block = start_block,
+		};
+	}
+
+	decode_u32_le(buffer, offset, &mem);
+	*offset += sizeof(u32);
+	sparse = buffer[(*offset)++];
+
+	geometry->index_config = (struct index_config) {
+		.mem = mem,
+		.sparse = sparse,
+	};
+}
+
+/**
+ * vdo_parse_geometry_block() - Decode and validate an encoded geometry block.
+ * @block: The encoded geometry block.
+ * @geometry: The structure to receive the decoded fields.
+ */
+int __must_check vdo_parse_geometry_block(u8 *block, struct volume_geometry *geometry)
+{
+	u32 checksum, saved_checksum;
+	struct header header;
+	size_t offset = 0;
+	int result;
+
+	if (memcmp(block, VDO_GEOMETRY_MAGIC_NUMBER, VDO_GEOMETRY_MAGIC_NUMBER_SIZE) != 0)
+		return VDO_BAD_MAGIC;
+	offset += VDO_GEOMETRY_MAGIC_NUMBER_SIZE;
+
+	vdo_decode_header(block, &offset, &header);
+	if (header.version.major_version <= 4)
+		result = vdo_validate_header(&GEOMETRY_BLOCK_HEADER_4_0, &header, true, __func__);
+	else
+		result = vdo_validate_header(&GEOMETRY_BLOCK_HEADER_5_0, &header, true, __func__);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	decode_volume_geometry(block, &offset, geometry, header.version.major_version);
+
+	result = ASSERT(header.size == offset + sizeof(u32),
+			"should have decoded up to the geometry checksum");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	/* Decode and verify the checksum. */
+	checksum = vdo_crc32(block, offset);
+	decode_u32_le(block, &offset, &saved_checksum);
+
+	return ((checksum == saved_checksum) ? VDO_SUCCESS : VDO_CHECKSUM_MISMATCH);
+}
+
+struct block_map_page *vdo_format_block_map_page(void *buffer,
+						 nonce_t nonce,
+						 physical_block_number_t pbn,
+						 bool initialized)
+{
+	struct block_map_page *page = (struct block_map_page *) buffer;
+
+	memset(buffer, 0, VDO_BLOCK_SIZE);
+	page->version = vdo_pack_version_number(BLOCK_MAP_4_1);
+	page->header.nonce = __cpu_to_le64(nonce);
+	page->header.pbn = __cpu_to_le64(pbn);
+	page->header.initialized = initialized;
+	return page;
+}
+
+enum block_map_page_validity
+vdo_validate_block_map_page(struct block_map_page *page,
+			    nonce_t nonce,
+			    physical_block_number_t pbn)
+{
+	STATIC_ASSERT_SIZEOF(struct block_map_page_header, PAGE_HEADER_4_1_SIZE);
+
+	if (!vdo_are_same_version(BLOCK_MAP_4_1, vdo_unpack_version_number(page->version)) ||
+	    !page->header.initialized ||
+	    (nonce != __le64_to_cpu(page->header.nonce)))
+		return VDO_BLOCK_MAP_PAGE_INVALID;
+
+	if (pbn != vdo_get_block_map_page_pbn(page))
+		return VDO_BLOCK_MAP_PAGE_BAD;
+
+	return VDO_BLOCK_MAP_PAGE_VALID;
+}
+
+static int
+decode_block_map_state_2_0(u8 *buffer, size_t *offset, struct block_map_state_2_0 *state)
+{
+	size_t initial_offset;
+	block_count_t flat_page_count, root_count;
+	physical_block_number_t flat_page_origin, root_origin;
+	struct header header;
+	int result;
+
+	vdo_decode_header(buffer, offset, &header);
+	result = vdo_validate_header(&VDO_BLOCK_MAP_HEADER_2_0, &header, true, __func__);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	initial_offset = *offset;
+
+	decode_u64_le(buffer, offset, &flat_page_origin);
+	result = ASSERT(flat_page_origin == VDO_BLOCK_MAP_FLAT_PAGE_ORIGIN,
+			"Flat page origin must be %u (recorded as %llu)",
+			VDO_BLOCK_MAP_FLAT_PAGE_ORIGIN,
+			(unsigned long long) state->flat_page_origin);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	decode_u64_le(buffer, offset, &flat_page_count);
+	result = ASSERT(flat_page_count == 0,
+			"Flat page count must be 0 (recorded as %llu)",
+			(unsigned long long) state->flat_page_count);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	decode_u64_le(buffer, offset, &root_origin);
+	decode_u64_le(buffer, offset, &root_count);
+
+	result = ASSERT(VDO_BLOCK_MAP_HEADER_2_0.size == *offset - initial_offset,
+			"decoded block map component size must match header size");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	*state = (struct block_map_state_2_0) {
+		.flat_page_origin = flat_page_origin,
+		.flat_page_count = flat_page_count,
+		.root_origin = root_origin,
+		.root_count = root_count,
+	};
+
+	return VDO_SUCCESS;
+}
+
+static void
+encode_block_map_state_2_0(u8 *buffer, size_t *offset, struct block_map_state_2_0 state)
+{
+	size_t initial_offset;
+
+	vdo_encode_header(buffer, offset, &VDO_BLOCK_MAP_HEADER_2_0);
+
+	initial_offset = *offset;
+	encode_u64_le(buffer, offset, state.flat_page_origin);
+	encode_u64_le(buffer, offset, state.flat_page_count);
+	encode_u64_le(buffer, offset, state.root_origin);
+	encode_u64_le(buffer, offset, state.root_count);
+
+	ASSERT_LOG_ONLY(VDO_BLOCK_MAP_HEADER_2_0.size == *offset - initial_offset,
+			"encoded block map component size must match header size");
+}
+
+/**
+ * vdo_compute_new_forest_pages() - Compute the number of pages which must be allocated at each
+ *                                  level in order to grow the forest to a new number of entries.
+ * @entries: The new number of entries the block map must address.
+ *
+ * Return: The total number of non-leaf pages required.
+ */
+block_count_t vdo_compute_new_forest_pages(root_count_t root_count,
+					   struct boundary *old_sizes,
+					   block_count_t entries,
+					   struct boundary *new_sizes)
+{
+	page_count_t leaf_pages = max(vdo_compute_block_map_page_count(entries), 1U);
+	page_count_t level_size = DIV_ROUND_UP(leaf_pages, root_count);
+	block_count_t total_pages = 0;
+	height_t height;
+
+	for (height = 0; height < VDO_BLOCK_MAP_TREE_HEIGHT; height++) {
+		block_count_t new_pages;
+
+		level_size = DIV_ROUND_UP(level_size, VDO_BLOCK_MAP_ENTRIES_PER_PAGE);
+		new_sizes->levels[height] = level_size;
+		new_pages = level_size;
+		if (old_sizes != NULL)
+			new_pages -= old_sizes->levels[height];
+		total_pages += (new_pages * root_count);
+	}
+
+	return total_pages;
+}
+
+/**
+ * encode_recovery_journal_state_7_0() - Encode the state of a recovery journal.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static void
+encode_recovery_journal_state_7_0(u8 *buffer,
+				  size_t *offset,
+				  struct recovery_journal_state_7_0 state)
+{
+	size_t initial_offset;
+
+	vdo_encode_header(buffer, offset, &VDO_RECOVERY_JOURNAL_HEADER_7_0);
+
+	initial_offset = *offset;
+	encode_u64_le(buffer, offset, state.journal_start);
+	encode_u64_le(buffer, offset, state.logical_blocks_used);
+	encode_u64_le(buffer, offset, state.block_map_data_blocks);
+
+	ASSERT_LOG_ONLY(VDO_RECOVERY_JOURNAL_HEADER_7_0.size == *offset - initial_offset,
+			"encoded recovery journal component size must match header size");
+}
+
+/**
+ * decode_recovery_journal_state_7_0() - Decode the state of a recovery journal saved in a buffer.
+ * @buffer: The buffer containing the saved state.
+ * @state: A pointer to a recovery journal state to hold the result of a successful decode.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int __must_check
+decode_recovery_journal_state_7_0(u8 *buffer,
+				  size_t *offset,
+				  struct recovery_journal_state_7_0 *state)
+{
+	struct header header;
+	int result;
+	size_t initial_offset;
+	sequence_number_t journal_start;
+	block_count_t logical_blocks_used, block_map_data_blocks;
+
+	vdo_decode_header(buffer, offset, &header);
+	result = vdo_validate_header(&VDO_RECOVERY_JOURNAL_HEADER_7_0, &header, true, __func__);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	initial_offset = *offset;
+	decode_u64_le(buffer, offset, &journal_start);
+	decode_u64_le(buffer, offset, &logical_blocks_used);
+	decode_u64_le(buffer, offset, &block_map_data_blocks);
+
+	result = ASSERT(VDO_RECOVERY_JOURNAL_HEADER_7_0.size == *offset - initial_offset,
+			"decoded recovery journal component size must match header size");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	*state = (struct recovery_journal_state_7_0) {
+		.journal_start = journal_start,
+		.logical_blocks_used = logical_blocks_used,
+		.block_map_data_blocks = block_map_data_blocks,
+	};
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_get_journal_operation_name() - Get the name of a journal operation.
+ * @operation: The operation to name.
+ *
+ * Return: The name of the operation.
+ */
+const char *vdo_get_journal_operation_name(enum journal_operation operation)
+{
+	switch (operation) {
+	case VDO_JOURNAL_DATA_REMAPPING:
+		return "data remapping";
+
+	case VDO_JOURNAL_BLOCK_MAP_REMAPPING:
+		return "block map remapping";
+
+	default:
+		return "unknown journal operation";
+	}
+}
+
+/**
+ * encode_slab_depot_state_2_0() - Encode the state of a slab depot into a buffer.
+ *
+ * Return: UDS_SUCCESS or an error.
+ */
+static void
+encode_slab_depot_state_2_0(u8 *buffer, size_t *offset, struct slab_depot_state_2_0 state)
+{
+	size_t initial_offset;
+
+	vdo_encode_header(buffer, offset, &VDO_SLAB_DEPOT_HEADER_2_0);
+
+	initial_offset = *offset;
+	encode_u64_le(buffer, offset, state.slab_config.slab_blocks);
+	encode_u64_le(buffer, offset, state.slab_config.data_blocks);
+	encode_u64_le(buffer, offset, state.slab_config.reference_count_blocks);
+	encode_u64_le(buffer, offset, state.slab_config.slab_journal_blocks);
+	encode_u64_le(buffer, offset, state.slab_config.slab_journal_flushing_threshold);
+	encode_u64_le(buffer, offset, state.slab_config.slab_journal_blocking_threshold);
+	encode_u64_le(buffer, offset, state.slab_config.slab_journal_scrubbing_threshold);
+	encode_u64_le(buffer, offset, state.first_block);
+	encode_u64_le(buffer, offset, state.last_block);
+	buffer[(*offset)++] = state.zone_count;
+
+	ASSERT_LOG_ONLY(VDO_SLAB_DEPOT_HEADER_2_0.size == *offset - initial_offset,
+			"encoded block map component size must match header size");
+}
+
+/**
+ * decode_slab_depot_state_2_0() - Decode slab depot component state version 2.0 from a buffer.
+ *
+ * Return: UDS_SUCCESS or an error code.
+ */
+static int
+decode_slab_depot_state_2_0(u8 *buffer, size_t *offset, struct slab_depot_state_2_0 *state)
+{
+	struct header header;
+	int result;
+	size_t initial_offset;
+	struct slab_config slab_config;
+	block_count_t count;
+	physical_block_number_t first_block, last_block;
+	zone_count_t zone_count;
+
+	vdo_decode_header(buffer, offset, &header);
+	result = vdo_validate_header(&VDO_SLAB_DEPOT_HEADER_2_0, &header, true, __func__);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	initial_offset = *offset;
+	decode_u64_le(buffer, offset, &count);
+	slab_config.slab_blocks = count;
+
+	decode_u64_le(buffer, offset, &count);
+	slab_config.data_blocks = count;
+
+	decode_u64_le(buffer, offset, &count);
+	slab_config.reference_count_blocks = count;
+
+	decode_u64_le(buffer, offset, &count);
+	slab_config.slab_journal_blocks = count;
+
+	decode_u64_le(buffer, offset, &count);
+	slab_config.slab_journal_flushing_threshold = count;
+
+	decode_u64_le(buffer, offset, &count);
+	slab_config.slab_journal_blocking_threshold = count;
+
+	decode_u64_le(buffer, offset, &count);
+	slab_config.slab_journal_scrubbing_threshold = count;
+
+	decode_u64_le(buffer, offset, &first_block);
+	decode_u64_le(buffer, offset, &last_block);
+	zone_count = buffer[(*offset)++];
+
+	result = ASSERT(VDO_SLAB_DEPOT_HEADER_2_0.size == *offset - initial_offset,
+			"decoded slab depot component size must match header size");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	*state = (struct slab_depot_state_2_0) {
+		.slab_config = slab_config,
+		.first_block = first_block,
+		.last_block = last_block,
+		.zone_count = zone_count,
+	};
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_configure_slab_depot() - Configure the slab depot.
+ * @partition: The slab depot partition
+ * @slab_config: The configuration of a single slab.
+ * @zone_count: The number of zones the depot will use.
+ * @state: The state structure to be configured.
+ *
+ * Configures the slab_depot for the specified storage capacity, finding the number of data blocks
+ * that will fit and still leave room for the depot metadata, then return the saved state for that
+ * configuration.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+int vdo_configure_slab_depot(const struct partition *partition,
+			     struct slab_config slab_config,
+			     zone_count_t zone_count,
+			     struct slab_depot_state_2_0 *state)
+{
+	block_count_t total_slab_blocks, total_data_blocks;
+	size_t slab_count;
+	physical_block_number_t last_block;
+	block_count_t slab_size = slab_config.slab_blocks;
+
+	uds_log_debug("slabDepot %s(block_count=%llu, first_block=%llu, slab_size=%llu, zone_count=%u)",
+		      __func__,
+		      (unsigned long long) partition->count,
+		      (unsigned long long) partition->offset,
+		      (unsigned long long) slab_size,
+		      zone_count);
+
+	/* We do not allow runt slabs, so we waste up to a slab's worth. */
+	slab_count = (partition->count / slab_size);
+	if (slab_count == 0)
+		return VDO_NO_SPACE;
+
+	if (slab_count > MAX_VDO_SLABS)
+		return VDO_TOO_MANY_SLABS;
+
+	total_slab_blocks = slab_count * slab_config.slab_blocks;
+	total_data_blocks = slab_count * slab_config.data_blocks;
+	last_block = partition->offset + total_slab_blocks;
+
+	*state = (struct slab_depot_state_2_0) {
+		.slab_config = slab_config,
+		.first_block = partition->offset,
+		.last_block = last_block,
+		.zone_count = zone_count,
+	};
+
+	uds_log_debug("slab_depot last_block=%llu, total_data_blocks=%llu, slab_count=%zu, left_over=%llu",
+		      (unsigned long long) last_block,
+		      (unsigned long long) total_data_blocks,
+		      slab_count,
+		      (unsigned long long) (partition->count - (last_block - partition->offset)));
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_configure_slab() - Measure and initialize the configuration to use for each slab.
+ * @slab_size: The number of blocks per slab.
+ * @slab_journal_blocks: The number of blocks for the slab journal.
+ * @slab_config: The slab configuration to initialize.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+int vdo_configure_slab(block_count_t slab_size,
+		       block_count_t slab_journal_blocks,
+		       struct slab_config *slab_config)
+{
+	block_count_t ref_blocks, meta_blocks, data_blocks;
+	block_count_t flushing_threshold, remaining, blocking_threshold;
+	block_count_t minimal_extra_space, scrubbing_threshold;
+
+	if (slab_journal_blocks >= slab_size)
+		return VDO_BAD_CONFIGURATION;
+
+	/*
+	 * This calculation should technically be a recurrence, but the total number of metadata
+	 * blocks is currently less than a single block of ref_counts, so we'd gain at most one
+	 * data block in each slab with more iteration.
+	 */
+	ref_blocks = vdo_get_saved_reference_count_size(slab_size - slab_journal_blocks);
+	meta_blocks = (ref_blocks + slab_journal_blocks);
+
+	/* Make sure test code hasn't configured slabs to be too small. */
+	if (meta_blocks >= slab_size)
+		return VDO_BAD_CONFIGURATION;
+
+	/*
+	 * If the slab size is very small, assume this must be a unit test and override the number
+	 * of data blocks to be a power of two (wasting blocks in the slab). Many tests need their
+	 * data_blocks fields to be the exact capacity of the configured volume, and that used to
+	 * fall out since they use a power of two for the number of data blocks, the slab size was
+	 * a power of two, and every block in a slab was a data block.
+	 *
+	 * TODO: Try to figure out some way of structuring testParameters and unit tests so this
+	 * hack isn't needed without having to edit several unit tests every time the metadata size
+	 * changes by one block.
+	 */
+	data_blocks = slab_size - meta_blocks;
+	if ((slab_size < 1024) && !is_power_of_2(data_blocks))
+		data_blocks = ((block_count_t) 1 << ilog2(data_blocks));
+
+	/*
+	 * Configure the slab journal thresholds. The flush threshold is 168 of 224 blocks in
+	 * production, or 3/4ths, so we use this ratio for all sizes.
+	 */
+	flushing_threshold = ((slab_journal_blocks * 3) + 3) / 4;
+	/*
+	 * The blocking threshold should be far enough from the flushing threshold to not produce
+	 * delays, but far enough from the end of the journal to allow multiple successive recovery
+	 * failures.
+	 */
+	remaining = slab_journal_blocks - flushing_threshold;
+	blocking_threshold = flushing_threshold + ((remaining * 5) / 7);
+	/* The scrubbing threshold should be at least 2048 entries before the end of the journal. */
+	minimal_extra_space = 1 + (MAXIMUM_VDO_USER_VIOS / VDO_SLAB_JOURNAL_FULL_ENTRIES_PER_BLOCK);
+	scrubbing_threshold = blocking_threshold;
+	if (slab_journal_blocks > minimal_extra_space)
+		scrubbing_threshold = slab_journal_blocks - minimal_extra_space;
+	if (blocking_threshold > scrubbing_threshold)
+		blocking_threshold = scrubbing_threshold;
+
+	*slab_config = (struct slab_config) {
+		.slab_blocks = slab_size,
+		.data_blocks = data_blocks,
+		.reference_count_blocks = ref_blocks,
+		.slab_journal_blocks = slab_journal_blocks,
+		.slab_journal_flushing_threshold = flushing_threshold,
+		.slab_journal_blocking_threshold = blocking_threshold,
+		.slab_journal_scrubbing_threshold = scrubbing_threshold};
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_decode_slab_journal_entry() - Decode a slab journal entry.
+ * @block: The journal block holding the entry.
+ * @entry_count: The number of the entry.
+ *
+ * Return: The decoded entry.
+ */
+struct slab_journal_entry
+vdo_decode_slab_journal_entry(struct packed_slab_journal_block *block,
+			      journal_entry_count_t entry_count)
+{
+	struct slab_journal_entry entry =
+		vdo_unpack_slab_journal_entry(&block->payload.entries[entry_count]);
+
+	if (block->header.has_block_map_increments &&
+	    ((block->payload.full_entries.entry_types[entry_count / 8] &
+	      ((u8)1 << (entry_count % 8))) != 0))
+		entry.operation = VDO_JOURNAL_BLOCK_MAP_REMAPPING;
+
+	return entry;
+}
+
+/**
+ * allocate_partition() - Allocate a partition and add it to a layout.
+ * @layout: The layout containing the partition.
+ * @id: The id of the partition.
+ * @offset: The offset into the layout at which the partition begins.
+ * @size: The size of the partition in blocks.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int allocate_partition(struct layout *layout,
+			      u8 id,
+			      physical_block_number_t offset,
+			      block_count_t size)
+{
+	struct partition *partition;
+	int result;
+
+	result = UDS_ALLOCATE(1, struct partition, __func__, &partition);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	partition->id = id;
+	partition->offset = offset;
+	partition->count = size;
+	partition->next = layout->head;
+	layout->head = partition;
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * make_partition() - Create a new partition from the beginning or end of the unused space in a
+ *                    layout.
+ * @layout: The layout.
+ * @id: The id of the partition to make.
+ * @size: The number of blocks to carve out; if 0, all remaining space will be used.
+ * @beginning: True if the partition should start at the beginning of the unused space.
+ *
+ * Return: A success or error code, particularly VDO_NO_SPACE if there are fewer than size blocks
+ *         remaining.
+ */
+static int __must_check
+make_partition(struct layout *layout,
+	       enum partition_id id,
+	       block_count_t size,
+	       bool beginning)
+{
+	int result;
+	physical_block_number_t offset;
+	block_count_t free_blocks = layout->last_free - layout->first_free;
+
+	if (size == 0) {
+		if (free_blocks == 0)
+			return VDO_NO_SPACE;
+		size = free_blocks;
+	} else if (size > free_blocks) {
+		return VDO_NO_SPACE;
+	}
+
+	result = vdo_get_partition(layout, id, NULL);
+	if (result != VDO_UNKNOWN_PARTITION)
+		return VDO_PARTITION_EXISTS;
+
+	offset = beginning ? layout->first_free : (layout->last_free - size);
+
+	result = allocate_partition(layout, id, offset, size);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	layout->num_partitions++;
+	if (beginning)
+		layout->first_free += size;
+	else
+		layout->last_free = layout->last_free - size;
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_initialize_layout() - Lay out the partitions of a vdo.
+ * @size: The entire size of the vdo.
+ * @origin: The start of the layout on the underlying storage in blocks.
+ * @block_map_blocks: The size of the block map partition.
+ * @journal_blocks: The size of the journal partition.
+ * @summary_blocks: The size of the slab summary partition.
+ * @layout: The layout to initialize.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_initialize_layout(block_count_t size,
+			  physical_block_number_t offset,
+			  block_count_t block_map_blocks,
+			  block_count_t journal_blocks,
+			  block_count_t summary_blocks,
+			  struct layout *layout)
+{
+	int result;
+	block_count_t necessary_size =
+		(offset + block_map_blocks + journal_blocks + summary_blocks);
+
+	if (necessary_size > size)
+		return uds_log_error_strerror(VDO_NO_SPACE, "Not enough space to make a VDO");
+
+	*layout = (struct layout) {
+		.start = offset,
+		.size = size,
+		.first_free = offset,
+		.last_free = size,
+		.num_partitions = 0,
+		.head = NULL,
+	};
+
+	result = make_partition(layout, VDO_BLOCK_MAP_PARTITION, block_map_blocks, true);
+	if (result != VDO_SUCCESS) {
+		vdo_uninitialize_layout(layout);
+		return result;
+	}
+
+	result = make_partition(layout, VDO_SLAB_SUMMARY_PARTITION, summary_blocks, false);
+	if (result != VDO_SUCCESS) {
+		vdo_uninitialize_layout(layout);
+		return result;
+	}
+
+	result = make_partition(layout, VDO_RECOVERY_JOURNAL_PARTITION, journal_blocks, false);
+	if (result != VDO_SUCCESS) {
+		vdo_uninitialize_layout(layout);
+		return result;
+	}
+
+	result = make_partition(layout, VDO_SLAB_DEPOT_PARTITION, 0, true);
+	if (result != VDO_SUCCESS)
+		vdo_uninitialize_layout(layout);
+
+	return result;
+}
+
+/**
+ * vdo_uninitialize_layout() - Clean up a layout.
+ * @layout: The layout to clean up.
+ *
+ * All partitions created by this layout become invalid pointers.
+ */
+void vdo_uninitialize_layout(struct layout *layout)
+{
+	while (layout->head != NULL) {
+		struct partition *part = layout->head;
+
+		layout->head = part->next;
+		UDS_FREE(part);
+	}
+
+	memset(layout, 0, sizeof(struct layout));
+}
+
+/**
+ * vdo_get_partition() - Get a partition by id.
+ * @layout: The layout from which to get a partition.
+ * @id: The id of the partition.
+ * @partition_ptr: A pointer to hold the partition.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_get_partition(struct layout *layout,
+		      enum partition_id id,
+		      struct partition **partition_ptr)
+{
+	struct partition *partition;
+
+	for (partition = layout->head; partition != NULL; partition = partition->next) {
+		if (partition->id == id) {
+			if (partition_ptr != NULL)
+				*partition_ptr = partition;
+			return VDO_SUCCESS;
+		}
+	}
+
+	return VDO_UNKNOWN_PARTITION;
+}
+
+/**
+ * vdo_get_known_partition() - Get a partition by id from a validated layout.
+ * @layout: The layout from which to get a partition.
+ * @id: The id of the partition.
+ *
+ * Return: the partition
+ */
+struct partition *vdo_get_known_partition(struct layout *layout, enum partition_id id)
+{
+	struct partition *partition;
+	int result = vdo_get_partition(layout, id, &partition);
+
+	ASSERT_LOG_ONLY(result == VDO_SUCCESS, "layout has expected partition: %u", id);
+
+	return partition;
+}
+
+static void encode_layout(u8 *buffer, size_t *offset, const struct layout *layout)
+{
+	const struct partition *partition;
+	size_t initial_offset;
+	struct header header = VDO_LAYOUT_HEADER_3_0;
+
+	STATIC_ASSERT_SIZEOF(enum partition_id, sizeof(u8));
+	ASSERT_LOG_ONLY(layout->num_partitions <= U8_MAX,
+			"layout partition count must fit in a byte");
+
+	vdo_encode_header(buffer, offset, &header);
+
+	initial_offset = *offset;
+	encode_u64_le(buffer, offset, layout->first_free);
+	encode_u64_le(buffer, offset, layout->last_free);
+	buffer[(*offset)++] = layout->num_partitions;
+
+	ASSERT_LOG_ONLY(sizeof(struct layout_3_0) == *offset - initial_offset,
+			"encoded size of a layout header must match structure");
+
+	for (partition = layout->head; partition != NULL; partition = partition->next) {
+		buffer[(*offset)++] = partition->id;
+		encode_u64_le(buffer, offset, partition->offset);
+		/* This field only exists for backwards compatibility */
+		encode_u64_le(buffer, offset, 0);
+		encode_u64_le(buffer, offset, partition->count);
+	}
+
+	ASSERT_LOG_ONLY(header.size == *offset - initial_offset,
+			"encoded size of a layout must match header size");
+}
+
+static int
+decode_layout(u8 *buffer,
+	      size_t *offset,
+	      physical_block_number_t start,
+	      block_count_t size,
+	      struct layout *layout)
+{
+	struct header header;
+	struct layout_3_0 layout_header;
+	struct partition *partition;
+	size_t initial_offset;
+	physical_block_number_t first_free, last_free;
+	u8 partition_count;
+	u8 i;
+	int result;
+
+	vdo_decode_header(buffer, offset, &header);
+	/* Layout is variable size, so only do a minimum size check here. */
+	result = vdo_validate_header(&VDO_LAYOUT_HEADER_3_0, &header, false, __func__);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	initial_offset = *offset;
+	decode_u64_le(buffer, offset, &first_free);
+	decode_u64_le(buffer, offset, &last_free);
+	partition_count = buffer[(*offset)++];
+	layout_header = (struct layout_3_0) {
+		.first_free = first_free,
+		.last_free = last_free,
+		.partition_count = partition_count,
+	};
+
+	result = ASSERT(sizeof(struct layout_3_0) == *offset - initial_offset,
+			"decoded size of a layout header must match structure");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	layout->start = start;
+	layout->size = size;
+	layout->first_free = layout_header.first_free;
+	layout->last_free = layout_header.last_free;
+	layout->num_partitions = layout_header.partition_count;
+
+	if (layout->num_partitions > VDO_PARTITION_COUNT)
+		return uds_log_error_strerror(VDO_UNKNOWN_PARTITION,
+					      "layout has extra partitions");
+
+	for (i = 0; i < layout->num_partitions; i++) {
+		u8 id;
+		u64 partition_offset, count;
+
+		id = buffer[(*offset)++];
+		decode_u64_le(buffer, offset, &partition_offset);
+		*offset += sizeof(u64);
+		decode_u64_le(buffer, offset, &count);
+
+		result = allocate_partition(layout, id, partition_offset, count);
+		if (result != VDO_SUCCESS) {
+			vdo_uninitialize_layout(layout);
+			return result;
+		}
+	}
+
+	/* Validate that the layout has all (and only) the required partitions */
+	for (i = 0; i < VDO_PARTITION_COUNT; i++) {
+		result = vdo_get_partition(layout, REQUIRED_PARTITIONS[i], &partition);
+		if (result != VDO_SUCCESS) {
+			vdo_uninitialize_layout(layout);
+			return uds_log_error_strerror(result,
+						      "layout is missing required partition %u",
+						      REQUIRED_PARTITIONS[i]);
+		}
+
+		start += partition->count;
+	}
+
+	if (start != size) {
+		vdo_uninitialize_layout(layout);
+		return uds_log_error_strerror(UDS_BAD_STATE, "partitions do not cover the layout");
+	}
+
+	return VDO_SUCCESS;
+}
+
+/**
+ * pack_vdo_config() - Convert a vdo_config to its packed on-disk representation.
+ * @config: The vdo config to convert.
+ *
+ * Return: The platform-independent representation of the config.
+ */
+static struct packed_vdo_config pack_vdo_config(struct vdo_config config)
+{
+	return (struct packed_vdo_config) {
+		.logical_blocks = __cpu_to_le64(config.logical_blocks),
+		.physical_blocks = __cpu_to_le64(config.physical_blocks),
+		.slab_size = __cpu_to_le64(config.slab_size),
+		.recovery_journal_size = __cpu_to_le64(config.recovery_journal_size),
+		.slab_journal_blocks = __cpu_to_le64(config.slab_journal_blocks),
+	};
+}
+
+/**
+ * pack_vdo_component() - Convert a vdo_component to its packed on-disk representation.
+ * @component: The VDO component data to convert.
+ *
+ * Return: The platform-independent representation of the component.
+ */
+static struct packed_vdo_component_41_0 pack_vdo_component(const struct vdo_component component)
+{
+	return (struct packed_vdo_component_41_0) {
+		.state = __cpu_to_le32(component.state),
+		.complete_recoveries = __cpu_to_le64(component.complete_recoveries),
+		.read_only_recoveries = __cpu_to_le64(component.read_only_recoveries),
+		.config = pack_vdo_config(component.config),
+		.nonce = __cpu_to_le64(component.nonce),
+	};
+}
+
+static void encode_vdo_component(u8 *buffer, size_t *offset, struct vdo_component component)
+{
+	struct packed_vdo_component_41_0 packed;
+
+	encode_version_number(buffer, offset, VDO_COMPONENT_DATA_41_0);
+	packed = pack_vdo_component(component);
+	memcpy(buffer + *offset, &packed, sizeof(packed));
+	*offset += sizeof(packed);
+}
+
+/**
+ * unpack_vdo_config() - Convert a packed_vdo_config to its native in-memory representation.
+ * @config: The packed vdo config to convert.
+ *
+ * Return: The native in-memory representation of the vdo config.
+ */
+static struct vdo_config unpack_vdo_config(struct packed_vdo_config config)
+{
+	return (struct vdo_config) {
+		.logical_blocks = __le64_to_cpu(config.logical_blocks),
+		.physical_blocks = __le64_to_cpu(config.physical_blocks),
+		.slab_size = __le64_to_cpu(config.slab_size),
+		.recovery_journal_size = __le64_to_cpu(config.recovery_journal_size),
+		.slab_journal_blocks = __le64_to_cpu(config.slab_journal_blocks),
+	};
+}
+
+/**
+ * unpack_vdo_component_41_0() - Convert a packed_vdo_component_41_0 to its native in-memory
+ *				 representation.
+ * @component: The packed vdo component data to convert.
+ *
+ * Return: The native in-memory representation of the component.
+ */
+static struct vdo_component
+unpack_vdo_component_41_0(struct packed_vdo_component_41_0 component)
+{
+	return (struct vdo_component) {
+		.state = __le32_to_cpu(component.state),
+		.complete_recoveries = __le64_to_cpu(component.complete_recoveries),
+		.read_only_recoveries = __le64_to_cpu(component.read_only_recoveries),
+		.config = unpack_vdo_config(component.config),
+		.nonce = __le64_to_cpu(component.nonce),
+	};
+}
+
+/**
+ * vdo_decode_component() - Decode the component data for the vdo itself out of the super block.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int decode_vdo_component(u8 *buffer, size_t *offset, struct vdo_component *component)
+{
+	struct version_number version;
+	struct packed_vdo_component_41_0 packed;
+	int result;
+
+	decode_version_number(buffer, offset, &version);
+	result = validate_version(version, VDO_COMPONENT_DATA_41_0, "VDO component data");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	memcpy(&packed, buffer + *offset, sizeof(packed));
+	*offset += sizeof(packed);
+	*component = unpack_vdo_component_41_0(packed);
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_validate_config() - Validate constraints on a VDO config.
+ * @config: The VDO config.
+ * @physical_block_count: The minimum block count of the underlying storage.
+ * @logical_block_count: The expected logical size of the VDO, or 0 if the logical size may be
+ *			 unspecified.
+ *
+ * Return: A success or error code.
+ */
+int vdo_validate_config(const struct vdo_config *config,
+			block_count_t physical_block_count,
+			block_count_t logical_block_count)
+{
+	struct slab_config slab_config;
+	int result;
+
+	result = ASSERT(config->slab_size > 0, "slab size unspecified");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT(is_power_of_2(config->slab_size), "slab size must be a power of two");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT(config->slab_size <= (1 << MAX_VDO_SLAB_BITS),
+			"slab size must be less than or equal to 2^%d",
+			MAX_VDO_SLAB_BITS);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = ASSERT(config->slab_journal_blocks >= MINIMUM_VDO_SLAB_JOURNAL_BLOCKS,
+			"slab journal size meets minimum size");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT(config->slab_journal_blocks <= config->slab_size,
+			"slab journal size is within expected bound");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = vdo_configure_slab(config->slab_size, config->slab_journal_blocks, &slab_config);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = ASSERT((slab_config.data_blocks >= 1),
+			"slab must be able to hold at least one block");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT(config->physical_blocks > 0, "physical blocks unspecified");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT(config->physical_blocks <= MAXIMUM_VDO_PHYSICAL_BLOCKS,
+			"physical block count %llu exceeds maximum %llu",
+			(unsigned long long) config->physical_blocks,
+			(unsigned long long) MAXIMUM_VDO_PHYSICAL_BLOCKS);
+	if (result != UDS_SUCCESS)
+		return VDO_OUT_OF_RANGE;
+
+	if (physical_block_count != config->physical_blocks) {
+		uds_log_error("A physical size of %llu blocks was specified, not the %llu blocks configured in the vdo super block",
+			      (unsigned long long) physical_block_count,
+			      (unsigned long long) config->physical_blocks);
+		return VDO_PARAMETER_MISMATCH;
+	}
+
+	if (logical_block_count > 0) {
+		result = ASSERT((config->logical_blocks > 0), "logical blocks unspecified");
+		if (result != UDS_SUCCESS)
+			return result;
+
+		if (logical_block_count != config->logical_blocks) {
+			uds_log_error("A logical size of %llu blocks was specified, but that differs from the %llu blocks configured in the vdo super block",
+				      (unsigned long long) logical_block_count,
+				      (unsigned long long) config->logical_blocks);
+			return VDO_PARAMETER_MISMATCH;
+		}
+	}
+
+	result = ASSERT(config->logical_blocks <= MAXIMUM_VDO_LOGICAL_BLOCKS,
+			"logical blocks too large");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT(config->recovery_journal_size > 0, "recovery journal size unspecified");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	result = ASSERT(is_power_of_2(config->recovery_journal_size),
+			"recovery journal size must be a power of two");
+	if (result != UDS_SUCCESS)
+		return result;
+
+	return result;
+}
+
+/**
+ * vdo_destroy_component_states() - Clean up any allocations in a vdo_component_states.
+ * @states: The component states to destroy.
+ */
+void vdo_destroy_component_states(struct vdo_component_states *states)
+{
+	if (states == NULL)
+		return;
+
+	vdo_uninitialize_layout(&states->layout);
+}
+
+/**
+ * decode_components() - Decode the components now that we know the component data is a version we
+ *                       understand.
+ * @buffer: The buffer being decoded.
+ * @offset: The offset to start decoding from.
+ * @geometry: The vdo geometry
+ * @states: An object to hold the successfully decoded state.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int __must_check decode_components(u8 *buffer,
+					  size_t *offset,
+					  struct volume_geometry *geometry,
+					  struct vdo_component_states *states)
+{
+	int result;
+
+	decode_vdo_component(buffer, offset, &states->vdo);
+
+	result = decode_layout(buffer,
+			       offset,
+			       vdo_get_data_region_start(*geometry) + 1,
+			       states->vdo.config.physical_blocks,
+			       &states->layout);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = decode_recovery_journal_state_7_0(buffer, offset, &states->recovery_journal);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = decode_slab_depot_state_2_0(buffer, offset, &states->slab_depot);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = decode_block_map_state_2_0(buffer, offset, &states->block_map);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	ASSERT_LOG_ONLY(*offset == VDO_COMPONENT_DATA_OFFSET + VDO_COMPONENT_DATA_SIZE,
+			"All decoded component data was used");
+	return VDO_SUCCESS;
+}
+
+/**
+ * vdo_decode_component_states() - Decode the payload of a super block.
+ * @buffer: The buffer containing the encoded super block contents.
+ * @geometry: The vdo geometry
+ * @states: A pointer to hold the decoded states.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+int vdo_decode_component_states(u8 *buffer,
+				struct volume_geometry *geometry,
+				struct vdo_component_states *states)
+{
+	int result;
+	size_t offset = VDO_COMPONENT_DATA_OFFSET;
+
+	/* This is for backwards compatibility. */
+	decode_u32_le(buffer, &offset, &states->unused);
+
+	/* Check the VDO volume version */
+	decode_version_number(buffer, &offset, &states->volume_version);
+	result = validate_version(VDO_VOLUME_VERSION_67_0, states->volume_version, "volume");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = decode_components(buffer, &offset, geometry, states);
+	if (result != VDO_SUCCESS)
+		vdo_uninitialize_layout(&states->layout);
+
+	return result;
+}
+
+/**
+ * vdo_validate_component_states() - Validate the decoded super block configuration.
+ * @states: The state decoded from the super block.
+ * @geometry_nonce: The nonce from the geometry block.
+ * @physical_size: The minimum block count of the underlying storage.
+ * @logical_size: The expected logical size of the VDO, or 0 if the logical size may be
+ *                unspecified.
+ *
+ * Return: VDO_SUCCESS or an error if the configuration is invalid.
+ */
+int vdo_validate_component_states(struct vdo_component_states *states,
+				  nonce_t geometry_nonce,
+				  block_count_t physical_size,
+				  block_count_t logical_size)
+{
+	if (geometry_nonce != states->vdo.nonce)
+		return uds_log_error_strerror(VDO_BAD_NONCE,
+					      "Geometry nonce %llu does not match superblock nonce %llu",
+					      (unsigned long long) geometry_nonce,
+					      (unsigned long long) states->vdo.nonce);
+
+	return vdo_validate_config(&states->vdo.config, physical_size, logical_size);
+}
+
+/**
+ * vdo_encode_component_states() - Encode the state of all vdo components in the super block.
+ */
+static void
+vdo_encode_component_states(u8 *buffer, size_t *offset, const struct vdo_component_states *states)
+{
+	/* This is for backwards compatibility. */
+	encode_u32_le(buffer, offset, states->unused);
+	encode_version_number(buffer, offset, states->volume_version);
+	encode_vdo_component(buffer, offset, states->vdo);
+	encode_layout(buffer, offset, &states->layout);
+	encode_recovery_journal_state_7_0(buffer, offset, states->recovery_journal);
+	encode_slab_depot_state_2_0(buffer, offset, states->slab_depot);
+	encode_block_map_state_2_0(buffer, offset, states->block_map);
+
+	ASSERT_LOG_ONLY(*offset == VDO_COMPONENT_DATA_OFFSET + VDO_COMPONENT_DATA_SIZE,
+			"All super block component data was encoded");
+}
+
+/**
+ * vdo_encode_super_block() - Encode a super block into its on-disk representation.
+ */
+void vdo_encode_super_block(u8 *buffer, struct vdo_component_states *states)
+{
+	u32 checksum;
+	struct header header = SUPER_BLOCK_HEADER_12_0;
+	size_t offset = 0;
+
+	header.size += VDO_COMPONENT_DATA_SIZE;
+	vdo_encode_header(buffer, &offset, &header);
+	vdo_encode_component_states(buffer, &offset, states);
+
+	checksum = vdo_crc32(buffer, offset);
+	encode_u32_le(buffer, &offset, checksum);
+
+	/*
+	 * Even though the buffer is a full block, to avoid the potential corruption from a torn
+	 * write, the entire encoding must fit in the first sector.
+	 */
+	ASSERT_LOG_ONLY(offset <= VDO_SECTOR_SIZE, "entire superblock must fit in one sector");
+}
+
+/**
+ * vdo_decode_super_block() - Decode a super block from its on-disk representation.
+ */
+int vdo_decode_super_block(u8 *buffer)
+{
+	struct header header;
+	int result;
+	u32 checksum, saved_checksum;
+	size_t offset = 0;
+
+	/* Decode and validate the header. */
+	vdo_decode_header(buffer, &offset, &header);
+	result = vdo_validate_header(&SUPER_BLOCK_HEADER_12_0, &header, false, __func__);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	if (header.size > VDO_COMPONENT_DATA_SIZE + sizeof(u32))
+		/*
+		 * We can't check release version or checksum until we know the content size, so we
+		 * have to assume a version mismatch on unexpected values.
+		 */
+		return uds_log_error_strerror(VDO_UNSUPPORTED_VERSION,
+					      "super block contents too large: %zu",
+					      header.size);
+
+	/* Skip past the component data for now, to verify the checksum. */
+	offset += VDO_COMPONENT_DATA_SIZE;
+
+	checksum = vdo_crc32(buffer, offset);
+	decode_u32_le(buffer, &offset, &saved_checksum);
+
+	result = ASSERT(offset == VDO_SUPER_BLOCK_FIXED_SIZE + VDO_COMPONENT_DATA_SIZE,
+			"must have decoded entire superblock payload");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	return ((checksum != saved_checksum) ? VDO_CHECKSUM_MISMATCH : VDO_SUCCESS);
+}
diff --git a/drivers/md/dm-vdo/encodings.h b/drivers/md/dm-vdo/encodings.h
new file mode 100644
index 000000000000..49b849a087d3
--- /dev/null
+++ b/drivers/md/dm-vdo/encodings.h
@@ -0,0 +1,1307 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_ENCODINGS_H
+#define VDO_ENCODINGS_H
+
+#include <linux/blk_types.h>
+#include <linux/crc32.h>
+#include <linux/limits.h>
+#include <linux/uuid.h>
+
+#include "numeric.h"
+#include "uds.h"
+
+#include "constants.h"
+#include "types.h"
+
+/*
+ * An in-memory representation of a version number for versioned structures on disk.
+ *
+ * A version number consists of two portions, a major version and a minor version. Any format
+ * change which does not require an explicit upgrade step from the previous version should
+ * increment the minor version. Any format change which either requires an explicit upgrade step,
+ * or is wholly incompatible (i.e. can not be upgraded to), should increment the major version, and
+ * set the minor version to 0.
+ */
+struct version_number {
+	u32 major_version;
+	u32 minor_version;
+};
+
+/*
+ * A packed, machine-independent, on-disk representation of a version_number. Both fields are
+ * stored in little-endian byte order.
+ */
+struct packed_version_number {
+	__le32 major_version;
+	__le32 minor_version;
+} __packed;
+
+/* The registry of component ids for use in headers */
+#define VDO_SUPER_BLOCK 0
+#define VDO_LAYOUT 1
+#define VDO_RECOVERY_JOURNAL 2
+#define VDO_SLAB_DEPOT 3
+#define VDO_BLOCK_MAP 4
+#define VDO_GEOMETRY_BLOCK 5
+
+/* The header for versioned data stored on disk. */
+struct header {
+	u32 id; /* The component this is a header for */
+	struct version_number version; /* The version of the data format */
+	size_t size; /* The size of the data following this header */
+};
+
+/* A packed, machine-independent, on-disk representation of a component header. */
+struct packed_header {
+	__le32 id;
+	struct packed_version_number version;
+	__le64 size;
+} __packed;
+
+enum {
+	VDO_GEOMETRY_BLOCK_LOCATION = 0,
+	VDO_GEOMETRY_MAGIC_NUMBER_SIZE = 8,
+	VDO_DEFAULT_GEOMETRY_BLOCK_VERSION = 5,
+};
+
+struct index_config {
+	u32 mem;
+	u32 unused;
+	bool sparse;
+} __packed;
+
+enum volume_region_id {
+	VDO_INDEX_REGION = 0,
+	VDO_DATA_REGION = 1,
+	VDO_VOLUME_REGION_COUNT,
+};
+
+struct volume_region {
+	/* The ID of the region */
+	enum volume_region_id id;
+	/*
+	 * The absolute starting offset on the device. The region continues until the next region
+	 * begins.
+	 */
+	physical_block_number_t start_block;
+} __packed;
+
+struct volume_geometry {
+	/* For backwards compatibility */
+	u32 unused;
+	/* The nonce of this volume */
+	nonce_t nonce;
+	/* The uuid of this volume */
+	uuid_t uuid;
+	/* The block offset to be applied to bios */
+	block_count_t bio_offset;
+	/* The regions in ID order */
+	struct volume_region regions[VDO_VOLUME_REGION_COUNT];
+	/* The index config */
+	struct index_config index_config;
+} __packed;
+
+/* This volume geometry struct is used for sizing only */
+struct volume_geometry_4_0 {
+	/* For backwards compatibility */
+	u32 unused;
+	/* The nonce of this volume */
+	nonce_t nonce;
+	/* The uuid of this volume */
+	uuid_t uuid;
+	/* The regions in ID order */
+	struct volume_region regions[VDO_VOLUME_REGION_COUNT];
+	/* The index config */
+	struct index_config index_config;
+} __packed;
+
+extern const u8 VDO_GEOMETRY_MAGIC_NUMBER[VDO_GEOMETRY_MAGIC_NUMBER_SIZE + 1];
+
+/**
+ * DOC: Block map entries
+ *
+ * The entry for each logical block in the block map is encoded into five bytes, which saves space
+ * in both the on-disk and in-memory layouts. It consists of the 36 low-order bits of a
+ * physical_block_number_t (addressing 256 terabytes with a 4KB block size) and a 4-bit encoding of
+ * a block_mapping_state.
+ *
+ * Of the 8 high bits of the 5-byte structure:
+ *
+ * Bits 7..4: The four highest bits of the 36-bit physical block number
+ * Bits 3..0: The 4-bit block_mapping_state
+ *
+ * The following 4 bytes are the low order bytes of the physical block number, in little-endian
+ * order.
+ *
+ * Conversion functions to and from a data location are provided.
+ */
+struct block_map_entry {
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	unsigned mapping_state : 4;
+	unsigned pbn_high_nibble : 4;
+#else
+	unsigned pbn_high_nibble : 4;
+	unsigned mapping_state : 4;
+#endif
+
+	__le32 pbn_low_word;
+} __packed;
+
+struct block_map_page_header {
+	__le64 nonce;
+	__le64 pbn;
+
+	/** May be non-zero on disk */
+	u8 unused_long_word[8];
+
+	/* Whether this page has been written twice to disk */
+	bool initialized;
+
+	/* Always zero on disk */
+	u8 unused_byte1;
+
+	/* May be non-zero on disk */
+	u8 unused_byte2;
+	u8 unused_byte3;
+} __packed;
+
+struct block_map_page {
+	struct packed_version_number version;
+	struct block_map_page_header header;
+	struct block_map_entry entries[];
+} __packed;
+
+enum block_map_page_validity {
+	VDO_BLOCK_MAP_PAGE_VALID,
+	VDO_BLOCK_MAP_PAGE_INVALID,
+	/* Valid page found in the wrong location on disk */
+	VDO_BLOCK_MAP_PAGE_BAD,
+};
+
+struct block_map_state_2_0 {
+	physical_block_number_t flat_page_origin;
+	block_count_t flat_page_count;
+	physical_block_number_t root_origin;
+	block_count_t root_count;
+} __packed;
+
+struct boundary {
+	page_number_t levels[VDO_BLOCK_MAP_TREE_HEIGHT];
+};
+
+extern const struct header VDO_BLOCK_MAP_HEADER_2_0;
+
+/* The state of the recovery journal as encoded in the VDO super block. */
+struct recovery_journal_state_7_0 {
+	/** Sequence number to start the journal */
+	sequence_number_t journal_start;
+	/** Number of logical blocks used by VDO */
+	block_count_t logical_blocks_used;
+	/** Number of block map pages allocated */
+	block_count_t block_map_data_blocks;
+} __packed;
+
+extern const struct header VDO_RECOVERY_JOURNAL_HEADER_7_0;
+
+typedef u16 journal_entry_count_t;
+
+/*
+ * A recovery journal entry stores three physical locations: a data location that is the value of a
+ * single mapping in the block map tree, and the two locations of the block map pages and slots
+ * that are acquiring and releasing a reference to the location. The journal entry also stores an
+ * operation code that says whether the mapping is for a logical block or for the block map tree
+ * itself.
+ */
+struct recovery_journal_entry {
+	struct block_map_slot slot;
+	struct data_location mapping;
+	struct data_location unmapping;
+	enum journal_operation operation;
+};
+
+/* The packed, on-disk representation of a recovery journal entry. */
+struct packed_recovery_journal_entry {
+	/*
+	 * In little-endian bit order:
+	 * Bits 15..12: The four highest bits of the 36-bit physical block number of the block map
+	 * tree page
+	 * Bits 11..2: The 10-bit block map page slot number
+	 * Bit 1..0: The journal_operation of the entry (this actually only requires 1 bit, but
+	 *           it is convenient to keep the extra bit as part of this field.
+	 */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	unsigned operation : 2;
+	unsigned slot_low : 6;
+	unsigned slot_high : 4;
+	unsigned pbn_high_nibble : 4;
+#else
+	unsigned slot_low : 6;
+	unsigned operation : 2;
+	unsigned pbn_high_nibble : 4;
+	unsigned slot_high : 4;
+#endif
+
+	/*
+	 * Bits 47..16: The 32 low-order bits of the block map page PBN, in little-endian byte
+	 * order
+	 */
+	__le32 pbn_low_word;
+
+	/*
+	 * Bits 87..48: The five-byte block map entry encoding the location that will be stored in
+	 * the block map page slot
+	 */
+	struct block_map_entry mapping;
+
+	/*
+	 * Bits 127..88: The five-byte block map entry encoding the location that was stored in the
+	 * block map page slot
+	 */
+	struct block_map_entry unmapping;
+} __packed;
+
+/* The packed, on-disk representation of an old format recovery journal entry. */
+struct packed_recovery_journal_entry_1 {
+	/*
+	 * In little-endian bit order:
+	 * Bits 15..12: The four highest bits of the 36-bit physical block number of the block map
+	 *              tree page
+	 * Bits 11..2: The 10-bit block map page slot number
+	 * Bits 1..0: The 2-bit journal_operation of the entry
+	 *
+	 */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	unsigned operation : 2;
+	unsigned slot_low : 6;
+	unsigned slot_high : 4;
+	unsigned pbn_high_nibble : 4;
+#else
+	unsigned slot_low : 6;
+	unsigned operation : 2;
+	unsigned pbn_high_nibble : 4;
+	unsigned slot_high : 4;
+#endif
+
+	/*
+	 * Bits 47..16: The 32 low-order bits of the block map page PBN, in little-endian byte
+	 * order
+	 */
+	__le32 pbn_low_word;
+
+	/*
+	 * Bits 87..48: The five-byte block map entry encoding the location that was or will be
+	 * stored in the block map page slot
+	 */
+	struct block_map_entry block_map_entry;
+} __packed;
+
+enum journal_operation_1 {
+	VDO_JOURNAL_DATA_DECREMENT = 0,
+	VDO_JOURNAL_DATA_INCREMENT = 1,
+	VDO_JOURNAL_BLOCK_MAP_DECREMENT = 2,
+	VDO_JOURNAL_BLOCK_MAP_INCREMENT = 3,
+} __packed;
+
+struct recovery_block_header {
+	sequence_number_t block_map_head; /* Block map head sequence number */
+	sequence_number_t slab_journal_head; /* Slab journal head seq. number */
+	sequence_number_t sequence_number; /* Sequence number for this block */
+	nonce_t nonce; /* A given VDO instance's nonce */
+	block_count_t logical_blocks_used; /* Logical blocks in use */
+	block_count_t block_map_data_blocks; /* Allocated block map pages */
+	journal_entry_count_t entry_count; /* Number of entries written */
+	u8 check_byte; /* The protection check byte */
+	u8 recovery_count; /* Number of recoveries completed */
+	enum vdo_metadata_type metadata_type; /* Metadata type */
+};
+
+/*
+ * The packed, on-disk representation of a recovery journal block header. All fields are kept in
+ * little-endian byte order.
+ */
+struct packed_journal_header {
+	/* Block map head 64-bit sequence number */
+	__le64 block_map_head;
+
+	/* Slab journal head 64-bit sequence number */
+	__le64 slab_journal_head;
+
+	/* The 64-bit sequence number for this block */
+	__le64 sequence_number;
+
+	/* A given VDO instance's 64-bit nonce */
+	__le64 nonce;
+
+	/* 8-bit metadata type (should always be one for the recovery journal) */
+	u8 metadata_type;
+
+	/* 16-bit count of the entries encoded in the block */
+	__le16 entry_count;
+
+	/* 64-bit count of the logical blocks used when this block was opened */
+	__le64 logical_blocks_used;
+
+	/* 64-bit count of the block map blocks used when this block was opened */
+	__le64 block_map_data_blocks;
+
+	/* The protection check byte */
+	u8 check_byte;
+
+	/* The number of recoveries completed */
+	u8 recovery_count;
+} __packed;
+
+struct packed_journal_sector {
+	/* The protection check byte */
+	u8 check_byte;
+
+	/* The number of recoveries completed */
+	u8 recovery_count;
+
+	/* The number of entries in this sector */
+	u8 entry_count;
+
+	/* Journal entries for this sector */
+	struct packed_recovery_journal_entry entries[];
+} __packed;
+
+enum {
+	/* The number of entries in each sector (except the last) when filled */
+	RECOVERY_JOURNAL_ENTRIES_PER_SECTOR =
+		((VDO_SECTOR_SIZE - sizeof(struct packed_journal_sector)) /
+		 sizeof(struct packed_recovery_journal_entry)),
+	RECOVERY_JOURNAL_ENTRIES_PER_BLOCK = RECOVERY_JOURNAL_ENTRIES_PER_SECTOR * 7,
+	/* The number of entries in a v1 recovery journal block. */
+	RECOVERY_JOURNAL_1_ENTRIES_PER_BLOCK = 311,
+	/* The number of entries in each v1 sector (except the last) when filled */
+	RECOVERY_JOURNAL_1_ENTRIES_PER_SECTOR =
+		((VDO_SECTOR_SIZE - sizeof(struct packed_journal_sector)) /
+		 sizeof(struct packed_recovery_journal_entry_1)),
+	/* The number of entries in the last sector when a block is full */
+	RECOVERY_JOURNAL_1_ENTRIES_IN_LAST_SECTOR =
+		(RECOVERY_JOURNAL_1_ENTRIES_PER_BLOCK % RECOVERY_JOURNAL_1_ENTRIES_PER_SECTOR),
+};
+
+/* A type representing a reference count of a block. */
+typedef u8 vdo_refcount_t;
+
+/* The absolute position of an entry in a recovery journal or slab journal. */
+struct journal_point {
+	sequence_number_t sequence_number;
+	journal_entry_count_t entry_count;
+};
+
+/* A packed, platform-independent encoding of a struct journal_point. */
+struct packed_journal_point {
+	/*
+	 * The packed representation is the little-endian 64-bit representation of the low-order 48
+	 * bits of the sequence number, shifted up 16 bits, or'ed with the 16-bit entry count.
+	 *
+	 * Very long-term, the top 16 bits of the sequence number may not always be zero, as this
+	 * encoding assumes--see BZ 1523240.
+	 */
+	__le64 encoded_point;
+} __packed;
+
+/* Special vdo_refcount_t values. */
+#define EMPTY_REFERENCE_COUNT 0
+enum {
+	MAXIMUM_REFERENCE_COUNT = 254,
+	PROVISIONAL_REFERENCE_COUNT = 255,
+};
+
+enum {
+	COUNTS_PER_SECTOR =
+		((VDO_SECTOR_SIZE - sizeof(struct packed_journal_point)) / sizeof(vdo_refcount_t)),
+	COUNTS_PER_BLOCK = COUNTS_PER_SECTOR * VDO_SECTORS_PER_BLOCK,
+};
+
+/* The format of each sector of a reference_block on disk. */
+struct packed_reference_sector {
+	struct packed_journal_point commit_point;
+	vdo_refcount_t counts[COUNTS_PER_SECTOR];
+} __packed;
+
+struct packed_reference_block {
+	struct packed_reference_sector sectors[VDO_SECTORS_PER_BLOCK];
+};
+
+struct slab_depot_state_2_0 {
+	struct slab_config slab_config;
+	physical_block_number_t first_block;
+	physical_block_number_t last_block;
+	zone_count_t zone_count;
+} __packed;
+
+extern const struct header VDO_SLAB_DEPOT_HEADER_2_0;
+
+/*
+ * vdo_slab journal blocks may have one of two formats, depending upon whether or not any of the
+ * entries in the block are block map increments. Since the steady state for a VDO is that all of
+ * the necessary block map pages will be allocated, most slab journal blocks will have only data
+ * entries. Such blocks can hold more entries, hence the two formats.
+ */
+
+/* A single slab journal entry */
+struct slab_journal_entry {
+	slab_block_number sbn;
+	enum journal_operation operation;
+	bool increment;
+};
+
+/* A single slab journal entry in its on-disk form */
+typedef struct {
+	u8 offset_low8;
+	u8 offset_mid8;
+
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	unsigned offset_high7 : 7;
+	unsigned increment : 1;
+#else
+	unsigned increment : 1;
+	unsigned offset_high7 : 7;
+#endif
+} __packed packed_slab_journal_entry;
+
+/* The unpacked representation of the header of a slab journal block */
+struct slab_journal_block_header {
+	/* Sequence number for head of journal */
+	sequence_number_t head;
+	/* Sequence number for this block */
+	sequence_number_t sequence_number;
+	/* The nonce for a given VDO instance */
+	nonce_t nonce;
+	/* Recovery journal point for last entry */
+	struct journal_point recovery_point;
+	/* Metadata type */
+	enum vdo_metadata_type metadata_type;
+	/* Whether this block contains block map increments */
+	bool has_block_map_increments;
+	/* The number of entries in the block */
+	journal_entry_count_t entry_count;
+};
+
+/*
+ * The packed, on-disk representation of a slab journal block header. All fields are kept in
+ * little-endian byte order.
+ */
+struct packed_slab_journal_block_header {
+	/* 64-bit sequence number for head of journal */
+	__le64 head;
+	/* 64-bit sequence number for this block */
+	__le64 sequence_number;
+	/* Recovery journal point for the last entry, packed into 64 bits */
+	struct packed_journal_point recovery_point;
+	/* The 64-bit nonce for a given VDO instance */
+	__le64 nonce;
+	/* 8-bit metadata type (should always be two, for the slab journal) */
+	u8 metadata_type;
+	/* Whether this block contains block map increments */
+	bool has_block_map_increments;
+	/* 16-bit count of the entries encoded in the block */
+	__le16 entry_count;
+} __packed;
+
+enum {
+	VDO_SLAB_JOURNAL_PAYLOAD_SIZE =
+		VDO_BLOCK_SIZE - sizeof(struct packed_slab_journal_block_header),
+	VDO_SLAB_JOURNAL_FULL_ENTRIES_PER_BLOCK = (VDO_SLAB_JOURNAL_PAYLOAD_SIZE * 8) / 25,
+	VDO_SLAB_JOURNAL_ENTRY_TYPES_SIZE =
+		((VDO_SLAB_JOURNAL_FULL_ENTRIES_PER_BLOCK - 1) / 8) + 1,
+	VDO_SLAB_JOURNAL_ENTRIES_PER_BLOCK =
+		(VDO_SLAB_JOURNAL_PAYLOAD_SIZE / sizeof(packed_slab_journal_entry)),
+};
+
+/* The payload of a slab journal block which has block map increments */
+struct full_slab_journal_entries {
+	/* The entries themselves */
+	packed_slab_journal_entry entries[VDO_SLAB_JOURNAL_FULL_ENTRIES_PER_BLOCK];
+	/* The bit map indicating which entries are block map increments */
+	u8 entry_types[VDO_SLAB_JOURNAL_ENTRY_TYPES_SIZE];
+} __packed;
+
+typedef union {
+	/* Entries which include block map increments */
+	struct full_slab_journal_entries full_entries;
+	/* Entries which are only data updates */
+	packed_slab_journal_entry entries[VDO_SLAB_JOURNAL_ENTRIES_PER_BLOCK];
+	/* Ensure the payload fills to the end of the block */
+	u8 space[VDO_SLAB_JOURNAL_PAYLOAD_SIZE];
+} __packed slab_journal_payload;
+
+struct packed_slab_journal_block {
+	struct packed_slab_journal_block_header header;
+	slab_journal_payload payload;
+} __packed;
+
+/* The offset of a slab journal tail block. */
+typedef u8 tail_block_offset_t;
+
+struct slab_summary_entry {
+	/* Bits 7..0: The offset of the tail block within the slab journal */
+	tail_block_offset_t tail_block_offset;
+
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	/* Bits 13..8: A hint about the fullness of the slab */
+	unsigned int fullness_hint : 6;
+	/* Bit 14: Whether the ref_counts must be loaded from the layer */
+	unsigned int load_ref_counts : 1;
+	/* Bit 15: The believed cleanliness of this slab */
+	unsigned int is_dirty : 1;
+#else
+	/* Bit 15: The believed cleanliness of this slab */
+	unsigned int is_dirty : 1;
+	/* Bit 14: Whether the ref_counts must be loaded from the layer */
+	unsigned int load_ref_counts : 1;
+	/* Bits 13..8: A hint about the fullness of the slab */
+	unsigned int fullness_hint : 6;
+#endif
+} __packed;
+
+enum {
+	VDO_SLAB_SUMMARY_FULLNESS_HINT_BITS = 6,
+	VDO_SLAB_SUMMARY_ENTRIES_PER_BLOCK = VDO_BLOCK_SIZE / sizeof(struct slab_summary_entry),
+	VDO_SLAB_SUMMARY_BLOCKS_PER_ZONE = MAX_VDO_SLABS / VDO_SLAB_SUMMARY_ENTRIES_PER_BLOCK,
+	VDO_SLAB_SUMMARY_BLOCKS = VDO_SLAB_SUMMARY_BLOCKS_PER_ZONE * MAX_VDO_PHYSICAL_ZONES,
+};
+
+struct layout {
+	physical_block_number_t start;
+	block_count_t size;
+	physical_block_number_t first_free;
+	physical_block_number_t last_free;
+	size_t num_partitions;
+	struct partition *head;
+};
+
+struct partition {
+	enum partition_id id; /* The id of this partition */
+	physical_block_number_t offset; /* The offset into the layout of this partition */
+	block_count_t count; /* The number of blocks in the partition */
+	struct partition *next; /* A pointer to the next partition in the layout */
+};
+
+struct layout_3_0 {
+	physical_block_number_t first_free;
+	physical_block_number_t last_free;
+	u8 partition_count;
+} __packed;
+
+struct partition_3_0 {
+	enum partition_id id;
+	physical_block_number_t offset;
+	physical_block_number_t base; /* unused but retained for backwards compatibility */
+	block_count_t count;
+} __packed;
+
+/*
+ * The configuration of the VDO service.
+ */
+struct vdo_config {
+	block_count_t logical_blocks; /* number of logical blocks */
+	block_count_t physical_blocks; /* number of physical blocks */
+	block_count_t slab_size; /* number of blocks in a slab */
+	block_count_t recovery_journal_size; /* number of recovery journal blocks */
+	block_count_t slab_journal_blocks; /* number of slab journal blocks */
+};
+
+/* This is the structure that captures the vdo fields saved as a super block component. */
+struct vdo_component {
+	enum vdo_state state;
+	u64 complete_recoveries;
+	u64 read_only_recoveries;
+	struct vdo_config config;
+	nonce_t nonce;
+};
+
+/*
+ * A packed, machine-independent, on-disk representation of the vdo_config in the VDO component
+ * data in the super block.
+ */
+struct packed_vdo_config {
+	__le64 logical_blocks;
+	__le64 physical_blocks;
+	__le64 slab_size;
+	__le64 recovery_journal_size;
+	__le64 slab_journal_blocks;
+} __packed;
+
+/*
+ * A packed, machine-independent, on-disk representation of version 41.0 of the VDO component data
+ * in the super block.
+ */
+struct packed_vdo_component_41_0 {
+	__le32 state;
+	__le64 complete_recoveries;
+	__le64 read_only_recoveries;
+	struct packed_vdo_config config;
+	__le64 nonce;
+} __packed;
+
+/*
+ * The version of the on-disk format of a VDO volume. This should be incremented any time the
+ * on-disk representation of any VDO structure changes. Changes which require only online upgrade
+ * steps should increment the minor version. Changes which require an offline upgrade or which can
+ * not be upgraded to at all should increment the major version and set the minor version to 0.
+ */
+extern const struct version_number VDO_VOLUME_VERSION_67_0;
+
+enum {
+	VDO_ENCODED_HEADER_SIZE = sizeof(struct packed_header),
+	BLOCK_MAP_COMPONENT_ENCODED_SIZE =
+		VDO_ENCODED_HEADER_SIZE + sizeof(struct block_map_state_2_0),
+	RECOVERY_JOURNAL_COMPONENT_ENCODED_SIZE =
+		VDO_ENCODED_HEADER_SIZE + sizeof(struct recovery_journal_state_7_0),
+	SLAB_DEPOT_COMPONENT_ENCODED_SIZE =
+		VDO_ENCODED_HEADER_SIZE + sizeof(struct slab_depot_state_2_0),
+	VDO_PARTITION_COUNT = 4,
+	VDO_LAYOUT_ENCODED_SIZE = (VDO_ENCODED_HEADER_SIZE +
+				   sizeof(struct layout_3_0) +
+				   (sizeof(struct partition_3_0) * VDO_PARTITION_COUNT)),
+	VDO_SUPER_BLOCK_FIXED_SIZE = VDO_ENCODED_HEADER_SIZE + sizeof(u32),
+	VDO_MAX_COMPONENT_DATA_SIZE = VDO_SECTOR_SIZE - VDO_SUPER_BLOCK_FIXED_SIZE,
+	VDO_COMPONENT_ENCODED_SIZE =
+		(sizeof(struct packed_version_number) + sizeof(struct packed_vdo_component_41_0)),
+	VDO_COMPONENT_DATA_OFFSET = VDO_ENCODED_HEADER_SIZE,
+	VDO_COMPONENT_DATA_SIZE = (sizeof(u32) +
+				   sizeof(struct packed_version_number) +
+				   VDO_COMPONENT_ENCODED_SIZE +
+				   VDO_LAYOUT_ENCODED_SIZE +
+				   RECOVERY_JOURNAL_COMPONENT_ENCODED_SIZE +
+				   SLAB_DEPOT_COMPONENT_ENCODED_SIZE +
+				   BLOCK_MAP_COMPONENT_ENCODED_SIZE),
+};
+
+/* The entirety of the component data encoded in the VDO super block. */
+struct vdo_component_states {
+	/* For backwards compatibility */
+	u32 unused;
+
+	/* The VDO volume version */
+	struct version_number volume_version;
+
+	/* Components */
+	struct vdo_component vdo;
+	struct block_map_state_2_0 block_map;
+	struct recovery_journal_state_7_0 recovery_journal;
+	struct slab_depot_state_2_0 slab_depot;
+
+	/* Our partitioning of the underlying storage */
+	struct layout layout;
+};
+
+/**
+ * vdo_are_same_version() - Check whether two version numbers are the same.
+ * @version_a: The first version.
+ * @version_b: The second version.
+ *
+ * Return: true if the two versions are the same.
+ */
+static inline bool
+vdo_are_same_version(struct version_number version_a, struct version_number version_b)
+{
+	return ((version_a.major_version == version_b.major_version) &&
+		(version_a.minor_version == version_b.minor_version));
+}
+
+/**
+ * vdo_is_upgradable_version() - Check whether an actual version is upgradable to an expected
+ *                               version.
+ * @expected_version: The expected version.
+ * @actual_version: The version being validated.
+ *
+ * An actual version is upgradable if its major number is expected but its minor number differs,
+ * and the expected version's minor number is greater than the actual version's minor number.
+ *
+ * Return: true if the actual version is upgradable.
+ */
+static inline bool vdo_is_upgradable_version(struct version_number expected_version,
+					     struct version_number actual_version)
+{
+	return ((expected_version.major_version == actual_version.major_version) &&
+		(expected_version.minor_version > actual_version.minor_version));
+}
+
+int __must_check vdo_validate_header(const struct header *expected_header,
+				     const struct header *actual_header,
+				     bool exact_size,
+				     const char *component_name);
+
+void vdo_encode_header(u8 *buffer, size_t *offset, const struct header *header);
+void vdo_decode_header(u8 *buffer, size_t *offset, struct header *header);
+
+/**
+ * vdo_pack_version_number() - Convert a version_number to its packed on-disk representation.
+ * @version: The version number to convert.
+ *
+ * Return: the platform-independent representation of the version
+ */
+static inline struct packed_version_number vdo_pack_version_number(struct version_number version)
+{
+	return (struct packed_version_number) {
+		.major_version = __cpu_to_le32(version.major_version),
+		.minor_version = __cpu_to_le32(version.minor_version),
+	};
+}
+
+/**
+ * vdo_unpack_version_number() - Convert a packed_version_number to its native in-memory
+ *                               representation.
+ * @version: The version number to convert.
+ *
+ * Return: The platform-independent representation of the version.
+ */
+static inline struct version_number vdo_unpack_version_number(struct packed_version_number version)
+{
+	return (struct version_number) {
+		.major_version = __le32_to_cpu(version.major_version),
+		.minor_version = __le32_to_cpu(version.minor_version),
+	};
+}
+
+/**
+ * vdo_pack_header() - Convert a component header to its packed on-disk representation.
+ * @header: The header to convert.
+ *
+ * Return: the platform-independent representation of the header
+ */
+static inline struct packed_header vdo_pack_header(const struct header *header)
+{
+	return (struct packed_header) {
+		.id = __cpu_to_le32(header->id),
+		.version = vdo_pack_version_number(header->version),
+		.size = __cpu_to_le64(header->size),
+	};
+}
+
+/**
+ * vdo_unpack_header() - Convert a packed_header to its native in-memory representation.
+ * @header: The header to convert.
+ *
+ * Return: The platform-independent representation of the version.
+ */
+static inline struct header vdo_unpack_header(const struct packed_header *header)
+{
+	return (struct header) {
+		.id = __le32_to_cpu(header->id),
+		.version = vdo_unpack_version_number(header->version),
+		.size = __le64_to_cpu(header->size),
+	};
+}
+
+/**
+ * vdo_get_index_region_start() - Get the start of the index region from a geometry.
+ * @geometry: The geometry.
+ *
+ * Return: The start of the index region.
+ */
+static inline physical_block_number_t __must_check
+vdo_get_index_region_start(struct volume_geometry geometry)
+{
+	return geometry.regions[VDO_INDEX_REGION].start_block;
+}
+
+/**
+ * vdo_get_data_region_start() - Get the start of the data region from a geometry.
+ * @geometry: The geometry.
+ *
+ * Return: The start of the data region.
+ */
+static inline physical_block_number_t __must_check
+vdo_get_data_region_start(struct volume_geometry geometry)
+{
+	return geometry.regions[VDO_DATA_REGION].start_block;
+}
+
+/**
+ * vdo_get_index_region_size() - Get the size of the index region from a geometry.
+ * @geometry: The geometry.
+ *
+ * Return: The size of the index region.
+ */
+static inline physical_block_number_t __must_check
+vdo_get_index_region_size(struct volume_geometry geometry)
+{
+	return vdo_get_data_region_start(geometry) -
+		vdo_get_index_region_start(geometry);
+}
+
+int __must_check vdo_parse_geometry_block(unsigned char *block, struct volume_geometry *geometry);
+
+static inline bool vdo_is_state_compressed(const enum block_mapping_state mapping_state)
+{
+	return (mapping_state > VDO_MAPPING_STATE_UNCOMPRESSED);
+}
+
+static inline struct block_map_entry
+vdo_pack_block_map_entry(physical_block_number_t pbn, enum block_mapping_state mapping_state)
+{
+	return (struct block_map_entry) {
+		.mapping_state = (mapping_state & 0x0F),
+		.pbn_high_nibble = ((pbn >> 32) & 0x0F),
+		.pbn_low_word = __cpu_to_le32(pbn & UINT_MAX),
+	};
+}
+
+static inline struct data_location vdo_unpack_block_map_entry(const struct block_map_entry *entry)
+{
+	physical_block_number_t low32 = __le32_to_cpu(entry->pbn_low_word);
+	physical_block_number_t high4 = entry->pbn_high_nibble;
+
+	return (struct data_location) {
+		.pbn = ((high4 << 32) | low32),
+		.state = entry->mapping_state,
+	};
+}
+
+static inline bool vdo_is_mapped_location(const struct data_location *location)
+{
+	return (location->state != VDO_MAPPING_STATE_UNMAPPED);
+}
+
+static inline bool vdo_is_valid_location(const struct data_location *location)
+{
+	if (location->pbn == VDO_ZERO_BLOCK)
+		return !vdo_is_state_compressed(location->state);
+	else
+		return vdo_is_mapped_location(location);
+}
+
+static inline physical_block_number_t __must_check
+vdo_get_block_map_page_pbn(const struct block_map_page *page)
+{
+	return __le64_to_cpu(page->header.pbn);
+}
+
+struct block_map_page *vdo_format_block_map_page(void *buffer,
+						 nonce_t nonce,
+						 physical_block_number_t pbn,
+						 bool initialized);
+
+enum block_map_page_validity __must_check
+vdo_validate_block_map_page(struct block_map_page *page,
+			    nonce_t nonce,
+			    physical_block_number_t pbn);
+
+static inline page_count_t vdo_compute_block_map_page_count(block_count_t entries)
+{
+	return DIV_ROUND_UP(entries, VDO_BLOCK_MAP_ENTRIES_PER_PAGE);
+}
+
+block_count_t __must_check
+vdo_compute_new_forest_pages(root_count_t root_count,
+			     struct boundary *old_sizes,
+			     block_count_t entries,
+			     struct boundary *new_sizes);
+
+/**
+ * vdo_pack_recovery_journal_entry() - Return the packed, on-disk representation of a recovery
+ *                                     journal entry.
+ * @entry: The journal entry to pack.
+ *
+ * Return: The packed representation of the journal entry.
+ */
+static inline struct packed_recovery_journal_entry
+vdo_pack_recovery_journal_entry(const struct recovery_journal_entry *entry)
+{
+	return (struct packed_recovery_journal_entry) {
+		.operation = entry->operation,
+		.slot_low = entry->slot.slot & 0x3F,
+		.slot_high = (entry->slot.slot >> 6) & 0x0F,
+		.pbn_high_nibble = (entry->slot.pbn >> 32) & 0x0F,
+		.pbn_low_word = __cpu_to_le32(entry->slot.pbn & UINT_MAX),
+		.mapping = vdo_pack_block_map_entry(entry->mapping.pbn, entry->mapping.state),
+		.unmapping = vdo_pack_block_map_entry(entry->unmapping.pbn,
+						      entry->unmapping.state),
+	};
+}
+
+/**
+ * vdo_unpack_recovery_journal_entry() - Unpack the on-disk representation of a recovery journal
+ *                                       entry.
+ * @entry: The recovery journal entry to unpack.
+ *
+ * Return: The unpacked entry.
+ */
+static inline struct recovery_journal_entry
+vdo_unpack_recovery_journal_entry(const struct packed_recovery_journal_entry *entry)
+{
+	physical_block_number_t low32 = __le32_to_cpu(entry->pbn_low_word);
+	physical_block_number_t high4 = entry->pbn_high_nibble;
+
+	return (struct recovery_journal_entry) {
+		.operation = entry->operation,
+		.slot = {
+			.pbn = ((high4 << 32) | low32),
+			.slot = (entry->slot_low | (entry->slot_high << 6)),
+		},
+		.mapping = vdo_unpack_block_map_entry(&entry->mapping),
+		.unmapping = vdo_unpack_block_map_entry(&entry->unmapping),
+	};
+}
+
+const char * __must_check vdo_get_journal_operation_name(enum journal_operation operation);
+
+/**
+ * vdo_is_valid_recovery_journal_sector() - Determine whether the header of the given sector could
+ *                                          describe a valid sector for the given journal block
+ *                                          header.
+ * @header: The unpacked block header to compare against.
+ * @sector: The packed sector to check.
+ * @sector_number: The number of the sector being checked.
+ *
+ * Return: true if the sector matches the block header.
+ */
+static inline bool __must_check
+vdo_is_valid_recovery_journal_sector(const struct recovery_block_header *header,
+				     const struct packed_journal_sector *sector,
+				     u8 sector_number)
+{
+	if ((header->check_byte != sector->check_byte) ||
+	    (header->recovery_count != sector->recovery_count))
+		return false;
+
+	if (header->metadata_type == VDO_METADATA_RECOVERY_JOURNAL_2)
+		return sector->entry_count <= RECOVERY_JOURNAL_ENTRIES_PER_SECTOR;
+
+	if (sector_number == 7)
+		return sector->entry_count <= RECOVERY_JOURNAL_1_ENTRIES_IN_LAST_SECTOR;
+
+	return sector->entry_count <= RECOVERY_JOURNAL_1_ENTRIES_PER_SECTOR;
+}
+
+/**
+ * vdo_compute_recovery_journal_block_number() - Compute the physical block number of the recovery
+ *                                               journal block which would have a given sequence
+ *                                               number.
+ * @journal_size: The size of the journal.
+ * @sequence_number: The sequence number.
+ *
+ * Return: The pbn of the journal block which would the specified sequence number.
+ */
+static inline physical_block_number_t __must_check
+vdo_compute_recovery_journal_block_number(block_count_t journal_size,
+					  sequence_number_t sequence_number)
+{
+	/*
+	 * Since journal size is a power of two, the block number modulus can just be extracted
+	 * from the low-order bits of the sequence.
+	 */
+	return (sequence_number & (journal_size - 1));
+}
+
+/**
+ * vdo_get_journal_block_sector() - Find the recovery journal sector from the block header and
+ *                                  sector number.
+ * @header: The header of the recovery journal block.
+ * @sector_number: The index of the sector (1-based).
+ *
+ * Return: A packed recovery journal sector.
+ */
+static inline struct packed_journal_sector * __must_check
+vdo_get_journal_block_sector(struct packed_journal_header *header, int sector_number)
+{
+	char *sector_data = ((char *) header) + (VDO_SECTOR_SIZE * sector_number);
+
+	return (struct packed_journal_sector *) sector_data;
+}
+
+/**
+ * vdo_pack_recovery_block_header() - Generate the packed representation of a recovery block
+ *                                    header.
+ * @header: The header containing the values to encode.
+ * @packed: The header into which to pack the values.
+ */
+static inline void vdo_pack_recovery_block_header(const struct recovery_block_header *header,
+						  struct packed_journal_header *packed)
+{
+	*packed = (struct packed_journal_header) {
+		.block_map_head = __cpu_to_le64(header->block_map_head),
+		.slab_journal_head = __cpu_to_le64(header->slab_journal_head),
+		.sequence_number = __cpu_to_le64(header->sequence_number),
+		.nonce = __cpu_to_le64(header->nonce),
+		.logical_blocks_used = __cpu_to_le64(header->logical_blocks_used),
+		.block_map_data_blocks = __cpu_to_le64(header->block_map_data_blocks),
+		.entry_count = __cpu_to_le16(header->entry_count),
+		.check_byte = header->check_byte,
+		.recovery_count = header->recovery_count,
+		.metadata_type = header->metadata_type,
+	};
+}
+
+/**
+ * vdo_unpack_recovery_block_header() - Decode the packed representation of a recovery block
+ *                                      header.
+ * @packed: The packed header to decode.
+ *
+ * Return: The unpacked header.
+ */
+static inline struct recovery_block_header
+vdo_unpack_recovery_block_header(const struct packed_journal_header *packed)
+{
+	return (struct recovery_block_header) {
+		.block_map_head = __le64_to_cpu(packed->block_map_head),
+		.slab_journal_head = __le64_to_cpu(packed->slab_journal_head),
+		.sequence_number = __le64_to_cpu(packed->sequence_number),
+		.nonce = __le64_to_cpu(packed->nonce),
+		.logical_blocks_used = __le64_to_cpu(packed->logical_blocks_used),
+		.block_map_data_blocks = __le64_to_cpu(packed->block_map_data_blocks),
+		.entry_count = __le16_to_cpu(packed->entry_count),
+		.check_byte = packed->check_byte,
+		.recovery_count = packed->recovery_count,
+		.metadata_type = packed->metadata_type,
+	};
+}
+
+/**
+ * vdo_compute_slab_count() - Compute the number of slabs a depot with given parameters would have.
+ * @first_block: PBN of the first data block.
+ * @last_block: PBN of the last data block.
+ * @slab_size_shift: Exponent for the number of blocks per slab.
+ *
+ * Return: The number of slabs.
+ */
+static inline slab_count_t
+vdo_compute_slab_count(physical_block_number_t first_block,
+		       physical_block_number_t last_block,
+		       unsigned int slab_size_shift)
+{
+	return (slab_count_t) ((last_block - first_block) >> slab_size_shift);
+}
+
+int __must_check vdo_configure_slab_depot(const struct partition *partition,
+					  struct slab_config slab_config,
+					  zone_count_t zone_count,
+					  struct slab_depot_state_2_0 *state);
+
+int __must_check vdo_configure_slab(block_count_t slab_size,
+				    block_count_t slab_journal_blocks,
+				    struct slab_config *slab_config);
+
+/**
+ * vdo_get_saved_reference_count_size() - Get the number of blocks required to save a reference
+ *                                        counts state covering the specified number of data
+ *                                        blocks.
+ * @block_count: The number of physical data blocks that can be referenced.
+ *
+ * Return: The number of blocks required to save reference counts with the given block count.
+ */
+static inline block_count_t
+vdo_get_saved_reference_count_size(block_count_t block_count)
+{
+	return DIV_ROUND_UP(block_count, COUNTS_PER_BLOCK);
+}
+
+/**
+ * vdo_get_slab_journal_start_block() - Get the physical block number of the start of the slab
+ *                                      journal relative to the start block allocator partition.
+ * @slab_config: The slab configuration of the VDO.
+ * @origin: The first block of the slab.
+ */
+static inline physical_block_number_t __must_check
+vdo_get_slab_journal_start_block(const struct slab_config *slab_config,
+				 physical_block_number_t origin)
+{
+	return origin + slab_config->data_blocks + slab_config->reference_count_blocks;
+}
+
+/**
+ * vdo_advance_journal_point() - Move the given journal point forward by one entry.
+ * @point: The journal point to adjust.
+ * @entries_per_block: The number of entries in one full block.
+ */
+static inline void
+vdo_advance_journal_point(struct journal_point *point, journal_entry_count_t entries_per_block)
+{
+	point->entry_count++;
+	if (point->entry_count == entries_per_block) {
+		point->sequence_number++;
+		point->entry_count = 0;
+	}
+}
+
+/**
+ * vdo_before_journal_point() - Check whether the first point precedes the second point.
+ * @first: The first journal point.
+ * @second: The second journal point.
+ *
+ * Return: true if the first point precedes the second point.
+ */
+static inline bool
+vdo_before_journal_point(const struct journal_point *first, const struct journal_point *second)
+{
+	return ((first->sequence_number < second->sequence_number) ||
+		((first->sequence_number == second->sequence_number) &&
+		 (first->entry_count < second->entry_count)));
+}
+
+/**
+ * vdo_pack_journal_point() - Encode the journal location represented by a
+ *                            journal_point into a packed_journal_point.
+ * @unpacked: The unpacked input point.
+ * @packed: The packed output point.
+ */
+static inline void
+vdo_pack_journal_point(const struct journal_point *unpacked, struct packed_journal_point *packed)
+{
+	packed->encoded_point =
+		__cpu_to_le64((unpacked->sequence_number << 16) | unpacked->entry_count);
+}
+
+/**
+ * vdo_unpack_journal_point() - Decode the journal location represented by a packed_journal_point
+ *                              into a journal_point.
+ * @packed: The packed input point.
+ * @unpacked: The unpacked output point.
+ */
+static inline void
+vdo_unpack_journal_point(const struct packed_journal_point *packed, struct journal_point *unpacked)
+{
+	u64 native = __le64_to_cpu(packed->encoded_point);
+
+	unpacked->sequence_number = (native >> 16);
+	unpacked->entry_count = (native & 0xffff);
+}
+
+/**
+ * vdo_pack_slab_journal_block_header() - Generate the packed representation of a slab block
+ *                                        header.
+ * @header: The header containing the values to encode.
+ * @packed: The header into which to pack the values.
+ */
+static inline void
+vdo_pack_slab_journal_block_header(const struct slab_journal_block_header *header,
+				   struct packed_slab_journal_block_header *packed)
+{
+	packed->head = __cpu_to_le64(header->head);
+	packed->sequence_number = __cpu_to_le64(header->sequence_number);
+	packed->nonce = __cpu_to_le64(header->nonce);
+	packed->entry_count = __cpu_to_le16(header->entry_count);
+	packed->metadata_type = header->metadata_type;
+	packed->has_block_map_increments = header->has_block_map_increments;
+
+	vdo_pack_journal_point(&header->recovery_point, &packed->recovery_point);
+}
+
+/**
+ * vdo_unpack_slab_journal_block_header() - Decode the packed representation of a slab block
+ *                                          header.
+ * @packed: The packed header to decode.
+ * @header: The header into which to unpack the values.
+ */
+static inline void
+vdo_unpack_slab_journal_block_header(const struct packed_slab_journal_block_header *packed,
+				     struct slab_journal_block_header *header)
+{
+	*header = (struct slab_journal_block_header) {
+		.head = __le64_to_cpu(packed->head),
+		.sequence_number = __le64_to_cpu(packed->sequence_number),
+		.nonce = __le64_to_cpu(packed->nonce),
+		.entry_count = __le16_to_cpu(packed->entry_count),
+		.metadata_type = packed->metadata_type,
+		.has_block_map_increments = packed->has_block_map_increments,
+	};
+	vdo_unpack_journal_point(&packed->recovery_point, &header->recovery_point);
+}
+
+/**
+ * vdo_pack_slab_journal_entry() - Generate the packed encoding of a slab journal entry.
+ * @packed: The entry into which to pack the values.
+ * @sbn: The slab block number of the entry to encode.
+ * @is_increment: The increment flag.
+ */
+static inline void vdo_pack_slab_journal_entry(packed_slab_journal_entry *packed,
+					       slab_block_number sbn,
+					       bool is_increment)
+{
+	packed->offset_low8 = (sbn & 0x0000FF);
+	packed->offset_mid8 = (sbn & 0x00FF00) >> 8;
+	packed->offset_high7 = (sbn & 0x7F0000) >> 16;
+	packed->increment = is_increment ? 1 : 0;
+}
+
+/**
+ * vdo_unpack_slab_journal_entry() - Decode the packed representation of a slab journal entry.
+ * @packed: The packed entry to decode.
+ *
+ * Return: The decoded slab journal entry.
+ */
+static inline struct slab_journal_entry __must_check
+vdo_unpack_slab_journal_entry(const packed_slab_journal_entry *packed)
+{
+	struct slab_journal_entry entry;
+
+	entry.sbn = packed->offset_high7;
+	entry.sbn <<= 8;
+	entry.sbn |= packed->offset_mid8;
+	entry.sbn <<= 8;
+	entry.sbn |= packed->offset_low8;
+	entry.operation = VDO_JOURNAL_DATA_REMAPPING;
+	entry.increment = packed->increment;
+	return entry;
+}
+
+struct slab_journal_entry __must_check
+vdo_decode_slab_journal_entry(struct packed_slab_journal_block *block,
+			      journal_entry_count_t entry_count);
+
+/**
+ * vdo_get_slab_summary_hint_shift() - Compute the shift for slab summary hints.
+ * @slab_size_shift: Exponent for the number of blocks per slab.
+ *
+ * Return: The hint shift.
+ */
+static inline u8 __must_check vdo_get_slab_summary_hint_shift(unsigned int slab_size_shift)
+{
+	return ((slab_size_shift > VDO_SLAB_SUMMARY_FULLNESS_HINT_BITS) ?
+		(slab_size_shift - VDO_SLAB_SUMMARY_FULLNESS_HINT_BITS) :
+		0);
+}
+
+int __must_check vdo_initialize_layout(block_count_t size,
+				       physical_block_number_t offset,
+				       block_count_t block_map_blocks,
+				       block_count_t journal_blocks,
+				       block_count_t summary_blocks,
+				       struct layout *layout);
+
+void vdo_uninitialize_layout(struct layout *layout);
+
+int __must_check vdo_get_partition(struct layout *layout,
+				   enum partition_id id,
+				   struct partition **partition_ptr);
+
+struct partition * __must_check
+vdo_get_known_partition(struct layout *layout, enum partition_id id);
+
+int vdo_validate_config(const struct vdo_config *config,
+			block_count_t physical_block_count,
+			block_count_t logical_block_count);
+
+void vdo_destroy_component_states(struct vdo_component_states *states);
+
+int __must_check
+vdo_decode_component_states(u8 *buffer,
+			    struct volume_geometry *geometry,
+			    struct vdo_component_states *states);
+
+int __must_check
+vdo_validate_component_states(struct vdo_component_states *states,
+			      nonce_t geometry_nonce,
+			      block_count_t physical_size,
+			      block_count_t logical_size);
+
+void vdo_encode_super_block(u8 *buffer, struct vdo_component_states *states);
+int __must_check vdo_decode_super_block(u8 *buffer);
+
+/* We start with 0L and postcondition with ~0L to match our historical usage in userspace. */
+static inline u32 vdo_crc32(const void *buf, unsigned long len)
+{
+	return (crc32(0L, buf, len) ^ ~0L);
+}
+
+#endif /* VDO_ENCODINGS_H */

From patchwork Thu Oct 26 21:41:32 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437949
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0D536C27C4D
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:32 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344732AbjJZVob (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:31 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41014 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344852AbjJZVoQ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:16 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 90CC0D50
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356505;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=/0dcr44mbPyxQkR2N69CxYlN9/0nB+mfST+rvNkSC14=;
        b=UKgOXuYAOsu5jY35kwHrk5gAD1QKwbfg79xJY1flP7zMd/QOxopg1VbLMSwrGSrW9NS3Cq
        VlPhjcJkVnkU8wKRRI2lIYMG2aRASik5DAzgifS+UjOSMOCAIaiscBwVXflTQpfJ8xAEW+
        Quo+qhmrG99kLWQaUOxr3NuxV9pEnG4=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-345-lKpZFmbyPpSevIYSdXjWqw-1; Thu,
 26 Oct 2023 17:41:43 -0400
X-MC-Unique: lKpZFmbyPpSevIYSdXjWqw-1
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com
 [10.11.54.4])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 491323C0276C;
        Thu, 26 Oct 2023 21:41:43 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 314AE2026D4C;
        Thu, 26 Oct 2023 21:41:43 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id D975D3003B; Thu, 26 Oct 2023 17:41:42 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 35/39] dm vdo: add statistics reporting
Date: Thu, 26 Oct 2023 17:41:32 -0400
Message-Id: <20231026214136.1067410-36-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.4
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Add data and methods to report statisics.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/message-stats.c | 1215 +++++++++++++++++++++++++++++
 drivers/md/dm-vdo/message-stats.h |   13 +
 drivers/md/dm-vdo/statistics.h    |  278 +++++++
 3 files changed, 1506 insertions(+)
 create mode 100644 drivers/md/dm-vdo/message-stats.c
 create mode 100644 drivers/md/dm-vdo/message-stats.h
 create mode 100644 drivers/md/dm-vdo/statistics.h

diff --git a/drivers/md/dm-vdo/message-stats.c b/drivers/md/dm-vdo/message-stats.c
new file mode 100644
index 000000000000..b91a841043eb
--- /dev/null
+++ b/drivers/md/dm-vdo/message-stats.c
@@ -0,0 +1,1215 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "dedupe.h"
+#include "logger.h"
+#include "memory-alloc.h"
+#include "message-stats.h"
+#include "statistics.h"
+#include "thread-device.h"
+#include "vdo.h"
+
+static int write_u64(char *prefix,
+		     u64 value,
+		     char *suffix,
+		     char **buf,
+		     unsigned int *maxlen)
+{
+	int count = scnprintf(*buf, *maxlen, "%s%llu%s",
+			      prefix == NULL ? "" : prefix,
+			      value,
+			      suffix == NULL ? "" : suffix);
+	*buf += count;
+	*maxlen -= count;
+	if (count >= *maxlen)
+		return VDO_UNEXPECTED_EOF;
+	return VDO_SUCCESS;
+}
+
+static int write_u32(char *prefix,
+		     u32 value,
+		     char *suffix,
+		     char **buf,
+		     unsigned int *maxlen)
+{
+	int count = scnprintf(*buf, *maxlen, "%s%u%s",
+			      prefix == NULL ? "" : prefix,
+			      value,
+			      suffix == NULL ? "" : suffix);
+	*buf += count;
+	*maxlen -= count;
+	if (count >= *maxlen)
+		return VDO_UNEXPECTED_EOF;
+	return VDO_SUCCESS;
+}
+
+static int write_block_count_t(char *prefix,
+			       block_count_t value,
+			       char *suffix,
+			       char **buf,
+			       unsigned int *maxlen)
+{
+	int count = scnprintf(*buf, *maxlen, "%s%llu%s",
+			      prefix == NULL ? "" : prefix,
+			      value,
+			      suffix == NULL ? "" : suffix);
+	*buf += count;
+	*maxlen -= count;
+	if (count >= *maxlen)
+		return VDO_UNEXPECTED_EOF;
+	return VDO_SUCCESS;
+}
+
+static int write_string(char *prefix,
+			char *value,
+			char *suffix,
+			char **buf,
+			unsigned int *maxlen)
+{
+	int count = scnprintf(*buf, *maxlen, "%s%s%s",
+			      prefix == NULL ? "" : prefix,
+			      value,
+			      suffix == NULL ? "" : suffix);
+	*buf += count;
+	*maxlen -= count;
+	if (count >= *maxlen)
+		return VDO_UNEXPECTED_EOF;
+	return VDO_SUCCESS;
+}
+
+static int write_bool(char *prefix,
+		      bool value,
+		      char *suffix,
+		      char **buf,
+		      unsigned int *maxlen)
+{
+	int count = scnprintf(*buf, *maxlen, "%s%d%s",
+			      prefix == NULL ? "" : prefix,
+			      value,
+			      suffix == NULL ? "" : suffix);
+	*buf += count;
+	*maxlen -= count;
+	if (count >= *maxlen)
+		return VDO_UNEXPECTED_EOF;
+	return VDO_SUCCESS;
+}
+
+static int write_u8(char *prefix,
+		    u8 value,
+		    char *suffix,
+		    char **buf,
+		    unsigned int *maxlen)
+{
+	int count = scnprintf(*buf, *maxlen, "%s%u%s",
+			      prefix == NULL ? "" : prefix,
+			      value,
+			      suffix == NULL ? "" : suffix);
+	*buf += count;
+	*maxlen -= count;
+	if (count >= *maxlen)
+		return VDO_UNEXPECTED_EOF;
+	return VDO_SUCCESS;
+}
+
+static int write_block_allocator_statistics(char *prefix,
+					    struct block_allocator_statistics *stats,
+					    char *suffix,
+					    char **buf,
+					    unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The total number of slabs from which blocks may be allocated */
+	result = write_u64("slabCount : ",
+			   stats->slab_count,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The total number of slabs from which blocks have ever been allocated */
+	result = write_u64("slabsOpened : ",
+			   stats->slabs_opened,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The number of times since loading that a slab has been re-opened */
+	result = write_u64("slabsReopened : ",
+			   stats->slabs_reopened,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_commit_statistics(char *prefix,
+				   struct commit_statistics *stats,
+				   char *suffix,
+				   char **buf,
+				   unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The total number of items on which processing has started */
+	result = write_u64("started : ",
+			   stats->started,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The total number of items for which a write operation has been issued */
+	result = write_u64("written : ",
+			   stats->written,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The total number of items for which a write operation has completed */
+	result = write_u64("committed : ",
+			   stats->committed,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_recovery_journal_statistics(char *prefix,
+					     struct recovery_journal_statistics *stats,
+					     char *suffix,
+					     char **buf,
+					     unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of times the on-disk journal was full */
+	result = write_u64("diskFull : ",
+			   stats->disk_full,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of times the recovery journal requested slab journal commits. */
+	result = write_u64("slabJournalCommitsRequested : ",
+			   stats->slab_journal_commits_requested,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Write/Commit totals for individual journal entries */
+	result = write_commit_statistics("entries : ",
+					 &stats->entries,
+					 ", ",
+					 buf,
+					 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Write/Commit totals for journal blocks */
+	result = write_commit_statistics("blocks : ",
+					 &stats->blocks,
+					 ", ",
+					 buf,
+					 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_packer_statistics(char *prefix,
+				   struct packer_statistics *stats,
+				   char *suffix,
+				   char **buf,
+				   unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of compressed data items written since startup */
+	result = write_u64("compressedFragmentsWritten : ",
+			   stats->compressed_fragments_written,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of blocks containing compressed items written since startup */
+	result = write_u64("compressedBlocksWritten : ",
+			   stats->compressed_blocks_written,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of VIOs that are pending in the packer */
+	result = write_u64("compressedFragmentsInPacker : ",
+			   stats->compressed_fragments_in_packer,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_slab_journal_statistics(char *prefix,
+					 struct slab_journal_statistics *stats,
+					 char *suffix,
+					 char **buf,
+					 unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of times the on-disk journal was full */
+	result = write_u64("diskFullCount : ",
+			   stats->disk_full_count,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of times an entry was added over the flush threshold */
+	result = write_u64("flushCount : ",
+			   stats->flush_count,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of times an entry was added over the block threshold */
+	result = write_u64("blockedCount : ",
+			   stats->blocked_count,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of times a tail block was written */
+	result = write_u64("blocksWritten : ",
+			   stats->blocks_written,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of times we had to wait for the tail to write */
+	result = write_u64("tailBusyCount : ",
+			   stats->tail_busy_count,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_slab_summary_statistics(char *prefix,
+					 struct slab_summary_statistics *stats,
+					 char *suffix,
+					 char **buf,
+					 unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of blocks written */
+	result = write_u64("blocksWritten : ",
+			   stats->blocks_written,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_ref_counts_statistics(char *prefix,
+				       struct ref_counts_statistics *stats,
+				       char *suffix,
+				       char **buf,
+				       unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of reference blocks written */
+	result = write_u64("blocksWritten : ",
+			   stats->blocks_written,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_block_map_statistics(char *prefix,
+				      struct block_map_statistics *stats,
+				      char *suffix,
+				      char **buf,
+				      unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of dirty (resident) pages */
+	result = write_u32("dirtyPages : ",
+			   stats->dirty_pages,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of clean (resident) pages */
+	result = write_u32("cleanPages : ",
+			   stats->clean_pages,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of free pages */
+	result = write_u32("freePages : ",
+			   stats->free_pages,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of pages in failed state */
+	result = write_u32("failedPages : ",
+			   stats->failed_pages,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of pages incoming */
+	result = write_u32("incomingPages : ",
+			   stats->incoming_pages,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of pages outgoing */
+	result = write_u32("outgoingPages : ",
+			   stats->outgoing_pages,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* how many times free page not avail */
+	result = write_u32("cachePressure : ",
+			   stats->cache_pressure,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of get_vdo_page() calls for read */
+	result = write_u64("readCount : ",
+			   stats->read_count,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of get_vdo_page() calls for write */
+	result = write_u64("writeCount : ",
+			   stats->write_count,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of times pages failed to read */
+	result = write_u64("failedReads : ",
+			   stats->failed_reads,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of times pages failed to write */
+	result = write_u64("failedWrites : ",
+			   stats->failed_writes,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of gets that are reclaimed */
+	result = write_u64("reclaimed : ",
+			   stats->reclaimed,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of gets for outgoing pages */
+	result = write_u64("readOutgoing : ",
+			   stats->read_outgoing,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of gets that were already there */
+	result = write_u64("foundInCache : ",
+			   stats->found_in_cache,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of gets requiring discard */
+	result = write_u64("discardRequired : ",
+			   stats->discard_required,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of gets enqueued for their page */
+	result = write_u64("waitForPage : ",
+			   stats->wait_for_page,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of gets that have to fetch */
+	result = write_u64("fetchRequired : ",
+			   stats->fetch_required,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of page fetches */
+	result = write_u64("pagesLoaded : ",
+			   stats->pages_loaded,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of page saves */
+	result = write_u64("pagesSaved : ",
+			   stats->pages_saved,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* the number of flushes issued */
+	result = write_u64("flushCount : ",
+			   stats->flush_count,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_hash_lock_statistics(char *prefix,
+				      struct hash_lock_statistics *stats,
+				      char *suffix,
+				      char **buf,
+				      unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of times the UDS advice proved correct */
+	result = write_u64("dedupeAdviceValid : ",
+			   stats->dedupe_advice_valid,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of times the UDS advice proved incorrect */
+	result = write_u64("dedupeAdviceStale : ",
+			   stats->dedupe_advice_stale,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of writes with the same data as another in-flight write */
+	result = write_u64("concurrentDataMatches : ",
+			   stats->concurrent_data_matches,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of writes whose hash collided with an in-flight write */
+	result = write_u64("concurrentHashCollisions : ",
+			   stats->concurrent_hash_collisions,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Current number of dedupe queries that are in flight */
+	result = write_u32("currDedupeQueries : ",
+			   stats->curr_dedupe_queries,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_error_statistics(char *prefix,
+				  struct error_statistics *stats,
+				  char *suffix,
+				  char **buf,
+				  unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of times VDO got an invalid dedupe advice PBN from UDS */
+	result = write_u64("invalidAdvicePBNCount : ",
+			   stats->invalid_advice_pbn_count,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of times a VIO completed with a VDO_NO_SPACE error */
+	result = write_u64("noSpaceErrorCount : ",
+			   stats->no_space_error_count,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of times a VIO completed with a VDO_READ_ONLY error */
+	result = write_u64("readOnlyErrorCount : ",
+			   stats->read_only_error_count,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_bio_stats(char *prefix,
+			   struct bio_stats *stats,
+			   char *suffix,
+			   char **buf,
+			   unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of REQ_OP_READ bios */
+	result = write_u64("read : ",
+			   stats->read,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of REQ_OP_WRITE bios with data */
+	result = write_u64("write : ",
+			   stats->write,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+	result = write_u64("emptyFlush : ",
+			   stats->empty_flush,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of REQ_OP_DISCARD bios */
+	result = write_u64("discard : ",
+			   stats->discard,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of bios tagged with REQ_PREFLUSH */
+	result = write_u64("flush : ",
+			   stats->flush,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of bios tagged with REQ_FUA */
+	result = write_u64("fua : ",
+			   stats->fua,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_memory_usage(char *prefix,
+			      struct memory_usage *stats,
+			      char *suffix,
+			      char **buf,
+			      unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Tracked bytes currently allocated. */
+	result = write_u64("bytesUsed : ",
+			   stats->bytes_used,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Maximum tracked bytes allocated. */
+	result = write_u64("peakBytesUsed : ",
+			   stats->peak_bytes_used,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_index_statistics(char *prefix,
+				  struct index_statistics *stats,
+				  char *suffix,
+				  char **buf,
+				  unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of records stored in the index */
+	result = write_u64("entriesIndexed : ",
+			   stats->entries_indexed,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of post calls that found an existing entry */
+	result = write_u64("postsFound : ",
+			   stats->posts_found,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of post calls that added a new entry */
+	result = write_u64("postsNotFound : ",
+			   stats->posts_not_found,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of query calls that found an existing entry */
+	result = write_u64("queriesFound : ",
+			   stats->queries_found,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of query calls that added a new entry */
+	result = write_u64("queriesNotFound : ",
+			   stats->queries_not_found,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of update calls that found an existing entry */
+	result = write_u64("updatesFound : ",
+			   stats->updates_found,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of update calls that added a new entry */
+	result = write_u64("updatesNotFound : ",
+			   stats->updates_not_found,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of entries discarded */
+	result = write_u64("entriesDiscarded : ",
+			   stats->entries_discarded,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+static int write_vdo_statistics(char *prefix,
+				struct vdo_statistics *stats,
+				char *suffix,
+				char **buf,
+				unsigned int *maxlen)
+{
+	int result;
+
+	result = write_string(prefix, "{ ", NULL, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_u32("version : ",
+			   stats->version,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of blocks used for data */
+	result = write_u64("dataBlocksUsed : ",
+			   stats->data_blocks_used,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of blocks used for VDO metadata */
+	result = write_u64("overheadBlocksUsed : ",
+			   stats->overhead_blocks_used,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of logical blocks that are currently mapped to physical blocks */
+	result = write_u64("logicalBlocksUsed : ",
+			   stats->logical_blocks_used,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of physical blocks */
+	result = write_block_count_t("physicalBlocks : ",
+				     stats->physical_blocks,
+				     ", ",
+				     buf,
+				     maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* number of logical blocks */
+	result = write_block_count_t("logicalBlocks : ",
+				     stats->logical_blocks,
+				     ", ",
+				     buf,
+				     maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Size of the block map page cache, in bytes */
+	result = write_u64("blockMapCacheSize : ",
+			   stats->block_map_cache_size,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The physical block size */
+	result = write_u64("blockSize : ",
+			   stats->block_size,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of times the VDO has successfully recovered */
+	result = write_u64("completeRecoveries : ",
+			   stats->complete_recoveries,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of times the VDO has recovered from read-only mode */
+	result = write_u64("readOnlyRecoveries : ",
+			   stats->read_only_recoveries,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* String describing the operating mode of the VDO */
+	result = write_string("mode : ",
+			      stats->mode,
+			      ", ",
+			      buf,
+			      maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Whether the VDO is in recovery mode */
+	result = write_bool("inRecoveryMode : ",
+			    stats->in_recovery_mode,
+			    ", ",
+			    buf,
+			    maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* What percentage of recovery mode work has been completed */
+	result = write_u8("recoveryPercentage : ",
+			  stats->recovery_percentage,
+			  ", ",
+			  buf,
+			  maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The statistics for the compressed block packer */
+	result = write_packer_statistics("packer : ",
+					 &stats->packer,
+					 ", ",
+					 buf,
+					 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Counters for events in the block allocator */
+	result = write_block_allocator_statistics("allocator : ",
+						  &stats->allocator,
+						  ", ",
+						  buf,
+						  maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Counters for events in the recovery journal */
+	result = write_recovery_journal_statistics("journal : ",
+						   &stats->journal,
+						   ", ",
+						   buf,
+						   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The statistics for the slab journals */
+	result = write_slab_journal_statistics("slabJournal : ",
+					       &stats->slab_journal,
+					       ", ",
+					       buf,
+					       maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The statistics for the slab summary */
+	result = write_slab_summary_statistics("slabSummary : ",
+					       &stats->slab_summary,
+					       ", ",
+					       buf,
+					       maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The statistics for the reference counts */
+	result = write_ref_counts_statistics("refCounts : ",
+					     &stats->ref_counts,
+					     ", ",
+					     buf,
+					     maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The statistics for the block map */
+	result = write_block_map_statistics("blockMap : ",
+					    &stats->block_map,
+					    ", ",
+					    buf,
+					    maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The dedupe statistics from hash locks */
+	result = write_hash_lock_statistics("hashLock : ",
+					    &stats->hash_lock,
+					    ", ",
+					    buf,
+					    maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Counts of error conditions */
+	result = write_error_statistics("errors : ",
+					&stats->errors,
+					", ",
+					buf,
+					maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The VDO instance */
+	result = write_u32("instance : ",
+			   stats->instance,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Current number of active VIOs */
+	result = write_u32("currentVIOsInProgress : ",
+			   stats->current_vios_in_progress,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Maximum number of active VIOs */
+	result = write_u32("maxVIOs : ",
+			   stats->max_vios,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of times the UDS index was too slow in responding */
+	result = write_u64("dedupeAdviceTimeouts : ",
+			   stats->dedupe_advice_timeouts,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Number of flush requests submitted to the storage device */
+	result = write_u64("flushOut : ",
+			   stats->flush_out,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Logical block size */
+	result = write_u64("logicalBlockSize : ",
+			   stats->logical_block_size,
+			   ", ",
+			   buf,
+			   maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Bios submitted into VDO from above */
+	result = write_bio_stats("biosIn : ",
+				 &stats->bios_in,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_bio_stats("biosInPartial : ",
+				 &stats->bios_in_partial,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Bios submitted onward for user data */
+	result = write_bio_stats("biosOut : ",
+				 &stats->bios_out,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Bios submitted onward for metadata */
+	result = write_bio_stats("biosMeta : ",
+				 &stats->bios_meta,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_bio_stats("biosJournal : ",
+				 &stats->bios_journal,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_bio_stats("biosPageCache : ",
+				 &stats->bios_page_cache,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_bio_stats("biosOutCompleted : ",
+				 &stats->bios_out_completed,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_bio_stats("biosMetaCompleted : ",
+				 &stats->bios_meta_completed,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_bio_stats("biosJournalCompleted : ",
+				 &stats->bios_journal_completed,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_bio_stats("biosPageCacheCompleted : ",
+				 &stats->bios_page_cache_completed,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_bio_stats("biosAcknowledged : ",
+				 &stats->bios_acknowledged,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_bio_stats("biosAcknowledgedPartial : ",
+				 &stats->bios_acknowledged_partial,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Current number of bios in progress */
+	result = write_bio_stats("biosInProgress : ",
+				 &stats->bios_in_progress,
+				 ", ",
+				 buf,
+				 maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* Memory usage stats. */
+	result = write_memory_usage("memoryUsage : ",
+				    &stats->memory_usage,
+				    ", ",
+				    buf,
+				    maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	/* The statistics for the UDS index */
+	result = write_index_statistics("index : ",
+					&stats->index,
+					", ",
+					buf,
+					maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	result = write_string(NULL, "}", suffix, buf, maxlen);
+	if (result != VDO_SUCCESS)
+		return result;
+	return VDO_SUCCESS;
+}
+
+int vdo_write_stats(struct vdo *vdo,
+		    char *buf,
+		    unsigned int maxlen)
+{
+	struct vdo_statistics *stats;
+	int result;
+
+	result = UDS_ALLOCATE(1, struct vdo_statistics, __func__, &stats);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	vdo_fetch_statistics(vdo, stats);
+	result = write_vdo_statistics(NULL, stats, NULL, &buf, &maxlen);
+	UDS_FREE(stats);
+	return result;
+}
diff --git a/drivers/md/dm-vdo/message-stats.h b/drivers/md/dm-vdo/message-stats.h
new file mode 100644
index 000000000000..f7fceca9acab
--- /dev/null
+++ b/drivers/md/dm-vdo/message-stats.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_MESSAGE_STATS_H
+#define VDO_MESSAGE_STATS_H
+
+#include "types.h"
+
+int vdo_write_stats(struct vdo *vdo, char *buf, unsigned int maxlen);
+
+#endif /* VDO_MESSAGE_STATS_H */
diff --git a/drivers/md/dm-vdo/statistics.h b/drivers/md/dm-vdo/statistics.h
new file mode 100644
index 000000000000..da8b6a7fe24f
--- /dev/null
+++ b/drivers/md/dm-vdo/statistics.h
@@ -0,0 +1,278 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef STATISTICS_H
+#define STATISTICS_H
+
+#include "types.h"
+
+enum {
+	STATISTICS_VERSION = 36,
+};
+
+struct block_allocator_statistics {
+	/** The total number of slabs from which blocks may be allocated */
+	u64 slab_count;
+	/** The total number of slabs from which blocks have ever been allocated */
+	u64 slabs_opened;
+	/** The number of times since loading that a slab has been re-opened */
+	u64 slabs_reopened;
+};
+
+/**
+ * Counters for tracking the number of items written (blocks, requests, etc.)
+ * that keep track of totals at steps in the write pipeline. Three counters
+ * allow the number of buffered, in-memory items and the number of in-flight,
+ * unacknowledged writes to be derived, while still tracking totals for
+ * reporting purposes
+ */
+struct commit_statistics {
+	/** The total number of items on which processing has started */
+	u64 started;
+	/** The total number of items for which a write operation has been issued */
+	u64 written;
+	/** The total number of items for which a write operation has completed */
+	u64 committed;
+};
+
+/** Counters for events in the recovery journal */
+struct recovery_journal_statistics {
+	/** Number of times the on-disk journal was full */
+	u64 disk_full;
+	/** Number of times the recovery journal requested slab journal commits. */
+	u64 slab_journal_commits_requested;
+	/** Write/Commit totals for individual journal entries */
+	struct commit_statistics entries;
+	/** Write/Commit totals for journal blocks */
+	struct commit_statistics blocks;
+};
+
+/** The statistics for the compressed block packer. */
+struct packer_statistics {
+	/** Number of compressed data items written since startup */
+	u64 compressed_fragments_written;
+	/** Number of blocks containing compressed items written since startup */
+	u64 compressed_blocks_written;
+	/** Number of VIOs that are pending in the packer */
+	u64 compressed_fragments_in_packer;
+};
+
+/** The statistics for the slab journals. */
+struct slab_journal_statistics {
+	/** Number of times the on-disk journal was full */
+	u64 disk_full_count;
+	/** Number of times an entry was added over the flush threshold */
+	u64 flush_count;
+	/** Number of times an entry was added over the block threshold */
+	u64 blocked_count;
+	/** Number of times a tail block was written */
+	u64 blocks_written;
+	/** Number of times we had to wait for the tail to write */
+	u64 tail_busy_count;
+};
+
+/** The statistics for the slab summary. */
+struct slab_summary_statistics {
+	/** Number of blocks written */
+	u64 blocks_written;
+};
+
+/** The statistics for the reference counts. */
+struct ref_counts_statistics {
+	/** Number of reference blocks written */
+	u64 blocks_written;
+};
+
+/** The statistics for the block map. */
+struct block_map_statistics {
+	/** number of dirty (resident) pages */
+	u32 dirty_pages;
+	/** number of clean (resident) pages */
+	u32 clean_pages;
+	/** number of free pages */
+	u32 free_pages;
+	/** number of pages in failed state */
+	u32 failed_pages;
+	/** number of pages incoming */
+	u32 incoming_pages;
+	/** number of pages outgoing */
+	u32 outgoing_pages;
+	/** how many times free page not avail */
+	u32 cache_pressure;
+	/** number of get_vdo_page() calls for read */
+	u64 read_count;
+	/** number of get_vdo_page() calls for write */
+	u64 write_count;
+	/** number of times pages failed to read */
+	u64 failed_reads;
+	/** number of times pages failed to write */
+	u64 failed_writes;
+	/** number of gets that are reclaimed */
+	u64 reclaimed;
+	/** number of gets for outgoing pages */
+	u64 read_outgoing;
+	/** number of gets that were already there */
+	u64 found_in_cache;
+	/** number of gets requiring discard */
+	u64 discard_required;
+	/** number of gets enqueued for their page */
+	u64 wait_for_page;
+	/** number of gets that have to fetch */
+	u64 fetch_required;
+	/** number of page fetches */
+	u64 pages_loaded;
+	/** number of page saves */
+	u64 pages_saved;
+	/** the number of flushes issued */
+	u64 flush_count;
+};
+
+/** The dedupe statistics from hash locks */
+struct hash_lock_statistics {
+	/** Number of times the UDS advice proved correct */
+	u64 dedupe_advice_valid;
+	/** Number of times the UDS advice proved incorrect */
+	u64 dedupe_advice_stale;
+	/** Number of writes with the same data as another in-flight write */
+	u64 concurrent_data_matches;
+	/** Number of writes whose hash collided with an in-flight write */
+	u64 concurrent_hash_collisions;
+	/** Current number of dedupe queries that are in flight */
+	u32 curr_dedupe_queries;
+};
+
+/** Counts of error conditions in VDO. */
+struct error_statistics {
+	/** number of times VDO got an invalid dedupe advice PBN from UDS */
+	u64 invalid_advice_pbn_count;
+	/** number of times a VIO completed with a VDO_NO_SPACE error */
+	u64 no_space_error_count;
+	/** number of times a VIO completed with a VDO_READ_ONLY error */
+	u64 read_only_error_count;
+};
+
+struct bio_stats {
+	/** Number of REQ_OP_READ bios */
+	u64 read;
+	/** Number of REQ_OP_WRITE bios with data */
+	u64 write;
+	/** Number of bios tagged with REQ_PREFLUSH and containing no data */
+	u64 empty_flush;
+	/** Number of REQ_OP_DISCARD bios */
+	u64 discard;
+	/** Number of bios tagged with REQ_PREFLUSH */
+	u64 flush;
+	/** Number of bios tagged with REQ_FUA */
+	u64 fua;
+};
+
+struct memory_usage {
+	/** Tracked bytes currently allocated. */
+	u64 bytes_used;
+	/** Maximum tracked bytes allocated. */
+	u64 peak_bytes_used;
+};
+
+/** UDS index statistics */
+struct index_statistics {
+	/** Number of records stored in the index */
+	u64 entries_indexed;
+	/** Number of post calls that found an existing entry */
+	u64 posts_found;
+	/** Number of post calls that added a new entry */
+	u64 posts_not_found;
+	/** Number of query calls that found an existing entry */
+	u64 queries_found;
+	/** Number of query calls that added a new entry */
+	u64 queries_not_found;
+	/** Number of update calls that found an existing entry */
+	u64 updates_found;
+	/** Number of update calls that added a new entry */
+	u64 updates_not_found;
+	/** Number of entries discarded */
+	u64 entries_discarded;
+};
+
+/** The statistics of the vdo service. */
+struct vdo_statistics {
+	u32 version;
+	/** Number of blocks used for data */
+	u64 data_blocks_used;
+	/** Number of blocks used for VDO metadata */
+	u64 overhead_blocks_used;
+	/** Number of logical blocks that are currently mapped to physical blocks */
+	u64 logical_blocks_used;
+	/** number of physical blocks */
+	block_count_t physical_blocks;
+	/** number of logical blocks */
+	block_count_t logical_blocks;
+	/** Size of the block map page cache, in bytes */
+	u64 block_map_cache_size;
+	/** The physical block size */
+	u64 block_size;
+	/** Number of times the VDO has successfully recovered */
+	u64 complete_recoveries;
+	/** Number of times the VDO has recovered from read-only mode */
+	u64 read_only_recoveries;
+	/** String describing the operating mode of the VDO */
+	char mode[15];
+	/** Whether the VDO is in recovery mode */
+	bool in_recovery_mode;
+	/** What percentage of recovery mode work has been completed */
+	u8 recovery_percentage;
+	/** The statistics for the compressed block packer */
+	struct packer_statistics packer;
+	/** Counters for events in the block allocator */
+	struct block_allocator_statistics allocator;
+	/** Counters for events in the recovery journal */
+	struct recovery_journal_statistics journal;
+	/** The statistics for the slab journals */
+	struct slab_journal_statistics slab_journal;
+	/** The statistics for the slab summary */
+	struct slab_summary_statistics slab_summary;
+	/** The statistics for the reference counts */
+	struct ref_counts_statistics ref_counts;
+	/** The statistics for the block map */
+	struct block_map_statistics block_map;
+	/** The dedupe statistics from hash locks */
+	struct hash_lock_statistics hash_lock;
+	/** Counts of error conditions */
+	struct error_statistics errors;
+	/** The VDO instance */
+	u32 instance;
+	/** Current number of active VIOs */
+	u32 current_vios_in_progress;
+	/** Maximum number of active VIOs */
+	u32 max_vios;
+	/** Number of times the UDS index was too slow in responding */
+	u64 dedupe_advice_timeouts;
+	/** Number of flush requests submitted to the storage device */
+	u64 flush_out;
+	/** Logical block size */
+	u64 logical_block_size;
+	/** Bios submitted into VDO from above */
+	struct bio_stats bios_in;
+	struct bio_stats bios_in_partial;
+	/** Bios submitted onward for user data */
+	struct bio_stats bios_out;
+	/** Bios submitted onward for metadata */
+	struct bio_stats bios_meta;
+	struct bio_stats bios_journal;
+	struct bio_stats bios_page_cache;
+	struct bio_stats bios_out_completed;
+	struct bio_stats bios_meta_completed;
+	struct bio_stats bios_journal_completed;
+	struct bio_stats bios_page_cache_completed;
+	struct bio_stats bios_acknowledged;
+	struct bio_stats bios_acknowledged_partial;
+	/** Current number of bios in progress */
+	struct bio_stats bios_in_progress;
+	/** Memory usage stats. */
+	struct memory_usage memory_usage;
+	/** The statistics for the UDS index */
+	struct index_statistics index;
+};
+
+#endif /* not STATISTICS_H */

From patchwork Thu Oct 26 21:41:33 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437954
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 780B3C25B6B
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:32 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344852AbjJZVoc (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:32 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40990 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231947AbjJZVoT (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:19 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A397DD54
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356505;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=1jVoe3rXMWrMQLtsclPr+8o2cvQKTOddCbPfEM9W4I4=;
        b=YmzRQQfklEAyozU/Cm/OlNY8+RHtC99sLPvYiMreWDueSYqrD51ZVd4x1HFLzRCavqUzpd
        jKYmV+CwxIGC5HImaBW3wmXunKgW2AcoN4L8BXN11MDbxW8x/bzBlda4J+eIPDlXQ7OH3b
        xkE6EyhwD36AhTeSSF8gxE/pIaqxsWM=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-41-pcRogTvEOtGMHGhh13L2fw-1; Thu, 26 Oct 2023 17:41:43 -0400
X-MC-Unique: pcRogTvEOtGMHGhh13L2fw-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com
 [10.11.54.2])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 77AAB857CF1;
        Thu, 26 Oct 2023 21:41:43 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 5A40640C6F7B;
        Thu, 26 Oct 2023 21:41:43 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 0ED213003C; Thu, 26 Oct 2023 17:41:43 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 36/39] dm vdo: add sysfs support for setting parameters and
 fetching stats
Date: Thu, 26 Oct 2023 17:41:33 -0400
Message-Id: <20231026214136.1067410-37-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Add data and methods setting run time parameters via sysfs, and to
make state and statistics information available through sysfs.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Co-developed-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/pool-sysfs-stats.c | 2063 ++++++++++++++++++++++++++
 drivers/md/dm-vdo/pool-sysfs.c       |  193 +++
 drivers/md/dm-vdo/pool-sysfs.h       |   19 +
 drivers/md/dm-vdo/sysfs.c            |   84 ++
 4 files changed, 2359 insertions(+)
 create mode 100644 drivers/md/dm-vdo/pool-sysfs-stats.c
 create mode 100644 drivers/md/dm-vdo/pool-sysfs.c
 create mode 100644 drivers/md/dm-vdo/pool-sysfs.h
 create mode 100644 drivers/md/dm-vdo/sysfs.c

diff --git a/drivers/md/dm-vdo/pool-sysfs-stats.c b/drivers/md/dm-vdo/pool-sysfs-stats.c
new file mode 100644
index 000000000000..ae3838894a1c
--- /dev/null
+++ b/drivers/md/dm-vdo/pool-sysfs-stats.c
@@ -0,0 +1,2063 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include <linux/mutex.h>
+
+#include "logger.h"
+#include "string-utils.h"
+
+#include "dedupe.h"
+#include "pool-sysfs.h"
+#include "statistics.h"
+#include "vdo.h"
+
+struct pool_stats_attribute {
+	struct attribute attr;
+	ssize_t (*print)(struct vdo_statistics *stats, char *buf);
+};
+
+static ssize_t pool_stats_attr_show(struct kobject *directory,
+				    struct attribute *attr,
+				    char *buf)
+{
+	ssize_t size;
+	struct pool_stats_attribute *pool_stats_attr =
+		container_of(attr, struct pool_stats_attribute, attr);
+	struct vdo *vdo = container_of(directory, struct vdo, stats_directory);
+
+	if (pool_stats_attr->print == NULL)
+		return -EINVAL;
+
+	mutex_lock(&vdo->stats_mutex);
+	vdo_fetch_statistics(vdo, &vdo->stats_buffer);
+	size = pool_stats_attr->print(&vdo->stats_buffer, buf);
+	mutex_unlock(&vdo->stats_mutex);
+
+	return size;
+}
+
+const struct sysfs_ops vdo_pool_stats_sysfs_ops = {
+	.show = pool_stats_attr_show,
+	.store = NULL,
+};
+
+/* Number of blocks used for data */
+static ssize_t
+pool_stats_print_data_blocks_used(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->data_blocks_used);
+}
+
+static struct pool_stats_attribute pool_stats_attr_data_blocks_used = {
+	.attr = { .name = "data_blocks_used", .mode = 0444, },
+	.print = pool_stats_print_data_blocks_used,
+};
+
+/* Number of blocks used for VDO metadata */
+static ssize_t
+pool_stats_print_overhead_blocks_used(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->overhead_blocks_used);
+}
+
+static struct pool_stats_attribute pool_stats_attr_overhead_blocks_used = {
+	.attr = { .name = "overhead_blocks_used", .mode = 0444, },
+	.print = pool_stats_print_overhead_blocks_used,
+};
+
+/* Number of logical blocks that are currently mapped to physical blocks */
+static ssize_t
+pool_stats_print_logical_blocks_used(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->logical_blocks_used);
+}
+
+static struct pool_stats_attribute pool_stats_attr_logical_blocks_used = {
+	.attr = { .name = "logical_blocks_used", .mode = 0444, },
+	.print = pool_stats_print_logical_blocks_used,
+};
+
+/* number of physical blocks */
+static ssize_t
+pool_stats_print_physical_blocks(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->physical_blocks);
+}
+
+static struct pool_stats_attribute pool_stats_attr_physical_blocks = {
+	.attr = { .name = "physical_blocks", .mode = 0444, },
+	.print = pool_stats_print_physical_blocks,
+};
+
+/* number of logical blocks */
+static ssize_t
+pool_stats_print_logical_blocks(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->logical_blocks);
+}
+
+static struct pool_stats_attribute pool_stats_attr_logical_blocks = {
+	.attr = { .name = "logical_blocks", .mode = 0444, },
+	.print = pool_stats_print_logical_blocks,
+};
+
+/* Size of the block map page cache, in bytes */
+static ssize_t
+pool_stats_print_block_map_cache_size(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map_cache_size);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_cache_size = {
+	.attr = { .name = "block_map_cache_size", .mode = 0444, },
+	.print = pool_stats_print_block_map_cache_size,
+};
+
+/* The physical block size */
+static ssize_t
+pool_stats_print_block_size(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_size);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_size = {
+	.attr = { .name = "block_size", .mode = 0444, },
+	.print = pool_stats_print_block_size,
+};
+
+/* Number of times the VDO has successfully recovered */
+static ssize_t
+pool_stats_print_complete_recoveries(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->complete_recoveries);
+}
+
+static struct pool_stats_attribute pool_stats_attr_complete_recoveries = {
+	.attr = { .name = "complete_recoveries", .mode = 0444, },
+	.print = pool_stats_print_complete_recoveries,
+};
+
+/* Number of times the VDO has recovered from read-only mode */
+static ssize_t
+pool_stats_print_read_only_recoveries(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->read_only_recoveries);
+}
+
+static struct pool_stats_attribute pool_stats_attr_read_only_recoveries = {
+	.attr = { .name = "read_only_recoveries", .mode = 0444, },
+	.print = pool_stats_print_read_only_recoveries,
+};
+
+/* String describing the operating mode of the VDO */
+static ssize_t
+pool_stats_print_mode(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%s\n", stats->mode);
+}
+
+static struct pool_stats_attribute pool_stats_attr_mode = {
+	.attr = { .name = "mode", .mode = 0444, },
+	.print = pool_stats_print_mode,
+};
+
+/* Whether the VDO is in recovery mode */
+static ssize_t
+pool_stats_print_in_recovery_mode(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%d\n", stats->in_recovery_mode);
+}
+
+static struct pool_stats_attribute pool_stats_attr_in_recovery_mode = {
+	.attr = { .name = "in_recovery_mode", .mode = 0444, },
+	.print = pool_stats_print_in_recovery_mode,
+};
+
+/* What percentage of recovery mode work has been completed */
+static ssize_t
+pool_stats_print_recovery_percentage(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%u\n", stats->recovery_percentage);
+}
+
+static struct pool_stats_attribute pool_stats_attr_recovery_percentage = {
+	.attr = { .name = "recovery_percentage", .mode = 0444, },
+	.print = pool_stats_print_recovery_percentage,
+};
+
+/* Number of compressed data items written since startup */
+static ssize_t
+pool_stats_print_packer_compressed_fragments_written(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->packer.compressed_fragments_written);
+}
+
+static struct pool_stats_attribute pool_stats_attr_packer_compressed_fragments_written = {
+	.attr = { .name = "packer_compressed_fragments_written", .mode = 0444, },
+	.print = pool_stats_print_packer_compressed_fragments_written,
+};
+
+/* Number of blocks containing compressed items written since startup */
+static ssize_t
+pool_stats_print_packer_compressed_blocks_written(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->packer.compressed_blocks_written);
+}
+
+static struct pool_stats_attribute pool_stats_attr_packer_compressed_blocks_written = {
+	.attr = { .name = "packer_compressed_blocks_written", .mode = 0444, },
+	.print = pool_stats_print_packer_compressed_blocks_written,
+};
+
+/* Number of VIOs that are pending in the packer */
+static ssize_t
+pool_stats_print_packer_compressed_fragments_in_packer(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->packer.compressed_fragments_in_packer);
+}
+
+static struct pool_stats_attribute pool_stats_attr_packer_compressed_fragments_in_packer = {
+	.attr = { .name = "packer_compressed_fragments_in_packer", .mode = 0444, },
+	.print = pool_stats_print_packer_compressed_fragments_in_packer,
+};
+
+/* The total number of slabs from which blocks may be allocated */
+static ssize_t
+pool_stats_print_allocator_slab_count(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->allocator.slab_count);
+}
+
+static struct pool_stats_attribute pool_stats_attr_allocator_slab_count = {
+	.attr = { .name = "allocator_slab_count", .mode = 0444, },
+	.print = pool_stats_print_allocator_slab_count,
+};
+
+/* The total number of slabs from which blocks have ever been allocated */
+static ssize_t
+pool_stats_print_allocator_slabs_opened(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->allocator.slabs_opened);
+}
+
+static struct pool_stats_attribute pool_stats_attr_allocator_slabs_opened = {
+	.attr = { .name = "allocator_slabs_opened", .mode = 0444, },
+	.print = pool_stats_print_allocator_slabs_opened,
+};
+
+/* The number of times since loading that a slab has been re-opened */
+static ssize_t
+pool_stats_print_allocator_slabs_reopened(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->allocator.slabs_reopened);
+}
+
+static struct pool_stats_attribute pool_stats_attr_allocator_slabs_reopened = {
+	.attr = { .name = "allocator_slabs_reopened", .mode = 0444, },
+	.print = pool_stats_print_allocator_slabs_reopened,
+};
+
+/* Number of times the on-disk journal was full */
+static ssize_t
+pool_stats_print_journal_disk_full(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->journal.disk_full);
+}
+
+static struct pool_stats_attribute pool_stats_attr_journal_disk_full = {
+	.attr = { .name = "journal_disk_full", .mode = 0444, },
+	.print = pool_stats_print_journal_disk_full,
+};
+
+/* Number of times the recovery journal requested slab journal commits. */
+static ssize_t
+pool_stats_print_journal_slab_journal_commits_requested(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->journal.slab_journal_commits_requested);
+}
+
+static struct pool_stats_attribute pool_stats_attr_journal_slab_journal_commits_requested = {
+	.attr = { .name = "journal_slab_journal_commits_requested", .mode = 0444, },
+	.print = pool_stats_print_journal_slab_journal_commits_requested,
+};
+
+/* The total number of items on which processing has started */
+static ssize_t
+pool_stats_print_journal_entries_started(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->journal.entries.started);
+}
+
+static struct pool_stats_attribute pool_stats_attr_journal_entries_started = {
+	.attr = { .name = "journal_entries_started", .mode = 0444, },
+	.print = pool_stats_print_journal_entries_started,
+};
+
+/* The total number of items for which a write operation has been issued */
+static ssize_t
+pool_stats_print_journal_entries_written(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->journal.entries.written);
+}
+
+static struct pool_stats_attribute pool_stats_attr_journal_entries_written = {
+	.attr = { .name = "journal_entries_written", .mode = 0444, },
+	.print = pool_stats_print_journal_entries_written,
+};
+
+/* The total number of items for which a write operation has completed */
+static ssize_t
+pool_stats_print_journal_entries_committed(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->journal.entries.committed);
+}
+
+static struct pool_stats_attribute pool_stats_attr_journal_entries_committed = {
+	.attr = { .name = "journal_entries_committed", .mode = 0444, },
+	.print = pool_stats_print_journal_entries_committed,
+};
+
+/* The total number of items on which processing has started */
+static ssize_t
+pool_stats_print_journal_blocks_started(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->journal.blocks.started);
+}
+
+static struct pool_stats_attribute pool_stats_attr_journal_blocks_started = {
+	.attr = { .name = "journal_blocks_started", .mode = 0444, },
+	.print = pool_stats_print_journal_blocks_started,
+};
+
+/* The total number of items for which a write operation has been issued */
+static ssize_t
+pool_stats_print_journal_blocks_written(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->journal.blocks.written);
+}
+
+static struct pool_stats_attribute pool_stats_attr_journal_blocks_written = {
+	.attr = { .name = "journal_blocks_written", .mode = 0444, },
+	.print = pool_stats_print_journal_blocks_written,
+};
+
+/* The total number of items for which a write operation has completed */
+static ssize_t
+pool_stats_print_journal_blocks_committed(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->journal.blocks.committed);
+}
+
+static struct pool_stats_attribute pool_stats_attr_journal_blocks_committed = {
+	.attr = { .name = "journal_blocks_committed", .mode = 0444, },
+	.print = pool_stats_print_journal_blocks_committed,
+};
+
+/* Number of times the on-disk journal was full */
+static ssize_t
+pool_stats_print_slab_journal_disk_full_count(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->slab_journal.disk_full_count);
+}
+
+static struct pool_stats_attribute pool_stats_attr_slab_journal_disk_full_count = {
+	.attr = { .name = "slab_journal_disk_full_count", .mode = 0444, },
+	.print = pool_stats_print_slab_journal_disk_full_count,
+};
+
+/* Number of times an entry was added over the flush threshold */
+static ssize_t
+pool_stats_print_slab_journal_flush_count(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->slab_journal.flush_count);
+}
+
+static struct pool_stats_attribute pool_stats_attr_slab_journal_flush_count = {
+	.attr = { .name = "slab_journal_flush_count", .mode = 0444, },
+	.print = pool_stats_print_slab_journal_flush_count,
+};
+
+/* Number of times an entry was added over the block threshold */
+static ssize_t
+pool_stats_print_slab_journal_blocked_count(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->slab_journal.blocked_count);
+}
+
+static struct pool_stats_attribute pool_stats_attr_slab_journal_blocked_count = {
+	.attr = { .name = "slab_journal_blocked_count", .mode = 0444, },
+	.print = pool_stats_print_slab_journal_blocked_count,
+};
+
+/* Number of times a tail block was written */
+static ssize_t
+pool_stats_print_slab_journal_blocks_written(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->slab_journal.blocks_written);
+}
+
+static struct pool_stats_attribute pool_stats_attr_slab_journal_blocks_written = {
+	.attr = { .name = "slab_journal_blocks_written", .mode = 0444, },
+	.print = pool_stats_print_slab_journal_blocks_written,
+};
+
+/* Number of times we had to wait for the tail to write */
+static ssize_t
+pool_stats_print_slab_journal_tail_busy_count(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->slab_journal.tail_busy_count);
+}
+
+static struct pool_stats_attribute pool_stats_attr_slab_journal_tail_busy_count = {
+	.attr = { .name = "slab_journal_tail_busy_count", .mode = 0444, },
+	.print = pool_stats_print_slab_journal_tail_busy_count,
+};
+
+/* Number of blocks written */
+static ssize_t
+pool_stats_print_slab_summary_blocks_written(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->slab_summary.blocks_written);
+}
+
+static struct pool_stats_attribute pool_stats_attr_slab_summary_blocks_written = {
+	.attr = { .name = "slab_summary_blocks_written", .mode = 0444, },
+	.print = pool_stats_print_slab_summary_blocks_written,
+};
+
+/* Number of reference blocks written */
+static ssize_t
+pool_stats_print_ref_counts_blocks_written(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->ref_counts.blocks_written);
+}
+
+static struct pool_stats_attribute pool_stats_attr_ref_counts_blocks_written = {
+	.attr = { .name = "ref_counts_blocks_written", .mode = 0444, },
+	.print = pool_stats_print_ref_counts_blocks_written,
+};
+
+/* number of dirty (resident) pages */
+static ssize_t
+pool_stats_print_block_map_dirty_pages(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%u\n", stats->block_map.dirty_pages);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_dirty_pages = {
+	.attr = { .name = "block_map_dirty_pages", .mode = 0444, },
+	.print = pool_stats_print_block_map_dirty_pages,
+};
+
+/* number of clean (resident) pages */
+static ssize_t
+pool_stats_print_block_map_clean_pages(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%u\n", stats->block_map.clean_pages);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_clean_pages = {
+	.attr = { .name = "block_map_clean_pages", .mode = 0444, },
+	.print = pool_stats_print_block_map_clean_pages,
+};
+
+/* number of free pages */
+static ssize_t
+pool_stats_print_block_map_free_pages(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%u\n", stats->block_map.free_pages);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_free_pages = {
+	.attr = { .name = "block_map_free_pages", .mode = 0444, },
+	.print = pool_stats_print_block_map_free_pages,
+};
+
+/* number of pages in failed state */
+static ssize_t
+pool_stats_print_block_map_failed_pages(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%u\n", stats->block_map.failed_pages);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_failed_pages = {
+	.attr = { .name = "block_map_failed_pages", .mode = 0444, },
+	.print = pool_stats_print_block_map_failed_pages,
+};
+
+/* number of pages incoming */
+static ssize_t
+pool_stats_print_block_map_incoming_pages(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%u\n", stats->block_map.incoming_pages);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_incoming_pages = {
+	.attr = { .name = "block_map_incoming_pages", .mode = 0444, },
+	.print = pool_stats_print_block_map_incoming_pages,
+};
+
+/* number of pages outgoing */
+static ssize_t
+pool_stats_print_block_map_outgoing_pages(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%u\n", stats->block_map.outgoing_pages);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_outgoing_pages = {
+	.attr = { .name = "block_map_outgoing_pages", .mode = 0444, },
+	.print = pool_stats_print_block_map_outgoing_pages,
+};
+
+/* how many times free page not avail */
+static ssize_t
+pool_stats_print_block_map_cache_pressure(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%u\n", stats->block_map.cache_pressure);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_cache_pressure = {
+	.attr = { .name = "block_map_cache_pressure", .mode = 0444, },
+	.print = pool_stats_print_block_map_cache_pressure,
+};
+
+/* number of get_vdo_page() calls for read */
+static ssize_t
+pool_stats_print_block_map_read_count(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.read_count);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_read_count = {
+	.attr = { .name = "block_map_read_count", .mode = 0444, },
+	.print = pool_stats_print_block_map_read_count,
+};
+
+/* number of get_vdo_page() calls for write */
+static ssize_t
+pool_stats_print_block_map_write_count(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.write_count);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_write_count = {
+	.attr = { .name = "block_map_write_count", .mode = 0444, },
+	.print = pool_stats_print_block_map_write_count,
+};
+
+/* number of times pages failed to read */
+static ssize_t
+pool_stats_print_block_map_failed_reads(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.failed_reads);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_failed_reads = {
+	.attr = { .name = "block_map_failed_reads", .mode = 0444, },
+	.print = pool_stats_print_block_map_failed_reads,
+};
+
+/* number of times pages failed to write */
+static ssize_t
+pool_stats_print_block_map_failed_writes(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.failed_writes);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_failed_writes = {
+	.attr = { .name = "block_map_failed_writes", .mode = 0444, },
+	.print = pool_stats_print_block_map_failed_writes,
+};
+
+/* number of gets that are reclaimed */
+static ssize_t
+pool_stats_print_block_map_reclaimed(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.reclaimed);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_reclaimed = {
+	.attr = { .name = "block_map_reclaimed", .mode = 0444, },
+	.print = pool_stats_print_block_map_reclaimed,
+};
+
+/* number of gets for outgoing pages */
+static ssize_t
+pool_stats_print_block_map_read_outgoing(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.read_outgoing);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_read_outgoing = {
+	.attr = { .name = "block_map_read_outgoing", .mode = 0444, },
+	.print = pool_stats_print_block_map_read_outgoing,
+};
+
+/* number of gets that were already there */
+static ssize_t
+pool_stats_print_block_map_found_in_cache(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.found_in_cache);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_found_in_cache = {
+	.attr = { .name = "block_map_found_in_cache", .mode = 0444, },
+	.print = pool_stats_print_block_map_found_in_cache,
+};
+
+/* number of gets requiring discard */
+static ssize_t
+pool_stats_print_block_map_discard_required(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.discard_required);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_discard_required = {
+	.attr = { .name = "block_map_discard_required", .mode = 0444, },
+	.print = pool_stats_print_block_map_discard_required,
+};
+
+/* number of gets enqueued for their page */
+static ssize_t
+pool_stats_print_block_map_wait_for_page(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.wait_for_page);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_wait_for_page = {
+	.attr = { .name = "block_map_wait_for_page", .mode = 0444, },
+	.print = pool_stats_print_block_map_wait_for_page,
+};
+
+/* number of gets that have to fetch */
+static ssize_t
+pool_stats_print_block_map_fetch_required(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.fetch_required);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_fetch_required = {
+	.attr = { .name = "block_map_fetch_required", .mode = 0444, },
+	.print = pool_stats_print_block_map_fetch_required,
+};
+
+/* number of page fetches */
+static ssize_t
+pool_stats_print_block_map_pages_loaded(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.pages_loaded);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_pages_loaded = {
+	.attr = { .name = "block_map_pages_loaded", .mode = 0444, },
+	.print = pool_stats_print_block_map_pages_loaded,
+};
+
+/* number of page saves */
+static ssize_t
+pool_stats_print_block_map_pages_saved(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.pages_saved);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_pages_saved = {
+	.attr = { .name = "block_map_pages_saved", .mode = 0444, },
+	.print = pool_stats_print_block_map_pages_saved,
+};
+
+/* the number of flushes issued */
+static ssize_t
+pool_stats_print_block_map_flush_count(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->block_map.flush_count);
+}
+
+static struct pool_stats_attribute pool_stats_attr_block_map_flush_count = {
+	.attr = { .name = "block_map_flush_count", .mode = 0444, },
+	.print = pool_stats_print_block_map_flush_count,
+};
+
+/* Number of times the UDS advice proved correct */
+static ssize_t
+pool_stats_print_hash_lock_dedupe_advice_valid(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->hash_lock.dedupe_advice_valid);
+}
+
+static struct pool_stats_attribute pool_stats_attr_hash_lock_dedupe_advice_valid = {
+	.attr = { .name = "hash_lock_dedupe_advice_valid", .mode = 0444, },
+	.print = pool_stats_print_hash_lock_dedupe_advice_valid,
+};
+
+/* Number of times the UDS advice proved incorrect */
+static ssize_t
+pool_stats_print_hash_lock_dedupe_advice_stale(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->hash_lock.dedupe_advice_stale);
+}
+
+static struct pool_stats_attribute pool_stats_attr_hash_lock_dedupe_advice_stale = {
+	.attr = { .name = "hash_lock_dedupe_advice_stale", .mode = 0444, },
+	.print = pool_stats_print_hash_lock_dedupe_advice_stale,
+};
+
+/* Number of writes with the same data as another in-flight write */
+static ssize_t
+pool_stats_print_hash_lock_concurrent_data_matches(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->hash_lock.concurrent_data_matches);
+}
+
+static struct pool_stats_attribute pool_stats_attr_hash_lock_concurrent_data_matches = {
+	.attr = { .name = "hash_lock_concurrent_data_matches", .mode = 0444, },
+	.print = pool_stats_print_hash_lock_concurrent_data_matches,
+};
+
+/* Number of writes whose hash collided with an in-flight write */
+static ssize_t
+pool_stats_print_hash_lock_concurrent_hash_collisions(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->hash_lock.concurrent_hash_collisions);
+}
+
+static struct pool_stats_attribute pool_stats_attr_hash_lock_concurrent_hash_collisions = {
+	.attr = { .name = "hash_lock_concurrent_hash_collisions", .mode = 0444, },
+	.print = pool_stats_print_hash_lock_concurrent_hash_collisions,
+};
+
+/* Current number of dedupe queries that are in flight */
+static ssize_t
+pool_stats_print_hash_lock_curr_dedupe_queries(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%u\n", stats->hash_lock.curr_dedupe_queries);
+}
+
+static struct pool_stats_attribute pool_stats_attr_hash_lock_curr_dedupe_queries = {
+	.attr = { .name = "hash_lock_curr_dedupe_queries", .mode = 0444, },
+	.print = pool_stats_print_hash_lock_curr_dedupe_queries,
+};
+
+/* number of times VDO got an invalid dedupe advice PBN from UDS */
+static ssize_t
+pool_stats_print_errors_invalid_advice_pbn_count(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->errors.invalid_advice_pbn_count);
+}
+
+static struct pool_stats_attribute pool_stats_attr_errors_invalid_advice_pbn_count = {
+	.attr = { .name = "errors_invalid_advice_pbn_count", .mode = 0444, },
+	.print = pool_stats_print_errors_invalid_advice_pbn_count,
+};
+
+/* number of times a VIO completed with a VDO_NO_SPACE error */
+static ssize_t
+pool_stats_print_errors_no_space_error_count(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->errors.no_space_error_count);
+}
+
+static struct pool_stats_attribute pool_stats_attr_errors_no_space_error_count = {
+	.attr = { .name = "errors_no_space_error_count", .mode = 0444, },
+	.print = pool_stats_print_errors_no_space_error_count,
+};
+
+/* number of times a VIO completed with a VDO_READ_ONLY error */
+static ssize_t
+pool_stats_print_errors_read_only_error_count(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->errors.read_only_error_count);
+}
+
+static struct pool_stats_attribute pool_stats_attr_errors_read_only_error_count = {
+	.attr = { .name = "errors_read_only_error_count", .mode = 0444, },
+	.print = pool_stats_print_errors_read_only_error_count,
+};
+
+/* The VDO instance */
+static ssize_t
+pool_stats_print_instance(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%u\n", stats->instance);
+}
+
+static struct pool_stats_attribute pool_stats_attr_instance = {
+	.attr = { .name = "instance", .mode = 0444, },
+	.print = pool_stats_print_instance,
+};
+
+/* Current number of active VIOs */
+static ssize_t
+pool_stats_print_current_vios_in_progress(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%u\n", stats->current_vios_in_progress);
+}
+
+static struct pool_stats_attribute pool_stats_attr_current_vios_in_progress = {
+	.attr = { .name = "current_vios_in_progress", .mode = 0444, },
+	.print = pool_stats_print_current_vios_in_progress,
+};
+
+/* Maximum number of active VIOs */
+static ssize_t
+pool_stats_print_max_vios(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%u\n", stats->max_vios);
+}
+
+static struct pool_stats_attribute pool_stats_attr_max_vios = {
+	.attr = { .name = "max_vios", .mode = 0444, },
+	.print = pool_stats_print_max_vios,
+};
+
+/* Number of times the UDS index was too slow in responding */
+static ssize_t
+pool_stats_print_dedupe_advice_timeouts(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->dedupe_advice_timeouts);
+}
+
+static struct pool_stats_attribute pool_stats_attr_dedupe_advice_timeouts = {
+	.attr = { .name = "dedupe_advice_timeouts", .mode = 0444, },
+	.print = pool_stats_print_dedupe_advice_timeouts,
+};
+
+/* Number of flush requests submitted to the storage device */
+static ssize_t
+pool_stats_print_flush_out(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->flush_out);
+}
+
+static struct pool_stats_attribute pool_stats_attr_flush_out = {
+	.attr = { .name = "flush_out", .mode = 0444, },
+	.print = pool_stats_print_flush_out,
+};
+
+/* Logical block size */
+static ssize_t
+pool_stats_print_logical_block_size(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->logical_block_size);
+}
+
+static struct pool_stats_attribute pool_stats_attr_logical_block_size = {
+	.attr = { .name = "logical_block_size", .mode = 0444, },
+	.print = pool_stats_print_logical_block_size,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_in_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_read = {
+	.attr = { .name = "bios_in_read", .mode = 0444, },
+	.print = pool_stats_print_bios_in_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_in_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_write = {
+	.attr = { .name = "bios_in_write", .mode = 0444, },
+	.print = pool_stats_print_bios_in_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_in_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_empty_flush = {
+	.attr = { .name = "bios_in_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_in_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_in_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_discard = {
+	.attr = { .name = "bios_in_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_in_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_in_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_flush = {
+	.attr = { .name = "bios_in_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_in_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_in_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_fua = {
+	.attr = { .name = "bios_in_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_in_fua,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_in_partial_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in_partial.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_partial_read = {
+	.attr = { .name = "bios_in_partial_read", .mode = 0444, },
+	.print = pool_stats_print_bios_in_partial_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_in_partial_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in_partial.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_partial_write = {
+	.attr = { .name = "bios_in_partial_write", .mode = 0444, },
+	.print = pool_stats_print_bios_in_partial_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_in_partial_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in_partial.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_partial_empty_flush = {
+	.attr = { .name = "bios_in_partial_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_in_partial_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_in_partial_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in_partial.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_partial_discard = {
+	.attr = { .name = "bios_in_partial_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_in_partial_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_in_partial_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in_partial.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_partial_flush = {
+	.attr = { .name = "bios_in_partial_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_in_partial_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_in_partial_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in_partial.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_partial_fua = {
+	.attr = { .name = "bios_in_partial_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_in_partial_fua,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_out_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_out.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_out_read = {
+	.attr = { .name = "bios_out_read", .mode = 0444, },
+	.print = pool_stats_print_bios_out_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_out_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_out.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_out_write = {
+	.attr = { .name = "bios_out_write", .mode = 0444, },
+	.print = pool_stats_print_bios_out_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_out_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_out.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_out_empty_flush = {
+	.attr = { .name = "bios_out_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_out_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_out_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_out.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_out_discard = {
+	.attr = { .name = "bios_out_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_out_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_out_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_out.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_out_flush = {
+	.attr = { .name = "bios_out_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_out_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_out_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_out.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_out_fua = {
+	.attr = { .name = "bios_out_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_out_fua,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_meta_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_meta.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_meta_read = {
+	.attr = { .name = "bios_meta_read", .mode = 0444, },
+	.print = pool_stats_print_bios_meta_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_meta_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_meta.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_meta_write = {
+	.attr = { .name = "bios_meta_write", .mode = 0444, },
+	.print = pool_stats_print_bios_meta_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_meta_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_meta.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_meta_empty_flush = {
+	.attr = { .name = "bios_meta_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_meta_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_meta_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_meta.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_meta_discard = {
+	.attr = { .name = "bios_meta_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_meta_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_meta_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_meta.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_meta_flush = {
+	.attr = { .name = "bios_meta_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_meta_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_meta_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_meta.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_meta_fua = {
+	.attr = { .name = "bios_meta_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_meta_fua,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_journal_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_journal.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_journal_read = {
+	.attr = { .name = "bios_journal_read", .mode = 0444, },
+	.print = pool_stats_print_bios_journal_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_journal_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_journal.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_journal_write = {
+	.attr = { .name = "bios_journal_write", .mode = 0444, },
+	.print = pool_stats_print_bios_journal_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_journal_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_journal.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_journal_empty_flush = {
+	.attr = { .name = "bios_journal_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_journal_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_journal_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_journal.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_journal_discard = {
+	.attr = { .name = "bios_journal_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_journal_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_journal_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_journal.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_journal_flush = {
+	.attr = { .name = "bios_journal_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_journal_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_journal_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_journal.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_journal_fua = {
+	.attr = { .name = "bios_journal_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_journal_fua,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_page_cache_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_page_cache.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_page_cache_read = {
+	.attr = { .name = "bios_page_cache_read", .mode = 0444, },
+	.print = pool_stats_print_bios_page_cache_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_page_cache_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_page_cache.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_page_cache_write = {
+	.attr = { .name = "bios_page_cache_write", .mode = 0444, },
+	.print = pool_stats_print_bios_page_cache_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_page_cache_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_page_cache.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_page_cache_empty_flush = {
+	.attr = { .name = "bios_page_cache_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_page_cache_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_page_cache_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_page_cache.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_page_cache_discard = {
+	.attr = { .name = "bios_page_cache_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_page_cache_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_page_cache_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_page_cache.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_page_cache_flush = {
+	.attr = { .name = "bios_page_cache_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_page_cache_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_page_cache_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_page_cache.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_page_cache_fua = {
+	.attr = { .name = "bios_page_cache_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_page_cache_fua,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_out_completed_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_out_completed.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_out_completed_read = {
+	.attr = { .name = "bios_out_completed_read", .mode = 0444, },
+	.print = pool_stats_print_bios_out_completed_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_out_completed_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_out_completed.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_out_completed_write = {
+	.attr = { .name = "bios_out_completed_write", .mode = 0444, },
+	.print = pool_stats_print_bios_out_completed_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_out_completed_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_out_completed.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_out_completed_empty_flush = {
+	.attr = { .name = "bios_out_completed_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_out_completed_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_out_completed_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_out_completed.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_out_completed_discard = {
+	.attr = { .name = "bios_out_completed_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_out_completed_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_out_completed_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_out_completed.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_out_completed_flush = {
+	.attr = { .name = "bios_out_completed_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_out_completed_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_out_completed_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_out_completed.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_out_completed_fua = {
+	.attr = { .name = "bios_out_completed_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_out_completed_fua,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_meta_completed_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_meta_completed.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_meta_completed_read = {
+	.attr = { .name = "bios_meta_completed_read", .mode = 0444, },
+	.print = pool_stats_print_bios_meta_completed_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_meta_completed_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_meta_completed.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_meta_completed_write = {
+	.attr = { .name = "bios_meta_completed_write", .mode = 0444, },
+	.print = pool_stats_print_bios_meta_completed_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_meta_completed_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_meta_completed.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_meta_completed_empty_flush = {
+	.attr = { .name = "bios_meta_completed_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_meta_completed_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_meta_completed_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_meta_completed.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_meta_completed_discard = {
+	.attr = { .name = "bios_meta_completed_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_meta_completed_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_meta_completed_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_meta_completed.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_meta_completed_flush = {
+	.attr = { .name = "bios_meta_completed_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_meta_completed_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_meta_completed_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_meta_completed.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_meta_completed_fua = {
+	.attr = { .name = "bios_meta_completed_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_meta_completed_fua,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_journal_completed_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_journal_completed.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_journal_completed_read = {
+	.attr = { .name = "bios_journal_completed_read", .mode = 0444, },
+	.print = pool_stats_print_bios_journal_completed_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_journal_completed_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_journal_completed.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_journal_completed_write = {
+	.attr = { .name = "bios_journal_completed_write", .mode = 0444, },
+	.print = pool_stats_print_bios_journal_completed_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_journal_completed_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_journal_completed.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_journal_completed_empty_flush = {
+	.attr = { .name = "bios_journal_completed_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_journal_completed_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_journal_completed_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_journal_completed.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_journal_completed_discard = {
+	.attr = { .name = "bios_journal_completed_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_journal_completed_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_journal_completed_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_journal_completed.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_journal_completed_flush = {
+	.attr = { .name = "bios_journal_completed_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_journal_completed_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_journal_completed_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_journal_completed.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_journal_completed_fua = {
+	.attr = { .name = "bios_journal_completed_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_journal_completed_fua,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_page_cache_completed_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_page_cache_completed.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_page_cache_completed_read = {
+	.attr = { .name = "bios_page_cache_completed_read", .mode = 0444, },
+	.print = pool_stats_print_bios_page_cache_completed_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_page_cache_completed_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_page_cache_completed.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_page_cache_completed_write = {
+	.attr = { .name = "bios_page_cache_completed_write", .mode = 0444, },
+	.print = pool_stats_print_bios_page_cache_completed_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_page_cache_completed_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_page_cache_completed.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_page_cache_completed_empty_flush = {
+	.attr = { .name = "bios_page_cache_completed_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_page_cache_completed_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_page_cache_completed_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_page_cache_completed.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_page_cache_completed_discard = {
+	.attr = { .name = "bios_page_cache_completed_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_page_cache_completed_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_page_cache_completed_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_page_cache_completed.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_page_cache_completed_flush = {
+	.attr = { .name = "bios_page_cache_completed_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_page_cache_completed_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_page_cache_completed_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_page_cache_completed.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_page_cache_completed_fua = {
+	.attr = { .name = "bios_page_cache_completed_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_page_cache_completed_fua,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_acknowledged_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_acknowledged.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_acknowledged_read = {
+	.attr = { .name = "bios_acknowledged_read", .mode = 0444, },
+	.print = pool_stats_print_bios_acknowledged_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_acknowledged_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_acknowledged.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_acknowledged_write = {
+	.attr = { .name = "bios_acknowledged_write", .mode = 0444, },
+	.print = pool_stats_print_bios_acknowledged_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_acknowledged_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_acknowledged.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_acknowledged_empty_flush = {
+	.attr = { .name = "bios_acknowledged_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_acknowledged_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_acknowledged_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_acknowledged.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_acknowledged_discard = {
+	.attr = { .name = "bios_acknowledged_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_acknowledged_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_acknowledged_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_acknowledged.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_acknowledged_flush = {
+	.attr = { .name = "bios_acknowledged_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_acknowledged_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_acknowledged_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_acknowledged.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_acknowledged_fua = {
+	.attr = { .name = "bios_acknowledged_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_acknowledged_fua,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_acknowledged_partial_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_acknowledged_partial.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_acknowledged_partial_read = {
+	.attr = { .name = "bios_acknowledged_partial_read", .mode = 0444, },
+	.print = pool_stats_print_bios_acknowledged_partial_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_acknowledged_partial_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_acknowledged_partial.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_acknowledged_partial_write = {
+	.attr = { .name = "bios_acknowledged_partial_write", .mode = 0444, },
+	.print = pool_stats_print_bios_acknowledged_partial_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_acknowledged_partial_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_acknowledged_partial.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_acknowledged_partial_empty_flush = {
+	.attr = { .name = "bios_acknowledged_partial_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_acknowledged_partial_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_acknowledged_partial_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_acknowledged_partial.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_acknowledged_partial_discard = {
+	.attr = { .name = "bios_acknowledged_partial_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_acknowledged_partial_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_acknowledged_partial_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_acknowledged_partial.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_acknowledged_partial_flush = {
+	.attr = { .name = "bios_acknowledged_partial_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_acknowledged_partial_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_acknowledged_partial_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_acknowledged_partial.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_acknowledged_partial_fua = {
+	.attr = { .name = "bios_acknowledged_partial_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_acknowledged_partial_fua,
+};
+
+/* Number of REQ_OP_READ bios */
+static ssize_t
+pool_stats_print_bios_in_progress_read(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in_progress.read);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_progress_read = {
+	.attr = { .name = "bios_in_progress_read", .mode = 0444, },
+	.print = pool_stats_print_bios_in_progress_read,
+};
+
+/* Number of REQ_OP_WRITE bios with data */
+static ssize_t
+pool_stats_print_bios_in_progress_write(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in_progress.write);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_progress_write = {
+	.attr = { .name = "bios_in_progress_write", .mode = 0444, },
+	.print = pool_stats_print_bios_in_progress_write,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH and containing no data */
+static ssize_t
+pool_stats_print_bios_in_progress_empty_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in_progress.empty_flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_progress_empty_flush = {
+	.attr = { .name = "bios_in_progress_empty_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_in_progress_empty_flush,
+};
+
+/* Number of REQ_OP_DISCARD bios */
+static ssize_t
+pool_stats_print_bios_in_progress_discard(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in_progress.discard);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_progress_discard = {
+	.attr = { .name = "bios_in_progress_discard", .mode = 0444, },
+	.print = pool_stats_print_bios_in_progress_discard,
+};
+
+/* Number of bios tagged with REQ_PREFLUSH */
+static ssize_t
+pool_stats_print_bios_in_progress_flush(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in_progress.flush);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_progress_flush = {
+	.attr = { .name = "bios_in_progress_flush", .mode = 0444, },
+	.print = pool_stats_print_bios_in_progress_flush,
+};
+
+/* Number of bios tagged with REQ_FUA */
+static ssize_t
+pool_stats_print_bios_in_progress_fua(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->bios_in_progress.fua);
+}
+
+static struct pool_stats_attribute pool_stats_attr_bios_in_progress_fua = {
+	.attr = { .name = "bios_in_progress_fua", .mode = 0444, },
+	.print = pool_stats_print_bios_in_progress_fua,
+};
+
+/* Tracked bytes currently allocated. */
+static ssize_t
+pool_stats_print_memory_usage_bytes_used(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->memory_usage.bytes_used);
+}
+
+static struct pool_stats_attribute pool_stats_attr_memory_usage_bytes_used = {
+	.attr = { .name = "memory_usage_bytes_used", .mode = 0444, },
+	.print = pool_stats_print_memory_usage_bytes_used,
+};
+
+/* Maximum tracked bytes allocated. */
+static ssize_t
+pool_stats_print_memory_usage_peak_bytes_used(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->memory_usage.peak_bytes_used);
+}
+
+static struct pool_stats_attribute pool_stats_attr_memory_usage_peak_bytes_used = {
+	.attr = { .name = "memory_usage_peak_bytes_used", .mode = 0444, },
+	.print = pool_stats_print_memory_usage_peak_bytes_used,
+};
+
+/* Number of records stored in the index */
+static ssize_t
+pool_stats_print_index_entries_indexed(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->index.entries_indexed);
+}
+
+static struct pool_stats_attribute pool_stats_attr_index_entries_indexed = {
+	.attr = { .name = "index_entries_indexed", .mode = 0444, },
+	.print = pool_stats_print_index_entries_indexed,
+};
+
+/* Number of post calls that found an existing entry */
+static ssize_t
+pool_stats_print_index_posts_found(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->index.posts_found);
+}
+
+static struct pool_stats_attribute pool_stats_attr_index_posts_found = {
+	.attr = { .name = "index_posts_found", .mode = 0444, },
+	.print = pool_stats_print_index_posts_found,
+};
+
+/* Number of post calls that added a new entry */
+static ssize_t
+pool_stats_print_index_posts_not_found(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->index.posts_not_found);
+}
+
+static struct pool_stats_attribute pool_stats_attr_index_posts_not_found = {
+	.attr = { .name = "index_posts_not_found", .mode = 0444, },
+	.print = pool_stats_print_index_posts_not_found,
+};
+
+/* Number of query calls that found an existing entry */
+static ssize_t
+pool_stats_print_index_queries_found(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->index.queries_found);
+}
+
+static struct pool_stats_attribute pool_stats_attr_index_queries_found = {
+	.attr = { .name = "index_queries_found", .mode = 0444, },
+	.print = pool_stats_print_index_queries_found,
+};
+
+/* Number of query calls that added a new entry */
+static ssize_t
+pool_stats_print_index_queries_not_found(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->index.queries_not_found);
+}
+
+static struct pool_stats_attribute pool_stats_attr_index_queries_not_found = {
+	.attr = { .name = "index_queries_not_found", .mode = 0444, },
+	.print = pool_stats_print_index_queries_not_found,
+};
+
+/* Number of update calls that found an existing entry */
+static ssize_t
+pool_stats_print_index_updates_found(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->index.updates_found);
+}
+
+static struct pool_stats_attribute pool_stats_attr_index_updates_found = {
+	.attr = { .name = "index_updates_found", .mode = 0444, },
+	.print = pool_stats_print_index_updates_found,
+};
+
+/* Number of update calls that added a new entry */
+static ssize_t
+pool_stats_print_index_updates_not_found(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->index.updates_not_found);
+}
+
+static struct pool_stats_attribute pool_stats_attr_index_updates_not_found = {
+	.attr = { .name = "index_updates_not_found", .mode = 0444, },
+	.print = pool_stats_print_index_updates_not_found,
+};
+
+/* Number of entries discarded */
+static ssize_t
+pool_stats_print_index_entries_discarded(struct vdo_statistics *stats, char *buf)
+{
+	return sprintf(buf, "%llu\n", stats->index.entries_discarded);
+}
+
+static struct pool_stats_attribute pool_stats_attr_index_entries_discarded = {
+	.attr = { .name = "index_entries_discarded", .mode = 0444, },
+	.print = pool_stats_print_index_entries_discarded,
+};
+
+struct attribute *vdo_pool_stats_attrs[] = {
+	&pool_stats_attr_data_blocks_used.attr,
+	&pool_stats_attr_overhead_blocks_used.attr,
+	&pool_stats_attr_logical_blocks_used.attr,
+	&pool_stats_attr_physical_blocks.attr,
+	&pool_stats_attr_logical_blocks.attr,
+	&pool_stats_attr_block_map_cache_size.attr,
+	&pool_stats_attr_block_size.attr,
+	&pool_stats_attr_complete_recoveries.attr,
+	&pool_stats_attr_read_only_recoveries.attr,
+	&pool_stats_attr_mode.attr,
+	&pool_stats_attr_in_recovery_mode.attr,
+	&pool_stats_attr_recovery_percentage.attr,
+	&pool_stats_attr_packer_compressed_fragments_written.attr,
+	&pool_stats_attr_packer_compressed_blocks_written.attr,
+	&pool_stats_attr_packer_compressed_fragments_in_packer.attr,
+	&pool_stats_attr_allocator_slab_count.attr,
+	&pool_stats_attr_allocator_slabs_opened.attr,
+	&pool_stats_attr_allocator_slabs_reopened.attr,
+	&pool_stats_attr_journal_disk_full.attr,
+	&pool_stats_attr_journal_slab_journal_commits_requested.attr,
+	&pool_stats_attr_journal_entries_started.attr,
+	&pool_stats_attr_journal_entries_written.attr,
+	&pool_stats_attr_journal_entries_committed.attr,
+	&pool_stats_attr_journal_blocks_started.attr,
+	&pool_stats_attr_journal_blocks_written.attr,
+	&pool_stats_attr_journal_blocks_committed.attr,
+	&pool_stats_attr_slab_journal_disk_full_count.attr,
+	&pool_stats_attr_slab_journal_flush_count.attr,
+	&pool_stats_attr_slab_journal_blocked_count.attr,
+	&pool_stats_attr_slab_journal_blocks_written.attr,
+	&pool_stats_attr_slab_journal_tail_busy_count.attr,
+	&pool_stats_attr_slab_summary_blocks_written.attr,
+	&pool_stats_attr_ref_counts_blocks_written.attr,
+	&pool_stats_attr_block_map_dirty_pages.attr,
+	&pool_stats_attr_block_map_clean_pages.attr,
+	&pool_stats_attr_block_map_free_pages.attr,
+	&pool_stats_attr_block_map_failed_pages.attr,
+	&pool_stats_attr_block_map_incoming_pages.attr,
+	&pool_stats_attr_block_map_outgoing_pages.attr,
+	&pool_stats_attr_block_map_cache_pressure.attr,
+	&pool_stats_attr_block_map_read_count.attr,
+	&pool_stats_attr_block_map_write_count.attr,
+	&pool_stats_attr_block_map_failed_reads.attr,
+	&pool_stats_attr_block_map_failed_writes.attr,
+	&pool_stats_attr_block_map_reclaimed.attr,
+	&pool_stats_attr_block_map_read_outgoing.attr,
+	&pool_stats_attr_block_map_found_in_cache.attr,
+	&pool_stats_attr_block_map_discard_required.attr,
+	&pool_stats_attr_block_map_wait_for_page.attr,
+	&pool_stats_attr_block_map_fetch_required.attr,
+	&pool_stats_attr_block_map_pages_loaded.attr,
+	&pool_stats_attr_block_map_pages_saved.attr,
+	&pool_stats_attr_block_map_flush_count.attr,
+	&pool_stats_attr_hash_lock_dedupe_advice_valid.attr,
+	&pool_stats_attr_hash_lock_dedupe_advice_stale.attr,
+	&pool_stats_attr_hash_lock_concurrent_data_matches.attr,
+	&pool_stats_attr_hash_lock_concurrent_hash_collisions.attr,
+	&pool_stats_attr_hash_lock_curr_dedupe_queries.attr,
+	&pool_stats_attr_errors_invalid_advice_pbn_count.attr,
+	&pool_stats_attr_errors_no_space_error_count.attr,
+	&pool_stats_attr_errors_read_only_error_count.attr,
+	&pool_stats_attr_instance.attr,
+	&pool_stats_attr_current_vios_in_progress.attr,
+	&pool_stats_attr_max_vios.attr,
+	&pool_stats_attr_dedupe_advice_timeouts.attr,
+	&pool_stats_attr_flush_out.attr,
+	&pool_stats_attr_logical_block_size.attr,
+	&pool_stats_attr_bios_in_read.attr,
+	&pool_stats_attr_bios_in_write.attr,
+	&pool_stats_attr_bios_in_empty_flush.attr,
+	&pool_stats_attr_bios_in_discard.attr,
+	&pool_stats_attr_bios_in_flush.attr,
+	&pool_stats_attr_bios_in_fua.attr,
+	&pool_stats_attr_bios_in_partial_read.attr,
+	&pool_stats_attr_bios_in_partial_write.attr,
+	&pool_stats_attr_bios_in_partial_empty_flush.attr,
+	&pool_stats_attr_bios_in_partial_discard.attr,
+	&pool_stats_attr_bios_in_partial_flush.attr,
+	&pool_stats_attr_bios_in_partial_fua.attr,
+	&pool_stats_attr_bios_out_read.attr,
+	&pool_stats_attr_bios_out_write.attr,
+	&pool_stats_attr_bios_out_empty_flush.attr,
+	&pool_stats_attr_bios_out_discard.attr,
+	&pool_stats_attr_bios_out_flush.attr,
+	&pool_stats_attr_bios_out_fua.attr,
+	&pool_stats_attr_bios_meta_read.attr,
+	&pool_stats_attr_bios_meta_write.attr,
+	&pool_stats_attr_bios_meta_empty_flush.attr,
+	&pool_stats_attr_bios_meta_discard.attr,
+	&pool_stats_attr_bios_meta_flush.attr,
+	&pool_stats_attr_bios_meta_fua.attr,
+	&pool_stats_attr_bios_journal_read.attr,
+	&pool_stats_attr_bios_journal_write.attr,
+	&pool_stats_attr_bios_journal_empty_flush.attr,
+	&pool_stats_attr_bios_journal_discard.attr,
+	&pool_stats_attr_bios_journal_flush.attr,
+	&pool_stats_attr_bios_journal_fua.attr,
+	&pool_stats_attr_bios_page_cache_read.attr,
+	&pool_stats_attr_bios_page_cache_write.attr,
+	&pool_stats_attr_bios_page_cache_empty_flush.attr,
+	&pool_stats_attr_bios_page_cache_discard.attr,
+	&pool_stats_attr_bios_page_cache_flush.attr,
+	&pool_stats_attr_bios_page_cache_fua.attr,
+	&pool_stats_attr_bios_out_completed_read.attr,
+	&pool_stats_attr_bios_out_completed_write.attr,
+	&pool_stats_attr_bios_out_completed_empty_flush.attr,
+	&pool_stats_attr_bios_out_completed_discard.attr,
+	&pool_stats_attr_bios_out_completed_flush.attr,
+	&pool_stats_attr_bios_out_completed_fua.attr,
+	&pool_stats_attr_bios_meta_completed_read.attr,
+	&pool_stats_attr_bios_meta_completed_write.attr,
+	&pool_stats_attr_bios_meta_completed_empty_flush.attr,
+	&pool_stats_attr_bios_meta_completed_discard.attr,
+	&pool_stats_attr_bios_meta_completed_flush.attr,
+	&pool_stats_attr_bios_meta_completed_fua.attr,
+	&pool_stats_attr_bios_journal_completed_read.attr,
+	&pool_stats_attr_bios_journal_completed_write.attr,
+	&pool_stats_attr_bios_journal_completed_empty_flush.attr,
+	&pool_stats_attr_bios_journal_completed_discard.attr,
+	&pool_stats_attr_bios_journal_completed_flush.attr,
+	&pool_stats_attr_bios_journal_completed_fua.attr,
+	&pool_stats_attr_bios_page_cache_completed_read.attr,
+	&pool_stats_attr_bios_page_cache_completed_write.attr,
+	&pool_stats_attr_bios_page_cache_completed_empty_flush.attr,
+	&pool_stats_attr_bios_page_cache_completed_discard.attr,
+	&pool_stats_attr_bios_page_cache_completed_flush.attr,
+	&pool_stats_attr_bios_page_cache_completed_fua.attr,
+	&pool_stats_attr_bios_acknowledged_read.attr,
+	&pool_stats_attr_bios_acknowledged_write.attr,
+	&pool_stats_attr_bios_acknowledged_empty_flush.attr,
+	&pool_stats_attr_bios_acknowledged_discard.attr,
+	&pool_stats_attr_bios_acknowledged_flush.attr,
+	&pool_stats_attr_bios_acknowledged_fua.attr,
+	&pool_stats_attr_bios_acknowledged_partial_read.attr,
+	&pool_stats_attr_bios_acknowledged_partial_write.attr,
+	&pool_stats_attr_bios_acknowledged_partial_empty_flush.attr,
+	&pool_stats_attr_bios_acknowledged_partial_discard.attr,
+	&pool_stats_attr_bios_acknowledged_partial_flush.attr,
+	&pool_stats_attr_bios_acknowledged_partial_fua.attr,
+	&pool_stats_attr_bios_in_progress_read.attr,
+	&pool_stats_attr_bios_in_progress_write.attr,
+	&pool_stats_attr_bios_in_progress_empty_flush.attr,
+	&pool_stats_attr_bios_in_progress_discard.attr,
+	&pool_stats_attr_bios_in_progress_flush.attr,
+	&pool_stats_attr_bios_in_progress_fua.attr,
+	&pool_stats_attr_memory_usage_bytes_used.attr,
+	&pool_stats_attr_memory_usage_peak_bytes_used.attr,
+	&pool_stats_attr_index_entries_indexed.attr,
+	&pool_stats_attr_index_posts_found.attr,
+	&pool_stats_attr_index_posts_not_found.attr,
+	&pool_stats_attr_index_queries_found.attr,
+	&pool_stats_attr_index_queries_not_found.attr,
+	&pool_stats_attr_index_updates_found.attr,
+	&pool_stats_attr_index_updates_not_found.attr,
+	&pool_stats_attr_index_entries_discarded.attr,
+	NULL,
+};
diff --git a/drivers/md/dm-vdo/pool-sysfs.c b/drivers/md/dm-vdo/pool-sysfs.c
new file mode 100644
index 000000000000..413a5194586e
--- /dev/null
+++ b/drivers/md/dm-vdo/pool-sysfs.c
@@ -0,0 +1,193 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "pool-sysfs.h"
+
+#include <linux/kstrtox.h>
+
+#include "memory-alloc.h"
+#include "string-utils.h"
+
+#include "data-vio.h"
+#include "dedupe.h"
+#include "vdo.h"
+
+struct pool_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct vdo *vdo, char *buf);
+	ssize_t (*store)(struct vdo *vdo, const char *value, size_t count);
+};
+
+static ssize_t vdo_pool_attr_show(struct kobject *directory, struct attribute *attr, char *buf)
+{
+	struct pool_attribute *pool_attr = container_of(attr, struct pool_attribute, attr);
+	struct vdo *vdo = container_of(directory, struct vdo, vdo_directory);
+
+	if (pool_attr->show == NULL)
+		return -EINVAL;
+	return pool_attr->show(vdo, buf);
+}
+
+static ssize_t vdo_pool_attr_store(struct kobject *directory,
+				   struct attribute *attr,
+				   const char *buf,
+				   size_t length)
+{
+	struct pool_attribute *pool_attr = container_of(attr, struct pool_attribute, attr);
+	struct vdo *vdo = container_of(directory, struct vdo, vdo_directory);
+
+	if (pool_attr->store == NULL)
+		return -EINVAL;
+	return pool_attr->store(vdo, buf, length);
+}
+
+static const struct sysfs_ops vdo_pool_sysfs_ops = {
+	.show = vdo_pool_attr_show,
+	.store = vdo_pool_attr_store,
+};
+
+static ssize_t pool_compressing_show(struct vdo *vdo, char *buf)
+{
+	return sprintf(buf, "%s\n", (vdo_get_compressing(vdo) ? "1" : "0"));
+}
+
+static ssize_t pool_discards_active_show(struct vdo *vdo, char *buf)
+{
+	return sprintf(buf, "%u\n", get_data_vio_pool_active_discards(vdo->data_vio_pool));
+}
+
+static ssize_t pool_discards_limit_show(struct vdo *vdo, char *buf)
+{
+	return sprintf(buf, "%u\n", get_data_vio_pool_discard_limit(vdo->data_vio_pool));
+}
+
+static ssize_t pool_discards_limit_store(struct vdo *vdo, const char *buf, size_t length)
+{
+	unsigned int value;
+	int result;
+
+	if ((length > 12) || (kstrtouint(buf, 10, &value) < 0) || (value < 1))
+		return -EINVAL;
+
+	result = set_data_vio_pool_discard_limit(vdo->data_vio_pool, value);
+	if (result != VDO_SUCCESS)
+		return -EINVAL;
+
+	return length;
+}
+
+static ssize_t pool_discards_maximum_show(struct vdo *vdo, char *buf)
+{
+	return sprintf(buf, "%u\n", get_data_vio_pool_maximum_discards(vdo->data_vio_pool));
+}
+
+static ssize_t pool_instance_show(struct vdo *vdo, char *buf)
+{
+	return sprintf(buf, "%u\n", vdo->instance);
+}
+
+static ssize_t pool_requests_active_show(struct vdo *vdo, char *buf)
+{
+	return sprintf(buf, "%u\n", get_data_vio_pool_active_requests(vdo->data_vio_pool));
+}
+
+static ssize_t pool_requests_limit_show(struct vdo *vdo, char *buf)
+{
+	return sprintf(buf, "%u\n", get_data_vio_pool_request_limit(vdo->data_vio_pool));
+}
+
+static ssize_t pool_requests_maximum_show(struct vdo *vdo, char *buf)
+{
+	return sprintf(buf, "%u\n", get_data_vio_pool_maximum_requests(vdo->data_vio_pool));
+}
+
+static void vdo_pool_release(struct kobject *directory)
+{
+	UDS_FREE(container_of(directory, struct vdo, vdo_directory));
+}
+
+static struct pool_attribute vdo_pool_compressing_attr = {
+	.attr = {
+			.name = "compressing",
+			.mode = 0444,
+		},
+	.show = pool_compressing_show,
+};
+
+static struct pool_attribute vdo_pool_discards_active_attr = {
+	.attr = {
+			.name = "discards_active",
+			.mode = 0444,
+		},
+	.show = pool_discards_active_show,
+};
+
+static struct pool_attribute vdo_pool_discards_limit_attr = {
+	.attr = {
+			.name = "discards_limit",
+			.mode = 0644,
+		},
+	.show = pool_discards_limit_show,
+	.store = pool_discards_limit_store,
+};
+
+static struct pool_attribute vdo_pool_discards_maximum_attr = {
+	.attr = {
+			.name = "discards_maximum",
+			.mode = 0444,
+		},
+	.show = pool_discards_maximum_show,
+};
+
+static struct pool_attribute vdo_pool_instance_attr = {
+	.attr = {
+			.name = "instance",
+			.mode = 0444,
+		},
+	.show = pool_instance_show,
+};
+
+static struct pool_attribute vdo_pool_requests_active_attr = {
+	.attr = {
+			.name = "requests_active",
+			.mode = 0444,
+		},
+	.show = pool_requests_active_show,
+};
+
+static struct pool_attribute vdo_pool_requests_limit_attr = {
+	.attr = {
+			.name = "requests_limit",
+			.mode = 0444,
+		},
+	.show = pool_requests_limit_show,
+};
+
+static struct pool_attribute vdo_pool_requests_maximum_attr = {
+	.attr = {
+			.name = "requests_maximum",
+			.mode = 0444,
+		},
+	.show = pool_requests_maximum_show,
+};
+
+static struct attribute *pool_attrs[] = {
+	&vdo_pool_compressing_attr.attr,
+	&vdo_pool_discards_active_attr.attr,
+	&vdo_pool_discards_limit_attr.attr,
+	&vdo_pool_discards_maximum_attr.attr,
+	&vdo_pool_instance_attr.attr,
+	&vdo_pool_requests_active_attr.attr,
+	&vdo_pool_requests_limit_attr.attr,
+	&vdo_pool_requests_maximum_attr.attr,
+	NULL,
+};
+ATTRIBUTE_GROUPS(pool);
+
+const struct kobj_type vdo_directory_type = {
+	.release = vdo_pool_release,
+	.sysfs_ops = &vdo_pool_sysfs_ops,
+	.default_groups = pool_groups,
+};
diff --git a/drivers/md/dm-vdo/pool-sysfs.h b/drivers/md/dm-vdo/pool-sysfs.h
new file mode 100644
index 000000000000..00e680924dc1
--- /dev/null
+++ b/drivers/md/dm-vdo/pool-sysfs.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_POOL_SYSFS_H
+#define VDO_POOL_SYSFS_H
+
+#include <linux/kobject.h>
+
+/* The kobj_type used for setting up the kernel layer kobject. */
+extern const struct kobj_type vdo_directory_type;
+
+/* The sysfs_ops used for the "statistics" subdirectory. */
+extern const struct sysfs_ops vdo_pool_stats_sysfs_ops;
+/* The attribute used for the "statistics" subdirectory. */
+extern struct attribute *vdo_pool_stats_attrs[];
+
+#endif /* VDO_POOL_SYSFS_H */
diff --git a/drivers/md/dm-vdo/sysfs.c b/drivers/md/dm-vdo/sysfs.c
new file mode 100644
index 000000000000..b2995004f236
--- /dev/null
+++ b/drivers/md/dm-vdo/sysfs.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include <linux/module.h>
+
+#include "logger.h"
+
+#include "constants.h"
+#include "dedupe.h"
+#include "vdo.h"
+
+static int vdo_log_level_show(char *buf, const struct kernel_param *kp)
+{
+	return sprintf(buf, "%s\n", uds_log_priority_to_string(uds_get_log_level()));
+}
+
+static int vdo_log_level_store(const char *buf, const struct kernel_param *kp)
+{
+	static char internal_buf[11];
+
+	int n = strlen(buf);
+
+	if (n > 10)
+		return -EINVAL;
+
+	memset(internal_buf, '\000', sizeof(internal_buf));
+	memcpy(internal_buf, buf, n);
+	if (internal_buf[n - 1] == '\n')
+		internal_buf[n - 1] = '\000';
+	uds_set_log_level(uds_log_string_to_priority(internal_buf));
+	return 0;
+}
+
+
+static int vdo_dedupe_timeout_interval_store(const char *buf, const struct kernel_param *kp)
+{
+	int result = param_set_uint(buf, kp);
+
+	if (result != 0)
+		return result;
+	vdo_set_dedupe_index_timeout_interval(*(uint *)kp->arg);
+	return 0;
+}
+
+static int vdo_min_dedupe_timer_interval_store(const char *buf, const struct kernel_param *kp)
+{
+	int result = param_set_uint(buf, kp);
+
+	if (result != 0)
+		return result;
+	vdo_set_dedupe_index_min_timer_interval(*(uint *)kp->arg);
+	return 0;
+}
+
+static const struct kernel_param_ops log_level_ops = {
+	.set = vdo_log_level_store,
+	.get = vdo_log_level_show,
+};
+
+
+static const struct kernel_param_ops dedupe_timeout_ops = {
+	.set = vdo_dedupe_timeout_interval_store,
+	.get = param_get_uint,
+};
+
+static const struct kernel_param_ops dedupe_timer_ops = {
+	.set = vdo_min_dedupe_timer_interval_store,
+	.get = param_get_uint,
+};
+
+module_param_cb(log_level, &log_level_ops, NULL, 0644);
+
+
+module_param_cb(deduplication_timeout_interval,
+		&dedupe_timeout_ops,
+		&vdo_dedupe_index_timeout_interval,
+		0644);
+
+module_param_cb(min_deduplication_timer_interval,
+		&dedupe_timer_ops,
+		&vdo_dedupe_index_min_timer_interval,
+		0644);

From patchwork Thu Oct 26 21:41:34 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437945
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 999B5C25B72
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:27 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S234816AbjJZVo1 (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:27 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40966 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344880AbjJZVoQ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:16 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E434ED59
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356507;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=jx5RbrAUF9h/5t6gpZ2RnZh5eCLaU8AGWqukn3niFfk=;
        b=KEVg12YLnEc65bPn4F7r+YFtqzY96Gd2BzJ93sUD1Iye+tH6QLffcL6655+uJ2Q3SRaXcj
        jY96M6e59w7dsaHUSOCxK2xbOPLG5UAEuzCeXUZizJQ4fd1FF52AtXsT1JfHfKbPyU3vCk
        w2wn19/XkS/4wr9DZtmpdX2WaI9SweU=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-182-DPeei2jzM2eFJaWF_yoohQ-1; Thu, 26 Oct 2023 17:41:43 -0400
X-MC-Unique: DPeei2jzM2eFJaWF_yoohQ-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 8176B811E7D;
        Thu, 26 Oct 2023 21:41:43 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 77EC82949;
        Thu, 26 Oct 2023 21:41:43 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 3819F3003E; Thu, 26 Oct 2023 17:41:43 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 37/39] dm vdo: add debugging support
Date: Thu, 26 Oct 2023 17:41:34 -0400
Message-Id: <20231026214136.1067410-38-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Add support for dumping detailed vdo state to the kernel log via a dmsetup
message. The dump code is not thread-safe and is generally intended for use
only when the vdo is hung.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Co-developed-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Co-developed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo/dump.c | 288 +++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/dump.h |  17 +++
 2 files changed, 305 insertions(+)
 create mode 100644 drivers/md/dm-vdo/dump.c
 create mode 100644 drivers/md/dm-vdo/dump.h

diff --git a/drivers/md/dm-vdo/dump.c b/drivers/md/dm-vdo/dump.c
new file mode 100644
index 000000000000..d50db3f7a14d
--- /dev/null
+++ b/drivers/md/dm-vdo/dump.c
@@ -0,0 +1,288 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include "dump.h"
+
+#include <linux/module.h>
+
+#include "memory-alloc.h"
+#include "string-utils.h"
+
+#include "constants.h"
+#include "data-vio.h"
+#include "dedupe.h"
+#include "io-submitter.h"
+#include "logger.h"
+#include "types.h"
+#include "vdo.h"
+#include "work-queue.h"
+
+enum dump_options {
+	/* Work queues */
+	SHOW_QUEUES,
+	/* Memory pools */
+	SHOW_VIO_POOL,
+	/* Others */
+	SHOW_VDO_STATUS,
+	/* This one means an option overrides the "default" choices, instead of altering them. */
+	SKIP_DEFAULT
+};
+
+enum dump_option_flags {
+	/* Work queues */
+	FLAG_SHOW_QUEUES = (1 << SHOW_QUEUES),
+	/* Memory pools */
+	FLAG_SHOW_VIO_POOL = (1 << SHOW_VIO_POOL),
+	/* Others */
+	FLAG_SHOW_VDO_STATUS = (1 << SHOW_VDO_STATUS),
+	/* Special */
+	FLAG_SKIP_DEFAULT = (1 << SKIP_DEFAULT)
+};
+
+enum {
+	FLAGS_ALL_POOLS = (FLAG_SHOW_VIO_POOL),
+	DEFAULT_DUMP_FLAGS = (FLAG_SHOW_QUEUES | FLAG_SHOW_VDO_STATUS)
+};
+
+static inline bool is_arg_string(const char *arg, const char *this_option)
+{
+	/* convention seems to be case-independent options */
+	return strncasecmp(arg, this_option, strlen(this_option)) == 0;
+}
+
+static void do_dump(struct vdo *vdo, unsigned int dump_options_requested, const char *why)
+{
+	u32 active, maximum;
+	s64 outstanding;
+
+	uds_log_info("%s dump triggered via %s", UDS_LOGGING_MODULE_NAME, why);
+	active = get_data_vio_pool_active_requests(vdo->data_vio_pool);
+	maximum = get_data_vio_pool_maximum_requests(vdo->data_vio_pool);
+	outstanding = (atomic64_read(&vdo->stats.bios_submitted) -
+		       atomic64_read(&vdo->stats.bios_completed));
+	uds_log_info("%u device requests outstanding (max %u), %lld bio requests outstanding, device '%s'",
+		     active,
+		     maximum,
+		     outstanding,
+		     vdo_get_device_name(vdo->device_config->owning_target));
+	if (((dump_options_requested & FLAG_SHOW_QUEUES) != 0) && (vdo->threads != NULL)) {
+		thread_id_t id;
+
+		for (id = 0; id < vdo->thread_config.thread_count; id++)
+			vdo_dump_work_queue(vdo->threads[id].queue);
+	}
+
+	vdo_dump_hash_zones(vdo->hash_zones);
+	dump_data_vio_pool(vdo->data_vio_pool, (dump_options_requested & FLAG_SHOW_VIO_POOL) != 0);
+	if ((dump_options_requested & FLAG_SHOW_VDO_STATUS) != 0)
+		vdo_dump_status(vdo);
+
+	uds_report_memory_usage();
+	uds_log_info("end of %s dump", UDS_LOGGING_MODULE_NAME);
+}
+
+static int
+parse_dump_options(unsigned int argc, char *const *argv, unsigned int *dump_options_requested_ptr)
+{
+	unsigned int dump_options_requested = 0;
+
+	static const struct {
+		const char *name;
+		unsigned int flags;
+	} option_names[] = {
+		{ "viopool", FLAG_SKIP_DEFAULT | FLAG_SHOW_VIO_POOL },
+		{ "vdo", FLAG_SKIP_DEFAULT | FLAG_SHOW_VDO_STATUS },
+		{ "pools", FLAG_SKIP_DEFAULT | FLAGS_ALL_POOLS },
+		{ "queues", FLAG_SKIP_DEFAULT | FLAG_SHOW_QUEUES },
+		{ "threads", FLAG_SKIP_DEFAULT | FLAG_SHOW_QUEUES },
+		{ "default", FLAG_SKIP_DEFAULT | DEFAULT_DUMP_FLAGS },
+		{ "all", ~0 },
+	};
+
+	bool options_okay = true;
+	unsigned int i;
+
+	for (i = 1; i < argc; i++) {
+		unsigned int j;
+
+		for (j = 0; j < ARRAY_SIZE(option_names); j++) {
+			if (is_arg_string(argv[i], option_names[j].name)) {
+				dump_options_requested |= option_names[j].flags;
+				break;
+			}
+		}
+		if (j == ARRAY_SIZE(option_names)) {
+			uds_log_warning("dump option name '%s' unknown", argv[i]);
+			options_okay = false;
+		}
+	}
+	if (!options_okay)
+		return -EINVAL;
+	if ((dump_options_requested & FLAG_SKIP_DEFAULT) == 0)
+		dump_options_requested |= DEFAULT_DUMP_FLAGS;
+	*dump_options_requested_ptr = dump_options_requested;
+	return 0;
+}
+
+/* Dump as specified by zero or more string arguments. */
+int vdo_dump(struct vdo *vdo, unsigned int argc, char *const *argv, const char *why)
+{
+	unsigned int dump_options_requested = 0;
+	int result = parse_dump_options(argc, argv, &dump_options_requested);
+
+	if (result != 0)
+		return result;
+
+	do_dump(vdo, dump_options_requested, why);
+	return 0;
+}
+
+/* Dump everything we know how to dump */
+void vdo_dump_all(struct vdo *vdo, const char *why)
+{
+	do_dump(vdo, ~0, why);
+}
+
+/*
+ * Dump out the data_vio waiters on a wait queue.
+ * wait_on should be the label to print for queue (e.g. logical or physical)
+ */
+static void dump_vio_waiters(struct wait_queue *queue, char *wait_on)
+{
+	struct waiter *waiter, *first = vdo_get_first_waiter(queue);
+	struct data_vio *data_vio;
+
+	if (first == NULL)
+		return;
+
+	data_vio = waiter_as_data_vio(first);
+
+	uds_log_info("      %s is locked. Waited on by: vio %px pbn %llu lbn %llu d-pbn %llu lastOp %s",
+		     wait_on,
+		     data_vio,
+		     data_vio->allocation.pbn,
+		     data_vio->logical.lbn,
+		     data_vio->duplicate.pbn,
+		     get_data_vio_operation_name(data_vio));
+
+	for (waiter = first->next_waiter; waiter != first; waiter = waiter->next_waiter) {
+		data_vio = waiter_as_data_vio(waiter);
+		uds_log_info("     ... and : vio %px pbn %llu lbn %llu d-pbn %llu lastOp %s",
+			     data_vio,
+			     data_vio->allocation.pbn,
+			     data_vio->logical.lbn,
+			     data_vio->duplicate.pbn,
+			     get_data_vio_operation_name(data_vio));
+	}
+}
+
+/*
+ * Encode various attributes of a data_vio as a string of one-character flags. This encoding is for
+ * logging brevity:
+ *
+ * R => vio completion result not VDO_SUCCESS
+ * W => vio is on a wait queue
+ * D => vio is a duplicate
+ * p => vio is a partial block operation
+ * z => vio is a zero block
+ * d => vio is a discard
+ *
+ * The common case of no flags set will result in an empty, null-terminated buffer. If any flags
+ * are encoded, the first character in the string will be a space character.
+ */
+static void encode_vio_dump_flags(struct data_vio *data_vio, char buffer[8])
+{
+	char *p_flag = buffer;
+	*p_flag++ = ' ';
+	if (data_vio->vio.completion.result != VDO_SUCCESS)
+		*p_flag++ = 'R';
+	if (data_vio->waiter.next_waiter != NULL)
+		*p_flag++ = 'W';
+	if (data_vio->is_duplicate)
+		*p_flag++ = 'D';
+	if (data_vio->is_partial)
+		*p_flag++ = 'p';
+	if (data_vio->is_zero)
+		*p_flag++ = 'z';
+	if (data_vio->remaining_discard > 0)
+		*p_flag++ = 'd';
+	if (p_flag == &buffer[1])
+		/* No flags, so remove the blank space. */
+		p_flag = buffer;
+	*p_flag = '\0';
+}
+
+/* Implements buffer_dump_function. */
+void dump_data_vio(void *data)
+{
+	struct data_vio *data_vio = (struct data_vio *) data;
+
+	/*
+	 * This just needs to be big enough to hold a queue (thread) name and a function name (plus
+	 * a separator character and NUL). The latter is limited only by taste.
+	 *
+	 * In making this static, we're assuming only one "dump" will run at a time. If more than
+	 * one does run, the log output will be garbled anyway.
+	 */
+	static char vio_completion_dump_buffer[100 + MAX_VDO_WORK_QUEUE_NAME_LEN];
+	/* Another static buffer... log10(256) = 2.408+, round up: */
+	enum { DIGITS_PER_U64 = 1 + sizeof(u64) * 2409 / 1000 };
+
+	static char vio_block_number_dump_buffer[sizeof("P L D") + 3 * DIGITS_PER_U64];
+	static char vio_flush_generation_buffer[sizeof(" FG") + DIGITS_PER_U64];
+	static char flags_dump_buffer[8];
+
+	/*
+	 * We're likely to be logging a couple thousand of these lines, and in some circumstances
+	 * syslogd may have trouble keeping up, so keep it BRIEF rather than user-friendly.
+	 */
+	vdo_dump_completion_to_buffer(&data_vio->vio.completion,
+				      vio_completion_dump_buffer,
+				      sizeof(vio_completion_dump_buffer));
+	if (data_vio->is_duplicate)
+		snprintf(vio_block_number_dump_buffer,
+			 sizeof(vio_block_number_dump_buffer),
+			 "P%llu L%llu D%llu",
+			 data_vio->allocation.pbn,
+			 data_vio->logical.lbn,
+			 data_vio->duplicate.pbn);
+	else if (data_vio_has_allocation(data_vio))
+		snprintf(vio_block_number_dump_buffer,
+			 sizeof(vio_block_number_dump_buffer),
+			 "P%llu L%llu",
+			 data_vio->allocation.pbn,
+			 data_vio->logical.lbn);
+	else
+		snprintf(vio_block_number_dump_buffer,
+			 sizeof(vio_block_number_dump_buffer),
+			 "L%llu",
+			 data_vio->logical.lbn);
+
+	if (data_vio->flush_generation != 0)
+		snprintf(vio_flush_generation_buffer,
+			 sizeof(vio_flush_generation_buffer),
+			 " FG%llu",
+			 data_vio->flush_generation);
+	else
+		vio_flush_generation_buffer[0] = 0;
+
+	encode_vio_dump_flags(data_vio, flags_dump_buffer);
+
+	uds_log_info("	vio %px %s%s %s %s%s",
+		     data_vio,
+		     vio_block_number_dump_buffer,
+		     vio_flush_generation_buffer,
+		     get_data_vio_operation_name(data_vio),
+		     vio_completion_dump_buffer,
+		     flags_dump_buffer);
+	/*
+	 * might want info on: wantUDSAnswer / operation / status
+	 * might want info on: bio / bios_merged
+	 */
+
+	dump_vio_waiters(&data_vio->logical.waiters, "lbn");
+
+	/* might want to dump more info from vio here */
+}
diff --git a/drivers/md/dm-vdo/dump.h b/drivers/md/dm-vdo/dump.h
new file mode 100644
index 000000000000..ad47c70cca78
--- /dev/null
+++ b/drivers/md/dm-vdo/dump.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_DUMP_H
+#define VDO_DUMP_H
+
+#include "types.h"
+
+int vdo_dump(struct vdo *vdo, unsigned int argc, char *const *argv, const char *why);
+
+void vdo_dump_all(struct vdo *vdo, const char *why);
+
+void dump_data_vio(void *data);
+
+#endif /* VDO_DUMP_H */

From patchwork Thu Oct 26 21:41:35 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437963
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 38D40C27C4C
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:40 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1344905AbjJZVoj (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:39 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41018 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234971AbjJZVoZ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:25 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AB501D62
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356511;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=w0ItEexfKWLjsFE1C7+/oS4i4s8ZX4I48DggcthiV0A=;
        b=Ua6TCZ6Y3+P2ne9PUKSkpCpX/ThZcg0JdkV98PPFk74hbnhZbykWItzROawS/vEyRwrsFY
        VCxvp3SoU4bPzXqZj3xgSCCLs3e/qRS3QzbZdwcfn2P7w/GKxwkSYrLMoaEhZ4bAU1itVP
        9eUVECI2lvjfCAx0TVa6gaa2T69maQw=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-43-lvqGQYocNiOxdi6fG9kIZw-1; Thu,
 26 Oct 2023 17:41:44 -0400
X-MC-Unique: lvqGQYocNiOxdi6fG9kIZw-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id CEAB628237CC;
        Thu, 26 Oct 2023 21:41:43 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id ABFE82166B26;
        Thu, 26 Oct 2023 21:41:43 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 623C63003D; Thu, 26 Oct 2023 17:41:43 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 38/39] dm vdo: add the top-level DM target
Date: Thu, 26 Oct 2023 17:41:35 -0400
Message-Id: <20231026214136.1067410-39-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

This adds the dm-vdo target.

The dm-vdo target provides inline deduplication, compression, and
zero-block elimination, allowing applications to consume less actual
storage than a normal target. By layering it with other device mapper
targets, it can add these features to any storage stack. It can also
provide a common deduplication pool for groups of targets. The vdo target
does not protect against data corruption, relying instead on integrity
protection of the storage below it.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-devel@redhat.com>
Signed-off-by: Michael Sclafani <vdo-devel@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Co-developed-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Co-developed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/dm-vdo-target.c | 2982 ++++++++++++++++++++++++++++++++++++
 1 file changed, 2982 insertions(+)
 create mode 100644 drivers/md/dm-vdo-target.c

diff --git a/drivers/md/dm-vdo-target.c b/drivers/md/dm-vdo-target.c
new file mode 100644
index 000000000000..4415e9bd31b4
--- /dev/null
+++ b/drivers/md/dm-vdo-target.c
@@ -0,0 +1,2982 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#include <linux/atomic.h>
+#include <linux/bitops.h>
+#include <linux/completion.h>
+#include <linux/delay.h>
+#include <linux/device-mapper.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+#include "dm-vdo/admin-state.h"
+#include "dm-vdo/block-map.h"
+#include "dm-vdo/completion.h"
+#include "dm-vdo/constants.h"
+#include "dm-vdo/data-vio.h"
+#include "dm-vdo/dedupe.h"
+#include "dm-vdo/dump.h"
+#include "dm-vdo/encodings.h"
+#include "dm-vdo/errors.h"
+#include "dm-vdo/flush.h"
+#include "dm-vdo/io-submitter.h"
+#include "dm-vdo/logger.h"
+#include "dm-vdo/memory-alloc.h"
+#include "dm-vdo/message-stats.h"
+#include "dm-vdo/pool-sysfs.h"
+#include "dm-vdo/recovery-journal.h"
+#include "dm-vdo/repair.h"
+#include "dm-vdo/slab-depot.h"
+#include "dm-vdo/status-codes.h"
+#include "dm-vdo/string-utils.h"
+#include "dm-vdo/thread-device.h"
+#include "dm-vdo/thread-registry.h"
+#include "dm-vdo/types.h"
+#include "dm-vdo/uds-sysfs.h"
+#include "dm-vdo/vdo.h"
+#include "dm-vdo/vio.h"
+
+#define	CURRENT_VERSION "8.3.0.65"
+
+enum {
+	GROW_LOGICAL_PHASE_START,
+	GROW_LOGICAL_PHASE_GROW_BLOCK_MAP,
+	GROW_LOGICAL_PHASE_END,
+	GROW_LOGICAL_PHASE_ERROR,
+	GROW_PHYSICAL_PHASE_START,
+	GROW_PHYSICAL_PHASE_COPY_SUMMARY,
+	GROW_PHYSICAL_PHASE_UPDATE_COMPONENTS,
+	GROW_PHYSICAL_PHASE_USE_NEW_SLABS,
+	GROW_PHYSICAL_PHASE_END,
+	GROW_PHYSICAL_PHASE_ERROR,
+	LOAD_PHASE_START,
+	LOAD_PHASE_STATS,
+	LOAD_PHASE_LOAD_DEPOT,
+	LOAD_PHASE_MAKE_DIRTY,
+	LOAD_PHASE_PREPARE_TO_ALLOCATE,
+	LOAD_PHASE_SCRUB_SLABS,
+	LOAD_PHASE_DATA_REDUCTION,
+	LOAD_PHASE_FINISHED,
+	LOAD_PHASE_DRAIN_JOURNAL,
+	LOAD_PHASE_WAIT_FOR_READ_ONLY,
+	PRE_LOAD_PHASE_START,
+	PRE_LOAD_PHASE_LOAD_COMPONENTS,
+	PRE_LOAD_PHASE_END,
+	PREPARE_GROW_PHYSICAL_PHASE_START,
+	RESUME_PHASE_START,
+	RESUME_PHASE_ALLOW_READ_ONLY_MODE,
+	RESUME_PHASE_DEDUPE,
+	RESUME_PHASE_DEPOT,
+	RESUME_PHASE_JOURNAL,
+	RESUME_PHASE_BLOCK_MAP,
+	RESUME_PHASE_LOGICAL_ZONES,
+	RESUME_PHASE_PACKER,
+	RESUME_PHASE_FLUSHER,
+	RESUME_PHASE_DATA_VIOS,
+	RESUME_PHASE_END,
+	SUSPEND_PHASE_START,
+	SUSPEND_PHASE_PACKER,
+	SUSPEND_PHASE_DATA_VIOS,
+	SUSPEND_PHASE_DEDUPE,
+	SUSPEND_PHASE_FLUSHES,
+	SUSPEND_PHASE_LOGICAL_ZONES,
+	SUSPEND_PHASE_BLOCK_MAP,
+	SUSPEND_PHASE_JOURNAL,
+	SUSPEND_PHASE_DEPOT,
+	SUSPEND_PHASE_READ_ONLY_WAIT,
+	SUSPEND_PHASE_WRITE_SUPER_BLOCK,
+	SUSPEND_PHASE_END,
+};
+
+static const char * const ADMIN_PHASE_NAMES[] = {
+	"GROW_LOGICAL_PHASE_START",
+	"GROW_LOGICAL_PHASE_GROW_BLOCK_MAP",
+	"GROW_LOGICAL_PHASE_END",
+	"GROW_LOGICAL_PHASE_ERROR",
+	"GROW_PHYSICAL_PHASE_START",
+	"GROW_PHYSICAL_PHASE_COPY_SUMMARY",
+	"GROW_PHYSICAL_PHASE_UPDATE_COMPONENTS",
+	"GROW_PHYSICAL_PHASE_USE_NEW_SLABS",
+	"GROW_PHYSICAL_PHASE_END",
+	"GROW_PHYSICAL_PHASE_ERROR",
+	"LOAD_PHASE_START",
+	"LOAD_PHASE_STATS",
+	"LOAD_PHASE_LOAD_DEPOT",
+	"LOAD_PHASE_MAKE_DIRTY",
+	"LOAD_PHASE_PREPARE_TO_ALLOCATE",
+	"LOAD_PHASE_SCRUB_SLABS",
+	"LOAD_PHASE_DATA_REDUCTION",
+	"LOAD_PHASE_FINISHED",
+	"LOAD_PHASE_DRAIN_JOURNAL",
+	"LOAD_PHASE_WAIT_FOR_READ_ONLY",
+	"PRE_LOAD_PHASE_START",
+	"PRE_LOAD_PHASE_LOAD_COMPONENTS",
+	"PRE_LOAD_PHASE_END",
+	"PREPARE_GROW_PHYSICAL_PHASE_START",
+	"RESUME_PHASE_START",
+	"RESUME_PHASE_ALLOW_READ_ONLY_MODE",
+	"RESUME_PHASE_DEDUPE",
+	"RESUME_PHASE_DEPOT",
+	"RESUME_PHASE_JOURNAL",
+	"RESUME_PHASE_BLOCK_MAP",
+	"RESUME_PHASE_LOGICAL_ZONES",
+	"RESUME_PHASE_PACKER",
+	"RESUME_PHASE_FLUSHER",
+	"RESUME_PHASE_DATA_VIOS",
+	"RESUME_PHASE_END",
+	"SUSPEND_PHASE_START",
+	"SUSPEND_PHASE_PACKER",
+	"SUSPEND_PHASE_DATA_VIOS",
+	"SUSPEND_PHASE_DEDUPE",
+	"SUSPEND_PHASE_FLUSHES",
+	"SUSPEND_PHASE_LOGICAL_ZONES",
+	"SUSPEND_PHASE_BLOCK_MAP",
+	"SUSPEND_PHASE_JOURNAL",
+	"SUSPEND_PHASE_DEPOT",
+	"SUSPEND_PHASE_READ_ONLY_WAIT",
+	"SUSPEND_PHASE_WRITE_SUPER_BLOCK",
+	"SUSPEND_PHASE_END",
+};
+
+enum {
+	/* If we bump this, update the arrays below */
+	TABLE_VERSION = 4,
+};
+
+/* arrays for handling different table versions */
+static const u8 REQUIRED_ARGC[] = { 10, 12, 9, 7, 6 };
+/* pool name no longer used. only here for verification of older versions */
+static const u8 POOL_NAME_ARG_INDEX[] = { 8, 10, 8 };
+
+/*
+ * Track in-use instance numbers using a flat bit array.
+ *
+ * O(n) run time isn't ideal, but if we have 1000 VDO devices in use simultaneously we still only
+ * need to scan 16 words, so it's not likely to be a big deal compared to other resource usage.
+ */
+
+enum {
+	/*
+	 * This minimum size for the bit array creates a numbering space of 0-999, which allows
+	 * successive starts of the same volume to have different instance numbers in any
+	 * reasonably-sized test. Changing instances on restart allows vdoMonReport to detect that
+	 * the ephemeral stats have reset to zero.
+	 */
+	BIT_COUNT_MINIMUM = 1000,
+	/** Grow the bit array by this many bits when needed */
+	BIT_COUNT_INCREMENT = 100,
+};
+
+struct instance_tracker {
+	unsigned int bit_count;
+	unsigned long *words;
+	unsigned int count;
+	unsigned int next;
+};
+
+static DEFINE_MUTEX(instances_lock);
+static struct instance_tracker instances;
+
+/**
+ * free_device_config() - Free a device config created by parse_device_config().
+ * @config: The config to free.
+ */
+static void free_device_config(struct device_config *config)
+{
+	if (config == NULL)
+		return;
+
+	if (config->owned_device != NULL)
+		dm_put_device(config->owning_target, config->owned_device);
+
+	UDS_FREE(config->parent_device_name);
+	UDS_FREE(config->original_string);
+
+	/* Reduce the chance a use-after-free (as in BZ 1669960) happens to work. */
+	memset(config, 0, sizeof(*config));
+	UDS_FREE(config);
+}
+
+/**
+ * get_version_number() - Decide the version number from argv.
+ *
+ * @argc: The number of table values.
+ * @argv: The array of table values.
+ * @error_ptr: A pointer to return a error string in.
+ * @version_ptr: A pointer to return the version.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int get_version_number(int argc, char **argv, char **error_ptr, unsigned int *version_ptr)
+{
+	/* version, if it exists, is in a form of V<n> */
+	if (sscanf(argv[0], "V%u", version_ptr) == 1) {
+		if (*version_ptr < 1 || *version_ptr > TABLE_VERSION) {
+			*error_ptr = "Unknown version number detected";
+			return VDO_BAD_CONFIGURATION;
+		}
+	} else {
+		/* V0 actually has no version number in the table string */
+		*version_ptr = 0;
+	}
+
+	/*
+	 * V0 and V1 have no optional parameters. There will always be a parameter for thread
+	 * config, even if it's a "." to show it's an empty list.
+	 */
+	if (*version_ptr <= 1) {
+		if (argc != REQUIRED_ARGC[*version_ptr]) {
+			*error_ptr = "Incorrect number of arguments for version";
+			return VDO_BAD_CONFIGURATION;
+		}
+	} else if (argc < REQUIRED_ARGC[*version_ptr]) {
+		*error_ptr = "Incorrect number of arguments for version";
+		return VDO_BAD_CONFIGURATION;
+	}
+
+	if (*version_ptr != TABLE_VERSION) {
+		uds_log_warning("Detected version mismatch between kernel module and tools kernel: %d, tool: %d",
+				TABLE_VERSION,
+				*version_ptr);
+		uds_log_warning("Please consider upgrading management tools to match kernel.");
+	}
+	return VDO_SUCCESS;
+}
+
+/* Free a list of non-NULL string pointers, and then the list itself. */
+static void free_string_array(char **string_array)
+{
+	unsigned int offset;
+
+	for (offset = 0; string_array[offset] != NULL; offset++)
+		UDS_FREE(string_array[offset]);
+	UDS_FREE(string_array);
+}
+
+/*
+ * Split the input string into substrings, separated at occurrences of the indicated character,
+ * returning a null-terminated list of string pointers.
+ *
+ * The string pointers and the pointer array itself should both be freed with UDS_FREE() when no
+ * longer needed. This can be done with vdo_free_string_array (below) if the pointers in the array
+ * are not changed. Since the array and copied strings are allocated by this function, it may only
+ * be used in contexts where allocation is permitted.
+ *
+ * Empty substrings are not ignored; that is, returned substrings may be empty strings if the
+ * separator occurs twice in a row.
+ */
+static int split_string(const char *string, char separator, char ***substring_array_ptr)
+{
+	unsigned int current_substring = 0, substring_count = 1;
+	const char *s;
+	char **substrings;
+	int result;
+	ptrdiff_t length;
+
+	for (s = string; *s != 0; s++)
+		if (*s == separator)
+			substring_count++;
+
+	result = UDS_ALLOCATE(substring_count + 1, char *, "string-splitting array", &substrings);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	for (s = string; *s != 0; s++) {
+		if (*s == separator) {
+			ptrdiff_t length = s - string;
+
+			result = UDS_ALLOCATE(length + 1,
+					      char,
+					      "split string",
+					      &substrings[current_substring]);
+			if (result != UDS_SUCCESS) {
+				free_string_array(substrings);
+				return result;
+			}
+			/*
+			 * Trailing NUL is already in place after allocation; deal with the zero or
+			 * more non-NUL bytes in the string.
+			 */
+			if (length > 0)
+				memcpy(substrings[current_substring], string, length);
+			string = s + 1;
+			current_substring++;
+			BUG_ON(current_substring >= substring_count);
+		}
+	}
+	/* Process final string, with no trailing separator. */
+	BUG_ON(current_substring != (substring_count - 1));
+	length = strlen(string);
+
+	result = UDS_ALLOCATE(length + 1, char, "split string", &substrings[current_substring]);
+	if (result != UDS_SUCCESS) {
+		free_string_array(substrings);
+		return result;
+	}
+	memcpy(substrings[current_substring], string, length);
+	current_substring++;
+	/* substrings[current_substring] is NULL already */
+	*substring_array_ptr = substrings;
+	return UDS_SUCCESS;
+}
+
+/*
+ * Join the input substrings into one string, joined with the indicated character, returning a
+ * string. array_length is a bound on the number of valid elements in substring_array, in case it
+ * is not NULL-terminated.
+ */
+static int
+join_strings(char **substring_array, size_t array_length, char separator, char **string_ptr)
+{
+	size_t string_length = 0;
+	size_t i;
+	int result;
+	char *output, *current_position;
+
+	for (i = 0; (i < array_length) && (substring_array[i] != NULL); i++)
+		string_length += strlen(substring_array[i]) + 1;
+
+	result = UDS_ALLOCATE(string_length, char, __func__, &output);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	current_position = &output[0];
+
+	for (i = 0; (i < array_length) && (substring_array[i] != NULL); i++) {
+		current_position = uds_append_to_buffer(current_position,
+							output + string_length,
+							"%s",
+							substring_array[i]);
+		*current_position = separator;
+		current_position++;
+	}
+
+	/* We output one too many separators; replace the last with a zero byte. */
+	if (current_position != output)
+		*(current_position - 1) = '\0';
+
+	*string_ptr = output;
+	return UDS_SUCCESS;
+}
+
+/**
+ * parse_bool() - Parse a two-valued option into a bool.
+ * @bool_str: The string value to convert to a bool.
+ * @true_str: The string value which should be converted to true.
+ * @false_str: The string value which should be converted to false.
+ * @bool_ptr: A pointer to return the bool value in.
+ *
+ * Return: VDO_SUCCESS or an error if bool_str is neither true_str nor false_str.
+ */
+static inline int __must_check
+parse_bool(const char *bool_str, const char *true_str, const char *false_str, bool *bool_ptr)
+{
+	bool value = false;
+
+	if (strcmp(bool_str, true_str) == 0)
+		value = true;
+	else if (strcmp(bool_str, false_str) == 0)
+		value = false;
+	else
+		return VDO_BAD_CONFIGURATION;
+
+	*bool_ptr = value;
+	return VDO_SUCCESS;
+}
+
+/**
+ * process_one_thread_config_spec() - Process one component of a thread parameter configuration
+ *				      string and update the configuration data structure.
+ * @thread_param_type: The type of thread specified.
+ * @count: The thread count requested.
+ * @config: The configuration data structure to update.
+ *
+ * If the thread count requested is invalid, a message is logged and -EINVAL returned. If the
+ * thread name is unknown, a message is logged but no error is returned.
+ *
+ * Return: VDO_SUCCESS or -EINVAL
+ */
+static int process_one_thread_config_spec(const char *thread_param_type,
+					  unsigned int count,
+					  struct thread_count_config *config)
+{
+	/* Handle limited thread parameters */
+	if (strcmp(thread_param_type, "bioRotationInterval") == 0) {
+		if (count == 0) {
+			uds_log_error("thread config string error:  'bioRotationInterval' of at least 1 is required");
+			return -EINVAL;
+		} else if (count > VDO_BIO_ROTATION_INTERVAL_LIMIT) {
+			uds_log_error("thread config string error: 'bioRotationInterval' cannot be higher than %d",
+				      VDO_BIO_ROTATION_INTERVAL_LIMIT);
+			return -EINVAL;
+		}
+		config->bio_rotation_interval = count;
+		return VDO_SUCCESS;
+	}
+	if (strcmp(thread_param_type, "logical") == 0) {
+		if (count > MAX_VDO_LOGICAL_ZONES) {
+			uds_log_error("thread config string error: at most %d 'logical' threads are allowed",
+				      MAX_VDO_LOGICAL_ZONES);
+			return -EINVAL;
+		}
+		config->logical_zones = count;
+		return VDO_SUCCESS;
+	}
+	if (strcmp(thread_param_type, "physical") == 0) {
+		if (count > MAX_VDO_PHYSICAL_ZONES) {
+			uds_log_error("thread config string error: at most %d 'physical' threads are allowed",
+				      MAX_VDO_PHYSICAL_ZONES);
+			return -EINVAL;
+		}
+		config->physical_zones = count;
+		return VDO_SUCCESS;
+	}
+	/* Handle other thread count parameters */
+	if (count > MAXIMUM_VDO_THREADS) {
+		uds_log_error("thread config string error: at most %d '%s' threads are allowed",
+			      MAXIMUM_VDO_THREADS,
+			      thread_param_type);
+		return -EINVAL;
+	}
+	if (strcmp(thread_param_type, "hash") == 0) {
+		config->hash_zones = count;
+		return VDO_SUCCESS;
+	}
+	if (strcmp(thread_param_type, "cpu") == 0) {
+		if (count == 0) {
+			uds_log_error("thread config string error: at least one 'cpu' thread required");
+			return -EINVAL;
+		}
+		config->cpu_threads = count;
+		return VDO_SUCCESS;
+	}
+	if (strcmp(thread_param_type, "ack") == 0) {
+		config->bio_ack_threads = count;
+		return VDO_SUCCESS;
+	}
+	if (strcmp(thread_param_type, "bio") == 0) {
+		if (count == 0) {
+			uds_log_error("thread config string error: at least one 'bio' thread required");
+			return -EINVAL;
+		}
+		config->bio_threads = count;
+		return VDO_SUCCESS;
+	}
+
+	/*
+	 * Don't fail, just log. This will handle version mismatches between user mode tools and
+	 * kernel.
+	 */
+	uds_log_info("unknown thread parameter type \"%s\"", thread_param_type);
+	return VDO_SUCCESS;
+}
+
+/**
+ * parse_one_thread_config_spec() - Parse one component of a thread parameter configuration string
+ *				    and update the configuration data structure.
+ * @spec: The thread parameter specification string.
+ * @config: The configuration data to be updated.
+ */
+static int parse_one_thread_config_spec(const char *spec, struct thread_count_config *config)
+{
+	unsigned int count;
+	char **fields;
+	int result;
+
+	result = split_string(spec, '=', &fields);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	if ((fields[0] == NULL) || (fields[1] == NULL) || (fields[2] != NULL)) {
+		uds_log_error("thread config string error: expected thread parameter assignment, saw \"%s\"",
+			      spec);
+		free_string_array(fields);
+		return -EINVAL;
+	}
+
+	result = kstrtouint(fields[1], 10, &count);
+	if (result != UDS_SUCCESS) {
+		uds_log_error("thread config string error: integer value needed, found \"%s\"",
+			      fields[1]);
+		free_string_array(fields);
+		return result;
+	}
+
+	result = process_one_thread_config_spec(fields[0], count, config);
+	free_string_array(fields);
+	return result;
+}
+
+/**
+ * parse_thread_config_string() - Parse the configuration string passed and update the specified
+ *				  counts and other parameters of various types of threads to be
+ *				  created.
+ * @string: Thread parameter configuration string.
+ * @config: The thread configuration data to update.
+ *
+ * The configuration string should contain one or more comma-separated specs of the form
+ * "typename=number"; the supported type names are "cpu", "ack", "bio", "bioRotationInterval",
+ * "logical", "physical", and "hash".
+ *
+ * If an error occurs during parsing of a single key/value pair, we deem it serious enough to stop
+ * further parsing.
+ *
+ * This function can't set the "reason" value the caller wants to pass back, because we'd want to
+ * format it to say which field was invalid, and we can't allocate the "reason" strings
+ * dynamically. So if an error occurs, we'll log the details and pass back an error.
+ *
+ * Return: VDO_SUCCESS or -EINVAL or -ENOMEM
+ */
+static int parse_thread_config_string(const char *string, struct thread_count_config *config)
+{
+	int result = VDO_SUCCESS;
+	char **specs;
+
+	if (strcmp(".", string) != 0) {
+		unsigned int i;
+
+		result = split_string(string, ',', &specs);
+		if (result != UDS_SUCCESS)
+			return result;
+
+		for (i = 0; specs[i] != NULL; i++) {
+			result = parse_one_thread_config_spec(specs[i], config);
+			if (result != VDO_SUCCESS)
+				break;
+		}
+		free_string_array(specs);
+	}
+	return result;
+}
+
+/**
+ * process_one_key_value_pair() - Process one component of an optional parameter string and update
+ *				  the configuration data structure.
+ * @key: The optional parameter key name.
+ * @value: The optional parameter value.
+ * @config: The configuration data structure to update.
+ *
+ * If the value requested is invalid, a message is logged and -EINVAL returned. If the key is
+ * unknown, a message is logged but no error is returned.
+ *
+ * Return: VDO_SUCCESS or -EINVAL
+ */
+static int
+process_one_key_value_pair(const char *key, unsigned int value, struct device_config *config)
+{
+	/* Non thread optional parameters */
+	if (strcmp(key, "maxDiscard") == 0) {
+		if (value == 0) {
+			uds_log_error("optional parameter error: at least one max discard block required");
+			return -EINVAL;
+		}
+		/* Max discard sectors in blkdev_issue_discard is UINT_MAX >> 9 */
+		if (value > (UINT_MAX / VDO_BLOCK_SIZE)) {
+			uds_log_error("optional parameter error: at most %d max discard	 blocks are allowed",
+				      UINT_MAX / VDO_BLOCK_SIZE);
+			return -EINVAL;
+		}
+		config->max_discard_blocks = value;
+		return VDO_SUCCESS;
+	}
+	/* Handles unknown key names */
+	return process_one_thread_config_spec(key, value, &config->thread_counts);
+}
+
+/**
+ * parse_one_key_value_pair() - Parse one key/value pair and update the configuration data
+ *				structure.
+ * @key: The optional key name.
+ * @value: The optional value.
+ * @config: The configuration data to be updated.
+ *
+ * Return: VDO_SUCCESS or error.
+ */
+static int
+parse_one_key_value_pair(const char *key, const char *value, struct device_config *config)
+{
+	unsigned int count;
+	int result;
+
+	if (strcmp(key, "deduplication") == 0)
+		return parse_bool(value, "on", "off", &config->deduplication);
+
+	if (strcmp(key, "compression") == 0)
+		return parse_bool(value, "on", "off", &config->compression);
+
+	/* The remaining arguments must have integral values. */
+	result = kstrtouint(value, 10, &count);
+	if (result != UDS_SUCCESS) {
+		uds_log_error("optional config string error: integer value needed, found \"%s\"",
+			      value);
+		return result;
+	}
+	return process_one_key_value_pair(key, count, config);
+}
+
+/**
+ * parse_key_value_pairs() - Parse all key/value pairs from a list of arguments.
+ * @argc: The total number of arguments in list.
+ * @argv: The list of key/value pairs.
+ * @config: The device configuration data to update.
+ *
+ * If an error occurs during parsing of a single key/value pair, we deem it serious enough to stop
+ * further parsing.
+ *
+ * This function can't set the "reason" value the caller wants to pass back, because we'd want to
+ * format it to say which field was invalid, and we can't allocate the "reason" strings
+ * dynamically. So if an error occurs, we'll log the details and return the error.
+ *
+ * Return: VDO_SUCCESS or error
+ */
+static int parse_key_value_pairs(int argc, char **argv, struct device_config *config)
+{
+	int result = VDO_SUCCESS;
+
+	while (argc) {
+		result = parse_one_key_value_pair(argv[0], argv[1], config);
+		if (result != VDO_SUCCESS)
+			break;
+
+		argc -= 2;
+		argv += 2;
+	}
+
+	return result;
+}
+
+/**
+ * parse_optional_arguments() - Parse the configuration string passed in for optional arguments.
+ * @arg_set: The structure holding the arguments to parse.
+ * @error_ptr: Pointer to a buffer to hold the error string.
+ * @config: Pointer to device configuration data to update.
+ *
+ * For V0/V1 configurations, there will only be one optional parameter; the thread configuration.
+ * The configuration string should contain one or more comma-separated specs of the form
+ * "typename=number"; the supported type names are "cpu", "ack", "bio", "bioRotationInterval",
+ * "logical", "physical", and "hash".
+ *
+ * For V2 configurations and beyond, there could be any number of arguments. They should contain
+ * one or more key/value pairs separated by a space.
+ *
+ * Return: VDO_SUCCESS or error
+ */
+static int parse_optional_arguments(struct dm_arg_set *arg_set,
+				    char **error_ptr,
+				    struct device_config *config)
+{
+	int result = VDO_SUCCESS;
+
+	if (config->version == 0 || config->version == 1) {
+		result = parse_thread_config_string(arg_set->argv[0], &config->thread_counts);
+		if (result != VDO_SUCCESS) {
+			*error_ptr = "Invalid thread-count configuration";
+			return VDO_BAD_CONFIGURATION;
+		}
+	} else {
+		if ((arg_set->argc % 2) != 0) {
+			*error_ptr = "Odd number of optional arguments given but they should be <key> <value> pairs";
+			return VDO_BAD_CONFIGURATION;
+		}
+		result = parse_key_value_pairs(arg_set->argc, arg_set->argv, config);
+		if (result != VDO_SUCCESS) {
+			*error_ptr = "Invalid optional argument configuration";
+			return VDO_BAD_CONFIGURATION;
+		}
+	}
+	return result;
+}
+
+/**
+ * handle_parse_error() - Handle a parsing error.
+ * @config: The config to free.
+ * @error_ptr: A place to store a constant string about the error.
+ * @error_str: A constant string to store in error_ptr.
+ */
+static void handle_parse_error(struct device_config *config, char **error_ptr, char *error_str)
+{
+	free_device_config(config);
+	*error_ptr = error_str;
+}
+
+/**
+ * parse_device_config() - Convert the dmsetup table into a struct device_config.
+ * @argc: The number of table values.
+ * @argv: The array of table values.
+ * @ti: The target structure for this table.
+ * @config_ptr: A pointer to return the allocated config.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int parse_device_config(int argc,
+			       char **argv,
+			       struct dm_target *ti,
+			       struct device_config **config_ptr)
+{
+	bool enable_512e;
+	size_t logical_bytes = to_bytes(ti->len);
+	struct dm_arg_set arg_set;
+	char **error_ptr = &ti->error;
+	struct device_config *config = NULL;
+	int result;
+
+	if ((logical_bytes % VDO_BLOCK_SIZE) != 0) {
+		handle_parse_error(config, error_ptr, "Logical size must be a multiple of 4096");
+		return VDO_BAD_CONFIGURATION;
+	}
+
+	if (argc == 0) {
+		handle_parse_error(config, error_ptr, "Incorrect number of arguments");
+		return VDO_BAD_CONFIGURATION;
+	}
+
+	result = UDS_ALLOCATE(1, struct device_config, "device_config", &config);
+	if (result != VDO_SUCCESS) {
+		handle_parse_error(config, error_ptr, "Could not allocate config structure");
+		return VDO_BAD_CONFIGURATION;
+	}
+
+	config->owning_target = ti;
+	config->logical_blocks = logical_bytes / VDO_BLOCK_SIZE;
+	INIT_LIST_HEAD(&config->config_list);
+
+	/* Save the original string. */
+	result = join_strings(argv, argc, ' ', &config->original_string);
+	if (result != VDO_SUCCESS) {
+		handle_parse_error(config, error_ptr, "Could not populate string");
+		return VDO_BAD_CONFIGURATION;
+	}
+
+	uds_log_info("table line: %s", config->original_string);
+
+	config->thread_counts = (struct thread_count_config) {
+		.bio_ack_threads = 1,
+		.bio_threads = DEFAULT_VDO_BIO_SUBMIT_QUEUE_COUNT,
+		.bio_rotation_interval = DEFAULT_VDO_BIO_SUBMIT_QUEUE_ROTATE_INTERVAL,
+		.cpu_threads = 1,
+		.logical_zones = 0,
+		.physical_zones = 0,
+		.hash_zones = 0,
+	};
+	config->max_discard_blocks = 1;
+	config->deduplication = true;
+	config->compression = false;
+
+	arg_set.argc = argc;
+	arg_set.argv = argv;
+
+	result = get_version_number(argc, argv, error_ptr, &config->version);
+	if (result != VDO_SUCCESS) {
+		/* get_version_number sets error_ptr itself. */
+		handle_parse_error(config, error_ptr, *error_ptr);
+		return result;
+	}
+	/* Move the arg pointer forward only if the argument was there. */
+	if (config->version >= 1)
+		dm_shift_arg(&arg_set);
+
+	result = uds_duplicate_string(dm_shift_arg(&arg_set),
+				      "parent device name",
+				      &config->parent_device_name);
+	if (result != VDO_SUCCESS) {
+		handle_parse_error(config, error_ptr, "Could not copy parent device name");
+		return VDO_BAD_CONFIGURATION;
+	}
+
+	/* Get the physical blocks, if known. */
+	if (config->version >= 1) {
+		result = kstrtoull(dm_shift_arg(&arg_set), 10, &config->physical_blocks);
+		if (result != VDO_SUCCESS) {
+			handle_parse_error(config, error_ptr, "Invalid physical block count");
+			return VDO_BAD_CONFIGURATION;
+		}
+	}
+
+	/* Get the logical block size and validate */
+	result = parse_bool(dm_shift_arg(&arg_set), "512", "4096", &enable_512e);
+	if (result != VDO_SUCCESS) {
+		handle_parse_error(config, error_ptr, "Invalid logical block size");
+		return VDO_BAD_CONFIGURATION;
+	}
+	config->logical_block_size = (enable_512e ? 512 : 4096);
+
+	/* Skip past the two no longer used read cache options. */
+	if (config->version <= 1)
+		dm_consume_args(&arg_set, 2);
+
+	/* Get the page cache size. */
+	result = kstrtouint(dm_shift_arg(&arg_set), 10, &config->cache_size);
+	if (result != VDO_SUCCESS) {
+		handle_parse_error(config, error_ptr, "Invalid block map page cache size");
+		return VDO_BAD_CONFIGURATION;
+	}
+
+	/* Get the block map era length. */
+	result = kstrtouint(dm_shift_arg(&arg_set), 10, &config->block_map_maximum_age);
+	if (result != VDO_SUCCESS) {
+		handle_parse_error(config, error_ptr, "Invalid block map maximum age");
+		return VDO_BAD_CONFIGURATION;
+	}
+
+	/* Skip past the no longer used MD RAID5 optimization mode */
+	if (config->version <= 2)
+		dm_consume_args(&arg_set, 1);
+
+	/* Skip past the no longer used write policy setting */
+	if (config->version <= 3)
+		dm_consume_args(&arg_set, 1);
+
+	/* Skip past the no longer used pool name for older table lines */
+	if (config->version <= 2) {
+		/*
+		 * Make sure the enum to get the pool name from argv directly is still in sync with
+		 * the parsing of the table line.
+		 */
+		if (&arg_set.argv[0] != &argv[POOL_NAME_ARG_INDEX[config->version]]) {
+			handle_parse_error(config, error_ptr, "Pool name not in expected location");
+			return VDO_BAD_CONFIGURATION;
+		}
+		dm_shift_arg(&arg_set);
+	}
+
+	/* Get the optional arguments and validate. */
+	result = parse_optional_arguments(&arg_set, error_ptr, config);
+	if (result != VDO_SUCCESS) {
+		/* parse_optional_arguments sets error_ptr itself. */
+		handle_parse_error(config, error_ptr, *error_ptr);
+		return result;
+	}
+
+	/*
+	 * Logical, physical, and hash zone counts can all be zero; then we get one thread doing
+	 * everything, our older configuration. If any zone count is non-zero, the others must be
+	 * as well.
+	 */
+	if (((config->thread_counts.logical_zones == 0) !=
+	     (config->thread_counts.physical_zones == 0)) ||
+	    ((config->thread_counts.physical_zones == 0) !=
+	     (config->thread_counts.hash_zones == 0))) {
+		handle_parse_error(config,
+				   error_ptr,
+				   "Logical, physical, and hash zones counts must all be zero or all non-zero");
+		return VDO_BAD_CONFIGURATION;
+	}
+
+	if (config->cache_size <
+	    (2 * MAXIMUM_VDO_USER_VIOS * config->thread_counts.logical_zones)) {
+		handle_parse_error(config,
+				   error_ptr,
+				   "Insufficient block map cache for logical zones");
+		return VDO_BAD_CONFIGURATION;
+	}
+
+	result = dm_get_device(ti,
+			       config->parent_device_name,
+			       dm_table_get_mode(ti->table),
+			       &config->owned_device);
+	if (result != 0) {
+		uds_log_error("couldn't open device \"%s\": error %d",
+			      config->parent_device_name,
+			      result);
+		handle_parse_error(config, error_ptr, "Unable to open storage device");
+		return VDO_BAD_CONFIGURATION;
+	}
+
+	if (config->version == 0) {
+		u64 device_size = i_size_read(config->owned_device->bdev->bd_inode);
+
+		config->physical_blocks = device_size / VDO_BLOCK_SIZE;
+	}
+
+	*config_ptr = config;
+	return result;
+}
+
+static struct vdo *get_vdo_for_target(struct dm_target *ti)
+{
+	return ((struct device_config *) ti->private)->vdo;
+}
+
+
+static int vdo_map_bio(struct dm_target *ti, struct bio *bio)
+{
+	struct vdo *vdo = get_vdo_for_target(ti);
+	struct vdo_work_queue *current_work_queue;
+	const struct admin_state_code *code = vdo_get_admin_state_code(&vdo->admin.state);
+
+	ASSERT_LOG_ONLY(code->normal, "vdo should not receive bios while in state %s", code->name);
+
+	/* Count all incoming bios. */
+	vdo_count_bios(&vdo->stats.bios_in, bio);
+
+
+	/* Handle empty bios.  Empty flush bios are not associated with a vio. */
+	if ((bio_op(bio) == REQ_OP_FLUSH) || ((bio->bi_opf & REQ_PREFLUSH) != 0)) {
+		vdo_launch_flush(vdo, bio);
+		return DM_MAPIO_SUBMITTED;
+	}
+
+	/* This could deadlock, */
+	current_work_queue = vdo_get_current_work_queue();
+	BUG_ON((current_work_queue != NULL) &&
+	       (vdo == vdo_get_work_queue_owner(current_work_queue)->vdo));
+	vdo_launch_bio(vdo->data_vio_pool, bio);
+	return DM_MAPIO_SUBMITTED;
+}
+
+static void vdo_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+	struct vdo *vdo = get_vdo_for_target(ti);
+
+	limits->logical_block_size = vdo->device_config->logical_block_size;
+	limits->physical_block_size = VDO_BLOCK_SIZE;
+
+	/* The minimum io size for random io */
+	blk_limits_io_min(limits, VDO_BLOCK_SIZE);
+	/* The optimal io size for streamed/sequential io */
+	blk_limits_io_opt(limits, VDO_BLOCK_SIZE);
+
+	/*
+	 * Sets the maximum discard size that will be passed into VDO. This value comes from a
+	 * table line value passed in during dmsetup create.
+	 *
+	 * The value 1024 is the largest usable value on HD systems.  A 2048 sector discard on a
+	 * busy HD system takes 31 seconds.  We should use a value no higher than 1024, which takes
+	 * 15 to 16 seconds on a busy HD system.
+	 *
+	 * But using large values results in 120 second blocked task warnings in /var/log/kern.log.
+	 * In order to avoid these warnings, we choose to use the smallest reasonable value.  See
+	 * VDO-3062 and VDO-3087.
+	 *
+	 * The value is displayed in sysfs, and also used by dm-thin to determine whether to pass
+	 * down discards. The block layer splits large discards on this boundary when this is set.
+	 */
+	limits->max_discard_sectors =
+		(vdo->device_config->max_discard_blocks * VDO_SECTORS_PER_BLOCK);
+
+	/*
+	 * Force discards to not begin or end with a partial block by stating the granularity is
+	 * 4k.
+	 */
+	limits->discard_granularity = VDO_BLOCK_SIZE;
+}
+
+static int vdo_iterate_devices(struct dm_target *ti, iterate_devices_callout_fn fn, void *data)
+{
+	struct device_config *config = get_vdo_for_target(ti)->device_config;
+
+	return fn(ti,
+		  config->owned_device,
+		  0,
+		  config->physical_blocks * VDO_SECTORS_PER_BLOCK,
+		  data);
+}
+
+/*
+ * Status line is:
+ *    <device> <operating mode> <in recovery> <index state> <compression state>
+ *    <used physical blocks> <total physical blocks>
+ */
+
+static void vdo_status(struct dm_target *ti,
+		       status_type_t status_type,
+		       unsigned int status_flags,
+		       char *result,
+		       unsigned int maxlen)
+{
+	struct vdo *vdo = get_vdo_for_target(ti);
+	struct vdo_statistics *stats;
+	struct device_config *device_config;
+	/* N.B.: The DMEMIT macro uses the variables named "sz", "result", "maxlen". */
+	int sz = 0;
+
+	switch (status_type) {
+	case STATUSTYPE_INFO:
+		/* Report info for dmsetup status */
+		mutex_lock(&vdo->stats_mutex);
+		vdo_fetch_statistics(vdo, &vdo->stats_buffer);
+		stats = &vdo->stats_buffer;
+
+		DMEMIT("/dev/%pg %s %s %s %s %llu %llu",
+		       vdo_get_backing_device(vdo),
+		       stats->mode,
+		       stats->in_recovery_mode ? "recovering" : "-",
+		       vdo_get_dedupe_index_state_name(vdo->hash_zones),
+		       vdo_get_compressing(vdo) ? "online" : "offline",
+		       stats->data_blocks_used + stats->overhead_blocks_used,
+		       stats->physical_blocks);
+		mutex_unlock(&vdo->stats_mutex);
+		break;
+
+	case STATUSTYPE_TABLE:
+		/* Report the string actually specified in the beginning. */
+		device_config = (struct device_config *) ti->private;
+		DMEMIT("%s", device_config->original_string);
+		break;
+
+	case STATUSTYPE_IMA:
+		/* FIXME: We ought to be more detailed here, but this is what thin does. */
+		*result = '\0';
+		break;
+	}
+}
+
+static block_count_t __must_check get_underlying_device_block_count(const struct vdo *vdo)
+{
+	return i_size_read(vdo_get_backing_device(vdo)->bd_inode) / VDO_BLOCK_SIZE;
+}
+
+static int __must_check process_vdo_message_locked(struct vdo *vdo, unsigned int argc, char **argv)
+{
+	if ((argc == 2) && (strcasecmp(argv[0], "compression") == 0)) {
+		if (strcasecmp(argv[1], "on") == 0) {
+			vdo_set_compressing(vdo, true);
+			return 0;
+		}
+
+		if (strcasecmp(argv[1], "off") == 0) {
+			vdo_set_compressing(vdo, false);
+			return 0;
+		}
+
+		uds_log_warning("invalid argument '%s' to dmsetup compression message", argv[1]);
+		return -EINVAL;
+	}
+
+	uds_log_warning("unrecognized dmsetup message '%s' received", argv[0]);
+	return -EINVAL;
+}
+
+/*
+ * If the message is a dump, just do it. Otherwise, check that no other message is being processed,
+ * and only proceed if so.
+ * Returns -EBUSY if another message is being processed
+ */
+static int __must_check process_vdo_message(struct vdo *vdo, unsigned int argc, char **argv)
+{
+	int result;
+
+	/*
+	 * All messages which may be processed in parallel with other messages should be handled
+	 * here before the atomic check below. Messages which should be exclusive should be
+	 * processed in process_vdo_message_locked().
+	 */
+
+	/* Dump messages should always be processed */
+	if (strcasecmp(argv[0], "dump") == 0)
+		return vdo_dump(vdo, argc, argv, "dmsetup message");
+
+	if (argc == 1) {
+		if (strcasecmp(argv[0], "dump-on-shutdown") == 0) {
+			vdo->dump_on_shutdown = true;
+			return 0;
+		}
+
+		/* Index messages should always be processed */
+		if ((strcasecmp(argv[0], "index-close") == 0) ||
+		    (strcasecmp(argv[0], "index-create") == 0) ||
+		    (strcasecmp(argv[0], "index-disable") == 0) ||
+		    (strcasecmp(argv[0], "index-enable") == 0))
+			return vdo_message_dedupe_index(vdo->hash_zones, argv[0]);
+	}
+
+	if (atomic_cmpxchg(&vdo->processing_message, 0, 1) != 0)
+		return -EBUSY;
+
+	result = process_vdo_message_locked(vdo, argc, argv);
+
+	/* Pairs with the implicit barrier in cmpxchg just above */
+	smp_wmb();
+	atomic_set(&vdo->processing_message, 0);
+	return result;
+}
+
+static int vdo_message(struct dm_target *ti,
+		       unsigned int argc,
+		       char **argv,
+		       char *result_buffer,
+		       unsigned int maxlen)
+{
+	struct registered_thread allocating_thread, instance_thread;
+	struct vdo *vdo;
+	int result;
+
+	if (argc == 0) {
+		uds_log_warning("unspecified dmsetup message");
+		return -EINVAL;
+	}
+
+	vdo = get_vdo_for_target(ti);
+	uds_register_allocating_thread(&allocating_thread, NULL);
+	uds_register_thread_device_id(&instance_thread, &vdo->instance);
+
+	/*
+	 * Must be done here so we don't map return codes. The code in dm-ioctl expects a 1 for a
+	 * return code to look at the buffer and see if it is full or not.
+	 */
+	if ((argc == 1) && (strcasecmp(argv[0], "stats") == 0)) {
+		vdo_write_stats(vdo, result_buffer, maxlen);
+		result = 1;
+	} else {
+		result = vdo_map_to_system_error(process_vdo_message(vdo, argc, argv));
+	}
+
+	uds_unregister_thread_device_id();
+	uds_unregister_allocating_thread();
+	return result;
+}
+
+static void configure_target_capabilities(struct dm_target *ti)
+{
+	ti->discards_supported = 1;
+	ti->flush_supported = true;
+	ti->num_discard_bios = 1;
+	ti->num_flush_bios = 1;
+
+	/*
+	 * If this value changes, please make sure to update the value for max_discard_sectors
+	 * accordingly.
+	 */
+	BUG_ON(dm_set_target_max_io_len(ti, VDO_SECTORS_PER_BLOCK) != 0);
+}
+
+/*
+ * Implements vdo_filter_t.
+ */
+static bool vdo_uses_device(struct vdo *vdo, const void *context)
+{
+	const struct device_config *config = context;
+
+	return vdo_get_backing_device(vdo)->bd_dev == config->owned_device->bdev->bd_dev;
+}
+
+/**
+ * get_thread_id_for_phase() - Get the thread id for the current phase of the admin operation in
+ *                             progress.
+ */
+static thread_id_t __must_check
+get_thread_id_for_phase(struct vdo *vdo)
+{
+	switch (vdo->admin.phase) {
+	case RESUME_PHASE_PACKER:
+	case RESUME_PHASE_FLUSHER:
+	case SUSPEND_PHASE_PACKER:
+	case SUSPEND_PHASE_FLUSHES:
+		return vdo->thread_config.packer_thread;
+
+	case RESUME_PHASE_DATA_VIOS:
+	case SUSPEND_PHASE_DATA_VIOS:
+		return vdo->thread_config.cpu_thread;
+
+	case LOAD_PHASE_DRAIN_JOURNAL:
+	case RESUME_PHASE_JOURNAL:
+	case SUSPEND_PHASE_JOURNAL:
+		return vdo->thread_config.journal_thread;
+
+	default:
+		return vdo->thread_config.admin_thread;
+	}
+}
+
+static struct vdo_completion *prepare_admin_completion(struct vdo *vdo,
+						       vdo_action *callback,
+						       vdo_action *error_handler)
+{
+	struct vdo_completion *completion = &vdo->admin.completion;
+
+	/*
+	 * We can't use vdo_prepare_completion_for_requeue() here because we don't want to reset
+	 * any error in the completion.
+	 */
+	completion->callback = callback;
+	completion->error_handler = error_handler;
+	completion->callback_thread_id = get_thread_id_for_phase(vdo);
+	completion->requeue = true;
+	return completion;
+}
+
+/**
+ * advance_phase() - Increment the phase of the current admin operation and prepare the admin
+ *                   completion to run on the thread for the next phase.
+ * @vdo: The on which an admin operation is being performed
+ *
+ * Return: The current phase
+ */
+static u32 advance_phase(struct vdo *vdo)
+{
+	u32 phase = vdo->admin.phase++;
+
+	vdo->admin.completion.callback_thread_id = get_thread_id_for_phase(vdo);
+	vdo->admin.completion.requeue = true;
+	return phase;
+}
+
+/*
+ * Perform an administrative operation (load, suspend, grow logical, or grow physical). This method
+ * should not be called from vdo threads.
+ */
+static int perform_admin_operation(struct vdo *vdo,
+				   u32 starting_phase,
+				   vdo_action *callback,
+				   vdo_action *error_handler,
+				   const char *type)
+{
+	int result;
+	struct vdo_administrator *admin = &vdo->admin;
+
+	if (atomic_cmpxchg(&admin->busy, 0, 1) != 0)
+		return uds_log_error_strerror(VDO_COMPONENT_BUSY,
+					      "Can't start %s operation, another operation is already in progress",
+					      type);
+
+	admin->phase = starting_phase;
+	reinit_completion(&admin->callback_sync);
+	vdo_reset_completion(&admin->completion);
+	vdo_launch_completion(prepare_admin_completion(vdo, callback, error_handler));
+
+	/*
+	 * Using the "interruptible" interface means that Linux will not log a message when we wait
+	 * for more than 120 seconds.
+	 */
+	while (wait_for_completion_interruptible(&admin->callback_sync) != 0)
+		/* * However, if we get a signal in a user-mode process, we could spin... */
+		fsleep(1000);
+
+	result = admin->completion.result;
+	/* pairs with implicit barrier in cmpxchg above */
+	smp_wmb();
+	atomic_set(&admin->busy, 0);
+	return result;
+}
+
+/* Assert that we are operating on the correct thread for the current phase. */
+static void assert_admin_phase_thread(struct vdo *vdo, const char *what)
+{
+	ASSERT_LOG_ONLY(vdo_get_callback_thread_id() == get_thread_id_for_phase(vdo),
+			"%s on correct thread for %s",
+			what,
+			ADMIN_PHASE_NAMES[vdo->admin.phase]);
+}
+
+/**
+ * finish_operation_callback() - Callback to finish an admin operation.
+ * @completion: The admin_completion.
+ */
+static void finish_operation_callback(struct vdo_completion *completion)
+{
+	struct vdo_administrator *admin = &completion->vdo->admin;
+
+	vdo_finish_operation(&admin->state, completion->result);
+	complete(&admin->callback_sync);
+}
+
+/**
+ * decode_from_super_block() - Decode the VDO state from the super block and validate that it is
+ *                             correct.
+ * @vdo: The vdo being loaded.
+ *
+ * On error from this method, the component states must be destroyed explicitly. If this method
+ * returns successfully, the component states must not be destroyed.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int __must_check decode_from_super_block(struct vdo *vdo)
+{
+	const struct device_config *config = vdo->device_config;
+	int result;
+
+	result = vdo_decode_component_states(vdo->super_block.buffer,
+					     &vdo->geometry,
+					     &vdo->states);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	vdo_set_state(vdo, vdo->states.vdo.state);
+	vdo->load_state = vdo->states.vdo.state;
+
+	/*
+	 * If the device config specifies a larger logical size than was recorded in the super
+	 * block, just accept it.
+	 */
+	if (vdo->states.vdo.config.logical_blocks < config->logical_blocks) {
+		uds_log_warning("Growing logical size: a logical size of %llu blocks was specified, but that differs from the %llu blocks configured in the vdo super block",
+				(unsigned long long) config->logical_blocks,
+				(unsigned long long) vdo->states.vdo.config.logical_blocks);
+		vdo->states.vdo.config.logical_blocks = config->logical_blocks;
+	}
+
+	result = vdo_validate_component_states(&vdo->states,
+					       vdo->geometry.nonce,
+					       config->physical_blocks,
+					       config->logical_blocks);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	vdo->layout = vdo->states.layout;
+	return VDO_SUCCESS;
+}
+
+/**
+ * decode_vdo() - Decode the component data portion of a super block and fill in the corresponding
+ *                portions of the vdo being loaded.
+ * @vdo: The vdo being loaded.
+ *
+ * This will also allocate the recovery journal and slab depot. If this method is called with an
+ * asynchronous layer (i.e. a thread config which specifies at least one base thread), the block
+ * map and packer will be constructed as well.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int __must_check decode_vdo(struct vdo *vdo)
+{
+	block_count_t maximum_age, journal_length;
+	struct partition *partition;
+	int result;
+
+	result = decode_from_super_block(vdo);
+	if (result != VDO_SUCCESS) {
+		vdo_destroy_component_states(&vdo->states);
+		return result;
+	}
+
+	maximum_age = vdo_convert_maximum_age(vdo->device_config->block_map_maximum_age);
+	journal_length =
+		vdo_get_recovery_journal_length(vdo->states.vdo.config.recovery_journal_size);
+	if (maximum_age > (journal_length / 2))
+		return uds_log_error_strerror(VDO_BAD_CONFIGURATION,
+					      "maximum age: %llu exceeds limit %llu",
+					      (unsigned long long) maximum_age,
+					      (unsigned long long) (journal_length / 2));
+
+	if (maximum_age == 0)
+		return uds_log_error_strerror(VDO_BAD_CONFIGURATION,
+					      "maximum age must be greater than 0");
+
+	result = vdo_enable_read_only_entry(vdo);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	partition = vdo_get_known_partition(&vdo->layout, VDO_RECOVERY_JOURNAL_PARTITION);
+	result = vdo_decode_recovery_journal(vdo->states.recovery_journal,
+					     vdo->states.vdo.nonce,
+					     vdo,
+					     partition,
+					     vdo->states.vdo.complete_recoveries,
+					     vdo->states.vdo.config.recovery_journal_size,
+					     &vdo->recovery_journal);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	partition = vdo_get_known_partition(&vdo->layout, VDO_SLAB_SUMMARY_PARTITION);
+	result = vdo_decode_slab_depot(vdo->states.slab_depot,
+				       vdo,
+				       partition,
+				       &vdo->depot);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = vdo_decode_block_map(vdo->states.block_map,
+				      vdo->states.vdo.config.logical_blocks,
+				      vdo,
+				      vdo->recovery_journal,
+				      vdo->states.vdo.nonce,
+				      vdo->device_config->cache_size,
+				      maximum_age,
+				      &vdo->block_map);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = vdo_make_physical_zones(vdo, &vdo->physical_zones);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	/* The logical zones depend on the physical zones already existing. */
+	result = vdo_make_logical_zones(vdo, &vdo->logical_zones);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	return vdo_make_hash_zones(vdo, &vdo->hash_zones);
+}
+
+/**
+ * pre_load_callback() - Callback to initiate a pre-load, registered in vdo_initialize().
+ * @completion: The admin completion.
+ */
+static void pre_load_callback(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+	int result;
+
+	assert_admin_phase_thread(vdo, __func__);
+
+	switch (advance_phase(vdo)) {
+	case PRE_LOAD_PHASE_START:
+		result = vdo_start_operation(&vdo->admin.state, VDO_ADMIN_STATE_PRE_LOADING);
+		if (result != VDO_SUCCESS) {
+			vdo_continue_completion(completion, result);
+			return;
+		}
+
+		vdo_load_super_block(vdo, completion);
+		return;
+
+	case PRE_LOAD_PHASE_LOAD_COMPONENTS:
+		vdo_continue_completion(completion, decode_vdo(vdo));
+		return;
+
+	case PRE_LOAD_PHASE_END:
+		break;
+
+	default:
+		vdo_set_completion_result(completion, UDS_BAD_STATE);
+	}
+
+	finish_operation_callback(completion);
+}
+
+static void release_instance(unsigned int instance)
+{
+	mutex_lock(&instances_lock);
+	if (instance >= instances.bit_count) {
+		ASSERT_LOG_ONLY(false,
+				"instance number %u must be less than bit count %u",
+				instance,
+				instances.bit_count);
+	} else if (test_bit(instance, instances.words) == 0) {
+		ASSERT_LOG_ONLY(false, "instance number %u must be allocated", instance);
+	} else {
+		__clear_bit(instance, instances.words);
+		instances.count -= 1;
+	}
+	mutex_unlock(&instances_lock);
+}
+
+static void set_device_config(struct dm_target *ti, struct vdo *vdo, struct device_config *config)
+{
+	list_del_init(&config->config_list);
+	list_add_tail(&config->config_list, &vdo->device_config_list);
+	config->vdo = vdo;
+	ti->private = config;
+	configure_target_capabilities(ti);
+}
+
+static int
+vdo_initialize(struct dm_target *ti, unsigned int instance, struct device_config *config)
+{
+	struct vdo *vdo;
+	int result;
+	u64 block_size = VDO_BLOCK_SIZE;
+	u64 logical_size = to_bytes(ti->len);
+	block_count_t logical_blocks = logical_size / block_size;
+
+	uds_log_info("loading device '%s'", vdo_get_device_name(ti));
+	uds_log_debug("Logical block size     = %llu", (u64) config->logical_block_size);
+	uds_log_debug("Logical blocks         = %llu", logical_blocks);
+	uds_log_debug("Physical block size    = %llu", (u64) block_size);
+	uds_log_debug("Physical blocks        = %llu", config->physical_blocks);
+	uds_log_debug("Block map cache blocks = %u", config->cache_size);
+	uds_log_debug("Block map maximum age  = %u", config->block_map_maximum_age);
+	uds_log_debug("Deduplication          = %s", (config->deduplication ? "on" : "off"));
+	uds_log_debug("Compression            = %s", (config->compression ? "on" : "off"));
+
+	vdo = vdo_find_matching(vdo_uses_device, config);
+	if (vdo != NULL) {
+		uds_log_error("Existing vdo already uses device %s",
+			      vdo->device_config->parent_device_name);
+		ti->error = "Cannot share storage device with already-running VDO";
+		return VDO_BAD_CONFIGURATION;
+	}
+
+	result = vdo_make(instance, config, &ti->error, &vdo);
+	if (result != VDO_SUCCESS) {
+		uds_log_error("Could not create VDO device. (VDO error %d, message %s)",
+			      result,
+			      ti->error);
+		vdo_destroy(vdo);
+		return result;
+	}
+
+	result = perform_admin_operation(vdo,
+					 PRE_LOAD_PHASE_START,
+					 pre_load_callback,
+					 finish_operation_callback,
+					 "pre-load");
+	if (result != VDO_SUCCESS) {
+		ti->error = ((result == VDO_INVALID_ADMIN_STATE) ?
+			     "Pre-load is only valid immediately after initialization" :
+			     "Cannot load metadata from device");
+		uds_log_error("Could not start VDO device. (VDO error %d, message %s)",
+			      result,
+			      ti->error);
+		vdo_destroy(vdo);
+		return result;
+	}
+
+	set_device_config(ti, vdo, config);
+	vdo->device_config = config;
+	return VDO_SUCCESS;
+}
+
+/* Implements vdo_filter_t. */
+static bool __must_check vdo_is_named(struct vdo *vdo, const void *context)
+{
+	struct dm_target *ti = vdo->device_config->owning_target;
+	const char *device_name = vdo_get_device_name(ti);
+
+	return strcmp(device_name, (const char *) context) == 0;
+}
+
+/**
+ * get_bit_array_size() - Return the number of bytes needed to store a bit array of the specified
+ *                        capacity in an array of unsigned longs.
+ * @bit_count: The number of bits the array must hold.
+ *
+ * Return: the number of bytes needed for the array representation.
+ */
+static size_t get_bit_array_size(unsigned int bit_count)
+{
+	/* Round up to a multiple of the word size and convert to a byte count. */
+	return (BITS_TO_LONGS(bit_count) * sizeof(unsigned long));
+}
+
+/**
+ * grow_bit_array() - Re-allocate the bitmap word array so there will more instance numbers that
+ *                    can be allocated.
+ *
+ * Since the array is initially NULL, this also initializes the array the first time we allocate an
+ * instance number.
+ *
+ * Return: UDS_SUCCESS or an error code from the allocation
+ */
+static int grow_bit_array(void)
+{
+	unsigned int new_count =
+		max(instances.bit_count + BIT_COUNT_INCREMENT, (unsigned int) BIT_COUNT_MINIMUM);
+	unsigned long *new_words;
+	int result;
+
+	result = uds_reallocate_memory(instances.words,
+				       get_bit_array_size(instances.bit_count),
+				       get_bit_array_size(new_count),
+				       "instance number bit array",
+				       &new_words);
+	if (result != UDS_SUCCESS)
+		return result;
+
+	instances.bit_count = new_count;
+	instances.words = new_words;
+	return UDS_SUCCESS;
+}
+
+/**
+ * allocate_instance() - Allocate an instance number.
+ * @instance_ptr: A point to hold the instance number
+ *
+ * Return: UDS_SUCCESS or an error code
+ *
+ * This function must be called while holding the instances lock.
+ */
+static int allocate_instance(unsigned int *instance_ptr)
+{
+	unsigned int instance;
+	int result;
+
+	/* If there are no unallocated instances, grow the bit array. */
+	if (instances.count >= instances.bit_count) {
+		result = grow_bit_array();
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	/*
+	 * There must be a zero bit somewhere now. Find it, starting just after the last instance
+	 * allocated.
+	 */
+	instance = find_next_zero_bit(instances.words,
+				      instances.bit_count,
+				      instances.next);
+	if (instance >= instances.bit_count) {
+		/* Nothing free after next, so wrap around to instance zero. */
+		instance = find_first_zero_bit(instances.words, instances.bit_count);
+		result = ASSERT(instance < instances.bit_count, "impossibly, no zero bit found");
+		if (result != UDS_SUCCESS)
+			return result;
+	}
+
+	__set_bit(instance, instances.words);
+	instances.count++;
+	instances.next = instance + 1;
+	*instance_ptr = instance;
+	return UDS_SUCCESS;
+}
+
+static int construct_new_vdo_registered(struct dm_target *ti,
+					unsigned int argc,
+					char **argv,
+					unsigned int instance)
+{
+	int result;
+	struct device_config *config;
+
+	result = parse_device_config(argc, argv, ti, &config);
+	if (result != VDO_SUCCESS) {
+		uds_log_error_strerror(result, "parsing failed: %s", ti->error);
+		release_instance(instance);
+		return -EINVAL;
+	}
+
+	/* Beyond this point, the instance number will be cleaned up for us if needed */
+	result = vdo_initialize(ti, instance, config);
+	if (result != VDO_SUCCESS) {
+		release_instance(instance);
+		free_device_config(config);
+		return vdo_map_to_system_error(result);
+	}
+
+	return VDO_SUCCESS;
+}
+
+static int construct_new_vdo(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	int result;
+	unsigned int instance;
+	struct registered_thread instance_thread;
+
+	mutex_lock(&instances_lock);
+	result = allocate_instance(&instance);
+	mutex_unlock(&instances_lock);
+	if (result != VDO_SUCCESS)
+		return -ENOMEM;
+
+	uds_register_thread_device_id(&instance_thread, &instance);
+	result = construct_new_vdo_registered(ti, argc, argv, instance);
+	uds_unregister_thread_device_id();
+	return result;
+}
+
+/**
+ * check_may_grow_physical() - Callback to check that we're not in recovery mode, used in
+ *                             vdo_prepare_to_grow_physical().
+ * @completion: The admin completion.
+ */
+static void check_may_grow_physical(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+
+	assert_admin_phase_thread(vdo, __func__);
+
+	/* These checks can only be done from a vdo thread. */
+	if (vdo_is_read_only(vdo))
+		vdo_set_completion_result(completion, VDO_READ_ONLY);
+
+	if (vdo_in_recovery_mode(vdo))
+		vdo_set_completion_result(completion, VDO_RETRY_AFTER_REBUILD);
+
+	finish_operation_callback(completion);
+}
+
+static block_count_t get_partition_size(struct layout *layout, enum partition_id id)
+{
+	return vdo_get_known_partition(layout, id)->count;
+}
+
+/**
+ * grow_layout() - Make the layout for growing a vdo.
+ * @vdo: The vdo preparing to grow.
+ * @old_size: The current size of the vdo.
+ * @new_size: The size to which the vdo will be grown.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int grow_layout(struct vdo *vdo, block_count_t old_size, block_count_t new_size)
+{
+	int result;
+	block_count_t min_new_size;
+
+	if (vdo->next_layout.size == new_size)
+		/* We are already prepared to grow to the new size, so we're done. */
+		return VDO_SUCCESS;
+
+	/* Make a copy completion if there isn't one */
+	if (vdo->partition_copier == NULL) {
+		vdo->partition_copier = dm_kcopyd_client_create(NULL);
+		if (vdo->partition_copier == NULL)
+			return -ENOMEM;
+	}
+
+	/* Free any unused preparation. */
+	vdo_uninitialize_layout(&vdo->next_layout);
+
+	/*
+	 * Make a new layout with the existing partition sizes for everything but the slab depot
+	 * partition.
+	 */
+	result = vdo_initialize_layout(new_size,
+				       vdo->layout.start,
+				       get_partition_size(&vdo->layout, VDO_BLOCK_MAP_PARTITION),
+				       get_partition_size(&vdo->layout,
+							  VDO_RECOVERY_JOURNAL_PARTITION),
+				       get_partition_size(&vdo->layout,
+							  VDO_SLAB_SUMMARY_PARTITION),
+				       &vdo->next_layout);
+	if (result != VDO_SUCCESS) {
+		dm_kcopyd_client_destroy(UDS_FORGET(vdo->partition_copier));
+		return result;
+	}
+
+	/* Ensure the new journal and summary are entirely within the added blocks. */
+	min_new_size = (old_size +
+			get_partition_size(&vdo->next_layout, VDO_SLAB_SUMMARY_PARTITION) +
+			get_partition_size(&vdo->next_layout, VDO_RECOVERY_JOURNAL_PARTITION));
+	if (min_new_size > new_size) {
+		/* Copying the journal and summary would destroy some old metadata. */
+		vdo_uninitialize_layout(&vdo->next_layout);
+		dm_kcopyd_client_destroy(UDS_FORGET(vdo->partition_copier));
+		return VDO_INCREMENT_TOO_SMALL;
+	}
+
+	return VDO_SUCCESS;
+}
+
+static int prepare_to_grow_physical(struct vdo *vdo, block_count_t new_physical_blocks)
+{
+	int result;
+	block_count_t current_physical_blocks = vdo->states.vdo.config.physical_blocks;
+
+	uds_log_info("Preparing to resize physical to %llu",
+		     (unsigned long long) new_physical_blocks);
+	ASSERT_LOG_ONLY((new_physical_blocks > current_physical_blocks),
+			"New physical size is larger than current physical size");
+	result = perform_admin_operation(vdo,
+					 PREPARE_GROW_PHYSICAL_PHASE_START,
+					 check_may_grow_physical,
+					 finish_operation_callback,
+					 "prepare grow-physical");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = grow_layout(vdo, current_physical_blocks, new_physical_blocks);
+	if (result != VDO_SUCCESS)
+		return result;
+
+	result = vdo_prepare_to_grow_slab_depot(vdo->depot,
+						vdo_get_known_partition(&vdo->next_layout,
+									VDO_SLAB_DEPOT_PARTITION));
+	if (result != VDO_SUCCESS) {
+		vdo_uninitialize_layout(&vdo->next_layout);
+		return result;
+	}
+
+	uds_log_info("Done preparing to resize physical");
+	return VDO_SUCCESS;
+}
+
+/**
+ * validate_new_device_config() - Check whether a new device config represents a valid modification
+ *				  to an existing config.
+ * @to_validate: The new config to validate.
+ * @config: The existing config.
+ * @may_grow: Set to true if growing the logical and physical size of the vdo is currently
+ *	      permitted.
+ * @error_ptr: A pointer to hold the reason for any error.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int validate_new_device_config(struct device_config *to_validate,
+				      struct device_config *config,
+				      bool may_grow,
+				      char **error_ptr)
+{
+	if (to_validate->owning_target->begin != config->owning_target->begin) {
+		*error_ptr = "Starting sector cannot change";
+		return VDO_PARAMETER_MISMATCH;
+	}
+
+	if (to_validate->logical_block_size != config->logical_block_size) {
+		*error_ptr = "Logical block size cannot change";
+		return VDO_PARAMETER_MISMATCH;
+	}
+
+	if (to_validate->logical_blocks < config->logical_blocks) {
+		*error_ptr = "Can't shrink VDO logical size";
+		return VDO_PARAMETER_MISMATCH;
+	}
+
+	if (to_validate->cache_size != config->cache_size) {
+		*error_ptr = "Block map cache size cannot change";
+		return VDO_PARAMETER_MISMATCH;
+	}
+
+	if (to_validate->block_map_maximum_age != config->block_map_maximum_age) {
+		*error_ptr = "Block map maximum age cannot change";
+		return VDO_PARAMETER_MISMATCH;
+	}
+
+	if (memcmp(&to_validate->thread_counts,
+		   &config->thread_counts,
+		   sizeof(struct thread_count_config)) != 0) {
+		*error_ptr = "Thread configuration cannot change";
+		return VDO_PARAMETER_MISMATCH;
+	}
+
+	if (to_validate->physical_blocks < config->physical_blocks) {
+		*error_ptr = "Removing physical storage from a VDO is not supported";
+		return VDO_NOT_IMPLEMENTED;
+	}
+
+	if (!may_grow && (to_validate->physical_blocks > config->physical_blocks)) {
+		*error_ptr = "VDO physical size may not grow in current state";
+		return VDO_NOT_IMPLEMENTED;
+	}
+
+	return VDO_SUCCESS;
+}
+
+static int prepare_to_modify(struct dm_target *ti, struct device_config *config, struct vdo *vdo)
+{
+	int result;
+	bool may_grow = (vdo_get_admin_state(vdo) != VDO_ADMIN_STATE_PRE_LOADED);
+
+	result = validate_new_device_config(config, vdo->device_config, may_grow, &ti->error);
+	if (result != VDO_SUCCESS)
+		return -EINVAL;
+
+	if (config->logical_blocks > vdo->device_config->logical_blocks) {
+		block_count_t logical_blocks = vdo->states.vdo.config.logical_blocks;
+
+		uds_log_info("Preparing to resize logical to %llu",
+			     (unsigned long long) config->logical_blocks);
+		ASSERT_LOG_ONLY((config->logical_blocks > logical_blocks),
+				"New logical size is larger than current size");
+
+		result = vdo_prepare_to_grow_block_map(vdo->block_map, config->logical_blocks);
+		if (result != VDO_SUCCESS) {
+			ti->error = "Device vdo_prepare_to_grow_logical failed";
+			return result;
+		}
+
+		uds_log_info("Done preparing to resize logical");
+	}
+
+	if (config->physical_blocks > vdo->device_config->physical_blocks) {
+		result = prepare_to_grow_physical(vdo, config->physical_blocks);
+		if (result != VDO_SUCCESS) {
+			if (result == VDO_PARAMETER_MISMATCH)
+				/*
+				 * If we don't trap this case, vdo_map_to_system_error() will remap
+				 * it to -EIO, which is misleading and ahistorical.
+				 */
+				result = -EINVAL;
+
+			if (result == VDO_TOO_MANY_SLABS)
+				ti->error = "Device vdo_prepare_to_grow_physical failed (specified physical size too big based on formatted slab size)";
+			else
+				ti->error = "Device vdo_prepare_to_grow_physical failed";
+
+			return result;
+		}
+	}
+
+	if (strcmp(config->parent_device_name, vdo->device_config->parent_device_name) != 0) {
+		const char *device_name = vdo_get_device_name(config->owning_target);
+
+		uds_log_info("Updating backing device of %s from %s to %s",
+			     device_name,
+			     vdo->device_config->parent_device_name,
+			     config->parent_device_name);
+	}
+
+	return VDO_SUCCESS;
+}
+
+static int update_existing_vdo(const char *device_name,
+			       struct dm_target *ti,
+			       unsigned int argc,
+			       char **argv,
+			       struct vdo *vdo)
+{
+	int result;
+	struct device_config *config;
+
+	result = parse_device_config(argc, argv, ti, &config);
+	if (result != VDO_SUCCESS)
+		return -EINVAL;
+
+	uds_log_info("preparing to modify device '%s'", device_name);
+	result = prepare_to_modify(ti, config, vdo);
+	if (result != VDO_SUCCESS) {
+		free_device_config(config);
+		return vdo_map_to_system_error(result);
+	}
+
+	set_device_config(ti, vdo, config);
+	return VDO_SUCCESS;
+}
+
+static int vdo_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	int result;
+	struct registered_thread allocating_thread, instance_thread;
+	const char *device_name;
+	struct vdo *vdo;
+
+	uds_register_allocating_thread(&allocating_thread, NULL);
+	device_name = vdo_get_device_name(ti);
+	vdo = vdo_find_matching(vdo_is_named, (const void *) device_name);
+	if (vdo == NULL) {
+		result = construct_new_vdo(ti, argc, argv);
+	} else {
+		uds_register_thread_device_id(&instance_thread, &vdo->instance);
+		result = update_existing_vdo(device_name, ti, argc, argv, vdo);
+		uds_unregister_thread_device_id();
+	}
+
+	uds_unregister_allocating_thread();
+	return result;
+}
+
+static void vdo_dtr(struct dm_target *ti)
+{
+	struct device_config *config = ti->private;
+	struct vdo *vdo = UDS_FORGET(config->vdo);
+
+	list_del_init(&config->config_list);
+	if (list_empty(&vdo->device_config_list)) {
+		const char *device_name;
+
+		/* This was the last config referencing the VDO. Free it. */
+		unsigned int instance = vdo->instance;
+		struct registered_thread allocating_thread, instance_thread;
+
+		uds_register_thread_device_id(&instance_thread, &instance);
+		uds_register_allocating_thread(&allocating_thread, NULL);
+
+		device_name = vdo_get_device_name(ti);
+		uds_log_info("stopping device '%s'", device_name);
+		if (vdo->dump_on_shutdown)
+			vdo_dump_all(vdo, "device shutdown");
+
+		vdo_destroy(UDS_FORGET(vdo));
+		uds_log_info("device '%s' stopped", device_name);
+		uds_unregister_thread_device_id();
+		uds_unregister_allocating_thread();
+		release_instance(instance);
+	} else if (config == vdo->device_config) {
+		/*
+		 * The VDO still references this config. Give it a reference to a config that isn't
+		 * being destroyed.
+		 */
+		vdo->device_config = list_first_entry(&vdo->device_config_list,
+						      struct device_config,
+						      config_list);
+	}
+
+	free_device_config(config);
+	ti->private = NULL;
+}
+
+static void vdo_presuspend(struct dm_target *ti)
+{
+	get_vdo_for_target(ti)->suspend_type =
+		(dm_noflush_suspending(ti) ? VDO_ADMIN_STATE_SUSPENDING : VDO_ADMIN_STATE_SAVING);
+}
+
+/**
+ * write_super_block_for_suspend() - Update the VDO state and save the super block.
+ * @completion: The admin completion
+ */
+static void write_super_block_for_suspend(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+
+	switch (vdo_get_state(vdo)) {
+	case VDO_DIRTY:
+	case VDO_NEW:
+		vdo_set_state(vdo, VDO_CLEAN);
+		break;
+
+	case VDO_CLEAN:
+	case VDO_READ_ONLY_MODE:
+	case VDO_FORCE_REBUILD:
+	case VDO_RECOVERING:
+	case VDO_REBUILD_FOR_UPGRADE:
+		break;
+
+	case VDO_REPLAYING:
+	default:
+		vdo_continue_completion(completion, UDS_BAD_STATE);
+		return;
+	}
+
+	vdo_save_components(vdo, completion);
+}
+
+/**
+ * suspend_callback() - Callback to initiate a suspend, registered in vdo_postsuspend().
+ * @completion: The sub-task completion.
+ */
+static void suspend_callback(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+	struct admin_state *state = &vdo->admin.state;
+	int result;
+
+	assert_admin_phase_thread(vdo, __func__);
+
+	switch (advance_phase(vdo)) {
+	case SUSPEND_PHASE_START:
+		if (vdo_get_admin_state_code(state)->quiescent)
+			/* Already suspended */
+			break;
+
+		vdo_continue_completion(completion, vdo_start_operation(state, vdo->suspend_type));
+		return;
+
+	case SUSPEND_PHASE_PACKER:
+		/*
+		 * If the VDO was already resumed from a prior suspend while read-only, some of the
+		 * components may not have been resumed. By setting a read-only error here, we
+		 * guarantee that the result of this suspend will be VDO_READ_ONLY and not
+		 * VDO_INVALID_ADMIN_STATE in that case.
+		 */
+		if (vdo_in_read_only_mode(vdo))
+			vdo_set_completion_result(completion, VDO_READ_ONLY);
+
+		vdo_drain_packer(vdo->packer, completion);
+		return;
+
+	case SUSPEND_PHASE_DATA_VIOS:
+		drain_data_vio_pool(vdo->data_vio_pool, completion);
+		return;
+
+	case SUSPEND_PHASE_DEDUPE:
+		vdo_drain_hash_zones(vdo->hash_zones, completion);
+		return;
+
+	case SUSPEND_PHASE_FLUSHES:
+		vdo_drain_flusher(vdo->flusher, completion);
+		return;
+
+	case SUSPEND_PHASE_LOGICAL_ZONES:
+		/*
+		 * Attempt to flush all I/O before completing post suspend work. We believe a
+		 * suspended device is expected to have persisted all data written before the
+		 * suspend, even if it hasn't been flushed yet.
+		 */
+		result = vdo_synchronous_flush(vdo);
+		if (result != VDO_SUCCESS)
+			vdo_enter_read_only_mode(vdo, result);
+
+		vdo_drain_logical_zones(vdo->logical_zones,
+					vdo_get_admin_state_code(state),
+					completion);
+		return;
+
+	case SUSPEND_PHASE_BLOCK_MAP:
+		vdo_drain_block_map(vdo->block_map,
+				    vdo_get_admin_state_code(state),
+				    completion);
+		return;
+
+	case SUSPEND_PHASE_JOURNAL:
+		vdo_drain_recovery_journal(vdo->recovery_journal,
+					   vdo_get_admin_state_code(state),
+					   completion);
+		return;
+
+	case SUSPEND_PHASE_DEPOT:
+		vdo_drain_slab_depot(vdo->depot,
+				     vdo_get_admin_state_code(state),
+				     completion);
+		return;
+
+	case SUSPEND_PHASE_READ_ONLY_WAIT:
+		vdo_wait_until_not_entering_read_only_mode(completion);
+		return;
+
+	case SUSPEND_PHASE_WRITE_SUPER_BLOCK:
+		if (vdo_is_state_suspending(state) || (completion->result != VDO_SUCCESS))
+			/* If we didn't save the VDO or there was an error, we're done. */
+			break;
+
+		write_super_block_for_suspend(completion);
+		return;
+
+	case SUSPEND_PHASE_END:
+		break;
+
+	default:
+		vdo_set_completion_result(completion, UDS_BAD_STATE);
+	}
+
+	finish_operation_callback(completion);
+}
+
+static void vdo_postsuspend(struct dm_target *ti)
+{
+	struct vdo *vdo = get_vdo_for_target(ti);
+	struct registered_thread instance_thread;
+	const char *device_name;
+	int result;
+
+	uds_register_thread_device_id(&instance_thread, &vdo->instance);
+	device_name = vdo_get_device_name(vdo->device_config->owning_target);
+	uds_log_info("suspending device '%s'", device_name);
+
+	/*
+	 * It's important to note any error here does not actually stop device-mapper from
+	 * suspending the device. All this work is done post suspend.
+	 */
+	result = perform_admin_operation(vdo,
+					 SUSPEND_PHASE_START,
+					 suspend_callback,
+					 suspend_callback,
+					 "suspend");
+
+	if ((result == VDO_SUCCESS) || (result == VDO_READ_ONLY)) {
+		/*
+		 * Treat VDO_READ_ONLY as a success since a read-only suspension still leaves the
+		 * VDO suspended.
+		 */
+		uds_log_info("device '%s' suspended", device_name);
+	} else if (result == VDO_INVALID_ADMIN_STATE) {
+		uds_log_error("Suspend invoked while in unexpected state: %s",
+			      vdo_get_admin_state(vdo)->name);
+	} else {
+		uds_log_error_strerror(result, "Suspend of device '%s' failed", device_name);
+	}
+
+	uds_unregister_thread_device_id();
+}
+
+/**
+ * was_new() - Check whether the vdo was new when it was loaded.
+ * @vdo: The vdo to query.
+ *
+ * Return: true if the vdo was new.
+ */
+static bool was_new(const struct vdo *vdo)
+{
+	return (vdo->load_state == VDO_NEW);
+}
+
+/**
+ * requires_repair() - Check whether a vdo requires recovery or rebuild.
+ * @vdo: The vdo to query.
+ *
+ * Return: true if the vdo must be repaired.
+ */
+static bool __must_check requires_repair(const struct vdo *vdo)
+{
+	switch (vdo_get_state(vdo)) {
+	case VDO_DIRTY:
+	case VDO_FORCE_REBUILD:
+	case VDO_REPLAYING:
+	case VDO_REBUILD_FOR_UPGRADE:
+		return true;
+
+	default:
+		return false;
+	}
+}
+
+/**
+ * get_load_type() - Determine how the slab depot was loaded.
+ * @vdo: The vdo.
+ *
+ * Return: How the depot was loaded.
+ */
+static enum slab_depot_load_type get_load_type(struct vdo *vdo)
+{
+	if (vdo_state_requires_read_only_rebuild(vdo->load_state))
+		return VDO_SLAB_DEPOT_REBUILD_LOAD;
+
+	if (vdo_state_requires_recovery(vdo->load_state))
+		return VDO_SLAB_DEPOT_RECOVERY_LOAD;
+
+	return VDO_SLAB_DEPOT_NORMAL_LOAD;
+}
+
+/**
+ * vdo_initialize_kobjects() - Initialize the vdo sysfs directory.
+ * @vdo: The vdo being initialized.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int vdo_initialize_kobjects(struct vdo *vdo)
+{
+	int result;
+	struct dm_target *target = vdo->device_config->owning_target;
+	struct mapped_device *md = dm_table_get_md(target->table);
+
+	kobject_init(&vdo->vdo_directory, &vdo_directory_type);
+	vdo->sysfs_added = true;
+	result = kobject_add(&vdo->vdo_directory, &disk_to_dev(dm_disk(md))->kobj, "vdo");
+	if (result != 0)
+		return VDO_CANT_ADD_SYSFS_NODE;
+
+	result = vdo_add_dedupe_index_sysfs(vdo->hash_zones);
+	if (result != 0)
+		return VDO_CANT_ADD_SYSFS_NODE;
+
+	return vdo_add_sysfs_stats_dir(vdo);
+}
+
+/**
+ * load_callback() - Callback to do the destructive parts of loading a VDO.
+ * @completion: The sub-task completion.
+ */
+static void load_callback(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+	int result;
+
+	assert_admin_phase_thread(vdo, __func__);
+
+	switch (advance_phase(vdo)) {
+	case LOAD_PHASE_START:
+		result = vdo_start_operation(&vdo->admin.state, VDO_ADMIN_STATE_LOADING);
+		if (result != VDO_SUCCESS) {
+			vdo_continue_completion(completion, result);
+			return;
+		}
+
+		/* Prepare the recovery journal for new entries. */
+		vdo_open_recovery_journal(vdo->recovery_journal, vdo->depot, vdo->block_map);
+		vdo_allow_read_only_mode_entry(completion);
+		return;
+
+	case LOAD_PHASE_STATS:
+		vdo_continue_completion(completion, vdo_initialize_kobjects(vdo));
+		return;
+
+	case LOAD_PHASE_LOAD_DEPOT:
+		if (vdo_is_read_only(vdo)) {
+			/*
+			 * In read-only mode we don't use the allocator and it may not even be
+			 * readable, so don't bother trying to load it.
+			 */
+			vdo_set_completion_result(completion, VDO_READ_ONLY);
+			break;
+		}
+
+		if (requires_repair(vdo)) {
+			vdo_repair(completion);
+			return;
+		}
+
+		vdo_load_slab_depot(vdo->depot,
+				    (was_new(vdo) ?
+				     VDO_ADMIN_STATE_FORMATTING :
+				     VDO_ADMIN_STATE_LOADING),
+				    completion,
+				    NULL);
+		return;
+
+	case LOAD_PHASE_MAKE_DIRTY:
+		vdo_set_state(vdo, VDO_DIRTY);
+		vdo_save_components(vdo, completion);
+		return;
+
+	case LOAD_PHASE_PREPARE_TO_ALLOCATE:
+		vdo_initialize_block_map_from_journal(vdo->block_map, vdo->recovery_journal);
+		vdo_prepare_slab_depot_to_allocate(vdo->depot, get_load_type(vdo), completion);
+		return;
+
+	case LOAD_PHASE_SCRUB_SLABS:
+		if (vdo_state_requires_recovery(vdo->load_state))
+			vdo_enter_recovery_mode(vdo);
+
+		vdo_scrub_all_unrecovered_slabs(vdo->depot, completion);
+		return;
+
+	case LOAD_PHASE_DATA_REDUCTION:
+		WRITE_ONCE(vdo->compressing, vdo->device_config->compression);
+		if (vdo->device_config->deduplication)
+			/*
+			 * Don't try to load or rebuild the index first (and log scary error
+			 * messages) if this is known to be a newly-formatted volume.
+			 */
+			vdo_start_dedupe_index(vdo->hash_zones, was_new(vdo));
+
+		vdo->allocations_allowed = false;
+		fallthrough;
+
+	case LOAD_PHASE_FINISHED:
+		break;
+
+	case LOAD_PHASE_DRAIN_JOURNAL:
+		vdo_drain_recovery_journal(vdo->recovery_journal,
+					   VDO_ADMIN_STATE_SAVING,
+					   completion);
+		return;
+
+	case LOAD_PHASE_WAIT_FOR_READ_ONLY:
+		/* Avoid an infinite loop */
+		completion->error_handler = NULL;
+		vdo->admin.phase = LOAD_PHASE_FINISHED;
+		vdo_wait_until_not_entering_read_only_mode(completion);
+		return;
+
+	default:
+		vdo_set_completion_result(completion, UDS_BAD_STATE);
+	}
+
+	finish_operation_callback(completion);
+}
+
+/**
+ * handle_load_error() - Handle an error during the load operation.
+ * @completion: The admin completion.
+ *
+ * If at all possible, brings the vdo online in read-only mode. This handler is registered in
+ * vdo_preresume_registered().
+ */
+static void handle_load_error(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+
+	if (vdo_requeue_completion_if_needed(completion, vdo->thread_config.admin_thread))
+		return;
+
+	if (vdo_state_requires_read_only_rebuild(vdo->load_state) &&
+	    (vdo->admin.phase == LOAD_PHASE_MAKE_DIRTY)) {
+		uds_log_error_strerror(completion->result, "aborting load");
+		vdo->admin.phase = LOAD_PHASE_DRAIN_JOURNAL;
+		load_callback(UDS_FORGET(completion));
+		return;
+	}
+
+	uds_log_error_strerror(completion->result, "Entering read-only mode due to load error");
+	vdo->admin.phase = LOAD_PHASE_WAIT_FOR_READ_ONLY;
+	vdo_enter_read_only_mode(vdo, completion->result);
+	completion->result = VDO_READ_ONLY;
+	load_callback(completion);
+}
+
+/**
+ * write_super_block_for_resume() - Update the VDO state and save the super block.
+ * @completion: The admin completion
+ */
+static void write_super_block_for_resume(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+
+	switch (vdo_get_state(vdo)) {
+	case VDO_CLEAN:
+	case VDO_NEW:
+		vdo_set_state(vdo, VDO_DIRTY);
+		vdo_save_components(vdo, completion);
+		return;
+
+	case VDO_DIRTY:
+	case VDO_READ_ONLY_MODE:
+	case VDO_FORCE_REBUILD:
+	case VDO_RECOVERING:
+	case VDO_REBUILD_FOR_UPGRADE:
+		/* No need to write the super block in these cases */
+		vdo_launch_completion(completion);
+		return;
+
+	case VDO_REPLAYING:
+	default:
+		vdo_continue_completion(completion, UDS_BAD_STATE);
+	}
+}
+
+/**
+ * resume_callback() - Callback to resume a VDO.
+ * @completion: The admin completion.
+ */
+static void resume_callback(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+	int result;
+
+	assert_admin_phase_thread(vdo, __func__);
+
+	switch (advance_phase(vdo)) {
+	case RESUME_PHASE_START:
+		result = vdo_start_operation(&vdo->admin.state, VDO_ADMIN_STATE_RESUMING);
+		if (result != VDO_SUCCESS) {
+			vdo_continue_completion(completion, result);
+			return;
+		}
+
+		write_super_block_for_resume(completion);
+		return;
+
+	case RESUME_PHASE_ALLOW_READ_ONLY_MODE:
+		vdo_allow_read_only_mode_entry(completion);
+		return;
+
+	case RESUME_PHASE_DEDUPE:
+		vdo_resume_hash_zones(vdo->hash_zones, completion);
+		return;
+
+	case RESUME_PHASE_DEPOT:
+		vdo_resume_slab_depot(vdo->depot, completion);
+		return;
+
+	case RESUME_PHASE_JOURNAL:
+		vdo_resume_recovery_journal(vdo->recovery_journal, completion);
+		return;
+
+	case RESUME_PHASE_BLOCK_MAP:
+		vdo_resume_block_map(vdo->block_map, completion);
+		return;
+
+	case RESUME_PHASE_LOGICAL_ZONES:
+		vdo_resume_logical_zones(vdo->logical_zones, completion);
+		return;
+
+	case RESUME_PHASE_PACKER:
+	{
+		bool was_enabled = vdo_get_compressing(vdo);
+		bool enable = vdo->device_config->compression;
+
+		if (enable != was_enabled)
+			WRITE_ONCE(vdo->compressing, enable);
+		uds_log_info("compression is %s", (enable ? "enabled" : "disabled"));
+
+		vdo_resume_packer(vdo->packer, completion);
+		return;
+	}
+
+	case RESUME_PHASE_FLUSHER:
+		vdo_resume_flusher(vdo->flusher, completion);
+		return;
+
+	case RESUME_PHASE_DATA_VIOS:
+		resume_data_vio_pool(vdo->data_vio_pool, completion);
+		return;
+
+	case RESUME_PHASE_END:
+		break;
+
+	default:
+		vdo_set_completion_result(completion, UDS_BAD_STATE);
+	}
+
+	finish_operation_callback(completion);
+}
+
+/**
+ * grow_logical_callback() - Callback to initiate a grow logical.
+ * @completion: The admin completion.
+ *
+ * Registered in perform_grow_logical().
+ */
+static void grow_logical_callback(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+	int result;
+
+	assert_admin_phase_thread(vdo, __func__);
+
+	switch (advance_phase(vdo)) {
+	case GROW_LOGICAL_PHASE_START:
+		if (vdo_is_read_only(vdo)) {
+			uds_log_error_strerror(VDO_READ_ONLY,
+					       "Can't grow logical size of a read-only VDO");
+			vdo_set_completion_result(completion, VDO_READ_ONLY);
+			break;
+		}
+
+		result = vdo_start_operation(&vdo->admin.state,
+					     VDO_ADMIN_STATE_SUSPENDED_OPERATION);
+		if (result != VDO_SUCCESS) {
+			vdo_continue_completion(completion, result);
+			return;
+		}
+
+		vdo->states.vdo.config.logical_blocks = vdo->block_map->next_entry_count;
+		vdo_save_components(vdo, completion);
+		return;
+
+	case GROW_LOGICAL_PHASE_GROW_BLOCK_MAP:
+		vdo_grow_block_map(vdo->block_map, completion);
+		return;
+
+	case GROW_LOGICAL_PHASE_END:
+		break;
+
+	case GROW_LOGICAL_PHASE_ERROR:
+		vdo_enter_read_only_mode(vdo, completion->result);
+		break;
+
+	default:
+		vdo_set_completion_result(completion, UDS_BAD_STATE);
+	}
+
+	finish_operation_callback(completion);
+}
+
+/**
+ * handle_logical_growth_error() - Handle an error during the grow physical process.
+ * @completion: The admin completion.
+ */
+static void handle_logical_growth_error(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+
+	if (vdo->admin.phase == GROW_LOGICAL_PHASE_GROW_BLOCK_MAP) {
+		/*
+		 * We've failed to write the new size in the super block, so set our in memory
+		 * config back to the old size.
+		 */
+		vdo->states.vdo.config.logical_blocks = vdo->block_map->entry_count;
+		vdo_abandon_block_map_growth(vdo->block_map);
+	}
+
+	vdo->admin.phase = GROW_LOGICAL_PHASE_ERROR;
+	grow_logical_callback(completion);
+}
+
+/**
+ * perform_grow_logical() - Grow the logical size of the vdo.
+ * @vdo: The vdo to grow.
+ * @new_logical_blocks: The size to which the vdo should be grown.
+ *
+ * Context: This method may only be called when the vdo has been suspended and must not be called
+ * from a base thread.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int perform_grow_logical(struct vdo *vdo, block_count_t new_logical_blocks)
+{
+	int result;
+
+	if (vdo->device_config->logical_blocks == new_logical_blocks) {
+		/*
+		 * A table was loaded for which we prepared to grow, but a table without that
+		 * growth was what we are resuming with.
+		 */
+		vdo_abandon_block_map_growth(vdo->block_map);
+		return VDO_SUCCESS;
+	}
+
+	uds_log_info("Resizing logical to %llu", (unsigned long long) new_logical_blocks);
+	if (vdo->block_map->next_entry_count != new_logical_blocks)
+		return VDO_PARAMETER_MISMATCH;
+
+	result = perform_admin_operation(vdo,
+					 GROW_LOGICAL_PHASE_START,
+					 grow_logical_callback,
+					 handle_logical_growth_error,
+					 "grow logical");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	uds_log_info("Logical blocks now %llu", (unsigned long long) new_logical_blocks);
+	return VDO_SUCCESS;
+}
+
+static void copy_callback(int read_err, unsigned long write_err, void *context)
+{
+	struct vdo_completion *completion = context;
+	int result = (((read_err == 0) && (write_err == 0)) ? VDO_SUCCESS : -EIO);
+
+	vdo_continue_completion(completion, result);
+}
+
+static void
+partition_to_region(struct partition *partition, struct vdo *vdo, struct dm_io_region *region)
+{
+	physical_block_number_t pbn = partition->offset - vdo->geometry.bio_offset;
+
+	*region = (struct dm_io_region) {
+		.bdev = vdo_get_backing_device(vdo),
+		.sector = pbn * VDO_SECTORS_PER_BLOCK,
+		.count = partition->count * VDO_SECTORS_PER_BLOCK,
+	};
+}
+
+/**
+ * copy_partition() - Copy a partition from the location specified in the current layout to that in
+ *                    the next layout.
+ * @vdo: The vdo preparing to grow.
+ * @id: The ID of the partition to copy.
+ * @parent: The completion to notify when the copy is complete.
+ */
+static void copy_partition(struct vdo *vdo, enum partition_id id, struct vdo_completion *parent)
+{
+	struct dm_io_region read_region, write_regions[1];
+	struct partition *from = vdo_get_known_partition(&vdo->layout, id);
+	struct partition *to = vdo_get_known_partition(&vdo->next_layout, id);
+
+	partition_to_region(from, vdo, &read_region);
+	partition_to_region(to, vdo, &write_regions[0]);
+	dm_kcopyd_copy(vdo->partition_copier,
+		       &read_region,
+		       1,
+		       write_regions,
+		       0,
+		       copy_callback,
+		       parent);
+}
+
+/**
+ * grow_physical_callback() - Callback to initiate a grow physical.
+ * @completion: The admin completion.
+ *
+ * Registered in perform_grow_physical().
+ */
+static void grow_physical_callback(struct vdo_completion *completion)
+{
+	struct vdo *vdo = completion->vdo;
+	int result;
+
+	assert_admin_phase_thread(vdo, __func__);
+
+	switch (advance_phase(vdo)) {
+	case GROW_PHYSICAL_PHASE_START:
+		if (vdo_is_read_only(vdo)) {
+			uds_log_error_strerror(VDO_READ_ONLY,
+					       "Can't grow physical size of a read-only VDO");
+			vdo_set_completion_result(completion, VDO_READ_ONLY);
+			break;
+		}
+
+		result = vdo_start_operation(&vdo->admin.state,
+					     VDO_ADMIN_STATE_SUSPENDED_OPERATION);
+		if (result != VDO_SUCCESS) {
+			vdo_continue_completion(completion, result);
+			return;
+		}
+
+		/* Copy the journal into the new layout. */
+		copy_partition(vdo, VDO_RECOVERY_JOURNAL_PARTITION, completion);
+		return;
+
+	case GROW_PHYSICAL_PHASE_COPY_SUMMARY:
+		copy_partition(vdo, VDO_SLAB_SUMMARY_PARTITION, completion);
+		return;
+
+	case GROW_PHYSICAL_PHASE_UPDATE_COMPONENTS:
+		vdo_uninitialize_layout(&vdo->layout);
+		vdo->layout = vdo->next_layout;
+		UDS_FORGET(vdo->next_layout.head);
+		vdo->states.vdo.config.physical_blocks = vdo->layout.size;
+		vdo_update_slab_depot_size(vdo->depot);
+		vdo_save_components(vdo, completion);
+		return;
+
+	case GROW_PHYSICAL_PHASE_USE_NEW_SLABS:
+		vdo_use_new_slabs(vdo->depot, completion);
+		return;
+
+	case GROW_PHYSICAL_PHASE_END:
+		vdo->depot->summary_origin =
+			vdo_get_known_partition(&vdo->layout, VDO_SLAB_SUMMARY_PARTITION)->offset;
+		vdo->recovery_journal->origin =
+			vdo_get_known_partition(&vdo->layout,
+						VDO_RECOVERY_JOURNAL_PARTITION)->offset;
+		break;
+
+	case GROW_PHYSICAL_PHASE_ERROR:
+		vdo_enter_read_only_mode(vdo, completion->result);
+		break;
+
+	default:
+		vdo_set_completion_result(completion, UDS_BAD_STATE);
+	}
+
+	vdo_uninitialize_layout(&vdo->next_layout);
+	finish_operation_callback(completion);
+}
+
+/**
+ * handle_physical_growth_error() - Handle an error during the grow physical process.
+ * @completion: The sub-task completion.
+ */
+static void handle_physical_growth_error(struct vdo_completion *completion)
+{
+	completion->vdo->admin.phase = GROW_PHYSICAL_PHASE_ERROR;
+	grow_physical_callback(completion);
+}
+
+/**
+ * perform_grow_physical() - Grow the physical size of the vdo.
+ * @vdo: The vdo to resize.
+ * @new_physical_blocks: The new physical size in blocks.
+ *
+ * Context: This method may only be called when the vdo has been suspended and must not be called
+ * from a base thread.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int perform_grow_physical(struct vdo *vdo, block_count_t new_physical_blocks)
+{
+	int result;
+	block_count_t new_depot_size, prepared_depot_size;
+	block_count_t old_physical_blocks = vdo->states.vdo.config.physical_blocks;
+
+	/* Skip any noop grows. */
+	if (old_physical_blocks == new_physical_blocks)
+		return VDO_SUCCESS;
+
+	if (new_physical_blocks != vdo->next_layout.size) {
+		/*
+		 * Either the VDO isn't prepared to grow, or it was prepared to grow to a different
+		 * size. Doing this check here relies on the fact that the call to this method is
+		 * done under the dmsetup message lock.
+		 */
+		vdo_uninitialize_layout(&vdo->next_layout);
+		vdo_abandon_new_slabs(vdo->depot);
+		return VDO_PARAMETER_MISMATCH;
+	}
+
+	/* Validate that we are prepared to grow appropriately. */
+	new_depot_size =
+		vdo_get_known_partition(&vdo->next_layout, VDO_SLAB_DEPOT_PARTITION)->count;
+	prepared_depot_size = (vdo->depot->new_slabs == NULL) ? 0 : vdo->depot->new_size;
+	if (prepared_depot_size != new_depot_size)
+		return VDO_PARAMETER_MISMATCH;
+
+	result = perform_admin_operation(vdo,
+					 GROW_PHYSICAL_PHASE_START,
+					 grow_physical_callback,
+					 handle_physical_growth_error,
+					 "grow physical");
+	if (result != VDO_SUCCESS)
+		return result;
+
+	uds_log_info("Physical block count was %llu, now %llu",
+		     (unsigned long long) old_physical_blocks,
+		     (unsigned long long) new_physical_blocks);
+	return VDO_SUCCESS;
+}
+
+/**
+ * apply_new_vdo_configuration() - Attempt to make any configuration changes from the table being
+ *                                 resumed.
+ * @vdo: The vdo being resumed.
+ * @config: The new device configuration derived from the table with which the vdo is being
+ *          resumed.
+ *
+ * Return: VDO_SUCCESS or an error.
+ */
+static int __must_check apply_new_vdo_configuration(struct vdo *vdo, struct device_config *config)
+{
+	int result;
+
+	result = perform_grow_logical(vdo, config->logical_blocks);
+	if (result != VDO_SUCCESS) {
+		uds_log_error("grow logical operation failed, result = %d", result);
+		return result;
+	}
+
+	result = perform_grow_physical(vdo, config->physical_blocks);
+	if (result != VDO_SUCCESS)
+		uds_log_error("resize operation failed, result = %d", result);
+
+	return result;
+}
+
+static int vdo_preresume_registered(struct dm_target *ti, struct vdo *vdo)
+{
+	struct device_config *config = ti->private;
+	const char *device_name = vdo_get_device_name(ti);
+	block_count_t backing_blocks;
+	int result;
+
+	backing_blocks = get_underlying_device_block_count(vdo);
+	if (backing_blocks < config->physical_blocks) {
+		/* FIXME: can this still happen? */
+		uds_log_error("resume of device '%s' failed: backing device has %llu blocks but VDO physical size is %llu blocks",
+			      device_name,
+			      (unsigned long long) backing_blocks,
+			      (unsigned long long) config->physical_blocks);
+		return -EINVAL;
+	}
+
+	if (vdo_get_admin_state(vdo) == VDO_ADMIN_STATE_PRE_LOADED) {
+		uds_log_info("starting device '%s'", device_name);
+		result = perform_admin_operation(vdo,
+						 LOAD_PHASE_START,
+						 load_callback,
+						 handle_load_error,
+						 "load");
+		if ((result != VDO_SUCCESS) && (result != VDO_READ_ONLY)) {
+			/*
+			 * Something has gone very wrong. Make sure everything has drained and
+			 * leave the device in an unresumable state.
+			 */
+			uds_log_error_strerror(result,
+					       "Start failed, could not load VDO metadata");
+			vdo->suspend_type = VDO_ADMIN_STATE_STOPPING;
+			perform_admin_operation(vdo,
+						SUSPEND_PHASE_START,
+						suspend_callback,
+						suspend_callback,
+						"suspend");
+			return result;
+		}
+
+		/* Even if the VDO is read-only, it is now able to handle read requests. */
+		uds_log_info("device '%s' started", device_name);
+	}
+
+	uds_log_info("resuming device '%s'", device_name);
+
+	/* If this fails, the VDO was not in a state to be resumed. This should never happen. */
+	result = apply_new_vdo_configuration(vdo, config);
+	BUG_ON(result == VDO_INVALID_ADMIN_STATE);
+
+	/*
+	 * Now that we've tried to modify the vdo, the new config *is* the config, whether the
+	 * modifications worked or not.
+	 */
+	vdo->device_config = config;
+
+	/*
+	 * Any error here is highly unexpected and the state of the vdo is questionable, so we mark
+	 * it read-only in memory. Because we are suspended, the read-only state will not be
+	 * written to disk.
+	 */
+	if (result != VDO_SUCCESS) {
+		uds_log_error_strerror(result,
+				       "Commit of modifications to device '%s' failed",
+				       device_name);
+		vdo_enter_read_only_mode(vdo, result);
+		return result;
+	}
+
+	if (vdo_get_admin_state(vdo)->normal)
+		/* The VDO was just started, so we don't need to resume it. */
+		return VDO_SUCCESS;
+
+	result = perform_admin_operation(vdo,
+					 RESUME_PHASE_START,
+					 resume_callback,
+					 resume_callback,
+					 "resume");
+	BUG_ON(result == VDO_INVALID_ADMIN_STATE);
+	if (result == VDO_READ_ONLY)
+		/* Even if the vdo is read-only, it has still resumed. */
+		result = VDO_SUCCESS;
+
+	if (result != VDO_SUCCESS)
+		uds_log_error("resume of device '%s' failed with error: %d", device_name, result);
+
+	return result;
+}
+
+static int vdo_preresume(struct dm_target *ti)
+{
+	struct registered_thread instance_thread;
+	struct vdo *vdo = get_vdo_for_target(ti);
+	int result;
+
+	uds_register_thread_device_id(&instance_thread, &vdo->instance);
+	result = vdo_preresume_registered(ti, vdo);
+	if ((result == VDO_PARAMETER_MISMATCH) || (result == VDO_INVALID_ADMIN_STATE))
+		result = -EINVAL;
+	uds_unregister_thread_device_id();
+	return vdo_map_to_system_error(result);
+}
+
+static void vdo_resume(struct dm_target *ti)
+{
+	struct registered_thread instance_thread;
+
+	uds_register_thread_device_id(&instance_thread, &get_vdo_for_target(ti)->instance);
+	uds_log_info("device '%s' resumed", vdo_get_device_name(ti));
+	uds_unregister_thread_device_id();
+}
+
+/*
+ * If anything changes that affects how user tools will interact with vdo, update the version
+ * number and make sure documentation about the change is complete so tools can properly update
+ * their management code.
+ */
+static struct target_type vdo_target_bio = {
+	.features = DM_TARGET_SINGLETON,
+	.name = "vdo",
+	.version = { 8, 2, 0 },
+	.module = THIS_MODULE,
+	.ctr = vdo_ctr,
+	.dtr = vdo_dtr,
+	.io_hints = vdo_io_hints,
+	.iterate_devices = vdo_iterate_devices,
+	.map = vdo_map_bio,
+	.message = vdo_message,
+	.status = vdo_status,
+	.presuspend = vdo_presuspend,
+	.postsuspend = vdo_postsuspend,
+	.preresume = vdo_preresume,
+	.resume = vdo_resume,
+};
+
+static bool dm_registered;
+
+static void vdo_module_destroy(void)
+{
+	uds_log_debug("unloading");
+
+	if (dm_registered)
+		dm_unregister_target(&vdo_target_bio);
+
+	ASSERT_LOG_ONLY(instances.count == 0,
+			"should have no instance numbers still in use, but have %u",
+			instances.count);
+	UDS_FREE(instances.words);
+	memset(&instances, 0, sizeof(struct instance_tracker));
+
+	uds_log_info("unloaded version %s", CURRENT_VERSION);
+}
+
+static int __init vdo_init(void)
+{
+	int result = 0;
+
+	/*
+	 * UDS module level initialization must be done first, as VDO initialization depends on it
+	 */
+	uds_initialize_thread_device_registry();
+	uds_memory_init();
+	uds_init_sysfs();
+
+	vdo_initialize_device_registry_once();
+	uds_log_info("loaded version %s", CURRENT_VERSION);
+
+	/* Add VDO errors to the already existing set of errors in UDS. */
+	result = vdo_register_status_codes();
+	if (result != UDS_SUCCESS) {
+		uds_log_error("vdo_register_status_codes failed %d", result);
+		vdo_module_destroy();
+		return result;
+	}
+
+	result = dm_register_target(&vdo_target_bio);
+	if (result < 0) {
+		uds_log_error("dm_register_target failed %d", result);
+		vdo_module_destroy();
+		return result;
+	}
+	dm_registered = true;
+
+	return result;
+}
+
+static void __exit vdo_exit(void)
+{
+	vdo_module_destroy();
+	/*
+	 * UDS module level exit processing must be done after all VDO module exit processing is
+	 * complete.
+	 */
+	uds_put_sysfs();
+	uds_memory_exit();
+}
+
+module_init(vdo_init);
+module_exit(vdo_exit);
+
+MODULE_DESCRIPTION(DM_NAME " target for transparent deduplication");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");

From patchwork Thu Oct 26 21:41:36 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Matthew Sakai <msakai@redhat.com>
X-Patchwork-Id: 13437944
Return-Path: <linux-block-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A350EC25B6B
	for <linux-block@archiver.kernel.org>; Thu, 26 Oct 2023 21:44:26 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235016AbjJZVo0 (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 26 Oct 2023 17:44:26 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52846 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344882AbjJZVoQ (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 26 Oct 2023 17:44:16 -0400
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2642DD5C
        for <linux-block@vger.kernel.org>;
 Thu, 26 Oct 2023 14:41:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1698356508;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=2VCe6vdbWClz+JSHP8xKkLJYnJuJNX6zpMYbifB9jBc=;
        b=UL3XTl4LAtMkuXa2vcWKFHkqzWgd319FYSK7XkEuL++8BnE+pnNIB2eiWGTcoXC7zthpkn
        y8zB+pxjnywf4nKumss4q5ByjgZbg3ewLuAfUJD+QqWg0nxthM8xH2icMA8kwhilmHcMlH
        Dh20o1mczPoAAS+CJx4So2BMJLZvzDU=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-230-efJD2pA_O6iEw58Be_2wAw-1; Thu, 26 Oct 2023 17:41:44 -0400
X-MC-Unique: efJD2pA_O6iEw58Be_2wAw-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id DFC14185A783;
        Thu, 26 Oct 2023 21:41:43 +0000 (UTC)
Received: from pbitcolo-build-10.permabit.com
 (pbitcolo-build-10.permabit.lab.eng.bos.redhat.com [10.19.117.76])
        by smtp.corp.redhat.com (Postfix) with ESMTP id D5C942949;
        Thu, 26 Oct 2023 21:41:43 +0000 (UTC)
Received: by pbitcolo-build-10.permabit.com (Postfix, from userid 1138)
        id 8C6203003B; Thu, 26 Oct 2023 17:41:43 -0400 (EDT)
From: Matthew Sakai <msakai@redhat.com>
To: dm-devel@lists.linux.dev, linux-block@vger.kernel.org
Cc: Matthew Sakai <msakai@redhat.com>
Subject: [PATCH v4 39/39] dm vdo: enable configuration and building of dm-vdo
Date: Thu, 26 Oct 2023 17:41:36 -0400
Message-Id: <20231026214136.1067410-40-msakai@redhat.com>
In-Reply-To: <20231026214136.1067410-1-msakai@redhat.com>
References: <20231026214136.1067410-1-msakai@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

dm-vdo targets are not supported for 32-bit configurations. A vdo target
typically requires 1 to 1.5 GB of memory at any given time, which is likely
a large fraction of the addressable memory of a 32-bit system. At the same
time, the amount of addressable storage attached to a 32-bit system may not
be large enough for deduplication to provide much benefit. Because of these
concerns, 32-bit platforms are deemed unlikely to benefit from using a vdo
target, so dm-vdo is targeted only at 64-bit platforms.

Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net>
Co-developed-by: John Wiele <jwiele@redhat.com>
Signed-off-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
---
 drivers/md/Kconfig  | 16 ++++++++++++++++
 drivers/md/Makefile |  2 ++
 2 files changed, 18 insertions(+)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index b0a22e99bade..9fa9dec10292 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -521,6 +521,22 @@ config DM_FLAKEY
 	help
 	 A target that intermittently fails I/O for debugging purposes.
 
+config DM_VDO
+	tristate "VDO: deduplication and compression target"
+	depends on 64BIT
+	depends on BLK_DEV_DM
+	select DM_BUFIO
+	select LZ4_COMPRESS
+	select LZ4_DECOMPRESS
+	help
+	  This device mapper target presents a block device with
+	  deduplication, compression and thin-provisioning.
+
+	  To compile this code as a module, choose M here: the module will
+	  be called dm-vdo.
+
+	  If unsure, say N.
+
 config DM_VERITY
 	tristate "Verity target support"
 	depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 84291e38dca8..47444b393abb 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -25,6 +25,7 @@ dm-ebs-y	+= dm-ebs-target.o
 dm-era-y	+= dm-era-target.o
 dm-clone-y	+= dm-clone-target.o dm-clone-metadata.o
 dm-verity-y	+= dm-verity-target.o
+dm-vdo-y	+= dm-vdo-target.o $(patsubst drivers/md/dm-vdo/%.c,dm-vdo/%.o,$(wildcard $(src)/dm-vdo/*.c))
 dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
 
 md-mod-y	+= md.o md-bitmap.o
@@ -74,6 +75,7 @@ obj-$(CONFIG_DM_ZERO)		+= dm-zero.o
 obj-$(CONFIG_DM_RAID)		+= dm-raid.o
 obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o
 obj-$(CONFIG_DM_VERITY)		+= dm-verity.o
+obj-$(CONFIG_DM_VDO)            += dm-vdo.o
 obj-$(CONFIG_DM_CACHE)		+= dm-cache.o
 obj-$(CONFIG_DM_CACHE_SMQ)	+= dm-cache-smq.o
 obj-$(CONFIG_DM_EBS)		+= dm-ebs.o