mbox series

[v2,00/39] Add the dm-vdo deduplication and compression device mapper target.

Message ID 20230523214539.226387-1-corwin@redhat.com (mailing list archive)
Headers show
Series Add the dm-vdo deduplication and compression device mapper target. | expand

Message

corwin May 23, 2023, 9:45 p.m. UTC
The dm-vdo target provides inline deduplication, compression, zero-block
elimination, and thin provisioning. A dm-vdo target can be backed by up to
256TB of storage, and can present a logical size of up to 4PB. This target
was originally developed at Permabit Technology Corp. starting in 2009. It
was first released in 2013 and has been used in production environments
ever since. It was made open-source in 2017 after Permabit was acquired by
Red Hat.

Because deduplication rates fall drastically as the block size increases, a
vdo target has a maximum block size of 4KB. However, it can achieve
deduplication rates of 254:1, i.e. up to 254 copies of a given 4KB block
can reference a single 4KB of actual storage. It can achieve compression
rates of 14:1. All zero blocks consume no storage at all.

Design Summary
--------------

This is a high-level summary of the ideas behind dm-vdo. For details about
the implementation and various design choices, refer to vdo-design.rst
included in this patch set.

Deduplication is a two-part problem. The first part is recognizing
duplicate data; the second part is avoiding multiple copies of the
duplicated data. Therefore, vdo has two main sections: a deduplication
index that is used to discover potential duplicate data, and a data store
with a reference counted block map that maps from logical block addresses
to the actual storage location of the data.

Hashing:

In order to identify blocks, vdo hashes each 4KB block to produce a 128-bit
block name. Since vdo only requires these names to be evenly distributed,
it uses MurmurHash3, a non-cryptographic hash algorithm which is faster
than cryptographic hashes.

The Deduplication Index:

The index is a set of mappings between a block name (the hash of its
contents) and a hint indicating where the block might be stored. These
mappings are stored in temporal order because groups of blocks that are
written together (such as a large file) tend to be rewritten together as
well. The index uses a least-recently-used (LRU) scheme to keep frequently
used names in the index while older names are discarded.

The index uses a structure called a delta-index to store its mappings,
which is more space-efficient than using a hashtable. It uses a variable
length encoding with the property that the average size of an entry
decreases as the number of entries increases, resulting in a roughly
constant size as the index fills.

Because storing hashes along with the data, or rehashing blocks on
overwrite is expensive, entries are never explicitly deleted from the
index. Instead, the vdo must always check the data at the physical location
provided by the index to ensure that the hint is still valid.

The Data Store:

The data store is implemented by three main data structures: the block map,
the slab depot, and the recovery journal. These structures work in concert
to amortize metadata updates across as many data writes as possible.

The block map contains the mapping from logical addresses to physical
locations. For each logical address it indicates whether that address is
unused, all zeros, or which physical block holds its contents and whether
or not it is compressed. The array of mappings is represented as a tree,
with nodes that are allocated as needed from the available physical space.

The slab depot tracks the physical space available for storing user data.
The depot also maintains a reference count for each physical block. Each
block can have up to 254 logical references.

The recovery journal is a transaction log of the logical-to-physical
mappings made by data writes. Committing this journal regularly allows a
vdo to reduce the frequency of other metadata writes and allows it to
reconstruct its metadata in the event of a crash.

Zones and Threading:

Due to the complexity of deduplication, the number of metadata structures
involved in a single write operation to a vdo target is larger than most
other targets. Furthermore, because vdo operates on small block sizes in
order to achieve good deduplication rates, parallelism is key to good
performance. The deduplication index, the block map, and the slab depot are
all designed to be easily divided into disjoint zones such that any piece
of metadata is handled by a single zone. Each zone is then assigned to a
single thread so that all metadata operations in that zone can proceed
without locking. Each bio is associated with a request object which can be
enqueued on each zone thread it needs to access. The zone divisions are not
reflected in the on-disk representation of the data structures, so the
number of zones, and hence the number of threads, can be configured each
time a vdo target is started.

Existing facilities
-------------------

In a few cases, we found that existing kernel facilities did not meet vdo's
needs, either because of performance or due to a mismatch of semantics.
These are detailed here:

Work Queues:

Handling a single bio requires a number of small operations across a number
of zones. The per-zone worker threads can be very busy, often using upwards
of 30% CPU time. Kernel work queues seem targeted for lighter work loads.
They do not let us easily prioritize individual tasks within a zone, and
make CPU affinity control at a per-thread level more difficult.

The threads scanning and updating the in-memory portion of the
deduplication index process a large number of queries through a single
function. It uses its own "request queue" mechanism to process these
efficiently in dedicated threads. In experiments using kernel work queues
for the index lookups, we observed an overall throughput drop of up to
almost 10%. In the following table, randwrite% and write% represent the
change in throughput when switching to kernel work queues for random and
sequential write workloads, respectively.

| compression% | deduplication% | randwrite% | write% |
|--------------+----------------+------------+--------|
|            0 |              0 |       -8.3 |   -6.4 |
|           55 |              0 |       -7.9 |   -8.5 |
|           90 |              0 |       -9.3 |   -8.9 |
|            0 |             50 |       -4.9 |   -4.5 |
|           55 |             50 |       -4.4 |   -4.4 |
|           90 |             50 |       -4.2 |   -4.7 |
|            0 |             90 |       -1.0 |    0.7 |
|           55 |             90 |        0.2 |   -0.4 |
|           90 |             90 |       -0.5 |    0.2 |

Mempools:

There are two types of object pools in the vdo implementation for which the
existing mempool structure was not appropriate. The first of these are
pools of structures wrapping the bios used for vdo's metadata I/O. Since
each of these pools is only accessed from a single thread, the locking done
by mempool is a needless cost. The second of these, the single pool of the
wrappers for incoming bios, has more complicated locking semantics than
mempool provides. When a thread attempts to submit a bio to vdo, but the
pool is exhausted, the thread is put to sleep. The pool is designed to only
wake that thread once, when it is certain that that thread's bio will be
processed. It is not desirable to merely allocate more wrappers as a number
of other vdo structures are designed to handle only a fixed number of
concurrent requests. This limit is also necessary to bound the amount of
work needed when recovering after a crash.

MurmurHash:

MurmurHash3 was selected for its hash quality, performance on 4KB blocks,
and its 128-bit output size (vdo needs significantly more than 64 uniformly
distributed bits for its in-memory and on-disk indexing). For
cross-platform compatibility, vdo uses a modified version which always
produces the same output as the original x64 variant, rather than being
optimized per platform. There is no such hash function already in the
kernel.

J. corwin Coburn (39):
  Add documentation for dm-vdo.
  Add the MurmurHash3 fast hashing algorithm.
  Add memory allocation utilities.
  Add basic logging and support utilities.
  Add vdo type declarations, constants, and simple data structures.
  Add thread and synchronization utilities.
  Add specialized request queueing functionality.
  Add basic data structures.
  Add deduplication configuration structures.
  Add deduplication index storage interface.
  Implement the delta index.
  Implement the volume index.
  Implement the open chapter and chapter indexes.
  Implement the chapter volume store.
  Implement top-level deduplication index.
  Implement external deduplication index interface.
  Add administrative state and scheduling for vdo.
  Add vio, the request object for vdo metadata.
  Add data_vio, the request object which services incoming bios.
  Add flush support to vdo.
  Add the vdo io_submitter.
  Add hash locks and hash zones.
  Add use of the deduplication index in hash zones.
  Add the compressed block bin packer.
  Add vdo_slab.
  Add the slab summary.
  Add the block allocators and physical zones.
  Add the slab depot itself.
  Add the vdo block map.
  Implement the vdo block map page cache.
  Add the vdo recovery journal.
  Add repair (crash recovery and read-only rebuild) of damaged vdos.
  Add the vdo structure itself.
  Add the on-disk formats and marshalling of vdo structures.
  Add statistics tracking.
  Add sysfs support for setting vdo parameters and fetching statistics.
  Add vdo debugging support.
  Add dm-vdo-target.c
  Enable configuration and building of dm-vdo.

 .../admin-guide/device-mapper/vdo-design.rst  |  390 ++
 .../admin-guide/device-mapper/vdo.rst         |  386 ++
 drivers/md/Kconfig                            |   16 +
 drivers/md/Makefile                           |    2 +
 drivers/md/dm-vdo-target.c                    | 2983 ++++++++++
 drivers/md/dm-vdo/action-manager.c            |  410 ++
 drivers/md/dm-vdo/action-manager.h            |  117 +
 drivers/md/dm-vdo/admin-state.c               |  512 ++
 drivers/md/dm-vdo/admin-state.h               |  180 +
 drivers/md/dm-vdo/block-map.c                 | 3381 +++++++++++
 drivers/md/dm-vdo/block-map.h                 |  392 ++
 drivers/md/dm-vdo/chapter-index.c             |  304 +
 drivers/md/dm-vdo/chapter-index.h             |   66 +
 drivers/md/dm-vdo/completion.c                |  141 +
 drivers/md/dm-vdo/completion.h                |  155 +
 drivers/md/dm-vdo/config.c                    |  389 ++
 drivers/md/dm-vdo/config.h                    |  125 +
 drivers/md/dm-vdo/constants.c                 |   15 +
 drivers/md/dm-vdo/constants.h                 |  102 +
 drivers/md/dm-vdo/cpu.h                       |   58 +
 drivers/md/dm-vdo/data-vio.c                  | 2076 +++++++
 drivers/md/dm-vdo/data-vio.h                  |  683 +++
 drivers/md/dm-vdo/dedupe.c                    | 3073 ++++++++++
 drivers/md/dm-vdo/dedupe.h                    |  119 +
 drivers/md/dm-vdo/delta-index.c               | 2018 +++++++
 drivers/md/dm-vdo/delta-index.h               |  292 +
 drivers/md/dm-vdo/dump.c                      |  288 +
 drivers/md/dm-vdo/dump.h                      |   17 +
 drivers/md/dm-vdo/encodings.c                 | 1523 +++++
 drivers/md/dm-vdo/encodings.h                 | 1307 +++++
 drivers/md/dm-vdo/errors.c                    |  316 +
 drivers/md/dm-vdo/errors.h                    |   83 +
 drivers/md/dm-vdo/flush.c                     |  563 ++
 drivers/md/dm-vdo/flush.h                     |   44 +
 drivers/md/dm-vdo/funnel-queue.c              |  169 +
 drivers/md/dm-vdo/funnel-queue.h              |  110 +
 drivers/md/dm-vdo/geometry.c                  |  205 +
 drivers/md/dm-vdo/geometry.h                  |  137 +
 drivers/md/dm-vdo/hash-utils.h                |   66 +
 drivers/md/dm-vdo/index-layout.c              | 1775 ++++++
 drivers/md/dm-vdo/index-layout.h              |   42 +
 drivers/md/dm-vdo/index-page-map.c            |  181 +
 drivers/md/dm-vdo/index-page-map.h            |   54 +
 drivers/md/dm-vdo/index-session.c             |  815 +++
 drivers/md/dm-vdo/index-session.h             |   84 +
 drivers/md/dm-vdo/index.c                     | 1403 +++++
 drivers/md/dm-vdo/index.h                     |   83 +
 drivers/md/dm-vdo/int-map.c                   |  710 +++
 drivers/md/dm-vdo/int-map.h                   |   40 +
 drivers/md/dm-vdo/io-factory.c                |  458 ++
 drivers/md/dm-vdo/io-factory.h                |   66 +
 drivers/md/dm-vdo/io-submitter.c              |  483 ++
 drivers/md/dm-vdo/io-submitter.h              |   52 +
 drivers/md/dm-vdo/logger.c                    |  304 +
 drivers/md/dm-vdo/logger.h                    |  112 +
 drivers/md/dm-vdo/logical-zone.c              |  378 ++
 drivers/md/dm-vdo/logical-zone.h              |   87 +
 drivers/md/dm-vdo/memory-alloc.c              |  447 ++
 drivers/md/dm-vdo/memory-alloc.h              |  181 +
 drivers/md/dm-vdo/message-stats.c             | 1222 ++++
 drivers/md/dm-vdo/message-stats.h             |   13 +
 drivers/md/dm-vdo/murmurhash3.c               |  175 +
 drivers/md/dm-vdo/murmurhash3.h               |   15 +
 drivers/md/dm-vdo/numeric.h                   |   78 +
 drivers/md/dm-vdo/open-chapter.c              |  433 ++
 drivers/md/dm-vdo/open-chapter.h              |   79 +
 drivers/md/dm-vdo/packer.c                    |  794 +++
 drivers/md/dm-vdo/packer.h                    |  123 +
 drivers/md/dm-vdo/permassert.c                |   35 +
 drivers/md/dm-vdo/permassert.h                |   65 +
 drivers/md/dm-vdo/physical-zone.c             |  650 ++
 drivers/md/dm-vdo/physical-zone.h             |  115 +
 drivers/md/dm-vdo/pointer-map.c               |  691 +++
 drivers/md/dm-vdo/pointer-map.h               |   81 +
 drivers/md/dm-vdo/pool-sysfs-stats.c          | 2063 +++++++
 drivers/md/dm-vdo/pool-sysfs.c                |  193 +
 drivers/md/dm-vdo/pool-sysfs.h                |   19 +
 drivers/md/dm-vdo/priority-table.c            |  226 +
 drivers/md/dm-vdo/priority-table.h            |   48 +
 drivers/md/dm-vdo/radix-sort.c                |  349 ++
 drivers/md/dm-vdo/radix-sort.h                |   28 +
 drivers/md/dm-vdo/recovery-journal.c          | 1772 ++++++
 drivers/md/dm-vdo/recovery-journal.h          |  313 +
 drivers/md/dm-vdo/release-versions.h          |   20 +
 drivers/md/dm-vdo/repair.c                    | 1775 ++++++
 drivers/md/dm-vdo/repair.h                    |   14 +
 drivers/md/dm-vdo/request-queue.c             |  284 +
 drivers/md/dm-vdo/request-queue.h             |   30 +
 drivers/md/dm-vdo/slab-depot.c                | 5210 +++++++++++++++++
 drivers/md/dm-vdo/slab-depot.h                |  594 ++
 drivers/md/dm-vdo/sparse-cache.c              |  595 ++
 drivers/md/dm-vdo/sparse-cache.h              |   49 +
 drivers/md/dm-vdo/statistics.h                |  279 +
 drivers/md/dm-vdo/status-codes.c              |  126 +
 drivers/md/dm-vdo/status-codes.h              |  112 +
 drivers/md/dm-vdo/string-utils.c              |   28 +
 drivers/md/dm-vdo/string-utils.h              |   23 +
 drivers/md/dm-vdo/sysfs.c                     |   84 +
 drivers/md/dm-vdo/thread-cond-var.c           |   46 +
 drivers/md/dm-vdo/thread-device.c             |   35 +
 drivers/md/dm-vdo/thread-device.h             |   19 +
 drivers/md/dm-vdo/thread-registry.c           |   93 +
 drivers/md/dm-vdo/thread-registry.h           |   33 +
 drivers/md/dm-vdo/time-utils.h                |   28 +
 drivers/md/dm-vdo/types.h                     |  403 ++
 drivers/md/dm-vdo/uds-sysfs.c                 |  185 +
 drivers/md/dm-vdo/uds-sysfs.h                 |   12 +
 drivers/md/dm-vdo/uds-threads.c               |  189 +
 drivers/md/dm-vdo/uds-threads.h               |  126 +
 drivers/md/dm-vdo/uds.h                       |  334 ++
 drivers/md/dm-vdo/vdo.c                       | 1846 ++++++
 drivers/md/dm-vdo/vdo.h                       |  381 ++
 drivers/md/dm-vdo/vio.c                       |  525 ++
 drivers/md/dm-vdo/vio.h                       |  221 +
 drivers/md/dm-vdo/volume-index.c              | 1272 ++++
 drivers/md/dm-vdo/volume-index.h              |  192 +
 drivers/md/dm-vdo/volume.c                    | 1792 ++++++
 drivers/md/dm-vdo/volume.h                    |  174 +
 drivers/md/dm-vdo/wait-queue.c                |  223 +
 drivers/md/dm-vdo/wait-queue.h                |  129 +
 drivers/md/dm-vdo/work-queue.c                |  659 +++
 drivers/md/dm-vdo/work-queue.h                |   53 +
 122 files changed, 58741 insertions(+)
 create mode 100644 Documentation/admin-guide/device-mapper/vdo-design.rst
 create mode 100644 Documentation/admin-guide/device-mapper/vdo.rst
 create mode 100644 drivers/md/dm-vdo-target.c
 create mode 100644 drivers/md/dm-vdo/action-manager.c
 create mode 100644 drivers/md/dm-vdo/action-manager.h
 create mode 100644 drivers/md/dm-vdo/admin-state.c
 create mode 100644 drivers/md/dm-vdo/admin-state.h
 create mode 100644 drivers/md/dm-vdo/block-map.c
 create mode 100644 drivers/md/dm-vdo/block-map.h
 create mode 100644 drivers/md/dm-vdo/chapter-index.c
 create mode 100644 drivers/md/dm-vdo/chapter-index.h
 create mode 100644 drivers/md/dm-vdo/completion.c
 create mode 100644 drivers/md/dm-vdo/completion.h
 create mode 100644 drivers/md/dm-vdo/config.c
 create mode 100644 drivers/md/dm-vdo/config.h
 create mode 100644 drivers/md/dm-vdo/constants.c
 create mode 100644 drivers/md/dm-vdo/constants.h
 create mode 100644 drivers/md/dm-vdo/cpu.h
 create mode 100644 drivers/md/dm-vdo/data-vio.c
 create mode 100644 drivers/md/dm-vdo/data-vio.h
 create mode 100644 drivers/md/dm-vdo/dedupe.c
 create mode 100644 drivers/md/dm-vdo/dedupe.h
 create mode 100644 drivers/md/dm-vdo/delta-index.c
 create mode 100644 drivers/md/dm-vdo/delta-index.h
 create mode 100644 drivers/md/dm-vdo/dump.c
 create mode 100644 drivers/md/dm-vdo/dump.h
 create mode 100644 drivers/md/dm-vdo/encodings.c
 create mode 100644 drivers/md/dm-vdo/encodings.h
 create mode 100644 drivers/md/dm-vdo/errors.c
 create mode 100644 drivers/md/dm-vdo/errors.h
 create mode 100644 drivers/md/dm-vdo/flush.c
 create mode 100644 drivers/md/dm-vdo/flush.h
 create mode 100644 drivers/md/dm-vdo/funnel-queue.c
 create mode 100644 drivers/md/dm-vdo/funnel-queue.h
 create mode 100644 drivers/md/dm-vdo/geometry.c
 create mode 100644 drivers/md/dm-vdo/geometry.h
 create mode 100644 drivers/md/dm-vdo/hash-utils.h
 create mode 100644 drivers/md/dm-vdo/index-layout.c
 create mode 100644 drivers/md/dm-vdo/index-layout.h
 create mode 100644 drivers/md/dm-vdo/index-page-map.c
 create mode 100644 drivers/md/dm-vdo/index-page-map.h
 create mode 100644 drivers/md/dm-vdo/index-session.c
 create mode 100644 drivers/md/dm-vdo/index-session.h
 create mode 100644 drivers/md/dm-vdo/index.c
 create mode 100644 drivers/md/dm-vdo/index.h
 create mode 100644 drivers/md/dm-vdo/int-map.c
 create mode 100644 drivers/md/dm-vdo/int-map.h
 create mode 100644 drivers/md/dm-vdo/io-factory.c
 create mode 100644 drivers/md/dm-vdo/io-factory.h
 create mode 100644 drivers/md/dm-vdo/io-submitter.c
 create mode 100644 drivers/md/dm-vdo/io-submitter.h
 create mode 100644 drivers/md/dm-vdo/logger.c
 create mode 100644 drivers/md/dm-vdo/logger.h
 create mode 100644 drivers/md/dm-vdo/logical-zone.c
 create mode 100644 drivers/md/dm-vdo/logical-zone.h
 create mode 100644 drivers/md/dm-vdo/memory-alloc.c
 create mode 100644 drivers/md/dm-vdo/memory-alloc.h
 create mode 100644 drivers/md/dm-vdo/message-stats.c
 create mode 100644 drivers/md/dm-vdo/message-stats.h
 create mode 100644 drivers/md/dm-vdo/murmurhash3.c
 create mode 100644 drivers/md/dm-vdo/murmurhash3.h
 create mode 100644 drivers/md/dm-vdo/numeric.h
 create mode 100644 drivers/md/dm-vdo/open-chapter.c
 create mode 100644 drivers/md/dm-vdo/open-chapter.h
 create mode 100644 drivers/md/dm-vdo/packer.c
 create mode 100644 drivers/md/dm-vdo/packer.h
 create mode 100644 drivers/md/dm-vdo/permassert.c
 create mode 100644 drivers/md/dm-vdo/permassert.h
 create mode 100644 drivers/md/dm-vdo/physical-zone.c
 create mode 100644 drivers/md/dm-vdo/physical-zone.h
 create mode 100644 drivers/md/dm-vdo/pointer-map.c
 create mode 100644 drivers/md/dm-vdo/pointer-map.h
 create mode 100644 drivers/md/dm-vdo/pool-sysfs-stats.c
 create mode 100644 drivers/md/dm-vdo/pool-sysfs.c
 create mode 100644 drivers/md/dm-vdo/pool-sysfs.h
 create mode 100644 drivers/md/dm-vdo/priority-table.c
 create mode 100644 drivers/md/dm-vdo/priority-table.h
 create mode 100644 drivers/md/dm-vdo/radix-sort.c
 create mode 100644 drivers/md/dm-vdo/radix-sort.h
 create mode 100644 drivers/md/dm-vdo/recovery-journal.c
 create mode 100644 drivers/md/dm-vdo/recovery-journal.h
 create mode 100644 drivers/md/dm-vdo/release-versions.h
 create mode 100644 drivers/md/dm-vdo/repair.c
 create mode 100644 drivers/md/dm-vdo/repair.h
 create mode 100644 drivers/md/dm-vdo/request-queue.c
 create mode 100644 drivers/md/dm-vdo/request-queue.h
 create mode 100644 drivers/md/dm-vdo/slab-depot.c
 create mode 100644 drivers/md/dm-vdo/slab-depot.h
 create mode 100644 drivers/md/dm-vdo/sparse-cache.c
 create mode 100644 drivers/md/dm-vdo/sparse-cache.h
 create mode 100644 drivers/md/dm-vdo/statistics.h
 create mode 100644 drivers/md/dm-vdo/status-codes.c
 create mode 100644 drivers/md/dm-vdo/status-codes.h
 create mode 100644 drivers/md/dm-vdo/string-utils.c
 create mode 100644 drivers/md/dm-vdo/string-utils.h
 create mode 100644 drivers/md/dm-vdo/sysfs.c
 create mode 100644 drivers/md/dm-vdo/thread-cond-var.c
 create mode 100644 drivers/md/dm-vdo/thread-device.c
 create mode 100644 drivers/md/dm-vdo/thread-device.h
 create mode 100644 drivers/md/dm-vdo/thread-registry.c
 create mode 100644 drivers/md/dm-vdo/thread-registry.h
 create mode 100644 drivers/md/dm-vdo/time-utils.h
 create mode 100644 drivers/md/dm-vdo/types.h
 create mode 100644 drivers/md/dm-vdo/uds-sysfs.c
 create mode 100644 drivers/md/dm-vdo/uds-sysfs.h
 create mode 100644 drivers/md/dm-vdo/uds-threads.c
 create mode 100644 drivers/md/dm-vdo/uds-threads.h
 create mode 100644 drivers/md/dm-vdo/uds.h
 create mode 100644 drivers/md/dm-vdo/vdo.c
 create mode 100644 drivers/md/dm-vdo/vdo.h
 create mode 100644 drivers/md/dm-vdo/vio.c
 create mode 100644 drivers/md/dm-vdo/vio.h
 create mode 100644 drivers/md/dm-vdo/volume-index.c
 create mode 100644 drivers/md/dm-vdo/volume-index.h
 create mode 100644 drivers/md/dm-vdo/volume.c
 create mode 100644 drivers/md/dm-vdo/volume.h
 create mode 100644 drivers/md/dm-vdo/wait-queue.c
 create mode 100644 drivers/md/dm-vdo/wait-queue.h
 create mode 100644 drivers/md/dm-vdo/work-queue.c
 create mode 100644 drivers/md/dm-vdo/work-queue.h

Comments

Eric Biggers May 23, 2023, 10:40 p.m. UTC | #1
On Tue, May 23, 2023 at 05:45:00PM -0400, J. corwin Coburn wrote:
> The dm-vdo target provides inline deduplication, compression, zero-block
> elimination, and thin provisioning. A dm-vdo target can be backed by up to
> 256TB of storage, and can present a logical size of up to 4PB. This target
> was originally developed at Permabit Technology Corp. starting in 2009. It
> was first released in 2013 and has been used in production environments
> ever since. It was made open-source in 2017 after Permabit was acquired by
> Red Hat.

As with any kernel patchset, please mention the git commit that it applies to.
This can be done using the --base option to 'git format-patch'.

- Eric

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Matthew Sakai May 30, 2023, 11:03 p.m. UTC | #2
On 5/23/23 18:40, Eric Biggers wrote:
> On Tue, May 23, 2023 at 05:45:00PM -0400, J. corwin Coburn wrote:
>> The dm-vdo target provides inline deduplication, compression, zero-block
>> elimination, and thin provisioning. A dm-vdo target can be backed by up to
>> 256TB of storage, and can present a logical size of up to 4PB. This target
>> was originally developed at Permabit Technology Corp. starting in 2009. It
>> was first released in 2013 and has been used in production environments
>> ever since. It was made open-source in 2017 after Permabit was acquired by
>> Red Hat.
> 
> As with any kernel patchset, please mention the git commit that it applies to.
> This can be done using the --base option to 'git format-patch'.

This will be in the next version of the patch set.

> - Eric
> 
> _______________________________________________
> vdo-devel mailing list
> vdo-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/vdo-devel
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer July 18, 2023, 3:51 p.m. UTC | #3
[Top-posting to provide general update and historic context]

Hi all,

Corwin has decided to leave Red Hat, I want to thank Corwin for his
efforts in developing and/or leading the development of changes to the
VDO codebase (including those that were a prereq for VDO's upstream
submission).

Matt Sakai (cc'd) and I will be meeting regularly to continue the work
of driving VDO changes needed for the code to be what I consider
"ready" for upstream Linux inclusion.  Matt has a lot of learning to
do about Linux kernel development (most of his work was in userspace
since VDO can be compiled and used in userspace).  Conversely, I
still have much to learn about the VDO codebase.  Hopefully we get
over our respective learning curves quicker by working together.

My following email is nearly 6 years old and reflects the list of
VDO changes initially determined to be needed soon after Red Hat
acquired Permabit:

On Wed, Aug 30 2017 at 11:52P -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> The following list covers what Joe and I feel is a good first step at
> starting to prepare the permabit kernel code for upstream.  This will be
> an iterative process but working through these issues is a start:
> 
> 1) coding style must adhere to the Linux kernel coding style, see
>    Documentation/process/coding-style.rst
>    - camelCase needs to be converted to lower_case_with_under_scores
>    - spaces must be converted to tabs with tab-width of 8
>    - those are the 2 big style issues but scripts/checkpatch.pl will
>      obviously complain about many other issues that will likely need
>      addressing.
> 
> 2) use GPL interfaces that the permabit kernel code previously was
>    unable to use due to licensing
>    - please work through the ones you are aware of
>    - but other more complex data structures will need to be gone over at
>      some point (e.g. Permabit's workqueue, rcu, linked-list, etc)
> 
> 3) the test harness abstraction code needs to be removed
>    - this obviously will require engineering an alternative to having an
>      in kernel abstraction for compiling in userspace.  That could take
>      the form of just switching test coverage to more traditional DM
>      tests against the permabit device interfaces with IO workloads and
>      ioctls (e.g. device-mapper-test-suite's testing for dm-thinp and
>      dm-cache).
>      Or some new novel way to elevate the kernel code up to userspace
>      for userspace testing.  In any case, this level of test harness is
>      _not_ going to fly in the context of upstream inclusion.
>    - the intermediate data type wrappers must be removed; native kernel
>      types must be used
> 
> 4) rename the Permabit DM target
>    - switch from dm-dedupe to dm-vdo?
>    - dm-vdo would seem to reduce renaming throughout the DM target given
>      "VDO" is used extensively
> 
> bonus for the first round of work items:
> 
> 5) the ioctl interfaces need to be reviewed/refactored
>    - not aware why the Permabit DM target cannot just use the ioctl
>      interfaces that every other kernel DM target uses.
>    - though obviously the traditional DM status output is less than
>      ideal given it is positional rather than name=value pairs, etc.
>    - it could be that we can compromise by adding name=value pair
>      support to aide this effort
> 
> 6) extensive in-kernel code comment blocks need to be reduced/removed
>    - quite a few ramble without _really_ saying much
> 
> Obviously some of the above work items are more involved than others
> (e.g. item #3 above).  We're happy to discuss.
> 
> Thanks,
> Mike

The bulk of the above list (and more that wasn't on it) has been
resolved.  It took ~6 years because corwin and the VDO team were also
supporting the VDO in production, with historic and new customers,
while also working on incrementally addressing the above list (and
more).

But the long-standing dependency on VDO's work-queue data
struct is still lingering (drivers/md/dm-vdo/work-queue.c). At a
minimum we need to work toward pinning down _exactly_ why that is, and
I think the best way to answer that is by simply converting the VDO
code over to using Linux's workqueues.  If doing so causes serious
inherent performance (or functionality) loss then we need to
understand why -- and fix Linux's workqueue code accordingly. (I've
cc'd Tejun so he is aware).

Also, VDO's historic use of murmurhash3 predates there being any
alternative hash that met their requirements.  There was discussion of
valid alternatives briefly, see:
https://listman.redhat.com/archives/dm-devel/2023-May/054267.html
So improving the interface so that the chosen hash is selectable
(while still allowing murmurhash3 to be selected for backward compat)
would be ideal.

Lastly, one of the most challenging problems that VDO currently has
is: discard performance is very slow.  VDO isn't a fast target in
general (born out of its need for 4K IO to achieve best dedup results)
but the discard support is a well-known VDO pain-point that I need to
understand more clearly.

I'm sure there will be other things that elevate to needing further
review and scrutiny.  This will _not_ be a _quick_ process.

But I've always wanted the VDO team to be extremely successful and
look forward to working closer with them on trying to prepare VDO for
upstream inclusion so it can stick the landing.

All said, I really would welcome review for others on the dm-devel and
linux-block mailing list.  I'll take any and all technical concerns
under advisement and work through them.

Thanks,
Mike


On Tue, May 23 2023 at  5:45P -0400,
J. corwin Coburn <corwin@redhat.com> wrote:

> The dm-vdo target provides inline deduplication, compression, zero-block
> elimination, and thin provisioning. A dm-vdo target can be backed by up to
> 256TB of storage, and can present a logical size of up to 4PB. This target
> was originally developed at Permabit Technology Corp. starting in 2009. It
> was first released in 2013 and has been used in production environments
> ever since. It was made open-source in 2017 after Permabit was acquired by
> Red Hat.
> 
> Because deduplication rates fall drastically as the block size increases, a
> vdo target has a maximum block size of 4KB. However, it can achieve
> deduplication rates of 254:1, i.e. up to 254 copies of a given 4KB block
> can reference a single 4KB of actual storage. It can achieve compression
> rates of 14:1. All zero blocks consume no storage at all.
> 
> Design Summary
> --------------
> 
> This is a high-level summary of the ideas behind dm-vdo. For details about
> the implementation and various design choices, refer to vdo-design.rst
> included in this patch set.
> 
> Deduplication is a two-part problem. The first part is recognizing
> duplicate data; the second part is avoiding multiple copies of the
> duplicated data. Therefore, vdo has two main sections: a deduplication
> index that is used to discover potential duplicate data, and a data store
> with a reference counted block map that maps from logical block addresses
> to the actual storage location of the data.
> 
> Hashing:
> 
> In order to identify blocks, vdo hashes each 4KB block to produce a 128-bit
> block name. Since vdo only requires these names to be evenly distributed,
> it uses MurmurHash3, a non-cryptographic hash algorithm which is faster
> than cryptographic hashes.
> 
> The Deduplication Index:
> 
> The index is a set of mappings between a block name (the hash of its
> contents) and a hint indicating where the block might be stored. These
> mappings are stored in temporal order because groups of blocks that are
> written together (such as a large file) tend to be rewritten together as
> well. The index uses a least-recently-used (LRU) scheme to keep frequently
> used names in the index while older names are discarded.
> 
> The index uses a structure called a delta-index to store its mappings,
> which is more space-efficient than using a hashtable. It uses a variable
> length encoding with the property that the average size of an entry
> decreases as the number of entries increases, resulting in a roughly
> constant size as the index fills.
> 
> Because storing hashes along with the data, or rehashing blocks on
> overwrite is expensive, entries are never explicitly deleted from the
> index. Instead, the vdo must always check the data at the physical location
> provided by the index to ensure that the hint is still valid.
> 
> The Data Store:
> 
> The data store is implemented by three main data structures: the block map,
> the slab depot, and the recovery journal. These structures work in concert
> to amortize metadata updates across as many data writes as possible.
> 
> The block map contains the mapping from logical addresses to physical
> locations. For each logical address it indicates whether that address is
> unused, all zeros, or which physical block holds its contents and whether
> or not it is compressed. The array of mappings is represented as a tree,
> with nodes that are allocated as needed from the available physical space.
> 
> The slab depot tracks the physical space available for storing user data.
> The depot also maintains a reference count for each physical block. Each
> block can have up to 254 logical references.
> 
> The recovery journal is a transaction log of the logical-to-physical
> mappings made by data writes. Committing this journal regularly allows a
> vdo to reduce the frequency of other metadata writes and allows it to
> reconstruct its metadata in the event of a crash.
> 
> Zones and Threading:
> 
> Due to the complexity of deduplication, the number of metadata structures
> involved in a single write operation to a vdo target is larger than most
> other targets. Furthermore, because vdo operates on small block sizes in
> order to achieve good deduplication rates, parallelism is key to good
> performance. The deduplication index, the block map, and the slab depot are
> all designed to be easily divided into disjoint zones such that any piece
> of metadata is handled by a single zone. Each zone is then assigned to a
> single thread so that all metadata operations in that zone can proceed
> without locking. Each bio is associated with a request object which can be
> enqueued on each zone thread it needs to access. The zone divisions are not
> reflected in the on-disk representation of the data structures, so the
> number of zones, and hence the number of threads, can be configured each
> time a vdo target is started.
> 
> Existing facilities
> -------------------
> 
> In a few cases, we found that existing kernel facilities did not meet vdo's
> needs, either because of performance or due to a mismatch of semantics.
> These are detailed here:
> 
> Work Queues:
> 
> Handling a single bio requires a number of small operations across a number
> of zones. The per-zone worker threads can be very busy, often using upwards
> of 30% CPU time. Kernel work queues seem targeted for lighter work loads.
> They do not let us easily prioritize individual tasks within a zone, and
> make CPU affinity control at a per-thread level more difficult.
> 
> The threads scanning and updating the in-memory portion of the
> deduplication index process a large number of queries through a single
> function. It uses its own "request queue" mechanism to process these
> efficiently in dedicated threads. In experiments using kernel work queues
> for the index lookups, we observed an overall throughput drop of up to
> almost 10%. In the following table, randwrite% and write% represent the
> change in throughput when switching to kernel work queues for random and
> sequential write workloads, respectively.
> 
> | compression% | deduplication% | randwrite% | write% |
> |--------------+----------------+------------+--------|
> |            0 |              0 |       -8.3 |   -6.4 |
> |           55 |              0 |       -7.9 |   -8.5 |
> |           90 |              0 |       -9.3 |   -8.9 |
> |            0 |             50 |       -4.9 |   -4.5 |
> |           55 |             50 |       -4.4 |   -4.4 |
> |           90 |             50 |       -4.2 |   -4.7 |
> |            0 |             90 |       -1.0 |    0.7 |
> |           55 |             90 |        0.2 |   -0.4 |
> |           90 |             90 |       -0.5 |    0.2 |
> 
> Mempools:
> 
> There are two types of object pools in the vdo implementation for which the
> existing mempool structure was not appropriate. The first of these are
> pools of structures wrapping the bios used for vdo's metadata I/O. Since
> each of these pools is only accessed from a single thread, the locking done
> by mempool is a needless cost. The second of these, the single pool of the
> wrappers for incoming bios, has more complicated locking semantics than
> mempool provides. When a thread attempts to submit a bio to vdo, but the
> pool is exhausted, the thread is put to sleep. The pool is designed to only
> wake that thread once, when it is certain that that thread's bio will be
> processed. It is not desirable to merely allocate more wrappers as a number
> of other vdo structures are designed to handle only a fixed number of
> concurrent requests. This limit is also necessary to bound the amount of
> work needed when recovering after a crash.
> 
> MurmurHash:
> 
> MurmurHash3 was selected for its hash quality, performance on 4KB blocks,
> and its 128-bit output size (vdo needs significantly more than 64 uniformly
> distributed bits for its in-memory and on-disk indexing). For
> cross-platform compatibility, vdo uses a modified version which always
> produces the same output as the original x64 variant, rather than being
> optimized per platform. There is no such hash function already in the
> kernel.
> 
> J. corwin Coburn (39):
>   Add documentation for dm-vdo.
>   Add the MurmurHash3 fast hashing algorithm.
>   Add memory allocation utilities.
>   Add basic logging and support utilities.
>   Add vdo type declarations, constants, and simple data structures.
>   Add thread and synchronization utilities.
>   Add specialized request queueing functionality.
>   Add basic data structures.
>   Add deduplication configuration structures.
>   Add deduplication index storage interface.
>   Implement the delta index.
>   Implement the volume index.
>   Implement the open chapter and chapter indexes.
>   Implement the chapter volume store.
>   Implement top-level deduplication index.
>   Implement external deduplication index interface.
>   Add administrative state and scheduling for vdo.
>   Add vio, the request object for vdo metadata.
>   Add data_vio, the request object which services incoming bios.
>   Add flush support to vdo.
>   Add the vdo io_submitter.
>   Add hash locks and hash zones.
>   Add use of the deduplication index in hash zones.
>   Add the compressed block bin packer.
>   Add vdo_slab.
>   Add the slab summary.
>   Add the block allocators and physical zones.
>   Add the slab depot itself.
>   Add the vdo block map.
>   Implement the vdo block map page cache.
>   Add the vdo recovery journal.
>   Add repair (crash recovery and read-only rebuild) of damaged vdos.
>   Add the vdo structure itself.
>   Add the on-disk formats and marshalling of vdo structures.
>   Add statistics tracking.
>   Add sysfs support for setting vdo parameters and fetching statistics.
>   Add vdo debugging support.
>   Add dm-vdo-target.c
>   Enable configuration and building of dm-vdo.
> 
>  .../admin-guide/device-mapper/vdo-design.rst  |  390 ++
>  .../admin-guide/device-mapper/vdo.rst         |  386 ++
>  drivers/md/Kconfig                            |   16 +
>  drivers/md/Makefile                           |    2 +
>  drivers/md/dm-vdo-target.c                    | 2983 ++++++++++
>  drivers/md/dm-vdo/action-manager.c            |  410 ++
>  drivers/md/dm-vdo/action-manager.h            |  117 +
>  drivers/md/dm-vdo/admin-state.c               |  512 ++
>  drivers/md/dm-vdo/admin-state.h               |  180 +
>  drivers/md/dm-vdo/block-map.c                 | 3381 +++++++++++
>  drivers/md/dm-vdo/block-map.h                 |  392 ++
>  drivers/md/dm-vdo/chapter-index.c             |  304 +
>  drivers/md/dm-vdo/chapter-index.h             |   66 +
>  drivers/md/dm-vdo/completion.c                |  141 +
>  drivers/md/dm-vdo/completion.h                |  155 +
>  drivers/md/dm-vdo/config.c                    |  389 ++
>  drivers/md/dm-vdo/config.h                    |  125 +
>  drivers/md/dm-vdo/constants.c                 |   15 +
>  drivers/md/dm-vdo/constants.h                 |  102 +
>  drivers/md/dm-vdo/cpu.h                       |   58 +
>  drivers/md/dm-vdo/data-vio.c                  | 2076 +++++++
>  drivers/md/dm-vdo/data-vio.h                  |  683 +++
>  drivers/md/dm-vdo/dedupe.c                    | 3073 ++++++++++
>  drivers/md/dm-vdo/dedupe.h                    |  119 +
>  drivers/md/dm-vdo/delta-index.c               | 2018 +++++++
>  drivers/md/dm-vdo/delta-index.h               |  292 +
>  drivers/md/dm-vdo/dump.c                      |  288 +
>  drivers/md/dm-vdo/dump.h                      |   17 +
>  drivers/md/dm-vdo/encodings.c                 | 1523 +++++
>  drivers/md/dm-vdo/encodings.h                 | 1307 +++++
>  drivers/md/dm-vdo/errors.c                    |  316 +
>  drivers/md/dm-vdo/errors.h                    |   83 +
>  drivers/md/dm-vdo/flush.c                     |  563 ++
>  drivers/md/dm-vdo/flush.h                     |   44 +
>  drivers/md/dm-vdo/funnel-queue.c              |  169 +
>  drivers/md/dm-vdo/funnel-queue.h              |  110 +
>  drivers/md/dm-vdo/geometry.c                  |  205 +
>  drivers/md/dm-vdo/geometry.h                  |  137 +
>  drivers/md/dm-vdo/hash-utils.h                |   66 +
>  drivers/md/dm-vdo/index-layout.c              | 1775 ++++++
>  drivers/md/dm-vdo/index-layout.h              |   42 +
>  drivers/md/dm-vdo/index-page-map.c            |  181 +
>  drivers/md/dm-vdo/index-page-map.h            |   54 +
>  drivers/md/dm-vdo/index-session.c             |  815 +++
>  drivers/md/dm-vdo/index-session.h             |   84 +
>  drivers/md/dm-vdo/index.c                     | 1403 +++++
>  drivers/md/dm-vdo/index.h                     |   83 +
>  drivers/md/dm-vdo/int-map.c                   |  710 +++
>  drivers/md/dm-vdo/int-map.h                   |   40 +
>  drivers/md/dm-vdo/io-factory.c                |  458 ++
>  drivers/md/dm-vdo/io-factory.h                |   66 +
>  drivers/md/dm-vdo/io-submitter.c              |  483 ++
>  drivers/md/dm-vdo/io-submitter.h              |   52 +
>  drivers/md/dm-vdo/logger.c                    |  304 +
>  drivers/md/dm-vdo/logger.h                    |  112 +
>  drivers/md/dm-vdo/logical-zone.c              |  378 ++
>  drivers/md/dm-vdo/logical-zone.h              |   87 +
>  drivers/md/dm-vdo/memory-alloc.c              |  447 ++
>  drivers/md/dm-vdo/memory-alloc.h              |  181 +
>  drivers/md/dm-vdo/message-stats.c             | 1222 ++++
>  drivers/md/dm-vdo/message-stats.h             |   13 +
>  drivers/md/dm-vdo/murmurhash3.c               |  175 +
>  drivers/md/dm-vdo/murmurhash3.h               |   15 +
>  drivers/md/dm-vdo/numeric.h                   |   78 +
>  drivers/md/dm-vdo/open-chapter.c              |  433 ++
>  drivers/md/dm-vdo/open-chapter.h              |   79 +
>  drivers/md/dm-vdo/packer.c                    |  794 +++
>  drivers/md/dm-vdo/packer.h                    |  123 +
>  drivers/md/dm-vdo/permassert.c                |   35 +
>  drivers/md/dm-vdo/permassert.h                |   65 +
>  drivers/md/dm-vdo/physical-zone.c             |  650 ++
>  drivers/md/dm-vdo/physical-zone.h             |  115 +
>  drivers/md/dm-vdo/pointer-map.c               |  691 +++
>  drivers/md/dm-vdo/pointer-map.h               |   81 +
>  drivers/md/dm-vdo/pool-sysfs-stats.c          | 2063 +++++++
>  drivers/md/dm-vdo/pool-sysfs.c                |  193 +
>  drivers/md/dm-vdo/pool-sysfs.h                |   19 +
>  drivers/md/dm-vdo/priority-table.c            |  226 +
>  drivers/md/dm-vdo/priority-table.h            |   48 +
>  drivers/md/dm-vdo/radix-sort.c                |  349 ++
>  drivers/md/dm-vdo/radix-sort.h                |   28 +
>  drivers/md/dm-vdo/recovery-journal.c          | 1772 ++++++
>  drivers/md/dm-vdo/recovery-journal.h          |  313 +
>  drivers/md/dm-vdo/release-versions.h          |   20 +
>  drivers/md/dm-vdo/repair.c                    | 1775 ++++++
>  drivers/md/dm-vdo/repair.h                    |   14 +
>  drivers/md/dm-vdo/request-queue.c             |  284 +
>  drivers/md/dm-vdo/request-queue.h             |   30 +
>  drivers/md/dm-vdo/slab-depot.c                | 5210 +++++++++++++++++
>  drivers/md/dm-vdo/slab-depot.h                |  594 ++
>  drivers/md/dm-vdo/sparse-cache.c              |  595 ++
>  drivers/md/dm-vdo/sparse-cache.h              |   49 +
>  drivers/md/dm-vdo/statistics.h                |  279 +
>  drivers/md/dm-vdo/status-codes.c              |  126 +
>  drivers/md/dm-vdo/status-codes.h              |  112 +
>  drivers/md/dm-vdo/string-utils.c              |   28 +
>  drivers/md/dm-vdo/string-utils.h              |   23 +
>  drivers/md/dm-vdo/sysfs.c                     |   84 +
>  drivers/md/dm-vdo/thread-cond-var.c           |   46 +
>  drivers/md/dm-vdo/thread-device.c             |   35 +
>  drivers/md/dm-vdo/thread-device.h             |   19 +
>  drivers/md/dm-vdo/thread-registry.c           |   93 +
>  drivers/md/dm-vdo/thread-registry.h           |   33 +
>  drivers/md/dm-vdo/time-utils.h                |   28 +
>  drivers/md/dm-vdo/types.h                     |  403 ++
>  drivers/md/dm-vdo/uds-sysfs.c                 |  185 +
>  drivers/md/dm-vdo/uds-sysfs.h                 |   12 +
>  drivers/md/dm-vdo/uds-threads.c               |  189 +
>  drivers/md/dm-vdo/uds-threads.h               |  126 +
>  drivers/md/dm-vdo/uds.h                       |  334 ++
>  drivers/md/dm-vdo/vdo.c                       | 1846 ++++++
>  drivers/md/dm-vdo/vdo.h                       |  381 ++
>  drivers/md/dm-vdo/vio.c                       |  525 ++
>  drivers/md/dm-vdo/vio.h                       |  221 +
>  drivers/md/dm-vdo/volume-index.c              | 1272 ++++
>  drivers/md/dm-vdo/volume-index.h              |  192 +
>  drivers/md/dm-vdo/volume.c                    | 1792 ++++++
>  drivers/md/dm-vdo/volume.h                    |  174 +
>  drivers/md/dm-vdo/wait-queue.c                |  223 +
>  drivers/md/dm-vdo/wait-queue.h                |  129 +
>  drivers/md/dm-vdo/work-queue.c                |  659 +++
>  drivers/md/dm-vdo/work-queue.h                |   53 +
>  122 files changed, 58741 insertions(+)
>  create mode 100644 Documentation/admin-guide/device-mapper/vdo-design.rst
>  create mode 100644 Documentation/admin-guide/device-mapper/vdo.rst
>  create mode 100644 drivers/md/dm-vdo-target.c
>  create mode 100644 drivers/md/dm-vdo/action-manager.c
>  create mode 100644 drivers/md/dm-vdo/action-manager.h
>  create mode 100644 drivers/md/dm-vdo/admin-state.c
>  create mode 100644 drivers/md/dm-vdo/admin-state.h
>  create mode 100644 drivers/md/dm-vdo/block-map.c
>  create mode 100644 drivers/md/dm-vdo/block-map.h
>  create mode 100644 drivers/md/dm-vdo/chapter-index.c
>  create mode 100644 drivers/md/dm-vdo/chapter-index.h
>  create mode 100644 drivers/md/dm-vdo/completion.c
>  create mode 100644 drivers/md/dm-vdo/completion.h
>  create mode 100644 drivers/md/dm-vdo/config.c
>  create mode 100644 drivers/md/dm-vdo/config.h
>  create mode 100644 drivers/md/dm-vdo/constants.c
>  create mode 100644 drivers/md/dm-vdo/constants.h
>  create mode 100644 drivers/md/dm-vdo/cpu.h
>  create mode 100644 drivers/md/dm-vdo/data-vio.c
>  create mode 100644 drivers/md/dm-vdo/data-vio.h
>  create mode 100644 drivers/md/dm-vdo/dedupe.c
>  create mode 100644 drivers/md/dm-vdo/dedupe.h
>  create mode 100644 drivers/md/dm-vdo/delta-index.c
>  create mode 100644 drivers/md/dm-vdo/delta-index.h
>  create mode 100644 drivers/md/dm-vdo/dump.c
>  create mode 100644 drivers/md/dm-vdo/dump.h
>  create mode 100644 drivers/md/dm-vdo/encodings.c
>  create mode 100644 drivers/md/dm-vdo/encodings.h
>  create mode 100644 drivers/md/dm-vdo/errors.c
>  create mode 100644 drivers/md/dm-vdo/errors.h
>  create mode 100644 drivers/md/dm-vdo/flush.c
>  create mode 100644 drivers/md/dm-vdo/flush.h
>  create mode 100644 drivers/md/dm-vdo/funnel-queue.c
>  create mode 100644 drivers/md/dm-vdo/funnel-queue.h
>  create mode 100644 drivers/md/dm-vdo/geometry.c
>  create mode 100644 drivers/md/dm-vdo/geometry.h
>  create mode 100644 drivers/md/dm-vdo/hash-utils.h
>  create mode 100644 drivers/md/dm-vdo/index-layout.c
>  create mode 100644 drivers/md/dm-vdo/index-layout.h
>  create mode 100644 drivers/md/dm-vdo/index-page-map.c
>  create mode 100644 drivers/md/dm-vdo/index-page-map.h
>  create mode 100644 drivers/md/dm-vdo/index-session.c
>  create mode 100644 drivers/md/dm-vdo/index-session.h
>  create mode 100644 drivers/md/dm-vdo/index.c
>  create mode 100644 drivers/md/dm-vdo/index.h
>  create mode 100644 drivers/md/dm-vdo/int-map.c
>  create mode 100644 drivers/md/dm-vdo/int-map.h
>  create mode 100644 drivers/md/dm-vdo/io-factory.c
>  create mode 100644 drivers/md/dm-vdo/io-factory.h
>  create mode 100644 drivers/md/dm-vdo/io-submitter.c
>  create mode 100644 drivers/md/dm-vdo/io-submitter.h
>  create mode 100644 drivers/md/dm-vdo/logger.c
>  create mode 100644 drivers/md/dm-vdo/logger.h
>  create mode 100644 drivers/md/dm-vdo/logical-zone.c
>  create mode 100644 drivers/md/dm-vdo/logical-zone.h
>  create mode 100644 drivers/md/dm-vdo/memory-alloc.c
>  create mode 100644 drivers/md/dm-vdo/memory-alloc.h
>  create mode 100644 drivers/md/dm-vdo/message-stats.c
>  create mode 100644 drivers/md/dm-vdo/message-stats.h
>  create mode 100644 drivers/md/dm-vdo/murmurhash3.c
>  create mode 100644 drivers/md/dm-vdo/murmurhash3.h
>  create mode 100644 drivers/md/dm-vdo/numeric.h
>  create mode 100644 drivers/md/dm-vdo/open-chapter.c
>  create mode 100644 drivers/md/dm-vdo/open-chapter.h
>  create mode 100644 drivers/md/dm-vdo/packer.c
>  create mode 100644 drivers/md/dm-vdo/packer.h
>  create mode 100644 drivers/md/dm-vdo/permassert.c
>  create mode 100644 drivers/md/dm-vdo/permassert.h
>  create mode 100644 drivers/md/dm-vdo/physical-zone.c
>  create mode 100644 drivers/md/dm-vdo/physical-zone.h
>  create mode 100644 drivers/md/dm-vdo/pointer-map.c
>  create mode 100644 drivers/md/dm-vdo/pointer-map.h
>  create mode 100644 drivers/md/dm-vdo/pool-sysfs-stats.c
>  create mode 100644 drivers/md/dm-vdo/pool-sysfs.c
>  create mode 100644 drivers/md/dm-vdo/pool-sysfs.h
>  create mode 100644 drivers/md/dm-vdo/priority-table.c
>  create mode 100644 drivers/md/dm-vdo/priority-table.h
>  create mode 100644 drivers/md/dm-vdo/radix-sort.c
>  create mode 100644 drivers/md/dm-vdo/radix-sort.h
>  create mode 100644 drivers/md/dm-vdo/recovery-journal.c
>  create mode 100644 drivers/md/dm-vdo/recovery-journal.h
>  create mode 100644 drivers/md/dm-vdo/release-versions.h
>  create mode 100644 drivers/md/dm-vdo/repair.c
>  create mode 100644 drivers/md/dm-vdo/repair.h
>  create mode 100644 drivers/md/dm-vdo/request-queue.c
>  create mode 100644 drivers/md/dm-vdo/request-queue.h
>  create mode 100644 drivers/md/dm-vdo/slab-depot.c
>  create mode 100644 drivers/md/dm-vdo/slab-depot.h
>  create mode 100644 drivers/md/dm-vdo/sparse-cache.c
>  create mode 100644 drivers/md/dm-vdo/sparse-cache.h
>  create mode 100644 drivers/md/dm-vdo/statistics.h
>  create mode 100644 drivers/md/dm-vdo/status-codes.c
>  create mode 100644 drivers/md/dm-vdo/status-codes.h
>  create mode 100644 drivers/md/dm-vdo/string-utils.c
>  create mode 100644 drivers/md/dm-vdo/string-utils.h
>  create mode 100644 drivers/md/dm-vdo/sysfs.c
>  create mode 100644 drivers/md/dm-vdo/thread-cond-var.c
>  create mode 100644 drivers/md/dm-vdo/thread-device.c
>  create mode 100644 drivers/md/dm-vdo/thread-device.h
>  create mode 100644 drivers/md/dm-vdo/thread-registry.c
>  create mode 100644 drivers/md/dm-vdo/thread-registry.h
>  create mode 100644 drivers/md/dm-vdo/time-utils.h
>  create mode 100644 drivers/md/dm-vdo/types.h
>  create mode 100644 drivers/md/dm-vdo/uds-sysfs.c
>  create mode 100644 drivers/md/dm-vdo/uds-sysfs.h
>  create mode 100644 drivers/md/dm-vdo/uds-threads.c
>  create mode 100644 drivers/md/dm-vdo/uds-threads.h
>  create mode 100644 drivers/md/dm-vdo/uds.h
>  create mode 100644 drivers/md/dm-vdo/vdo.c
>  create mode 100644 drivers/md/dm-vdo/vdo.h
>  create mode 100644 drivers/md/dm-vdo/vio.c
>  create mode 100644 drivers/md/dm-vdo/vio.h
>  create mode 100644 drivers/md/dm-vdo/volume-index.c
>  create mode 100644 drivers/md/dm-vdo/volume-index.h
>  create mode 100644 drivers/md/dm-vdo/volume.c
>  create mode 100644 drivers/md/dm-vdo/volume.h
>  create mode 100644 drivers/md/dm-vdo/wait-queue.c
>  create mode 100644 drivers/md/dm-vdo/wait-queue.h
>  create mode 100644 drivers/md/dm-vdo/work-queue.c
>  create mode 100644 drivers/md/dm-vdo/work-queue.h
> 
> -- 
> 2.40.1
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Kenneth Raeburn July 22, 2023, 1:59 a.m. UTC | #4
On Tue, Jul 18, 2023 at 11:51 AM Mike Snitzer <snitzer@kernel.org> wrote:

>
> But the long-standing dependency on VDO's work-queue data
> struct is still lingering (drivers/md/dm-vdo/work-queue.c). At a
> minimum we need to work toward pinning down _exactly_ why that is, and
> I think the best way to answer that is by simply converting the VDO
> code over to using Linux's workqueues.  If doing so causes serious
> inherent performance (or functionality) loss then we need to
> understand why -- and fix Linux's workqueue code accordingly. (I've
> cc'd Tejun so he is aware).
>

 We tried this experiment and did indeed see some significant performance
differences. Nearly a 7x slowdown in some cases.

VDO can be pretty CPU-intensive. In addition to hashing and compression, it
scans some big in-memory data structures as part of the deduplication
process. Some data structures are split across one or more "zones" to
enable concurrency (usually split based on bits of an address or something
like that), but some are not, and a couple of those threads can sometimes
exceed 50% CPU utilization, even 90% depending on the system and test data
configuration. (Usually this is while pushing over 1GB/s through the
deduplication and compression processing on a system with fast storage. On
a slow VM with spinning storage, the CPU load is much smaller.)

We use a sort of message-passing arrangement where a worker thread is
responsible for updating certain data structures as needed for the I/Os in
progress, rather than having the processing of each I/O contend for locks
on the data structures. It gives us some good throughput under load but it
does mean upwards of a dozen handoffs per 4kB write, depending on
compressibility, whether the block is a duplicate, and various other
factors. So processing 1 GB/s means handling over 3M messages per second,
though each step of processing is generally lightweight. For our dedicated
worker threads, it's not unusual for a thread to wake up and process a few
tens or even hundreds of updates to its data structures (likely benefiting
from CPU caching of the data structures) before running out of available
work and going back to sleep.

The experiment I ran was to create an ordered workqueue instead of each
dedicated thread where we need serialization, and unordered workqueues when
concurrency is allowed. On our slower test systems (> 10y old Supermicro
Xeon E5-1650 v2, RAID-0 storage using SSDs or HDDs), the slowdown was less
significant (under 2x), but on our faster system (4-5? year old Supermicro
1029P-WTR, 2x Xeon Gold 6128 = 12 cores, NVMe storage) we got nearly a 7x
slowdown overall. I haven't yet dug deeply into _why_ the kernel work
queues are slower in this sort of setup. I did run "perf top" briefly
during one test with kernel work queues, and the largest single use of CPU
cycles was in spin lock acquisition, but I didn't get call graphs.

(This was with Fedora 37 6.2.12-200 and 6.2.15-200 kernels, without the
latest submissions from Tejun, which look interesting. Though I suspect we
care more about cache locality for some of our thread-specific data
structures than for accessing the I/O structures.)

Ken
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Sweet Tea Dorminy July 23, 2023, 6:24 a.m. UTC | #5
> We use a sort of message-passing arrangement where a worker thread is 
> responsible for updating certain data structures as needed for the I/Os 
> in progress, rather than having the processing of each I/O contend for 
> locks on the data structures.  It gives us some good throughput under load but it does mean upwards of a dozen handoffs per 4kB write, depending on compressibility, whether the block is a duplicate, and various other factors. So processing 1 GB/s means handling over 3M messages per second, though each step of processing is generally lightweight. 

  There seems a natural duality between
work items passing between threads, each exclusively owning a structure, 
vs structures passing between threads, each exclusively owning a work 
item. In the first, the threads are grabbing a notional 'lock' on each 
item in turn to deal with their structure, as VDO does now; in the 
second, the threads are grabbing locks on each structure in turn to deal 
with their item.

If kernel workqueues have higher overhead per item for the lightweight 
work VDO currently does in each step, perhaps the dual of the current 
scheme would let more work get done per fixed queuing overhead, and thus 
perform better? VIOs could take locks on sections of structures, and 
operate on multiple structures before requeueing.

This might also enable more finegrained locking of structures than the 
chunks uniquely owned by threads at the moment. It would also be 
attractive to let the the kernel work queues deal with concurrency 
management instead of configuring the number of threads for each of a 
bunch of different structures at start time.

On the other hand, I played around with switching messagepassing to 
structurelocking in VDO a number of years ago for fun on the side, just 
extremely naively replacing each message passing with releasing a mutex 
on the current set of structures and (trying to) take a mutex on the 
next set of structures, and ran into some complexity around certain 
ordering requirements. I think they were around recovery journal entries 
going into the slab journal and the block map in the same order; and 
also around the use of different priorities for some different items. I 
don't have that code anymore, unfortunately, so I don't know how hard it 
would be to try that experiment again.

Sweet Tea

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Kenneth Raeburn July 24, 2023, 6:03 p.m. UTC | #6
(Apologies for the re-send ... I neglected to turn of HTML and so
linux-block bounced the email as spam.)

On Tue, Jul 18, 2023 at 11:51 AM Mike Snitzer <snitzer@kernel.org> wrote:

 But the long-standing dependency on VDO's work-queue data
 struct is still lingering (drivers/md/dm-vdo/work-queue.c). At a
 minimum we need to work toward pinning down _exactly_ why that is, and
 I think the best way to answer that is by simply converting the VDO
 code over to using Linux's workqueues.  If doing so causes serious
 inherent performance (or functionality) loss then we need to
 understand why -- and fix Linux's workqueue code accordingly. (I've
 cc'd Tejun so he is aware).

We tried this experiment and did indeed see some significant
performance differences. Nearly a 7x slowdown in some cases.

VDO can be pretty CPU-intensive. In addition to hashing and
compression, it scans some big in-memory data structures as part of
the deduplication process. Some data structures are split across one
or more "zones" to enable concurrency (usually split based on bits of
an address or something like that), but some are not, and a couple of
those threads can sometimes exceed 50% CPU utilization, even 90%
depending on the system and test data configuration. (Usually this is
while pushing over 1GB/s through the deduplication and compression
processing on a system with fast storage. On a slow VM with spinning
storage, the CPU load is much smaller.)

We use a sort of message-passing arrangement where a worker thread is
responsible for updating certain data structures as needed for the
I/Os in progress, rather than having the processing of each I/O
contend for locks on the data structures. It gives us some good
throughput under load but it does mean upwards of a dozen handoffs per
4kB write, depending on compressibility, whether the block is a
duplicate, and various other factors. So processing 1 GB/s means
handling over 3M messages per second, though each step of processing
is generally lightweight. For our dedicated worker threads, it's not
unusual for a thread to wake up and process a few tens or even
hundreds of updates to its data structures (likely benefiting from CPU
caching of the data structures) before running out of available work
and going back to sleep.

The experiment I ran was to create an ordered workqueue instead of
each dedicated thread where we need serialization, and unordered
workqueues when concurrency is allowed. On our slower test systems (>
10y old Supermicro Xeon E5-1650 v2, RAID-0 storage using SSDs or
HDDs), the slowdown was less significant (under 2x), but on our faster
system (4-5? year old Supermicro 1029P-WTR, 2x Xeon Gold 6128 = 12
cores, NVMe storage) we got nearly a 7x slowdown overall. I haven't
yet dug deeply into _why_ the kernel work queues are slower in this
sort of setup. I did run "perf top" briefly during one test with
kernel work queues, and the largest single use of CPU cycles was in
spin lock acquisition, but I didn't get call graphs.

(This was with Fedora 37 6.2.12-200 and 6.2.15-200 kernels, without
the latest submissions from Tejun, which look interesting. Though I
suspect we care more about cache locality for some of our
thread-specific data structures than for accessing the I/O
structures.)

Ken

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Kenneth Raeburn July 26, 2023, 11:32 p.m. UTC | #7
An offline discussion suggested maybe I should've gone into a little
more detail about how VDO uses its work queues.

VDO is sufficiently work-intensive that we found long ago that doing all
the work in one thread wouldn't keep up.

Our multithreaded design started many years ago and grew out of our
existing design for UDS (VDO's central deduplication index), which,
somewhat akin to partitioning and sharding in databases, does scanning
of the in-memory part of the "database" of values in some number (fixed
at startup) of threads, with the data and work divided up based on
certain bits of the hash value being looked up, and performs its I/O and
callbacks from certain other threads. We aren't splitting work to
multiple machines as database systems sometimes do, but to multiple
threads and potentially multiple NUMA nodes.

We try to optimize for keeping the busy case fast, even if it means
light usage loads don't perform quite as well as they could be made to.
We try to reduce instances of contention between threads by avoiding
locks when we can, preferring a fast queueing mechanism or loose
synchronization between threads. (We haven't kept to it strictly, but
we've mostly tried to.)

In VDO, at the first level, the work is split according to the
collection of data structures to be updated (e.g., recovery journal vs
disk block allocation vs block address mapping management).

For some data structures, we split the structures further based on
values of relevant bit-strings for the data structure in question (block
addresses, hash values). Currently we can split the work N ways for many
small values of N but it's hard to change N without restarting. The
processing of a read or write operation generally doesn't need to touch
more than one "zone" in any of these sets (or two, in a certain write
case).

Giving one thread exclusive access to the data structures means we can
do away with the locking. Of course, with so many different threads
owning data structures, we get a lot of queueing in exchange, but we
depend on a fast, nearly-lock-free MPSC queueing mechanism to keep that
reasonably efficient.

There's a little more to it in places where we need to preserve the
order of processing of multiple VIOs in a couple different sections of
the write path. So we do make some higher-level use of the fact that
we're adding work to queues with certain behavior, and not just turning
loose a bunch of threads to contend for a just-released mutex.

Some other bits of work like computing the hash value don't update any
other data structures, and not only would be amenable to kernel
workqueue conversion with concurrency greater than 1, but such a
conversion might open up some interesting options, like hashing on the
CPU or NUMA node where the data block is likely to reside in cache. But
for now, using one work management mechanism has been easier than two.

The experiment I referred to in my earlier email with using kernel
workqueues in VDO kept the same model of protecting data structures by
making them exclusive to specific threads (or in this case,
concurrency-1 workqueues) to serialize all access and using message
passing; it didn't change everything over to using mutexes instead.

I hope some of this helps. I'm happy to answer further questions.

Ken

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Kenneth Raeburn July 26, 2023, 11:33 p.m. UTC | #8
Sweet Tea Dorminy <sweettea-kernel@dorminy.me> writes:

>  There seems a natural duality between
> work items passing between threads, each exclusively owning a
> structure, vs structures passing between threads, each exclusively
> owning a work item. In the first, the threads are grabbing a notional
> 'lock' on each item in turn to deal with their structure, as VDO does
> now; in the second, the threads are grabbing locks on each structure
> in turn to deal with their item.

Yes.

> If kernel workqueues have higher overhead per item for the lightweight
> work VDO currently does in each step, perhaps the dual of the current
> scheme would let more work get done per fixed queuing overhead, and
> thus perform better? VIOs could take locks on sections of structures,
> and operate on multiple structures before requeueing.

Can you suggest a little more specifically what the "dual" is you're
picturing?

[...]
> On the other hand, I played around with switching messagepassing to
> structurelocking in VDO a number of years ago for fun on the side,
> just extremely naively replacing each message passing with releasing a
> mutex on the current set of structures and (trying to) take a mutex on
> the next set of structures, and ran into some complexity around
> certain ordering requirements. I think they were around recovery
> journal entries going into the slab journal and the block map in the
> same order; and also around the use of different priorities for some
> different items. I don't have that code anymore, unfortunately, so I
> don't know how hard it would be to try that experiment again.

Yes, we do have certain ordering requirements in one or two places,
which sort of breaks the mental model of independently processed VIOs.
There are also occasionally non-VIO objects which get queued to invoke
actions on various threads, which I expect might further complicate the
experiment.

Ken

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer July 27, 2023, 2:57 p.m. UTC | #9
On Wed, Jul 26 2023 at  7:32P -0400,
Ken Raeburn <raeburn@redhat.com> wrote:

> 
> An offline discussion suggested maybe I should've gone into a little
> more detail about how VDO uses its work queues.
> 
> VDO is sufficiently work-intensive that we found long ago that doing all
> the work in one thread wouldn't keep up.
> 
> Our multithreaded design started many years ago and grew out of our
> existing design for UDS (VDO's central deduplication index), which,
> somewhat akin to partitioning and sharding in databases, does scanning
> of the in-memory part of the "database" of values in some number (fixed
> at startup) of threads, with the data and work divided up based on
> certain bits of the hash value being looked up, and performs its I/O and
> callbacks from certain other threads. We aren't splitting work to
> multiple machines as database systems sometimes do, but to multiple
> threads and potentially multiple NUMA nodes.
> 
> We try to optimize for keeping the busy case fast, even if it means
> light usage loads don't perform quite as well as they could be made to.
> We try to reduce instances of contention between threads by avoiding
> locks when we can, preferring a fast queueing mechanism or loose
> synchronization between threads. (We haven't kept to it strictly, but
> we've mostly tried to.)
> 
> In VDO, at the first level, the work is split according to the
> collection of data structures to be updated (e.g., recovery journal vs
> disk block allocation vs block address mapping management).
> 
> For some data structures, we split the structures further based on
> values of relevant bit-strings for the data structure in question (block
> addresses, hash values). Currently we can split the work N ways for many
> small values of N but it's hard to change N without restarting. The
> processing of a read or write operation generally doesn't need to touch
> more than one "zone" in any of these sets (or two, in a certain write
> case).
> 
> Giving one thread exclusive access to the data structures means we can
> do away with the locking. Of course, with so many different threads
> owning data structures, we get a lot of queueing in exchange, but we
> depend on a fast, nearly-lock-free MPSC queueing mechanism to keep that
> reasonably efficient.
> 
> There's a little more to it in places where we need to preserve the
> order of processing of multiple VIOs in a couple different sections of
> the write path. So we do make some higher-level use of the fact that
> we're adding work to queues with certain behavior, and not just turning
> loose a bunch of threads to contend for a just-released mutex.
> 
> Some other bits of work like computing the hash value don't update any
> other data structures, and not only would be amenable to kernel
> workqueue conversion with concurrency greater than 1, but such a
> conversion might open up some interesting options, like hashing on the
> CPU or NUMA node where the data block is likely to reside in cache. But
> for now, using one work management mechanism has been easier than two.
> 
> The experiment I referred to in my earlier email with using kernel
> workqueues in VDO kept the same model of protecting data structures by
> making them exclusive to specific threads (or in this case,
> concurrency-1 workqueues) to serialize all access and using message
> passing; it didn't change everything over to using mutexes instead.
> 
> I hope some of this helps. I'm happy to answer further questions.
> 
> Ken
> 

Thanks for the extra context, but a _big_ elephant in the room for
this line of discussion is that: the Linux workqueue code has
basically always been only available for use by GPL'd code.  Given
VDO's historic non-GPL origins, it seems _to me_ that an alternative
to Linux's workqueues had to be created to allow VDO to drive its
work.  While understandable, I gave guidance 6 years ago that VDO
engineering should work to definitively reconcile if using Linux
workqueues viable now that VDO has been GPL'd.

But it appears there wasn't much in the way of serious effort put to
completely converting to using Linux workqueues.  That is a problem
because all of the work item strategy deployed by VDO is quite
bespoke.  I don't think the code lends itself to being properly
maintained by more than a 1 or 2 engineers (if we're lucky at this
point).

And while I appreciate that the prospect of _seriously_ converting
over to using Linux workqueues is itself destabilizing and challenging
effort: it seems that it needs to be done to legitimately position the
code to go upstream.

I would like to see a patch crafted that allows branching between the
use of Linux and VDO workqueues. Initially have a dm-vdo modparam
(e.g. use_vdo_wq or vice-versa: use_linux_wq).  And have a wrapping
interface and associated data struct(s) that can bridge between work
being driven/coordinated by either (depending on disposition of
modparam).

This work isn't trivial, I get that. But it serves to clearly showcase
shortcomings, areas for improvement, while pivoting to more standard
Linux interfaces that really should've been used from VDO's inception.

Is this work that you feel you could focus on with urgency?

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Sweet Tea Dorminy July 27, 2023, 3:29 p.m. UTC | #10
> 
>> If kernel workqueues have higher overhead per item for the lightweight
>> work VDO currently does in each step, perhaps the dual of the current
>> scheme would let more work get done per fixed queuing overhead, and
>> thus perform better? VIOs could take locks on sections of structures,
>> and operate on multiple structures before requeueing.
> 
> Can you suggest a little more specifically what the "dual" is you're
> picturing?

It sounds like your experiment consisted of one kernel workqueue per 
existing thread, with VIOs queueing on each thread in turn precisely as 
they do at present, so that when the VIO work item is running it's 
guaranteed to be the unique actor on a particular set of structures 
(e.g. for a physical thread the physical zone and slabs).

I am thinking of an alternate scheme where e.g. each slab, each block 
map zone, each packer would be protected by a lock instead of owned by a 
thread. There would be one workqueue with concurrency allowed where all 
VIOs would operate.

VIOs would do an initial queuing on a kernel workqueue, and then when 
the VIO work item would run, they'd take and hold the appropriate locks 
while they operated on each structure. So they'd take and release slab 
locks until they found a free block; send off to UDS and get requeued 
when it came back or the timer expired; try to compress and take/release 
a lock on the packer while adding itself to a bin and get requeued if 
appropriate when the packer released it; write and requeue when the 
write finishes if relevant. Then I think the 'make whatever modification 
to structures is relevant' part can be done without any requeue: take 
and release the recovery journal lock; ditto on the relevant slab; again 
the journal; again the other slab; then the part of the block map; etc.

Yes, there's the intriguing ordering requirements to work through, but 
maybe as an initial performance experiment the ordering can be ignored 
to get an idea of whether this scheme could provide acceptable performance.

> There are also occasionally non-VIO objects which get queued to invoke
> actions on various threads, which I expect might further complicate the
> experiment.

I think that's the easy part -- queueing a work item to grab a lock and 
Do Something seems to me a pretty common thing in the kernel code. 
Unless there are ordering requirements among the non-vios I'm not 
calling to mind.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Kenneth Raeburn July 28, 2023, 8:28 a.m. UTC | #11
Mike Snitzer <snitzer@kernel.org> writes:
> Thanks for the extra context, but a _big_ elephant in the room for
> this line of discussion is that: the Linux workqueue code has
> basically always been only available for use by GPL'd code.  Given
> VDO's historic non-GPL origins, it seems _to me_ that an alternative
> to Linux's workqueues had to be created to allow VDO to drive its
> work.  While understandable, I gave guidance 6 years ago that VDO
> engineering should work to definitively reconcile if using Linux
> workqueues viable now that VDO has been GPL'd.

Yes, initially that was a significant reason.

More recently, when we've tried switching, the performance loss made it
appear not worth the change. Especially since we also needed to ship a
usable version at the same time.

> But it appears there wasn't much in the way of serious effort put to
> completely converting to using Linux workqueues. That is a problem
> because all of the work item strategy deployed by VDO is quite
> bespoke.  I don't think the code lends itself to being properly
> maintained by more than a 1 or 2 engineers (if we're lucky at this
> point).

By "work item strategy" are you referring to the lower level handling of
queueing and executing the work items? Because I've done that. Well, the
first 90%, by making the VDO work queues function as a shim on top of
the kernel ones instead of creating their own threads. It would also
need the kernel workqueues modified to support the SYSFS and ORDERED
options together, because on NUMA systems the VDO performance really
tanks without tweaking CPU affinity, and one or two other small
additions. If we were to actually commit to that version there might be
additional work like tweaking some data structures and eliding some shim
functions if appropriate, but given the performance loss, we decided to
stop there.

Or do you mean the use of executing all actions affecting a data
structure in a single thread/queue via message passing to serialize
access to data structures instead of having a thread serially lock,
modify, and unlock the various different data structures on behalf of a
single I/O request, while another thread does the same for another I/O
request? The model we use can certainly make things more difficult to
follow. It reads like continuation-passing style code, not the
straight-line code many of us are more accustomed to.

"Converting to using Linux workqueues" really doesn't say the latter to
me, it says the former. But I thought I'd already mentioned I'd tried
the former out. (Perhaps not very clearly?)

> I would like to see a patch crafted that allows branching between the
> use of Linux and VDO workqueues. Initially have a dm-vdo modparam
> (e.g. use_vdo_wq or vice-versa: use_linux_wq).  And have a wrapping
> interface and associated data struct(s) that can bridge between work
> being driven/coordinated by either (depending on disposition of
> modparam).

If we're talking about the lower level handling, I don't think it would
be terribly hard.

> This work isn't trivial, I get that. But it serves to clearly showcase
> shortcomings, areas for improvement, while pivoting to more standard
> Linux interfaces that really should've been used from VDO's inception.
>
> Is this work that you feel you could focus on with urgency?
>
> Thanks,
> Mike

I think so, once we're clear on exactly what we're talking about...

Ken

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer July 28, 2023, 2:49 p.m. UTC | #12
On Fri, Jul 28 2023 at  4:28P -0400,
Ken Raeburn <raeburn@redhat.com> wrote:

> 
> Mike Snitzer <snitzer@kernel.org> writes:
> > Thanks for the extra context, but a _big_ elephant in the room for
> > this line of discussion is that: the Linux workqueue code has
> > basically always been only available for use by GPL'd code.  Given
> > VDO's historic non-GPL origins, it seems _to me_ that an alternative
> > to Linux's workqueues had to be created to allow VDO to drive its
> > work.  While understandable, I gave guidance 6 years ago that VDO
> > engineering should work to definitively reconcile if using Linux
> > workqueues viable now that VDO has been GPL'd.
> 
> Yes, initially that was a significant reason.
> 
> More recently, when we've tried switching, the performance loss made it
> appear not worth the change. Especially since we also needed to ship a
> usable version at the same time.
> 
> > But it appears there wasn't much in the way of serious effort put to
> > completely converting to using Linux workqueues. That is a problem
> > because all of the work item strategy deployed by VDO is quite
> > bespoke.  I don't think the code lends itself to being properly
> > maintained by more than a 1 or 2 engineers (if we're lucky at this
> > point).
> 
> By "work item strategy" are you referring to the lower level handling of
> queueing and executing the work items? Because I've done that. Well, the
> first 90%, by making the VDO work queues function as a shim on top of
> the kernel ones instead of creating their own threads. It would also
> need the kernel workqueues modified to support the SYSFS and ORDERED
> options together, because on NUMA systems the VDO performance really
> tanks without tweaking CPU affinity, and one or two other small
> additions. If we were to actually commit to that version there might be
> additional work like tweaking some data structures and eliding some shim
> functions if appropriate, but given the performance loss, we decided to
> stop there.

There needs to be a comprehensive audit of the locking and the
granularity of work.  The model VDO uses already requires that
anything that needs a continuation is assigned to the same thread
right?  Matt said that there is additional locking in the rare case
that another thread needs read access to an object.

Determining how best to initiate the work VDO requires (and provide
mutual exclusion that still allows concurrency is the goal). Having a
deep look at this is needed.
 
> Or do you mean the use of executing all actions affecting a data
> structure in a single thread/queue via message passing to serialize
> access to data structures instead of having a thread serially lock,
> modify, and unlock the various different data structures on behalf of a
> single I/O request, while another thread does the same for another I/O
> request? The model we use can certainly make things more difficult to
> follow. It reads like continuation-passing style code, not the
> straight-line code many of us are more accustomed to.
> 
> "Converting to using Linux workqueues" really doesn't say the latter to
> me, it says the former. But I thought I'd already mentioned I'd tried
> the former out. (Perhaps not very clearly?)

The implicit locking of the VDO thread assignment model needs to be
factored out.  If 'use_vdo_wq' is true then the locking operations are
a noop. But if Linux workqueues are used then appropriate locking
needed.

FYI, dm-cache-target.c uses a struct continuation to queue a sequence
of work.  Can VDO translate its ~12 stages of work into locking a vio
and using continuations to progress through the stages?  The locking
shouldn't be overbearing since VDO is already taking steps to isolate
the work to particular threads.

Also, just so you're aware DM core now provides helpers to shard a
data structures locking (used by dm-bufio and dm-bio-prison-v1).  See
dm_hash_locks_index() and dm_num_hash_locks().

> > I would like to see a patch crafted that allows branching between the
> > use of Linux and VDO workqueues. Initially have a dm-vdo modparam
> > (e.g. use_vdo_wq or vice-versa: use_linux_wq).  And have a wrapping
> > interface and associated data struct(s) that can bridge between work
> > being driven/coordinated by either (depending on disposition of
> > modparam).
> 
> If we're talking about the lower level handling, I don't think it would
> be terribly hard.
> 
> > This work isn't trivial, I get that. But it serves to clearly showcase
> > shortcomings, areas for improvement, while pivoting to more standard
> > Linux interfaces that really should've been used from VDO's inception.
> >
> > Is this work that you feel you could focus on with urgency?
> >
> > Thanks,
> > Mike
> 
> I think so, once we're clear on exactly what we're talking about...

I'm talking about a comprehensive audit of how work is performed.  And
backfilling proper locking by factoring out adequate protection that
allows conditional use of locking (e.g. IFF using linux workqueues).

In the end, using either VDO workqueue or Linux workqueues must pass
all VDO tests.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Matthew Sakai Aug. 9, 2023, 11:40 p.m. UTC | #13
On 5/23/23 17:45, J. corwin Coburn wrote:
> The dm-vdo target provides inline deduplication, compression, zero-block
> elimination, and thin provisioning. A dm-vdo target can be backed by up to
> 256TB of storage, and can present a logical size of up to 4PB. This target
> was originally developed at Permabit Technology Corp. starting in 2009. It
> was first released in 2013 and has been used in production environments
> ever since. It was made open-source in 2017 after Permabit was acquired by
> Red Hat.
> 
> Because deduplication rates fall drastically as the block size increases, a
> vdo target has a maximum block size of 4KB. However, it can achieve
> deduplication rates of 254:1, i.e. up to 254 copies of a given 4KB block
> can reference a single 4KB of actual storage. It can achieve compression
> rates of 14:1. All zero blocks consume no storage at all.
> 
> Design Summary
> --------------
> 
> This is a high-level summary of the ideas behind dm-vdo. For details about
> the implementation and various design choices, refer to vdo-design.rst
> included in this patch set.
> 
> Deduplication is a two-part problem. The first part is recognizing
> duplicate data; the second part is avoiding multiple copies of the
> duplicated data. Therefore, vdo has two main sections: a deduplication
> index that is used to discover potential duplicate data, and a data store
> with a reference counted block map that maps from logical block addresses
> to the actual storage location of the data.
> 
> Hashing:
> 
> In order to identify blocks, vdo hashes each 4KB block to produce a 128-bit
> block name. Since vdo only requires these names to be evenly distributed,
> it uses MurmurHash3, a non-cryptographic hash algorithm which is faster
> than cryptographic hashes.
> 
> The Deduplication Index:
> 
> The index is a set of mappings between a block name (the hash of its
> contents) and a hint indicating where the block might be stored. These
> mappings are stored in temporal order because groups of blocks that are
> written together (such as a large file) tend to be rewritten together as
> well. The index uses a least-recently-used (LRU) scheme to keep frequently
> used names in the index while older names are discarded.
> 
> The index uses a structure called a delta-index to store its mappings,
> which is more space-efficient than using a hashtable. It uses a variable
> length encoding with the property that the average size of an entry
> decreases as the number of entries increases, resulting in a roughly
> constant size as the index fills.
> 
> Because storing hashes along with the data, or rehashing blocks on
> overwrite is expensive, entries are never explicitly deleted from the
> index. Instead, the vdo must always check the data at the physical location
> provided by the index to ensure that the hint is still valid.
> 
> The Data Store:
> 
> The data store is implemented by three main data structures: the block map,
> the slab depot, and the recovery journal. These structures work in concert
> to amortize metadata updates across as many data writes as possible.
> 
> The block map contains the mapping from logical addresses to physical
> locations. For each logical address it indicates whether that address is
> unused, all zeros, or which physical block holds its contents and whether
> or not it is compressed. The array of mappings is represented as a tree,
> with nodes that are allocated as needed from the available physical space.
> 
> The slab depot tracks the physical space available for storing user data.
> The depot also maintains a reference count for each physical block. Each
> block can have up to 254 logical references.
> 
> The recovery journal is a transaction log of the logical-to-physical
> mappings made by data writes. Committing this journal regularly allows a
> vdo to reduce the frequency of other metadata writes and allows it to
> reconstruct its metadata in the event of a crash.
> 
> Zones and Threading:
> 
> Due to the complexity of deduplication, the number of metadata structures
> involved in a single write operation to a vdo target is larger than most
> other targets. Furthermore, because vdo operates on small block sizes in
> order to achieve good deduplication rates, parallelism is key to good
> performance. The deduplication index, the block map, and the slab depot are
> all designed to be easily divided into disjoint zones such that any piece
> of metadata is handled by a single zone. Each zone is then assigned to a
> single thread so that all metadata operations in that zone can proceed
> without locking. Each bio is associated with a request object which can be
> enqueued on each zone thread it needs to access. The zone divisions are not
> reflected in the on-disk representation of the data structures, so the
> number of zones, and hence the number of threads, can be configured each
> time a vdo target is started.
> 
> Existing facilities
> -------------------
> 
> In a few cases, we found that existing kernel facilities did not meet vdo's
> needs, either because of performance or due to a mismatch of semantics.
> These are detailed here:
> 
> Work Queues:
> 
> Handling a single bio requires a number of small operations across a number
> of zones. The per-zone worker threads can be very busy, often using upwards
> of 30% CPU time. Kernel work queues seem targeted for lighter work loads.
> They do not let us easily prioritize individual tasks within a zone, and
> make CPU affinity control at a per-thread level more difficult.
> 
> The threads scanning and updating the in-memory portion of the
> deduplication index process a large number of queries through a single
> function. It uses its own "request queue" mechanism to process these
> efficiently in dedicated threads. In experiments using kernel work queues
> for the index lookups, we observed an overall throughput drop of up to
> almost 10%. In the following table, randwrite% and write% represent the
> change in throughput when switching to kernel work queues for random and
> sequential write workloads, respectively.
> 
> | compression% | deduplication% | randwrite% | write% |
> |--------------+----------------+------------+--------|
> |            0 |              0 |       -8.3 |   -6.4 |
> |           55 |              0 |       -7.9 |   -8.5 |
> |           90 |              0 |       -9.3 |   -8.9 |
> |            0 |             50 |       -4.9 |   -4.5 |
> |           55 |             50 |       -4.4 |   -4.4 |
> |           90 |             50 |       -4.2 |   -4.7 |
> |            0 |             90 |       -1.0 |    0.7 |
> |           55 |             90 |        0.2 |   -0.4 |
> |           90 |             90 |       -0.5 |    0.2 |
> 
> Mempools:
> 
> There are two types of object pools in the vdo implementation for which the
> existing mempool structure was not appropriate. The first of these are
> pools of structures wrapping the bios used for vdo's metadata I/O. Since
> each of these pools is only accessed from a single thread, the locking done
> by mempool is a needless cost. The second of these, the single pool of the
> wrappers for incoming bios, has more complicated locking semantics than
> mempool provides. When a thread attempts to submit a bio to vdo, but the
> pool is exhausted, the thread is put to sleep. The pool is designed to only
> wake that thread once, when it is certain that that thread's bio will be
> processed. It is not desirable to merely allocate more wrappers as a number
> of other vdo structures are designed to handle only a fixed number of
> concurrent requests. This limit is also necessary to bound the amount of
> work needed when recovering after a crash.
> 
> MurmurHash:
> 
> MurmurHash3 was selected for its hash quality, performance on 4KB blocks,
> and its 128-bit output size (vdo needs significantly more than 64 uniformly
> distributed bits for its in-memory and on-disk indexing). For
> cross-platform compatibility, vdo uses a modified version which always
> produces the same output as the original x64 variant, rather than being
> optimized per platform. There is no such hash function already in the
> kernel.
> 
> J. corwin Coburn (39):
>    Add documentation for dm-vdo.
>    Add the MurmurHash3 fast hashing algorithm.
>    Add memory allocation utilities.
>    Add basic logging and support utilities.
>    Add vdo type declarations, constants, and simple data structures.
>    Add thread and synchronization utilities.
>    Add specialized request queueing functionality.
>    Add basic data structures.
>    Add deduplication configuration structures.
>    Add deduplication index storage interface.
>    Implement the delta index.
>    Implement the volume index.
>    Implement the open chapter and chapter indexes.
>    Implement the chapter volume store.
>    Implement top-level deduplication index.
>    Implement external deduplication index interface.
>    Add administrative state and scheduling for vdo.
>    Add vio, the request object for vdo metadata.
>    Add data_vio, the request object which services incoming bios.
>    Add flush support to vdo.
>    Add the vdo io_submitter.
>    Add hash locks and hash zones.
>    Add use of the deduplication index in hash zones.
>    Add the compressed block bin packer.
>    Add vdo_slab.
>    Add the slab summary.
>    Add the block allocators and physical zones.
>    Add the slab depot itself.
>    Add the vdo block map.
>    Implement the vdo block map page cache.
>    Add the vdo recovery journal.
>    Add repair (crash recovery and read-only rebuild) of damaged vdos.
>    Add the vdo structure itself.
>    Add the on-disk formats and marshalling of vdo structures.
>    Add statistics tracking.
>    Add sysfs support for setting vdo parameters and fetching statistics.
>    Add vdo debugging support.
>    Add dm-vdo-target.c
>    Enable configuration and building of dm-vdo.
> 
>   .../admin-guide/device-mapper/vdo-design.rst  |  390 ++
>   .../admin-guide/device-mapper/vdo.rst         |  386 ++
>   drivers/md/Kconfig                            |   16 +
>   drivers/md/Makefile                           |    2 +
>   drivers/md/dm-vdo-target.c                    | 2983 ++++++++++
>   drivers/md/dm-vdo/action-manager.c            |  410 ++
>   drivers/md/dm-vdo/action-manager.h            |  117 +
>   drivers/md/dm-vdo/admin-state.c               |  512 ++
>   drivers/md/dm-vdo/admin-state.h               |  180 +
>   drivers/md/dm-vdo/block-map.c                 | 3381 +++++++++++
>   drivers/md/dm-vdo/block-map.h                 |  392 ++
>   drivers/md/dm-vdo/chapter-index.c             |  304 +
>   drivers/md/dm-vdo/chapter-index.h             |   66 +
>   drivers/md/dm-vdo/completion.c                |  141 +
>   drivers/md/dm-vdo/completion.h                |  155 +
>   drivers/md/dm-vdo/config.c                    |  389 ++
>   drivers/md/dm-vdo/config.h                    |  125 +
>   drivers/md/dm-vdo/constants.c                 |   15 +
>   drivers/md/dm-vdo/constants.h                 |  102 +
>   drivers/md/dm-vdo/cpu.h                       |   58 +
>   drivers/md/dm-vdo/data-vio.c                  | 2076 +++++++
>   drivers/md/dm-vdo/data-vio.h                  |  683 +++
>   drivers/md/dm-vdo/dedupe.c                    | 3073 ++++++++++
>   drivers/md/dm-vdo/dedupe.h                    |  119 +
>   drivers/md/dm-vdo/delta-index.c               | 2018 +++++++
>   drivers/md/dm-vdo/delta-index.h               |  292 +
>   drivers/md/dm-vdo/dump.c                      |  288 +
>   drivers/md/dm-vdo/dump.h                      |   17 +
>   drivers/md/dm-vdo/encodings.c                 | 1523 +++++
>   drivers/md/dm-vdo/encodings.h                 | 1307 +++++
>   drivers/md/dm-vdo/errors.c                    |  316 +
>   drivers/md/dm-vdo/errors.h                    |   83 +
>   drivers/md/dm-vdo/flush.c                     |  563 ++
>   drivers/md/dm-vdo/flush.h                     |   44 +
>   drivers/md/dm-vdo/funnel-queue.c              |  169 +
>   drivers/md/dm-vdo/funnel-queue.h              |  110 +
>   drivers/md/dm-vdo/geometry.c                  |  205 +
>   drivers/md/dm-vdo/geometry.h                  |  137 +
>   drivers/md/dm-vdo/hash-utils.h                |   66 +
>   drivers/md/dm-vdo/index-layout.c              | 1775 ++++++
>   drivers/md/dm-vdo/index-layout.h              |   42 +
>   drivers/md/dm-vdo/index-page-map.c            |  181 +
>   drivers/md/dm-vdo/index-page-map.h            |   54 +
>   drivers/md/dm-vdo/index-session.c             |  815 +++
>   drivers/md/dm-vdo/index-session.h             |   84 +
>   drivers/md/dm-vdo/index.c                     | 1403 +++++
>   drivers/md/dm-vdo/index.h                     |   83 +
>   drivers/md/dm-vdo/int-map.c                   |  710 +++
>   drivers/md/dm-vdo/int-map.h                   |   40 +
>   drivers/md/dm-vdo/io-factory.c                |  458 ++
>   drivers/md/dm-vdo/io-factory.h                |   66 +
>   drivers/md/dm-vdo/io-submitter.c              |  483 ++
>   drivers/md/dm-vdo/io-submitter.h              |   52 +
>   drivers/md/dm-vdo/logger.c                    |  304 +
>   drivers/md/dm-vdo/logger.h                    |  112 +
>   drivers/md/dm-vdo/logical-zone.c              |  378 ++
>   drivers/md/dm-vdo/logical-zone.h              |   87 +
>   drivers/md/dm-vdo/memory-alloc.c              |  447 ++
>   drivers/md/dm-vdo/memory-alloc.h              |  181 +
>   drivers/md/dm-vdo/message-stats.c             | 1222 ++++
>   drivers/md/dm-vdo/message-stats.h             |   13 +
>   drivers/md/dm-vdo/murmurhash3.c               |  175 +
>   drivers/md/dm-vdo/murmurhash3.h               |   15 +
>   drivers/md/dm-vdo/numeric.h                   |   78 +
>   drivers/md/dm-vdo/open-chapter.c              |  433 ++
>   drivers/md/dm-vdo/open-chapter.h              |   79 +
>   drivers/md/dm-vdo/packer.c                    |  794 +++
>   drivers/md/dm-vdo/packer.h                    |  123 +
>   drivers/md/dm-vdo/permassert.c                |   35 +
>   drivers/md/dm-vdo/permassert.h                |   65 +
>   drivers/md/dm-vdo/physical-zone.c             |  650 ++
>   drivers/md/dm-vdo/physical-zone.h             |  115 +
>   drivers/md/dm-vdo/pointer-map.c               |  691 +++
>   drivers/md/dm-vdo/pointer-map.h               |   81 +
>   drivers/md/dm-vdo/pool-sysfs-stats.c          | 2063 +++++++
>   drivers/md/dm-vdo/pool-sysfs.c                |  193 +
>   drivers/md/dm-vdo/pool-sysfs.h                |   19 +
>   drivers/md/dm-vdo/priority-table.c            |  226 +
>   drivers/md/dm-vdo/priority-table.h            |   48 +
>   drivers/md/dm-vdo/radix-sort.c                |  349 ++
>   drivers/md/dm-vdo/radix-sort.h                |   28 +
>   drivers/md/dm-vdo/recovery-journal.c          | 1772 ++++++
>   drivers/md/dm-vdo/recovery-journal.h          |  313 +
>   drivers/md/dm-vdo/release-versions.h          |   20 +
>   drivers/md/dm-vdo/repair.c                    | 1775 ++++++
>   drivers/md/dm-vdo/repair.h                    |   14 +
>   drivers/md/dm-vdo/request-queue.c             |  284 +
>   drivers/md/dm-vdo/request-queue.h             |   30 +
>   drivers/md/dm-vdo/slab-depot.c                | 5210 +++++++++++++++++
>   drivers/md/dm-vdo/slab-depot.h                |  594 ++
>   drivers/md/dm-vdo/sparse-cache.c              |  595 ++
>   drivers/md/dm-vdo/sparse-cache.h              |   49 +
>   drivers/md/dm-vdo/statistics.h                |  279 +
>   drivers/md/dm-vdo/status-codes.c              |  126 +
>   drivers/md/dm-vdo/status-codes.h              |  112 +
>   drivers/md/dm-vdo/string-utils.c              |   28 +
>   drivers/md/dm-vdo/string-utils.h              |   23 +
>   drivers/md/dm-vdo/sysfs.c                     |   84 +
>   drivers/md/dm-vdo/thread-cond-var.c           |   46 +
>   drivers/md/dm-vdo/thread-device.c             |   35 +
>   drivers/md/dm-vdo/thread-device.h             |   19 +
>   drivers/md/dm-vdo/thread-registry.c           |   93 +
>   drivers/md/dm-vdo/thread-registry.h           |   33 +
>   drivers/md/dm-vdo/time-utils.h                |   28 +
>   drivers/md/dm-vdo/types.h                     |  403 ++
>   drivers/md/dm-vdo/uds-sysfs.c                 |  185 +
>   drivers/md/dm-vdo/uds-sysfs.h                 |   12 +
>   drivers/md/dm-vdo/uds-threads.c               |  189 +
>   drivers/md/dm-vdo/uds-threads.h               |  126 +
>   drivers/md/dm-vdo/uds.h                       |  334 ++
>   drivers/md/dm-vdo/vdo.c                       | 1846 ++++++
>   drivers/md/dm-vdo/vdo.h                       |  381 ++
>   drivers/md/dm-vdo/vio.c                       |  525 ++
>   drivers/md/dm-vdo/vio.h                       |  221 +
>   drivers/md/dm-vdo/volume-index.c              | 1272 ++++
>   drivers/md/dm-vdo/volume-index.h              |  192 +
>   drivers/md/dm-vdo/volume.c                    | 1792 ++++++
>   drivers/md/dm-vdo/volume.h                    |  174 +
>   drivers/md/dm-vdo/wait-queue.c                |  223 +
>   drivers/md/dm-vdo/wait-queue.h                |  129 +
>   drivers/md/dm-vdo/work-queue.c                |  659 +++
>   drivers/md/dm-vdo/work-queue.h                |   53 +
>   122 files changed, 58741 insertions(+)
>   create mode 100644 Documentation/admin-guide/device-mapper/vdo-design.rst
>   create mode 100644 Documentation/admin-guide/device-mapper/vdo.rst
>   create mode 100644 drivers/md/dm-vdo-target.c
>   create mode 100644 drivers/md/dm-vdo/action-manager.c
>   create mode 100644 drivers/md/dm-vdo/action-manager.h
>   create mode 100644 drivers/md/dm-vdo/admin-state.c
>   create mode 100644 drivers/md/dm-vdo/admin-state.h
>   create mode 100644 drivers/md/dm-vdo/block-map.c
>   create mode 100644 drivers/md/dm-vdo/block-map.h
>   create mode 100644 drivers/md/dm-vdo/chapter-index.c
>   create mode 100644 drivers/md/dm-vdo/chapter-index.h
>   create mode 100644 drivers/md/dm-vdo/completion.c
>   create mode 100644 drivers/md/dm-vdo/completion.h
>   create mode 100644 drivers/md/dm-vdo/config.c
>   create mode 100644 drivers/md/dm-vdo/config.h
>   create mode 100644 drivers/md/dm-vdo/constants.c
>   create mode 100644 drivers/md/dm-vdo/constants.h
>   create mode 100644 drivers/md/dm-vdo/cpu.h
>   create mode 100644 drivers/md/dm-vdo/data-vio.c
>   create mode 100644 drivers/md/dm-vdo/data-vio.h
>   create mode 100644 drivers/md/dm-vdo/dedupe.c
>   create mode 100644 drivers/md/dm-vdo/dedupe.h
>   create mode 100644 drivers/md/dm-vdo/delta-index.c
>   create mode 100644 drivers/md/dm-vdo/delta-index.h
>   create mode 100644 drivers/md/dm-vdo/dump.c
>   create mode 100644 drivers/md/dm-vdo/dump.h
>   create mode 100644 drivers/md/dm-vdo/encodings.c
>   create mode 100644 drivers/md/dm-vdo/encodings.h
>   create mode 100644 drivers/md/dm-vdo/errors.c
>   create mode 100644 drivers/md/dm-vdo/errors.h
>   create mode 100644 drivers/md/dm-vdo/flush.c
>   create mode 100644 drivers/md/dm-vdo/flush.h
>   create mode 100644 drivers/md/dm-vdo/funnel-queue.c
>   create mode 100644 drivers/md/dm-vdo/funnel-queue.h
>   create mode 100644 drivers/md/dm-vdo/geometry.c
>   create mode 100644 drivers/md/dm-vdo/geometry.h
>   create mode 100644 drivers/md/dm-vdo/hash-utils.h
>   create mode 100644 drivers/md/dm-vdo/index-layout.c
>   create mode 100644 drivers/md/dm-vdo/index-layout.h
>   create mode 100644 drivers/md/dm-vdo/index-page-map.c
>   create mode 100644 drivers/md/dm-vdo/index-page-map.h
>   create mode 100644 drivers/md/dm-vdo/index-session.c
>   create mode 100644 drivers/md/dm-vdo/index-session.h
>   create mode 100644 drivers/md/dm-vdo/index.c
>   create mode 100644 drivers/md/dm-vdo/index.h
>   create mode 100644 drivers/md/dm-vdo/int-map.c
>   create mode 100644 drivers/md/dm-vdo/int-map.h
>   create mode 100644 drivers/md/dm-vdo/io-factory.c
>   create mode 100644 drivers/md/dm-vdo/io-factory.h
>   create mode 100644 drivers/md/dm-vdo/io-submitter.c
>   create mode 100644 drivers/md/dm-vdo/io-submitter.h
>   create mode 100644 drivers/md/dm-vdo/logger.c
>   create mode 100644 drivers/md/dm-vdo/logger.h
>   create mode 100644 drivers/md/dm-vdo/logical-zone.c
>   create mode 100644 drivers/md/dm-vdo/logical-zone.h
>   create mode 100644 drivers/md/dm-vdo/memory-alloc.c
>   create mode 100644 drivers/md/dm-vdo/memory-alloc.h
>   create mode 100644 drivers/md/dm-vdo/message-stats.c
>   create mode 100644 drivers/md/dm-vdo/message-stats.h
>   create mode 100644 drivers/md/dm-vdo/murmurhash3.c
>   create mode 100644 drivers/md/dm-vdo/murmurhash3.h
>   create mode 100644 drivers/md/dm-vdo/numeric.h
>   create mode 100644 drivers/md/dm-vdo/open-chapter.c
>   create mode 100644 drivers/md/dm-vdo/open-chapter.h
>   create mode 100644 drivers/md/dm-vdo/packer.c
>   create mode 100644 drivers/md/dm-vdo/packer.h
>   create mode 100644 drivers/md/dm-vdo/permassert.c
>   create mode 100644 drivers/md/dm-vdo/permassert.h
>   create mode 100644 drivers/md/dm-vdo/physical-zone.c
>   create mode 100644 drivers/md/dm-vdo/physical-zone.h
>   create mode 100644 drivers/md/dm-vdo/pointer-map.c
>   create mode 100644 drivers/md/dm-vdo/pointer-map.h
>   create mode 100644 drivers/md/dm-vdo/pool-sysfs-stats.c
>   create mode 100644 drivers/md/dm-vdo/pool-sysfs.c
>   create mode 100644 drivers/md/dm-vdo/pool-sysfs.h
>   create mode 100644 drivers/md/dm-vdo/priority-table.c
>   create mode 100644 drivers/md/dm-vdo/priority-table.h
>   create mode 100644 drivers/md/dm-vdo/radix-sort.c
>   create mode 100644 drivers/md/dm-vdo/radix-sort.h
>   create mode 100644 drivers/md/dm-vdo/recovery-journal.c
>   create mode 100644 drivers/md/dm-vdo/recovery-journal.h
>   create mode 100644 drivers/md/dm-vdo/release-versions.h
>   create mode 100644 drivers/md/dm-vdo/repair.c
>   create mode 100644 drivers/md/dm-vdo/repair.h
>   create mode 100644 drivers/md/dm-vdo/request-queue.c
>   create mode 100644 drivers/md/dm-vdo/request-queue.h
>   create mode 100644 drivers/md/dm-vdo/slab-depot.c
>   create mode 100644 drivers/md/dm-vdo/slab-depot.h
>   create mode 100644 drivers/md/dm-vdo/sparse-cache.c
>   create mode 100644 drivers/md/dm-vdo/sparse-cache.h
>   create mode 100644 drivers/md/dm-vdo/statistics.h
>   create mode 100644 drivers/md/dm-vdo/status-codes.c
>   create mode 100644 drivers/md/dm-vdo/status-codes.h
>   create mode 100644 drivers/md/dm-vdo/string-utils.c
>   create mode 100644 drivers/md/dm-vdo/string-utils.h
>   create mode 100644 drivers/md/dm-vdo/sysfs.c
>   create mode 100644 drivers/md/dm-vdo/thread-cond-var.c
>   create mode 100644 drivers/md/dm-vdo/thread-device.c
>   create mode 100644 drivers/md/dm-vdo/thread-device.h
>   create mode 100644 drivers/md/dm-vdo/thread-registry.c
>   create mode 100644 drivers/md/dm-vdo/thread-registry.h
>   create mode 100644 drivers/md/dm-vdo/time-utils.h
>   create mode 100644 drivers/md/dm-vdo/types.h
>   create mode 100644 drivers/md/dm-vdo/uds-sysfs.c
>   create mode 100644 drivers/md/dm-vdo/uds-sysfs.h
>   create mode 100644 drivers/md/dm-vdo/uds-threads.c
>   create mode 100644 drivers/md/dm-vdo/uds-threads.h
>   create mode 100644 drivers/md/dm-vdo/uds.h
>   create mode 100644 drivers/md/dm-vdo/vdo.c
>   create mode 100644 drivers/md/dm-vdo/vdo.h
>   create mode 100644 drivers/md/dm-vdo/vio.c
>   create mode 100644 drivers/md/dm-vdo/vio.h
>   create mode 100644 drivers/md/dm-vdo/volume-index.c
>   create mode 100644 drivers/md/dm-vdo/volume-index.h
>   create mode 100644 drivers/md/dm-vdo/volume.c
>   create mode 100644 drivers/md/dm-vdo/volume.h
>   create mode 100644 drivers/md/dm-vdo/wait-queue.c
>   create mode 100644 drivers/md/dm-vdo/wait-queue.h
>   create mode 100644 drivers/md/dm-vdo/work-queue.c
>   create mode 100644 drivers/md/dm-vdo/work-queue.h
> 

For the series:

Co-developed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>

Matt Sakai

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel