[RFC,00/76] SSDFS: flash-friendly LFS file system for ZNS SSD

Message ID	20230225010927.813929-1-slava@dubeyko.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@vger.kernel.org> From: Viacheslav Dubeyko <slava@dubeyko.com> To: linux-fsdevel@vger.kernel.org Cc: viacheslav.dubeyko@bytedance.com, luka.perkov@sartura.hr, bruno.banelli@sartura.hr, Viacheslav Dubeyko <slava@dubeyko.com> Subject: [RFC PATCH 00/76] SSDFS: flash-friendly LFS file system for ZNS SSD Date: Fri, 24 Feb 2023 17:08:11 -0800 Message-Id: <20230225010927.813929-1-slava@dubeyko.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	SSDFS: flash-friendly LFS file system for ZNS SSD \| expand [RFC,00/76] SSDFS: flash-friendly LFS file system for ZNS SSD [RFC,01/76] ssdfs: introduce SSDFS on-disk layout [RFC,02/76] ssdfs: key file system declarations [RFC,03/76] ssdfs: implement raw device operations [RFC,04/76] ssdfs: implement super operations [RFC,05/76] ssdfs: implement commit superblock operation [RFC,06/76] ssdfs: segment header + log footer operations [RFC,07/76] ssdfs: basic mount logic implementation [RFC,08/76] ssdfs: search last actual superblock [RFC,09/76] ssdfs: internal array/sequence primitives [RFC,10/76] ssdfs: introduce PEB's block bitmap [RFC,11/76] ssdfs: block bitmap search operations implementation [RFC,12/76] ssdfs: block bitmap modification operations implementation [RFC,13/76] ssdfs: introduce PEB block bitmap [RFC,14/76] ssdfs: PEB block bitmap modification operations [RFC,15/76] ssdfs: introduce segment block bitmap [RFC,16/76] ssdfs: introduce segment request queue [RFC,17/76] ssdfs: introduce offset translation table [RFC,18/76] ssdfs: flush offset translation table [RFC,19/76] ssdfs: offset translation table API implementation [RFC,20/76] ssdfs: introduce PEB object [RFC,21/76] ssdfs: introduce PEB container [RFC,22/76] ssdfs: create/destroy PEB container [RFC,23/76] ssdfs: PEB container API implementation [RFC,24/76] ssdfs: PEB read thread's init logic [RFC,25/76] ssdfs: block bitmap initialization logic [RFC,26/76] ssdfs: offset translation table initialization logic [RFC,27/76] ssdfs: read/readahead logic of PEB's thread [RFC,28/76] ssdfs: PEB flush thread's finite state machine [RFC,29/76] ssdfs: commit log logic [RFC,30/76] ssdfs: commit log payload [RFC,31/76] ssdfs: process update request [RFC,32/76] ssdfs: process create request [RFC,33/76] ssdfs: create log logic [RFC,34/76] ssdfs: auxilairy GC threads logic [RFC,35/76] ssdfs: introduce segment object [RFC,36/76] ssdfs: segment object's add data/metadata operations [RFC,37/76] ssdfs: segment object's update/invalidate data/metadata [RFC,38/76] ssdfs: introduce PEB mapping table [RFC,39/76] ssdfs: flush PEB mapping table [RFC,40/76] ssdfs: convert/map LEB to PEB functionality [RFC,41/76] ssdfs: support migration scheme by PEB state [RFC,42/76] ssdfs: PEB mapping table thread logic [RFC,43/76] ssdfs: introduce PEB mapping table cache [RFC,44/76] ssdfs: PEB mapping table cache's modification operations [RFC,45/76] ssdfs: introduce segment bitmap [RFC,46/76] ssdfs: segment bitmap API implementation [RFC,47/76] ssdfs: introduce b-tree object [RFC,48/76] ssdfs: add/delete b-tree node [RFC,49/76] ssdfs: b-tree API implementation [RFC,50/76] ssdfs: introduce b-tree node object [RFC,51/76] ssdfs: flush b-tree node object [RFC,52/76] ssdfs: b-tree node index operations [RFC,53/76] ssdfs: search/allocate/insert b-tree node operations [RFC,54/76] ssdfs: change/delete b-tree node operations [RFC,55/76] ssdfs: range operations of b-tree node [RFC,56/76] ssdfs: introduce b-tree hierarchy object [RFC,57/76] ssdfs: check b-tree hierarchy for add operation [RFC,58/76] ssdfs: check b-tree hierarchy for update/delete operation [RFC,59/76] ssdfs: execute b-tree hierarchy modification [RFC,60/76] ssdfs: introduce inodes b-tree [RFC,61/76] ssdfs: inodes b-tree node operations [RFC,62/76] ssdfs: introduce dentries b-tree [RFC,63/76] ssdfs: dentries b-tree specialized operations [RFC,64/76] ssdfs: dentries b-tree node's specialized operations [RFC,65/76] ssdfs: introduce extents queue object [RFC,66/76] ssdfs: introduce extents b-tree [RFC,67/76] ssdfs: extents b-tree specialized operations [RFC,68/76] ssdfs: search extent logic in extents b-tree node [RFC,69/76] ssdfs: add/change/delete extent in extents b-tree node [RFC,70/76] ssdfs: introduce invalidated extents b-tree [RFC,71/76] ssdfs: find item in invalidated extents b-tree [RFC,72/76] ssdfs: modification operations of invalidated extents b-tree [RFC,73/76] ssdfs: implement inode operations support [RFC,74/76] ssdfs: implement directory operations support [RFC,75/76] ssdfs: implement file operations support [RFC,76/76] introduce SSDFS file system

Message ID

20230225010927.813929-1-slava@dubeyko.com (mailing list archive)

Headers

From: Viacheslav Dubeyko <slava@dubeyko.com>
To: linux-fsdevel@vger.kernel.org
Cc: viacheslav.dubeyko@bytedance.com, luka.perkov@sartura.hr,
        bruno.banelli@sartura.hr, Viacheslav Dubeyko <slava@dubeyko.com>
Subject: [RFC PATCH 00/76] SSDFS: flash-friendly LFS file system for ZNS SSD
Date: Fri, 24 Feb 2023 17:08:11 -0800
Message-Id: <20230225010927.813929-1-slava@dubeyko.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

SSDFS: flash-friendly LFS file system for ZNS SSD | expand

Message

Viacheslav Dubeyko Feb. 25, 2023, 1:08 a.m. UTC

Hello,

I am completely aware that patchset is big. And I am opened for any
advices how I can split the patchset on reasonable portions with
the goal to introduce SSDFS for the review. Even now, I excluded
the code of several subsystems to make the patchset slightly
smaller. Potentially, I can introduce SSDFS by smaller portions
with limited fucntionality. However, it can confuse and makes it
hard to understand how declared goals are achieved by implemented
functionality. SSDFS is still not completely stable but I believe that
it's time to hear the community opinion.

[PROBLEM DECLARATION]

SSD is a sophisticated device capable of managing in-place
updates. However, in-place updates generate significant FTL GC
responsibilities that increase write amplification factor, require
substantial NAND flash overprovisioning, decrease SSD lifetime,
and introduce performance spikes. Log-structured File System (LFS)
approach can introduce a more flash-friendly Copy-On-Write (COW) model.
However, F2FS and NILFS2 issue in-place updates anyway, even by using
the COW policy for main volume area. Also, GC is an inevitable subsystem
of any LFS file system that introduces write amplification, retention
issue, excessive copy operations, and performance degradation for
aged volume. Generally speaking, available file system technologies
have side effects: (1) write amplification issue, (2) significant FTL GC
responsibilities, (3) inevitable FS GC overhead, (4) read disturbance,
(5) retention issue. As a result, SSD lifetime reduction, perfomance
degradation, early SSD failure, and increased TCO cost are reality of
data infrastructure.

[WHY YET ANOTHER FS?]

ZNS SSD is a good vehicle that can help to manage a subset of known
issues by means of introducing a strict append-only mode of operations.
However, for example, F2FS has an in-place update metadata area that
can be placed into conventional zone and, anyway, introduces FTL GC
responsibilities even for ZNS SSD case. Also, limited number of
open/active zones (for example, 14 open/active zones) creates really
complicated requirements that not every file system architecure can
satisfy. It means that architecture of multiple file systems has
peculiarities compromising the ZNS SSD model. Moreover, FS GC overhead is
still a critical problem for LFS file systems (F2FS, NILFS2, for example),
even for the case of ZNS SSD.

Generally speaking, it will be good to see an LFS file system architecture
that is capable:
(1) eliminate FS GC overhead,
(2) decrease/eliminate FTL GC responsibilities,
(3) decrease write amplification factor,
(4) introduce native architectural support of ZNS SSD + SMR HDD,
(5) increase compression ratio by using delta-encoding and deduplication,
(6) introduce smart management of "cold" data and efficient TRIM policy,
(7) employ parallelism of multiple NAND dies/channels,
(8) prolong SSD lifetime and decrease TCO cost,
(9) guarantee strong reliability and capability to reconstruct heavily
    corrupted file system volume,
(10) guarantee stable performance.

[SSDFS DESIGN GOALS]

SSDFS is an open-source, kernel-space LFS file system designed:
(1) eliminate GC overhead, (2) prolong SSD lifetime, (3) natively support
a strict append-only mode (ZNS SSD + SMR HDD compatible), (4) guarantee
strong reliability, (5) guarantee stable performance.

[SSDFS ARCHITECTURE]

One of the key goals of SSDFS is to decrease the write amplification
factor. Logical extent concept is the fundamental technique to achieve
the goal. Logical extent describes any volume extent on the basis of
{segment ID, logical block ID, and length}. Segment is a portion of
file system volume that has to be aligned on erase block size and
always located at the same offset. It is basic unit to allocate and
to manage free space of file system volume. Every segment can include
one or several Logical Erase Blocks (LEB). LEB can be mapped into
"Physical" Erase Block (PEB). Generally speaking, PEB is fixed-sized
container that includes a number of logical blocks (physical sectors
or NAND flash pages). SSDFS is pure Log-structured File System (LFS).
It means that any write operation into erase block is the creation of
log. Content of every erase block is a sequence of logs. PEB has block
bitmap with the goal of tracking the state (free, pre-allocated,
allocated, invalid) of logical blocks and to account the physical space
is used for storing log's metadata (segment header, partial log header,
footer). Also, log contains an offset translation table that converts
logical block ID into particular offset inside of log's payload.
Log concept implements a support of compression, delta-encoding,
and compaction scheme. As a result, it provides the way: (1) decrease
write amplification, (2) decrease FTL GC responsibilities, (3) improve
compression ration and decrease payload size. Finally, SSD lifetime
can be longer and write I/O performance can be improved.

SSDFS file system is based on the concept of logical segment that
is the aggregation of Logical Erase Blocks (LEB). Moreover, initially,
LEB hasn’t association with a particular "Physical" Erase Block (PEB).
It means that segment could have the association not for all LEBs or,
even, to have no association at all with any PEB (for example, in the
case of clean segment). Generally speaking, SSDFS file system needs
a special metadata structure (PEB mapping table) that is capable of
associating any LEB with any PEB. The PEB mapping table is the crucial
metadata structure that has several goals: (1) mapping LEB to PEB,
(2) implementation of the logical extent concept, (3) implementation of
the concept of PEB migration, (4) implementation of the delayed erase
operation by specialized thread.

SSDFS implements a migration scheme. Migration scheme is a fundamental
technique of GC overhead management. The key responsibility of the
migration scheme is to guarantee the presence of data in the same segment
for any update operations. Generally speaking, the migration scheme’s model
is implemented on the basis of association an exhausted "Physical" Erase
Block (PEB) with a clean one. The goal of such association of two PEBs is
to implement the gradual migration of data by means of the update
operations in the initial (exhausted) PEB. As a result, the old, exhausted
PEB becomes invalidated after complete data migration and it will be
possible to apply the erase operation to convert it to a clean state.
The migration scheme is capable of decreasing GC activity significantly
by means of excluding the necessity to update metadata and by means of
self-migration of data between PEBs is triggered by regular update
operations. Finally, migration scheme can: (1) eliminate GC overhead,
(2) implement efficient TRIM policy, (3) prolong SDD lifetime,
(4) guarantee stable performance.

Generally speaking, SSDFS doesn't need a classical model of garbage
collection that is used in NILFS2 or F2FS. However, SSDFS has several
global GC threads (dirty, pre-dirty, used, using segment states) and
segment bitmap. The main responsibility of global GC threads is:
(1) find segment in a particular state, (2) check that segment object
is constructed and initialized by file system driver logic,
(3) check the necessity to stimulate or finish the migration
(if segment is under update operations or has update operations
recently, then migration stimulation is not necessary),
(4) define valid blocks that require migration, (5) add recommended
migration request to PEB update queue, (6) destroy in-core segment
object if no migration is necessary and no create/update requests
have been received by segment object recently. Global GC threads are
used to recommend migration stimulation for particular PEBs and
to destroy in-core segment objects that have no requests for
processing. Segment bitmap is the critical metadata structure of
SSDFS file system that implements several goals: (1) searching for
a candidate for a current segment capable of storing new data,
(2) searching by GC subsystem for the most optimal segment (dirty
state, for example) with the goal of preparing the segment in
background for storing new data (converting in a clean state).

SSDFS file system uses b-tree architecture for metadata representation
(for example, inodes tree, extents tree, dentries tree, xattr tree)
because it provides the compact way of reserving the metadata space
without the necessity to use the excessive overprovisioning of
metadata reservation (for example, in the case of plain table or array).
SSDFS file system uses a hybrid b-tree architecture with the goal
to eliminate the index nodes’ side effect. The hybrid b-tree operates by
three node types: (1) index node, (2) hybrid node, (3) leaf node.
Generally speaking, the peculiarity of hybrid node is the mixture
as index as data records into one node. Hybrid b-tree starts with
root node that is capable to keep the two index records or two data
records inline (if size of data record is equal or lesser than size
of index record). If the b-tree needs to contain more than two items
then it should be added the first hybrid node into the b-tree.
The root level of b-tree is able to contain only two nodes because
the root node is capable to store only two index records. Generally speaking,
the initial goal of hybrid node is to store the data records in
the presence of reserved index area. B-tree implements compact and
flexible metadata structure that can decrease payload size and
isolate hot, warm, and cold metadata types in different erase blocks.

Migration scheme is completely enough for the case of conventional SSDs
as for metadata as for user data. But ZNS SSD has huge zone size and
limited number of active/open zones. As a result, it requires introducing
a moving scheme for user data in the case of ZNS SSD. Finally, migration
scheme works for metadata and moving scheme works for user data
(ZNS SSD case). Initially, user data can be stored into current user
data segment/zone. And user data can be updated at the same zone until
exhaustion. Next, moving scheme starts to work. Updated user data is moved
into current user data zone for updates. As a result, it needs to update
the extents tree and to store invalidated extents of old zone into
invalidated extents tree. Invalidated extents tree needs to track
the moment when the old zone is completely invalidated and is ready
to be erased.

[BENCHMARKING]

Benchmarking results show that SSDFS is capable:
(1) generate smaller amount of write I/O requests compared with:
    1.4x - 116x (ext4),
    14x - 42x (xfs),
    6.2x - 9.8x (btrfs),
    1.5x - 41x (f2fs),
    0.6x - 22x (nilfs2);
(2) create smaller payload compared with:
    0.3x - 300x (ext4),
    0.3x - 190x (xfs),
    0.7x - 400x (btrfs),
    1.2x - 400x (f2fs),
    0.9x - 190x (nilfs2);
(3) decrease the write amplification factor compared with:
    1.3x - 116x (ext4),
    14x - 42x (xfs),
    6x - 9x (btrfs),
    1.5x - 50x (f2fs),
    1.2x - 20x (nilfs2);
(4) prolong SSD lifetime compared with:
    1.4x - 7.8x (ext4),
    15x - 60x (xfs),
    6x - 12x (btrfs),
    1.5x - 7x (f2fs),
    1x - 4.6x (nilfs2).

[CURRENT ISSUES]

Patchset does not include:
(1) shared dictionary functionality (implemented),
(2) deduplication functionality (partially implemented),
(3) snapshot support functionality (partially implemented),
(4) extended attributes support (implemented),
(5) internal unit-tests functionality (implemented),
(6) IOCTLs support (implemented).

SSDFS code still has bugs and is not fully stable yet:
(1) ZNS support is not fully stable;
(2) b-tree operations have issues for some use-cases;
(3) Support of 8K, 16K, 32K logical blocks has critical bugs;
(4) Support of multiple PEBs in segment is not stable yet;
(5) Delta-encoding support is not stable;
(6) The fsck and recoverfs tools are not fully implemented yet;
(7) Currently, offset translation table functionality introduces
    performance degradation for read I/O patch (patch with the fix is
    under testing).

[REFERENCES]
[1] SSDFS tools: https://github.com/dubeyko/ssdfs-tools.git
[2] SSDFS driver: https://github.com/dubeyko/ssdfs-driver.git
[3] Linux kernel with SSDFS support: https://github.com/dubeyko/linux.git
[4] SSDFS (paper): https://arxiv.org/abs/1907.11825
[5] Linux Plumbers 2022: https://www.youtube.com/watch?v=sBGddJBHsIo

Viacheslav Dubeyko (76):
  ssdfs: introduce SSDFS on-disk layout
  ssdfs: key file system declarations
  ssdfs: implement raw device operations
  ssdfs: implement super operations
  ssdfs: implement commit superblock operation
  ssdfs: segment header + log footer operations
  ssdfs: basic mount logic implementation
  ssdfs: search last actual superblock
  ssdfs: internal array/sequence primitives
  ssdfs: introduce PEB's block bitmap
  ssdfs: block bitmap search operations implementation
  ssdfs: block bitmap modification operations implementation
  ssdfs: introduce PEB block bitmap
  ssdfs: PEB block bitmap modification operations
  ssdfs: introduce segment block bitmap
  ssdfs: introduce segment request queue
  ssdfs: introduce offset translation table
  ssdfs: flush offset translation table
  ssdfs: offset translation table API implementation
  ssdfs: introduce PEB object
  ssdfs: introduce PEB container
  ssdfs: create/destroy PEB container
  ssdfs: PEB container API implementation
  ssdfs: PEB read thread's init logic
  ssdfs: block bitmap initialization logic
  ssdfs: offset translation table initialization logic
  ssdfs: read/readahead logic of PEB's thread
  ssdfs: PEB flush thread's finite state machine
  ssdfs: commit log logic
  ssdfs: commit log payload
  ssdfs: process update request
  ssdfs: process create request
  ssdfs: create log logic
  ssdfs: auxilairy GC threads logic
  ssdfs: introduce segment object
  ssdfs: segment object's add data/metadata operations
  ssdfs: segment object's update/invalidate data/metadata
  ssdfs: introduce PEB mapping table
  ssdfs: flush PEB mapping table
  ssdfs: convert/map LEB to PEB functionality
  ssdfs: support migration scheme by PEB state
  ssdfs: PEB mapping table thread logic
  ssdfs: introduce PEB mapping table cache
  ssdfs: PEB mapping table cache's modification operations
  ssdfs: introduce segment bitmap
  ssdfs: segment bitmap API implementation
  ssdfs: introduce b-tree object
  ssdfs: add/delete b-tree node
  ssdfs: b-tree API implementation
  ssdfs: introduce b-tree node object
  ssdfs: flush b-tree node object
  ssdfs: b-tree node index operations
  ssdfs: search/allocate/insert b-tree node operations
  ssdfs: change/delete b-tree node operations
  ssdfs: range operations of b-tree node
  ssdfs: introduce b-tree hierarchy object
  ssdfs: check b-tree hierarchy for add operation
  ssdfs: check b-tree hierarchy for update/delete operation
  ssdfs: execute b-tree hierarchy modification
  ssdfs: introduce inodes b-tree
  ssdfs: inodes b-tree node operations
  ssdfs: introduce dentries b-tree
  ssdfs: dentries b-tree specialized operations
  ssdfs: dentries b-tree node's specialized operations
  ssdfs: introduce extents queue object
  ssdfs: introduce extents b-tree
  ssdfs: extents b-tree specialized operations
  ssdfs: search extent logic in extents b-tree node
  ssdfs: add/change/delete extent in extents b-tree node
  ssdfs: introduce invalidated extents b-tree
  ssdfs: find item in invalidated extents b-tree
  ssdfs: modification operations of invalidated extents b-tree
  ssdfs: implement inode operations support
  ssdfs: implement directory operations support
  ssdfs: implement file operations support
  introduce SSDFS file system

 fs/Kconfig                          |     1 +
 fs/Makefile                         |     1 +
 fs/ssdfs/Kconfig                    |   300 +
 fs/ssdfs/Makefile                   |    50 +
 fs/ssdfs/block_bitmap.c             |  5313 ++++++++
 fs/ssdfs/block_bitmap.h             |   370 +
 fs/ssdfs/block_bitmap_tables.c      |   310 +
 fs/ssdfs/btree.c                    |  7787 ++++++++++++
 fs/ssdfs/btree.h                    |   218 +
 fs/ssdfs/btree_hierarchy.c          |  9420 ++++++++++++++
 fs/ssdfs/btree_hierarchy.h          |   284 +
 fs/ssdfs/btree_node.c               | 16928 ++++++++++++++++++++++++++
 fs/ssdfs/btree_node.h               |   768 ++
 fs/ssdfs/btree_search.c             |   885 ++
 fs/ssdfs/btree_search.h             |   359 +
 fs/ssdfs/compr_lzo.c                |   256 +
 fs/ssdfs/compr_zlib.c               |   359 +
 fs/ssdfs/compression.c              |   548 +
 fs/ssdfs/compression.h              |   104 +
 fs/ssdfs/current_segment.c          |   682 ++
 fs/ssdfs/current_segment.h          |    76 +
 fs/ssdfs/dentries_tree.c            |  9726 +++++++++++++++
 fs/ssdfs/dentries_tree.h            |   156 +
 fs/ssdfs/dev_bdev.c                 |  1187 ++
 fs/ssdfs/dev_mtd.c                  |   641 +
 fs/ssdfs/dev_zns.c                  |  1281 ++
 fs/ssdfs/dir.c                      |  2071 ++++
 fs/ssdfs/dynamic_array.c            |   781 ++
 fs/ssdfs/dynamic_array.h            |    96 +
 fs/ssdfs/extents_queue.c            |  1723 +++
 fs/ssdfs/extents_queue.h            |   105 +
 fs/ssdfs/extents_tree.c             | 13060 ++++++++++++++++++++
 fs/ssdfs/extents_tree.h             |   171 +
 fs/ssdfs/file.c                     |  2523 ++++
 fs/ssdfs/fs_error.c                 |   257 +
 fs/ssdfs/inode.c                    |  1190 ++
 fs/ssdfs/inodes_tree.c              |  5534 +++++++++
 fs/ssdfs/inodes_tree.h              |   177 +
 fs/ssdfs/invalidated_extents_tree.c |  7063 +++++++++++
 fs/ssdfs/invalidated_extents_tree.h |    95 +
 fs/ssdfs/log_footer.c               |   901 ++
 fs/ssdfs/offset_translation_table.c |  8160 +++++++++++++
 fs/ssdfs/offset_translation_table.h |   446 +
 fs/ssdfs/options.c                  |   190 +
 fs/ssdfs/page_array.c               |  1746 +++
 fs/ssdfs/page_array.h               |   119 +
 fs/ssdfs/page_vector.c              |   437 +
 fs/ssdfs/page_vector.h              |    64 +
 fs/ssdfs/peb.c                      |   813 ++
 fs/ssdfs/peb.h                      |   970 ++
 fs/ssdfs/peb_block_bitmap.c         |  3958 ++++++
 fs/ssdfs/peb_block_bitmap.h         |   165 +
 fs/ssdfs/peb_container.c            |  5649 +++++++++
 fs/ssdfs/peb_container.h            |   291 +
 fs/ssdfs/peb_flush_thread.c         | 16856 +++++++++++++++++++++++++
 fs/ssdfs/peb_gc_thread.c            |  2953 +++++
 fs/ssdfs/peb_mapping_queue.c        |   334 +
 fs/ssdfs/peb_mapping_queue.h        |    67 +
 fs/ssdfs/peb_mapping_table.c        | 12706 +++++++++++++++++++
 fs/ssdfs/peb_mapping_table.h        |   699 ++
 fs/ssdfs/peb_mapping_table_cache.c  |  4702 +++++++
 fs/ssdfs/peb_mapping_table_cache.h  |   119 +
 fs/ssdfs/peb_mapping_table_thread.c |  2817 +++++
 fs/ssdfs/peb_migration_scheme.c     |  1302 ++
 fs/ssdfs/peb_read_thread.c          | 10672 ++++++++++++++++
 fs/ssdfs/readwrite.c                |   651 +
 fs/ssdfs/recovery.c                 |  3144 +++++
 fs/ssdfs/recovery.h                 |   446 +
 fs/ssdfs/recovery_fast_search.c     |  1194 ++
 fs/ssdfs/recovery_slow_search.c     |   585 +
 fs/ssdfs/recovery_thread.c          |  1196 ++
 fs/ssdfs/request_queue.c            |  1240 ++
 fs/ssdfs/request_queue.h            |   417 +
 fs/ssdfs/segment.c                  |  5262 ++++++++
 fs/ssdfs/segment.h                  |   957 ++
 fs/ssdfs/segment_bitmap.c           |  4821 ++++++++
 fs/ssdfs/segment_bitmap.h           |   459 +
 fs/ssdfs/segment_bitmap_tables.c    |   814 ++
 fs/ssdfs/segment_block_bitmap.c     |  1425 +++
 fs/ssdfs/segment_block_bitmap.h     |   205 +
 fs/ssdfs/segment_tree.c             |   748 ++
 fs/ssdfs/segment_tree.h             |    66 +
 fs/ssdfs/sequence_array.c           |   639 +
 fs/ssdfs/sequence_array.h           |   119 +
 fs/ssdfs/ssdfs.h                    |   411 +
 fs/ssdfs/ssdfs_constants.h          |    81 +
 fs/ssdfs/ssdfs_fs_info.h            |   412 +
 fs/ssdfs/ssdfs_inline.h             |  1346 ++
 fs/ssdfs/ssdfs_inode_info.h         |   143 +
 fs/ssdfs/ssdfs_thread_info.h        |    42 +
 fs/ssdfs/super.c                    |  4044 ++++++
 fs/ssdfs/version.h                  |     7 +
 fs/ssdfs/volume_header.c            |  1256 ++
 include/linux/ssdfs_fs.h            |  3468 ++++++
 include/trace/events/ssdfs.h        |   255 +
 include/uapi/linux/magic.h          |     1 +
 include/uapi/linux/ssdfs_fs.h       |   117 +
 97 files changed, 205963 insertions(+)
 create mode 100644 fs/ssdfs/Kconfig
 create mode 100644 fs/ssdfs/Makefile
 create mode 100644 fs/ssdfs/block_bitmap.c
 create mode 100644 fs/ssdfs/block_bitmap.h
 create mode 100644 fs/ssdfs/block_bitmap_tables.c
 create mode 100644 fs/ssdfs/btree.c
 create mode 100644 fs/ssdfs/btree.h
 create mode 100644 fs/ssdfs/btree_hierarchy.c
 create mode 100644 fs/ssdfs/btree_hierarchy.h
 create mode 100644 fs/ssdfs/btree_node.c
 create mode 100644 fs/ssdfs/btree_node.h
 create mode 100644 fs/ssdfs/btree_search.c
 create mode 100644 fs/ssdfs/btree_search.h
 create mode 100644 fs/ssdfs/compr_lzo.c
 create mode 100644 fs/ssdfs/compr_zlib.c
 create mode 100644 fs/ssdfs/compression.c
 create mode 100644 fs/ssdfs/compression.h
 create mode 100644 fs/ssdfs/current_segment.c
 create mode 100644 fs/ssdfs/current_segment.h
 create mode 100644 fs/ssdfs/dentries_tree.c
 create mode 100644 fs/ssdfs/dentries_tree.h
 create mode 100644 fs/ssdfs/dev_bdev.c
 create mode 100644 fs/ssdfs/dev_mtd.c
 create mode 100644 fs/ssdfs/dev_zns.c
 create mode 100644 fs/ssdfs/dir.c
 create mode 100644 fs/ssdfs/dynamic_array.c
 create mode 100644 fs/ssdfs/dynamic_array.h
 create mode 100644 fs/ssdfs/extents_queue.c
 create mode 100644 fs/ssdfs/extents_queue.h
 create mode 100644 fs/ssdfs/extents_tree.c
 create mode 100644 fs/ssdfs/extents_tree.h
 create mode 100644 fs/ssdfs/file.c
 create mode 100644 fs/ssdfs/fs_error.c
 create mode 100644 fs/ssdfs/inode.c
 create mode 100644 fs/ssdfs/inodes_tree.c
 create mode 100644 fs/ssdfs/inodes_tree.h
 create mode 100644 fs/ssdfs/invalidated_extents_tree.c
 create mode 100644 fs/ssdfs/invalidated_extents_tree.h
 create mode 100644 fs/ssdfs/log_footer.c
 create mode 100644 fs/ssdfs/offset_translation_table.c
 create mode 100644 fs/ssdfs/offset_translation_table.h
 create mode 100644 fs/ssdfs/options.c
 create mode 100644 fs/ssdfs/page_array.c
 create mode 100644 fs/ssdfs/page_array.h
 create mode 100644 fs/ssdfs/page_vector.c
 create mode 100644 fs/ssdfs/page_vector.h
 create mode 100644 fs/ssdfs/peb.c
 create mode 100644 fs/ssdfs/peb.h
 create mode 100644 fs/ssdfs/peb_block_bitmap.c
 create mode 100644 fs/ssdfs/peb_block_bitmap.h
 create mode 100644 fs/ssdfs/peb_container.c
 create mode 100644 fs/ssdfs/peb_container.h
 create mode 100644 fs/ssdfs/peb_flush_thread.c
 create mode 100644 fs/ssdfs/peb_gc_thread.c
 create mode 100644 fs/ssdfs/peb_mapping_queue.c
 create mode 100644 fs/ssdfs/peb_mapping_queue.h
 create mode 100644 fs/ssdfs/peb_mapping_table.c
 create mode 100644 fs/ssdfs/peb_mapping_table.h
 create mode 100644 fs/ssdfs/peb_mapping_table_cache.c
 create mode 100644 fs/ssdfs/peb_mapping_table_cache.h
 create mode 100644 fs/ssdfs/peb_mapping_table_thread.c
 create mode 100644 fs/ssdfs/peb_migration_scheme.c
 create mode 100644 fs/ssdfs/peb_read_thread.c
 create mode 100644 fs/ssdfs/readwrite.c
 create mode 100644 fs/ssdfs/recovery.c
 create mode 100644 fs/ssdfs/recovery.h
 create mode 100644 fs/ssdfs/recovery_fast_search.c
 create mode 100644 fs/ssdfs/recovery_slow_search.c
 create mode 100644 fs/ssdfs/recovery_thread.c
 create mode 100644 fs/ssdfs/request_queue.c
 create mode 100644 fs/ssdfs/request_queue.h
 create mode 100644 fs/ssdfs/segment.c
 create mode 100644 fs/ssdfs/segment.h
 create mode 100644 fs/ssdfs/segment_bitmap.c
 create mode 100644 fs/ssdfs/segment_bitmap.h
 create mode 100644 fs/ssdfs/segment_bitmap_tables.c
 create mode 100644 fs/ssdfs/segment_block_bitmap.c
 create mode 100644 fs/ssdfs/segment_block_bitmap.h
 create mode 100644 fs/ssdfs/segment_tree.c
 create mode 100644 fs/ssdfs/segment_tree.h
 create mode 100644 fs/ssdfs/sequence_array.c
 create mode 100644 fs/ssdfs/sequence_array.h
 create mode 100644 fs/ssdfs/ssdfs.h
 create mode 100644 fs/ssdfs/ssdfs_constants.h
 create mode 100644 fs/ssdfs/ssdfs_fs_info.h
 create mode 100644 fs/ssdfs/ssdfs_inline.h
 create mode 100644 fs/ssdfs/ssdfs_inode_info.h
 create mode 100644 fs/ssdfs/ssdfs_thread_info.h
 create mode 100644 fs/ssdfs/super.c
 create mode 100644 fs/ssdfs/version.h
 create mode 100644 fs/ssdfs/volume_header.c
 create mode 100644 include/linux/ssdfs_fs.h
 create mode 100644 include/trace/events/ssdfs.h
 create mode 100644 include/uapi/linux/ssdfs_fs.h

Comments

Stefan Hajnoczi Feb. 27, 2023, 1:53 p.m. UTC | #1

> Benchmarking results show that SSDFS is capable:

Is there performance data showing IOPS?

These comparisions include file systems that don't support zoned devices
natively, maybe that's why IOPS comparisons cannot be made?

> (3) decrease the write amplification factor compared with:
>     1.3x - 116x (ext4),
>     14x - 42x (xfs),
>     6x - 9x (btrfs),
>     1.5x - 50x (f2fs),
>     1.2x - 20x (nilfs2);
> (4) prolong SSD lifetime compared with:

Is this measuring how many times blocks are erased? I guess this
measurement includes the background I/O from ssdfs migration and moving?

>     1.4x - 7.8x (ext4),
>     15x - 60x (xfs),
>     6x - 12x (btrfs),
>     1.5x - 7x (f2fs),
>     1x - 4.6x (nilfs2).

Thanks,
Stefan

Viacheslav A.Dubeyko Feb. 27, 2023, 10:59 p.m. UTC | #2

> On Feb 27, 2023, at 5:53 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
>> Benchmarking results show that SSDFS is capable:
> 
> Is there performance data showing IOPS?
> 

Yeah, I completely see your point. :) Everybody would like to see the performance
estimation. My first goal was to check how SSDFS can prolong SSD lifetime and
decrease write amplification. So, I used blktrace output to estimate the write
amplification factor, lifetime prolongation and compare various file systems.
Also, blktrace output contains timestamps information. And I realized during
comparison of timestamps that, currently, policy of distribution of offset translation table
among logs affects read performance of SSDFS file system. So, if I compare the whole
cycle between mount and unmount (read + write I/O), then I can see that SSDFS can be
faster than NILFS2, XFS but slower than ext4, btrfs, F2FS. However, write I/O path
should be faster because SSDFS can decrease the amount of write I/O requests.
I am changing the offset translation table’s distribution policy (patch is under testing) and
I expect to improve the SSDFS read performance significantly. So, performance
benchmarking will make sense after this fix.

> These comparisions include file systems that don't support zoned devices
> natively, maybe that's why IOPS comparisons cannot be made?
> 

Performance comparison can be made for conventional SSD devices.
Of course, ZNS SSD has some peculiarities (limited number of open/active
zones, zone size, write pointer, strict append-only mode) and it requires
fair comparison. Because, these peculiarities/restrictions can as help as
make life more difficult. However, even if we can compare file systems for
the same type of storage device, then various configuration options
(logical block size, erase block size, segment size, and so on) or particular
workload can significantly change a file system behavior. It’s always not so
easy statement that this file system faster than another one.

>> (3) decrease the write amplification factor compared with:
>>    1.3x - 116x (ext4),
>>    14x - 42x (xfs),
>>    6x - 9x (btrfs),
>>    1.5x - 50x (f2fs),
>>    1.2x - 20x (nilfs2);
>> (4) prolong SSD lifetime compared with:
> 
> Is this measuring how many times blocks are erased? I guess this
> measurement includes the background I/O from ssdfs migration and moving?
> 

So, first of all, I need to explain the testing methodology. Testing included:
(1) create file (empty, 64 bytes, 16K, 100K), (2) update file, (3) delete file.
Every particular test-case is executed as multiple mount/unmount operations
sequence. For example, total number of file creation operations were 1000 and
10000, but one mount cycle included 10, 100, or 1000 file creation, file update,
or file delete operations. Finally, file system must flush all dirty metadata and
user data during unmount operation.

The blktrace tool registers LBAs and size for every I/O request. These data are
the basis for estimation how many erase blocks have been involved into
operations. SSDFS volumes have been created by using 128KB, 512KB, and
8MB erase block sizes. So, I used these erase block sizes for estimation.
Generally speaking, we can estimate the total number of erase blocks that
were involved into file system operations for particular use-case by means of
calculation of number of bytes of all I/O requests and division on erase block size.
If file system uses in-place updates, then it is possible to estimate how many times
the same erase block (we know LBA numbers) has been completely re-written.
For example, if erase block (starting from LBA #32) received 1310720 bytes of
write I/O requests, then erase block of 128KB in size has been re-written 10x times.
So, it means that FTL needs to store all these data into 10 X 128KB erase blocks
in the background or execute around 9 erase operation to keep the actual state
of data into one 128KB erase block. So, this is the estimation of FTL GC responsibility.

However, if we would like to estimate the total number of erase operation, then
we need to take into account:

E total = E(FTL GC) + E(TRIM) + E(FS GC) + E(read disturbance) + E(retention)

The estimation of erase operation on the basis of retention issue is tricky and
it shows negligibly small number for such short testing. So, we can ignore it.
However, retention issue is important factor of decreasing SSD lifetime.
I executed the estimation of this factor and I made comparison for various
file systems. But this factor is deeply depends on time, workload, and
payload size. So, it’s really hard to share any stable and reasonable numbers
for this factor. Especially, it heavily depends on FTL implementation.

It is possible to make estimation of read disturbance but, again, it heavily
depends on NAND flash type, organization, and FTL algorithms. Also, this
estimation shows really small numbers that can be ignored for short testing.
I’ve made this estimation and I can see that, currently, SSDFS has read-intensive
nature because of offset translation table distribution policy. I am testing the fix
and I have hope to remove this issue.

SSDFS has efficient TRIM/erase policy. So, I can see TRIM/erase operations
even for such “short" test-cases. As far as I can see, no other file system issues
discard operations for the same test-cases. I included TRIM/erase operations
into the calculation of total number of erase operations.

Estimation of GC operations on FS side (F2FS, NILFS2) is the most speculative one.
I’ve made estimation of number of erase operations that FS GC can generate.
However, as far as I can see, even without taking into account the FS GC erase
operations, SSDFS looks better compared with F2FS and NILFS2.
I need to add here that SSDFS uses migration scheme and doesn’t need
in classical GC. But even for such “short” test-cases migration scheme shows
really efficient TRIM/erase policy. 

So, write amplification factor was estimated on the basis of write I/O requests
comparison. And SSD lifetime prolongation has been estimated and compared
by using the model that I explained above. I hope I explained it's clear enough.
Feel free to ask additional questions if I missed something.

The measurement includes all operations (foreground and background) that
file system initiates because of using mount/unmount model. However, migration
scheme requires additional explanation. Generally speaking, migration scheme
doesn’t generate additional I/O requests. Oppositely, migration scheme decreases
number of I/O requests. It could be tricky to follow. SSDFS uses compression,
delta-encoding, compaction scheme, and migration stimulation. It means that
reqular file system’s update operations are the main vehicle of migration scheme.
Let imagine that application updates 4KB logical block. It means that SSDFS
tries to compress (or delta-encode) this piece of data. Let compression gives us
1KB compressed piece of data (4KB uncompressed size). It means that we can
place 1KB into 4KB memory page and we have 3KB free space. So, migration
logic checks that exhausted (completely full) old erase block that received update
operation has another valid block(s). If we have such valid logical blocks, then
we can compress this logical blocks and store it into free space of 4K memory page.
So, we can finally store 4 compressed logical blocks (1KB in size each), for example,
into 4KB memory page. It means that SSDFS issues one I/O request for 4 logical
blocks instead of 4 ones. I simplify the explanation, but idea remains the same.
I hope I clarified the point. Feel free to ask additional questions if I missed something.

Thanks,
Slava.

>>    1.4x - 7.8x (ext4),
>>    15x - 60x (xfs),
>>    6x - 12x (btrfs),
>>    1.5x - 7x (f2fs),
>>    1x - 4.6x (nilfs2).
> 
> Thanks,
> Stefan

Stefan Hajnoczi Feb. 28, 2023, 1:59 p.m. UTC | #3

On Mon, Feb 27, 2023 at 02:59:08PM -0800, Viacheslav A.Dubeyko wrote:
> > On Feb 27, 2023, at 5:53 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > These comparisions include file systems that don't support zoned devices
> > natively, maybe that's why IOPS comparisons cannot be made?
> > 
> 
> Performance comparison can be made for conventional SSD devices.
> Of course, ZNS SSD has some peculiarities (limited number of open/active
> zones, zone size, write pointer, strict append-only mode) and it requires
> fair comparison. Because, these peculiarities/restrictions can as help as
> make life more difficult. However, even if we can compare file systems for
> the same type of storage device, then various configuration options
> (logical block size, erase block size, segment size, and so on) or particular
> workload can significantly change a file system behavior. It’s always not so
> easy statement that this file system faster than another one.

I incorrectly assumed ssdfs was only for zoned devices.

> 
> >> (3) decrease the write amplification factor compared with:
> >>    1.3x - 116x (ext4),
> >>    14x - 42x (xfs),
> >>    6x - 9x (btrfs),
> >>    1.5x - 50x (f2fs),
> >>    1.2x - 20x (nilfs2);
> >> (4) prolong SSD lifetime compared with:
> > 
> > Is this measuring how many times blocks are erased? I guess this
> > measurement includes the background I/O from ssdfs migration and moving?
> > 
> 
> So, first of all, I need to explain the testing methodology. Testing included:
> (1) create file (empty, 64 bytes, 16K, 100K), (2) update file, (3) delete file.
> Every particular test-case is executed as multiple mount/unmount operations
> sequence. For example, total number of file creation operations were 1000 and
> 10000, but one mount cycle included 10, 100, or 1000 file creation, file update,
> or file delete operations. Finally, file system must flush all dirty metadata and
> user data during unmount operation.
> 
> The blktrace tool registers LBAs and size for every I/O request. These data are
> the basis for estimation how many erase blocks have been involved into
> operations. SSDFS volumes have been created by using 128KB, 512KB, and
> 8MB erase block sizes. So, I used these erase block sizes for estimation.
> Generally speaking, we can estimate the total number of erase blocks that
> were involved into file system operations for particular use-case by means of
> calculation of number of bytes of all I/O requests and division on erase block size.
> If file system uses in-place updates, then it is possible to estimate how many times
> the same erase block (we know LBA numbers) has been completely re-written.
> For example, if erase block (starting from LBA #32) received 1310720 bytes of
> write I/O requests, then erase block of 128KB in size has been re-written 10x times.
> So, it means that FTL needs to store all these data into 10 X 128KB erase blocks
> in the background or execute around 9 erase operation to keep the actual state
> of data into one 128KB erase block. So, this is the estimation of FTL GC responsibility.
> 
> However, if we would like to estimate the total number of erase operation, then
> we need to take into account:
> 
> E total = E(FTL GC) + E(TRIM) + E(FS GC) + E(read disturbance) + E(retention)
> 
> The estimation of erase operation on the basis of retention issue is tricky and
> it shows negligibly small number for such short testing. So, we can ignore it.
> However, retention issue is important factor of decreasing SSD lifetime.
> I executed the estimation of this factor and I made comparison for various
> file systems. But this factor is deeply depends on time, workload, and
> payload size. So, it’s really hard to share any stable and reasonable numbers
> for this factor. Especially, it heavily depends on FTL implementation.
> 
> It is possible to make estimation of read disturbance but, again, it heavily
> depends on NAND flash type, organization, and FTL algorithms. Also, this
> estimation shows really small numbers that can be ignored for short testing.
> I’ve made this estimation and I can see that, currently, SSDFS has read-intensive
> nature because of offset translation table distribution policy. I am testing the fix
> and I have hope to remove this issue.
> 
> SSDFS has efficient TRIM/erase policy. So, I can see TRIM/erase operations
> even for such “short" test-cases. As far as I can see, no other file system issues
> discard operations for the same test-cases. I included TRIM/erase operations
> into the calculation of total number of erase operations.
> 
> Estimation of GC operations on FS side (F2FS, NILFS2) is the most speculative one.
> I’ve made estimation of number of erase operations that FS GC can generate.
> However, as far as I can see, even without taking into account the FS GC erase
> operations, SSDFS looks better compared with F2FS and NILFS2.
> I need to add here that SSDFS uses migration scheme and doesn’t need
> in classical GC. But even for such “short” test-cases migration scheme shows
> really efficient TRIM/erase policy. 
> 
> So, write amplification factor was estimated on the basis of write I/O requests
> comparison. And SSD lifetime prolongation has been estimated and compared
> by using the model that I explained above. I hope I explained it's clear enough.
> Feel free to ask additional questions if I missed something.
> 
> The measurement includes all operations (foreground and background) that
> file system initiates because of using mount/unmount model. However, migration
> scheme requires additional explanation. Generally speaking, migration scheme
> doesn’t generate additional I/O requests. Oppositely, migration scheme decreases
> number of I/O requests. It could be tricky to follow. SSDFS uses compression,
> delta-encoding, compaction scheme, and migration stimulation. It means that
> reqular file system’s update operations are the main vehicle of migration scheme.
> Let imagine that application updates 4KB logical block. It means that SSDFS
> tries to compress (or delta-encode) this piece of data. Let compression gives us
> 1KB compressed piece of data (4KB uncompressed size). It means that we can
> place 1KB into 4KB memory page and we have 3KB free space. So, migration
> logic checks that exhausted (completely full) old erase block that received update
> operation has another valid block(s). If we have such valid logical blocks, then
> we can compress this logical blocks and store it into free space of 4K memory page.
> So, we can finally store 4 compressed logical blocks (1KB in size each), for example,
> into 4KB memory page. It means that SSDFS issues one I/O request for 4 logical
> blocks instead of 4 ones. I simplify the explanation, but idea remains the same.
> I hope I clarified the point. Feel free to ask additional questions if I missed something.

Thanks for these explanations, that clarifies things!

Stefan