diff mbox

[RFC,v2,01/83] Introduction and documentation of NOVA filesystem.

Message ID 1520705944-6723-2-git-send-email-jix024@eng.ucsd.edu (mailing list archive)
State Changes Requested
Headers show

Commit Message

Andiry Xu March 10, 2018, 6:17 p.m. UTC
From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA is a log-structured file system tailored for byte-addressable non-volatile memories.
It was designed and developed at the Non-Volatile Systems Laboratory in the Computer
Science and Engineering Department at the University of California, San Diego.
Its primary authors are Andiry Xu <jix024@eng.ucsd.edu>, Lu Zhang
<luzh@eng.ucsd.edu>, and Steven Swanson <swanson@eng.ucsd.edu>.

These two papers provide a detailed, high-level description of NOVA's design goals and approach:

   NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
   In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
   (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)

   NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
   In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
   (http://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf)

This patchset contains features from the FAST paper. We leave NOVA-Fortis features,
such as snapshot, metadata and data replication and RAID parity for
future submission.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 Documentation/filesystems/00-INDEX |   2 +
 Documentation/filesystems/nova.txt | 498 +++++++++++++++++++++++++++++++++++++
 MAINTAINERS                        |   8 +
 3 files changed, 508 insertions(+)
 create mode 100644 Documentation/filesystems/nova.txt

Comments

Randy Dunlap March 19, 2018, 8:43 p.m. UTC | #1
On 03/10/2018 10:17 AM, Andiry Xu wrote:
> From: Andiry Xu <jix024@cs.ucsd.edu>
> 
> NOVA is a log-structured file system tailored for byte-addressable non-volatile memories.
> It was designed and developed at the Non-Volatile Systems Laboratory in the Computer
> Science and Engineering Department at the University of California, San Diego.
> Its primary authors are Andiry Xu <jix024@eng.ucsd.edu>, Lu Zhang
> <luzh@eng.ucsd.edu>, and Steven Swanson <swanson@eng.ucsd.edu>.
> 
> These two papers provide a detailed, high-level description of NOVA's design goals and approach:
> 
>    NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
>    In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
>    (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)
> 
>    NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
>    In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
>    (http://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf)
> 
> This patchset contains features from the FAST paper. We leave NOVA-Fortis features,
> such as snapshot, metadata and data replication and RAID parity for
> future submission.
> 
> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
> ---
>  Documentation/filesystems/00-INDEX |   2 +
>  Documentation/filesystems/nova.txt | 498 +++++++++++++++++++++++++++++++++++++
>  MAINTAINERS                        |   8 +
>  3 files changed, 508 insertions(+)
>  create mode 100644 Documentation/filesystems/nova.txt

> diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt
> new file mode 100644
> index 0000000..4728f50
> --- /dev/null
> +++ b/Documentation/filesystems/nova.txt
> @@ -0,0 +1,498 @@
> +The NOVA Filesystem
> +===================
> +
> +NOn-Volatile memory Accelerated file system (NOVA) is a DAX file system
> +designed to provide a high performance and production-ready file system
> +tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
> +and Intel's soon-to-be-released 3DXPoint DIMMs).
> +NOVA combines design elements from many other file systems
> +and adapts conventional log-structured file system techniques to
> +exploit the fast random access that NVMs provide. In particular, NOVA maintains
> +separate logs for each inode to improve concurrency, and stores file data
> +outside the log to minimize log size and reduce garbage collection costs. NOVA's
> +logs provide metadata and data atomicity and focus on simplicity and
> +reliability, keeping complex metadata structures in DRAM to accelerate lookup
> +operations.
> +
> +NOVA was developed by the Non-Volatile Systems Laboratory (NVSL) in
> +the Computer Science and Engineering Department at the University of
> +California, San Diego.
> +
> +A more thorough discussion of NOVA's design is avaialable in these two papers:

                                                  available

> +
> +NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories
> +Jian Xu and Steven Swanson
> +In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
> +
> +NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
> +Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase,
> +Tamires Brito Da Silva, Andy Rudoff and Steven Swanson
> +In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
> +
> +This version of NOVA contains features from the FAST paper.
> +NOVA-Fortis features, such as snapshot, metadata and data protection and replication
> +are left for future submission.
> +
> +The main NOVA features include:
> +
> +  * POSIX semantics
> +  * Directly access (DAX) byte-addressable NVMM without page caching
> +  * Per-CPU NVMM pool to maximize concurrency
> +  * Strong consistency guarantees with 8-byte atomic stores
> +
> +
> +Filesystem Design
> +=================
> +
> +NOVA divides NVMM into several regions. NOVA's 512B superblock contains global

                                        (prefer:) 512-byte

> +file system information and the recovery inode. The recovery inode represents a
> +special file that stores recovery information (e.g., the list of unallocated
> +NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also
> +provides per-CPU journals for complex file operations that involve multiple
> +inodes. The rest of the available NVMM stores logs and file data.
> +
> +NOVA is log-structured and stores a separate log for each inode to maximize
> +concurrency and provide atomicity for operations that affect a single file. The
> +logs only store metadata and comprise a linked list of 4 KB pages. Log entries
> +are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log
> +pages may reside anywhere in NVMM.
> +
> +NOVA keeps copies of most file metadata in DRAM during normal
> +operations, eliminating the need to access metadata in NVMM during reads.
> +
> +NOVA supports both copy-on-write and in-place file data updates and appends
> +metadata about the write to the log. For operations that affect multiple inodes

                                                                            inodes,

> +NOVA uses lightweight, fixed-length journals –one per core.

                                                -- one per core.

> +
> +NOVA divides the allocatable NVMM into multiple regions, one region per CPU
> +core. A per-core allocator manages each of the regions, minimizing contention
> +during memory allocation.
> +
> +After a system crash, NOVA must scan all the logs to rebuild the memory
> +allocator state. Since, there are many logs, NOVA aggressively parallelizes the

                    Since there are

> +scan.
> +
> +
> +Building and using NOVA
> +=======================
> +
> +To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`),
> +DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.
> +
> +NOVA runs on a pmem non-volatile memory region.  You can create one of these
> +regions with the `memmap` kernel command line option.  For instance, adding
> +`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting
> +from address 8GB, and the kernel will create a `pmem0` block device under the
> +`/dev` directory.
> +
> +After the OS has booted, you can initialize a NOVA instance with the following commands:
> +
> +
> +# modprobe nova
> +# mount -t NOVA -o init /dev/pmem0 /mnt/nova

Hmph, unique in upper-case-ness (at least for in-tree fs-es).
Would you consider "nova" instead?

> +
> +
> +The above commands create a NOVA instance on `/dev/pmem0` and mounts it on
> +`/mnt/nova`.
> +
> +NOVA support several module command line options:

        supports

> +
> + * measure_timing: Measure the timing of file system operations for profiling (default: 0)
> +
> + * inplace_data_updates:  Update data in place rather than with COW (default: 0)
> +
> +To recover an existing NOVA instance, mount NOVA without the init option, for example:
> +
> +# mount -t NOVA /dev/pmem0 /mnt/nova
> +
> +
> +Sysfs support
> +-------------
> +
> +NOVA provides sysfs support to enable user to get/set information of 

                                  enable a user
                        or        enable users

And the line above ends with a trailing space.  Please check/remove all of those.

> +a running NOVA instance.
> +After mount, NOVA creates four entries under proc directory /proc/fs/nova/pmem#/:

Above uses lower-case "nova" in /proc/fs/nova/... but the examples below use NOVA.
nova is preferred (IMO).

> +
> +timing_stats	IO_stats	allocator	gc
> +
> +Show NOVA file operation timing statistics:
> +# cat /proc/fs/NOVA/pmem#/timing_stats
> +
> +Clear timing statistics:
> +# echo 1 > /proc/fs/NOVA/pmem#/timing_stats
> +
> +Show NOVA I/O statistics:
> +# cat /proc/fs/NOVA/pmem#/IO_stats
> +
> +Clear I/O statistics:
> +# echo 1 > /proc/fs/NOVA/pmem#/IO_stats
> +
> +Show NOVA allocator information:
> +# cat /proc/fs/NOVA/pmem#/allocator
> +
> +Manual garbage collection:
> +# echo #inode_number > /proc/fs/NOVA/pmem#/gc
> +
> +
> +Source File Structure
> +=====================
> +
> +  * nova_def.h/nova.h
> +   Defines NOVA macros and key inline functions.
> +
> +  * balloc.{h,c}
> +    NOVA's pmem allocator implementation.
> +
> +  * bbuild.c
> +    Implements recovery routines to restore the in-use inode list and the NVMM
> +    allocator information.
> +
> +  * dax.c
> +    Implements DAX read/write and mmap functions to access file data. NOVA uses
> +    copy-on-write to modify file pages by default, unless inplace data update is
> +    enabled at mount-time.
> +
> +  * dir.c
> +    Contains functions to create, update, and remove NOVA dentries.
> +
> +  * file.c
> +    Implements file-related operations such as open, fallocate, llseek, fsync,
> +    and flush.
> +
> +  * gc.c
> +    NOVA's garbage collection functions.
> +
> +  * inode.{h,c}
> +    Creates, reads, and frees NOVA inode tables and inodes.
> +
> +  * ioctl.c
> +    Implements some ioctl commands to call NOVA's internal functions.
> +
> +  * journal.{h,c}
> +    For operations that affect multiple inodes NOVA uses lightweight,
> +    fixed-length journals – one per core. This file contains functions to
> +    create and manage the lite journals.
> +
> +  * log.{h,c}
> +    Functions to manipulate NOVA inode logs, including log page allocation, log
> +    entry creation, commit, modification, and deletion.
> +
> +  * namei.c
> +    Functions to create/remove files, directories, and links. It also looks for
> +    the NOVA inode number for a given path name.
> +
> +  * rebuild.c
> +    When mounting NOVA, rebuild NOVA inodes from its logs.
> +
> +  * stats.{h,c}
> +    Provide routines to gather and print NOVA usage statistics.
> +
> +  * super.{h,c}
> +    Super block structures and NOVA FS layout and entry points for NOVA
> +    mounting and unmounting, initializing or recovering the NOVA super block
> +    and other global file system information.
> +
> +  * symlink.c
> +    Implements functions to create and read symbolic links in the filesystem.
> +
> +  * sysfs.c
> +    Implements sysfs entries to take user inputs for printing NOVA statistics.

s/sysfs/procfs/

> +
> +
> +Filesystem Layout
> +=================
> +
> +A NOVA file systems resides in single PMEM device. *****
> +NOVA divides the device into 4KB blocks.

                                4 KB  {or use 4KB way up above here}

> +
> + block
> ++---------------------------------------------------------+
> +|    0    | primary super block (struct nova_super_block) |
> ++---------------------------------------------------------+
> +|    1    | Reserved inodes                               |
> ++---------------------------------------------------------+
> +|  2 - 15 | reserved                                      |
> ++---------------------------------------------------------+
> +| 16 - 31 | Inode table pointers                          |
> ++---------------------------------------------------------+
> +| 32 - 47 | Journal pointers                              |
> ++---------------------------------------------------------+
> +| 48 - 63 | reserved                                      |
> ++---------------------------------------------------------+
> +|   ...   | log and data pages                            |
> ++---------------------------------------------------------+
> +|   n-2   | replica reserved Inodes                       |
> ++---------------------------------------------------------+
> +|   n-1   | replica super block                           |
> ++---------------------------------------------------------+
> +
> +
> +
> +Superblock and Associated Structures
> +====================================
> +
> +The beginning of the PMEM device hold the super block and its associated

                                    holds

> +tables.  These include reserved inodes, a table of pointers to the journals
> +NOVA uses for complex operations, and pointers to inodes tables.  NOVA
> +maintains replicas of the super block and reserved inodes in the last two
> +blocks of the PMEM area.
> +
> +
> +Block Allocator/Free Lists
> +==========================
> +
> +NOVA uses per-CPU allocators to manage free PMEM blocks.  On initialization,> +NOVA divides the range of blocks in the PMEM device among the CPUs, and those
> +blocks are managed solely by that CPU.  We call these ranges of "allocation regions".
> +Each allocator maintains a red-black tree of unallocated ranges (struct
> +nova_range_node).
> +
> +Allocation Functions
> +--------------------
> +
> +NOVA allocate PMEM blocks using two mechanisms:

        allocates

> +
> +1.  Static allocation as defined in super.h
> +
> +2.  Allocation for log and data pages via nova_new_log_blocks() and
> +nova_new_data_blocks().
> +
> +
> +PMEM Address Translation
> +------------------------
> +
> +In NOVA's persistent data structures, memory locations are given as offsets
> +from the beginning of the PMEM region.  nova_get_block() translates offsets to
> +PMEM addresses.  nova_get_addr_off() performs the reverse translation.
> +
> +
> +Inodes
> +======
> +
> +NOVA maintains per-CPU inode tables, and inode numbers are striped across the
> +tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).
> +
> +The inodes themselves live in a set of linked lists (one per CPU) of 2MB
> +blocks.  The last 8 bytes of each block points to the next block.  Pointers to
> +heads of these list live in PMEM block INODE_TABLE_START.

                  lists

> +Additional space for inodes is allocated on demand.
> +
> +To allocate inodes, NOVA maintains a per-cpu "inuse_list" in DRAM holds a RB

s/cpu/CPU/g
s/a RB/an RB/

but that isn't quite a sentence. Please fix it.

> +tree that holds ranges of allocated inode numbers.
> +
> +
> +Logs
> +====
> +
> +NOVA maintains a log for each inode that records updates to the inode's
> +metadata and holds pointers to the file data.  NOVA makes updates to file data
> +and metadata atomic by atomically appending log entries to the log.
> +
> +Each inode contains pointers to head and tail of the inode's log.  When the log
> +grows past the end of the last page, nova allocates additional space.  For
> +short logs (less than 1MB) , it doubles the length.  For longer logs, it adds a
> +fixed amount of additional space (1MB).
> +
> +Log space is reclaimed during garbage collection.
> +
> +Log Entries
> +-----------
> +
> +There are four kinds of log entry, documented in log.h.  The log entries have
> +several entries in common:
> +
> +   1.  'epoch_id' gives the epoch during which the log entry was created.
> +   Creating a snapshot increments the epoch_id for the file systems.

                                                          file system.  (?)
or do multiple epochs (snapshots) => multiple fs-es?

> +   Currently disabled (always zero).
> +
> +   2.  'trans_id' is per-inode, monotone increasing, number assigned each
> +   log entry.  It provides an ordering over FS operations on a single inode.
> +
> +   3.  'invalid' is true if the effects of this entry are dead and the log
> +   entry can be garbage collected.
> +
> +   4.  'csum' is a CRC32 checksum for the entry. Currently it is disabled.
> +
> +Log structure
> +-------------
> +
> +The logs comprise a linked list of PMEM blocks.  The tail of each block
> +contains some metadata about the block and pointers to the next block and
> +block's replica (struct nova_inode_page_tail).
> +
> ++----------------+
> +| log entry      |
> ++----------------+
> +| log entry      |
> ++----------------+
> +| ...            |
> ++----------------+
> +| tail           |
> +|  metadata      |
> +|  -> next block |
> ++----------------+
> +
> +
> +Journals
> +========
> +
> +NOVA uses a lightweight journaling mechanisms to provide atomicity for

                                      mechanism

> +operations that modify more than one on inode.  The journals providing logging

end of that "sentence" (above) is confusing or missing something.

> +for two operations:
> +
> +1.  Single word updates (JOURNAL_ENTRY)
> +2.  Copying inodes (JOURNAL_INODE)
> +
> +The journals are undo logs: NOVA creates the journal entries for an operation,
> +and if the operation does not complete due to a system failure, the recovery
> +process rolls back the changes using the journal entries.
> +
> +To commit, NOVA drops the log.
> +
> +NOVA maintains one journal per CPU.  The head and tail pointers for each
> +journal live in a reserved page near the beginning of the file system.
> +
> +During recovery, NOVA scans the journals and undoes the operations described by
> +each entry.
> +
> +
> +File and Directory Access
> +=========================
> +
> +To access file data via read(), NOVA maintains a radix tree in DRAM for each
> +inode (nova_inode_info_header.tree) that maps file offsets to write log
> +entries.  For directories, the same tree maps a hash of filenames to their
> +corresponding dentry.
> +
> +In both cases, the nova populates the tree when the file or directory is opened

                  the nova fs (?)

> +by scanning its log.
> +
> +
> +MMap and DAX
> +============
> +
> +NOVA leverages the kernel's DAX mechanisms for mmap and file data access.
> +NOVA supports DAX-style mmap, i.e. mapping NVM pages directly to the
> +application's address space.
> +
> +
> +Garbage Collection
> +==================
> +
> +NOVA recovers log space with a two-phase garbage collection system.  When a log
> +reaches the end of its allocated pages, NOVA allocates more space.  Then, the
> +fast GC algorithm scans the log to remove pages that have no valid entries.
> +Then, it estimates how many pages the logs valid entries would fill.  If this
> +is less than half the number of pages in the log, the second GC phase copies
> +the valid entries to new pages.
> +
> +For example (V=valid; I=invalid):
> +
> ++---+         +---+	        +---+
> +| I |	       | I |  	      	| V |
> ++---+	       +---+  Thorough	+---+
> +| V |	       | V |  	 GC   	| V |
> ++---+	       +---+   =====> 	+---+
> +| I |	       | I |  	      	| V |
> ++---+	       +---+	        +---+
> +| V |	       | V |  	        | V |
> ++---+	       +---+            +---+
> +  |	         |
> +  V	         V
> ++---+	       +---+
> +| I |	       | V |
> ++---+	       +---+
> +| I | fast GC  | I |
> ++---+  ====>   +---+
> +| I |	       | I |
> ++---+	       +---+
> +| I |	       | V |
> ++---+	       +---+
> +  |
> +  V
> ++---+
> +| V |
> ++---+
> +| I |
> ++---+
> +| I |
> ++---+
> +| V |
> ++---+
> +
> +
> +Umount and Recovery
> +===================
> +
> +Clean umount/mount
> +------------------
> +
> +On a clean unmount, NOVA saves the contents of many of its DRAM data structures
> +to PMEM to accelerate the next mount:
> +
> +1. NOVA stores the allocator state for each of the per-cpu allocators to the
> +   log of a reserved inode (NOVA_BLOCK_NODE_INO).
> +
> +2. NOVA stores the per-CPU lists of alive inodes (the inuse_list) to the
> +   NOVA_BLOCK_INODELIST_INO reserved inode.
> +
> +After a clean unmount, the following mount restores these data and then
> +invalidates them.
> +
> +Recovery after failures
> +-----------------------
> +
> +In case of a unclean dismount (e.g., system crash), NOVA must rebuild these

           of an unclean

> +DRAM structures by scanning the inode logs.  NOVA log scanning is fast because
> +per-CPU inode tables and per-inode logs allow for parallel recovery.
> +
> +The number of live log entries in an inode log is roughly the number of extents
> +in the file.  As a result, NOVA only needs to scan a small fraction of the NVMM
> +during recovery.
> +
> +The NOVA failure recovery consists of two steps:
> +
> +First, NOVA checks its lite weight journals and rolls back any uncommitted

          should be one word: lightweight (or liteweight)

> +transactions to restore the file system to a consistent state.
> +
> +Second, NOVA starts a recovery thread on each CPU and scans the inode tables in
> +parallel, performing log scanning for every valid inode in the inode table.
> +NOVA use different recovery mechanisms for directory inodes and file inodes:

                                                               and file inodes.

> +For a directory inode, NOVA scans the log's linked list to enumerate the pages
> +it occupies, but it does not inspect the log's contents.  For a file inode,
> +NOVA reads the write entries in the log to enumerate the data pages.
> +
> +During the recovery scan NOVA builds a bitmap of occupied pages, and rebuilds
> +the allocator based on the result. After this process completes, the file
> +system is ready to accept new requests.
> +
> +During the same scan, it rebuilds the list of available inodes.
> +
> +
> +Gaps, Missing Features, and Development Status
> +==============================================
> +
> +Although NOVA is a fully-functional file system, there is still much work left
> +to be done.  In particular, (at least) the following items are currently missing:
> +
> +1.  Snapshot, metadata and data replication and protection are left for future submission.
> +2.  There is no mkfs or fsck utility (`mount` takes `-o init` to create a NOVA file system).
> +3.  NOVA only works on x86-64 kernels.
> +4.  NOVA does not currently support extended attributes or ACL.
> +5.  NOVA doesn't provide quota support.
> +6.  Moving NOVA file systems between machines with different numbers of CPUs does not work.

You could artificially limit the number of "known" CPUs so that a NOVA fs could be
moved from a 16-CPU system to an 8-CPU system by telling NOVA to use only 8 CPUs
(as an example).  Just a thought.

> +
> +None of these are fundamental limitations of NOVA's design.
> +
> +NOVA is complete and robust enough to run a range of complex applications, but
> +it is not yet ready for production use.  Our current focus is on adding a few
> +missing features from the list above and finding/fixing bugs.
> +
> +
> +Hacking and Contributing
> +========================
> +
> +If you find bugs, please report them at https://github.com/NVSL/linux-nova/issues.
> +
> +If you have other questions or suggestions you can contact the NOVA developers
> +at cse-nova-hackers@eng.ucsd.edu.
Andiry Xu March 19, 2018, 11 p.m. UTC | #2
Thanks for all the comments.

On Mon, Mar 19, 2018 at 1:43 PM, Randy Dunlap <rdunlap@infradead.org> wrote:
> On 03/10/2018 10:17 AM, Andiry Xu wrote:
>> From: Andiry Xu <jix024@cs.ucsd.edu>
>>
>> NOVA is a log-structured file system tailored for byte-addressable non-volatile memories.
>> It was designed and developed at the Non-Volatile Systems Laboratory in the Computer
>> Science and Engineering Department at the University of California, San Diego.
>> Its primary authors are Andiry Xu <jix024@eng.ucsd.edu>, Lu Zhang
>> <luzh@eng.ucsd.edu>, and Steven Swanson <swanson@eng.ucsd.edu>.
>>
>> These two papers provide a detailed, high-level description of NOVA's design goals and approach:
>>
>>    NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
>>    In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
>>    (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)
>>
>>    NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
>>    In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
>>    (http://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf)
>>
>> This patchset contains features from the FAST paper. We leave NOVA-Fortis features,
>> such as snapshot, metadata and data replication and RAID parity for
>> future submission.
>>
>> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
>> ---
>>  Documentation/filesystems/00-INDEX |   2 +
>>  Documentation/filesystems/nova.txt | 498 +++++++++++++++++++++++++++++++++++++
>>  MAINTAINERS                        |   8 +
>>  3 files changed, 508 insertions(+)
>>  create mode 100644 Documentation/filesystems/nova.txt
>
>> diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt
>> new file mode 100644
>> index 0000000..4728f50
>> --- /dev/null
>> +++ b/Documentation/filesystems/nova.txt
>> @@ -0,0 +1,498 @@
>> +The NOVA Filesystem
>> +===================
>> +
>> +NOn-Volatile memory Accelerated file system (NOVA) is a DAX file system
>> +designed to provide a high performance and production-ready file system
>> +tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
>> +and Intel's soon-to-be-released 3DXPoint DIMMs).
>> +NOVA combines design elements from many other file systems
>> +and adapts conventional log-structured file system techniques to
>> +exploit the fast random access that NVMs provide. In particular, NOVA maintains
>> +separate logs for each inode to improve concurrency, and stores file data
>> +outside the log to minimize log size and reduce garbage collection costs. NOVA's
>> +logs provide metadata and data atomicity and focus on simplicity and
>> +reliability, keeping complex metadata structures in DRAM to accelerate lookup
>> +operations.
>> +
>> +NOVA was developed by the Non-Volatile Systems Laboratory (NVSL) in
>> +the Computer Science and Engineering Department at the University of
>> +California, San Diego.
>> +
>> +A more thorough discussion of NOVA's design is avaialable in these two papers:
>
>                                                   available
>
>> +
>> +NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories
>> +Jian Xu and Steven Swanson
>> +In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
>> +
>> +NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
>> +Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase,
>> +Tamires Brito Da Silva, Andy Rudoff and Steven Swanson
>> +In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
>> +
>> +This version of NOVA contains features from the FAST paper.
>> +NOVA-Fortis features, such as snapshot, metadata and data protection and replication
>> +are left for future submission.
>> +
>> +The main NOVA features include:
>> +
>> +  * POSIX semantics
>> +  * Directly access (DAX) byte-addressable NVMM without page caching
>> +  * Per-CPU NVMM pool to maximize concurrency
>> +  * Strong consistency guarantees with 8-byte atomic stores
>> +
>> +
>> +Filesystem Design
>> +=================
>> +
>> +NOVA divides NVMM into several regions. NOVA's 512B superblock contains global
>
>                                         (prefer:) 512-byte
>
>> +file system information and the recovery inode. The recovery inode represents a
>> +special file that stores recovery information (e.g., the list of unallocated
>> +NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also
>> +provides per-CPU journals for complex file operations that involve multiple
>> +inodes. The rest of the available NVMM stores logs and file data.
>> +
>> +NOVA is log-structured and stores a separate log for each inode to maximize
>> +concurrency and provide atomicity for operations that affect a single file. The
>> +logs only store metadata and comprise a linked list of 4 KB pages. Log entries
>> +are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log
>> +pages may reside anywhere in NVMM.
>> +
>> +NOVA keeps copies of most file metadata in DRAM during normal
>> +operations, eliminating the need to access metadata in NVMM during reads.
>> +
>> +NOVA supports both copy-on-write and in-place file data updates and appends
>> +metadata about the write to the log. For operations that affect multiple inodes
>
>                                                                             inodes,
>
>> +NOVA uses lightweight, fixed-length journals –one per core.
>
>                                                 -- one per core.
>
>> +
>> +NOVA divides the allocatable NVMM into multiple regions, one region per CPU
>> +core. A per-core allocator manages each of the regions, minimizing contention
>> +during memory allocation.
>> +
>> +After a system crash, NOVA must scan all the logs to rebuild the memory
>> +allocator state. Since, there are many logs, NOVA aggressively parallelizes the
>
>                     Since there are
>
>> +scan.
>> +
>> +
>> +Building and using NOVA
>> +=======================
>> +
>> +To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`),
>> +DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.
>> +
>> +NOVA runs on a pmem non-volatile memory region.  You can create one of these
>> +regions with the `memmap` kernel command line option.  For instance, adding
>> +`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting
>> +from address 8GB, and the kernel will create a `pmem0` block device under the
>> +`/dev` directory.
>> +
>> +After the OS has booted, you can initialize a NOVA instance with the following commands:
>> +
>> +
>> +# modprobe nova
>> +# mount -t NOVA -o init /dev/pmem0 /mnt/nova
>
> Hmph, unique in upper-case-ness (at least for in-tree fs-es).
> Would you consider "nova" instead?
>

I will try that.

>> +
>> +
>> +The above commands create a NOVA instance on `/dev/pmem0` and mounts it on
>> +`/mnt/nova`.
>> +
>> +NOVA support several module command line options:
>
>         supports
>
>> +
>> + * measure_timing: Measure the timing of file system operations for profiling (default: 0)
>> +
>> + * inplace_data_updates:  Update data in place rather than with COW (default: 0)
>> +
>> +To recover an existing NOVA instance, mount NOVA without the init option, for example:
>> +
>> +# mount -t NOVA /dev/pmem0 /mnt/nova
>> +
>> +
>> +Sysfs support
>> +-------------
>> +
>> +NOVA provides sysfs support to enable user to get/set information of
>
>                                   enable a user
>                         or        enable users
>
> And the line above ends with a trailing space.  Please check/remove all of those.
>
>> +a running NOVA instance.
>> +After mount, NOVA creates four entries under proc directory /proc/fs/nova/pmem#/:
>
> Above uses lower-case "nova" in /proc/fs/nova/... but the examples below use NOVA.
> nova is preferred (IMO).
>
>> +
>> +timing_stats IO_stats        allocator       gc
>> +
>> +Show NOVA file operation timing statistics:
>> +# cat /proc/fs/NOVA/pmem#/timing_stats
>> +
>> +Clear timing statistics:
>> +# echo 1 > /proc/fs/NOVA/pmem#/timing_stats
>> +
>> +Show NOVA I/O statistics:
>> +# cat /proc/fs/NOVA/pmem#/IO_stats
>> +
>> +Clear I/O statistics:
>> +# echo 1 > /proc/fs/NOVA/pmem#/IO_stats
>> +
>> +Show NOVA allocator information:
>> +# cat /proc/fs/NOVA/pmem#/allocator
>> +
>> +Manual garbage collection:
>> +# echo #inode_number > /proc/fs/NOVA/pmem#/gc
>> +
>> +
>> +Source File Structure
>> +=====================
>> +
>> +  * nova_def.h/nova.h
>> +   Defines NOVA macros and key inline functions.
>> +
>> +  * balloc.{h,c}
>> +    NOVA's pmem allocator implementation.
>> +
>> +  * bbuild.c
>> +    Implements recovery routines to restore the in-use inode list and the NVMM
>> +    allocator information.
>> +
>> +  * dax.c
>> +    Implements DAX read/write and mmap functions to access file data. NOVA uses
>> +    copy-on-write to modify file pages by default, unless inplace data update is
>> +    enabled at mount-time.
>> +
>> +  * dir.c
>> +    Contains functions to create, update, and remove NOVA dentries.
>> +
>> +  * file.c
>> +    Implements file-related operations such as open, fallocate, llseek, fsync,
>> +    and flush.
>> +
>> +  * gc.c
>> +    NOVA's garbage collection functions.
>> +
>> +  * inode.{h,c}
>> +    Creates, reads, and frees NOVA inode tables and inodes.
>> +
>> +  * ioctl.c
>> +    Implements some ioctl commands to call NOVA's internal functions.
>> +
>> +  * journal.{h,c}
>> +    For operations that affect multiple inodes NOVA uses lightweight,
>> +    fixed-length journals – one per core. This file contains functions to
>> +    create and manage the lite journals.
>> +
>> +  * log.{h,c}
>> +    Functions to manipulate NOVA inode logs, including log page allocation, log
>> +    entry creation, commit, modification, and deletion.
>> +
>> +  * namei.c
>> +    Functions to create/remove files, directories, and links. It also looks for
>> +    the NOVA inode number for a given path name.
>> +
>> +  * rebuild.c
>> +    When mounting NOVA, rebuild NOVA inodes from its logs.
>> +
>> +  * stats.{h,c}
>> +    Provide routines to gather and print NOVA usage statistics.
>> +
>> +  * super.{h,c}
>> +    Super block structures and NOVA FS layout and entry points for NOVA
>> +    mounting and unmounting, initializing or recovering the NOVA super block
>> +    and other global file system information.
>> +
>> +  * symlink.c
>> +    Implements functions to create and read symbolic links in the filesystem.
>> +
>> +  * sysfs.c
>> +    Implements sysfs entries to take user inputs for printing NOVA statistics.
>
> s/sysfs/procfs/
>
>> +
>> +
>> +Filesystem Layout
>> +=================
>> +
>> +A NOVA file systems resides in single PMEM device. *****
>> +NOVA divides the device into 4KB blocks.
>
>                                 4 KB  {or use 4KB way up above here}
>
>> +
>> + block
>> ++---------------------------------------------------------+
>> +|    0    | primary super block (struct nova_super_block) |
>> ++---------------------------------------------------------+
>> +|    1    | Reserved inodes                               |
>> ++---------------------------------------------------------+
>> +|  2 - 15 | reserved                                      |
>> ++---------------------------------------------------------+
>> +| 16 - 31 | Inode table pointers                          |
>> ++---------------------------------------------------------+
>> +| 32 - 47 | Journal pointers                              |
>> ++---------------------------------------------------------+
>> +| 48 - 63 | reserved                                      |
>> ++---------------------------------------------------------+
>> +|   ...   | log and data pages                            |
>> ++---------------------------------------------------------+
>> +|   n-2   | replica reserved Inodes                       |
>> ++---------------------------------------------------------+
>> +|   n-1   | replica super block                           |
>> ++---------------------------------------------------------+
>> +
>> +
>> +
>> +Superblock and Associated Structures
>> +====================================
>> +
>> +The beginning of the PMEM device hold the super block and its associated
>
>                                     holds
>
>> +tables.  These include reserved inodes, a table of pointers to the journals
>> +NOVA uses for complex operations, and pointers to inodes tables.  NOVA
>> +maintains replicas of the super block and reserved inodes in the last two
>> +blocks of the PMEM area.
>> +
>> +
>> +Block Allocator/Free Lists
>> +==========================
>> +
>> +NOVA uses per-CPU allocators to manage free PMEM blocks.  On initialization,> +NOVA divides the range of blocks in the PMEM device among the CPUs, and those
>> +blocks are managed solely by that CPU.  We call these ranges of "allocation regions".
>> +Each allocator maintains a red-black tree of unallocated ranges (struct
>> +nova_range_node).
>> +
>> +Allocation Functions
>> +--------------------
>> +
>> +NOVA allocate PMEM blocks using two mechanisms:
>
>         allocates
>
>> +
>> +1.  Static allocation as defined in super.h
>> +
>> +2.  Allocation for log and data pages via nova_new_log_blocks() and
>> +nova_new_data_blocks().
>> +
>> +
>> +PMEM Address Translation
>> +------------------------
>> +
>> +In NOVA's persistent data structures, memory locations are given as offsets
>> +from the beginning of the PMEM region.  nova_get_block() translates offsets to
>> +PMEM addresses.  nova_get_addr_off() performs the reverse translation.
>> +
>> +
>> +Inodes
>> +======
>> +
>> +NOVA maintains per-CPU inode tables, and inode numbers are striped across the
>> +tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).
>> +
>> +The inodes themselves live in a set of linked lists (one per CPU) of 2MB
>> +blocks.  The last 8 bytes of each block points to the next block.  Pointers to
>> +heads of these list live in PMEM block INODE_TABLE_START.
>
>                   lists
>
>> +Additional space for inodes is allocated on demand.
>> +
>> +To allocate inodes, NOVA maintains a per-cpu "inuse_list" in DRAM holds a RB
>
> s/cpu/CPU/g
> s/a RB/an RB/
>
> but that isn't quite a sentence. Please fix it.
>
>> +tree that holds ranges of allocated inode numbers.
>> +
>> +
>> +Logs
>> +====
>> +
>> +NOVA maintains a log for each inode that records updates to the inode's
>> +metadata and holds pointers to the file data.  NOVA makes updates to file data
>> +and metadata atomic by atomically appending log entries to the log.
>> +
>> +Each inode contains pointers to head and tail of the inode's log.  When the log
>> +grows past the end of the last page, nova allocates additional space.  For
>> +short logs (less than 1MB) , it doubles the length.  For longer logs, it adds a
>> +fixed amount of additional space (1MB).
>> +
>> +Log space is reclaimed during garbage collection.
>> +
>> +Log Entries
>> +-----------
>> +
>> +There are four kinds of log entry, documented in log.h.  The log entries have
>> +several entries in common:
>> +
>> +   1.  'epoch_id' gives the epoch during which the log entry was created.
>> +   Creating a snapshot increments the epoch_id for the file systems.
>
>                                                           file system.  (?)
> or do multiple epochs (snapshots) => multiple fs-es?
>
>> +   Currently disabled (always zero).
>> +
>> +   2.  'trans_id' is per-inode, monotone increasing, number assigned each
>> +   log entry.  It provides an ordering over FS operations on a single inode.
>> +
>> +   3.  'invalid' is true if the effects of this entry are dead and the log
>> +   entry can be garbage collected.
>> +
>> +   4.  'csum' is a CRC32 checksum for the entry. Currently it is disabled.
>> +
>> +Log structure
>> +-------------
>> +
>> +The logs comprise a linked list of PMEM blocks.  The tail of each block
>> +contains some metadata about the block and pointers to the next block and
>> +block's replica (struct nova_inode_page_tail).
>> +
>> ++----------------+
>> +| log entry      |
>> ++----------------+
>> +| log entry      |
>> ++----------------+
>> +| ...            |
>> ++----------------+
>> +| tail           |
>> +|  metadata      |
>> +|  -> next block |
>> ++----------------+
>> +
>> +
>> +Journals
>> +========
>> +
>> +NOVA uses a lightweight journaling mechanisms to provide atomicity for
>
>                                       mechanism
>
>> +operations that modify more than one on inode.  The journals providing logging
>
> end of that "sentence" (above) is confusing or missing something.
>
>> +for two operations:
>> +
>> +1.  Single word updates (JOURNAL_ENTRY)
>> +2.  Copying inodes (JOURNAL_INODE)
>> +
>> +The journals are undo logs: NOVA creates the journal entries for an operation,
>> +and if the operation does not complete due to a system failure, the recovery
>> +process rolls back the changes using the journal entries.
>> +
>> +To commit, NOVA drops the log.
>> +
>> +NOVA maintains one journal per CPU.  The head and tail pointers for each
>> +journal live in a reserved page near the beginning of the file system.
>> +
>> +During recovery, NOVA scans the journals and undoes the operations described by
>> +each entry.
>> +
>> +
>> +File and Directory Access
>> +=========================
>> +
>> +To access file data via read(), NOVA maintains a radix tree in DRAM for each
>> +inode (nova_inode_info_header.tree) that maps file offsets to write log
>> +entries.  For directories, the same tree maps a hash of filenames to their
>> +corresponding dentry.
>> +
>> +In both cases, the nova populates the tree when the file or directory is opened
>
>                   the nova fs (?)
>
>> +by scanning its log.
>> +
>> +
>> +MMap and DAX
>> +============
>> +
>> +NOVA leverages the kernel's DAX mechanisms for mmap and file data access.
>> +NOVA supports DAX-style mmap, i.e. mapping NVM pages directly to the
>> +application's address space.
>> +
>> +
>> +Garbage Collection
>> +==================
>> +
>> +NOVA recovers log space with a two-phase garbage collection system.  When a log
>> +reaches the end of its allocated pages, NOVA allocates more space.  Then, the
>> +fast GC algorithm scans the log to remove pages that have no valid entries.
>> +Then, it estimates how many pages the logs valid entries would fill.  If this
>> +is less than half the number of pages in the log, the second GC phase copies
>> +the valid entries to new pages.
>> +
>> +For example (V=valid; I=invalid):
>> +
>> ++---+         +---+          +---+
>> +| I |               | I |            | V |
>> ++---+               +---+  Thorough  +---+
>> +| V |               | V |     GC     | V |
>> ++---+               +---+   =====>   +---+
>> +| I |               | I |            | V |
>> ++---+               +---+            +---+
>> +| V |               | V |            | V |
>> ++---+               +---+            +---+
>> +  |           |
>> +  V           V
>> ++---+               +---+
>> +| I |               | V |
>> ++---+               +---+
>> +| I | fast GC  | I |
>> ++---+  ====>   +---+
>> +| I |               | I |
>> ++---+               +---+
>> +| I |               | V |
>> ++---+               +---+
>> +  |
>> +  V
>> ++---+
>> +| V |
>> ++---+
>> +| I |
>> ++---+
>> +| I |
>> ++---+
>> +| V |
>> ++---+
>> +
>> +
>> +Umount and Recovery
>> +===================
>> +
>> +Clean umount/mount
>> +------------------
>> +
>> +On a clean unmount, NOVA saves the contents of many of its DRAM data structures
>> +to PMEM to accelerate the next mount:
>> +
>> +1. NOVA stores the allocator state for each of the per-cpu allocators to the
>> +   log of a reserved inode (NOVA_BLOCK_NODE_INO).
>> +
>> +2. NOVA stores the per-CPU lists of alive inodes (the inuse_list) to the
>> +   NOVA_BLOCK_INODELIST_INO reserved inode.
>> +
>> +After a clean unmount, the following mount restores these data and then
>> +invalidates them.
>> +
>> +Recovery after failures
>> +-----------------------
>> +
>> +In case of a unclean dismount (e.g., system crash), NOVA must rebuild these
>
>            of an unclean
>
>> +DRAM structures by scanning the inode logs.  NOVA log scanning is fast because
>> +per-CPU inode tables and per-inode logs allow for parallel recovery.
>> +
>> +The number of live log entries in an inode log is roughly the number of extents
>> +in the file.  As a result, NOVA only needs to scan a small fraction of the NVMM
>> +during recovery.
>> +
>> +The NOVA failure recovery consists of two steps:
>> +
>> +First, NOVA checks its lite weight journals and rolls back any uncommitted
>
>           should be one word: lightweight (or liteweight)
>
>> +transactions to restore the file system to a consistent state.
>> +
>> +Second, NOVA starts a recovery thread on each CPU and scans the inode tables in
>> +parallel, performing log scanning for every valid inode in the inode table.
>> +NOVA use different recovery mechanisms for directory inodes and file inodes:
>
>                                                                and file inodes.
>
>> +For a directory inode, NOVA scans the log's linked list to enumerate the pages
>> +it occupies, but it does not inspect the log's contents.  For a file inode,
>> +NOVA reads the write entries in the log to enumerate the data pages.
>> +
>> +During the recovery scan NOVA builds a bitmap of occupied pages, and rebuilds
>> +the allocator based on the result. After this process completes, the file
>> +system is ready to accept new requests.
>> +
>> +During the same scan, it rebuilds the list of available inodes.
>> +
>> +
>> +Gaps, Missing Features, and Development Status
>> +==============================================
>> +
>> +Although NOVA is a fully-functional file system, there is still much work left
>> +to be done.  In particular, (at least) the following items are currently missing:
>> +
>> +1.  Snapshot, metadata and data replication and protection are left for future submission.
>> +2.  There is no mkfs or fsck utility (`mount` takes `-o init` to create a NOVA file system).
>> +3.  NOVA only works on x86-64 kernels.
>> +4.  NOVA does not currently support extended attributes or ACL.
>> +5.  NOVA doesn't provide quota support.
>> +6.  Moving NOVA file systems between machines with different numbers of CPUs does not work.
>
> You could artificially limit the number of "known" CPUs so that a NOVA fs could be
> moved from a 16-CPU system to an 8-CPU system by telling NOVA to use only 8 CPUs
> (as an example).  Just a thought.
>

I think storing the number of CPUs in the superblock, and perform
checking during mount phase can fix the issue.

Moving from 8-CPU to 16-CPU should be simple, just allocate more inode
tables and journal pages. Moving from 16-CPU to 8-CPU is a little more
difficult, mainly in inode table linking. CPU hotplug is still a
challenge.

I will try to fix it in the next version if I have time.

Thanks,
Andiry

>> +
>> +None of these are fundamental limitations of NOVA's design.
>> +
>> +NOVA is complete and robust enough to run a range of complex applications, but
>> +it is not yet ready for production use.  Our current focus is on adding a few
>> +missing features from the list above and finding/fixing bugs.
>> +
>> +
>> +Hacking and Contributing
>> +========================
>> +
>> +If you find bugs, please report them at https://github.com/NVSL/linux-nova/issues.
>> +
>> +If you have other questions or suggestions you can contact the NOVA developers
>> +at cse-nova-hackers@eng.ucsd.edu.
>
>
> --
> ~Randy
diff mbox

Patch

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index b7bd6c9..dc5c722 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -95,6 +95,8 @@  nfs/
 	- nfs-related documentation.
 nilfs2.txt
 	- info and mount options for the NILFS2 filesystem.
+nova.txt
+	- info on the NOVA filesystem.
 ntfs.txt
 	- info and mount options for the NTFS filesystem (Windows NT).
 ocfs2.txt
diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt
new file mode 100644
index 0000000..4728f50
--- /dev/null
+++ b/Documentation/filesystems/nova.txt
@@ -0,0 +1,498 @@ 
+The NOVA Filesystem
+===================
+
+NOn-Volatile memory Accelerated file system (NOVA) is a DAX file system
+designed to provide a high performance and production-ready file system
+tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
+and Intel's soon-to-be-released 3DXPoint DIMMs).
+NOVA combines design elements from many other file systems
+and adapts conventional log-structured file system techniques to
+exploit the fast random access that NVMs provide. In particular, NOVA maintains
+separate logs for each inode to improve concurrency, and stores file data
+outside the log to minimize log size and reduce garbage collection costs. NOVA's
+logs provide metadata and data atomicity and focus on simplicity and
+reliability, keeping complex metadata structures in DRAM to accelerate lookup
+operations.
+
+NOVA was developed by the Non-Volatile Systems Laboratory (NVSL) in
+the Computer Science and Engineering Department at the University of
+California, San Diego.
+
+A more thorough discussion of NOVA's design is avaialable in these two papers:
+
+NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories
+Jian Xu and Steven Swanson
+In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
+
+NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
+Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase,
+Tamires Brito Da Silva, Andy Rudoff and Steven Swanson
+In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
+
+This version of NOVA contains features from the FAST paper.
+NOVA-Fortis features, such as snapshot, metadata and data protection and replication
+are left for future submission.
+
+The main NOVA features include:
+
+  * POSIX semantics
+  * Directly access (DAX) byte-addressable NVMM without page caching
+  * Per-CPU NVMM pool to maximize concurrency
+  * Strong consistency guarantees with 8-byte atomic stores
+
+
+Filesystem Design
+=================
+
+NOVA divides NVMM into several regions. NOVA's 512B superblock contains global
+file system information and the recovery inode. The recovery inode represents a
+special file that stores recovery information (e.g., the list of unallocated
+NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also
+provides per-CPU journals for complex file operations that involve multiple
+inodes. The rest of the available NVMM stores logs and file data.
+
+NOVA is log-structured and stores a separate log for each inode to maximize
+concurrency and provide atomicity for operations that affect a single file. The
+logs only store metadata and comprise a linked list of 4 KB pages. Log entries
+are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log
+pages may reside anywhere in NVMM.
+
+NOVA keeps copies of most file metadata in DRAM during normal
+operations, eliminating the need to access metadata in NVMM during reads.
+
+NOVA supports both copy-on-write and in-place file data updates and appends
+metadata about the write to the log. For operations that affect multiple inodes
+NOVA uses lightweight, fixed-length journals –one per core.
+
+NOVA divides the allocatable NVMM into multiple regions, one region per CPU
+core. A per-core allocator manages each of the regions, minimizing contention
+during memory allocation.
+
+After a system crash, NOVA must scan all the logs to rebuild the memory
+allocator state. Since, there are many logs, NOVA aggressively parallelizes the
+scan.
+
+
+Building and using NOVA
+=======================
+
+To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`),
+DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.
+
+NOVA runs on a pmem non-volatile memory region.  You can create one of these
+regions with the `memmap` kernel command line option.  For instance, adding
+`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting
+from address 8GB, and the kernel will create a `pmem0` block device under the
+`/dev` directory.
+
+After the OS has booted, you can initialize a NOVA instance with the following commands:
+
+
+# modprobe nova
+# mount -t NOVA -o init /dev/pmem0 /mnt/nova
+
+
+The above commands create a NOVA instance on `/dev/pmem0` and mounts it on
+`/mnt/nova`.
+
+NOVA support several module command line options:
+
+ * measure_timing: Measure the timing of file system operations for profiling (default: 0)
+
+ * inplace_data_updates:  Update data in place rather than with COW (default: 0)
+
+To recover an existing NOVA instance, mount NOVA without the init option, for example:
+
+# mount -t NOVA /dev/pmem0 /mnt/nova
+
+
+Sysfs support
+-------------
+
+NOVA provides sysfs support to enable user to get/set information of 
+a running NOVA instance.
+After mount, NOVA creates four entries under proc directory /proc/fs/nova/pmem#/:
+
+timing_stats	IO_stats	allocator	gc
+
+Show NOVA file operation timing statistics:
+# cat /proc/fs/NOVA/pmem#/timing_stats
+
+Clear timing statistics:
+# echo 1 > /proc/fs/NOVA/pmem#/timing_stats
+
+Show NOVA I/O statistics:
+# cat /proc/fs/NOVA/pmem#/IO_stats
+
+Clear I/O statistics:
+# echo 1 > /proc/fs/NOVA/pmem#/IO_stats
+
+Show NOVA allocator information:
+# cat /proc/fs/NOVA/pmem#/allocator
+
+Manual garbage collection:
+# echo #inode_number > /proc/fs/NOVA/pmem#/gc
+
+
+Source File Structure
+=====================
+
+  * nova_def.h/nova.h
+   Defines NOVA macros and key inline functions.
+
+  * balloc.{h,c}
+    NOVA's pmem allocator implementation.
+
+  * bbuild.c
+    Implements recovery routines to restore the in-use inode list and the NVMM
+    allocator information.
+
+  * dax.c
+    Implements DAX read/write and mmap functions to access file data. NOVA uses
+    copy-on-write to modify file pages by default, unless inplace data update is
+    enabled at mount-time.
+
+  * dir.c
+    Contains functions to create, update, and remove NOVA dentries.
+
+  * file.c
+    Implements file-related operations such as open, fallocate, llseek, fsync,
+    and flush.
+
+  * gc.c
+    NOVA's garbage collection functions.
+
+  * inode.{h,c}
+    Creates, reads, and frees NOVA inode tables and inodes.
+
+  * ioctl.c
+    Implements some ioctl commands to call NOVA's internal functions.
+
+  * journal.{h,c}
+    For operations that affect multiple inodes NOVA uses lightweight,
+    fixed-length journals – one per core. This file contains functions to
+    create and manage the lite journals.
+
+  * log.{h,c}
+    Functions to manipulate NOVA inode logs, including log page allocation, log
+    entry creation, commit, modification, and deletion.
+
+  * namei.c
+    Functions to create/remove files, directories, and links. It also looks for
+    the NOVA inode number for a given path name.
+
+  * rebuild.c
+    When mounting NOVA, rebuild NOVA inodes from its logs.
+
+  * stats.{h,c}
+    Provide routines to gather and print NOVA usage statistics.
+
+  * super.{h,c}
+    Super block structures and NOVA FS layout and entry points for NOVA
+    mounting and unmounting, initializing or recovering the NOVA super block
+    and other global file system information.
+
+  * symlink.c
+    Implements functions to create and read symbolic links in the filesystem.
+
+  * sysfs.c
+    Implements sysfs entries to take user inputs for printing NOVA statistics.
+
+
+Filesystem Layout
+=================
+
+A NOVA file systems resides in single PMEM device.
+NOVA divides the device into 4KB blocks.
+
+ block
++---------------------------------------------------------+
+|    0    | primary super block (struct nova_super_block) |
++---------------------------------------------------------+
+|    1    | Reserved inodes                               |
++---------------------------------------------------------+
+|  2 - 15 | reserved                                      |
++---------------------------------------------------------+
+| 16 - 31 | Inode table pointers                          |
++---------------------------------------------------------+
+| 32 - 47 | Journal pointers                              |
++---------------------------------------------------------+
+| 48 - 63 | reserved                                      |
++---------------------------------------------------------+
+|   ...   | log and data pages                            |
++---------------------------------------------------------+
+|   n-2   | replica reserved Inodes                       |
++---------------------------------------------------------+
+|   n-1   | replica super block                           |
++---------------------------------------------------------+
+
+
+
+Superblock and Associated Structures
+====================================
+
+The beginning of the PMEM device hold the super block and its associated
+tables.  These include reserved inodes, a table of pointers to the journals
+NOVA uses for complex operations, and pointers to inodes tables.  NOVA
+maintains replicas of the super block and reserved inodes in the last two
+blocks of the PMEM area.
+
+
+Block Allocator/Free Lists
+==========================
+
+NOVA uses per-CPU allocators to manage free PMEM blocks.  On initialization,
+NOVA divides the range of blocks in the PMEM device among the CPUs, and those
+blocks are managed solely by that CPU.  We call these ranges of "allocation regions".
+Each allocator maintains a red-black tree of unallocated ranges (struct
+nova_range_node).
+
+Allocation Functions
+--------------------
+
+NOVA allocate PMEM blocks using two mechanisms:
+
+1.  Static allocation as defined in super.h
+
+2.  Allocation for log and data pages via nova_new_log_blocks() and
+nova_new_data_blocks().
+
+
+PMEM Address Translation
+------------------------
+
+In NOVA's persistent data structures, memory locations are given as offsets
+from the beginning of the PMEM region.  nova_get_block() translates offsets to
+PMEM addresses.  nova_get_addr_off() performs the reverse translation.
+
+
+Inodes
+======
+
+NOVA maintains per-CPU inode tables, and inode numbers are striped across the
+tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).
+
+The inodes themselves live in a set of linked lists (one per CPU) of 2MB
+blocks.  The last 8 bytes of each block points to the next block.  Pointers to
+heads of these list live in PMEM block INODE_TABLE_START.
+Additional space for inodes is allocated on demand.
+
+To allocate inodes, NOVA maintains a per-cpu "inuse_list" in DRAM holds a RB
+tree that holds ranges of allocated inode numbers.
+
+
+Logs
+====
+
+NOVA maintains a log for each inode that records updates to the inode's
+metadata and holds pointers to the file data.  NOVA makes updates to file data
+and metadata atomic by atomically appending log entries to the log.
+
+Each inode contains pointers to head and tail of the inode's log.  When the log
+grows past the end of the last page, nova allocates additional space.  For
+short logs (less than 1MB) , it doubles the length.  For longer logs, it adds a
+fixed amount of additional space (1MB).
+
+Log space is reclaimed during garbage collection.
+
+Log Entries
+-----------
+
+There are four kinds of log entry, documented in log.h.  The log entries have
+several entries in common:
+
+   1.  'epoch_id' gives the epoch during which the log entry was created.
+   Creating a snapshot increments the epoch_id for the file systems.
+   Currently disabled (always zero).
+
+   2.  'trans_id' is per-inode, monotone increasing, number assigned each
+   log entry.  It provides an ordering over FS operations on a single inode.
+
+   3.  'invalid' is true if the effects of this entry are dead and the log
+   entry can be garbage collected.
+
+   4.  'csum' is a CRC32 checksum for the entry. Currently it is disabled.
+
+Log structure
+-------------
+
+The logs comprise a linked list of PMEM blocks.  The tail of each block
+contains some metadata about the block and pointers to the next block and
+block's replica (struct nova_inode_page_tail).
+
++----------------+
+| log entry      |
++----------------+
+| log entry      |
++----------------+
+| ...            |
++----------------+
+| tail           |
+|  metadata      |
+|  -> next block |
++----------------+
+
+
+Journals
+========
+
+NOVA uses a lightweight journaling mechanisms to provide atomicity for
+operations that modify more than one on inode.  The journals providing logging
+for two operations:
+
+1.  Single word updates (JOURNAL_ENTRY)
+2.  Copying inodes (JOURNAL_INODE)
+
+The journals are undo logs: NOVA creates the journal entries for an operation,
+and if the operation does not complete due to a system failure, the recovery
+process rolls back the changes using the journal entries.
+
+To commit, NOVA drops the log.
+
+NOVA maintains one journal per CPU.  The head and tail pointers for each
+journal live in a reserved page near the beginning of the file system.
+
+During recovery, NOVA scans the journals and undoes the operations described by
+each entry.
+
+
+File and Directory Access
+=========================
+
+To access file data via read(), NOVA maintains a radix tree in DRAM for each
+inode (nova_inode_info_header.tree) that maps file offsets to write log
+entries.  For directories, the same tree maps a hash of filenames to their
+corresponding dentry.
+
+In both cases, the nova populates the tree when the file or directory is opened
+by scanning its log.
+
+
+MMap and DAX
+============
+
+NOVA leverages the kernel's DAX mechanisms for mmap and file data access.
+NOVA supports DAX-style mmap, i.e. mapping NVM pages directly to the
+application's address space.
+
+
+Garbage Collection
+==================
+
+NOVA recovers log space with a two-phase garbage collection system.  When a log
+reaches the end of its allocated pages, NOVA allocates more space.  Then, the
+fast GC algorithm scans the log to remove pages that have no valid entries.
+Then, it estimates how many pages the logs valid entries would fill.  If this
+is less than half the number of pages in the log, the second GC phase copies
+the valid entries to new pages.
+
+For example (V=valid; I=invalid):
+
++---+         +---+	        +---+
+| I |	       | I |  	      	| V |
++---+	       +---+  Thorough	+---+
+| V |	       | V |  	 GC   	| V |
++---+	       +---+   =====> 	+---+
+| I |	       | I |  	      	| V |
++---+	       +---+	        +---+
+| V |	       | V |  	        | V |
++---+	       +---+            +---+
+  |	         |
+  V	         V
++---+	       +---+
+| I |	       | V |
++---+	       +---+
+| I | fast GC  | I |
++---+  ====>   +---+
+| I |	       | I |
++---+	       +---+
+| I |	       | V |
++---+	       +---+
+  |
+  V
++---+
+| V |
++---+
+| I |
++---+
+| I |
++---+
+| V |
++---+
+
+
+Umount and Recovery
+===================
+
+Clean umount/mount
+------------------
+
+On a clean unmount, NOVA saves the contents of many of its DRAM data structures
+to PMEM to accelerate the next mount:
+
+1. NOVA stores the allocator state for each of the per-cpu allocators to the
+   log of a reserved inode (NOVA_BLOCK_NODE_INO).
+
+2. NOVA stores the per-CPU lists of alive inodes (the inuse_list) to the
+   NOVA_BLOCK_INODELIST_INO reserved inode.
+
+After a clean unmount, the following mount restores these data and then
+invalidates them.
+
+Recovery after failures
+-----------------------
+
+In case of a unclean dismount (e.g., system crash), NOVA must rebuild these
+DRAM structures by scanning the inode logs.  NOVA log scanning is fast because
+per-CPU inode tables and per-inode logs allow for parallel recovery.
+
+The number of live log entries in an inode log is roughly the number of extents
+in the file.  As a result, NOVA only needs to scan a small fraction of the NVMM
+during recovery.
+
+The NOVA failure recovery consists of two steps:
+
+First, NOVA checks its lite weight journals and rolls back any uncommitted
+transactions to restore the file system to a consistent state.
+
+Second, NOVA starts a recovery thread on each CPU and scans the inode tables in
+parallel, performing log scanning for every valid inode in the inode table.
+NOVA use different recovery mechanisms for directory inodes and file inodes:
+For a directory inode, NOVA scans the log's linked list to enumerate the pages
+it occupies, but it does not inspect the log's contents.  For a file inode,
+NOVA reads the write entries in the log to enumerate the data pages.
+
+During the recovery scan NOVA builds a bitmap of occupied pages, and rebuilds
+the allocator based on the result. After this process completes, the file
+system is ready to accept new requests.
+
+During the same scan, it rebuilds the list of available inodes.
+
+
+Gaps, Missing Features, and Development Status
+==============================================
+
+Although NOVA is a fully-functional file system, there is still much work left
+to be done.  In particular, (at least) the following items are currently missing:
+
+1.  Snapshot, metadata and data replication and protection are left for future submission.
+2.  There is no mkfs or fsck utility (`mount` takes `-o init` to create a NOVA file system).
+3.  NOVA only works on x86-64 kernels.
+4.  NOVA does not currently support extended attributes or ACL.
+5.  NOVA doesn't provide quota support.
+6.  Moving NOVA file systems between machines with different numbers of CPUs does not work.
+
+None of these are fundamental limitations of NOVA's design.
+
+NOVA is complete and robust enough to run a range of complex applications, but
+it is not yet ready for production use.  Our current focus is on adding a few
+missing features from the list above and finding/fixing bugs.
+
+
+Hacking and Contributing
+========================
+
+If you find bugs, please report them at https://github.com/NVSL/linux-nova/issues.
+
+If you have other questions or suggestions you can contact the NOVA developers
+at cse-nova-hackers@eng.ucsd.edu.
diff --git a/MAINTAINERS b/MAINTAINERS
index 4623caf..89ac59b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9825,6 +9825,14 @@  F:	drivers/power/supply/bq27xxx_battery_i2c.c
 F:	drivers/power/supply/isp1704_charger.c
 F:	drivers/power/supply/rx51_battery.c
 
+NOVA FILESYSTEM
+M:	Andiry Xu <jix024@cs.ucsd.edu>
+M:	Steven Swanson <swanson@cs.ucsd.edu>
+L:	linux-fsdevel@vger.kernel.org
+L:	linux-nvdimm@lists.01.org
+F:	Documentation/filesystems/nova.txt
+F:	fs/nova/
+
 NTB AMD DRIVER
 M:	Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
 L:	linux-ntb@googlegroups.com