[RFC,11/16] NOVA: Snapshot support

Nova supports snapshots to facilitate backups.

Taking a snapshot
-----------------

Each Nova file systems has a current epoch_id in the super block and each log
entry has the epoch_id attached to it at creation.  When the user creates a
snaphot, Nova increments the epoch_id for the file system and the old epoch_id
identifies the moment the snapshot was taken.

Nova records the epoch_id and a timestamp in a new log entry (struct
snapshot_info_log_entry) and appends it to the log of the reserved snapshot
inode (NOVA_SNAPSHOT_INODE) in the superblock.

Nova also maintains a radix tree (nova_sb_info.snapshot_info_tree) of struct
snapshot_info in DRAM indexed by epoch_id.

Nova also marks all mmap'd pages as read-only and uses COW to preserve file
contents after the snapshot.

Tracking Live Data
------------------

Supporting snapshots requires Nova to preserve file contents from previous
snapshots while also being able to recover the space a snapshot occupied after
its deletion.

Preserving file contents requires a small change to how Nova implements write
operations.  To perform a write, Nova appends a write log entry to the file's
log.  The log entry includes pointers to newly-allocated and populated NVMM
pages that hold the written data.  If the write overwrites existing data, Nova
locates the previous write log entry for that portion of the file, and performs
an "epoch check" that compares the old log entry's epoch_id to the file
system's current epoch_id.  If the comparison matches, the old write log entry
and the file data blocks it points to no longer belong to any snapshot, and
Nova reclaims the data blocks.

If the epoch_id's do not match, then the data in the old log entry belongs to
an earlier snapshot and Nova leaves the log entry in place.

Determining when to reclaim data belonging to deleted snapshots requires
additional bookkeeping.  For each snapshot, Nova maintains a "snapshot log"
that records the inodes and blocks that belong to that snapshot, but are not
part of the current file system image.

Nova populates the snapshot log during the epoch check: If the epoch_ids for
the new and old log entries do not match, it appends a log entry (either struct
snapshot_inode_entry or struct snapshot_file_write_entry) to the snapshot log
that the old log entry belongs to.  The log entry contains a pointer to the old
log entry, and the filesystem's current epoch_id as the delete_epoch_id.

To delete a snapshot, Nova removes the snapshot from the list of live snapshots
and appends its log to the following snapshot's log.  Then, a background thread
traverses the combined log and reclaims dead inode/data based on the delete
epoch_id: If the delete epoch_id for an entry in the log is less than or equal
to the snapshot's epoch_id, it means the log entry and/or the associated data
blocks are now dead.

Snapshots and DAX
-----------------

Taking consistent snapshots while applications are modifying files using
DAX-style mmap requires NOVA to reckon with the order in which stores to NVMM
become persistent (i.e., reach physical NVMM so they will survive a system
failure).  These applications rely on the processor's memory persistence
model'' [http://dl.acm.org/citation.cfm?id=2665671.2665712] to make guarantees
about when and in what order stores become persistent.  These guarantees allow
the application to restore their data to a consistent state during recovery
from a system failure.

>From the application's perspective, reading a snapshot is equivalent to
recovering from a system failure.  In both cases, the contents of the
memory-mapped file reflect its state at a moment when application operations
might be in-flight and when the application had no chance to shut down cleanly.

A naive approach to checkpointing mmap()'d files in NOVA would simply mark each
of the read/write mapped pages as read-only and then do copy-on-write when a
store occurs to preserve the old pages as part of the snapshot.

However, this approach can leave the snapshot in an inconsistent state:
Setting the page to read-only captures its contents for the
snapshot, and the kernel requires NOVA to set the pages as read-only
one at a time.  So, if the order in which NOVA marks pages as read-only
is incompatible with ordering that the application requires, the snapshot will
contain an inconsistent version of the file.

To resolve this problem, when NOVA starts marking pages as read-only, it blocks
page faults to the read-only mmap()'d pages until it has marked all the pages
read-only and finished taking the snapshot.

More detail is available in the technical report referenced at the top of this
document.

We have implemented this functionality in NOVA by adding the 'original_write'
flag to struct vm_area_struct that tracks whether the vm_area_struct is created
with write permission, but has been marked read-only in the course of taking a
snapshot.  We have also added a 'dax_cow' operation to struct
vm_operations_struct that the page fault handler runs when applications write
to a page with original_write = 1.  NOVA's dax_cow operation
(nova_restore_page_write()) performs the COW, maps the page to a new physical
page and allows writing.

Saving Snapshot State
---------------------

During a clean shutdown, Nova stores the snapshot information to PMEM.

Nova reserves an inode for storing snapshot information.  The log for the inode
contains an entry for each snapshot (struct snapshot_info_log_entry).  On
shutdown, Nova allocates one page (struct snapshot_nvmm_page) to store an array
of struct snapshot_nvmm_list.

Each of these lists (one per CPU) contains head and tail pointers to a linked
list of blocks (just like an inode log).  The lists contain a struct
snapshot_file_write_entry or struct snapshot_inode_entry for each operation
that modified file data or an inode.

Superblock
+--------------------+
|   ...              |
+--------------------+
| Reserved Inodes    |
+---+----------------+
|   |     ..         |
+---+----------------+
| 7 | Snapshot Inode |
|   | head           |
+---+----------------+
        /
       /
      /
+---------+---------+---------+
|  Snap   |  Snap   |  Snap   |
| epoch=1 | epoch=4 | epoch=11|
|         |         |         |
|nvmm_page|nvmm_page|nvmm_page|
+---------+---------+---------+
     |
     |
+----------+   +--------+--------+
|  cpu 0   |   | snap 	| snap   |
|   head   |-->| inode	| write	 |
|          |   | entry  | entry  |
|          |   +--------+--------+
+----------+   +--------+--------+
|  cpu 1   |   | snap 	| snap   |
|   head   |-->| write	| write	 |
|          |   | entry  | entry  |
|          |   +--------+--------+
+----------+
|    ...   |
+----------+   +--------+
|  cpu 128 |   | snap 	|
|   head   |-->| inode	|
|          |   | entry  |
|          |   +--------+
+----------+

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 arch/x86/mm/fault.c      |   11 
 fs/nova/snapshot.c       | 1407 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/snapshot.h       |   98 +++
 include/linux/mm.h       |    2 
 include/linux/mm_types.h |    3 
 mm/mprotect.c            |   13 
 6 files changed, 1533 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/snapshot.c
 create mode 100644 fs/nova/snapshot.h

[RFC,11/16] NOVA: Snapshot support

Commit Message

Patch