diff mbox

[3/4] design: incorporate self-describing metadata design doc

Message ID 152623084306.10242.9245415705036181017.stgit@magnolia (mailing list archive)
State New, archived
Headers show

Commit Message

Darrick J. Wong May 13, 2018, 5 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

Incorporate the self-describing metadata doc into the design book.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 design/XFS_Filesystem_Structure/docinfo.xml        |    1 
 .../self_describing_metadata.asciidoc              |  354 ++++++++++++++++++++
 .../xfs_filesystem_structure.asciidoc              |    2 
 design/xfs-self-describing-metadata.asciidoc       |  356 --------------------
 4 files changed, 357 insertions(+), 356 deletions(-)
 create mode 100644 design/XFS_Filesystem_Structure/self_describing_metadata.asciidoc
 delete mode 100644 design/xfs-self-describing-metadata.asciidoc



--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Allison Henderson May 14, 2018, 2:30 a.m. UTC | #1
Looks ok.  Thx!

Reviewed by: Allison Henderson <allison.henderson@oracle.com>

On 05/13/2018 10:00 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> Incorporate the self-describing metadata doc into the design book.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   design/XFS_Filesystem_Structure/docinfo.xml        |    1
>   .../self_describing_metadata.asciidoc              |  354 ++++++++++++++++++++
>   .../xfs_filesystem_structure.asciidoc              |    2
>   design/xfs-self-describing-metadata.asciidoc       |  356 --------------------
>   4 files changed, 357 insertions(+), 356 deletions(-)
>   create mode 100644 design/XFS_Filesystem_Structure/self_describing_metadata.asciidoc
>   delete mode 100644 design/xfs-self-describing-metadata.asciidoc
>
>
> diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml
> index 558a04c..29ffbb5 100644
> --- a/design/XFS_Filesystem_Structure/docinfo.xml
> +++ b/design/XFS_Filesystem_Structure/docinfo.xml
> @@ -180,6 +180,7 @@
>   		<revdescription>
>   			<simplelist>
>   				<member>Incorporate Dave Chinner's log design document.</member>
> +				<member>Incorporate Dave Chinner's self-describing metadata design document.</member>
>   			</simplelist>
>   		</revdescription>
>   	</revision>
> diff --git a/design/XFS_Filesystem_Structure/self_describing_metadata.asciidoc b/design/XFS_Filesystem_Structure/self_describing_metadata.asciidoc
> new file mode 100644
> index 0000000..c3038b9
> --- /dev/null
> +++ b/design/XFS_Filesystem_Structure/self_describing_metadata.asciidoc
> @@ -0,0 +1,354 @@
> += XFS Self Describing Metadata
> +
> +== Introduction
> +
> +The largest scalability problem facing XFS is not one of algorithmic
> +scalability, but of verification of the filesystem structure. Scalabilty of the
> +structures and indexes on disk and the algorithms for iterating them are
> +adequate for supporting PB scale filesystems with billions of inodes, however it
> +is this very scalability that causes the verification problem.
> +
> +Almost all metadata on XFS is dynamically allocated. The only fixed location
> +metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
> +other metadata structures need to be discovered by walking the filesystem
> +structure in different ways. While this is already done by userspace tools for
> +validating and repairing the structure, there are limits to what they can
> +verify, and this in turn limits the supportable size of an XFS filesystem.
> +
> +For example, it is entirely possible to manually use xfs_db and a bit of
> +scripting to analyse the structure of a 100TB filesystem when trying to
> +determine the root cause of a corruption problem, but it is still mainly a
> +manual task of verifying that things like single bit errors or misplaced writes
> +weren't the ultimate cause of a corruption event. It may take a few hours to a
> +few days to perform such forensic analysis, so for at this scale root cause
> +analysis is entirely possible.
> +
> +However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
> +to analyse and so that analysis blows out towards weeks/months of forensic work.
> +Most of the analysis work is slow and tedious, so as the amount of analysis goes
> +up, the more likely that the cause will be lost in the noise.  Hence the primary
> +concern for supporting PB scale filesystems is minimising the time and effort
> +required for basic forensic analysis of the filesystem structure.
> +
> +
> +== Self Describing Metadata
> +
> +One of the problems with the current metadata format is that apart from the
> +magic number in the metadata block, we have no other way of identifying what it
> +is supposed to be. We can't even identify if it is the right place. Put simply,
> +you can't look at a single metadata block in isolation and say "yes, it is
> +supposed to be there and the contents are valid".
> +
> +Hence most of the time spent on forensic analysis is spent doing basic
> +verification of metadata values, looking for values that are in range (and hence
> +not detected by automated verification checks) but are not correct. Finding and
> +understanding how things like cross linked block lists (e.g. sibling
> +pointers in a btree end up with loops in them) are the key to understanding what
> +went wrong, but it is impossible to tell what order the blocks were linked into
> +each other or written to disk after the fact.
> +
> +Hence we need to record more information into the metadata to allow us to
> +quickly determine if the metadata is intact and can be ignored for the purpose
> +of analysis. We can't protect against every possible type of error, but we can
> +ensure that common types of errors are easily detectable.  Hence the concept of
> +self describing metadata.
> +
> +The first, fundamental requirement of self describing metadata is that the
> +metadata object contains some form of unique identifier in a well known
> +location. This allows us to identify the expected contents of the block and
> +hence parse and verify the metadata object. IF we can't independently identify
> +the type of metadata in the object, then the metadata doesn't describe itself
> +very well at all!
> +
> +Luckily, almost all XFS metadata has magic numbers embedded already - only the
> +AGFL, remote symlinks and remote attribute blocks do not contain identifying
> +magic numbers. Hence we can change the on-disk format of all these objects to
> +add more identifying information and detect this simply by changing the magic
> +numbers in the metadata objects. That is, if it has the current magic number,
> +the metadata isn't self identifying. If it contains a new magic number, it is
> +self identifying and we can do much more expansive automated verification of the
> +metadata object at runtime, during forensic analysis or repair.
> +
> +As a primary concern, self describing metadata needs some form of overall
> +integrity checking. We cannot trust the metadata if we cannot verify that it has
> +not been changed as a result of external influences. Hence we need some form of
> +integrity check, and this is done by adding CRC32c validation to the metadata
> +block. If we can verify the block contains the metadata it was intended to
> +contain, a large amount of the manual verification work can be skipped.
> +
> +CRC32c was selected as metadata cannot be more than 64k in length in XFS and
> +hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
> +metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
> +fast. So while CRC32c is not the strongest of possible integrity checks that
> +could be used, it is more than sufficient for our needs and has relatively
> +little overhead. Adding support for larger integrity fields and/or algorithms
> +does really provide any extra value over CRC32c, but it does add a lot of
> +complexity and so there is no provision for changing the integrity checking
> +mechanism.
> +
> +Self describing metadata needs to contain enough information so that the
> +metadata block can be verified as being in the correct place without needing to
> +look at any other metadata. This means it needs to contain location information.
> +Just adding a block number to the metadata is not sufficient to protect against
> +mis-directed writes - a write might be misdirected to the wrong LUN and so be
> +written to the "correct block" of the wrong filesystem. Hence location
> +information must contain a filesystem identifier as well as a block number.
> +
> +Another key information point in forensic analysis is knowing who the metadata
> +block belongs to. We already know the type, the location, that it is valid
> +and/or corrupted, and how long ago that it was last modified. Knowing the owner
> +of the block is important as it allows us to find other related metadata to
> +determine the scope of the corruption. For example, if we have a extent btree
> +object, we don't know what inode it belongs to and hence have to walk the entire
> +filesystem to find the owner of the block. Worse, the corruption could mean that
> +no owner can be found (i.e. it's an orphan block), and so without an owner field
> +in the metadata we have no idea of the scope of the corruption. If we have an
> +owner field in the metadata object, we can immediately do top down validation to
> +determine the scope of the problem.
> +
> +Different types of metadata have different owner identifiers. For example,
> +directory, attribute and extent tree blocks are all owned by an inode, whilst
> +freespace btree blocks are owned by an allocation group. Hence the size and
> +contents of the owner field are determined by the type of metadata object we are
> +looking at.  The owner information can also identify misplaced writes (e.g.
> +freespace btree block written to the wrong AG).
> +
> +Self describing metadata also needs to contain some indication of when it was
> +written to the filesystem. One of the key information points when doing forensic
> +analysis is how recently the block was modified. Correlation of set of corrupted
> +metadata blocks based on modification times is important as it can indicate
> +whether the corruptions are related, whether there's been multiple corruption
> +events that lead to the eventual failure, and even whether there are corruptions
> +present that the run-time verification is not detecting.
> +
> +For example, we can determine whether a metadata object is supposed to be free
> +space or still allocated if it is still referenced by its owner by looking at
> +when the free space btree block that contains the block was last written
> +compared to when the metadata object itself was last written.  If the free space
> +block is more recent than the object and the object's owner, then there is a
> +very good chance that the block should have been removed from the owner.
> +
> +To provide this "written timestamp", each metadata block gets the Log Sequence
> +Number (LSN) of the most recent transaction it was modified on written into it.
> +This number will always increase over the life of the filesystem, and the only
> +thing that resets it is running xfs_repair on the filesystem. Further, by use of
> +the LSN we can tell if the corrupted metadata all belonged to the same log
> +checkpoint and hence have some idea of how much modification occurred between
> +the first and last instance of corrupt metadata on disk and, further, how much
> +modification occurred between the corruption being written and when it was
> +detected.
> +
> +== Runtime Validation
> +
> +Validation of self-describing metadata takes place at runtime in two places:
> +
> +	* immediately after a successful read from disk
> +	* immediately prior to write IO submission
> +
> +The verification is completely stateless - it is done independently of the
> +modification process, and seeks only to check that the metadata is what it says
> +it is and that the metadata fields are within bounds and internally consistent.
> +As such, we cannot catch all types of corruption that can occur within a block
> +as there may be certain limitations that operational state enforces of the
> +metadata, or there may be corruption of interblock relationships (e.g. corrupted
> +sibling pointer lists). Hence we still need stateful checking in the main code
> +body, but in general most of the per-field validation is handled by the
> +verifiers.
> +
> +For read verification, the caller needs to specify the expected type of metadata
> +that it should see, and the IO completion process verifies that the metadata
> +object matches what was expected. If the verification process fails, then it
> +marks the object being read as EFSCORRUPTED. The caller needs to catch this
> +error (same as for IO errors), and if it needs to take special action due to a
> +verification error it can do so by catching the EFSCORRUPTED error value. If we
> +need more discrimination of error type at higher levels, we can define new
> +error numbers for different errors as necessary.
> +
> +The first step in read verification is checking the magic number and determining
> +whether CRC validating is necessary. If it is, the CRC32c is calculated and
> +compared against the value stored in the object itself. Once this is validated,
> +further checks are made against the location information, followed by extensive
> +object specific metadata validation. If any of these checks fail, then the
> +buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
> +
> +Write verification is the opposite of the read verification - first the object
> +is extensively verified and if it is OK we then update the LSN from the last
> +modification made to the object, After this, we calculate the CRC and insert it
> +into the object. Once this is done the write IO is allowed to continue. If any
> +error occurs during this process, the buffer is again marked with a EFSCORRUPTED
> +error for the higher layers to catch.
> +
> +== Structures
> +
> +A typical on-disk structure needs to contain the following information:
> +
> +[source ,c]
> +----
> +struct xfs_ondisk_hdr {
> +        __be32  magic;		/* magic number */
> +        __be32  crc;		/* CRC, not logged */
> +        uuid_t  uuid;		/* filesystem identifier */
> +        __be64  owner;		/* parent object */
> +        __be64  blkno;		/* location on disk */
> +        __be64  lsn;		/* last modification in log, not logged */
> +};
> +----
> +
> +Depending on the metadata, this information may be part of a header structure
> +separate to the metadata contents, or may be distributed through an existing
> +structure. The latter occurs with metadata that already contains some of this
> +information, such as the superblock and AG headers.
> +
> +Other metadata may have different formats for the information, but the same
> +level of information is generally provided. For example:
> +
> +	* short btree blocks have a 32 bit owner (ag number) and a 32 bit block
> +	  number for location. The two of these combined provide the same
> +	  information as @owner and @blkno in eh above structure, but using 8
> +	  bytes less space on disk.
> +
> +	* directory/attribute node blocks have a 16 bit magic number, and the
> +	  header that contains the magic number has other information in it as
> +	  well. hence the additional metadata headers change the overall format
> +	  of the metadata.
> +
> +A typical buffer read verifier is structured as follows:
> +
> +[source ,c]
> +----
> +#define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc)
> +
> +static void
> +xfs_foo_read_verify(
> +	struct xfs_buf	*bp)
> +{
> +       struct xfs_mount *mp = bp->b_target->bt_mount;
> +
> +        if ((xfs_sb_version_hascrc(&mp->m_sb) &&
> +             !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
> +					XFS_FOO_CRC_OFF)) ||
> +            !xfs_foo_verify(bp)) {
> +                XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
> +                xfs_buf_ioerror(bp, EFSCORRUPTED);
> +        }
> +}
> +----
> +
> +The code ensures that the CRC is only checked if the filesystem has CRCs enabled
> +by checking the superblock of the feature bit, and then if the CRC verifies OK
> +(or is not needed) it verifies the actual contents of the block.
> +
> +The verifier function will take a couple of different forms, depending on
> +whether the magic number can be used to determine the format of the block. In
> +the case it can't, the code is structured as follows:
> +
> +[source ,c]
> +----
> +static bool
> +xfs_foo_verify(
> +	struct xfs_buf		*bp)
> +{
> +        struct xfs_mount	*mp = bp->b_target->bt_mount;
> +        struct xfs_ondisk_hdr	*hdr = bp->b_addr;
> +
> +        if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
> +                return false;
> +
> +        if (!xfs_sb_version_hascrc(&mp->m_sb)) {
> +		if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
> +			return false;
> +		if (bp->b_bn != be64_to_cpu(hdr->blkno))
> +			return false;
> +		if (hdr->owner == 0)
> +			return false;
> +	}
> +
> +	/* object specific verification checks here */
> +
> +        return true;
> +}
> +----
> +
> +If there are different magic numbers for the different formats, the verifier
> +will look like:
> +
> +[source ,c]
> +----
> +static bool
> +xfs_foo_verify(
> +	struct xfs_buf		*bp)
> +{
> +        struct xfs_mount	*mp = bp->b_target->bt_mount;
> +        struct xfs_ondisk_hdr	*hdr = bp->b_addr;
> +
> +        if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
> +		if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
> +			return false;
> +		if (bp->b_bn != be64_to_cpu(hdr->blkno))
> +			return false;
> +		if (hdr->owner == 0)
> +			return false;
> +	} else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
> +		return false;
> +
> +	/* object specific verification checks here */
> +
> +        return true;
> +}
> +----
> +
> +Write verifiers are very similar to the read verifiers, they just do things in
> +the opposite order to the read verifiers. A typical write verifier:
> +
> +[source ,c]
> +----
> +static void
> +xfs_foo_write_verify(
> +	struct xfs_buf	*bp)
> +{
> +	struct xfs_mount	*mp = bp->b_target->bt_mount;
> +	struct xfs_buf_log_item	*bip = bp->b_fspriv;
> +
> +	if (!xfs_foo_verify(bp)) {
> +		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
> +		xfs_buf_ioerror(bp, EFSCORRUPTED);
> +		return;
> +	}
> +
> +	if (!xfs_sb_version_hascrc(&mp->m_sb))
> +		return;
> +
> +
> +	if (bip) {
> +		struct xfs_ondisk_hdr	*hdr = bp->b_addr;
> +		hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
> +	}
> +	xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
> +}
> +----
> +
> +This will verify the internal structure of the metadata before we go any
> +further, detecting corruptions that have occurred as the metadata has been
> +modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
> +update the LSN field (when it was last modified) and calculate the CRC on the
> +metadata. Once this is done, we can issue the IO.
> +
> +== Inodes and Dquots
> +
> +Inodes and dquots are special snowflakes. They have per-object CRC and
> +self-identifiers, but they are packed so that there are multiple objects per
> +buffer. Hence we do not use per-buffer verifiers to do the work of per-object
> +verification and CRC calculations. The per-buffer verifiers simply perform basic
> +identification of the buffer - that they contain inodes or dquots, and that
> +there are magic numbers in all the expected spots. All further CRC and
> +verification checks are done when each inode is read from or written back to the
> +buffer.
> +
> +The structure of the verifiers and the identifiers checks is very similar to the
> +buffer code described above. The only difference is where they are called. For
> +example, inode read verification is done in xfs_iread() when the inode is first
> +read out of the buffer and the struct xfs_inode is instantiated. The inode is
> +already extensively verified during writeback in xfs_iflush_int, so the only
> +addition here is to add the LSN and CRC to the inode as it is copied back into
> +the buffer.
> +
> diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
> index 7bdfade..15ab185 100644
> --- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
> +++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
> @@ -48,6 +48,8 @@ include::overview.asciidoc[]
>   
>   include::metadata_integrity.asciidoc[]
>   
> +include::self_describing_metadata.asciidoc[]
> +
>   include::delayed_logging.asciidoc[]
>   
>   include::reflink.asciidoc[]
> diff --git a/design/xfs-self-describing-metadata.asciidoc b/design/xfs-self-describing-metadata.asciidoc
> deleted file mode 100644
> index b7dc3ff..0000000
> --- a/design/xfs-self-describing-metadata.asciidoc
> +++ /dev/null
> @@ -1,356 +0,0 @@
> -= XFS Self Describing Metadata
> -Dave Chinner, <dchinner@redhat.com>
> -v1.0, Feb 2014: Initial conversion to asciidoc
> -
> -== Introduction
> -
> -The largest scalability problem facing XFS is not one of algorithmic
> -scalability, but of verification of the filesystem structure. Scalabilty of the
> -structures and indexes on disk and the algorithms for iterating them are
> -adequate for supporting PB scale filesystems with billions of inodes, however it
> -is this very scalability that causes the verification problem.
> -
> -Almost all metadata on XFS is dynamically allocated. The only fixed location
> -metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
> -other metadata structures need to be discovered by walking the filesystem
> -structure in different ways. While this is already done by userspace tools for
> -validating and repairing the structure, there are limits to what they can
> -verify, and this in turn limits the supportable size of an XFS filesystem.
> -
> -For example, it is entirely possible to manually use xfs_db and a bit of
> -scripting to analyse the structure of a 100TB filesystem when trying to
> -determine the root cause of a corruption problem, but it is still mainly a
> -manual task of verifying that things like single bit errors or misplaced writes
> -weren't the ultimate cause of a corruption event. It may take a few hours to a
> -few days to perform such forensic analysis, so for at this scale root cause
> -analysis is entirely possible.
> -
> -However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
> -to analyse and so that analysis blows out towards weeks/months of forensic work.
> -Most of the analysis work is slow and tedious, so as the amount of analysis goes
> -up, the more likely that the cause will be lost in the noise.  Hence the primary
> -concern for supporting PB scale filesystems is minimising the time and effort
> -required for basic forensic analysis of the filesystem structure.
> -
> -
> -== Self Describing Metadata
> -
> -One of the problems with the current metadata format is that apart from the
> -magic number in the metadata block, we have no other way of identifying what it
> -is supposed to be. We can't even identify if it is the right place. Put simply,
> -you can't look at a single metadata block in isolation and say "yes, it is
> -supposed to be there and the contents are valid".
> -
> -Hence most of the time spent on forensic analysis is spent doing basic
> -verification of metadata values, looking for values that are in range (and hence
> -not detected by automated verification checks) but are not correct. Finding and
> -understanding how things like cross linked block lists (e.g. sibling
> -pointers in a btree end up with loops in them) are the key to understanding what
> -went wrong, but it is impossible to tell what order the blocks were linked into
> -each other or written to disk after the fact.
> -
> -Hence we need to record more information into the metadata to allow us to
> -quickly determine if the metadata is intact and can be ignored for the purpose
> -of analysis. We can't protect against every possible type of error, but we can
> -ensure that common types of errors are easily detectable.  Hence the concept of
> -self describing metadata.
> -
> -The first, fundamental requirement of self describing metadata is that the
> -metadata object contains some form of unique identifier in a well known
> -location. This allows us to identify the expected contents of the block and
> -hence parse and verify the metadata object. IF we can't independently identify
> -the type of metadata in the object, then the metadata doesn't describe itself
> -very well at all!
> -
> -Luckily, almost all XFS metadata has magic numbers embedded already - only the
> -AGFL, remote symlinks and remote attribute blocks do not contain identifying
> -magic numbers. Hence we can change the on-disk format of all these objects to
> -add more identifying information and detect this simply by changing the magic
> -numbers in the metadata objects. That is, if it has the current magic number,
> -the metadata isn't self identifying. If it contains a new magic number, it is
> -self identifying and we can do much more expansive automated verification of the
> -metadata object at runtime, during forensic analysis or repair.
> -
> -As a primary concern, self describing metadata needs some form of overall
> -integrity checking. We cannot trust the metadata if we cannot verify that it has
> -not been changed as a result of external influences. Hence we need some form of
> -integrity check, and this is done by adding CRC32c validation to the metadata
> -block. If we can verify the block contains the metadata it was intended to
> -contain, a large amount of the manual verification work can be skipped.
> -
> -CRC32c was selected as metadata cannot be more than 64k in length in XFS and
> -hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
> -metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
> -fast. So while CRC32c is not the strongest of possible integrity checks that
> -could be used, it is more than sufficient for our needs and has relatively
> -little overhead. Adding support for larger integrity fields and/or algorithms
> -does really provide any extra value over CRC32c, but it does add a lot of
> -complexity and so there is no provision for changing the integrity checking
> -mechanism.
> -
> -Self describing metadata needs to contain enough information so that the
> -metadata block can be verified as being in the correct place without needing to
> -look at any other metadata. This means it needs to contain location information.
> -Just adding a block number to the metadata is not sufficient to protect against
> -mis-directed writes - a write might be misdirected to the wrong LUN and so be
> -written to the "correct block" of the wrong filesystem. Hence location
> -information must contain a filesystem identifier as well as a block number.
> -
> -Another key information point in forensic analysis is knowing who the metadata
> -block belongs to. We already know the type, the location, that it is valid
> -and/or corrupted, and how long ago that it was last modified. Knowing the owner
> -of the block is important as it allows us to find other related metadata to
> -determine the scope of the corruption. For example, if we have a extent btree
> -object, we don't know what inode it belongs to and hence have to walk the entire
> -filesystem to find the owner of the block. Worse, the corruption could mean that
> -no owner can be found (i.e. it's an orphan block), and so without an owner field
> -in the metadata we have no idea of the scope of the corruption. If we have an
> -owner field in the metadata object, we can immediately do top down validation to
> -determine the scope of the problem.
> -
> -Different types of metadata have different owner identifiers. For example,
> -directory, attribute and extent tree blocks are all owned by an inode, whilst
> -freespace btree blocks are owned by an allocation group. Hence the size and
> -contents of the owner field are determined by the type of metadata object we are
> -looking at.  The owner information can also identify misplaced writes (e.g.
> -freespace btree block written to the wrong AG).
> -
> -Self describing metadata also needs to contain some indication of when it was
> -written to the filesystem. One of the key information points when doing forensic
> -analysis is how recently the block was modified. Correlation of set of corrupted
> -metadata blocks based on modification times is important as it can indicate
> -whether the corruptions are related, whether there's been multiple corruption
> -events that lead to the eventual failure, and even whether there are corruptions
> -present that the run-time verification is not detecting.
> -
> -For example, we can determine whether a metadata object is supposed to be free
> -space or still allocated if it is still referenced by its owner by looking at
> -when the free space btree block that contains the block was last written
> -compared to when the metadata object itself was last written.  If the free space
> -block is more recent than the object and the object's owner, then there is a
> -very good chance that the block should have been removed from the owner.
> -
> -To provide this "written timestamp", each metadata block gets the Log Sequence
> -Number (LSN) of the most recent transaction it was modified on written into it.
> -This number will always increase over the life of the filesystem, and the only
> -thing that resets it is running xfs_repair on the filesystem. Further, by use of
> -the LSN we can tell if the corrupted metadata all belonged to the same log
> -checkpoint and hence have some idea of how much modification occurred between
> -the first and last instance of corrupt metadata on disk and, further, how much
> -modification occurred between the corruption being written and when it was
> -detected.
> -
> -== Runtime Validation
> -
> -Validation of self-describing metadata takes place at runtime in two places:
> -
> -	* immediately after a successful read from disk
> -	* immediately prior to write IO submission
> -
> -The verification is completely stateless - it is done independently of the
> -modification process, and seeks only to check that the metadata is what it says
> -it is and that the metadata fields are within bounds and internally consistent.
> -As such, we cannot catch all types of corruption that can occur within a block
> -as there may be certain limitations that operational state enforces of the
> -metadata, or there may be corruption of interblock relationships (e.g. corrupted
> -sibling pointer lists). Hence we still need stateful checking in the main code
> -body, but in general most of the per-field validation is handled by the
> -verifiers.
> -
> -For read verification, the caller needs to specify the expected type of metadata
> -that it should see, and the IO completion process verifies that the metadata
> -object matches what was expected. If the verification process fails, then it
> -marks the object being read as EFSCORRUPTED. The caller needs to catch this
> -error (same as for IO errors), and if it needs to take special action due to a
> -verification error it can do so by catching the EFSCORRUPTED error value. If we
> -need more discrimination of error type at higher levels, we can define new
> -error numbers for different errors as necessary.
> -
> -The first step in read verification is checking the magic number and determining
> -whether CRC validating is necessary. If it is, the CRC32c is calculated and
> -compared against the value stored in the object itself. Once this is validated,
> -further checks are made against the location information, followed by extensive
> -object specific metadata validation. If any of these checks fail, then the
> -buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
> -
> -Write verification is the opposite of the read verification - first the object
> -is extensively verified and if it is OK we then update the LSN from the last
> -modification made to the object, After this, we calculate the CRC and insert it
> -into the object. Once this is done the write IO is allowed to continue. If any
> -error occurs during this process, the buffer is again marked with a EFSCORRUPTED
> -error for the higher layers to catch.
> -
> -== Structures
> -
> -A typical on-disk structure needs to contain the following information:
> -
> -[source ,c]
> -----
> -struct xfs_ondisk_hdr {
> -        __be32  magic;		/* magic number */
> -        __be32  crc;		/* CRC, not logged */
> -        uuid_t  uuid;		/* filesystem identifier */
> -        __be64  owner;		/* parent object */
> -        __be64  blkno;		/* location on disk */
> -        __be64  lsn;		/* last modification in log, not logged */
> -};
> -----
> -
> -Depending on the metadata, this information may be part of a header structure
> -separate to the metadata contents, or may be distributed through an existing
> -structure. The latter occurs with metadata that already contains some of this
> -information, such as the superblock and AG headers.
> -
> -Other metadata may have different formats for the information, but the same
> -level of information is generally provided. For example:
> -
> -	* short btree blocks have a 32 bit owner (ag number) and a 32 bit block
> -	  number for location. The two of these combined provide the same
> -	  information as @owner and @blkno in eh above structure, but using 8
> -	  bytes less space on disk.
> -
> -	* directory/attribute node blocks have a 16 bit magic number, and the
> -	  header that contains the magic number has other information in it as
> -	  well. hence the additional metadata headers change the overall format
> -	  of the metadata.
> -
> -A typical buffer read verifier is structured as follows:
> -
> -[source ,c]
> -----
> -#define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc)
> -
> -static void
> -xfs_foo_read_verify(
> -	struct xfs_buf	*bp)
> -{
> -       struct xfs_mount *mp = bp->b_target->bt_mount;
> -
> -        if ((xfs_sb_version_hascrc(&mp->m_sb) &&
> -             !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
> -					XFS_FOO_CRC_OFF)) ||
> -            !xfs_foo_verify(bp)) {
> -                XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
> -                xfs_buf_ioerror(bp, EFSCORRUPTED);
> -        }
> -}
> -----
> -
> -The code ensures that the CRC is only checked if the filesystem has CRCs enabled
> -by checking the superblock of the feature bit, and then if the CRC verifies OK
> -(or is not needed) it verifies the actual contents of the block.
> -
> -The verifier function will take a couple of different forms, depending on
> -whether the magic number can be used to determine the format of the block. In
> -the case it can't, the code is structured as follows:
> -
> -[source ,c]
> -----
> -static bool
> -xfs_foo_verify(
> -	struct xfs_buf		*bp)
> -{
> -        struct xfs_mount	*mp = bp->b_target->bt_mount;
> -        struct xfs_ondisk_hdr	*hdr = bp->b_addr;
> -
> -        if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
> -                return false;
> -
> -        if (!xfs_sb_version_hascrc(&mp->m_sb)) {
> -		if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
> -			return false;
> -		if (bp->b_bn != be64_to_cpu(hdr->blkno))
> -			return false;
> -		if (hdr->owner == 0)
> -			return false;
> -	}
> -
> -	/* object specific verification checks here */
> -
> -        return true;
> -}
> -----
> -
> -If there are different magic numbers for the different formats, the verifier
> -will look like:
> -
> -[source ,c]
> -----
> -static bool
> -xfs_foo_verify(
> -	struct xfs_buf		*bp)
> -{
> -        struct xfs_mount	*mp = bp->b_target->bt_mount;
> -        struct xfs_ondisk_hdr	*hdr = bp->b_addr;
> -
> -        if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
> -		if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
> -			return false;
> -		if (bp->b_bn != be64_to_cpu(hdr->blkno))
> -			return false;
> -		if (hdr->owner == 0)
> -			return false;
> -	} else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
> -		return false;
> -
> -	/* object specific verification checks here */
> -
> -        return true;
> -}
> -----
> -
> -Write verifiers are very similar to the read verifiers, they just do things in
> -the opposite order to the read verifiers. A typical write verifier:
> -
> -[source ,c]
> -----
> -static void
> -xfs_foo_write_verify(
> -	struct xfs_buf	*bp)
> -{
> -	struct xfs_mount	*mp = bp->b_target->bt_mount;
> -	struct xfs_buf_log_item	*bip = bp->b_fspriv;
> -
> -	if (!xfs_foo_verify(bp)) {
> -		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
> -		xfs_buf_ioerror(bp, EFSCORRUPTED);
> -		return;
> -	}
> -
> -	if (!xfs_sb_version_hascrc(&mp->m_sb))
> -		return;
> -
> -
> -	if (bip) {
> -		struct xfs_ondisk_hdr	*hdr = bp->b_addr;
> -		hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
> -	}
> -	xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
> -}
> -----
> -
> -This will verify the internal structure of the metadata before we go any
> -further, detecting corruptions that have occurred as the metadata has been
> -modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
> -update the LSN field (when it was last modified) and calculate the CRC on the
> -metadata. Once this is done, we can issue the IO.
> -
> -== Inodes and Dquots
> -
> -Inodes and dquots are special snowflakes. They have per-object CRC and
> -self-identifiers, but they are packed so that there are multiple objects per
> -buffer. Hence we do not use per-buffer verifiers to do the work of per-object
> -verification and CRC calculations. The per-buffer verifiers simply perform basic
> -identification of the buffer - that they contain inodes or dquots, and that
> -there are magic numbers in all the expected spots. All further CRC and
> -verification checks are done when each inode is read from or written back to the
> -buffer.
> -
> -The structure of the verifiers and the identifiers checks is very similar to the
> -buffer code described above. The only difference is where they are called. For
> -example, inode read verification is done in xfs_iread() when the inode is first
> -read out of the buffer and the struct xfs_inode is instantiated. The inode is
> -already extensively verified during writeback in xfs_iflush_int, so the only
> -addition here is to add the LSN and CRC to the inode as it is copied back into
> -the buffer.
> -
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=iP2t2Exm_csCeGCRkBPUpJfwMGxzbamuNICaILTVlZ0&s=pyQi68Dq9_Eot7UpcpUEtlwllXTaer2GSrb8PQgFTug&e=

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml
index 558a04c..29ffbb5 100644
--- a/design/XFS_Filesystem_Structure/docinfo.xml
+++ b/design/XFS_Filesystem_Structure/docinfo.xml
@@ -180,6 +180,7 @@ 
 		<revdescription>
 			<simplelist>
 				<member>Incorporate Dave Chinner's log design document.</member>
+				<member>Incorporate Dave Chinner's self-describing metadata design document.</member>
 			</simplelist>
 		</revdescription>
 	</revision>
diff --git a/design/XFS_Filesystem_Structure/self_describing_metadata.asciidoc b/design/XFS_Filesystem_Structure/self_describing_metadata.asciidoc
new file mode 100644
index 0000000..c3038b9
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/self_describing_metadata.asciidoc
@@ -0,0 +1,354 @@ 
+= XFS Self Describing Metadata
+
+== Introduction
+
+The largest scalability problem facing XFS is not one of algorithmic
+scalability, but of verification of the filesystem structure. Scalabilty of the
+structures and indexes on disk and the algorithms for iterating them are
+adequate for supporting PB scale filesystems with billions of inodes, however it
+is this very scalability that causes the verification problem.
+
+Almost all metadata on XFS is dynamically allocated. The only fixed location
+metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
+other metadata structures need to be discovered by walking the filesystem
+structure in different ways. While this is already done by userspace tools for
+validating and repairing the structure, there are limits to what they can
+verify, and this in turn limits the supportable size of an XFS filesystem.
+
+For example, it is entirely possible to manually use xfs_db and a bit of
+scripting to analyse the structure of a 100TB filesystem when trying to
+determine the root cause of a corruption problem, but it is still mainly a
+manual task of verifying that things like single bit errors or misplaced writes
+weren't the ultimate cause of a corruption event. It may take a few hours to a
+few days to perform such forensic analysis, so for at this scale root cause
+analysis is entirely possible.
+
+However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
+to analyse and so that analysis blows out towards weeks/months of forensic work.
+Most of the analysis work is slow and tedious, so as the amount of analysis goes
+up, the more likely that the cause will be lost in the noise.  Hence the primary
+concern for supporting PB scale filesystems is minimising the time and effort
+required for basic forensic analysis of the filesystem structure.
+
+
+== Self Describing Metadata
+
+One of the problems with the current metadata format is that apart from the
+magic number in the metadata block, we have no other way of identifying what it
+is supposed to be. We can't even identify if it is the right place. Put simply,
+you can't look at a single metadata block in isolation and say "yes, it is
+supposed to be there and the contents are valid".
+
+Hence most of the time spent on forensic analysis is spent doing basic
+verification of metadata values, looking for values that are in range (and hence
+not detected by automated verification checks) but are not correct. Finding and
+understanding how things like cross linked block lists (e.g. sibling
+pointers in a btree end up with loops in them) are the key to understanding what
+went wrong, but it is impossible to tell what order the blocks were linked into
+each other or written to disk after the fact.
+
+Hence we need to record more information into the metadata to allow us to
+quickly determine if the metadata is intact and can be ignored for the purpose
+of analysis. We can't protect against every possible type of error, but we can
+ensure that common types of errors are easily detectable.  Hence the concept of
+self describing metadata.
+
+The first, fundamental requirement of self describing metadata is that the
+metadata object contains some form of unique identifier in a well known
+location. This allows us to identify the expected contents of the block and
+hence parse and verify the metadata object. IF we can't independently identify
+the type of metadata in the object, then the metadata doesn't describe itself
+very well at all!
+
+Luckily, almost all XFS metadata has magic numbers embedded already - only the
+AGFL, remote symlinks and remote attribute blocks do not contain identifying
+magic numbers. Hence we can change the on-disk format of all these objects to
+add more identifying information and detect this simply by changing the magic
+numbers in the metadata objects. That is, if it has the current magic number,
+the metadata isn't self identifying. If it contains a new magic number, it is
+self identifying and we can do much more expansive automated verification of the
+metadata object at runtime, during forensic analysis or repair.
+
+As a primary concern, self describing metadata needs some form of overall
+integrity checking. We cannot trust the metadata if we cannot verify that it has
+not been changed as a result of external influences. Hence we need some form of
+integrity check, and this is done by adding CRC32c validation to the metadata
+block. If we can verify the block contains the metadata it was intended to
+contain, a large amount of the manual verification work can be skipped.
+
+CRC32c was selected as metadata cannot be more than 64k in length in XFS and
+hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
+metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
+fast. So while CRC32c is not the strongest of possible integrity checks that
+could be used, it is more than sufficient for our needs and has relatively
+little overhead. Adding support for larger integrity fields and/or algorithms
+does really provide any extra value over CRC32c, but it does add a lot of
+complexity and so there is no provision for changing the integrity checking
+mechanism.
+
+Self describing metadata needs to contain enough information so that the
+metadata block can be verified as being in the correct place without needing to
+look at any other metadata. This means it needs to contain location information.
+Just adding a block number to the metadata is not sufficient to protect against
+mis-directed writes - a write might be misdirected to the wrong LUN and so be
+written to the "correct block" of the wrong filesystem. Hence location
+information must contain a filesystem identifier as well as a block number.
+
+Another key information point in forensic analysis is knowing who the metadata
+block belongs to. We already know the type, the location, that it is valid
+and/or corrupted, and how long ago that it was last modified. Knowing the owner
+of the block is important as it allows us to find other related metadata to
+determine the scope of the corruption. For example, if we have a extent btree
+object, we don't know what inode it belongs to and hence have to walk the entire
+filesystem to find the owner of the block. Worse, the corruption could mean that
+no owner can be found (i.e. it's an orphan block), and so without an owner field
+in the metadata we have no idea of the scope of the corruption. If we have an
+owner field in the metadata object, we can immediately do top down validation to
+determine the scope of the problem.
+
+Different types of metadata have different owner identifiers. For example,
+directory, attribute and extent tree blocks are all owned by an inode, whilst
+freespace btree blocks are owned by an allocation group. Hence the size and
+contents of the owner field are determined by the type of metadata object we are
+looking at.  The owner information can also identify misplaced writes (e.g.
+freespace btree block written to the wrong AG).
+
+Self describing metadata also needs to contain some indication of when it was
+written to the filesystem. One of the key information points when doing forensic
+analysis is how recently the block was modified. Correlation of set of corrupted
+metadata blocks based on modification times is important as it can indicate
+whether the corruptions are related, whether there's been multiple corruption
+events that lead to the eventual failure, and even whether there are corruptions
+present that the run-time verification is not detecting.
+
+For example, we can determine whether a metadata object is supposed to be free
+space or still allocated if it is still referenced by its owner by looking at
+when the free space btree block that contains the block was last written
+compared to when the metadata object itself was last written.  If the free space
+block is more recent than the object and the object's owner, then there is a
+very good chance that the block should have been removed from the owner.
+
+To provide this "written timestamp", each metadata block gets the Log Sequence
+Number (LSN) of the most recent transaction it was modified on written into it.
+This number will always increase over the life of the filesystem, and the only
+thing that resets it is running xfs_repair on the filesystem. Further, by use of
+the LSN we can tell if the corrupted metadata all belonged to the same log
+checkpoint and hence have some idea of how much modification occurred between
+the first and last instance of corrupt metadata on disk and, further, how much
+modification occurred between the corruption being written and when it was
+detected.
+
+== Runtime Validation
+
+Validation of self-describing metadata takes place at runtime in two places:
+
+	* immediately after a successful read from disk
+	* immediately prior to write IO submission
+
+The verification is completely stateless - it is done independently of the
+modification process, and seeks only to check that the metadata is what it says
+it is and that the metadata fields are within bounds and internally consistent.
+As such, we cannot catch all types of corruption that can occur within a block
+as there may be certain limitations that operational state enforces of the
+metadata, or there may be corruption of interblock relationships (e.g. corrupted
+sibling pointer lists). Hence we still need stateful checking in the main code
+body, but in general most of the per-field validation is handled by the
+verifiers.
+
+For read verification, the caller needs to specify the expected type of metadata
+that it should see, and the IO completion process verifies that the metadata
+object matches what was expected. If the verification process fails, then it
+marks the object being read as EFSCORRUPTED. The caller needs to catch this
+error (same as for IO errors), and if it needs to take special action due to a
+verification error it can do so by catching the EFSCORRUPTED error value. If we
+need more discrimination of error type at higher levels, we can define new
+error numbers for different errors as necessary.
+
+The first step in read verification is checking the magic number and determining
+whether CRC validating is necessary. If it is, the CRC32c is calculated and
+compared against the value stored in the object itself. Once this is validated,
+further checks are made against the location information, followed by extensive
+object specific metadata validation. If any of these checks fail, then the
+buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
+
+Write verification is the opposite of the read verification - first the object
+is extensively verified and if it is OK we then update the LSN from the last
+modification made to the object, After this, we calculate the CRC and insert it
+into the object. Once this is done the write IO is allowed to continue. If any
+error occurs during this process, the buffer is again marked with a EFSCORRUPTED
+error for the higher layers to catch.
+
+== Structures
+
+A typical on-disk structure needs to contain the following information:
+
+[source ,c]
+----
+struct xfs_ondisk_hdr {
+        __be32  magic;		/* magic number */
+        __be32  crc;		/* CRC, not logged */
+        uuid_t  uuid;		/* filesystem identifier */
+        __be64  owner;		/* parent object */
+        __be64  blkno;		/* location on disk */
+        __be64  lsn;		/* last modification in log, not logged */
+};
+----
+
+Depending on the metadata, this information may be part of a header structure
+separate to the metadata contents, or may be distributed through an existing
+structure. The latter occurs with metadata that already contains some of this
+information, such as the superblock and AG headers.
+
+Other metadata may have different formats for the information, but the same
+level of information is generally provided. For example:
+
+	* short btree blocks have a 32 bit owner (ag number) and a 32 bit block
+	  number for location. The two of these combined provide the same
+	  information as @owner and @blkno in eh above structure, but using 8
+	  bytes less space on disk.
+
+	* directory/attribute node blocks have a 16 bit magic number, and the
+	  header that contains the magic number has other information in it as
+	  well. hence the additional metadata headers change the overall format
+	  of the metadata.
+
+A typical buffer read verifier is structured as follows:
+
+[source ,c]
+----
+#define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc)
+
+static void
+xfs_foo_read_verify(
+	struct xfs_buf	*bp)
+{
+       struct xfs_mount *mp = bp->b_target->bt_mount;
+
+        if ((xfs_sb_version_hascrc(&mp->m_sb) &&
+             !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
+					XFS_FOO_CRC_OFF)) ||
+            !xfs_foo_verify(bp)) {
+                XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
+                xfs_buf_ioerror(bp, EFSCORRUPTED);
+        }
+}
+----
+
+The code ensures that the CRC is only checked if the filesystem has CRCs enabled
+by checking the superblock of the feature bit, and then if the CRC verifies OK
+(or is not needed) it verifies the actual contents of the block.
+
+The verifier function will take a couple of different forms, depending on
+whether the magic number can be used to determine the format of the block. In
+the case it can't, the code is structured as follows:
+
+[source ,c]
+----
+static bool
+xfs_foo_verify(
+	struct xfs_buf		*bp)
+{
+        struct xfs_mount	*mp = bp->b_target->bt_mount;
+        struct xfs_ondisk_hdr	*hdr = bp->b_addr;
+
+        if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
+                return false;
+
+        if (!xfs_sb_version_hascrc(&mp->m_sb)) {
+		if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
+			return false;
+		if (bp->b_bn != be64_to_cpu(hdr->blkno))
+			return false;
+		if (hdr->owner == 0)
+			return false;
+	}
+
+	/* object specific verification checks here */
+
+        return true;
+}
+----
+
+If there are different magic numbers for the different formats, the verifier
+will look like:
+
+[source ,c]
+----
+static bool
+xfs_foo_verify(
+	struct xfs_buf		*bp)
+{
+        struct xfs_mount	*mp = bp->b_target->bt_mount;
+        struct xfs_ondisk_hdr	*hdr = bp->b_addr;
+
+        if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
+		if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
+			return false;
+		if (bp->b_bn != be64_to_cpu(hdr->blkno))
+			return false;
+		if (hdr->owner == 0)
+			return false;
+	} else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
+		return false;
+
+	/* object specific verification checks here */
+
+        return true;
+}
+----
+
+Write verifiers are very similar to the read verifiers, they just do things in
+the opposite order to the read verifiers. A typical write verifier:
+
+[source ,c]
+----
+static void
+xfs_foo_write_verify(
+	struct xfs_buf	*bp)
+{
+	struct xfs_mount	*mp = bp->b_target->bt_mount;
+	struct xfs_buf_log_item	*bip = bp->b_fspriv;
+
+	if (!xfs_foo_verify(bp)) {
+		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
+		xfs_buf_ioerror(bp, EFSCORRUPTED);
+		return;
+	}
+
+	if (!xfs_sb_version_hascrc(&mp->m_sb))
+		return;
+
+
+	if (bip) {
+		struct xfs_ondisk_hdr	*hdr = bp->b_addr;
+		hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
+	}
+	xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
+}
+----
+
+This will verify the internal structure of the metadata before we go any
+further, detecting corruptions that have occurred as the metadata has been
+modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
+update the LSN field (when it was last modified) and calculate the CRC on the
+metadata. Once this is done, we can issue the IO.
+
+== Inodes and Dquots
+
+Inodes and dquots are special snowflakes. They have per-object CRC and
+self-identifiers, but they are packed so that there are multiple objects per
+buffer. Hence we do not use per-buffer verifiers to do the work of per-object
+verification and CRC calculations. The per-buffer verifiers simply perform basic
+identification of the buffer - that they contain inodes or dquots, and that
+there are magic numbers in all the expected spots. All further CRC and
+verification checks are done when each inode is read from or written back to the
+buffer.
+
+The structure of the verifiers and the identifiers checks is very similar to the
+buffer code described above. The only difference is where they are called. For
+example, inode read verification is done in xfs_iread() when the inode is first
+read out of the buffer and the struct xfs_inode is instantiated. The inode is
+already extensively verified during writeback in xfs_iflush_int, so the only
+addition here is to add the LSN and CRC to the inode as it is copied back into
+the buffer.
+
diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
index 7bdfade..15ab185 100644
--- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
+++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
@@ -48,6 +48,8 @@  include::overview.asciidoc[]
 
 include::metadata_integrity.asciidoc[]
 
+include::self_describing_metadata.asciidoc[]
+
 include::delayed_logging.asciidoc[]
 
 include::reflink.asciidoc[]
diff --git a/design/xfs-self-describing-metadata.asciidoc b/design/xfs-self-describing-metadata.asciidoc
deleted file mode 100644
index b7dc3ff..0000000
--- a/design/xfs-self-describing-metadata.asciidoc
+++ /dev/null
@@ -1,356 +0,0 @@ 
-= XFS Self Describing Metadata
-Dave Chinner, <dchinner@redhat.com>
-v1.0, Feb 2014: Initial conversion to asciidoc
-
-== Introduction
-
-The largest scalability problem facing XFS is not one of algorithmic
-scalability, but of verification of the filesystem structure. Scalabilty of the
-structures and indexes on disk and the algorithms for iterating them are
-adequate for supporting PB scale filesystems with billions of inodes, however it
-is this very scalability that causes the verification problem.
-
-Almost all metadata on XFS is dynamically allocated. The only fixed location
-metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
-other metadata structures need to be discovered by walking the filesystem
-structure in different ways. While this is already done by userspace tools for
-validating and repairing the structure, there are limits to what they can
-verify, and this in turn limits the supportable size of an XFS filesystem.
-
-For example, it is entirely possible to manually use xfs_db and a bit of
-scripting to analyse the structure of a 100TB filesystem when trying to
-determine the root cause of a corruption problem, but it is still mainly a
-manual task of verifying that things like single bit errors or misplaced writes
-weren't the ultimate cause of a corruption event. It may take a few hours to a
-few days to perform such forensic analysis, so for at this scale root cause
-analysis is entirely possible.
-
-However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
-to analyse and so that analysis blows out towards weeks/months of forensic work.
-Most of the analysis work is slow and tedious, so as the amount of analysis goes
-up, the more likely that the cause will be lost in the noise.  Hence the primary
-concern for supporting PB scale filesystems is minimising the time and effort
-required for basic forensic analysis of the filesystem structure.
-
-
-== Self Describing Metadata
-
-One of the problems with the current metadata format is that apart from the
-magic number in the metadata block, we have no other way of identifying what it
-is supposed to be. We can't even identify if it is the right place. Put simply,
-you can't look at a single metadata block in isolation and say "yes, it is
-supposed to be there and the contents are valid".
-
-Hence most of the time spent on forensic analysis is spent doing basic
-verification of metadata values, looking for values that are in range (and hence
-not detected by automated verification checks) but are not correct. Finding and
-understanding how things like cross linked block lists (e.g. sibling
-pointers in a btree end up with loops in them) are the key to understanding what
-went wrong, but it is impossible to tell what order the blocks were linked into
-each other or written to disk after the fact.
-
-Hence we need to record more information into the metadata to allow us to
-quickly determine if the metadata is intact and can be ignored for the purpose
-of analysis. We can't protect against every possible type of error, but we can
-ensure that common types of errors are easily detectable.  Hence the concept of
-self describing metadata.
-
-The first, fundamental requirement of self describing metadata is that the
-metadata object contains some form of unique identifier in a well known
-location. This allows us to identify the expected contents of the block and
-hence parse and verify the metadata object. IF we can't independently identify
-the type of metadata in the object, then the metadata doesn't describe itself
-very well at all!
-
-Luckily, almost all XFS metadata has magic numbers embedded already - only the
-AGFL, remote symlinks and remote attribute blocks do not contain identifying
-magic numbers. Hence we can change the on-disk format of all these objects to
-add more identifying information and detect this simply by changing the magic
-numbers in the metadata objects. That is, if it has the current magic number,
-the metadata isn't self identifying. If it contains a new magic number, it is
-self identifying and we can do much more expansive automated verification of the
-metadata object at runtime, during forensic analysis or repair.
-
-As a primary concern, self describing metadata needs some form of overall
-integrity checking. We cannot trust the metadata if we cannot verify that it has
-not been changed as a result of external influences. Hence we need some form of
-integrity check, and this is done by adding CRC32c validation to the metadata
-block. If we can verify the block contains the metadata it was intended to
-contain, a large amount of the manual verification work can be skipped.
-
-CRC32c was selected as metadata cannot be more than 64k in length in XFS and
-hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
-metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
-fast. So while CRC32c is not the strongest of possible integrity checks that
-could be used, it is more than sufficient for our needs and has relatively
-little overhead. Adding support for larger integrity fields and/or algorithms
-does really provide any extra value over CRC32c, but it does add a lot of
-complexity and so there is no provision for changing the integrity checking
-mechanism.
-
-Self describing metadata needs to contain enough information so that the
-metadata block can be verified as being in the correct place without needing to
-look at any other metadata. This means it needs to contain location information.
-Just adding a block number to the metadata is not sufficient to protect against
-mis-directed writes - a write might be misdirected to the wrong LUN and so be
-written to the "correct block" of the wrong filesystem. Hence location
-information must contain a filesystem identifier as well as a block number.
-
-Another key information point in forensic analysis is knowing who the metadata
-block belongs to. We already know the type, the location, that it is valid
-and/or corrupted, and how long ago that it was last modified. Knowing the owner
-of the block is important as it allows us to find other related metadata to
-determine the scope of the corruption. For example, if we have a extent btree
-object, we don't know what inode it belongs to and hence have to walk the entire
-filesystem to find the owner of the block. Worse, the corruption could mean that
-no owner can be found (i.e. it's an orphan block), and so without an owner field
-in the metadata we have no idea of the scope of the corruption. If we have an
-owner field in the metadata object, we can immediately do top down validation to
-determine the scope of the problem.
-
-Different types of metadata have different owner identifiers. For example,
-directory, attribute and extent tree blocks are all owned by an inode, whilst
-freespace btree blocks are owned by an allocation group. Hence the size and
-contents of the owner field are determined by the type of metadata object we are
-looking at.  The owner information can also identify misplaced writes (e.g.
-freespace btree block written to the wrong AG).
-
-Self describing metadata also needs to contain some indication of when it was
-written to the filesystem. One of the key information points when doing forensic
-analysis is how recently the block was modified. Correlation of set of corrupted
-metadata blocks based on modification times is important as it can indicate
-whether the corruptions are related, whether there's been multiple corruption
-events that lead to the eventual failure, and even whether there are corruptions
-present that the run-time verification is not detecting.
-
-For example, we can determine whether a metadata object is supposed to be free
-space or still allocated if it is still referenced by its owner by looking at
-when the free space btree block that contains the block was last written
-compared to when the metadata object itself was last written.  If the free space
-block is more recent than the object and the object's owner, then there is a
-very good chance that the block should have been removed from the owner.
-
-To provide this "written timestamp", each metadata block gets the Log Sequence
-Number (LSN) of the most recent transaction it was modified on written into it.
-This number will always increase over the life of the filesystem, and the only
-thing that resets it is running xfs_repair on the filesystem. Further, by use of
-the LSN we can tell if the corrupted metadata all belonged to the same log
-checkpoint and hence have some idea of how much modification occurred between
-the first and last instance of corrupt metadata on disk and, further, how much
-modification occurred between the corruption being written and when it was
-detected.
-
-== Runtime Validation
-
-Validation of self-describing metadata takes place at runtime in two places:
-
-	* immediately after a successful read from disk
-	* immediately prior to write IO submission
-
-The verification is completely stateless - it is done independently of the
-modification process, and seeks only to check that the metadata is what it says
-it is and that the metadata fields are within bounds and internally consistent.
-As such, we cannot catch all types of corruption that can occur within a block
-as there may be certain limitations that operational state enforces of the
-metadata, or there may be corruption of interblock relationships (e.g. corrupted
-sibling pointer lists). Hence we still need stateful checking in the main code
-body, but in general most of the per-field validation is handled by the
-verifiers.
-
-For read verification, the caller needs to specify the expected type of metadata
-that it should see, and the IO completion process verifies that the metadata
-object matches what was expected. If the verification process fails, then it
-marks the object being read as EFSCORRUPTED. The caller needs to catch this
-error (same as for IO errors), and if it needs to take special action due to a
-verification error it can do so by catching the EFSCORRUPTED error value. If we
-need more discrimination of error type at higher levels, we can define new
-error numbers for different errors as necessary.
-
-The first step in read verification is checking the magic number and determining
-whether CRC validating is necessary. If it is, the CRC32c is calculated and
-compared against the value stored in the object itself. Once this is validated,
-further checks are made against the location information, followed by extensive
-object specific metadata validation. If any of these checks fail, then the
-buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
-
-Write verification is the opposite of the read verification - first the object
-is extensively verified and if it is OK we then update the LSN from the last
-modification made to the object, After this, we calculate the CRC and insert it
-into the object. Once this is done the write IO is allowed to continue. If any
-error occurs during this process, the buffer is again marked with a EFSCORRUPTED
-error for the higher layers to catch.
-
-== Structures
-
-A typical on-disk structure needs to contain the following information:
-
-[source ,c]
-----
-struct xfs_ondisk_hdr {
-        __be32  magic;		/* magic number */
-        __be32  crc;		/* CRC, not logged */
-        uuid_t  uuid;		/* filesystem identifier */
-        __be64  owner;		/* parent object */
-        __be64  blkno;		/* location on disk */
-        __be64  lsn;		/* last modification in log, not logged */
-};
-----
-
-Depending on the metadata, this information may be part of a header structure
-separate to the metadata contents, or may be distributed through an existing
-structure. The latter occurs with metadata that already contains some of this
-information, such as the superblock and AG headers.
-
-Other metadata may have different formats for the information, but the same
-level of information is generally provided. For example:
-
-	* short btree blocks have a 32 bit owner (ag number) and a 32 bit block
-	  number for location. The two of these combined provide the same
-	  information as @owner and @blkno in eh above structure, but using 8
-	  bytes less space on disk.
-
-	* directory/attribute node blocks have a 16 bit magic number, and the
-	  header that contains the magic number has other information in it as
-	  well. hence the additional metadata headers change the overall format
-	  of the metadata.
-
-A typical buffer read verifier is structured as follows:
-
-[source ,c]
-----
-#define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc)
-
-static void
-xfs_foo_read_verify(
-	struct xfs_buf	*bp)
-{
-       struct xfs_mount *mp = bp->b_target->bt_mount;
-
-        if ((xfs_sb_version_hascrc(&mp->m_sb) &&
-             !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
-					XFS_FOO_CRC_OFF)) ||
-            !xfs_foo_verify(bp)) {
-                XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
-                xfs_buf_ioerror(bp, EFSCORRUPTED);
-        }
-}
-----
-
-The code ensures that the CRC is only checked if the filesystem has CRCs enabled
-by checking the superblock of the feature bit, and then if the CRC verifies OK
-(or is not needed) it verifies the actual contents of the block.
-
-The verifier function will take a couple of different forms, depending on
-whether the magic number can be used to determine the format of the block. In
-the case it can't, the code is structured as follows:
-
-[source ,c]
-----
-static bool
-xfs_foo_verify(
-	struct xfs_buf		*bp)
-{
-        struct xfs_mount	*mp = bp->b_target->bt_mount;
-        struct xfs_ondisk_hdr	*hdr = bp->b_addr;
-
-        if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
-                return false;
-
-        if (!xfs_sb_version_hascrc(&mp->m_sb)) {
-		if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
-			return false;
-		if (bp->b_bn != be64_to_cpu(hdr->blkno))
-			return false;
-		if (hdr->owner == 0)
-			return false;
-	}
-
-	/* object specific verification checks here */
-
-        return true;
-}
-----
-
-If there are different magic numbers for the different formats, the verifier
-will look like:
-
-[source ,c]
-----
-static bool
-xfs_foo_verify(
-	struct xfs_buf		*bp)
-{
-        struct xfs_mount	*mp = bp->b_target->bt_mount;
-        struct xfs_ondisk_hdr	*hdr = bp->b_addr;
-
-        if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
-		if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
-			return false;
-		if (bp->b_bn != be64_to_cpu(hdr->blkno))
-			return false;
-		if (hdr->owner == 0)
-			return false;
-	} else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
-		return false;
-
-	/* object specific verification checks here */
-
-        return true;
-}
-----
-
-Write verifiers are very similar to the read verifiers, they just do things in
-the opposite order to the read verifiers. A typical write verifier:
-
-[source ,c]
-----
-static void
-xfs_foo_write_verify(
-	struct xfs_buf	*bp)
-{
-	struct xfs_mount	*mp = bp->b_target->bt_mount;
-	struct xfs_buf_log_item	*bip = bp->b_fspriv;
-
-	if (!xfs_foo_verify(bp)) {
-		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
-		xfs_buf_ioerror(bp, EFSCORRUPTED);
-		return;
-	}
-
-	if (!xfs_sb_version_hascrc(&mp->m_sb))
-		return;
-
-
-	if (bip) {
-		struct xfs_ondisk_hdr	*hdr = bp->b_addr;
-		hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
-	}
-	xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
-}
-----
-
-This will verify the internal structure of the metadata before we go any
-further, detecting corruptions that have occurred as the metadata has been
-modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
-update the LSN field (when it was last modified) and calculate the CRC on the
-metadata. Once this is done, we can issue the IO.
-
-== Inodes and Dquots
-
-Inodes and dquots are special snowflakes. They have per-object CRC and
-self-identifiers, but they are packed so that there are multiple objects per
-buffer. Hence we do not use per-buffer verifiers to do the work of per-object
-verification and CRC calculations. The per-buffer verifiers simply perform basic
-identification of the buffer - that they contain inodes or dquots, and that
-there are magic numbers in all the expected spots. All further CRC and
-verification checks are done when each inode is read from or written back to the
-buffer.
-
-The structure of the verifiers and the identifiers checks is very similar to the
-buffer code described above. The only difference is where they are called. For
-example, inode read verification is done in xfs_iread() when the inode is first
-read out of the buffer and the struct xfs_inode is instantiated. The inode is
-already extensively verified during writeback in xfs_iflush_int, so the only
-addition here is to add the LSN and CRC to the inode as it is copied back into
-the buffer.
-