From patchwork Thu Oct 4 04:18:42 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 10625543 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8900D13BB for ; Thu, 4 Oct 2018 04:18:59 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 76B1228BC8 for ; Thu, 4 Oct 2018 04:18:59 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6A20928D81; Thu, 4 Oct 2018 04:18:59 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3447B28D81 for ; Thu, 4 Oct 2018 04:18:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726949AbeJDLKP (ORCPT ); Thu, 4 Oct 2018 07:10:15 -0400 Received: from userp2130.oracle.com ([156.151.31.86]:58628 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726735AbeJDLKP (ORCPT ); Thu, 4 Oct 2018 07:10:15 -0400 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w944Ir0k134777; Thu, 4 Oct 2018 04:18:53 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : from : to : cc : date : message-id : in-reply-to : references : mime-version : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=Gz33y5b1mV0BRlXWQ/bZx+zv9HtDmoNwD+tqFVBxFz8=; b=c3GsI+y0BTGZ5UNKA28Knieflv8WCvMtZ97EafE0vlxooTPtRiTCj/MSdRtfEZygP62t IBjkTkXRDuCcvwkDJgYXZI6qLWH94aUYCbr/uMoQ2w/9D3DEjbJ2rska6g6PfZk48Kgf E8NxAI+Vvdy+7KuTqu+xJ/5OcFz+bYqx42lqLUhUqGA5DNo1WAMRSHhefUsWgvj9e83C VbRPwnqS/uNk2lhdo/P7xhmS4MPG5paAi/CyUNyimkxzVwHKXnJ7IG354PzC7A2UkLPq /WbaToCVw+45T3PzOgNt81NCZUpRbvVqTnYg8gzWq1lQMbSgF0ssq42ITqYsxVXGSXCU Xg== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2130.oracle.com with ESMTP id 2mt0tu1mts-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 04 Oct 2018 04:18:53 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w944Ii92003346 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 4 Oct 2018 04:18:44 GMT Received: from abhmp0011.oracle.com (abhmp0011.oracle.com [141.146.116.17]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w944IieT026768; Thu, 4 Oct 2018 04:18:44 GMT Received: from localhost (/10.159.235.87) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 04 Oct 2018 04:18:43 +0000 Subject: [PATCH 03/22] docs: add XFS self-describing metadata integrity doc to DS&A book From: "Darrick J. Wong" To: darrick.wong@oracle.com Cc: linux-xfs@vger.kernel.org, linux-doc@vger.kernel.org, corbet@lwn.net Date: Wed, 03 Oct 2018 21:18:42 -0700 Message-ID: <153862672289.26427.5818097080898758305.stgit@magnolia> In-Reply-To: <153862669110.26427.16504658853992750743.stgit@magnolia> References: <153862669110.26427.16504658853992750743.stgit@magnolia> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9035 signatures=668707 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=3 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1810040044 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Darrick J. Wong Signed-off-by: Darrick J. Wong --- .../filesystems/xfs-data-structures/overview.rst | 2 .../self_describing_metadata.rst | 402 ++++++++++++++++++++ .../filesystems/xfs-self-describing-metadata.txt | 350 ----------------- 3 files changed, 404 insertions(+), 350 deletions(-) create mode 100644 Documentation/filesystems/xfs-data-structures/self_describing_metadata.rst delete mode 100644 Documentation/filesystems/xfs-self-describing-metadata.txt diff --git a/Documentation/filesystems/xfs-data-structures/overview.rst b/Documentation/filesystems/xfs-data-structures/overview.rst index 43b48f30f7e8..8b3de9abcf39 100644 --- a/Documentation/filesystems/xfs-data-structures/overview.rst +++ b/Documentation/filesystems/xfs-data-structures/overview.rst @@ -42,3 +42,5 @@ filesystem operations can be carried out atomically in the case of a crash. Furthermore, there is the concept of a real-time device wherein allocations are tracked more simply and in larger chunks to reduce jitter in allocation latency. + +.. include:: self_describing_metadata.rst diff --git a/Documentation/filesystems/xfs-data-structures/self_describing_metadata.rst b/Documentation/filesystems/xfs-data-structures/self_describing_metadata.rst new file mode 100644 index 000000000000..f9d41c76e1d5 --- /dev/null +++ b/Documentation/filesystems/xfs-data-structures/self_describing_metadata.rst @@ -0,0 +1,402 @@ +.. SPDX-License-Identifier: CC-BY-SA-4.0 + +Metadata Integrity +------------------ + +Introduction +~~~~~~~~~~~~ + +The largest scalability problem facing XFS is not one of algorithmic +scalability, but of verification of the filesystem structure. Scalabilty of +the structures and indexes on disk and the algorithms for iterating them are +adequate for supporting PB scale filesystems with billions of inodes, however +it is this very scalability that causes the verification problem. + +Almost all metadata on XFS is dynamically allocated. The only fixed location +metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all +other metadata structures need to be discovered by walking the filesystem +structure in different ways. While this is already done by userspace tools for +validating and repairing the structure, there are limits to what they can +verify, and this in turn limits the supportable size of an XFS filesystem. + +For example, it is entirely possible to manually use xfs\_db and a bit of +scripting to analyse the structure of a 100TB filesystem when trying to +determine the root cause of a corruption problem, but it is still mainly a +manual task of verifying that things like single bit errors or misplaced +writes weren’t the ultimate cause of a corruption event. It may take a few +hours to a few days to perform such forensic analysis, so for at this scale +root cause analysis is entirely possible. + +However, if we scale the filesystem up to 1PB, we now have 10x as much +metadata to analyse and so that analysis blows out towards weeks/months of +forensic work. Most of the analysis work is slow and tedious, so as the amount +of analysis goes up, the more likely that the cause will be lost in the noise. +Hence the primary concern for supporting PB scale filesystems is minimising +the time and effort required for basic forensic analysis of the filesystem +structure. + +Therefore, the version 5 disk format introduced larger headers for all +metadata types, which enable the filesystem to check information being read +from the disk more rigorously. Metadata integrity fields now include: + +- **Magic** numbers, to classify all types of metadata. This is unchanged + from v4. + +- A copy of the filesystem **UUID**, to confirm that a given disk block is + connected to the superblock. + +- The **owner**, to avoid accessing a piece of metadata which belongs to some + other part of the filesystem. + +- The filesystem **block number**, to detect misplaced writes. + +- The **log serial number** of the last write to this block, to avoid + replaying obsolete log entries. + +- A CRC32c **checksum** of the entire block, to detect minor corruption. + +Metadata integrity coverage has been extended to all metadata blocks in the +filesystem, with the following notes: + +- Inodes can have multiple "owners" in the directory tree; therefore the + record contains the inode number instead of an owner or a block number. + +- Superblocks have no owners. + +- The disk quota file has no owner or block numbers. + +- Metadata owned by files list the inode number as the owner. + +- Per-AG data and B+tree blocks list the AG number as the owner. + +- Per-AG header sectors don’t list owners or block numbers, since they have + fixed locations. + +- Remote attribute blocks are not logged and therefore the LSN must be -1. + +This functionality enables XFS to decide that a block contents are so +unexpected that it should stop immediately. Unfortunately checksums do not +allow for automatic correction. Please keep regular backups, as always. + +Self Describing Metadata +~~~~~~~~~~~~~~~~~~~~~~~~ + +One of the problems with the current metadata format is that apart from the +magic number in the metadata block, we have no other way of identifying what +it is supposed to be. We can’t even identify if it is the right place. Put +simply, you can’t look at a single metadata block in isolation and say "yes, +it is supposed to be there and the contents are valid". + +Hence most of the time spent on forensic analysis is spent doing basic +verification of metadata values, looking for values that are in range (and +hence not detected by automated verification checks) but are not correct. +Finding and understanding how things like cross linked block lists (e.g. +sibling pointers in a btree end up with loops in them) are the key to +understanding what went wrong, but it is impossible to tell what order the +blocks were linked into each other or written to disk after the fact. + +Hence we need to record more information into the metadata to allow us to +quickly determine if the metadata is intact and can be ignored for the purpose +of analysis. We can’t protect against every possible type of error, but we can +ensure that common types of errors are easily detectable. Hence the concept of +self describing metadata. + +The first, fundamental requirement of self describing metadata is that the +metadata object contains some form of unique identifier in a well known +location. This allows us to identify the expected contents of the block and +hence parse and verify the metadata object. IF we can’t independently identify +the type of metadata in the object, then the metadata doesn’t describe itself +very well at all! + +Luckily, almost all XFS metadata has magic numbers embedded already - only the +AGFL, remote symlinks and remote attribute blocks do not contain identifying +magic numbers. Hence we can change the on-disk format of all these objects to +add more identifying information and detect this simply by changing the magic +numbers in the metadata objects. That is, if it has the current magic number, +the metadata isn’t self identifying. If it contains a new magic number, it is +self identifying and we can do much more expansive automated verification of +the metadata object at runtime, during forensic analysis or repair. + +As a primary concern, self describing metadata needs some form of overall +integrity checking. We cannot trust the metadata if we cannot verify that it +has not been changed as a result of external influences. Hence we need some +form of integrity check, and this is done by adding CRC32c validation to the +metadata block. If we can verify the block contains the metadata it was +intended to contain, a large amount of the manual verification work can be +skipped. + +CRC32c was selected as metadata cannot be more than 64k in length in XFS and +hence a 32 bit CRC is more than sufficient to detect multi-bit errors in +metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it +is fast. So while CRC32c is not the strongest of possible integrity checks +that could be used, it is more than sufficient for our needs and has +relatively little overhead. Adding support for larger integrity fields and/or +algorithms does really provide any extra value over CRC32c, but it does add a +lot of complexity and so there is no provision for changing the integrity +checking mechanism. + +Self describing metadata needs to contain enough information so that the +metadata block can be verified as being in the correct place without needing +to look at any other metadata. This means it needs to contain location +information. Just adding a block number to the metadata is not sufficient to +protect against mis-directed writes - a write might be misdirected to the +wrong LUN and so be written to the "correct block" of the wrong filesystem. +Hence location information must contain a filesystem identifier as well as a +block number. + +Another key information point in forensic analysis is knowing who the metadata +block belongs to. We already know the type, the location, that it is valid +and/or corrupted, and how long ago that it was last modified. Knowing the +owner of the block is important as it allows us to find other related metadata +to determine the scope of the corruption. For example, if we have a extent +btree object, we don’t know what inode it belongs to and hence have to walk +the entire filesystem to find the owner of the block. Worse, the corruption +could mean that no owner can be found (i.e. it’s an orphan block), and so +without an owner field in the metadata we have no idea of the scope of the +corruption. If we have an owner field in the metadata object, we can +immediately do top down validation to determine the scope of the problem. + +Different types of metadata have different owner identifiers. For example, +directory, attribute and extent tree blocks are all owned by an inode, whilst +freespace btree blocks are owned by an allocation group. Hence the size and +contents of the owner field are determined by the type of metadata object we +are looking at. The owner information can also identify misplaced writes (e.g. +freespace btree block written to the wrong AG). + +Self describing metadata also needs to contain some indication of when it was +written to the filesystem. One of the key information points when doing +forensic analysis is how recently the block was modified. Correlation of set +of corrupted metadata blocks based on modification times is important as it +can indicate whether the corruptions are related, whether there’s been +multiple corruption events that lead to the eventual failure, and even whether +there are corruptions present that the run-time verification is not detecting. + +For example, we can determine whether a metadata object is supposed to be free +space or still allocated if it is still referenced by its owner by looking at +when the free space btree block that contains the block was last written +compared to when the metadata object itself was last written. If the free +space block is more recent than the object and the object’s owner, then there +is a very good chance that the block should have been removed from the owner. + +To provide this "written timestamp", each metadata block gets the Log Sequence +Number (LSN) of the most recent transaction it was modified on written into +it. This number will always increase over the life of the filesystem, and the +only thing that resets it is running xfs\_repair on the filesystem. Further, +by use of the LSN we can tell if the corrupted metadata all belonged to the +same log checkpoint and hence have some idea of how much modification occurred +between the first and last instance of corrupt metadata on disk and, further, +how much modification occurred between the corruption being written and when +it was detected. + +Runtime Validation +~~~~~~~~~~~~~~~~~~ + +Validation of self-describing metadata takes place at runtime in two places: + +- immediately after a successful read from disk + +- immediately prior to write IO submission + +The verification is completely stateless - it is done independently of the +modification process, and seeks only to check that the metadata is what it +says it is and that the metadata fields are within bounds and internally +consistent. As such, we cannot catch all types of corruption that can occur +within a block as there may be certain limitations that operational state +enforces of the metadata, or there may be corruption of interblock +relationships (e.g. corrupted sibling pointer lists). Hence we still need +stateful checking in the main code body, but in general most of the per-field +validation is handled by the verifiers. + +For read verification, the caller needs to specify the expected type of +metadata that it should see, and the IO completion process verifies that the +metadata object matches what was expected. If the verification process fails, +then it marks the object being read as EFSCORRUPTED. The caller needs to catch +this error (same as for IO errors), and if it needs to take special action due +to a verification error it can do so by catching the EFSCORRUPTED error value. +If we need more discrimination of error type at higher levels, we can define +new error numbers for different errors as necessary. + +The first step in read verification is checking the magic number and +determining whether CRC validating is necessary. If it is, the CRC32c is +calculated and compared against the value stored in the object itself. Once +this is validated, further checks are made against the location information, +followed by extensive object specific metadata validation. If any of these +checks fail, then the buffer is considered corrupt and the EFSCORRUPTED error +is set appropriately. + +Write verification is the opposite of the read verification - first the object +is extensively verified and if it is OK we then update the LSN from the last +modification made to the object, After this, we calculate the CRC and insert +it into the object. Once this is done the write IO is allowed to continue. If +any error occurs during this process, the buffer is again marked with a +EFSCORRUPTED error for the higher layers to catch. + +Structures +~~~~~~~~~~ + +A typical on-disk structure needs to contain the following information: + +.. code:: c + + struct xfs_ondisk_hdr { + __be32 magic; /* magic number */ + __be32 crc; /* CRC, not logged */ + uuid_t uuid; /* filesystem identifier */ + __be64 owner; /* parent object */ + __be64 blkno; /* location on disk */ + __be64 lsn; /* last modification in log, not logged */ + }; + +Depending on the metadata, this information may be part of a header structure +separate to the metadata contents, or may be distributed through an existing +structure. The latter occurs with metadata that already contains some of this +information, such as the superblock and AG headers. + +Other metadata may have different formats for the information, but the same +level of information is generally provided. For example: + +- short btree blocks have a 32 bit owner (ag number) and a 32 bit block + number for location. The two of these combined provide the same information + as @owner and @blkno in eh above structure, but using 8 bytes less space on + disk. + +- directory/attribute node blocks have a 16 bit magic number, and the header + that contains the magic number has other information in it as well. hence + the additional metadata headers change the overall format of the metadata. + +A typical buffer read verifier is structured as follows: + +.. code:: c + + #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) + + static void + xfs_foo_read_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_target->bt_mount; + + if ((xfs_sb_version_hascrc(&mp->m_sb) && + !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), + XFS_FOO_CRC_OFF)) || + !xfs_foo_verify(bp)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); + xfs_buf_ioerror(bp, EFSCORRUPTED); + } + } + +The code ensures that the CRC is only checked if the filesystem has CRCs +enabled by checking the superblock of the feature bit, and then if the CRC +verifies OK (or is not needed) it verifies the actual contents of the block. + +The verifier function will take a couple of different forms, depending on +whether the magic number can be used to determine the format of the block. In +the case it can’t, the code is structured as follows: + +.. code:: c + + static bool + xfs_foo_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_target->bt_mount; + struct xfs_ondisk_hdr *hdr = bp->b_addr; + + if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) + return false; + + if (!xfs_sb_version_hascrc(&mp->m_sb)) { + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) + return false; + if (bp->b_bn != be64_to_cpu(hdr->blkno)) + return false; + if (hdr->owner == 0) + return false; + } + + /* object specific verification checks here */ + + return true; + } + +If there are different magic numbers for the different formats, the verifier +will look like: + +.. code:: c + + static bool + xfs_foo_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_target->bt_mount; + struct xfs_ondisk_hdr *hdr = bp->b_addr; + + if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) + return false; + if (bp->b_bn != be64_to_cpu(hdr->blkno)) + return false; + if (hdr->owner == 0) + return false; + } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) + return false; + + /* object specific verification checks here */ + + return true; + } + +Write verifiers are very similar to the read verifiers, they just do things in +the opposite order to the read verifiers. A typical write verifier: + +.. code:: c + + static void + xfs_foo_write_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_target->bt_mount; + struct xfs_buf_log_item *bip = bp->b_fspriv; + + if (!xfs_foo_verify(bp)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); + xfs_buf_ioerror(bp, EFSCORRUPTED); + return; + } + + if (!xfs_sb_version_hascrc(&mp->m_sb)) + return; + + + if (bip) { + struct xfs_ondisk_hdr *hdr = bp->b_addr; + hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); + } + xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); + } + +This will verify the internal structure of the metadata before we go any +further, detecting corruptions that have occurred as the metadata has been +modified in memory. If the metadata verifies OK, and CRCs are enabled, we then +update the LSN field (when it was last modified) and calculate the CRC on the +metadata. Once this is done, we can issue the IO. + +Inodes and Dquots +~~~~~~~~~~~~~~~~~ + +Inodes and dquots are special snowflakes. They have per-object CRC and +self-identifiers, but they are packed so that there are multiple objects per +buffer. Hence we do not use per-buffer verifiers to do the work of per-object +verification and CRC calculations. The per-buffer verifiers simply perform +basic identification of the buffer - that they contain inodes or dquots, and +that there are magic numbers in all the expected spots. All further CRC and +verification checks are done when each inode is read from or written back to +the buffer. + +The structure of the verifiers and the identifiers checks is very similar to +the buffer code described above. The only difference is where they are called. +For example, inode read verification is done in xfs\_iread() when the inode is +first read out of the buffer and the struct xfs\_inode is instantiated. The +inode is already extensively verified during writeback in xfs\_iflush\_int, so +the only addition here is to add the LSN and CRC to the inode as it is copied +back into the buffer. diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt b/Documentation/filesystems/xfs-self-describing-metadata.txt deleted file mode 100644 index 05aa455163e3..000000000000 --- a/Documentation/filesystems/xfs-self-describing-metadata.txt +++ /dev/null @@ -1,350 +0,0 @@ -XFS Self Describing Metadata ----------------------------- - -Introduction ------------- - -The largest scalability problem facing XFS is not one of algorithmic -scalability, but of verification of the filesystem structure. Scalabilty of the -structures and indexes on disk and the algorithms for iterating them are -adequate for supporting PB scale filesystems with billions of inodes, however it -is this very scalability that causes the verification problem. - -Almost all metadata on XFS is dynamically allocated. The only fixed location -metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all -other metadata structures need to be discovered by walking the filesystem -structure in different ways. While this is already done by userspace tools for -validating and repairing the structure, there are limits to what they can -verify, and this in turn limits the supportable size of an XFS filesystem. - -For example, it is entirely possible to manually use xfs_db and a bit of -scripting to analyse the structure of a 100TB filesystem when trying to -determine the root cause of a corruption problem, but it is still mainly a -manual task of verifying that things like single bit errors or misplaced writes -weren't the ultimate cause of a corruption event. It may take a few hours to a -few days to perform such forensic analysis, so for at this scale root cause -analysis is entirely possible. - -However, if we scale the filesystem up to 1PB, we now have 10x as much metadata -to analyse and so that analysis blows out towards weeks/months of forensic work. -Most of the analysis work is slow and tedious, so as the amount of analysis goes -up, the more likely that the cause will be lost in the noise. Hence the primary -concern for supporting PB scale filesystems is minimising the time and effort -required for basic forensic analysis of the filesystem structure. - - -Self Describing Metadata ------------------------- - -One of the problems with the current metadata format is that apart from the -magic number in the metadata block, we have no other way of identifying what it -is supposed to be. We can't even identify if it is the right place. Put simply, -you can't look at a single metadata block in isolation and say "yes, it is -supposed to be there and the contents are valid". - -Hence most of the time spent on forensic analysis is spent doing basic -verification of metadata values, looking for values that are in range (and hence -not detected by automated verification checks) but are not correct. Finding and -understanding how things like cross linked block lists (e.g. sibling -pointers in a btree end up with loops in them) are the key to understanding what -went wrong, but it is impossible to tell what order the blocks were linked into -each other or written to disk after the fact. - -Hence we need to record more information into the metadata to allow us to -quickly determine if the metadata is intact and can be ignored for the purpose -of analysis. We can't protect against every possible type of error, but we can -ensure that common types of errors are easily detectable. Hence the concept of -self describing metadata. - -The first, fundamental requirement of self describing metadata is that the -metadata object contains some form of unique identifier in a well known -location. This allows us to identify the expected contents of the block and -hence parse and verify the metadata object. IF we can't independently identify -the type of metadata in the object, then the metadata doesn't describe itself -very well at all! - -Luckily, almost all XFS metadata has magic numbers embedded already - only the -AGFL, remote symlinks and remote attribute blocks do not contain identifying -magic numbers. Hence we can change the on-disk format of all these objects to -add more identifying information and detect this simply by changing the magic -numbers in the metadata objects. That is, if it has the current magic number, -the metadata isn't self identifying. If it contains a new magic number, it is -self identifying and we can do much more expansive automated verification of the -metadata object at runtime, during forensic analysis or repair. - -As a primary concern, self describing metadata needs some form of overall -integrity checking. We cannot trust the metadata if we cannot verify that it has -not been changed as a result of external influences. Hence we need some form of -integrity check, and this is done by adding CRC32c validation to the metadata -block. If we can verify the block contains the metadata it was intended to -contain, a large amount of the manual verification work can be skipped. - -CRC32c was selected as metadata cannot be more than 64k in length in XFS and -hence a 32 bit CRC is more than sufficient to detect multi-bit errors in -metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is -fast. So while CRC32c is not the strongest of possible integrity checks that -could be used, it is more than sufficient for our needs and has relatively -little overhead. Adding support for larger integrity fields and/or algorithms -does really provide any extra value over CRC32c, but it does add a lot of -complexity and so there is no provision for changing the integrity checking -mechanism. - -Self describing metadata needs to contain enough information so that the -metadata block can be verified as being in the correct place without needing to -look at any other metadata. This means it needs to contain location information. -Just adding a block number to the metadata is not sufficient to protect against -mis-directed writes - a write might be misdirected to the wrong LUN and so be -written to the "correct block" of the wrong filesystem. Hence location -information must contain a filesystem identifier as well as a block number. - -Another key information point in forensic analysis is knowing who the metadata -block belongs to. We already know the type, the location, that it is valid -and/or corrupted, and how long ago that it was last modified. Knowing the owner -of the block is important as it allows us to find other related metadata to -determine the scope of the corruption. For example, if we have a extent btree -object, we don't know what inode it belongs to and hence have to walk the entire -filesystem to find the owner of the block. Worse, the corruption could mean that -no owner can be found (i.e. it's an orphan block), and so without an owner field -in the metadata we have no idea of the scope of the corruption. If we have an -owner field in the metadata object, we can immediately do top down validation to -determine the scope of the problem. - -Different types of metadata have different owner identifiers. For example, -directory, attribute and extent tree blocks are all owned by an inode, whilst -freespace btree blocks are owned by an allocation group. Hence the size and -contents of the owner field are determined by the type of metadata object we are -looking at. The owner information can also identify misplaced writes (e.g. -freespace btree block written to the wrong AG). - -Self describing metadata also needs to contain some indication of when it was -written to the filesystem. One of the key information points when doing forensic -analysis is how recently the block was modified. Correlation of set of corrupted -metadata blocks based on modification times is important as it can indicate -whether the corruptions are related, whether there's been multiple corruption -events that lead to the eventual failure, and even whether there are corruptions -present that the run-time verification is not detecting. - -For example, we can determine whether a metadata object is supposed to be free -space or still allocated if it is still referenced by its owner by looking at -when the free space btree block that contains the block was last written -compared to when the metadata object itself was last written. If the free space -block is more recent than the object and the object's owner, then there is a -very good chance that the block should have been removed from the owner. - -To provide this "written timestamp", each metadata block gets the Log Sequence -Number (LSN) of the most recent transaction it was modified on written into it. -This number will always increase over the life of the filesystem, and the only -thing that resets it is running xfs_repair on the filesystem. Further, by use of -the LSN we can tell if the corrupted metadata all belonged to the same log -checkpoint and hence have some idea of how much modification occurred between -the first and last instance of corrupt metadata on disk and, further, how much -modification occurred between the corruption being written and when it was -detected. - -Runtime Validation ------------------- - -Validation of self-describing metadata takes place at runtime in two places: - - - immediately after a successful read from disk - - immediately prior to write IO submission - -The verification is completely stateless - it is done independently of the -modification process, and seeks only to check that the metadata is what it says -it is and that the metadata fields are within bounds and internally consistent. -As such, we cannot catch all types of corruption that can occur within a block -as there may be certain limitations that operational state enforces of the -metadata, or there may be corruption of interblock relationships (e.g. corrupted -sibling pointer lists). Hence we still need stateful checking in the main code -body, but in general most of the per-field validation is handled by the -verifiers. - -For read verification, the caller needs to specify the expected type of metadata -that it should see, and the IO completion process verifies that the metadata -object matches what was expected. If the verification process fails, then it -marks the object being read as EFSCORRUPTED. The caller needs to catch this -error (same as for IO errors), and if it needs to take special action due to a -verification error it can do so by catching the EFSCORRUPTED error value. If we -need more discrimination of error type at higher levels, we can define new -error numbers for different errors as necessary. - -The first step in read verification is checking the magic number and determining -whether CRC validating is necessary. If it is, the CRC32c is calculated and -compared against the value stored in the object itself. Once this is validated, -further checks are made against the location information, followed by extensive -object specific metadata validation. If any of these checks fail, then the -buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. - -Write verification is the opposite of the read verification - first the object -is extensively verified and if it is OK we then update the LSN from the last -modification made to the object, After this, we calculate the CRC and insert it -into the object. Once this is done the write IO is allowed to continue. If any -error occurs during this process, the buffer is again marked with a EFSCORRUPTED -error for the higher layers to catch. - -Structures ----------- - -A typical on-disk structure needs to contain the following information: - -struct xfs_ondisk_hdr { - __be32 magic; /* magic number */ - __be32 crc; /* CRC, not logged */ - uuid_t uuid; /* filesystem identifier */ - __be64 owner; /* parent object */ - __be64 blkno; /* location on disk */ - __be64 lsn; /* last modification in log, not logged */ -}; - -Depending on the metadata, this information may be part of a header structure -separate to the metadata contents, or may be distributed through an existing -structure. The latter occurs with metadata that already contains some of this -information, such as the superblock and AG headers. - -Other metadata may have different formats for the information, but the same -level of information is generally provided. For example: - - - short btree blocks have a 32 bit owner (ag number) and a 32 bit block - number for location. The two of these combined provide the same - information as @owner and @blkno in eh above structure, but using 8 - bytes less space on disk. - - - directory/attribute node blocks have a 16 bit magic number, and the - header that contains the magic number has other information in it as - well. hence the additional metadata headers change the overall format - of the metadata. - -A typical buffer read verifier is structured as follows: - -#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) - -static void -xfs_foo_read_verify( - struct xfs_buf *bp) -{ - struct xfs_mount *mp = bp->b_target->bt_mount; - - if ((xfs_sb_version_hascrc(&mp->m_sb) && - !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), - XFS_FOO_CRC_OFF)) || - !xfs_foo_verify(bp)) { - XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); - xfs_buf_ioerror(bp, EFSCORRUPTED); - } -} - -The code ensures that the CRC is only checked if the filesystem has CRCs enabled -by checking the superblock of the feature bit, and then if the CRC verifies OK -(or is not needed) it verifies the actual contents of the block. - -The verifier function will take a couple of different forms, depending on -whether the magic number can be used to determine the format of the block. In -the case it can't, the code is structured as follows: - -static bool -xfs_foo_verify( - struct xfs_buf *bp) -{ - struct xfs_mount *mp = bp->b_target->bt_mount; - struct xfs_ondisk_hdr *hdr = bp->b_addr; - - if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) - return false; - - if (!xfs_sb_version_hascrc(&mp->m_sb)) { - if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) - return false; - if (bp->b_bn != be64_to_cpu(hdr->blkno)) - return false; - if (hdr->owner == 0) - return false; - } - - /* object specific verification checks here */ - - return true; -} - -If there are different magic numbers for the different formats, the verifier -will look like: - -static bool -xfs_foo_verify( - struct xfs_buf *bp) -{ - struct xfs_mount *mp = bp->b_target->bt_mount; - struct xfs_ondisk_hdr *hdr = bp->b_addr; - - if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { - if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) - return false; - if (bp->b_bn != be64_to_cpu(hdr->blkno)) - return false; - if (hdr->owner == 0) - return false; - } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) - return false; - - /* object specific verification checks here */ - - return true; -} - -Write verifiers are very similar to the read verifiers, they just do things in -the opposite order to the read verifiers. A typical write verifier: - -static void -xfs_foo_write_verify( - struct xfs_buf *bp) -{ - struct xfs_mount *mp = bp->b_target->bt_mount; - struct xfs_buf_log_item *bip = bp->b_fspriv; - - if (!xfs_foo_verify(bp)) { - XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); - xfs_buf_ioerror(bp, EFSCORRUPTED); - return; - } - - if (!xfs_sb_version_hascrc(&mp->m_sb)) - return; - - - if (bip) { - struct xfs_ondisk_hdr *hdr = bp->b_addr; - hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); - } - xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); -} - -This will verify the internal structure of the metadata before we go any -further, detecting corruptions that have occurred as the metadata has been -modified in memory. If the metadata verifies OK, and CRCs are enabled, we then -update the LSN field (when it was last modified) and calculate the CRC on the -metadata. Once this is done, we can issue the IO. - -Inodes and Dquots ------------------ - -Inodes and dquots are special snowflakes. They have per-object CRC and -self-identifiers, but they are packed so that there are multiple objects per -buffer. Hence we do not use per-buffer verifiers to do the work of per-object -verification and CRC calculations. The per-buffer verifiers simply perform basic -identification of the buffer - that they contain inodes or dquots, and that -there are magic numbers in all the expected spots. All further CRC and -verification checks are done when each inode is read from or written back to the -buffer. - -The structure of the verifiers and the identifiers checks is very similar to the -buffer code described above. The only difference is where they are called. For -example, inode read verification is done in xfs_iread() when the inode is first -read out of the buffer and the struct xfs_inode is instantiated. The inode is -already extensively verified during writeback in xfs_iflush_int, so the only -addition here is to add the LSN and CRC to the inode as it is copied back into -the buffer. - -XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of -the unlinked list modifications check or update CRCs, neither during unlink nor -log recovery. So, it's gone unnoticed until now. This won't matter immediately - -repair will probably complain about it - but it needs to be fixed. -