From patchwork Thu Oct 4 04:19:40 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 10625559 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8023714BD for ; Thu, 4 Oct 2018 04:19:46 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 70E9D28DE7 for ; Thu, 4 Oct 2018 04:19:46 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 65C7728DEE; Thu, 4 Oct 2018 04:19:46 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8232F28DE7 for ; Thu, 4 Oct 2018 04:19:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726752AbeJDLLD (ORCPT ); Thu, 4 Oct 2018 07:11:03 -0400 Received: from userp2130.oracle.com ([156.151.31.86]:59742 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727138AbeJDLLD (ORCPT ); Thu, 4 Oct 2018 07:11:03 -0400 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w944J4Bo134981; Thu, 4 Oct 2018 04:19:42 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : from : to : cc : date : message-id : in-reply-to : references : mime-version : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=R/TgKTrkJsNSfjiSM5HRjZZjZ4C9aE1D3Axf/g34m8k=; b=uiKwkaL2RIRQYqKZrZY1ZZivJTxsDW59MOiFh4x/nRltzu221KIQ79lAdx5ZODN36lyL 0WCRVkqt938iJCyXoFupo74FOlyQ0y5I/3xSuhx+vRnwtSf5GFuZusRxarNaVYj94XkE xsN2wBXObQgg6wxGmKOF/XmedZhhu4N5HHvXznsxbsSN6MzAKGD3Ja0wWPZ91a2g/EsZ ZtuxGqJl+m/kz8ju5AhQvYgQEAd1nBd262zWp65piLOChYO5d89UdtuJDYIV41bRci45 1GQ6PX/OwJ3kSmum1RtlAwsjn5d0pz0c3qjJKdc5iXEH2yNo2gfemYxziYaqN9NCmbtI iA== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2130.oracle.com with ESMTP id 2mt0tu1mvx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 04 Oct 2018 04:19:42 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w944Jf0m031650 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 4 Oct 2018 04:19:41 GMT Received: from abhmp0001.oracle.com (abhmp0001.oracle.com [141.146.116.7]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w944JfSX029947; Thu, 4 Oct 2018 04:19:41 GMT Received: from localhost (/10.159.235.87) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 03 Oct 2018 21:19:41 -0700 Subject: [PATCH 12/22] docs: add XFS reverse mapping structures to the DS&A book From: "Darrick J. Wong" To: darrick.wong@oracle.com Cc: linux-xfs@vger.kernel.org, linux-doc@vger.kernel.org, corbet@lwn.net Date: Wed, 03 Oct 2018 21:19:40 -0700 Message-ID: <153862678010.26427.10700488839888247014.stgit@magnolia> In-Reply-To: <153862669110.26427.16504658853992750743.stgit@magnolia> References: <153862669110.26427.16504658853992750743.stgit@magnolia> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9035 signatures=668707 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=3 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1810040044 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Darrick J. Wong Signed-off-by: Darrick J. Wong --- .../xfs-data-structures/allocation_groups.rst | 2 .../filesystems/xfs-data-structures/rmapbt.rst | 336 ++++++++++++++++++++ 2 files changed, 338 insertions(+) create mode 100644 Documentation/filesystems/xfs-data-structures/rmapbt.rst diff --git a/Documentation/filesystems/xfs-data-structures/allocation_groups.rst b/Documentation/filesystems/xfs-data-structures/allocation_groups.rst index 30d169ab5cc5..6c0ffd3a170b 100644 --- a/Documentation/filesystems/xfs-data-structures/allocation_groups.rst +++ b/Documentation/filesystems/xfs-data-structures/allocation_groups.rst @@ -1379,3 +1379,5 @@ response times that come from metadata operations. None of the XFS per-AG B+trees are involved with real time files. It is not possible for real time files to share data blocks. + +.. include:: rmapbt.rst diff --git a/Documentation/filesystems/xfs-data-structures/rmapbt.rst b/Documentation/filesystems/xfs-data-structures/rmapbt.rst new file mode 100644 index 000000000000..eefcee5d4e95 --- /dev/null +++ b/Documentation/filesystems/xfs-data-structures/rmapbt.rst @@ -0,0 +1,336 @@ +.. SPDX-License-Identifier: CC-BY-SA-4.0 + +Reverse-Mapping B+tree +~~~~~~~~~~~~~~~~~~~~~~ + +If the feature is enabled, each allocation group has its own reverse +block-mapping B+tree, which grows in the free space like the free space +B+trees. As mentioned in the chapter about +`reconstruction <#metadata-reconstruction>`__, this data structure is another piece of +the puzzle necessary to reconstruct the data or attribute fork of a file from +reverse-mapping records; we can also use it to double-check allocations to +ensure that we are not accidentally cross-linking blocks, which can cause +severe damage to the filesystem. + +This B+tree is only present if the XFS\_SB\_FEAT\_RO\_COMPAT\_RMAPBT feature +is enabled. The feature requires a version 5 filesystem. + +Each record in the reverse-mapping B+tree has the following structure: + +.. code:: c + + struct xfs_rmap_rec { + __be32 rm_startblock; + __be32 rm_blockcount; + __be64 rm_owner; + __be64 rm_fork:1; + __be64 rm_bmbt:1; + __be64 rm_unwritten:1; + __be64 rm_unused:7; + __be64 rm_offset:54; + }; + +**rm\_startblock** + AG block number of this record. + +**rm\_blockcount** + The length of this extent. + +**rm\_owner** + A 64-bit number describing the owner of this extent. This is typically the + absolute inode number, but can also correspond to one of the following: + +.. list-table:: + :widths: 28 52 + :header-rows: 1 + + * - Flag + - Description + * - XFS\_RMAP\_OWN\_NULL + - No owner. This should never appear on disk. + + * - XFS\_RMAP\_OWN\_UNKNOWN + - Unknown owner; for EFI recovery. This should never appear on disk. + + * - XFS\_RMAP\_OWN\_FS + - Allocation group headers. + + * - XFS\_RMAP\_OWN\_LOG + - XFS log blocks. + + * - XFS\_RMAP\_OWN\_AG + - Per-allocation group B+tree blocks. This means free space B+tree blocks, + blocks on the freelist, and reverse-mapping B+tree blocks. + + * - XFS\_RMAP\_OWN\_INOBT + - Per-allocation group inode B+tree blocks. This includes free inode + B+tree blocks. + + * - XFS\_RMAP\_OWN\_INODES + - Inode chunks. + + * - XFS\_RMAP\_OWN\_REFC + - Per-allocation group refcount B+tree blocks. This will be used for + reflink support. + + * - XFS\_RMAP\_OWN\_COW + - Blocks that have been reserved for a copy-on-write operation that has + not completed. + +Table: Special owner values + +**rm\_fork** + If rm\_owner describes an inode, this can be 1 if this record is for an + attribute fork. + +**rm\_bmbt** + If rm\_owner describes an inode, this can be 1 to signify that this record + is for a block map B+tree block. In this case, rm\_offset has no meaning. + +**rm\_unwritten** + A flag indicating that the extent is unwritten. This corresponds to the + flag in the `extent record <#data-extents>`__ format which means + XFS\_EXT\_UNWRITTEN. + +**rm\_offset** + The 54-bit logical file block offset, if rm\_owner describes an inode. + Meaningless otherwise. + + **Note** + + The single-bit flag values rm\_unwritten, rm\_fork, and rm\_bmbt are + packed into the larger fields in the C structure definition. + +The key has the following structure: + +.. code:: c + + struct xfs_rmap_key { + __be32 rm_startblock; + __be64 rm_owner; + __be64 rm_fork:1; + __be64 rm_bmbt:1; + __be64 rm_reserved:1; + __be64 rm_unused:7; + __be64 rm_offset:54; + }; + +For the reverse-mapping B+tree on a filesystem that supports sharing of file +data blocks, the key definition is larger than the usual AG block number. On a +classic XFS filesystem, each block has only one owner, which means that +rm\_startblock is sufficient to uniquely identify each record. However, shared +block support (reflink) on XFS breaks that assumption; now filesystem blocks +can be linked to any logical block offset of any file inode. Therefore, the +key must include the owner and offset information to preserve the 1 to 1 +relation between key and record. + +- As the reference counting is AG relative, all the block numbers are only + 32-bits. + +- The bb\_magic value is "RMB3" (0x524d4233). + +- The xfs\_btree\_sblock\_t header is used for intermediate B+tree node as + well as the leaves. + +- Each pointer is associated with two keys. The first of these is the "low + key", which is the key of the smallest record accessible through the + pointer. This low key has the same meaning as the key in all other btrees. + The second key is the high key, which is the maximum of the largest key + that can be used to access a given record underneath the pointer. Recall + that each record in the reverse mapping b+tree describes an interval of + physical blocks mapped to an interval of logical file block offsets; + therefore, it makes sense that a range of keys can be used to find to a + record. + +xfs\_db rmapbt Example +^^^^^^^^^^^^^^^^^^^^^^ + +This example shows a reverse-mapping B+tree from a freshly populated root +filesystem: + +:: + + xfs_db> agf 0 + xfs_db> addr rmaproot + xfs_db> p + magic = 0x524d4233 + level = 1 + numrecs = 43 + leftsib = null + rightsib = null + bno = 56 + lsn = 0x3000004c8 + uuid = 1977221d-8345-464e-b1f4-aa2ea36895f4 + owner = 0 + crc = 0x7cf8be6f (correct) + keys[1-43] = [startblock,owner,offset] + keys[1-43] = [startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi, + offset_hi,attrfork_hi,bmbtblock_hi] + 1:[0,-3,0,0,0,351,4418,66,0,0] + 2:[417,285,0,0,0,827,4419,2,0,0] + 3:[829,499,0,0,0,2352,573,55,0,0] + 4:[1292,710,0,0,0,32168,262923,47,0,0] + 5:[32215,-5,0,0,0,34655,2365,3411,0,0] + 6:[34083,1161,0,0,0,34895,265220,1,0,1] + 7:[34896,256191,0,0,0,36522,-9,0,0,0] + ... + 41:[50998,326734,0,0,0,51430,-5,0,0,0] + 42:[51431,327010,0,0,0,51600,325722,11,0,0] + 43:[51611,327112,0,0,0,94063,23522,28375272,0,0] + ptrs[1-43] = 1:5 2:6 3:8 4:9 5:10 6:11 7:418 ... 41:46377 42:48784 43:49522 + +We arbitrarily pick pointer 17 to traverse downwards: + +:: + + xfs_db> addr ptrs[17] + xfs_db> p + magic = 0x524d4233 + level = 0 + numrecs = 168 + leftsib = 36284 + rightsib = 37617 + bno = 294760 + lsn = 0x200002761 + uuid = 1977221d-8345-464e-b1f4-aa2ea36895f4 + owner = 0 + crc = 0x2dad3fbe (correct) + recs[1-168] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock] + 1:[40326,1,259615,0,0,0,0] 2:[40327,1,-5,0,0,0,0] + 3:[40328,2,259618,0,0,0,0] 4:[40330,1,259619,0,0,0,0] + ... + 127:[40540,1,324266,0,0,0,0] 128:[40541,1,324266,8388608,0,0,0] + 129:[40542,2,324266,1,0,0,0] 130:[40544,32,-7,0,0,0,0] + +Several interesting things pop out here. The first record shows that inode +259,615 has mapped AG block 40,326 at offset 0. We confirm this by looking at +the block map for that inode: + +:: + + xfs_db> inode 259615 + xfs_db> bmap + data offset 0 startblock 40326 (0/40326) count 1 flag 0 + +Next, notice records 127 and 128, which describe neighboring AG blocks that +are mapped to non-contiguous logical blocks in inode 324,266. Given the +logical offset of 8,388,608 we surmise that this is a leaf directory, but let +us confirm: + +:: + + xfs_db> inode 324266 + xfs_db> p core.mode + core.mode = 040755 + xfs_db> bmap + data offset 0 startblock 40540 (0/40540) count 1 flag 0 + data offset 1 startblock 40542 (0/40542) count 2 flag 0 + data offset 3 startblock 40576 (0/40576) count 1 flag 0 + data offset 8388608 startblock 40541 (0/40541) count 1 flag 0 + xfs_db> p core.mode + core.mode = 0100644 + xfs_db> dblock 0 + xfs_db> p dhdr.hdr.magic + dhdr.hdr.magic = 0x58444433 + xfs_db> dblock 8388608 + xfs_db> p lhdr.info.hdr.magic + lhdr.info.hdr.magic = 0x3df1 + +Indeed, this inode 324,266 appears to be a leaf directory, as it has regular +directory data blocks at low offsets, and a single leaf block. + +Notice further the two reverse-mapping records with negative owners. An owner +of -7 corresponds to XFS\_RMAP\_OWN\_INODES, which is an inode chunk, and an +owner code of -5 corresponds to XFS\_RMAP\_OWN\_AG, which covers free space +B+trees and free space. Let’s see if block 40,544 is part of an inode chunk: + +:: + + xfs_db> blockget + xfs_db> fsblock 40544 + xfs_db> blockuse + block 40544 (0/40544) type inode + xfs_db> stack + 1: + byte offset 166068224, length 4096 + buffer block 324352 (fsbno 40544), 8 bbs + inode 324266, dir inode 324266, type data + xfs_db> type inode + xfs_db> p + core.magic = 0x494e + +Our suspicions are confirmed. Let’s also see if 40,327 is part of a free space +tree: + +:: + + xfs_db> fsblock 40327 + xfs_db> blockuse + block 40327 (0/40327) type btrmap + xfs_db> type rmapbt + xfs_db> p + magic = 0x524d4233 + +As you can see, the reverse block-mapping B+tree is an important secondary +metadata structure, which can be used to reconstruct damaged primary metadata. +Now let’s look at an extend rmap btree: + +:: + + xfs_db> agf 0 + xfs_db> addr rmaproot + xfs_db> p + magic = 0x34524d42 + level = 1 + numrecs = 5 + leftsib = null + rightsib = null + bno = 6368 + lsn = 0x100000d1b + uuid = 400f0928-6b88-4c37-af1e-cef1f8911f3f + owner = 0 + crc = 0x8d4ace05 (correct) + keys[1-5] = [startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi,offset_hi,attrfork_hi,bmbtblock_hi] + 1:[0,-3,0,0,0,705,132,681,0,0] + 2:[24,5761,0,0,0,548,5761,524,0,0] + 3:[24,5929,0,0,0,380,5929,356,0,0] + 4:[24,6097,0,0,0,212,6097,188,0,0] + 5:[24,6277,0,0,0,807,-7,0,0,0] + ptrs[1-5] = 1:5 2:771 3:9 4:10 5:11 + +The second pointer stores both the low key [24,5761,0,0,0] and the high key +[548,5761,524,0,0], which means that we can expect block 771 to contain +records starting at physical block 24, inode 5761, offset zero; and that one +of the records can be used to find a reverse mapping for physical block 548, +inode 5761, and offset 524: + +:: + + xfs_db> addr ptrs[2] + xfs_db> p + magic = 0x34524d42 + level = 0 + numrecs = 168 + leftsib = 5 + rightsib = 9 + bno = 6168 + lsn = 0x100000d1b + uuid = 400f0928-6b88-4c37-af1e-cef1f8911f3f + owner = 0 + crc = 0xd58eff0e (correct) + recs[1-168] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock] + 1:[24,525,5761,0,0,0,0] + 2:[24,524,5762,0,0,0,0] + 3:[24,523,5763,0,0,0,0] + ... + 166:[24,360,5926,0,0,0,0] + 167:[24,359,5927,0,0,0,0] + 168:[24,358,5928,0,0,0,0] + +Observe that the first record in the block starts at physical block 24, inode +5761, offset zero, just as we expected. Note that this first record is also +indexed by the highest key as provided in the node block; physical block 548, +inode 5761, offset 524 is the very last block mapped by this record. +Furthermore, note that record 168, despite being the last record in this +block, has a lower maximum key (physical block 382, inode 5928, offset 23) +than the first record.