[17/22] docs: add XFS inode format to the DS&A book
diff mbox series

Message ID 153862681216.26427.625170795563446401.stgit@magnolia
State Not Applicable
Headers show
Series
  • xfs-4.20: major documentation surgery
Related show

Commit Message

Darrick J. Wong Oct. 4, 2018, 4:20 a.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../filesystems/xfs-data-structures/dynamic.rst    |    2 
 .../xfs-data-structures/ondisk_inode.rst           |  558 ++++++++++++++++++++
 2 files changed, 560 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/ondisk_inode.rst

Patch
diff mbox series

diff --git a/Documentation/filesystems/xfs-data-structures/dynamic.rst b/Documentation/filesystems/xfs-data-structures/dynamic.rst
index 895c94e95889..945b07be2034 100644
--- a/Documentation/filesystems/xfs-data-structures/dynamic.rst
+++ b/Documentation/filesystems/xfs-data-structures/dynamic.rst
@@ -2,3 +2,5 @@ 
 
 Dynamic Allocated Structures
 ============================
+
+.. include:: ondisk_inode.rst
diff --git a/Documentation/filesystems/xfs-data-structures/ondisk_inode.rst b/Documentation/filesystems/xfs-data-structures/ondisk_inode.rst
new file mode 100644
index 000000000000..77ecd2917489
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/ondisk_inode.rst
@@ -0,0 +1,558 @@ 
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+On-Disk Inode
+-------------
+
+All files, directories, and links are stored on disk with inodes and descend
+from the root inode with its number defined in the
+`superblock <#superblocks>`__. The previous section on `AG Inode
+Management <#ag-inode-management>`__ describes the allocation and management
+of inodes on disk. This section describes the contents of inodes themselves.
+
+An inode is divided into 3 parts:
+
+.. figure:: images/23.png
+   :alt: On-disk inode sections
+
+   On-disk inode sections
+
+-  The core contains what the inode represents, stat data, and information
+   describing the data and attribute forks.
+
+-  The di\_u "data fork" contains normal data related to the inode. Its
+   contents depends on the file type specified by di\_core.di\_mode (eg.
+   regular file, directory, link, etc) and how much information is contained
+   in the file which determined by di\_core.di\_format. The following union to
+   represent this data is declared as follows:
+
+.. code:: c
+
+    union {
+         xfs_bmdr_block_t di_bmbt;
+         xfs_bmbt_rec_t   di_bmx[1];
+         xfs_dir2_sf_t    di_dir2sf;
+         char             di_c[1];
+         xfs_dev_t        di_dev;
+         uuid_t           di_muuid;
+         char             di_symlink[1];
+    } di_u;
+
+-  The di\_a "attribute fork" contains extended attributes. Its layout is
+   determined by the di\_core.di\_aformat value. Its representation is
+   declared as follows:
+
+.. code:: c
+
+    union {
+         xfs_bmdr_block_t     di_abmbt;
+         xfs_bmbt_rec_t       di_abmx[1];
+         xfs_attr_shortform_t di_attrsf;
+    } di_a;
+
+-   The above two unions are rarely used in the XFS code, but the structures
+    within the union are directly cast depending on the di\_mode/di\_format
+    and di\_aformat values. They are referenced in this document to make it
+    easier to explain the various structures in use within the inode.
+
+The remaining space in the inode after di\_next\_unlinked where the two forks
+are located is called the inode’s "literal area". This starts at offset
+100 (0x64) in a version 1 or 2 inode, and offset 176 (0xb0) in a version 3
+inode.
+
+The space for each of the two forks in the literal area is determined by the
+inode size, and di\_core.di\_forkoff. The data fork is located between the
+start of the literal area and di\_forkoff. The attribute fork is located
+between di\_forkoff and the end of the inode.
+
+Inode Core
+~~~~~~~~~~
+
+The inode’s core is 96 bytes on a V4 filesystem and 176 bytes on a V5
+filesystem. It contains information about the file itself including most stat
+data information about data and attribute forks after the core within the
+inode. It uses the following structure:
+
+.. code:: c
+
+    struct xfs_dinode_core {
+         __uint16_t                di_magic;
+         __uint16_t                di_mode;
+         __int8_t                  di_version;
+         __int8_t                  di_format;
+         __uint16_t                di_onlink;
+         __uint32_t                di_uid;
+         __uint32_t                di_gid;
+         __uint32_t                di_nlink;
+         __uint16_t                di_projid;
+         __uint16_t                di_projid_hi;
+         __uint8_t                 di_pad[6];
+         __uint16_t                di_flushiter;
+         xfs_timestamp_t           di_atime;
+         xfs_timestamp_t           di_mtime;
+         xfs_timestamp_t           di_ctime;
+         xfs_fsize_t               di_size;
+         xfs_rfsblock_t            di_nblocks;
+         xfs_extlen_t              di_extsize;
+         xfs_extnum_t              di_nextents;
+         xfs_aextnum_t             di_anextents;
+         __uint8_t                 di_forkoff;
+         __int8_t                  di_aformat;
+         __uint32_t                di_dmevmask;
+         __uint16_t                di_dmstate;
+         __uint16_t                di_flags;
+         __uint32_t                di_gen;
+
+         /* di_next_unlinked is the only non-core field in the old dinode */
+         __be32                    di_next_unlinked;
+
+         /* version 5 filesystem (inode version 3) fields start here */
+         __le32                    di_crc;
+         __be64                    di_changecount;
+         __be64                    di_lsn;
+         __be64                    di_flags2;
+         __be32                    di_cowextsize;
+         __u8                      di_pad2[12];
+         xfs_timestamp_t           di_crtime;
+         __be64                    di_ino;
+         uuid_t                    di_uuid;
+
+    };
+
+**di\_magic**
+    The inode signature; these two bytes are "IN" (0x494e).
+
+**di\_mode**
+    Specifies the mode access bits and type of file using the standard S\_Ixxx
+    values defined in stat.h.
+
+**di\_version**
+    Specifies the inode version which currently can only be 1, 2, or 3. The
+    inode version specifies the usage of the di\_onlink, di\_nlink and
+    di\_projid values in the inode core. Initially, inodes are created as v1
+    but can be converted on the fly to v2 when required. v3 inodes are created
+    only for v5 filesystems.
+
+**di\_format**
+    Specifies the format of the data fork in conjunction with the di\_mode
+    type. This can be one of several values. For directories and links, it can
+    be "local"
+    where all metadata associated with the file is within the inode; "extents"
+    where the inode contains an array of extents to other filesystem blocks
+    which contain the associated metadata or data; or
+    "btree" where the inode contains a B+tree
+    root node which points to filesystem blocks containing the metadata or data.
+    Migration between the formats depends on the amount of metadata associated with
+    the inode. "dev" is used for character and block devices while
+    "uuid" is
+    currently not used.  "rmap" indicates that a reverse-mapping B+tree is
+    rooted in the fork.
+
+.. code:: c
+
+    typedef enum xfs_dinode_fmt {
+         XFS_DINODE_FMT_DEV,
+         XFS_DINODE_FMT_LOCAL,
+         XFS_DINODE_FMT_EXTENTS,
+         XFS_DINODE_FMT_BTREE,
+         XFS_DINODE_FMT_UUID,
+         XFS_DINODE_FMT_RMAP,
+    } xfs_dinode_fmt_t;
+
+**di\_onlink**
+    In v1 inodes, this specifies the number of links to the inode from
+    directories. When the number exceeds 65535, the inode is converted to v2
+    and the link count is stored in di\_nlink.
+
+**di\_uid**
+    Specifies the owner’s UID of the inode.
+
+**di\_gid**
+    Specifies the owner’s GID of the inode.
+
+**di\_nlink**
+    Specifies the number of links to the inode from directories. This is
+    maintained for both inode versions for current versions of XFS. Prior to
+    v2 inodes, this field was part of di\_pad.
+
+**di\_projid**
+    Specifies the owner’s project ID in v2 inodes. An inode is converted to v2
+    if the project ID is set. This value must be zero for v1 inodes.
+
+**di\_projid\_hi**
+    Specifies the high 16 bits of the owner’s project ID in v2 inodes, if the
+    XFS\_SB\_VERSION2\_PROJID32BIT feature is set; and zero otherwise.
+
+**di\_pad[6]**
+    Reserved, must be zero.
+
+**di\_flushiter**
+    Incremented on flush.
+
+**di\_atime**
+    Specifies the last access time of the files using UNIX time conventions
+    the following structure. This value may be undefined if the filesystem is
+    mounted with the "noatime" option. XFS supports timestamps with
+    nanosecond resolution:
+
+.. code:: c
+
+    struct xfs_timestamp {
+         __int32_t                 t_sec;
+         __int32_t                 t_nsec;
+    };
+
+**di\_mtime**
+    Specifies the last time the file was modified.
+
+**di\_ctime**
+    Specifies when the inode’s status was last changed.
+
+**di\_size**
+    Specifies the EOF of the inode in bytes. This can be larger or smaller
+    than the extent space (therefore actual disk space) used for the inode.
+    For regular files, this is the filesize in bytes, directories, the space
+    taken by directory entries and for links, the length of the symlink.
+
+**di\_nblocks**
+    Specifies the number of filesystem blocks used to store the inode’s data
+    including relevant metadata like B+trees. This does not include blocks
+    used for extended attributes.
+
+**di\_extsize**
+    Specifies the extent size for filesystems with real-time devices or an
+    extent size hint for standard filesystems. For normal filesystems, and
+    with directories, the XFS\_DIFLAG\_EXTSZINHERIT flag must be set in
+    di\_flags if this field is used. Inodes created in these directories will
+    inherit the di\_extsize value and have XFS\_DIFLAG\_EXTSIZE set in their
+    di\_flags. When a file is written to beyond allocated space, XFS will
+    attempt to allocate additional disk space based on this value.
+
+**di\_nextents**
+    Specifies the number of data extents associated with this inode.
+
+**di\_anextents**
+    Specifies the number of extended attribute extents associated with this
+    inode.
+
+**di\_forkoff**
+    Specifies the offset into the inode’s literal area where the extended
+    attribute fork starts. This is an 8-bit value that is multiplied by 8 to
+    determine the actual offset in bytes (ie. attribute data is 64-bit
+    aligned). This also limits the maximum size of the inode to 2048 bytes.
+    This value is initially zero until an extended attribute is created. When
+    in attribute is added, the nature of di\_forkoff depends on the
+    XFS\_SB\_VERSION2\_ATTR2BIT  flag in the superblock. Refer to `Extended
+    Attribute Versions <#extended-attribute-versions>`__ for more details.
+
+**di\_aformat**
+    Specifies the format of the attribute fork. This uses the same values as
+    di\_format, but restricted to "local", "extents" and "btree"
+    formats for extended attribute data.
+
+**di\_dmevmask**
+    DMAPI event mask.
+
+**di\_dmstate**
+    DMAPI state.
+
+**di\_flags**
+    Specifies flags associated with the inode. This can be a combination of
+    the following values:
+
+.. list-table::
+   :widths: 28 52
+   :header-rows: 1
+
+   * - Flag
+     - Description
+
+   * - XFS_DIFLAG_REALTIME
+     - The inode's data is located on the real-time device.
+
+   * - XFS_DIFLAG_PREALLOC
+     - The inode's extents have been preallocated.
+
+   * - XFS_DIFLAG_NEWRTBM
+     - Specifies the +sb_rbmino+ uses the new real-time bitmap format
+
+   * - XFS_DIFLAG_IMMUTABLE
+     - Specifies the inode cannot be modified.
+
+   * - XFS_DIFLAG_APPEND
+     - The inode is in append only mode.
+
+   * - XFS_DIFLAG_SYNC
+     - The inode is written synchronously.
+
+   * - XFS_DIFLAG_NOATIME
+     - The inode's +di_atime+ is not updated.
+
+   * - XFS_DIFLAG_NODUMP
+     - Specifies the inode is to be ignored by xfsdump.
+
+   * - XFS_DIFLAG_RTINHERIT
+     - For directory inodes, new inodes inherit the XFS_DIFLAG_REALTIME bit.
+
+   * - XFS_DIFLAG_PROJINHERIT
+     - For directory inodes, new inodes inherit the ``di_projid`` value.
+
+   * - XFS_DIFLAG_NOSYMLINKS
+     - For directory inodes, symlinks cannot be created.
+
+   * - XFS_DIFLAG_EXTSIZE
+     - Specifies the extent size for real-time files or an extent size hint for
+       regular files.
+
+   * - XFS_DIFLAG_EXTSZINHERIT
+     - For directory inodes, new inodes inherit the +di_extsize+ value.
+
+   * - XFS_DIFLAG_NODEFRAG
+     - Specifies the inode is to be ignored when defragmenting the filesystem.
+
+   * - XFS_DIFLAG_FILESTREAMS
+     - Use the filestream allocator.  The filestreams allocator allows a
+       directory to reserve an entire allocation group for exclusive use by
+       files created in that directory.  Files in other directories cannot use
+       AGs reserved by other directories.
+
+Table: Version 2 Inode flags
+
+**di\_gen**
+    A generation number used for inode identification. This is used by tools
+    that do inode scanning such as backup tools and xfsdump. An inode’s
+    generation number can change by unlinking and creating a new file that
+    reuses the inode.
+
+**di\_next\_unlinked**
+    See the section on `unlinked inode pointers <#unlinked-pointer>`__ for
+    more information.
+
+**di\_crc**
+    Checksum of the inode.
+
+**di\_changecount**
+    Counts the number of changes made to the attributes in this inode.
+
+**di\_lsn**
+    Log sequence number of the last inode write.
+
+**di\_flags2**
+    Specifies extended flags associated with a v3 inode.
+
+.. list-table::
+   :widths: 28 52
+   :header-rows: 1
+
+   * - Flag
+     - Description
+
+   * - XFS\_DIFLAG2\_DAX
+     - For a file, enable DAX to increase performance on persistent-memory
+       storage. If set on a directory, files created in the directory will
+       inherit this flag.
+
+   * - XFS\_DIFLAG2\_REFLINK
+     - This inode shares (or has shared) data blocks with another inode.
+
+   * - XFS\_DIFLAG2\_COWEXTSIZE
+     - For files, this is the extent size hint for copy on write operations;
+       see di\_cowextsize for details. For directories, the value in
+       di\_cowextsize will be copied to all newly created files and
+       directories.
+
+Table: Version 3 Inode flags
+
+**di\_cowextsize**
+    Specifies the extent size hint for copy on write operations. When
+    allocating extents for a copy on write operation, the allocator will be
+    asked to align its allocations to either di\_cowextsize blocks or
+    di\_extsize blocks, whichever is greater. The XFS\_DIFLAG2\_COWEXTSIZE
+    flag must be set if this field is used. If this field and its flag are set
+    on a directory file, the value will be copied into any files or
+    directories created within this directory. During a block sharing
+    operation, this value will be copied from the source file to the
+    destination file if the sharing operation completely overwrites the
+    destination file’s contents and the destination file does not already have
+    di\_cowextsize set.
+
+**di\_pad2**
+    Padding for future expansion of the inode.
+
+**di\_crtime**
+    Specifies the time when this inode was created.
+
+**di\_ino**
+    The full inode number of this inode.
+
+**di\_uuid**
+    The UUID of this inode, which must match either sb\_uuid or sb\_meta\_uuid
+    depending on which features are set.
+
+Unlinked Pointer
+~~~~~~~~~~~~~~~~
+
+The di\_next\_unlinked value in the inode is used to track inodes that have
+been unlinked (deleted) but are still open by a program. When an inode is in
+this state, the inode is added to one of the `AGI’s <#ag-inode-management>`__
+agi\_unlinked hash buckets. The AGI unlinked bucket points to an inode and the
+di\_next\_unlinked value points to the next inode in the chain. The last inode
+in the chain has di\_next\_unlinked set to NULL (-1).
+
+Once the last reference is released, the inode is removed from the unlinked
+hash chain and di\_next\_unlinked is set to NULL. In the case of a system
+crash, XFS recovery will complete the unlink process for any inodes found in
+these lists.
+
+The only time the unlinked fields can be seen to be used on disk is either on
+an active filesystem or a crashed system. A cleanly unmounted or recovered
+filesystem will not have any inodes in these unlink hash chains.
+
+.. figure:: images/28.png
+   :alt: Unlinked inode pointer
+
+   Unlinked inode pointer
+
+Data Fork
+~~~~~~~~~
+
+The structure of the inode’s data fork based is on the inode’s type and
+di\_format. The data fork begins at the start of the inode’s "literal area".
+This area starts at offset 100 (0x64), or offset 176 (0xb0) in a v3 inode. The
+size of the data fork is determined by the type and format. The maximum size is
+determined by the inode size and di_forkoff. In code, use the XFS_DFORK_PTR
+macro specifying XFS_DATA_FORK for the "which" parameter. Alternatively,
+the XFS\_DFORK\_DPTR macro can be used.
+
+Each of the following sub-sections summarises the contents of the data fork
+based on the inode type.
+
+Regular Files (S\_IFREG)
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The data fork specifies the file’s data extents. The extents specify where the
+file’s actual data is located within the filesystem. Extents can have 2
+formats which is defined by the di\_format value:
+
+-  XFS\_DINODE\_FMT\_EXTENTS: The extent data is fully contained within the
+   inode which contains an array of extents to the filesystem blocks for the
+   file’s data. To access the extents, cast the return value from
+   XFS\_DFORK\_DPTR to xfs\_bmbt\_rec\_t\*.
+
+-  XFS\_DINODE\_FMT\_BTREE: The extent data is contained in the leaves of a
+   B+tree. The inode contains the root node of the tree and is accessed by
+   casting the return value from XFS\_DFORK\_DPTR to xfs\_bmdr\_block\_t\*.
+
+Details for each of these data extent formats are covered in the `Data
+Extents <#data-extents>`__ later on.
+
+Directories (S\_IFDIR)
+^^^^^^^^^^^^^^^^^^^^^^
+
+The data fork contains the directory’s entries and associated data. The format
+of the entries is also determined by the di\_format value and can be one of 3
+formats:
+
+-  XFS\_DINODE\_FMT\_LOCAL: The directory entries are fully contained within
+   the inode. This is accessed by casting the value from XFS\_DFORK\_DPTR to
+   xfs\_dir2\_sf\_t\*.
+
+-  XFS\_DINODE\_FMT\_EXTENTS: The actual directory entries are located in
+   another filesystem block, the inode contains an array of extents to these
+   filesystem blocks (xfs\_bmbt\_rec\_t\*).
+
+-  XFS\_DINODE\_FMT\_BTREE: The directory entries are contained in the leaves
+   of a B+tree. The inode contains the root node (xfs\_bmdr\_block\_t\*).
+
+Details for each of these directory formats are covered in the
+`Directories <#directories>`__ later on.
+
+Symbolic Links (S\_IFLNK)
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The data fork contains the contents of the symbolic link. The format of the
+link is determined by the di\_format value and can be one of 2 formats:
+
+-  XFS\_DINODE\_FMT\_LOCAL: The symbolic link is fully contained within the
+   inode. This is accessed by casting the return value from XFS\_DFORK\_DPTR
+   to char\*.
+
+-  XFS\_DINODE\_FMT\_EXTENTS: The actual symlink is located in another
+   filesystem block, the inode contains the extents to these filesystem blocks
+   (xfs\_bmbt\_rec\_t\*).
+
+Details for symbolic links is covered in the section about `Symbolic
+Links <#symbolic-links>`__.
+
+Other File Types
+^^^^^^^^^^^^^^^^
+
+For character and block devices (S\_IFCHR and S\_IFBLK), cast the value from
+XFS\_DFORK\_DPTR to xfs\_dev\_t\*.
+
+Attribute Fork
+~~~~~~~~~~~~~~
+
+The attribute fork in the inode always contains the location of the extended
+attributes associated with the inode.
+
+The location of the attribute fork in the inode’s literal area is specified by
+the di\_forkoff value in the inode’s core. If this value is zero, the inode
+does not contain any extended attributes. If non-zero, the attribute fork’s
+byte offset into the literal area can be computed from di\_forkoff × 8.
+Attributes must be allocated on a 64-bit boundary on the disk. To access the
+extended attributes in code, use the XFS\_DFORK\_PTR macro specifying
+XFS\_ATTR\_FORK for the "which" parameter. Alternatively, the
+XFS\_DFORK\_APTR macro can be used.
+
+The structure of the attribute fork depends on the di\_aformat value in the
+inode. It can be one of the following values:
+
+-  XFS\_DINODE\_FMT\_LOCAL: The extended attributes are contained entirely
+   within the inode. This is accessed by casting the value from
+   XFS\_DFORK\_APTR to xfs\_attr\_shortform\_t\*.
+
+-  XFS\_DINODE\_FMT\_EXTENTS: The attributes are located in another filesystem
+   block, the inode contains an array of pointers to these filesystem blocks.
+   They are accessed by casting the value from XFS\_DFORK\_APTR to
+   xfs\_bmbt\_rec\_t\*.
+
+-  XFS\_DINODE\_FMT\_BTREE: The extents for the attributes are contained in
+   the leaves of a B+tree. The inode contains the root node of the tree and is
+   accessed by casting the value from XFS\_DFORK\_APTR to
+   xfs\_bmdr\_block\_t\*.
+
+Detailed information on the layouts of extended attributes are covered in the
+`Extended Attributes <#extended-attributes>`__ in this document.
+
+Extended Attribute Versions
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Extended attributes come in two versions: "attr1" or "attr2". The
+attribute version is specified by the XFS\_SB\_VERSION2\_ATTR2BIT  flag in the
+sb\_features2 field in the superblock. It determines how the inode’s extra
+space is split between di\_u and di\_a forks which also determines how the
+di\_forkoff value is maintained in the inode’s core.
+
+With "attr1" attributes, the di\_forkoff is set to somewhere in the middle
+of the space between the core and end of the inode and never changes (which
+has the effect of artificially limiting the space for data information). As
+the data fork grows, when it gets to di\_forkoff, it will move the data to the
+next format level (ie. local < extent < btree). If very little space is used
+for either attributes or data, then a good portion of the available inode
+space is wasted with this version.
+
+"attr2" was introduced to maximum the utilisation of the inode’s literal
+area. The di\_forkoff starts at the end of the inode and works its way to the
+data fork as attributes are added. Attr2 is highly recommended if extended
+attributes are used.
+
+The following diagram compares the two versions:
+
+.. figure:: images/30.png
+   :alt: Extended attribute layouts
+
+   Extended attribute layouts
+
+Note that because di\_forkoff is an 8-bit value measuring units of 8 bytes,
+the maximum size of an inode is 2\ :sup:`8` × 2\ :sup:`3` = 2\ :sup:`11` =
+2048 bytes.