[2/2] xfs: Extend xattr extent counter to 32-bits

Message ID	20200404085203.1908-3-chandanrlinux@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=tkuD=5U=vger.kernel.org=linux-xfs-owner@kernel.org> From: Chandan Rajendra <chandanrlinux@gmail.com> To: linux-xfs@vger.kernel.org Cc: Chandan Rajendra <chandanrlinux@gmail.com>, david@fromorbit.com, chandan@linux.ibm.com, darrick.wong@oracle.com, bfoster@redhat.com Subject: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits Date: Sat, 4 Apr 2020 14:22:03 +0530 Message-Id: <20200404085203.1908-3-chandanrlinux@gmail.com> In-Reply-To: <20200404085203.1908-1-chandanrlinux@gmail.com> References: <20200404085203.1908-1-chandanrlinux@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	Extend xattr extent counter to 32-bits \| expand [0/2] Extend xattr extent counter to 32-bits [1/2] xfs: Fix log reservation calculation for xattr insert operation [2/2] xfs: Extend xattr extent counter to 32-bits

Chandan Babu R April 4, 2020, 8:52 a.m. UTC

XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
which
1. Creates 1,000,000 255-byte sized xattrs,
2. Deletes 50% of these xattrs in an alternating manner,
3. Tries to create 400,000 new 255-byte sized xattrs
causes the following message to be printed on the console,

XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173

This indicates that we overflowed the 16-bits wide xattr extent counter.

I have been informed that there are instances where a single file has
 > 100 million hardlinks. With parent pointers being stored in xattr,
we will overflow the 16-bits wide xattr extent counter when large
number of hardlinks are created.

Hence this commit extends xattr extent counter to 32-bits. It also introduces
an incompat flag to prevent older kernels from mounting newer filesystems with
32-bit wide xattr extent counter.

Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
 fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
 fs/xfs/libxfs/xfs_log_format.h |  5 +++--
 fs/xfs/libxfs/xfs_types.h      |  4 ++--
 fs/xfs/scrub/inode.c           |  7 ++++---
 fs/xfs/xfs_inode_item.c        |  3 ++-
 fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
 8 files changed, 63 insertions(+), 27 deletions(-)

Brian Foster April 6, 2020, 4:45 p.m. UTC | #1

On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> which
> 1. Creates 1,000,000 255-byte sized xattrs,
> 2. Deletes 50% of these xattrs in an alternating manner,
> 3. Tries to create 400,000 new 255-byte sized xattrs
> causes the following message to be printed on the console,
> 
> XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> 
> This indicates that we overflowed the 16-bits wide xattr extent counter.
> 
> I have been informed that there are instances where a single file has
>  > 100 million hardlinks. With parent pointers being stored in xattr,
> we will overflow the 16-bits wide xattr extent counter when large
> number of hardlinks are created.
> 
> Hence this commit extends xattr extent counter to 32-bits. It also introduces
> an incompat flag to prevent older kernels from mounting newer filesystems with
> 32-bit wide xattr extent counter.
> 

Just a couple high level comments on the first pass...

It looks like the feature bit is only set by mkfs. That raises a couple
questions. First, what about a fix for older/existing filesystems? Even
if we can't exceed the 16bit extent count, I would think we should be
able to fail more gracefully than allowing a write verifier to fail and
shutdown the fs. What happens when/if we run into a data fork extent
count limit, for example?

Second, I also wonder if enabling an incompat feature bit by default in
mkfs is a bit extreme. Perhaps this should be tied to a mkfs flag for a
period of time? Maybe others have thoughts on that, but I'd at minimum
request to introduce and enable said bit by default in separate patches
to make it a bit easier for distro releases to identify and manage the
incompatibility.

Brian

> Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
>  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
>  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
>  fs/xfs/libxfs/xfs_types.h      |  4 ++--
>  fs/xfs/scrub/inode.c           |  7 ++++---
>  fs/xfs/xfs_inode_item.c        |  3 ++-
>  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
>  8 files changed, 63 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 045556e78ee2c..0a4266b0d46e1 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
>  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
>  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
>  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
>  #define XFS_SB_FEAT_INCOMPAT_ALL \
>  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
>  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
>  
>  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
>  static inline bool
> @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
>  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>  	__be32		di_extsize;	/* basic/minimum extent size for file */
>  	__be32		di_nextents;	/* number of extents in data fork */
> -	__be16		di_anextents;	/* number of extents in attribute fork*/
> +	__be16		di_anextents_lo;/* lower part of xattr extent count */
>  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	__s8		di_aformat;	/* format of attr fork's data */
>  	__be32		di_dmevmask;	/* DMIG event mask */
> @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
>  	__be64		di_lsn;		/* flush sequence */
>  	__be64		di_flags2;	/* more random flags */
>  	__be32		di_cowextsize;	/* basic cow extent size for file */
> -	__u8		di_pad2[12];	/* more padding for future expansion */
> +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> +	__u8		di_pad2[10];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_timestamp_t	di_crtime;	/* time created */
> @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
>  	((w) == XFS_DATA_FORK ? \
>  		(dip)->di_format : \
>  		(dip)->di_aformat)
> -#define XFS_DFORK_NEXTENTS(dip,w) \
> -	((w) == XFS_DATA_FORK ? \
> -		be32_to_cpu((dip)->di_nextents) : \
> -		be16_to_cpu((dip)->di_anextents))
> +
> +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> +					struct xfs_dinode *dip, int whichfork)
> +{
> +	int32_t anextents;
> +
> +	if (whichfork == XFS_DATA_FORK)
> +		return be32_to_cpu((dip)->di_nextents);
> +
> +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> +	if (xfs_sb_version_has_v3inode(sbp))
> +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> +
> +	return anextents;
> +}
>  
>  /*
>   * For block and character special files the 32bit dev_t is stored at the
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 39c5a6e24915c..ced8195bd8c22 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -232,7 +232,8 @@ xfs_inode_from_disk(
>  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
>  	to->di_extsize = be32_to_cpu(from->di_extsize);
>  	to->di_nextents = be32_to_cpu(from->di_nextents);
> -	to->di_anextents = be16_to_cpu(from->di_anextents);
> +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> +				XFS_ATTR_FORK);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat	= from->di_aformat;
>  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> @@ -282,7 +283,7 @@ xfs_inode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents = cpu_to_be32(from->di_nextents);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -296,6 +297,8 @@ xfs_inode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		to->di_anextents_hi
> +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
>  		to->di_ino = cpu_to_be64(ip->i_ino);
>  		to->di_lsn = cpu_to_be64(lsn);
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents = cpu_to_be32(from->di_nextents);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
>  		to->di_ino = cpu_to_be64(from->di_ino);
>  		to->di_lsn = cpu_to_be64(from->di_lsn);
>  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
>  	struct xfs_mount	*mp,
>  	int			whichfork)
>  {
> -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	uint32_t		di_nextents;
> +
> +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
>  
>  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
>  	case XFS_DINODE_FMT_LOCAL:
> @@ -436,6 +442,9 @@ xfs_dinode_verify(
>  	uint16_t		flags;
>  	uint64_t		flags2;
>  	uint64_t		di_size;
> +	int32_t			nextents;
> +	int32_t			anextents;
> +	int64_t			nblocks;
>  
>  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
>  		return __this_address;
> @@ -466,10 +475,12 @@ xfs_dinode_verify(
>  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
>  		return __this_address;
>  
> +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
> +	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> +	nblocks = be64_to_cpu(dip->di_nblocks);
> +
>  	/* Fork checks carried over from xfs_iformat_fork */
> -	if (mode &&
> -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> -			be64_to_cpu(dip->di_nblocks))
> +	if (mode && nextents + anextents > nblocks)
>  		return __this_address;
>  
>  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> @@ -526,7 +537,7 @@ xfs_dinode_verify(
>  		default:
>  			return __this_address;
>  		}
> -		if (dip->di_anextents)
> +		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
>  			return __this_address;
>  	}
>  
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> index 518c6f0ec3a61..080fd0c156a1e 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -207,9 +207,10 @@ xfs_iformat_extents(
>  	int			whichfork)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_sb		*sb = &mp->m_sb;
>  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
>  	int			state = xfs_bmap_fork_to_state(whichfork);
> -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
>  	int			size = nex * sizeof(xfs_bmbt_rec_t);
>  	struct xfs_iext_cursor	icur;
>  	struct xfs_bmbt_rec	*dp;
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index e3400c9c71cdb..5db92aa508bc5 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -397,7 +397,7 @@ struct xfs_log_dinode {
>  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
>  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
>  	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
>  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	int8_t		di_aformat;	/* format of attr fork's data */
>  	uint32_t	di_dmevmask;	/* DMIG event mask */
> @@ -414,7 +414,8 @@ struct xfs_log_dinode {
>  	xfs_lsn_t	di_lsn;		/* flush sequence */
>  	uint64_t	di_flags2;	/* more random flags */
>  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
> +	uint8_t		di_pad2[10];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_ictimestamp_t di_crtime;	/* time created */
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 397d94775440d..01669aa65745a 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
>  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
>  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
>  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
>  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
>  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
>  
> @@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
>   */
>  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
>  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> +#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */
>  
>  /*
>   * Minimum and maximum blocksize and sectorsize.
> diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
> index 6d483ab29e639..3b624e24ae868 100644
> --- a/fs/xfs/scrub/inode.c
> +++ b/fs/xfs/scrub/inode.c
> @@ -371,10 +371,12 @@ xchk_dinode(
>  		break;
>  	}
>  
> +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> +
>  	/* di_forkoff */
>  	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
>  		xchk_ino_set_corrupt(sc, ino);
> -	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
> +	if (nextents != 0 && dip->di_forkoff == 0)
>  		xchk_ino_set_corrupt(sc, ino);
>  	if (dip->di_forkoff == 0 && dip->di_aformat != XFS_DINODE_FMT_EXTENTS)
>  		xchk_ino_set_corrupt(sc, ino);
> @@ -386,7 +388,6 @@ xchk_dinode(
>  		xchk_ino_set_corrupt(sc, ino);
>  
>  	/* di_anextents */
> -	nextents = be16_to_cpu(dip->di_anextents);
>  	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
>  	switch (dip->di_aformat) {
>  	case XFS_DINODE_FMT_EXTENTS:
> @@ -484,7 +485,7 @@ xchk_inode_xref_bmap(
>  			&nextents, &acount);
>  	if (!xchk_should_check_xref(sc, &error, NULL))
>  		return;
> -	if (nextents != be16_to_cpu(dip->di_anextents))
> +	if (nextents != XFS_DFORK_NEXTENTS(&sc->mp->m_sb, dip, XFS_ATTR_FORK))
>  		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
>  
>  	/* Check nblocks against the inode. */
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 4a3d13d4a0228..dff20f2b368ea 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
>  	to->di_nblocks = from->di_nblocks;
>  	to->di_extsize = from->di_extsize;
>  	to->di_nextents = from->di_nextents;
> -	to->di_anextents = from->di_anextents;
> +	to->di_anextents_lo = ((u32)(from->di_anextents)) & 0xffff;
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = from->di_dmevmask;
> @@ -344,6 +344,7 @@ xfs_inode_to_log_dinode(
>  		to->di_crtime.t_nsec = from->di_crtime.tv_nsec;
>  		to->di_flags2 = from->di_flags2;
>  		to->di_cowextsize = from->di_cowextsize;
> +		to->di_anextents_hi = ((u32)(from->di_anextents)) >> 16;
>  		to->di_ino = ip->i_ino;
>  		to->di_lsn = lsn;
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 11c3502b07b13..ba3fae95b2260 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2922,6 +2922,7 @@ xlog_recover_inode_pass2(
>  	struct xfs_log_dinode	*ldip;
>  	uint			isize;
>  	int			need_free = 0;
> +	uint32_t		nextents;
>  
>  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
>  		in_f = item->ri_buf[0].i_addr;
> @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
>  			goto out_release;
>  		}
>  	}
> -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> +
> +	nextents = ldip->di_anextents_lo;
> +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
> +
> +	nextents += ldip->di_nextents;
> +
> +	if (unlikely(nextents > ldip->di_nblocks)) {
>  		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
>  				     XFS_ERRLEVEL_LOW, mp, ldip,
>  				     sizeof(*ldip));
> @@ -3052,8 +3060,7 @@ xlog_recover_inode_pass2(
>  	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
>  	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
>  			__func__, item, dip, bp, in_f->ilf_ino,
> -			ldip->di_nextents + ldip->di_anextents,
> -			ldip->di_nblocks);
> +			nextents, ldip->di_nblocks);
>  		error = -EFSCORRUPTED;
>  		goto out_release;
>  	}
> -- 
> 2.19.1
>

Darrick J. Wong April 6, 2020, 5:06 p.m. UTC | #2

On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> which
> 1. Creates 1,000,000 255-byte sized xattrs,
> 2. Deletes 50% of these xattrs in an alternating manner,
> 3. Tries to create 400,000 new 255-byte sized xattrs
> causes the following message to be printed on the console,
> 
> XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> 
> This indicates that we overflowed the 16-bits wide xattr extent counter.
> 
> I have been informed that there are instances where a single file has
>  > 100 million hardlinks. With parent pointers being stored in xattr,
> we will overflow the 16-bits wide xattr extent counter when large
> number of hardlinks are created.
> 
> Hence this commit extends xattr extent counter to 32-bits. It also introduces
> an incompat flag to prevent older kernels from mounting newer filesystems with
> 32-bit wide xattr extent counter.
> 
> Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
>  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
>  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
>  fs/xfs/libxfs/xfs_types.h      |  4 ++--
>  fs/xfs/scrub/inode.c           |  7 ++++---
>  fs/xfs/xfs_inode_item.c        |  3 ++-
>  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
>  8 files changed, 63 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 045556e78ee2c..0a4266b0d46e1 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
>  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
>  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
>  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)

If you're going to introduce an INCOMPAT feature, please also use the
opportunity to convert xattrs to something resembling the dir v3 format,
where we index free space within each block so that we can speed up attr
setting with 100 million attrs.

>  #define XFS_SB_FEAT_INCOMPAT_ALL \
>  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
>  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
>  
>  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
>  static inline bool
> @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
>  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>  	__be32		di_extsize;	/* basic/minimum extent size for file */
>  	__be32		di_nextents;	/* number of extents in data fork */
> -	__be16		di_anextents;	/* number of extents in attribute fork*/
> +	__be16		di_anextents_lo;/* lower part of xattr extent count */
>  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	__s8		di_aformat;	/* format of attr fork's data */
>  	__be32		di_dmevmask;	/* DMIG event mask */
> @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
>  	__be64		di_lsn;		/* flush sequence */
>  	__be64		di_flags2;	/* more random flags */
>  	__be32		di_cowextsize;	/* basic cow extent size for file */
> -	__u8		di_pad2[12];	/* more padding for future expansion */
> +	__be16		di_anextents_hi;/* higher part of xattr extent count */

I was expecting you to use di_pad, not di_pad2... :)

> +	__u8		di_pad2[10];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_timestamp_t	di_crtime;	/* time created */
> @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
>  	((w) == XFS_DATA_FORK ? \
>  		(dip)->di_format : \
>  		(dip)->di_aformat)
> -#define XFS_DFORK_NEXTENTS(dip,w) \
> -	((w) == XFS_DATA_FORK ? \
> -		be32_to_cpu((dip)->di_nextents) : \
> -		be16_to_cpu((dip)->di_anextents))
> +
> +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> +					struct xfs_dinode *dip, int whichfork)
> +{

XFS style indenting, please.

> +	int32_t anextents;

When would we have negative extent count?

(Yes, this is a bug in the xfs_extnum/xfs_aextnum typedefs, bah...)

> +
> +	if (whichfork == XFS_DATA_FORK)
> +		return be32_to_cpu((dip)->di_nextents);
> +
> +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> +	if (xfs_sb_version_has_v3inode(sbp))

v3inode?  I thought this had a separate incompat flag?

> +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);

/me would have thought you'd do the splitting and endian conversion in
the opposite order, e.g.:

	be32 x = dip->di_anextents_lo;
	if (has32bitattrcount)
		x |= (be32)dip->di_anextents_hi << 16;
	return be32_to_cpu(x);

> +
> +	return anextents;
> +}
>  
>  /*
>   * For block and character special files the 32bit dev_t is stored at the
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 39c5a6e24915c..ced8195bd8c22 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -232,7 +232,8 @@ xfs_inode_from_disk(
>  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
>  	to->di_extsize = be32_to_cpu(from->di_extsize);
>  	to->di_nextents = be32_to_cpu(from->di_nextents);
> -	to->di_anextents = be16_to_cpu(from->di_anextents);
> +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> +				XFS_ATTR_FORK);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat	= from->di_aformat;
>  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> @@ -282,7 +283,7 @@ xfs_inode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents = cpu_to_be32(from->di_nextents);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -296,6 +297,8 @@ xfs_inode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		to->di_anextents_hi
> +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
>  		to->di_ino = cpu_to_be64(ip->i_ino);
>  		to->di_lsn = cpu_to_be64(lsn);
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents = cpu_to_be32(from->di_nextents);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
>  		to->di_ino = cpu_to_be64(from->di_ino);
>  		to->di_lsn = cpu_to_be64(from->di_lsn);
>  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
>  	struct xfs_mount	*mp,
>  	int			whichfork)
>  {
> -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	uint32_t		di_nextents;
> +
> +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
>  
>  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
>  	case XFS_DINODE_FMT_LOCAL:
> @@ -436,6 +442,9 @@ xfs_dinode_verify(
>  	uint16_t		flags;
>  	uint64_t		flags2;
>  	uint64_t		di_size;
> +	int32_t			nextents;
> +	int32_t			anextents;
> +	int64_t			nblocks;
>  
>  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
>  		return __this_address;
> @@ -466,10 +475,12 @@ xfs_dinode_verify(
>  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
>  		return __this_address;
>  
> +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
> +	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> +	nblocks = be64_to_cpu(dip->di_nblocks);
> +
>  	/* Fork checks carried over from xfs_iformat_fork */
> -	if (mode &&
> -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> -			be64_to_cpu(dip->di_nblocks))
> +	if (mode && nextents + anextents > nblocks)
>  		return __this_address;
>  
>  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> @@ -526,7 +537,7 @@ xfs_dinode_verify(
>  		default:
>  			return __this_address;
>  		}
> -		if (dip->di_anextents)
> +		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
>  			return __this_address;
>  	}
>  
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> index 518c6f0ec3a61..080fd0c156a1e 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -207,9 +207,10 @@ xfs_iformat_extents(
>  	int			whichfork)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_sb		*sb = &mp->m_sb;
>  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
>  	int			state = xfs_bmap_fork_to_state(whichfork);
> -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
>  	int			size = nex * sizeof(xfs_bmbt_rec_t);
>  	struct xfs_iext_cursor	icur;
>  	struct xfs_bmbt_rec	*dp;
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index e3400c9c71cdb..5db92aa508bc5 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -397,7 +397,7 @@ struct xfs_log_dinode {
>  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
>  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
>  	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
>  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	int8_t		di_aformat;	/* format of attr fork's data */
>  	uint32_t	di_dmevmask;	/* DMIG event mask */
> @@ -414,7 +414,8 @@ struct xfs_log_dinode {
>  	xfs_lsn_t	di_lsn;		/* flush sequence */
>  	uint64_t	di_flags2;	/* more random flags */
>  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
> +	uint8_t		di_pad2[10];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_ictimestamp_t di_crtime;	/* time created */
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 397d94775440d..01669aa65745a 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
>  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
>  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
>  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
>  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
>  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
>  
> @@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
>   */
>  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
>  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> +#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */

Need to preserve both limits so that we can do the correct check for the
given feature set.

>  
>  /*
>   * Minimum and maximum blocksize and sectorsize.
> diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
> index 6d483ab29e639..3b624e24ae868 100644
> --- a/fs/xfs/scrub/inode.c
> +++ b/fs/xfs/scrub/inode.c
> @@ -371,10 +371,12 @@ xchk_dinode(
>  		break;
>  	}
>  
> +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> +
>  	/* di_forkoff */
>  	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
>  		xchk_ino_set_corrupt(sc, ino);
> -	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
> +	if (nextents != 0 && dip->di_forkoff == 0)
>  		xchk_ino_set_corrupt(sc, ino);
>  	if (dip->di_forkoff == 0 && dip->di_aformat != XFS_DINODE_FMT_EXTENTS)
>  		xchk_ino_set_corrupt(sc, ino);
> @@ -386,7 +388,6 @@ xchk_dinode(
>  		xchk_ino_set_corrupt(sc, ino);
>  
>  	/* di_anextents */
> -	nextents = be16_to_cpu(dip->di_anextents);
>  	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
>  	switch (dip->di_aformat) {
>  	case XFS_DINODE_FMT_EXTENTS:
> @@ -484,7 +485,7 @@ xchk_inode_xref_bmap(
>  			&nextents, &acount);
>  	if (!xchk_should_check_xref(sc, &error, NULL))
>  		return;
> -	if (nextents != be16_to_cpu(dip->di_anextents))
> +	if (nextents != XFS_DFORK_NEXTENTS(&sc->mp->m_sb, dip, XFS_ATTR_FORK))
>  		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
>  
>  	/* Check nblocks against the inode. */
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 4a3d13d4a0228..dff20f2b368ea 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
>  	to->di_nblocks = from->di_nblocks;
>  	to->di_extsize = from->di_extsize;
>  	to->di_nextents = from->di_nextents;
> -	to->di_anextents = from->di_anextents;
> +	to->di_anextents_lo = ((u32)(from->di_anextents)) & 0xffff;
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = from->di_dmevmask;
> @@ -344,6 +344,7 @@ xfs_inode_to_log_dinode(
>  		to->di_crtime.t_nsec = from->di_crtime.tv_nsec;
>  		to->di_flags2 = from->di_flags2;
>  		to->di_cowextsize = from->di_cowextsize;
> +		to->di_anextents_hi = ((u32)(from->di_anextents)) >> 16;
>  		to->di_ino = ip->i_ino;
>  		to->di_lsn = lsn;
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 11c3502b07b13..ba3fae95b2260 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2922,6 +2922,7 @@ xlog_recover_inode_pass2(
>  	struct xfs_log_dinode	*ldip;
>  	uint			isize;
>  	int			need_free = 0;
> +	uint32_t		nextents;
>  
>  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
>  		in_f = item->ri_buf[0].i_addr;
> @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
>  			goto out_release;
>  		}
>  	}
> -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> +
> +	nextents = ldip->di_anextents_lo;
> +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
> +
> +	nextents += ldip->di_nextents;
> +
> +	if (unlikely(nextents > ldip->di_nblocks)) {
>  		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
>  				     XFS_ERRLEVEL_LOW, mp, ldip,
>  				     sizeof(*ldip));
> @@ -3052,8 +3060,7 @@ xlog_recover_inode_pass2(
>  	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
>  	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
>  			__func__, item, dip, bp, in_f->ilf_ino,
> -			ldip->di_nextents + ldip->di_anextents,
> -			ldip->di_nblocks);
> +			nextents, ldip->di_nblocks);
>  		error = -EFSCORRUPTED;
>  		goto out_release;
>  	}
> -- 
> 2.19.1
>

Dave Chinner April 6, 2020, 11:30 p.m. UTC | #3

On Mon, Apr 06, 2020 at 10:06:03AM -0700, Darrick J. Wong wrote:
> On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > which
> > 1. Creates 1,000,000 255-byte sized xattrs,
> > 2. Deletes 50% of these xattrs in an alternating manner,
> > 3. Tries to create 400,000 new 255-byte sized xattrs
> > causes the following message to be printed on the console,
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > 
> > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > 
> > I have been informed that there are instances where a single file has
> >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > we will overflow the 16-bits wide xattr extent counter when large
> > number of hardlinks are created.
> > 
> > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > an incompat flag to prevent older kernels from mounting newer filesystems with
> > 32-bit wide xattr extent counter.
> > 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> >  fs/xfs/scrub/inode.c           |  7 ++++---
> >  fs/xfs/xfs_inode_item.c        |  3 ++-
> >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> >  8 files changed, 63 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 045556e78ee2c..0a4266b0d46e1 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> 
> If you're going to introduce an INCOMPAT feature, please also use the
> opportunity to convert xattrs to something resembling the dir v3 format,
> where we index free space within each block so that we can speed up attr
> setting with 100 million attrs.

Not necessary. Chandan has already spent a lot of time investigating
that - I suggested doing the investigation probably a year ago when
he was looking for stuff to do knowing that this could be a problem
parent pointers hit. Long story short - there's no degradation in
performance in the dabtree out to tens of millions of records with
different fixed size or random sized attributes, nor does various
combinations of insert/lookup/remove/replace operations seem to
impact the tree performance at scale. IOWs, we hit the 16 bit extent
limits of the attribute trees without finding any degradation in
performance.

Hence we concluded that the dabtree structure does not require
significant modification or optimisation to work well with typical
parent pointer attribute demands...

As for free space indexes....

The issue with the directory structure that requires external free
space is that the directory data is not part of the dabtree itself.
The attribute fork stores all the attributes at the leaves of the
dabtree, while the directory structure stores the directory data in
external blocks and the dabtree only contains the name hash index
that points to the external data.

i.e. When we add an attribute to the dabtree, we split/merge leaves
of the tree based on where the name hash index tells us it needs to
be inserted/removed from. i.e. we make space available or collapse
sparse leaves of the dabtree as a side effect of inserting or
removing objects.

The directory structure is very different. The dirents cannot change
location as their logical offset into the dir data segment is used
as the readdir/seekdir/telldir cookie. Therefore that location is
not allowed to change for the life of the dirent and so we can't
store them in the leaves of a dabtree indexed in hash order because
the offset into the tree would change as other entries are inserted
and removed.  Hence when we remove dirents, we must leave holes in
the data segment so the rest of the dirent data does not change
logical offset.

The directory name hash index - the dabtree bit - is in a separate
segment (the 2nd one). Because it only stores pointers to dirents in
the data segment, it doesn't need to leave holes - the dabtree just
merge/splits as required as pointers to the dir data segment are
added/removed - and has no free space tracking.

Hence when we go to add a dirent, we need to find the best free
space in the dir data segment to add that dirent. This requires a
dir data segment free space index, and that is held in the 3rd dir
segment.  Once we've found the best free space via lookup in the
free space index, we go modify the dir data block it points to, then
update the dabtree to point the name hash at that new dirent.

IOWs, the requirement for a free space map in the directory
structure results from storing the dirent data externally to the
dabtree. Attributes are stored directly in the leaves of the
dabtree - except for remote attributes which can be anywhere in the
BMBT address space - and hence do no need external free space
tracking to determine where to best insert them...

Cheers,

Dave.

Dave Chinner April 7, 2020, 1:20 a.m. UTC | #4

On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> which
> 1. Creates 1,000,000 255-byte sized xattrs,
> 2. Deletes 50% of these xattrs in an alternating manner,
> 3. Tries to create 400,000 new 255-byte sized xattrs
> causes the following message to be printed on the console,
> 
> XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> 
> This indicates that we overflowed the 16-bits wide xattr extent counter.
> 
> I have been informed that there are instances where a single file has
>  > 100 million hardlinks. With parent pointers being stored in xattr,
> we will overflow the 16-bits wide xattr extent counter when large
> number of hardlinks are created.
> 
> Hence this commit extends xattr extent counter to 32-bits. It also introduces
> an incompat flag to prevent older kernels from mounting newer filesystems with
> 32-bit wide xattr extent counter.
> 
> Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
>  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
>  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
>  fs/xfs/libxfs/xfs_types.h      |  4 ++--
>  fs/xfs/scrub/inode.c           |  7 ++++---
>  fs/xfs/xfs_inode_item.c        |  3 ++-
>  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
>  8 files changed, 63 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 045556e78ee2c..0a4266b0d46e1 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
>  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
>  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
>  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
>  #define XFS_SB_FEAT_INCOMPAT_ALL \
>  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
>  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
>  
>  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
>  static inline bool
> @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
>  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>  	__be32		di_extsize;	/* basic/minimum extent size for file */
>  	__be32		di_nextents;	/* number of extents in data fork */
> -	__be16		di_anextents;	/* number of extents in attribute fork*/
> +	__be16		di_anextents_lo;/* lower part of xattr extent count */
>  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	__s8		di_aformat;	/* format of attr fork's data */
>  	__be32		di_dmevmask;	/* DMIG event mask */
> @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
>  	__be64		di_lsn;		/* flush sequence */
>  	__be64		di_flags2;	/* more random flags */
>  	__be32		di_cowextsize;	/* basic cow extent size for file */
> -	__u8		di_pad2[12];	/* more padding for future expansion */
> +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> +	__u8		di_pad2[10];	/* more padding for future expansion */

Ok, I think you've limited what we can do here by using this "fill
holes" variable split. I've never liked doing this, and we've only
done it in the past when we haven't had space in the inode to create
a new 32 bit variable.

IOWs, this is a v5 format feature only, so we should just create a
new variable:

	__be32		di_attr_nextents;

With that in place, we can now do what we did extending the v1 inode
link count (16 bits) to the v2 inode link count (32 bits).

That is, when the attribute count is going to overflow, we set a
inode flag on disk to indicate that it now has a 32 bit extent count
and uses that field in the inode, and we set a RO-compat feature
flag in the superblock to indicate that there are 32 bit attr fork
extent counts in use.

Old kernels can still read the filesystem, but see the extent count
as "max" (65535) but can't modify the attr fork and hence corrupt
the 32 bit count it knows nothing about.

If the kernel sees the RO feature bit set, it can set the inode flag
on inodes it is modifying and update both the old and new counters
appropriately when flushing the inode to disk (i.e. transparent
conversion).

In future, mkfs can then set the RO feature flag by default so all
new filesystems use the 32 bit counter.

>  	/* fields only written to during inode creation */
>  	xfs_timestamp_t	di_crtime;	/* time created */
> @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
>  	((w) == XFS_DATA_FORK ? \
>  		(dip)->di_format : \
>  		(dip)->di_aformat)
> -#define XFS_DFORK_NEXTENTS(dip,w) \
> -	((w) == XFS_DATA_FORK ? \
> -		be32_to_cpu((dip)->di_nextents) : \
> -		be16_to_cpu((dip)->di_anextents))
> +
> +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,

If you are converting a macro to static inline, then all the caller
sites should be converted to lower case at the same time.

> +					struct xfs_dinode *dip, int whichfork)
> +{
> +	int32_t anextents;

Extent counts should be unsigned, as they are on disk.

> +
> +	if (whichfork == XFS_DATA_FORK)
> +		return be32_to_cpu((dip)->di_nextents);
> +
> +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> +	if (xfs_sb_version_has_v3inode(sbp))
> +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> +
> +	return anextents;
> +}

No feature bit to indicate that 32 bit attribute extent counts are
valid?

>  
>  /*
>   * For block and character special files the 32bit dev_t is stored at the
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 39c5a6e24915c..ced8195bd8c22 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -232,7 +232,8 @@ xfs_inode_from_disk(
>  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
>  	to->di_extsize = be32_to_cpu(from->di_extsize);
>  	to->di_nextents = be32_to_cpu(from->di_nextents);
> -	to->di_anextents = be16_to_cpu(from->di_anextents);
> +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> +				XFS_ATTR_FORK);

This should open code, but I'd prefer a compeltely separate
variable...

>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat	= from->di_aformat;
>  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> @@ -282,7 +283,7 @@ xfs_inode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents = cpu_to_be32(from->di_nextents);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -296,6 +297,8 @@ xfs_inode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		to->di_anextents_hi
> +			= cpu_to_be16((u32)(from->di_anextents) >> 16);

Again, feature bit for on-disk format modifications needed...

>  		to->di_ino = cpu_to_be64(ip->i_ino);
>  		to->di_lsn = cpu_to_be64(lsn);
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents = cpu_to_be32(from->di_nextents);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
>  		to->di_ino = cpu_to_be64(from->di_ino);
>  		to->di_lsn = cpu_to_be64(from->di_lsn);
>  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
>  	struct xfs_mount	*mp,
>  	int			whichfork)
>  {
> -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	uint32_t		di_nextents;
> +
> +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
>  
>  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
>  	case XFS_DINODE_FMT_LOCAL:
> @@ -436,6 +442,9 @@ xfs_dinode_verify(
>  	uint16_t		flags;
>  	uint64_t		flags2;
>  	uint64_t		di_size;
> +	int32_t			nextents;
> +	int32_t			anextents;
> +	int64_t			nblocks;

Extent counts need to be converted to unsigned in memory - they are
unsigned on disk....

>  
>  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
>  		return __this_address;
> @@ -466,10 +475,12 @@ xfs_dinode_verify(
>  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
>  		return __this_address;
>  
> +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
> +	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> +	nblocks = be64_to_cpu(dip->di_nblocks);
> +
>  	/* Fork checks carried over from xfs_iformat_fork */
> -	if (mode &&
> -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> -			be64_to_cpu(dip->di_nblocks))
> +	if (mode && nextents + anextents > nblocks)
>  		return __this_address;
>  
>  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> @@ -526,7 +537,7 @@ xfs_dinode_verify(
>  		default:
>  			return __this_address;
>  		}
> -		if (dip->di_anextents)
> +		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
>  			return __this_address;
>  	}
>  
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> index 518c6f0ec3a61..080fd0c156a1e 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -207,9 +207,10 @@ xfs_iformat_extents(
>  	int			whichfork)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_sb		*sb = &mp->m_sb;
>  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
>  	int			state = xfs_bmap_fork_to_state(whichfork);
> -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
>  	int			size = nex * sizeof(xfs_bmbt_rec_t);
>  	struct xfs_iext_cursor	icur;
>  	struct xfs_bmbt_rec	*dp;
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index e3400c9c71cdb..5db92aa508bc5 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -397,7 +397,7 @@ struct xfs_log_dinode {
>  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
>  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
>  	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
>  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	int8_t		di_aformat;	/* format of attr fork's data */
>  	uint32_t	di_dmevmask;	/* DMIG event mask */
> @@ -414,7 +414,8 @@ struct xfs_log_dinode {
>  	xfs_lsn_t	di_lsn;		/* flush sequence */
>  	uint64_t	di_flags2;	/* more random flags */
>  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */

So, unsigned in the log, as on disk...

> +	uint8_t		di_pad2[10];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_ictimestamp_t di_crtime;	/* time created */
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 397d94775440d..01669aa65745a 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
>  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
>  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
>  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */

.... but not in memory?

>  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
>  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
>  
> @@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
>   */
>  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
>  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> +#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */

What about for older filesystems where MAXAEXTNUM is unchanged?

> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 11c3502b07b13..ba3fae95b2260 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2922,6 +2922,7 @@ xlog_recover_inode_pass2(
>  	struct xfs_log_dinode	*ldip;
>  	uint			isize;
>  	int			need_free = 0;
> +	uint32_t		nextents;
>  
>  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
>  		in_f = item->ri_buf[0].i_addr;
> @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
>  			goto out_release;
>  		}
>  	}
> -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> +
> +	nextents = ldip->di_anextents_lo;
> +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);

What happens if we are recovering from a filesysetm that doesn't
know anything about di_anextents_hi and never wrote anything to
the log for this field?

Cheers,

Dave.

Chandan Rajendra April 8, 2020, 12:40 p.m. UTC | #5

On Monday, April 6, 2020 10:15 PM Brian Foster wrote: 
> On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > which
> > 1. Creates 1,000,000 255-byte sized xattrs,
> > 2. Deletes 50% of these xattrs in an alternating manner,
> > 3. Tries to create 400,000 new 255-byte sized xattrs
> > causes the following message to be printed on the console,
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > 
> > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > 
> > I have been informed that there are instances where a single file has
> >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > we will overflow the 16-bits wide xattr extent counter when large
> > number of hardlinks are created.
> > 
> > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > an incompat flag to prevent older kernels from mounting newer filesystems with
> > 32-bit wide xattr extent counter.
> > 
> 
> Just a couple high level comments on the first pass...
> 
> It looks like the feature bit is only set by mkfs. That raises a couple
> questions. First, what about a fix for older/existing filesystems? Even
> if we can't exceed the 16bit extent count, I would think we should be
> able to fail more gracefully than allowing a write verifier to fail and
> shutdown the fs. What happens when/if we run into a data fork extent
> count limit, for example?

Yes, I agree that for older filesystems I should write a separate patch to
check for the 16-bit overflow case.

This applies to the data fork extent counter as well. Dave was suggesting that
we should change that to a 64-bit value. That would be my next work item.

> 
> Second, I also wonder if enabling an incompat feature bit by default in
> mkfs is a bit extreme. Perhaps this should be tied to a mkfs flag for a
> period of time? Maybe others have thoughts on that, but I'd at minimum
> request to introduce and enable said bit by default in separate patches
> to make it a bit easier for distro releases to identify and manage the
> incompatibility.

Dave has suggested that we should have a new 32-bit field in the inode. When
we are about to overflow the existing 16-bit counter limit, we set a per-inode
flag and also a RO-compat feature flag in the superblock.

When flushing an inode to disk, if the RO-compat feature flag is set, then we
set the corresponding inode flag and move over the 16-bit counter to the new
32-bit counter. Also, the RO feature flag can be set by default by mkfs
sometime in the future.

> 
> Brian
> 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> >  fs/xfs/scrub/inode.c           |  7 ++++---
> >  fs/xfs/xfs_inode_item.c        |  3 ++-
> >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> >  8 files changed, 63 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 045556e78ee2c..0a4266b0d46e1 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> >  
> >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> >  static inline bool
> > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> >  	__be32		di_nextents;	/* number of extents in data fork */
> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	__s8		di_aformat;	/* format of attr fork's data */
> >  	__be32		di_dmevmask;	/* DMIG event mask */
> > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> >  	__be64		di_lsn;		/* flush sequence */
> >  	__be64		di_flags2;	/* more random flags */
> >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > +	__u8		di_pad2[10];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_timestamp_t	di_crtime;	/* time created */
> > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> >  	((w) == XFS_DATA_FORK ? \
> >  		(dip)->di_format : \
> >  		(dip)->di_aformat)
> > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > -	((w) == XFS_DATA_FORK ? \
> > -		be32_to_cpu((dip)->di_nextents) : \
> > -		be16_to_cpu((dip)->di_anextents))
> > +
> > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > +					struct xfs_dinode *dip, int whichfork)
> > +{
> > +	int32_t anextents;
> > +
> > +	if (whichfork == XFS_DATA_FORK)
> > +		return be32_to_cpu((dip)->di_nextents);
> > +
> > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > +	if (xfs_sb_version_has_v3inode(sbp))
> > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > +
> > +	return anextents;
> > +}
> >  
> >  /*
> >   * For block and character special files the 32bit dev_t is stored at the
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 39c5a6e24915c..ced8195bd8c22 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > +				XFS_ATTR_FORK);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat	= from->di_aformat;
> >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi
> > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> >  		to->di_ino = cpu_to_be64(ip->i_ino);
> >  		to->di_lsn = cpu_to_be64(lsn);
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> >  		to->di_ino = cpu_to_be64(from->di_ino);
> >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> >  	struct xfs_mount	*mp,
> >  	int			whichfork)
> >  {
> > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	uint32_t		di_nextents;
> > +
> > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> >  
> >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> >  	case XFS_DINODE_FMT_LOCAL:
> > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> >  	uint16_t		flags;
> >  	uint64_t		flags2;
> >  	uint64_t		di_size;
> > +	int32_t			nextents;
> > +	int32_t			anextents;
> > +	int64_t			nblocks;
> >  
> >  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
> >  		return __this_address;
> > @@ -466,10 +475,12 @@ xfs_dinode_verify(
> >  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
> >  		return __this_address;
> >  
> > +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
> > +	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> > +	nblocks = be64_to_cpu(dip->di_nblocks);
> > +
> >  	/* Fork checks carried over from xfs_iformat_fork */
> > -	if (mode &&
> > -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> > -			be64_to_cpu(dip->di_nblocks))
> > +	if (mode && nextents + anextents > nblocks)
> >  		return __this_address;
> >  
> >  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> > @@ -526,7 +537,7 @@ xfs_dinode_verify(
> >  		default:
> >  			return __this_address;
> >  		}
> > -		if (dip->di_anextents)
> > +		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
> >  			return __this_address;
> >  	}
> >  
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > index 518c6f0ec3a61..080fd0c156a1e 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > @@ -207,9 +207,10 @@ xfs_iformat_extents(
> >  	int			whichfork)
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_sb		*sb = &mp->m_sb;
> >  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> >  	int			state = xfs_bmap_fork_to_state(whichfork);
> > -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
> >  	int			size = nex * sizeof(xfs_bmbt_rec_t);
> >  	struct xfs_iext_cursor	icur;
> >  	struct xfs_bmbt_rec	*dp;
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > index e3400c9c71cdb..5db92aa508bc5 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -397,7 +397,7 @@ struct xfs_log_dinode {
> >  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
> >  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
> >  	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> > -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> > +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
> >  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	int8_t		di_aformat;	/* format of attr fork's data */
> >  	uint32_t	di_dmevmask;	/* DMIG event mask */
> > @@ -414,7 +414,8 @@ struct xfs_log_dinode {
> >  	xfs_lsn_t	di_lsn;		/* flush sequence */
> >  	uint64_t	di_flags2;	/* more random flags */
> >  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> > -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> > +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
> > +	uint8_t		di_pad2[10];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_ictimestamp_t di_crtime;	/* time created */
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index 397d94775440d..01669aa65745a 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
> >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> >  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> > -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> > +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> >  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> >  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
> >  
> > @@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
> >   */
> >  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
> >  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> > -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> > +#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */
> >  
> >  /*
> >   * Minimum and maximum blocksize and sectorsize.
> > diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
> > index 6d483ab29e639..3b624e24ae868 100644
> > --- a/fs/xfs/scrub/inode.c
> > +++ b/fs/xfs/scrub/inode.c
> > @@ -371,10 +371,12 @@ xchk_dinode(
> >  		break;
> >  	}
> >  
> > +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> > +
> >  	/* di_forkoff */
> >  	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
> >  		xchk_ino_set_corrupt(sc, ino);
> > -	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
> > +	if (nextents != 0 && dip->di_forkoff == 0)
> >  		xchk_ino_set_corrupt(sc, ino);
> >  	if (dip->di_forkoff == 0 && dip->di_aformat != XFS_DINODE_FMT_EXTENTS)
> >  		xchk_ino_set_corrupt(sc, ino);
> > @@ -386,7 +388,6 @@ xchk_dinode(
> >  		xchk_ino_set_corrupt(sc, ino);
> >  
> >  	/* di_anextents */
> > -	nextents = be16_to_cpu(dip->di_anextents);
> >  	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
> >  	switch (dip->di_aformat) {
> >  	case XFS_DINODE_FMT_EXTENTS:
> > @@ -484,7 +485,7 @@ xchk_inode_xref_bmap(
> >  			&nextents, &acount);
> >  	if (!xchk_should_check_xref(sc, &error, NULL))
> >  		return;
> > -	if (nextents != be16_to_cpu(dip->di_anextents))
> > +	if (nextents != XFS_DFORK_NEXTENTS(&sc->mp->m_sb, dip, XFS_ATTR_FORK))
> >  		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
> >  
> >  	/* Check nblocks against the inode. */
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index 4a3d13d4a0228..dff20f2b368ea 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
> >  	to->di_nblocks = from->di_nblocks;
> >  	to->di_extsize = from->di_extsize;
> >  	to->di_nextents = from->di_nextents;
> > -	to->di_anextents = from->di_anextents;
> > +	to->di_anextents_lo = ((u32)(from->di_anextents)) & 0xffff;
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = from->di_dmevmask;
> > @@ -344,6 +344,7 @@ xfs_inode_to_log_dinode(
> >  		to->di_crtime.t_nsec = from->di_crtime.tv_nsec;
> >  		to->di_flags2 = from->di_flags2;
> >  		to->di_cowextsize = from->di_cowextsize;
> > +		to->di_anextents_hi = ((u32)(from->di_anextents)) >> 16;
> >  		to->di_ino = ip->i_ino;
> >  		to->di_lsn = lsn;
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 11c3502b07b13..ba3fae95b2260 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -2922,6 +2922,7 @@ xlog_recover_inode_pass2(
> >  	struct xfs_log_dinode	*ldip;
> >  	uint			isize;
> >  	int			need_free = 0;
> > +	uint32_t		nextents;
> >  
> >  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
> >  		in_f = item->ri_buf[0].i_addr;
> > @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
> >  			goto out_release;
> >  		}
> >  	}
> > -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> > +
> > +	nextents = ldip->di_anextents_lo;
> > +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> > +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
> > +
> > +	nextents += ldip->di_nextents;
> > +
> > +	if (unlikely(nextents > ldip->di_nblocks)) {
> >  		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
> >  				     XFS_ERRLEVEL_LOW, mp, ldip,
> >  				     sizeof(*ldip));
> > @@ -3052,8 +3060,7 @@ xlog_recover_inode_pass2(
> >  	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
> >  	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
> >  			__func__, item, dip, bp, in_f->ilf_ino,
> > -			ldip->di_nextents + ldip->di_anextents,
> > -			ldip->di_nblocks);
> > +			nextents, ldip->di_nblocks);
> >  		error = -EFSCORRUPTED;
> >  		goto out_release;
> >  	}
> 
>

Chandan Rajendra April 8, 2020, 12:42 p.m. UTC | #6

On Monday, April 6, 2020 10:36 PM Darrick J. Wong wrote: 
> On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > which
> > 1. Creates 1,000,000 255-byte sized xattrs,
> > 2. Deletes 50% of these xattrs in an alternating manner,
> > 3. Tries to create 400,000 new 255-byte sized xattrs
> > causes the following message to be printed on the console,
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > 
> > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > 
> > I have been informed that there are instances where a single file has
> >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > we will overflow the 16-bits wide xattr extent counter when large
> > number of hardlinks are created.
> > 
> > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > an incompat flag to prevent older kernels from mounting newer filesystems with
> > 32-bit wide xattr extent counter.
> > 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> >  fs/xfs/scrub/inode.c           |  7 ++++---
> >  fs/xfs/xfs_inode_item.c        |  3 ++-
> >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> >  8 files changed, 63 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 045556e78ee2c..0a4266b0d46e1 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> 
> If you're going to introduce an INCOMPAT feature, please also use the
> opportunity to convert xattrs to something resembling the dir v3 format,
> where we index free space within each block so that we can speed up attr
> setting with 100 million attrs.
> 
> >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> >  
> >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> >  static inline bool
> > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> >  	__be32		di_nextents;	/* number of extents in data fork */
> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	__s8		di_aformat;	/* format of attr fork's data */
> >  	__be32		di_dmevmask;	/* DMIG event mask */
> > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> >  	__be64		di_lsn;		/* flush sequence */
> >  	__be64		di_flags2;	/* more random flags */
> >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> 
> I was expecting you to use di_pad, not di_pad2... :)

Dave has suggested that a new 32-bit field be introduced. The kernel will
switch over to using this field when we are about to overflow the existing
16-bit counter.

> 
> > +	__u8		di_pad2[10];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_timestamp_t	di_crtime;	/* time created */
> > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> >  	((w) == XFS_DATA_FORK ? \
> >  		(dip)->di_format : \
> >  		(dip)->di_aformat)
> > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > -	((w) == XFS_DATA_FORK ? \
> > -		be32_to_cpu((dip)->di_nextents) : \
> > -		be16_to_cpu((dip)->di_anextents))
> > +
> > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > +					struct xfs_dinode *dip, int whichfork)
> > +{
> 
> XFS style indenting, please.

Sorry about that. I will fix it up.

> 
> > +	int32_t anextents;
> 
> When would we have negative extent count?
> 
> (Yes, this is a bug in the xfs_extnum/xfs_aextnum typedefs, bah...)
> 
> > +
> > +	if (whichfork == XFS_DATA_FORK)
> > +		return be32_to_cpu((dip)->di_nextents);
> > +
> > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > +	if (xfs_sb_version_has_v3inode(sbp))
> 
> v3inode?  I thought this had a separate incompat flag?
> 
> > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
>

Yes, I will fix this up. With the new ro-compat feature bit and and an extra
32-bit field to track the xattr extent counter the above logic will change.

> /me would have thought you'd do the splitting and endian conversion in
> the opposite order, e.g.:
> 
> 	be32 x = dip->di_anextents_lo;
> 	if (has32bitattrcount)
> 		x |= (be32)dip->di_anextents_hi << 16;
> 	return be32_to_cpu(x);

I actually followed what was being done w.r.t projid i.e.

     to->di_projid = (prid_t)be16_to_cpu(from->di_projid_hi) << 16 |                                                                                               
                             be16_to_cpu(from->di_projid_lo);

But with the new 32-bit extent counter, we won't have to do that either.

> 
> > +
> > +	return anextents;
> > +}
> >  
> >  /*
> >   * For block and character special files the 32bit dev_t is stored at the
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 39c5a6e24915c..ced8195bd8c22 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > +				XFS_ATTR_FORK);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat	= from->di_aformat;
> >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi
> > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> >  		to->di_ino = cpu_to_be64(ip->i_ino);
> >  		to->di_lsn = cpu_to_be64(lsn);
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> >  		to->di_ino = cpu_to_be64(from->di_ino);
> >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> >  	struct xfs_mount	*mp,
> >  	int			whichfork)
> >  {
> > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	uint32_t		di_nextents;
> > +
> > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> >  
> >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> >  	case XFS_DINODE_FMT_LOCAL:
> > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> >  	uint16_t		flags;
> >  	uint64_t		flags2;
> >  	uint64_t		di_size;
> > +	int32_t			nextents;
> > +	int32_t			anextents;
> > +	int64_t			nblocks;
> >  
> >  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
> >  		return __this_address;
> > @@ -466,10 +475,12 @@ xfs_dinode_verify(
> >  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
> >  		return __this_address;
> >  
> > +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
> > +	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> > +	nblocks = be64_to_cpu(dip->di_nblocks);
> > +
> >  	/* Fork checks carried over from xfs_iformat_fork */
> > -	if (mode &&
> > -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> > -			be64_to_cpu(dip->di_nblocks))
> > +	if (mode && nextents + anextents > nblocks)
> >  		return __this_address;
> >  
> >  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> > @@ -526,7 +537,7 @@ xfs_dinode_verify(
> >  		default:
> >  			return __this_address;
> >  		}
> > -		if (dip->di_anextents)
> > +		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
> >  			return __this_address;
> >  	}
> >  
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > index 518c6f0ec3a61..080fd0c156a1e 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > @@ -207,9 +207,10 @@ xfs_iformat_extents(
> >  	int			whichfork)
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_sb		*sb = &mp->m_sb;
> >  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> >  	int			state = xfs_bmap_fork_to_state(whichfork);
> > -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
> >  	int			size = nex * sizeof(xfs_bmbt_rec_t);
> >  	struct xfs_iext_cursor	icur;
> >  	struct xfs_bmbt_rec	*dp;
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > index e3400c9c71cdb..5db92aa508bc5 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -397,7 +397,7 @@ struct xfs_log_dinode {
> >  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
> >  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
> >  	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> > -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> > +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
> >  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	int8_t		di_aformat;	/* format of attr fork's data */
> >  	uint32_t	di_dmevmask;	/* DMIG event mask */
> > @@ -414,7 +414,8 @@ struct xfs_log_dinode {
> >  	xfs_lsn_t	di_lsn;		/* flush sequence */
> >  	uint64_t	di_flags2;	/* more random flags */
> >  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> > -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> > +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
> > +	uint8_t		di_pad2[10];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_ictimestamp_t di_crtime;	/* time created */
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index 397d94775440d..01669aa65745a 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
> >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> >  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> > -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> > +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> >  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> >  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
> >  
> > @@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
> >   */
> >  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
> >  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> > -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> > +#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */
> 
> Need to preserve both limits so that we can do the correct check for the
> given feature set.

True. I will fix that.

Thank you for providing the above review comments.

Chandan Rajendra April 8, 2020, 12:43 p.m. UTC | #7

On Tuesday, April 7, 2020 5:00 AM Dave Chinner wrote: 
> On Mon, Apr 06, 2020 at 10:06:03AM -0700, Darrick J. Wong wrote:
> > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > which
> > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > causes the following message to be printed on the console,
> > > 
> > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > 
> > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > 
> > > I have been informed that there are instances where a single file has
> > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > we will overflow the 16-bits wide xattr extent counter when large
> > > number of hardlinks are created.
> > > 
> > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > 32-bit wide xattr extent counter.
> > > 
> > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > 
> > If you're going to introduce an INCOMPAT feature, please also use the
> > opportunity to convert xattrs to something resembling the dir v3 format,
> > where we index free space within each block so that we can speed up attr
> > setting with 100 million attrs.
> 
> Not necessary. Chandan has already spent a lot of time investigating
> that - I suggested doing the investigation probably a year ago when
> he was looking for stuff to do knowing that this could be a problem
> parent pointers hit. Long story short - there's no degradation in
> performance in the dabtree out to tens of millions of records with
> different fixed size or random sized attributes, nor does various
> combinations of insert/lookup/remove/replace operations seem to
> impact the tree performance at scale. IOWs, we hit the 16 bit extent
> limits of the attribute trees without finding any degradation in
> performance.

My benchmarking was limited to working with a maximum of 1,000,000 xattrs. I
will address the review comments provided on this patchset and then run the
benchmarks once again ... but this time I will increase the upper limit to 100
million xattrs (since we will have a 32-bit extent counter). I will post the
results of the benchmarking (along with the benchmarking programs/scripts) to
the mailing list before I post the patchset itself.

Chandan Rajendra April 8, 2020, 12:45 p.m. UTC | #8

On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > which
> > 1. Creates 1,000,000 255-byte sized xattrs,
> > 2. Deletes 50% of these xattrs in an alternating manner,
> > 3. Tries to create 400,000 new 255-byte sized xattrs
> > causes the following message to be printed on the console,
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > 
> > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > 
> > I have been informed that there are instances where a single file has
> >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > we will overflow the 16-bits wide xattr extent counter when large
> > number of hardlinks are created.
> > 
> > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > an incompat flag to prevent older kernels from mounting newer filesystems with
> > 32-bit wide xattr extent counter.
> > 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> >  fs/xfs/scrub/inode.c           |  7 ++++---
> >  fs/xfs/xfs_inode_item.c        |  3 ++-
> >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> >  8 files changed, 63 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 045556e78ee2c..0a4266b0d46e1 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> >  
> >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> >  static inline bool
> > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> >  	__be32		di_nextents;	/* number of extents in data fork */
> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	__s8		di_aformat;	/* format of attr fork's data */
> >  	__be32		di_dmevmask;	/* DMIG event mask */
> > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> >  	__be64		di_lsn;		/* flush sequence */
> >  	__be64		di_flags2;	/* more random flags */
> >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > +	__u8		di_pad2[10];	/* more padding for future expansion */
> 
> Ok, I think you've limited what we can do here by using this "fill
> holes" variable split. I've never liked doing this, and we've only
> done it in the past when we haven't had space in the inode to create
> a new 32 bit variable.
> 
> IOWs, this is a v5 format feature only, so we should just create a
> new variable:
> 
> 	__be32		di_attr_nextents;
> 
> With that in place, we can now do what we did extending the v1 inode
> link count (16 bits) to the v2 inode link count (32 bits).
> 
> That is, when the attribute count is going to overflow, we set a
> inode flag on disk to indicate that it now has a 32 bit extent count
> and uses that field in the inode, and we set a RO-compat feature
> flag in the superblock to indicate that there are 32 bit attr fork
> extent counts in use.
> 
> Old kernels can still read the filesystem, but see the extent count
> as "max" (65535) but can't modify the attr fork and hence corrupt
> the 32 bit count it knows nothing about.
> 
> If the kernel sees the RO feature bit set, it can set the inode flag
> on inodes it is modifying and update both the old and new counters
> appropriately when flushing the inode to disk (i.e. transparent
> conversion).
> 
> In future, mkfs can then set the RO feature flag by default so all
> new filesystems use the 32 bit counter.

Sure. I will make the changes suggested above.

> 
> >  	/* fields only written to during inode creation */
> >  	xfs_timestamp_t	di_crtime;	/* time created */
> > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> >  	((w) == XFS_DATA_FORK ? \
> >  		(dip)->di_format : \
> >  		(dip)->di_aformat)
> > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > -	((w) == XFS_DATA_FORK ? \
> > -		be32_to_cpu((dip)->di_nextents) : \
> > -		be16_to_cpu((dip)->di_anextents))
> > +
> > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> 
> If you are converting a macro to static inline, then all the caller
> sites should be converted to lower case at the same time.

Ok.

> 
> > +					struct xfs_dinode *dip, int whichfork)
> > +{
> > +	int32_t anextents;
> 
> Extent counts should be unsigned, as they are on disk.

Ok.

> 
> > +
> > +	if (whichfork == XFS_DATA_FORK)
> > +		return be32_to_cpu((dip)->di_nextents);
> > +
> > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > +	if (xfs_sb_version_has_v3inode(sbp))
> > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > +
> > +	return anextents;
> > +}
> 
> No feature bit to indicate that 32 bit attribute extent counts are
> valid?

The incompat feature flag (i.e. XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR) that I
had introduced prevented older kernels from mounting filesystems having
di_anextents_hi field in the inodes.  As you have explained above, this method
is incorrect. I will add appropriate checks once I implement the new "RO
feature bit" method.

> 
> >  
> >  /*
> >   * For block and character special files the 32bit dev_t is stored at the
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 39c5a6e24915c..ced8195bd8c22 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > +				XFS_ATTR_FORK);
> 
> This should open code, but I'd prefer a compeltely separate
> variable...

Ok. I will change that.

> 
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat	= from->di_aformat;
> >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi
> > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> 
> Again, feature bit for on-disk format modifications needed...

Sure. I will change this.

> 
> >  		to->di_ino = cpu_to_be64(ip->i_ino);
> >  		to->di_lsn = cpu_to_be64(lsn);
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> >  		to->di_ino = cpu_to_be64(from->di_ino);
> >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> >  	struct xfs_mount	*mp,
> >  	int			whichfork)
> >  {
> > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	uint32_t		di_nextents;
> > +
> > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> >  
> >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> >  	case XFS_DINODE_FMT_LOCAL:
> > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> >  	uint16_t		flags;
> >  	uint64_t		flags2;
> >  	uint64_t		di_size;
> > +	int32_t			nextents;
> > +	int32_t			anextents;
> > +	int64_t			nblocks;
> 
> Extent counts need to be converted to unsigned in memory - they are
> unsigned on disk....

Ok.

> 
> >  
> >  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
> >  		return __this_address;
> > @@ -466,10 +475,12 @@ xfs_dinode_verify(
> >  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
> >  		return __this_address;
> >  
> > +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
> > +	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> > +	nblocks = be64_to_cpu(dip->di_nblocks);
> > +
> >  	/* Fork checks carried over from xfs_iformat_fork */
> > -	if (mode &&
> > -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> > -			be64_to_cpu(dip->di_nblocks))
> > +	if (mode && nextents + anextents > nblocks)
> >  		return __this_address;
> >  
> >  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> > @@ -526,7 +537,7 @@ xfs_dinode_verify(
> >  		default:
> >  			return __this_address;
> >  		}
> > -		if (dip->di_anextents)
> > +		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
> >  			return __this_address;
> >  	}
> >  
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > index 518c6f0ec3a61..080fd0c156a1e 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > @@ -207,9 +207,10 @@ xfs_iformat_extents(
> >  	int			whichfork)
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_sb		*sb = &mp->m_sb;
> >  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> >  	int			state = xfs_bmap_fork_to_state(whichfork);
> > -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
> >  	int			size = nex * sizeof(xfs_bmbt_rec_t);
> >  	struct xfs_iext_cursor	icur;
> >  	struct xfs_bmbt_rec	*dp;
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > index e3400c9c71cdb..5db92aa508bc5 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -397,7 +397,7 @@ struct xfs_log_dinode {
> >  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
> >  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
> >  	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> > -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> > +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
> >  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	int8_t		di_aformat;	/* format of attr fork's data */
> >  	uint32_t	di_dmevmask;	/* DMIG event mask */
> > @@ -414,7 +414,8 @@ struct xfs_log_dinode {
> >  	xfs_lsn_t	di_lsn;		/* flush sequence */
> >  	uint64_t	di_flags2;	/* more random flags */
> >  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> > -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> > +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
> 
> So, unsigned in the log, as on disk...
> 
> > +	uint8_t		di_pad2[10];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_ictimestamp_t di_crtime;	/* time created */
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index 397d94775440d..01669aa65745a 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
> >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> >  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> > -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> > +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> 
> .... but not in memory?

I will change this. I actually did notice 'data type' inconsistency in the
existing code,

typedef int16_t         xfs_aextnum_t;  /* # extents in an attribute fork */

... and I thought there could be some purpose behind this. I was wrong. I will
fix the data type inconsistencies.

> 
> >  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> >  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
> >  
> > @@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
> >   */
> >  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
> >  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> > -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> > +#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */
> 
> What about for older filesystems where MAXAEXTNUM is unchanged?

I had again depended on XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR incompat feature
flag to not allow older kernels to access a filesystem with 32-bit xattr
extent counter. I will fix this as well.

> 
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 11c3502b07b13..ba3fae95b2260 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -2922,6 +2922,7 @@ xlog_recover_inode_pass2(
> >  	struct xfs_log_dinode	*ldip;
> >  	uint			isize;
> >  	int			need_free = 0;
> > +	uint32_t		nextents;
> >  
> >  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
> >  		in_f = item->ri_buf[0].i_addr;
> > @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
> >  			goto out_release;
> >  		}
> >  	}
> > -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> > +
> > +	nextents = ldip->di_anextents_lo;
> > +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> > +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
> 
> What happens if we are recovering from a filesysetm that doesn't
> know anything about di_anextents_hi and never wrote anything to
> the log for this field?

I had again depended on XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR incompat feature
flag to not allow older kernels to access a filesystem with 32-bit xattr
extent counter. I will fix this as well.

Thanks for the review comments. I will post the next version which will
address them.

Darrick J. Wong April 8, 2020, 3:38 p.m. UTC | #9

On Wed, Apr 08, 2020 at 06:13:45PM +0530, Chandan Rajendra wrote:
> On Tuesday, April 7, 2020 5:00 AM Dave Chinner wrote: 
> > On Mon, Apr 06, 2020 at 10:06:03AM -0700, Darrick J. Wong wrote:
> > > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > > which
> > > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > > causes the following message to be printed on the console,
> > > > 
> > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > > 
> > > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > > 
> > > > I have been informed that there are instances where a single file has
> > > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > > we will overflow the 16-bits wide xattr extent counter when large
> > > > number of hardlinks are created.
> > > > 
> > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > > 32-bit wide xattr extent counter.
> > > > 
> > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > > 
> > > If you're going to introduce an INCOMPAT feature, please also use the
> > > opportunity to convert xattrs to something resembling the dir v3 format,
> > > where we index free space within each block so that we can speed up attr
> > > setting with 100 million attrs.
> > 
> > Not necessary. Chandan has already spent a lot of time investigating
> > that - I suggested doing the investigation probably a year ago when
> > he was looking for stuff to do knowing that this could be a problem
> > parent pointers hit. Long story short - there's no degradation in
> > performance in the dabtree out to tens of millions of records with
> > different fixed size or random sized attributes, nor does various
> > combinations of insert/lookup/remove/replace operations seem to
> > impact the tree performance at scale. IOWs, we hit the 16 bit extent
> > limits of the attribute trees without finding any degradation in
> > performance.
> 
> My benchmarking was limited to working with a maximum of 1,000,000 xattrs. I
> will address the review comments provided on this patchset and then run the
> benchmarks once again ... but this time I will increase the upper limit to 100
> million xattrs (since we will have a 32-bit extent counter). I will post the
> results of the benchmarking (along with the benchmarking programs/scripts) to
> the mailing list before I post the patchset itself.

Ok.  Thanks for doing that work. :)

--D

> -- 
> chandan
> 
> 
>

Darrick J. Wong April 8, 2020, 3:45 p.m. UTC | #10

On Tue, Apr 07, 2020 at 09:30:02AM +1000, Dave Chinner wrote:
> On Mon, Apr 06, 2020 at 10:06:03AM -0700, Darrick J. Wong wrote:
> > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > which
> > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > causes the following message to be printed on the console,
> > > 
> > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > 
> > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > 
> > > I have been informed that there are instances where a single file has
> > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > we will overflow the 16-bits wide xattr extent counter when large
> > > number of hardlinks are created.
> > > 
> > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > 32-bit wide xattr extent counter.
> > > 
> > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > 
> > If you're going to introduce an INCOMPAT feature, please also use the
> > opportunity to convert xattrs to something resembling the dir v3 format,
> > where we index free space within each block so that we can speed up attr
> > setting with 100 million attrs.
> 
> Not necessary. Chandan has already spent a lot of time investigating
> that - I suggested doing the investigation probably a year ago when
> he was looking for stuff to do knowing that this could be a problem
> parent pointers hit.

Oh, I didn't realize that analysis work has already been done.

Chandan, could you please mention that somewhere in the cover letter?
It does mention that you tried creating 1M xattrs, but I guess it needed
to be more explicit about not uncovering any gigantic performance holes.

> Long story short - there's no degradation in
> performance in the dabtree out to tens of millions of records with
> different fixed size or random sized attributes, nor does various
> combinations of insert/lookup/remove/replace operations seem to
> impact the tree performance at scale. IOWs, we hit the 16 bit extent
> limits of the attribute trees without finding any degradation in
> performance.

Ok.  I'll take "attr v3 upgrade" off my list of things to look out for.

> Hence we concluded that the dabtree structure does not require
> significant modification or optimisation to work well with typical
> parent pointer attribute demands...
> 
> As for free space indexes....
> 
> The issue with the directory structure that requires external free
> space is that the directory data is not part of the dabtree itself.
> The attribute fork stores all the attributes at the leaves of the
> dabtree, while the directory structure stores the directory data in
> external blocks and the dabtree only contains the name hash index
> that points to the external data.
> 
> i.e. When we add an attribute to the dabtree, we split/merge leaves
> of the tree based on where the name hash index tells us it needs to
> be inserted/removed from. i.e. we make space available or collapse
> sparse leaves of the dabtree as a side effect of inserting or
> removing objects.
> 
> The directory structure is very different. The dirents cannot change
> location as their logical offset into the dir data segment is used
> as the readdir/seekdir/telldir cookie. Therefore that location is
> not allowed to change for the life of the dirent and so we can't
> store them in the leaves of a dabtree indexed in hash order because
> the offset into the tree would change as other entries are inserted
> and removed.  Hence when we remove dirents, we must leave holes in
> the data segment so the rest of the dirent data does not change
> logical offset.
> 
> The directory name hash index - the dabtree bit - is in a separate
> segment (the 2nd one). Because it only stores pointers to dirents in
> the data segment, it doesn't need to leave holes - the dabtree just
> merge/splits as required as pointers to the dir data segment are
> added/removed - and has no free space tracking.
> 
> Hence when we go to add a dirent, we need to find the best free
> space in the dir data segment to add that dirent. This requires a
> dir data segment free space index, and that is held in the 3rd dir
> segment.  Once we've found the best free space via lookup in the
> free space index, we go modify the dir data block it points to, then
> update the dabtree to point the name hash at that new dirent.
> 
> IOWs, the requirement for a free space map in the directory
> structure results from storing the dirent data externally to the
> dabtree. Attributes are stored directly in the leaves of the
> dabtree - except for remote attributes which can be anywhere in the
> BMBT address space - and hence do no need external free space
> tracking to determine where to best insert them...

<nod> Got it.  I've suspected this property about the xattr structures
for a long time, so I'm glad to hear someone else echo that. :)

Dave: May I try to rework the above into something suitable for the
ondisk format documentation?

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner April 8, 2020, 10:43 p.m. UTC | #11

On Wed, Apr 08, 2020 at 06:13:45PM +0530, Chandan Rajendra wrote:
> On Tuesday, April 7, 2020 5:00 AM Dave Chinner wrote: 
> > On Mon, Apr 06, 2020 at 10:06:03AM -0700, Darrick J. Wong wrote:
> > > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > > which
> > > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > > causes the following message to be printed on the console,
> > > > 
> > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > > 
> > > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > > 
> > > > I have been informed that there are instances where a single file has
> > > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > > we will overflow the 16-bits wide xattr extent counter when large
> > > > number of hardlinks are created.
> > > > 
> > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > > 32-bit wide xattr extent counter.
> > > > 
> > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > > 
> > > If you're going to introduce an INCOMPAT feature, please also use the
> > > opportunity to convert xattrs to something resembling the dir v3 format,
> > > where we index free space within each block so that we can speed up attr
> > > setting with 100 million attrs.
> > 
> > Not necessary. Chandan has already spent a lot of time investigating
> > that - I suggested doing the investigation probably a year ago when
> > he was looking for stuff to do knowing that this could be a problem
> > parent pointers hit. Long story short - there's no degradation in
> > performance in the dabtree out to tens of millions of records with
> > different fixed size or random sized attributes, nor does various
> > combinations of insert/lookup/remove/replace operations seem to
> > impact the tree performance at scale. IOWs, we hit the 16 bit extent
> > limits of the attribute trees without finding any degradation in
> > performance.
> 
> My benchmarking was limited to working with a maximum of 1,000,000 xattrs. I

/me goes and reviews old emails

Yes, there were a lot of experiements limited to 1M xattrs because
of the 16bit extent count limitations once the tree modifications
started removing blocks and allocating new ones, but:

| Dave, I have experimented and found that xattr insertion and deletion
| operations consume cpu time in a O(N) manner. Below is a sample of such an
| experiment,
|
| | Nr attributes | Create | Delete |
| |---------------+--------+--------|
| |         10000 |   0.07 |   0.06 |
| |         20000 |   0.14 |   0.13 |
| |        100000 |   0.73 |   0.69 |
| |        200000 |   1.50 |   1.30 |
| |       1000000 |   7.87 |   6.39 |
| |       2000000 |  15.76 |  12.56 |
| |      10000000 |  78.68 |  66.53 |

There's 10M attributes with expected scalability behaviour.

Space efficiency for parent-pointer style xattrs out to 10 million
xattrs:

| I extracted some more data from the experiments,
| 
|     1. 13 to 20 bytes name length; Zero length value
|        | Nr xattr | 4kAvg | 4kmin | 4kmax | stddev | Total Nr leaves | Below avg space used | Percentage |
|        |----------+-------+-------+-------+--------+-----------------+----------------------+------------|
|        |    10000 |  3156 |  2100 |  4080 |    978 |             122 |                   56 |      45.90 |
|        |    20000 |  3358 |  2100 |  4080 |    945 |             255 |                  135 |      52.94 |
|        |   100000 |  3469 |  2080 |  4080 |    910 |            1349 |                  802 |      59.45 |
|        |   200000 |  2842 |  2080 |  4080 |    747 |            2649 |                 1264 |      47.72 |
|        |   300000 |  2739 |  2080 |  4080 |    699 |            3907 |                 2045 |      52.34 |
|        |   400000 |  2949 |  2080 |  4080 |    699 |            5349 |                 2692 |      50.33 |
|        |   500000 |  2947 |  2080 |  4080 |    714 |            6795 |                 3709 |      54.58 |
|        |   600000 |  2947 |  2080 |  4080 |    588 |            7726 |                 5214 |      67.49 |
|        |   700000 |  2858 |  2080 |  4088 |    619 |            9331 |                 4821 |      51.67 |
|        |   800000 |  3076 |  2080 |  4088 |    626 |           11148 |                 6241 |      55.98 |
|        |   900000 |  3060 |  2080 |  4088 |    715 |           11355 |                 5907 |      52.02 |
|        |  1000000 |  2726 |  2080 |  4080 |    602 |           11757 |                 5422 |      46.12 |
|        |  2000000 |  2707 |  2080 |  4088 |    530 |           24508 |                10877 |      44.38 |
|        |  3000000 |  2637 |  2080 |  4088 |    506 |           36842 |                15983 |      43.38 |
|        |  4000000 |  2639 |  2080 |  4088 |    509 |           49502 |                22745 |      45.95 |
|        |  5000000 |  2609 |  2080 |  4088 |    504 |           62102 |                28536 |      45.95 |
|        |  6000000 |  2622 |  2080 |  4088 |    525 |           74640 |                34797 |      46.62 |
|        |  7000000 |  2601 |  2080 |  4088 |    511 |           87232 |                40565 |      46.50 |
|        |  8000000 |  2593 |  2080 |  4088 |    513 |           99924 |                47249 |      47.28 |
|        |  9000000 |  2584 |  2080 |  4088 |    511 |          112551 |                48683 |      43.25 |
|        | 10000000 |  2597 |  2080 |  4088 |    527 |          125158 |                54245 |      43.34 |
| 
| 
|     2. 13 to 20 bytes name length; Value length is 13 bytes
|        | Nr xattr | 4kAvg | 4kmin | 4kmax | stddev | Total Nr leaves | Below avg space used | Percentage |
|        |----------+-------+-------+-------+--------+-----------------+----------------------+------------|
|        |    10000 |  2702 |  2096 |  3536 |    564 |              65 |                   30 |      46.15 |
|        |    20000 |  2746 |  2096 |  3968 |    687 |             122 |                   44 |      36.07 |
|        |   100000 |  2718 |  2092 |  3968 |    746 |             590 |                  180 |      30.51 |
|        |   200000 |  2782 |  2092 |  3968 |    690 |            1593 |                 1166 |      73.20 |
|        |   300000 |  2834 |  2092 |  4040 |    708 |            2557 |                 1473 |      57.61 |
|        |   400000 |  2764 |  2092 |  3968 |    536 |            3206 |                 1393 |      43.45 |
|        |   500000 |  2723 |  2092 |  4040 |    651 |            4045 |                 2449 |      60.54 |
|        |   600000 |  2870 |  2092 |  4040 |    594 |            4883 |                 2727 |      55.85 |
|        |   700000 |  2776 |  2092 |  4076 |    564 |            5903 |                 2647 |      44.84 |
|        |   800000 |  2659 |  2092 |  4076 |    510 |            6275 |                 3224 |      51.38 |
|        |   900000 |  2929 |  2092 |  3968 |    491 |            7113 |                 4207 |      59.15 |
|        |  1000000 |  3138 |  2092 |  4076 |    552 |            8916 |                 5746 |      64.45 |
|        |  2000000 |  3016 |  2096 |  4076 |    615 |           18119 |                11540 |      63.69 |
|        |  3000000 |  3010 |  2096 |  4076 |    642 |           27995 |                18411 |      65.77 |
|        |  4000000 |  2988 |  2096 |  4076 |    667 |           37346 |                22439 |      60.08 |
|        |  5000000 |  2977 |  2096 |  4076 |    670 |           47275 |                28745 |      60.80 |
|        |  6000000 |  2973 |  2096 |  4076 |    680 |           56479 |                33075 |      58.56 |
|        |  7000000 |  2968 |  2096 |  4076 |    680 |           66472 |                40288 |      60.61 |
|        |  8000000 |  2961 |  2096 |  4076 |    684 |           76241 |                45640 |      59.86 |
|        |  9000000 |  2958 |  2096 |  4076 |    684 |           86070 |                52306 |      60.77 |
|        | 10000000 |  2956 |  2096 |  4076 |    688 |           95179 |                56395 |      59.25 |

And theres a couple of logarithmic overhead data tables that go out
as far as 37 million xattrs...

> will address the review comments provided on this patchset and then run the
> benchmarks once again ... but this time I will increase the upper limit to 100
> million xattrs (since we will have a 32-bit extent counter). I will post the
> results of the benchmarking (along with the benchmarking programs/scripts) to
> the mailing list before I post the patchset itself.

Sounds good.

Though given the results of what you have done so far, I don't
expect to see any scalaility issues until we hit on machine memory
limits (i.e.  can't cache all the dabtree metadata in memory) or
maximum dabtree depths.

Cheers,

Dave.

Dave Chinner April 8, 2020, 10:45 p.m. UTC | #12

On Wed, Apr 08, 2020 at 08:45:12AM -0700, Darrick J. Wong wrote:
> On Tue, Apr 07, 2020 at 09:30:02AM +1000, Dave Chinner wrote:
> > Long story short - there's no degradation in
> > performance in the dabtree out to tens of millions of records with
> > different fixed size or random sized attributes, nor does various
> > combinations of insert/lookup/remove/replace operations seem to
> > impact the tree performance at scale. IOWs, we hit the 16 bit extent
> > limits of the attribute trees without finding any degradation in
> > performance.
> 
> Ok.  I'll take "attr v3 upgrade" off my list of things to look out for.
> 
> > Hence we concluded that the dabtree structure does not require
> > significant modification or optimisation to work well with typical
> > parent pointer attribute demands...
> > 
> > As for free space indexes....
> > 
> > The issue with the directory structure that requires external free
> > space is that the directory data is not part of the dabtree itself.
> > The attribute fork stores all the attributes at the leaves of the
> > dabtree, while the directory structure stores the directory data in
> > external blocks and the dabtree only contains the name hash index
> > that points to the external data.
> > 
> > i.e. When we add an attribute to the dabtree, we split/merge leaves
> > of the tree based on where the name hash index tells us it needs to
> > be inserted/removed from. i.e. we make space available or collapse
> > sparse leaves of the dabtree as a side effect of inserting or
> > removing objects.
> > 
> > The directory structure is very different. The dirents cannot change
> > location as their logical offset into the dir data segment is used
> > as the readdir/seekdir/telldir cookie. Therefore that location is
> > not allowed to change for the life of the dirent and so we can't
> > store them in the leaves of a dabtree indexed in hash order because
> > the offset into the tree would change as other entries are inserted
> > and removed.  Hence when we remove dirents, we must leave holes in
> > the data segment so the rest of the dirent data does not change
> > logical offset.
> > 
> > The directory name hash index - the dabtree bit - is in a separate
> > segment (the 2nd one). Because it only stores pointers to dirents in
> > the data segment, it doesn't need to leave holes - the dabtree just
> > merge/splits as required as pointers to the dir data segment are
> > added/removed - and has no free space tracking.
> > 
> > Hence when we go to add a dirent, we need to find the best free
> > space in the dir data segment to add that dirent. This requires a
> > dir data segment free space index, and that is held in the 3rd dir
> > segment.  Once we've found the best free space via lookup in the
> > free space index, we go modify the dir data block it points to, then
> > update the dabtree to point the name hash at that new dirent.
> > 
> > IOWs, the requirement for a free space map in the directory
> > structure results from storing the dirent data externally to the
> > dabtree. Attributes are stored directly in the leaves of the
> > dabtree - except for remote attributes which can be anywhere in the
> > BMBT address space - and hence do no need external free space
> > tracking to determine where to best insert them...
> 
> <nod> Got it.  I've suspected this property about the xattr structures
> for a long time, so I'm glad to hear someone else echo that. :)
> 
> Dave: May I try to rework the above into something suitable for the
> ondisk format documentation?

Sure. Anything that helps people understand the complexity of the
directory data structure is a good thing :)

-Dave.

Chandan Rajendra April 10, 2020, 7:46 a.m. UTC | #13

On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > which
> > 1. Creates 1,000,000 255-byte sized xattrs,
> > 2. Deletes 50% of these xattrs in an alternating manner,
> > 3. Tries to create 400,000 new 255-byte sized xattrs
> > causes the following message to be printed on the console,
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > 
> > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > 
> > I have been informed that there are instances where a single file has
> >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > we will overflow the 16-bits wide xattr extent counter when large
> > number of hardlinks are created.
> > 
> > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > an incompat flag to prevent older kernels from mounting newer filesystems with
> > 32-bit wide xattr extent counter.
> > 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> >  fs/xfs/scrub/inode.c           |  7 ++++---
> >  fs/xfs/xfs_inode_item.c        |  3 ++-
> >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> >  8 files changed, 63 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 045556e78ee2c..0a4266b0d46e1 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> >  
> >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> >  static inline bool
> > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> >  	__be32		di_nextents;	/* number of extents in data fork */
> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	__s8		di_aformat;	/* format of attr fork's data */
> >  	__be32		di_dmevmask;	/* DMIG event mask */
> > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> >  	__be64		di_lsn;		/* flush sequence */
> >  	__be64		di_flags2;	/* more random flags */
> >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > +	__u8		di_pad2[10];	/* more padding for future expansion */
> 
> Ok, I think you've limited what we can do here by using this "fill
> holes" variable split. I've never liked doing this, and we've only
> done it in the past when we haven't had space in the inode to create
> a new 32 bit variable.
> 
> IOWs, this is a v5 format feature only, so we should just create a
> new variable:
> 
> 	__be32		di_attr_nextents;
> 
> With that in place, we can now do what we did extending the v1 inode
> link count (16 bits) to the v2 inode link count (32 bits).
> 
> That is, when the attribute count is going to overflow, we set a
> inode flag on disk to indicate that it now has a 32 bit extent count
> and uses that field in the inode, and we set a RO-compat feature
> flag in the superblock to indicate that there are 32 bit attr fork
> extent counts in use.
> 
> Old kernels can still read the filesystem, but see the extent count
> as "max" (65535) but can't modify the attr fork and hence corrupt
> the 32 bit count it knows nothing about.
> 
> If the kernel sees the RO feature bit set, it can set the inode flag
> on inodes it is modifying and update both the old and new counters
> appropriately when flushing the inode to disk (i.e. transparent
> conversion).
> 
> In future, mkfs can then set the RO feature flag by default so all
> new filesystems use the 32 bit counter.
> 
> >  	/* fields only written to during inode creation */
> >  	xfs_timestamp_t	di_crtime;	/* time created */
> > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> >  	((w) == XFS_DATA_FORK ? \
> >  		(dip)->di_format : \
> >  		(dip)->di_aformat)
> > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > -	((w) == XFS_DATA_FORK ? \
> > -		be32_to_cpu((dip)->di_nextents) : \
> > -		be16_to_cpu((dip)->di_anextents))
> > +
> > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> 
> If you are converting a macro to static inline, then all the caller
> sites should be converted to lower case at the same time.
> 
> > +					struct xfs_dinode *dip, int whichfork)
> > +{
> > +	int32_t anextents;
> 
> Extent counts should be unsigned, as they are on disk.
> 
> > +
> > +	if (whichfork == XFS_DATA_FORK)
> > +		return be32_to_cpu((dip)->di_nextents);
> > +
> > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > +	if (xfs_sb_version_has_v3inode(sbp))
> > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > +
> > +	return anextents;
> > +}
> 
> No feature bit to indicate that 32 bit attribute extent counts are
> valid?
> 
> >  
> >  /*
> >   * For block and character special files the 32bit dev_t is stored at the
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 39c5a6e24915c..ced8195bd8c22 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > +				XFS_ATTR_FORK);
> 
> This should open code, but I'd prefer a compeltely separate
> variable...
> 
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat	= from->di_aformat;
> >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi
> > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> 
> Again, feature bit for on-disk format modifications needed...
> 
> >  		to->di_ino = cpu_to_be64(ip->i_ino);
> >  		to->di_lsn = cpu_to_be64(lsn);
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> >  		to->di_ino = cpu_to_be64(from->di_ino);
> >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> >  	struct xfs_mount	*mp,
> >  	int			whichfork)
> >  {
> > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	uint32_t		di_nextents;
> > +
> > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> >  
> >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> >  	case XFS_DINODE_FMT_LOCAL:
> > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> >  	uint16_t		flags;
> >  	uint64_t		flags2;
> >  	uint64_t		di_size;
> > +	int32_t			nextents;
> > +	int32_t			anextents;
> > +	int64_t			nblocks;
> 
> Extent counts need to be converted to unsigned in memory - they are
> unsigned on disk....

In the current code, we have,

#define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
#define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */

i.e. the maximum allowed data extent counter and xattr extent counter are
maximum possible values w.r.t signed int and signed short.

Can you please explain as to why signed maximum values were considered when
the corresponding on-disk data types are unsigned?

Chandan Rajendra April 12, 2020, 6:34 a.m. UTC | #14

On Friday, April 10, 2020 1:16 PM Chandan Rajendra wrote: 
> On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > which
> > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > causes the following message to be printed on the console,
> > > 
> > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > 
> > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > 
> > > I have been informed that there are instances where a single file has
> > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > we will overflow the 16-bits wide xattr extent counter when large
> > > number of hardlinks are created.
> > > 
> > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > 32-bit wide xattr extent counter.
> > > 
> > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> > >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> > >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> > >  
> > >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> > >  static inline bool
> > > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> > >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > >  	__be32		di_nextents;	/* number of extents in data fork */
> > > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> > >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > >  	__s8		di_aformat;	/* format of attr fork's data */
> > >  	__be32		di_dmevmask;	/* DMIG event mask */
> > > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> > >  	__be64		di_lsn;		/* flush sequence */
> > >  	__be64		di_flags2;	/* more random flags */
> > >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > > +	__u8		di_pad2[10];	/* more padding for future expansion */
> > 
> > Ok, I think you've limited what we can do here by using this "fill
> > holes" variable split. I've never liked doing this, and we've only
> > done it in the past when we haven't had space in the inode to create
> > a new 32 bit variable.
> > 
> > IOWs, this is a v5 format feature only, so we should just create a
> > new variable:
> > 
> > 	__be32		di_attr_nextents;
> > 
> > With that in place, we can now do what we did extending the v1 inode
> > link count (16 bits) to the v2 inode link count (32 bits).
> > 
> > That is, when the attribute count is going to overflow, we set a
> > inode flag on disk to indicate that it now has a 32 bit extent count
> > and uses that field in the inode, and we set a RO-compat feature
> > flag in the superblock to indicate that there are 32 bit attr fork
> > extent counts in use.
> > 
> > Old kernels can still read the filesystem, but see the extent count
> > as "max" (65535) but can't modify the attr fork and hence corrupt
> > the 32 bit count it knows nothing about.
> > 
> > If the kernel sees the RO feature bit set, it can set the inode flag
> > on inodes it is modifying and update both the old and new counters
> > appropriately when flushing the inode to disk (i.e. transparent
> > conversion).
> > 
> > In future, mkfs can then set the RO feature flag by default so all
> > new filesystems use the 32 bit counter.
> > 
> > >  	/* fields only written to during inode creation */
> > >  	xfs_timestamp_t	di_crtime;	/* time created */
> > > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> > >  	((w) == XFS_DATA_FORK ? \
> > >  		(dip)->di_format : \
> > >  		(dip)->di_aformat)
> > > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > > -	((w) == XFS_DATA_FORK ? \
> > > -		be32_to_cpu((dip)->di_nextents) : \
> > > -		be16_to_cpu((dip)->di_anextents))
> > > +
> > > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > 
> > If you are converting a macro to static inline, then all the caller
> > sites should be converted to lower case at the same time.
> > 
> > > +					struct xfs_dinode *dip, int whichfork)
> > > +{
> > > +	int32_t anextents;
> > 
> > Extent counts should be unsigned, as they are on disk.
> > 
> > > +
> > > +	if (whichfork == XFS_DATA_FORK)
> > > +		return be32_to_cpu((dip)->di_nextents);
> > > +
> > > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > > +	if (xfs_sb_version_has_v3inode(sbp))
> > > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > > +
> > > +	return anextents;
> > > +}
> > 
> > No feature bit to indicate that 32 bit attribute extent counts are
> > valid?
> > 
> > >  
> > >  /*
> > >   * For block and character special files the 32bit dev_t is stored at the
> > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > index 39c5a6e24915c..ced8195bd8c22 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> > >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> > >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> > >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > > +				XFS_ATTR_FORK);
> > 
> > This should open code, but I'd prefer a compeltely separate
> > variable...
> > 
> > >  	to->di_forkoff = from->di_forkoff;
> > >  	to->di_aformat	= from->di_aformat;
> > >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> > >  	to->di_forkoff = from->di_forkoff;
> > >  	to->di_aformat = from->di_aformat;
> > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > +		to->di_anextents_hi
> > > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> > 
> > Again, feature bit for on-disk format modifications needed...
> > 
> > >  		to->di_ino = cpu_to_be64(ip->i_ino);
> > >  		to->di_lsn = cpu_to_be64(lsn);
> > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> > >  	to->di_forkoff = from->di_forkoff;
> > >  	to->di_aformat = from->di_aformat;
> > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> > >  		to->di_ino = cpu_to_be64(from->di_ino);
> > >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> > >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> > >  	struct xfs_mount	*mp,
> > >  	int			whichfork)
> > >  {
> > > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > > +	uint32_t		di_nextents;
> > > +
> > > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> > >  
> > >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> > >  	case XFS_DINODE_FMT_LOCAL:
> > > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> > >  	uint16_t		flags;
> > >  	uint64_t		flags2;
> > >  	uint64_t		di_size;
> > > +	int32_t			nextents;
> > > +	int32_t			anextents;
> > > +	int64_t			nblocks;
> > 
> > Extent counts need to be converted to unsigned in memory - they are
> > unsigned on disk....
> 
> In the current code, we have,
> 
> #define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
> #define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */
> 
> i.e. the maximum allowed data extent counter and xattr extent counter are
> maximum possible values w.r.t signed int and signed short.
> 
> Can you please explain as to why signed maximum values were considered when
> the corresponding on-disk data types are unsigned?
> 
> 

Ok. So the reason I asked that question was because I was wondering if
changing the maximum number of extents for data and attr would cause a change
the height of the corresponding bmbt trees (which in-turn could change the log
reservation values). The following calculations prove otherwise,

- 5 levels deep data bmbt tree.
  |-------+------------------------+-------------------------------|
  | level | number of nodes/leaves | Total Nr recs                 |
  |-------+------------------------+-------------------------------|
  |     0 |                      1 | 3 (max root recs)             |
  |     1 |                      3 | 125 * 3 = 375                 |
  |     2 |                    375 | 125 * 375 = 46875             |
  |     3 |                  46875 | 125 * 46875 = 5859375         |
  |     4 |                5859375 | 125 * 5859375 = 732421875     |
  |     5 |              732421875 | 125 * 732421875 = 91552734375 |
  |-------+------------------------+-------------------------------|

- 3 levels deep attr bmbt tree.
  |-------+------------------------+-----------------------|
  | level | number of nodes/leaves | Total Nr recs         |
  |-------+------------------------+-----------------------|
  |     0 |                      1 | 2 (max root recs)     |
  |     1 |                      2 | 125 * 2 = 250         |
  |     2 |                    250 | 125 * 250 = 31250     |
  |     3 |                  31250 | 125 * 31250 = 3906250 |
  |-------+------------------------+-----------------------|

- Data type to number of records
  |-----------+-------------+-----------------|
  | data type | max extents | max leaf blocks |
  |-----------+-------------+-----------------|
  | int32     |  2147483647 |        17179870 |
  | uint32    |  4294967295 |        34359739 |
  | int16     |       32767 |             263 |
  | uint16    |       65535 |             525 |                                                                                                                  
  |-----------+-------------+-----------------|

So data bmbt will still have a height of 5 and attr bmbt will continue to have
a height of 3.

Darrick J. Wong April 13, 2020, 6:55 p.m. UTC | #15

On Sun, Apr 12, 2020 at 12:04:13PM +0530, Chandan Rajendra wrote:
> On Friday, April 10, 2020 1:16 PM Chandan Rajendra wrote: 
> > On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> > > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > > which
> > > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > > causes the following message to be printed on the console,
> > > > 
> > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > > 
> > > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > > 
> > > > I have been informed that there are instances where a single file has
> > > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > > we will overflow the 16-bits wide xattr extent counter when large
> > > > number of hardlinks are created.
> > > > 
> > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > > 32-bit wide xattr extent counter.
> > > > 
> > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > > >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> > > >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> > > >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > > > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > > > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > > > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> > > >  
> > > >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> > > >  static inline bool
> > > > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> > > >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > > >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > > >  	__be32		di_nextents;	/* number of extents in data fork */
> > > > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > > > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> > > >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > > >  	__s8		di_aformat;	/* format of attr fork's data */
> > > >  	__be32		di_dmevmask;	/* DMIG event mask */
> > > > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> > > >  	__be64		di_lsn;		/* flush sequence */
> > > >  	__be64		di_flags2;	/* more random flags */
> > > >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > > > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > > > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > > > +	__u8		di_pad2[10];	/* more padding for future expansion */
> > > 
> > > Ok, I think you've limited what we can do here by using this "fill
> > > holes" variable split. I've never liked doing this, and we've only
> > > done it in the past when we haven't had space in the inode to create
> > > a new 32 bit variable.
> > > 
> > > IOWs, this is a v5 format feature only, so we should just create a
> > > new variable:
> > > 
> > > 	__be32		di_attr_nextents;
> > > 
> > > With that in place, we can now do what we did extending the v1 inode
> > > link count (16 bits) to the v2 inode link count (32 bits).
> > > 
> > > That is, when the attribute count is going to overflow, we set a
> > > inode flag on disk to indicate that it now has a 32 bit extent count
> > > and uses that field in the inode, and we set a RO-compat feature
> > > flag in the superblock to indicate that there are 32 bit attr fork
> > > extent counts in use.
> > > 
> > > Old kernels can still read the filesystem, but see the extent count
> > > as "max" (65535) but can't modify the attr fork and hence corrupt
> > > the 32 bit count it knows nothing about.
> > > 
> > > If the kernel sees the RO feature bit set, it can set the inode flag
> > > on inodes it is modifying and update both the old and new counters
> > > appropriately when flushing the inode to disk (i.e. transparent
> > > conversion).
> > > 
> > > In future, mkfs can then set the RO feature flag by default so all
> > > new filesystems use the 32 bit counter.
> > > 
> > > >  	/* fields only written to during inode creation */
> > > >  	xfs_timestamp_t	di_crtime;	/* time created */
> > > > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> > > >  	((w) == XFS_DATA_FORK ? \
> > > >  		(dip)->di_format : \
> > > >  		(dip)->di_aformat)
> > > > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > > > -	((w) == XFS_DATA_FORK ? \
> > > > -		be32_to_cpu((dip)->di_nextents) : \
> > > > -		be16_to_cpu((dip)->di_anextents))
> > > > +
> > > > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > > 
> > > If you are converting a macro to static inline, then all the caller
> > > sites should be converted to lower case at the same time.
> > > 
> > > > +					struct xfs_dinode *dip, int whichfork)
> > > > +{
> > > > +	int32_t anextents;
> > > 
> > > Extent counts should be unsigned, as they are on disk.
> > > 
> > > > +
> > > > +	if (whichfork == XFS_DATA_FORK)
> > > > +		return be32_to_cpu((dip)->di_nextents);
> > > > +
> > > > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > > > +	if (xfs_sb_version_has_v3inode(sbp))
> > > > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > > > +
> > > > +	return anextents;
> > > > +}
> > > 
> > > No feature bit to indicate that 32 bit attribute extent counts are
> > > valid?
> > > 
> > > >  
> > > >  /*
> > > >   * For block and character special files the 32bit dev_t is stored at the
> > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > index 39c5a6e24915c..ced8195bd8c22 100644
> > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> > > >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> > > >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> > > >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > > > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > > > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > > > +				XFS_ATTR_FORK);
> > > 
> > > This should open code, but I'd prefer a compeltely separate
> > > variable...
> > > 
> > > >  	to->di_forkoff = from->di_forkoff;
> > > >  	to->di_aformat	= from->di_aformat;
> > > >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > > > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> > > >  	to->di_forkoff = from->di_forkoff;
> > > >  	to->di_aformat = from->di_aformat;
> > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > +		to->di_anextents_hi
> > > > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> > > 
> > > Again, feature bit for on-disk format modifications needed...
> > > 
> > > >  		to->di_ino = cpu_to_be64(ip->i_ino);
> > > >  		to->di_lsn = cpu_to_be64(lsn);
> > > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> > > >  	to->di_forkoff = from->di_forkoff;
> > > >  	to->di_aformat = from->di_aformat;
> > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> > > >  		to->di_ino = cpu_to_be64(from->di_ino);
> > > >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> > > >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > > > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> > > >  	struct xfs_mount	*mp,
> > > >  	int			whichfork)
> > > >  {
> > > > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > > > +	uint32_t		di_nextents;
> > > > +
> > > > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> > > >  
> > > >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> > > >  	case XFS_DINODE_FMT_LOCAL:
> > > > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> > > >  	uint16_t		flags;
> > > >  	uint64_t		flags2;
> > > >  	uint64_t		di_size;
> > > > +	int32_t			nextents;
> > > > +	int32_t			anextents;
> > > > +	int64_t			nblocks;
> > > 
> > > Extent counts need to be converted to unsigned in memory - they are
> > > unsigned on disk....
> > 
> > In the current code, we have,
> > 
> > #define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
> > #define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */
> > 
> > i.e. the maximum allowed data extent counter and xattr extent counter are
> > maximum possible values w.r.t signed int and signed short.
> > 
> > Can you please explain as to why signed maximum values were considered when
> > the corresponding on-disk data types are unsigned?
> > 
> > 
> 
> Ok. So the reason I asked that question was because I was wondering if
> changing the maximum number of extents for data and attr would cause a change
> the height of the corresponding bmbt trees (which in-turn could change the log
> reservation values). The following calculations prove otherwise,
> 
> - 5 levels deep data bmbt tree.
>   |-------+------------------------+-------------------------------|
>   | level | number of nodes/leaves | Total Nr recs                 |
>   |-------+------------------------+-------------------------------|
>   |     0 |                      1 | 3 (max root recs)             |
>   |     1 |                      3 | 125 * 3 = 375                 |
>   |     2 |                    375 | 125 * 375 = 46875             |
>   |     3 |                  46875 | 125 * 46875 = 5859375         |
>   |     4 |                5859375 | 125 * 5859375 = 732421875     |
>   |     5 |              732421875 | 125 * 732421875 = 91552734375 |
>   |-------+------------------------+-------------------------------|
> 
> - 3 levels deep attr bmbt tree.
>   |-------+------------------------+-----------------------|
>   | level | number of nodes/leaves | Total Nr recs         |
>   |-------+------------------------+-----------------------|
>   |     0 |                      1 | 2 (max root recs)     |
>   |     1 |                      2 | 125 * 2 = 250         |
>   |     2 |                    250 | 125 * 250 = 31250     |
>   |     3 |                  31250 | 125 * 31250 = 3906250 |
>   |-------+------------------------+-----------------------|
> 
> - Data type to number of records
>   |-----------+-------------+-----------------|
>   | data type | max extents | max leaf blocks |
>   |-----------+-------------+-----------------|
>   | int32     |  2147483647 |        17179870 |
>   | uint32    |  4294967295 |        34359739 |
>   | int16     |       32767 |             263 |
>   | uint16    |       65535 |             525 |                                                                                                                  
>   |-----------+-------------+-----------------|
> 
> So data bmbt will still have a height of 5 and attr bmbt will continue to have
> a height of 3.

I think extent count variables should be unsigned because there's no
meaning for a negative extent count.  ("You have -3 extents." "Ehh???")

That said, it was very helpful to point out that the current MAXEXTNUM /
MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.

Can we use this new feature flag + inode flag to allow 4294967295
extents in either fork?

--D

> 
> -- 
> chandan
> 
> 
>

Chandan Rajendra April 20, 2020, 4:38 a.m. UTC | #16

On Tuesday, April 14, 2020 12:25 AM Darrick J. Wong wrote: 
> On Sun, Apr 12, 2020 at 12:04:13PM +0530, Chandan Rajendra wrote:
> > On Friday, April 10, 2020 1:16 PM Chandan Rajendra wrote: 
> > > On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> > > > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > > > which
> > > > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > > > causes the following message to be printed on the console,
> > > > > 
> > > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > > > 
> > > > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > > > 
> > > > > I have been informed that there are instances where a single file has
> > > > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > > > we will overflow the 16-bits wide xattr extent counter when large
> > > > > number of hardlinks are created.
> > > > > 
> > > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > > > 32-bit wide xattr extent counter.
> > > > > 
> > > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > > > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > > > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > > > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > > > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > > > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > > > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > > > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > > > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > > > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > > > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > > > >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> > > > >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> > > > >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > > > > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > > > > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > > > > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> > > > >  
> > > > >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> > > > >  static inline bool
> > > > > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> > > > >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > > > >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > > > >  	__be32		di_nextents;	/* number of extents in data fork */
> > > > > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > > > > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> > > > >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > > > >  	__s8		di_aformat;	/* format of attr fork's data */
> > > > >  	__be32		di_dmevmask;	/* DMIG event mask */
> > > > > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> > > > >  	__be64		di_lsn;		/* flush sequence */
> > > > >  	__be64		di_flags2;	/* more random flags */
> > > > >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > > > > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > > > > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > > > > +	__u8		di_pad2[10];	/* more padding for future expansion */
> > > > 
> > > > Ok, I think you've limited what we can do here by using this "fill
> > > > holes" variable split. I've never liked doing this, and we've only
> > > > done it in the past when we haven't had space in the inode to create
> > > > a new 32 bit variable.
> > > > 
> > > > IOWs, this is a v5 format feature only, so we should just create a
> > > > new variable:
> > > > 
> > > > 	__be32		di_attr_nextents;
> > > > 
> > > > With that in place, we can now do what we did extending the v1 inode
> > > > link count (16 bits) to the v2 inode link count (32 bits).
> > > > 
> > > > That is, when the attribute count is going to overflow, we set a
> > > > inode flag on disk to indicate that it now has a 32 bit extent count
> > > > and uses that field in the inode, and we set a RO-compat feature
> > > > flag in the superblock to indicate that there are 32 bit attr fork
> > > > extent counts in use.
> > > > 
> > > > Old kernels can still read the filesystem, but see the extent count
> > > > as "max" (65535) but can't modify the attr fork and hence corrupt
> > > > the 32 bit count it knows nothing about.
> > > > 
> > > > If the kernel sees the RO feature bit set, it can set the inode flag
> > > > on inodes it is modifying and update both the old and new counters
> > > > appropriately when flushing the inode to disk (i.e. transparent
> > > > conversion).
> > > > 
> > > > In future, mkfs can then set the RO feature flag by default so all
> > > > new filesystems use the 32 bit counter.
> > > > 
> > > > >  	/* fields only written to during inode creation */
> > > > >  	xfs_timestamp_t	di_crtime;	/* time created */
> > > > > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> > > > >  	((w) == XFS_DATA_FORK ? \
> > > > >  		(dip)->di_format : \
> > > > >  		(dip)->di_aformat)
> > > > > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > > > > -	((w) == XFS_DATA_FORK ? \
> > > > > -		be32_to_cpu((dip)->di_nextents) : \
> > > > > -		be16_to_cpu((dip)->di_anextents))
> > > > > +
> > > > > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > > > 
> > > > If you are converting a macro to static inline, then all the caller
> > > > sites should be converted to lower case at the same time.
> > > > 
> > > > > +					struct xfs_dinode *dip, int whichfork)
> > > > > +{
> > > > > +	int32_t anextents;
> > > > 
> > > > Extent counts should be unsigned, as they are on disk.
> > > > 
> > > > > +
> > > > > +	if (whichfork == XFS_DATA_FORK)
> > > > > +		return be32_to_cpu((dip)->di_nextents);
> > > > > +
> > > > > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > > > > +	if (xfs_sb_version_has_v3inode(sbp))
> > > > > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > > > > +
> > > > > +	return anextents;
> > > > > +}
> > > > 
> > > > No feature bit to indicate that 32 bit attribute extent counts are
> > > > valid?
> > > > 
> > > > >  
> > > > >  /*
> > > > >   * For block and character special files the 32bit dev_t is stored at the
> > > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > index 39c5a6e24915c..ced8195bd8c22 100644
> > > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> > > > >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> > > > >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> > > > >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > > > > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > > > > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > > > > +				XFS_ATTR_FORK);
> > > > 
> > > > This should open code, but I'd prefer a compeltely separate
> > > > variable...
> > > > 
> > > > >  	to->di_forkoff = from->di_forkoff;
> > > > >  	to->di_aformat	= from->di_aformat;
> > > > >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > > > > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> > > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> > > > >  	to->di_forkoff = from->di_forkoff;
> > > > >  	to->di_aformat = from->di_aformat;
> > > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> > > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> > > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > > +		to->di_anextents_hi
> > > > > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> > > > 
> > > > Again, feature bit for on-disk format modifications needed...
> > > > 
> > > > >  		to->di_ino = cpu_to_be64(ip->i_ino);
> > > > >  		to->di_lsn = cpu_to_be64(lsn);
> > > > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > > > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> > > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> > > > >  	to->di_forkoff = from->di_forkoff;
> > > > >  	to->di_aformat = from->di_aformat;
> > > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> > > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> > > > >  		to->di_ino = cpu_to_be64(from->di_ino);
> > > > >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> > > > >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > > > > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> > > > >  	struct xfs_mount	*mp,
> > > > >  	int			whichfork)
> > > > >  {
> > > > > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > > > > +	uint32_t		di_nextents;
> > > > > +
> > > > > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> > > > >  
> > > > >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> > > > >  	case XFS_DINODE_FMT_LOCAL:
> > > > > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> > > > >  	uint16_t		flags;
> > > > >  	uint64_t		flags2;
> > > > >  	uint64_t		di_size;
> > > > > +	int32_t			nextents;
> > > > > +	int32_t			anextents;
> > > > > +	int64_t			nblocks;
> > > > 
> > > > Extent counts need to be converted to unsigned in memory - they are
> > > > unsigned on disk....
> > > 
> > > In the current code, we have,
> > > 
> > > #define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
> > > #define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */
> > > 
> > > i.e. the maximum allowed data extent counter and xattr extent counter are
> > > maximum possible values w.r.t signed int and signed short.
> > > 
> > > Can you please explain as to why signed maximum values were considered when
> > > the corresponding on-disk data types are unsigned?
> > > 
> > > 
> > 
> > Ok. So the reason I asked that question was because I was wondering if
> > changing the maximum number of extents for data and attr would cause a change
> > the height of the corresponding bmbt trees (which in-turn could change the log
> > reservation values). The following calculations prove otherwise,
> > 
> > - 5 levels deep data bmbt tree.
> >   |-------+------------------------+-------------------------------|
> >   | level | number of nodes/leaves | Total Nr recs                 |
> >   |-------+------------------------+-------------------------------|
> >   |     0 |                      1 | 3 (max root recs)             |
> >   |     1 |                      3 | 125 * 3 = 375                 |
> >   |     2 |                    375 | 125 * 375 = 46875             |
> >   |     3 |                  46875 | 125 * 46875 = 5859375         |
> >   |     4 |                5859375 | 125 * 5859375 = 732421875     |
> >   |     5 |              732421875 | 125 * 732421875 = 91552734375 |
> >   |-------+------------------------+-------------------------------|
> > 
> > - 3 levels deep attr bmbt tree.
> >   |-------+------------------------+-----------------------|
> >   | level | number of nodes/leaves | Total Nr recs         |
> >   |-------+------------------------+-----------------------|
> >   |     0 |                      1 | 2 (max root recs)     |
> >   |     1 |                      2 | 125 * 2 = 250         |
> >   |     2 |                    250 | 125 * 250 = 31250     |
> >   |     3 |                  31250 | 125 * 31250 = 3906250 |
> >   |-------+------------------------+-----------------------|
> > 
> > - Data type to number of records
> >   |-----------+-------------+-----------------|
> >   | data type | max extents | max leaf blocks |
> >   |-----------+-------------+-----------------|
> >   | int32     |  2147483647 |        17179870 |
> >   | uint32    |  4294967295 |        34359739 |
> >   | int16     |       32767 |             263 |
> >   | uint16    |       65535 |             525 |                                                                                                                  
> >   |-----------+-------------+-----------------|
> > 
> > So data bmbt will still have a height of 5 and attr bmbt will continue to have
> > a height of 3.
> 
> I think extent count variables should be unsigned because there's no
> meaning for a negative extent count.  ("You have -3 extents." "Ehh???")
> 
> That said, it was very helpful to point out that the current MAXEXTNUM /
> MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.
> 
> Can we use this new feature flag + inode flag to allow 4294967295
> extents in either fork?

Sure.

I have already tested that having 4294967295 as the maximum data extent count
does not cause any regressions.

Also, Dave was of the opinion that data extent counter be increased to
64-bit. I think I should include that change along with this feature flag
rather than adding a new one in the near future.

Chandan Rajendra April 22, 2020, 9:38 a.m. UTC | #17

On Monday, April 20, 2020 10:08 AM Chandan Rajendra wrote: 
> On Tuesday, April 14, 2020 12:25 AM Darrick J. Wong wrote: 
> > On Sun, Apr 12, 2020 at 12:04:13PM +0530, Chandan Rajendra wrote:
> > > On Friday, April 10, 2020 1:16 PM Chandan Rajendra wrote: 
> > > > On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> > > > > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > > > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > > > > which
> > > > > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > > > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > > > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > > > > causes the following message to be printed on the console,
> > > > > > 
> > > > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > > > > 
> > > > > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > > > > 
> > > > > > I have been informed that there are instances where a single file has
> > > > > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > > > > we will overflow the 16-bits wide xattr extent counter when large
> > > > > > number of hardlinks are created.
> > > > > > 
> > > > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > > > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > > > > 32-bit wide xattr extent counter.
> > > > > > 
> > > > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > > > ---
> > > > > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > > > > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > > > > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > > > > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > > > > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > > > > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > > > > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > > > > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > > > > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > > > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > > > > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > > > > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > > > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > > > > >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> > > > > >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> > > > > >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > > > > > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > > > > > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > > > > > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> > > > > >  
> > > > > >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> > > > > >  static inline bool
> > > > > > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> > > > > >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > > > > >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > > > > >  	__be32		di_nextents;	/* number of extents in data fork */
> > > > > > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > > > > > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> > > > > >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > > > > >  	__s8		di_aformat;	/* format of attr fork's data */
> > > > > >  	__be32		di_dmevmask;	/* DMIG event mask */
> > > > > > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> > > > > >  	__be64		di_lsn;		/* flush sequence */
> > > > > >  	__be64		di_flags2;	/* more random flags */
> > > > > >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > > > > > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > > > > > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > > > > > +	__u8		di_pad2[10];	/* more padding for future expansion */
> > > > > 
> > > > > Ok, I think you've limited what we can do here by using this "fill
> > > > > holes" variable split. I've never liked doing this, and we've only
> > > > > done it in the past when we haven't had space in the inode to create
> > > > > a new 32 bit variable.
> > > > > 
> > > > > IOWs, this is a v5 format feature only, so we should just create a
> > > > > new variable:
> > > > > 
> > > > > 	__be32		di_attr_nextents;
> > > > > 
> > > > > With that in place, we can now do what we did extending the v1 inode
> > > > > link count (16 bits) to the v2 inode link count (32 bits).
> > > > > 
> > > > > That is, when the attribute count is going to overflow, we set a
> > > > > inode flag on disk to indicate that it now has a 32 bit extent count
> > > > > and uses that field in the inode, and we set a RO-compat feature
> > > > > flag in the superblock to indicate that there are 32 bit attr fork
> > > > > extent counts in use.
> > > > > 
> > > > > Old kernels can still read the filesystem, but see the extent count
> > > > > as "max" (65535) but can't modify the attr fork and hence corrupt
> > > > > the 32 bit count it knows nothing about.
> > > > > 
> > > > > If the kernel sees the RO feature bit set, it can set the inode flag
> > > > > on inodes it is modifying and update both the old and new counters
> > > > > appropriately when flushing the inode to disk (i.e. transparent
> > > > > conversion).
> > > > > 
> > > > > In future, mkfs can then set the RO feature flag by default so all
> > > > > new filesystems use the 32 bit counter.
> > > > > 
> > > > > >  	/* fields only written to during inode creation */
> > > > > >  	xfs_timestamp_t	di_crtime;	/* time created */
> > > > > > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> > > > > >  	((w) == XFS_DATA_FORK ? \
> > > > > >  		(dip)->di_format : \
> > > > > >  		(dip)->di_aformat)
> > > > > > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > > > > > -	((w) == XFS_DATA_FORK ? \
> > > > > > -		be32_to_cpu((dip)->di_nextents) : \
> > > > > > -		be16_to_cpu((dip)->di_anextents))
> > > > > > +
> > > > > > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > > > > 
> > > > > If you are converting a macro to static inline, then all the caller
> > > > > sites should be converted to lower case at the same time.
> > > > > 
> > > > > > +					struct xfs_dinode *dip, int whichfork)
> > > > > > +{
> > > > > > +	int32_t anextents;
> > > > > 
> > > > > Extent counts should be unsigned, as they are on disk.
> > > > > 
> > > > > > +
> > > > > > +	if (whichfork == XFS_DATA_FORK)
> > > > > > +		return be32_to_cpu((dip)->di_nextents);
> > > > > > +
> > > > > > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > > > > > +	if (xfs_sb_version_has_v3inode(sbp))
> > > > > > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > > > > > +
> > > > > > +	return anextents;
> > > > > > +}
> > > > > 
> > > > > No feature bit to indicate that 32 bit attribute extent counts are
> > > > > valid?
> > > > > 
> > > > > >  
> > > > > >  /*
> > > > > >   * For block and character special files the 32bit dev_t is stored at the
> > > > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > > index 39c5a6e24915c..ced8195bd8c22 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> > > > > >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> > > > > >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> > > > > >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > > > > > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > > > > > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > > > > > +				XFS_ATTR_FORK);
> > > > > 
> > > > > This should open code, but I'd prefer a compeltely separate
> > > > > variable...
> > > > > 
> > > > > >  	to->di_forkoff = from->di_forkoff;
> > > > > >  	to->di_aformat	= from->di_aformat;
> > > > > >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > > > > > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> > > > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > > > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> > > > > >  	to->di_forkoff = from->di_forkoff;
> > > > > >  	to->di_aformat = from->di_aformat;
> > > > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > > > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> > > > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> > > > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > > > +		to->di_anextents_hi
> > > > > > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> > > > > 
> > > > > Again, feature bit for on-disk format modifications needed...
> > > > > 
> > > > > >  		to->di_ino = cpu_to_be64(ip->i_ino);
> > > > > >  		to->di_lsn = cpu_to_be64(lsn);
> > > > > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > > > > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> > > > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > > > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> > > > > >  	to->di_forkoff = from->di_forkoff;
> > > > > >  	to->di_aformat = from->di_aformat;
> > > > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > > > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> > > > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > > > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > > > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> > > > > >  		to->di_ino = cpu_to_be64(from->di_ino);
> > > > > >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> > > > > >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > > > > > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> > > > > >  	struct xfs_mount	*mp,
> > > > > >  	int			whichfork)
> > > > > >  {
> > > > > > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > > > > > +	uint32_t		di_nextents;
> > > > > > +
> > > > > > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> > > > > >  
> > > > > >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> > > > > >  	case XFS_DINODE_FMT_LOCAL:
> > > > > > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> > > > > >  	uint16_t		flags;
> > > > > >  	uint64_t		flags2;
> > > > > >  	uint64_t		di_size;
> > > > > > +	int32_t			nextents;
> > > > > > +	int32_t			anextents;
> > > > > > +	int64_t			nblocks;
> > > > > 
> > > > > Extent counts need to be converted to unsigned in memory - they are
> > > > > unsigned on disk....
> > > > 
> > > > In the current code, we have,
> > > > 
> > > > #define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
> > > > #define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */
> > > > 
> > > > i.e. the maximum allowed data extent counter and xattr extent counter are
> > > > maximum possible values w.r.t signed int and signed short.
> > > > 
> > > > Can you please explain as to why signed maximum values were considered when
> > > > the corresponding on-disk data types are unsigned?
> > > > 
> > > > 
> > > 
> > > Ok. So the reason I asked that question was because I was wondering if
> > > changing the maximum number of extents for data and attr would cause a change
> > > the height of the corresponding bmbt trees (which in-turn could change the log
> > > reservation values). The following calculations prove otherwise,
> > > 
> > > - 5 levels deep data bmbt tree.
> > >   |-------+------------------------+-------------------------------|
> > >   | level | number of nodes/leaves | Total Nr recs                 |
> > >   |-------+------------------------+-------------------------------|
> > >   |     0 |                      1 | 3 (max root recs)             |
> > >   |     1 |                      3 | 125 * 3 = 375                 |
> > >   |     2 |                    375 | 125 * 375 = 46875             |
> > >   |     3 |                  46875 | 125 * 46875 = 5859375         |
> > >   |     4 |                5859375 | 125 * 5859375 = 732421875     |
> > >   |     5 |              732421875 | 125 * 732421875 = 91552734375 |
> > >   |-------+------------------------+-------------------------------|
> > > 
> > > - 3 levels deep attr bmbt tree.
> > >   |-------+------------------------+-----------------------|
> > >   | level | number of nodes/leaves | Total Nr recs         |
> > >   |-------+------------------------+-----------------------|
> > >   |     0 |                      1 | 2 (max root recs)     |
> > >   |     1 |                      2 | 125 * 2 = 250         |
> > >   |     2 |                    250 | 125 * 250 = 31250     |
> > >   |     3 |                  31250 | 125 * 31250 = 3906250 |
> > >   |-------+------------------------+-----------------------|
> > > 
> > > - Data type to number of records
> > >   |-----------+-------------+-----------------|
> > >   | data type | max extents | max leaf blocks |
> > >   |-----------+-------------+-----------------|
> > >   | int32     |  2147483647 |        17179870 |
> > >   | uint32    |  4294967295 |        34359739 |
> > >   | int16     |       32767 |             263 |
> > >   | uint16    |       65535 |             525 |                                                                                                                  
> > >   |-----------+-------------+-----------------|
> > > 
> > > So data bmbt will still have a height of 5 and attr bmbt will continue to have
> > > a height of 3.
> > 
> > I think extent count variables should be unsigned because there's no
> > meaning for a negative extent count.  ("You have -3 extents." "Ehh???")
> > 
> > That said, it was very helpful to point out that the current MAXEXTNUM /
> > MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.
> > 
> > Can we use this new feature flag + inode flag to allow 4294967295
> > extents in either fork?
> 
> Sure.
> 
> I have already tested that having 4294967295 as the maximum data extent count
> does not cause any regressions.
> 
> Also, Dave was of the opinion that data extent counter be increased to
> 64-bit. I think I should include that change along with this feature flag
> rather than adding a new one in the near future.
> 
> 

Hello Dave & Darrick,

Can you please look into the following design decision w.r.t using 32-bit and
64-bit unsigned counters for xattr and data extents.

Maximum extent counts.
|-----------------------+----------------------|
| Field width (in bits) |          Max extents |
|-----------------------+----------------------|
|                    32 |           4294967295 |
|                    48 |      281474976710655 |
|                    64 | 18446744073709551615 |
|-----------------------+----------------------|

|-------------------+-----|
| Minimum node recs | 125 |
| Minimum leaf recs | 125 |
|-------------------+-----|

Data bmbt tree height (MINDBTPTRS == 3)
|-------+------------------------+-------------------------|
| Level | Number of nodes/leaves |           Total Nr recs |
|       |                        | (nr nodes/leaves * 125) |
|-------+------------------------+-------------------------|
|     0 |                      1 |                       3 |
|     1 |                      3 |                     375 |
|     2 |                    375 |                   46875 |
|     3 |                  46875 |                 5859375 |
|     4 |                5859375 |               732421875 |
|     5 |              732421875 |             91552734375 |
|     6 |            91552734375 |          11444091796875 |
|     7 |         11444091796875 |        1430511474609375 |
|     8 |       1430511474609375 |      178813934326171875 |
|     9 |     178813934326171875 |    22351741790771484375 |
|-------+------------------------+-------------------------|

For counting data extents, even though we theoretically have 64 bits at our
disposal, I think we should have (2 ** 48) - 1 as the maximum number of
extents. This gives 281474976710655 (i.e. ~281 trillion extents). With this,
bmbt tree's height grows by just two more levels (i.e. it grows from the
current maximum height of 5 to 7). Please let me know your opinion on this.

Attr bmbt tree height (MINABTPTRS == 2)
|-------+------------------------+-------------------------|
| Level | Number of nodes/leaves |           Total Nr recs |
|       |                        | (nr nodes/leaves * 125) |
|-------+------------------------+-------------------------|
|     0 |                      1 |                       2 |
|     1 |                      2 |                     250 |
|     2 |                    250 |                   31250 |
|     3 |                  31250 |                 3906250 |
|     4 |                3906250 |               488281250 |
|     5 |              488281250 |             61035156250 |
|-------+------------------------+-------------------------|

For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
will cause the corresponding bmbt's maximum height to go from 3 to 5.
This probably won't cause any regression.

Meanwhile, I will work on finding the impact of increasing the height of these
two trees on log reservation.

Dave Chinner April 22, 2020, 10:30 p.m. UTC | #18

On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> On Monday, April 20, 2020 10:08 AM Chandan Rajendra wrote: 
> > On Tuesday, April 14, 2020 12:25 AM Darrick J. Wong wrote: 
> > > That said, it was very helpful to point out that the current MAXEXTNUM /
> > > MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.
> > > 
> > > Can we use this new feature flag + inode flag to allow 4294967295
> > > extents in either fork?
> > 
> > Sure.
> > 
> > I have already tested that having 4294967295 as the maximum data extent count
> > does not cause any regressions.
> > 
> > Also, Dave was of the opinion that data extent counter be increased to
> > 64-bit. I think I should include that change along with this feature flag
> > rather than adding a new one in the near future.
> > 
> > 
> 
> Hello Dave & Darrick,
> 
> Can you please look into the following design decision w.r.t using 32-bit and
> 64-bit unsigned counters for xattr and data extents.
> 
> Maximum extent counts.
> |-----------------------+----------------------|
> | Field width (in bits) |          Max extents |
> |-----------------------+----------------------|
> |                    32 |           4294967295 |
> |                    48 |      281474976710655 |
> |                    64 | 18446744073709551615 |
> |-----------------------+----------------------|

These huge numbers are impossible to compare visually.  Once numbers
go beyond 7-9 digits, you need to start condensing them in reports.
Humans are, in general, unable to handle strings of digits longer
than 7-9 digits at all well...

Can you condense them by using scientific representation i.e. XEy,
which gives:

|-----------------------+-------------|
| Field width (in bits) | Max extents |
|-----------------------+-------------|
|                    32 |      4.3E09 |
|                    48 |      2.8E14 |
|                    64 |      1.8E19 |
|-----------------------+-------------|

It's much easier to compare differences visually because it's not
only 4 digits, not 20. The other alternative is to use k,m,g,t,p,e
suffixes to indicate magnitude (4.3g, 280t, 18e), but using
exponentials make the numbers easier to do calculations on
directly...

> |-------------------+-----|
> | Minimum node recs | 125 |
> | Minimum leaf recs | 125 |
> |-------------------+-----|

Please show your working. I'm assuming this is 50% * 4kB /
sizeof(bmbt_rec), so you are working out limits based on 4kB block
size? Realistically, worse case behaviour will be with the minimum
supported block size, which in this case will be 1kB....

> Data bmbt tree height (MINDBTPTRS == 3)
> |-------+------------------------+-------------------------|
> | Level | Number of nodes/leaves |           Total Nr recs |
> |       |                        | (nr nodes/leaves * 125) |
> |-------+------------------------+-------------------------|
> |     0 |                      1 |                       3 |
> |     1 |                      3 |                     375 |
> |     2 |                    375 |                   46875 |
> |     3 |                  46875 |                 5859375 |
> |     4 |                5859375 |               732421875 |
> |     5 |              732421875 |             91552734375 |
> |     6 |            91552734375 |          11444091796875 |
> |     7 |         11444091796875 |        1430511474609375 |
> |     8 |       1430511474609375 |      178813934326171875 |
> |     9 |     178813934326171875 |    22351741790771484375 |
> |-------+------------------------+-------------------------|
> 
> For counting data extents, even though we theoretically have 64 bits at our
> disposal, I think we should have (2 ** 48) - 1 as the maximum number of
> extents. This gives 281474976710655 (i.e. ~281 trillion extents). With this,
> bmbt tree's height grows by just two more levels (i.e. it grows from the
> current maximum height of 5 to 7). Please let me know your opinion on this.

We shouldn't make up arbitrary limits when we can calculate them exactly.
i.e. 2^63 max file size, 1kB block size (2^10), means max fragments
is 2^53 entries. On a 64kB block size (2^16), we have a max extent
count of 2^47....

i.e. 2^48 would be an acceptible limit for 1kB block size, but it is
not correct for 64kB block size filesystems....

> Attr bmbt tree height (MINABTPTRS == 2)
> |-------+------------------------+-------------------------|
> | Level | Number of nodes/leaves |           Total Nr recs |
> |       |                        | (nr nodes/leaves * 125) |
> |-------+------------------------+-------------------------|
> |     0 |                      1 |                       2 |
> |     1 |                      2 |                     250 |
> |     2 |                    250 |                   31250 |
> |     3 |                  31250 |                 3906250 |
> |     4 |                3906250 |               488281250 |
> |     5 |              488281250 |             61035156250 |
> |-------+------------------------+-------------------------|
> 
> For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> will cause the corresponding bmbt's maximum height to go from 3 to 5.
> This probably won't cause any regression.

We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
attr fork extent count makes no difference to the attribute fork
bmbt reservations. i.e. the bmbt reservations are defined by the
dabtree structure limits, not the maximum extent count the fork can
hold.

The data fork to 64 bits has no impact on the directory
reservations, either, because the number of extents in the directory
is bound by the directory segment size of 32GB. i.e. a directory can
hold, at most, 32GB of dirent data, which means there's a hard limit
on the number of dabtree entries somewhere in the order of a few
hundred million. That's where XFS_DA_NODE_MAXDEPTH comes from - it's
large enough to index a max sized directory, and the BMBT overhead
is derived from that...

> Meanwhile, I will work on finding the impact of increasing the
> height of these two trees on log reservation.

It should not change it substantially - 2 blocks per bmbt
reservation per transaction is what I'd expect from the numbers
presented...

Cheers,

Dave.

Darrick J. Wong April 22, 2020, 10:51 p.m. UTC | #19

On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> On Monday, April 20, 2020 10:08 AM Chandan Rajendra wrote: 
> > On Tuesday, April 14, 2020 12:25 AM Darrick J. Wong wrote: 
> > > On Sun, Apr 12, 2020 at 12:04:13PM +0530, Chandan Rajendra wrote:
> > > > On Friday, April 10, 2020 1:16 PM Chandan Rajendra wrote: 
> > > > > On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> > > > > > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > > > > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > > > > > which
> > > > > > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > > > > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > > > > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > > > > > causes the following message to be printed on the console,
> > > > > > > 
> > > > > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > > > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > > > > > 
> > > > > > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > > > > > 
> > > > > > > I have been informed that there are instances where a single file has
> > > > > > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > > > > > we will overflow the 16-bits wide xattr extent counter when large
> > > > > > > number of hardlinks are created.
> > > > > > > 
> > > > > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > > > > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > > > > > 32-bit wide xattr extent counter.
> > > > > > > 
> > > > > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > > > > ---
> > > > > > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > > > > > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > > > > > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > > > > > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > > > > > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > > > > > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > > > > > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > > > > > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > > > > > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > > > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > > > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > > > > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > > > > > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > > > > > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > > > > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > > > > > >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> > > > > > >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> > > > > > >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > > > > > > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > > > > > > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > > > > > > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> > > > > > >  
> > > > > > >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> > > > > > >  static inline bool
> > > > > > > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> > > > > > >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > > > > > >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > > > > > >  	__be32		di_nextents;	/* number of extents in data fork */
> > > > > > > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > > > > > > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> > > > > > >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > > > > > >  	__s8		di_aformat;	/* format of attr fork's data */
> > > > > > >  	__be32		di_dmevmask;	/* DMIG event mask */
> > > > > > > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> > > > > > >  	__be64		di_lsn;		/* flush sequence */
> > > > > > >  	__be64		di_flags2;	/* more random flags */
> > > > > > >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > > > > > > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > > > > > > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > > > > > > +	__u8		di_pad2[10];	/* more padding for future expansion */
> > > > > > 
> > > > > > Ok, I think you've limited what we can do here by using this "fill
> > > > > > holes" variable split. I've never liked doing this, and we've only
> > > > > > done it in the past when we haven't had space in the inode to create
> > > > > > a new 32 bit variable.
> > > > > > 
> > > > > > IOWs, this is a v5 format feature only, so we should just create a
> > > > > > new variable:
> > > > > > 
> > > > > > 	__be32		di_attr_nextents;
> > > > > > 
> > > > > > With that in place, we can now do what we did extending the v1 inode
> > > > > > link count (16 bits) to the v2 inode link count (32 bits).
> > > > > > 
> > > > > > That is, when the attribute count is going to overflow, we set a
> > > > > > inode flag on disk to indicate that it now has a 32 bit extent count
> > > > > > and uses that field in the inode, and we set a RO-compat feature
> > > > > > flag in the superblock to indicate that there are 32 bit attr fork
> > > > > > extent counts in use.
> > > > > > 
> > > > > > Old kernels can still read the filesystem, but see the extent count
> > > > > > as "max" (65535) but can't modify the attr fork and hence corrupt
> > > > > > the 32 bit count it knows nothing about.
> > > > > > 
> > > > > > If the kernel sees the RO feature bit set, it can set the inode flag
> > > > > > on inodes it is modifying and update both the old and new counters
> > > > > > appropriately when flushing the inode to disk (i.e. transparent
> > > > > > conversion).
> > > > > > 
> > > > > > In future, mkfs can then set the RO feature flag by default so all
> > > > > > new filesystems use the 32 bit counter.
> > > > > > 
> > > > > > >  	/* fields only written to during inode creation */
> > > > > > >  	xfs_timestamp_t	di_crtime;	/* time created */
> > > > > > > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> > > > > > >  	((w) == XFS_DATA_FORK ? \
> > > > > > >  		(dip)->di_format : \
> > > > > > >  		(dip)->di_aformat)
> > > > > > > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > > > > > > -	((w) == XFS_DATA_FORK ? \
> > > > > > > -		be32_to_cpu((dip)->di_nextents) : \
> > > > > > > -		be16_to_cpu((dip)->di_anextents))
> > > > > > > +
> > > > > > > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > > > > > 
> > > > > > If you are converting a macro to static inline, then all the caller
> > > > > > sites should be converted to lower case at the same time.
> > > > > > 
> > > > > > > +					struct xfs_dinode *dip, int whichfork)
> > > > > > > +{
> > > > > > > +	int32_t anextents;
> > > > > > 
> > > > > > Extent counts should be unsigned, as they are on disk.
> > > > > > 
> > > > > > > +
> > > > > > > +	if (whichfork == XFS_DATA_FORK)
> > > > > > > +		return be32_to_cpu((dip)->di_nextents);
> > > > > > > +
> > > > > > > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > > > > > > +	if (xfs_sb_version_has_v3inode(sbp))
> > > > > > > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > > > > > > +
> > > > > > > +	return anextents;
> > > > > > > +}
> > > > > > 
> > > > > > No feature bit to indicate that 32 bit attribute extent counts are
> > > > > > valid?
> > > > > > 
> > > > > > >  
> > > > > > >  /*
> > > > > > >   * For block and character special files the 32bit dev_t is stored at the
> > > > > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > > > index 39c5a6e24915c..ced8195bd8c22 100644
> > > > > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > > > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> > > > > > >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> > > > > > >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> > > > > > >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > > > > > > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > > > > > > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > > > > > > +				XFS_ATTR_FORK);
> > > > > > 
> > > > > > This should open code, but I'd prefer a compeltely separate
> > > > > > variable...
> > > > > > 
> > > > > > >  	to->di_forkoff = from->di_forkoff;
> > > > > > >  	to->di_aformat	= from->di_aformat;
> > > > > > >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > > > > > > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> > > > > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > > > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > > > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > > > > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> > > > > > >  	to->di_forkoff = from->di_forkoff;
> > > > > > >  	to->di_aformat = from->di_aformat;
> > > > > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > > > > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> > > > > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> > > > > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > > > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > > > > +		to->di_anextents_hi
> > > > > > > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> > > > > > 
> > > > > > Again, feature bit for on-disk format modifications needed...
> > > > > > 
> > > > > > >  		to->di_ino = cpu_to_be64(ip->i_ino);
> > > > > > >  		to->di_lsn = cpu_to_be64(lsn);
> > > > > > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > > > > > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> > > > > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > > > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > > > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > > > > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> > > > > > >  	to->di_forkoff = from->di_forkoff;
> > > > > > >  	to->di_aformat = from->di_aformat;
> > > > > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > > > > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> > > > > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > > > > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > > > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > > > > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> > > > > > >  		to->di_ino = cpu_to_be64(from->di_ino);
> > > > > > >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> > > > > > >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > > > > > > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> > > > > > >  	struct xfs_mount	*mp,
> > > > > > >  	int			whichfork)
> > > > > > >  {
> > > > > > > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > > > > > > +	uint32_t		di_nextents;
> > > > > > > +
> > > > > > > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> > > > > > >  
> > > > > > >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> > > > > > >  	case XFS_DINODE_FMT_LOCAL:
> > > > > > > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> > > > > > >  	uint16_t		flags;
> > > > > > >  	uint64_t		flags2;
> > > > > > >  	uint64_t		di_size;
> > > > > > > +	int32_t			nextents;
> > > > > > > +	int32_t			anextents;
> > > > > > > +	int64_t			nblocks;
> > > > > > 
> > > > > > Extent counts need to be converted to unsigned in memory - they are
> > > > > > unsigned on disk....
> > > > > 
> > > > > In the current code, we have,
> > > > > 
> > > > > #define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
> > > > > #define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */
> > > > > 
> > > > > i.e. the maximum allowed data extent counter and xattr extent counter are
> > > > > maximum possible values w.r.t signed int and signed short.
> > > > > 
> > > > > Can you please explain as to why signed maximum values were considered when
> > > > > the corresponding on-disk data types are unsigned?
> > > > > 
> > > > > 
> > > > 
> > > > Ok. So the reason I asked that question was because I was wondering if
> > > > changing the maximum number of extents for data and attr would cause a change
> > > > the height of the corresponding bmbt trees (which in-turn could change the log
> > > > reservation values). The following calculations prove otherwise,
> > > > 
> > > > - 5 levels deep data bmbt tree.
> > > >   |-------+------------------------+-------------------------------|
> > > >   | level | number of nodes/leaves | Total Nr recs                 |
> > > >   |-------+------------------------+-------------------------------|
> > > >   |     0 |                      1 | 3 (max root recs)             |
> > > >   |     1 |                      3 | 125 * 3 = 375                 |
> > > >   |     2 |                    375 | 125 * 375 = 46875             |
> > > >   |     3 |                  46875 | 125 * 46875 = 5859375         |
> > > >   |     4 |                5859375 | 125 * 5859375 = 732421875     |
> > > >   |     5 |              732421875 | 125 * 732421875 = 91552734375 |
> > > >   |-------+------------------------+-------------------------------|
> > > > 
> > > > - 3 levels deep attr bmbt tree.
> > > >   |-------+------------------------+-----------------------|
> > > >   | level | number of nodes/leaves | Total Nr recs         |
> > > >   |-------+------------------------+-----------------------|
> > > >   |     0 |                      1 | 2 (max root recs)     |
> > > >   |     1 |                      2 | 125 * 2 = 250         |
> > > >   |     2 |                    250 | 125 * 250 = 31250     |
> > > >   |     3 |                  31250 | 125 * 31250 = 3906250 |
> > > >   |-------+------------------------+-----------------------|
> > > > 
> > > > - Data type to number of records
> > > >   |-----------+-------------+-----------------|
> > > >   | data type | max extents | max leaf blocks |
> > > >   |-----------+-------------+-----------------|
> > > >   | int32     |  2147483647 |        17179870 |
> > > >   | uint32    |  4294967295 |        34359739 |
> > > >   | int16     |       32767 |             263 |
> > > >   | uint16    |       65535 |             525 |                                                                                                                  
> > > >   |-----------+-------------+-----------------|
> > > > 
> > > > So data bmbt will still have a height of 5 and attr bmbt will continue to have
> > > > a height of 3.
> > > 
> > > I think extent count variables should be unsigned because there's no
> > > meaning for a negative extent count.  ("You have -3 extents." "Ehh???")
> > > 
> > > That said, it was very helpful to point out that the current MAXEXTNUM /
> > > MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.
> > > 
> > > Can we use this new feature flag + inode flag to allow 4294967295
> > > extents in either fork?
> > 
> > Sure.
> > 
> > I have already tested that having 4294967295 as the maximum data extent count
> > does not cause any regressions.
> > 
> > Also, Dave was of the opinion that data extent counter be increased to
> > 64-bit. I think I should include that change along with this feature flag
> > rather than adding a new one in the near future.
> > 
> > 
> 
> Hello Dave & Darrick,
> 
> Can you please look into the following design decision w.r.t using 32-bit and
> 64-bit unsigned counters for xattr and data extents.
> 
> Maximum extent counts.
> |-----------------------+----------------------|
> | Field width (in bits) |          Max extents |
> |-----------------------+----------------------|
> |                    32 |           4294967295 |
> |                    48 |      281474976710655 |
> |                    64 | 18446744073709551615 |
> |-----------------------+----------------------|
> 
> |-------------------+-----|
> | Minimum node recs | 125 |
> | Minimum leaf recs | 125 |
> |-------------------+-----|
> 
> Data bmbt tree height (MINDBTPTRS == 3)
> |-------+------------------------+-------------------------|
> | Level | Number of nodes/leaves |           Total Nr recs |
> |       |                        | (nr nodes/leaves * 125) |
> |-------+------------------------+-------------------------|
> |     0 |                      1 |                       3 |
> |     1 |                      3 |                     375 |
> |     2 |                    375 |                   46875 |
> |     3 |                  46875 |                 5859375 |
> |     4 |                5859375 |               732421875 |
> |     5 |              732421875 |             91552734375 |
> |     6 |            91552734375 |          11444091796875 |
> |     7 |         11444091796875 |        1430511474609375 |
> |     8 |       1430511474609375 |      178813934326171875 |
> |     9 |     178813934326171875 |    22351741790771484375 |
> |-------+------------------------+-------------------------|
> 
> For counting data extents, even though we theoretically have 64 bits at our
> disposal, I think we should have (2 ** 48) - 1 as the maximum number of

Why not 2^54-1, since that's the maximum value you can put in
br_startoff?  Granted I might just use a u64 and not have to deal with
bit masking :P

Hmm, so 2^54-1 = 18,014,398,509,418,983.

BMBT blocks have a 72-byte header, so on a 1k block filesystem that's...

(1024-72) = 952 bytes for records, and 16 bytes per record.

Assuming the block is half full, that's ... 952 / (16 * 2) = 29 records
per leaf.

Assuming the max records, that's 621,186,155,497,207 leaf blocks.

Node blocks require 16 bytes per keyptr pair, so they also store 29
records per leaf block.

Node level 1 would need 21,420,212,258,525 blocks.
Node level 2 would need 738,628,008,915 blocks.
Node level 3 would need 25,469,931,342 blocks.
Node level 4 would need 878,273,495 blocks.
Node level 5 would need 30,285,293 blocks.
Node level 6 would need 1,044,321 blocks.
Node level 7 would need 36,012 blocks.
Node level 8 would need 1,242 blocks.
Node level 9 would need 43 blocks.
Node level 10 would need 2 blocks.
Node level 11 could hold that in the ifork.

So I guess we'd need to bump XFS_BTREE_MAXLEVELS to 11 to support that.
Though we'd run out of global RAM and disk supply long before we
actually hit that, so perhaps we don't care.  Certainly increasing
XFS_BM_MAXLEVELS will make log reservation requirements grow even more.

> extents. This gives 281474976710655 (i.e. ~281 trillion extents). With this,
> bmbt tree's height grows by just two more levels (i.e. it grows from the
> current maximum height of 5 to 7). Please let me know your opinion on this.
> 
> Attr bmbt tree height (MINABTPTRS == 2)
> |-------+------------------------+-------------------------|
> | Level | Number of nodes/leaves |           Total Nr recs |
> |       |                        | (nr nodes/leaves * 125) |
> |-------+------------------------+-------------------------|
> |     0 |                      1 |                       2 |
> |     1 |                      2 |                     250 |
> |     2 |                    250 |                   31250 |
> |     3 |                  31250 |                 3906250 |
> |     4 |                3906250 |               488281250 |
> |     5 |              488281250 |             61035156250 |
> |-------+------------------------+-------------------------|
> 
> For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> will cause the corresponding bmbt's maximum height to go from 3 to 5.
> This probably won't cause any regression.
>
> Meanwhile, I will work on finding the impact of increasing the height of these
> two trees on log reservation.

Heh.  xfs_db log reservation dump command can be your friend for that. :)

--D

> -- 
> chandan
> 
> 
>

Chandan Rajendra April 25, 2020, 12:07 p.m. UTC | #20

On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > On Monday, April 20, 2020 10:08 AM Chandan Rajendra wrote: 
> > > On Tuesday, April 14, 2020 12:25 AM Darrick J. Wong wrote: 
> > > > That said, it was very helpful to point out that the current MAXEXTNUM /
> > > > MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.
> > > > 
> > > > Can we use this new feature flag + inode flag to allow 4294967295
> > > > extents in either fork?
> > > 
> > > Sure.
> > > 
> > > I have already tested that having 4294967295 as the maximum data extent count
> > > does not cause any regressions.
> > > 
> > > Also, Dave was of the opinion that data extent counter be increased to
> > > 64-bit. I think I should include that change along with this feature flag
> > > rather than adding a new one in the near future.
> > > 
> > > 
> > 
> > Hello Dave & Darrick,
> > 
> > Can you please look into the following design decision w.r.t using 32-bit and
> > 64-bit unsigned counters for xattr and data extents.
> > 
> > Maximum extent counts.
> > |-----------------------+----------------------|
> > | Field width (in bits) |          Max extents |
> > |-----------------------+----------------------|
> > |                    32 |           4294967295 |
> > |                    48 |      281474976710655 |
> > |                    64 | 18446744073709551615 |
> > |-----------------------+----------------------|
> 
> These huge numbers are impossible to compare visually.  Once numbers
> go beyond 7-9 digits, you need to start condensing them in reports.
> Humans are, in general, unable to handle strings of digits longer
> than 7-9 digits at all well...
> 
> Can you condense them by using scientific representation i.e. XEy,
> which gives:
> 
> |-----------------------+-------------|
> | Field width (in bits) | Max extents |
> |-----------------------+-------------|
> |                    32 |      4.3E09 |
> |                    48 |      2.8E14 |
> |                    64 |      1.8E19 |
> |-----------------------+-------------|
> 
> It's much easier to compare differences visually because it's not
> only 4 digits, not 20. The other alternative is to use k,m,g,t,p,e
> suffixes to indicate magnitude (4.3g, 280t, 18e), but using
> exponentials make the numbers easier to do calculations on
> directly...
>

Sorry about that. I will use scientific notation for representing large
numbers.

> > |-------------------+-----|
> > | Minimum node recs | 125 |
> > | Minimum leaf recs | 125 |
> > |-------------------+-----|
>

Yes, your assumption of 4k block size is correct. I will include detailed
calculation steps in my future mails.

> Please show your working. I'm assuming this is 50% * 4kB /
> sizeof(bmbt_rec), so you are working out limits based on 4kB block
> size? Realistically, worse case behaviour will be with the minimum
> supported block size, which in this case will be 1kB....
> 
> > Data bmbt tree height (MINDBTPTRS == 3)
> > |-------+------------------------+-------------------------|
> > | Level | Number of nodes/leaves |           Total Nr recs |
> > |       |                        | (nr nodes/leaves * 125) |
> > |-------+------------------------+-------------------------|
> > |     0 |                      1 |                       3 |
> > |     1 |                      3 |                     375 |
> > |     2 |                    375 |                   46875 |
> > |     3 |                  46875 |                 5859375 |
> > |     4 |                5859375 |               732421875 |
> > |     5 |              732421875 |             91552734375 |
> > |     6 |            91552734375 |          11444091796875 |
> > |     7 |         11444091796875 |        1430511474609375 |
> > |     8 |       1430511474609375 |      178813934326171875 |
> > |     9 |     178813934326171875 |    22351741790771484375 |
> > |-------+------------------------+-------------------------|
> > 
> > For counting data extents, even though we theoretically have 64 bits at our
> > disposal, I think we should have (2 ** 48) - 1 as the maximum number of
> > extents. This gives 281474976710655 (i.e. ~281 trillion extents). With this,
> > bmbt tree's height grows by just two more levels (i.e. it grows from the
> > current maximum height of 5 to 7). Please let me know your opinion on this.
> 
> We shouldn't make up arbitrary limits when we can calculate them exactly.
> i.e. 2^63 max file size, 1kB block size (2^10), means max fragments
> is 2^53 entries. On a 64kB block size (2^16), we have a max extent
> count of 2^47....
> 
> i.e. 2^48 would be an acceptible limit for 1kB block size, but it is
> not correct for 64kB block size filesystems....

You are right about this. I will set the max data extent count to 2^47.

> 
> > Attr bmbt tree height (MINABTPTRS == 2)
> > |-------+------------------------+-------------------------|
> > | Level | Number of nodes/leaves |           Total Nr recs |
> > |       |                        | (nr nodes/leaves * 125) |
> > |-------+------------------------+-------------------------|
> > |     0 |                      1 |                       2 |
> > |     1 |                      2 |                     250 |
> > |     2 |                    250 |                   31250 |
> > |     3 |                  31250 |                 3906250 |
> > |     4 |                3906250 |               488281250 |
> > |     5 |              488281250 |             61035156250 |
> > |-------+------------------------+-------------------------|
> > 
> > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > This probably won't cause any regression.
> 
> We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> attr fork extent count makes no difference to the attribute fork
> bmbt reservations. i.e. the bmbt reservations are defined by the
> dabtree structure limits, not the maximum extent count the fork can
> hold.

I think the dabtree structure limits is because of the following ...

How many levels of dabtree would be needed to hold ~100 million xattrs?
- name len = 16 bytes
         struct xfs_parent_name_rec {
               __be64  p_ino;
               __be32  p_gen;
               __be32  p_diroffset;
       };
  i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
- Value len = file name length = Assume ~40 bytes
- Formula for number of node entries (used in column 3 in the table given
  below) at any level of the dabtree,
  nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
  xfs_da_node_entry))
  i.e. nr_blocks * ((block size - 64) / 8)
- Formula for number of leaf entries (used in column 4 in the table given
  below),
  (block size - sizeof(xfs_attr_leaf_hdr_t)) /
  (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
  i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))

Here I have assumed block size to be 4k.

|-------+------------------+--------------------------+--------------------------|
| Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
|-------+------------------+--------------------------+--------------------------|
|     0 |              1.0 |                      5e2 |                    6.1e1 |
|     1 |              5e2 |                    2.5e5 |                    3.0e4 |
|     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
|     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
|-------+------------------+--------------------------+--------------------------|

Hence we would need a tree of height 3.
Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
... which is < 2^32 (4.3e9)

> 
> The data fork to 64 bits has no impact on the directory
> reservations, either, because the number of extents in the directory
> is bound by the directory segment size of 32GB. i.e. a directory can
> hold, at most, 32GB of dirent data, which means there's a hard limit
> on the number of dabtree entries somewhere in the order of a few
> hundred million. That's where XFS_DA_NODE_MAXDEPTH comes from - it's
> large enough to index a max sized directory, and the BMBT overhead
> is derived from that...

Ok. Thanks for explaining that.

> 
> > Meanwhile, I will work on finding the impact of increasing the
> > height of these two trees on log reservation.
> 
> It should not change it substantially - 2 blocks per bmbt
> reservation per transaction is what I'd expect from the numbers
> presented...

I still haven't got to this task yet. I will respond soon. I spent time in
figuring out how directories are organized in XFS and also arriving at the
above mentioned calculations for xattr extent counter.

Dave Chinner April 26, 2020, 10:08 p.m. UTC | #21

On Sat, Apr 25, 2020 at 05:37:39PM +0530, Chandan Rajendra wrote:
> On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> > On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > > Attr bmbt tree height (MINABTPTRS == 2)
> > > |-------+------------------------+-------------------------|
> > > | Level | Number of nodes/leaves |           Total Nr recs |
> > > |       |                        | (nr nodes/leaves * 125) |
> > > |-------+------------------------+-------------------------|
> > > |     0 |                      1 |                       2 |
> > > |     1 |                      2 |                     250 |
> > > |     2 |                    250 |                   31250 |
> > > |     3 |                  31250 |                 3906250 |
> > > |     4 |                3906250 |               488281250 |
> > > |     5 |              488281250 |             61035156250 |
> > > |-------+------------------------+-------------------------|
> > > 
> > > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > > This probably won't cause any regression.
> > 
> > We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> > attr fork extent count makes no difference to the attribute fork
> > bmbt reservations. i.e. the bmbt reservations are defined by the
> > dabtree structure limits, not the maximum extent count the fork can
> > hold.
> 
> I think the dabtree structure limits is because of the following ...
> 
> How many levels of dabtree would be needed to hold ~100 million xattrs?
> - name len = 16 bytes
>          struct xfs_parent_name_rec {
>                __be64  p_ino;
>                __be32  p_gen;
>                __be32  p_diroffset;
>        };
>   i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
> - Value len = file name length = Assume ~40 bytes

That's quite long for a file name, but lets run with it...

> - Formula for number of node entries (used in column 3 in the table given
>   below) at any level of the dabtree,
>   nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
>   xfs_da_node_entry))
>   i.e. nr_blocks * ((block size - 64) / 8)
> - Formula for number of leaf entries (used in column 4 in the table given
>   below),
>   (block size - sizeof(xfs_attr_leaf_hdr_t)) /
>   (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
>   i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
> 
> Here I have assumed block size to be 4k.
> 
> |-------+------------------+--------------------------+--------------------------|
> | Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
> |-------+------------------+--------------------------+--------------------------|
> |     0 |              1.0 |                      5e2 |                    6.1e1 |
> |     1 |              5e2 |                    2.5e5 |                    3.0e4 |
> |     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
> |     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
> |-------+------------------+--------------------------+--------------------------|

I'm not sure what this table actually represents.

> 
> Hence we would need a tree of height 3.
> Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8

130 million blocks to hold 100 million xattrs? That doesn't pass the
smell test.

I think you are trying to do these calculations from the wrong
direction. Calculate the number of leaf blocks needed to hold the
xattr data first, then work out the height of the pointer tree from
that. e.g:

If we need 100m xattrs, we need this many 100% full 4k blocks to
hold them all:

blocks	= 100m / entries per leaf
	= 100m / 61
	= 1.64m

and if we assume 37% for the least populated (because magic
split/merge number), multiply by 3, so blocks ~= 5m for 100m xattrs
in 4k blocks.

That makes a lot more sense. Now the tree itself:

ptrs per node ^ N = 5m
ptrs per node ^ (N-1) = 5m / 500 = 10k
ptrs per node ^ (N-2) = 10k / 500 = 200
ptrs per node ^ (N-3) = 200 / 500 = 1

So, N-3 = level 0, so we've got a tree of height 4 for 100m xattrs,
and the pointer tree requires ~12000 blocks which is noise compared
to the number of leaf blocks...

As for the bmbt, we've got ~5m extents worst case, which is

ptrs per node ^ N = 5m
ptrs per node ^ (N-1) = 5m / 125 = 40k
ptrs per node ^ (N-2) = 40k / 125 = 320
ptrs per node ^ (N-3) = 320 / 125 = 3

As 3 bmbt records should fit in the inode fork, we'd only need a 4
level bmbt tree to hold this, too. It's at the lower limit of a 4
level tree, but 100m xattrs is the extreme case we are talking about
here...

FWIW, repeat this with a directory data segment size of 32GB w/ 40
byte names, and the numbers aren't much different to a worst case
xattr tree of this shape. You'll see the reason for the dabtree
height being limited to 5, and that neither the directory structure
nor the xattr structure is anywhere near the 2^32 bit extent count
limit...

Cheers,

Dave.

Christoph Hellwig April 27, 2020, 7:39 a.m. UTC | #22

FYI, I have had a series in the works for a while but not quite 
finished yet that moves the in-memory nextents and format fields
into the ifork structure.  I feared this might conflict badly, but
so far this seems relatively harmless.  Note that your patch creates
some not so nice layout in struct xfs_icdinode, so maybe I need to
rush and finish that series ASAP.

> +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> +					struct xfs_dinode *dip, int whichfork)
> +{
> +	int32_t anextents;
> +
> +	if (whichfork == XFS_DATA_FORK)
> +		return be32_to_cpu((dip)->di_nextents);
> +
> +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> +	if (xfs_sb_version_has_v3inode(sbp))
> +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> +
> +	return anextents;

No need for any of the braces around dip.  Also this funcion really
deserves a proper lower case name now, and probably should be moved out
of line.

>  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
>  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
>  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */

We can just retire xfs_aextnum_t.  It only has 4 uses anyway.

> @@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
>  	to->di_nblocks = from->di_nblocks;
>  	to->di_extsize = from->di_extsize;
>  	to->di_nextents = from->di_nextents;
> -	to->di_anextents = from->di_anextents;
> +	to->di_anextents_lo = ((u32)(from->di_anextents)) & 0xffff;

No need for any of the casting here.

> @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
>  			goto out_release;
>  		}
>  	}
> -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> +
> +	nextents = ldip->di_anextents_lo;
> +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
> +
> +	nextents += ldip->di_nextents;

Little helpers to get/set the attr extents in the log inode would be nice.


Last but not least:  This seems like a feature flag we could just lazily
set once needed, similar to attr2.

Christoph Hellwig April 27, 2020, 7:42 a.m. UTC | #23

On Tue, Apr 07, 2020 at 11:20:00AM +1000, Dave Chinner wrote:
> Ok, I think you've limited what we can do here by using this "fill
> holes" variable split. I've never liked doing this, and we've only
> done it in the past when we haven't had space in the inode to create
> a new 32 bit variable.
> 
> IOWs, this is a v5 format feature only, so we should just create a
> new variable:
> 
> 	__be32		di_attr_nextents;
> 
> With that in place, we can now do what we did extending the v1 inode
> link count (16 bits) to the v2 inode link count (32 bits).
> 
> That is, when the attribute count is going to overflow, we set a
> inode flag on disk to indicate that it now has a 32 bit extent count
> and uses that field in the inode, and we set a RO-compat feature
> flag in the superblock to indicate that there are 32 bit attr fork
> extent counts in use.
> 
> Old kernels can still read the filesystem, but see the extent count
> as "max" (65535) but can't modify the attr fork and hence corrupt
> the 32 bit count it knows nothing about.
> 
> If the kernel sees the RO feature bit set, it can set the inode flag
> on inodes it is modifying and update both the old and new counters
> appropriately when flushing the inode to disk (i.e. transparent
> conversion).
> 
> In future, mkfs can then set the RO feature flag by default so all
> new filesystems use the 32 bit counter.

I don't like just moving to a new counter.  This wastes precious
space that is going to be really confusing to reuse later, and doesn't
really help with performance.  And we can do the RO_COMPAT trick
even without that.

Chandan Rajendra April 29, 2020, 3:35 p.m. UTC | #24

On Monday, April 27, 2020 3:38 AM Dave Chinner wrote: 
> On Sat, Apr 25, 2020 at 05:37:39PM +0530, Chandan Rajendra wrote:
> > On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> > > On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > > > Attr bmbt tree height (MINABTPTRS == 2)
> > > > |-------+------------------------+-------------------------|
> > > > | Level | Number of nodes/leaves |           Total Nr recs |
> > > > |       |                        | (nr nodes/leaves * 125) |
> > > > |-------+------------------------+-------------------------|
> > > > |     0 |                      1 |                       2 |
> > > > |     1 |                      2 |                     250 |
> > > > |     2 |                    250 |                   31250 |
> > > > |     3 |                  31250 |                 3906250 |
> > > > |     4 |                3906250 |               488281250 |
> > > > |     5 |              488281250 |             61035156250 |
> > > > |-------+------------------------+-------------------------|
> > > > 
> > > > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > > > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > > > This probably won't cause any regression.
> > > 
> > > We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> > > attr fork extent count makes no difference to the attribute fork
> > > bmbt reservations. i.e. the bmbt reservations are defined by the
> > > dabtree structure limits, not the maximum extent count the fork can
> > > hold.
> > 
> > I think the dabtree structure limits is because of the following ...
> > 
> > How many levels of dabtree would be needed to hold ~100 million xattrs?
> > - name len = 16 bytes
> >          struct xfs_parent_name_rec {
> >                __be64  p_ino;
> >                __be32  p_gen;
> >                __be32  p_diroffset;
> >        };
> >   i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
> > - Value len = file name length = Assume ~40 bytes
> 
> That's quite long for a file name, but lets run with it...
> 
> > - Formula for number of node entries (used in column 3 in the table given
> >   below) at any level of the dabtree,
> >   nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
> >   xfs_da_node_entry))
> >   i.e. nr_blocks * ((block size - 64) / 8)
> > - Formula for number of leaf entries (used in column 4 in the table given
> >   below),
> >   (block size - sizeof(xfs_attr_leaf_hdr_t)) /
> >   (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
> >   i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
> > 
> > Here I have assumed block size to be 4k.
> > 
> > |-------+------------------+--------------------------+--------------------------|
> > | Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
> > |-------+------------------+--------------------------+--------------------------|
> > |     0 |              1.0 |                      5e2 |                    6.1e1 |
> > |     1 |              5e2 |                    2.5e5 |                    3.0e4 |
> > |     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
> > |     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
> > |-------+------------------+--------------------------+--------------------------|
> 
> I'm not sure what this table actually represents.
> 
> > 
> > Hence we would need a tree of height 3.
> > Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
> 
> 130 million blocks to hold 100 million xattrs? That doesn't pass the
> smell test.
> 
> I think you are trying to do these calculations from the wrong
> direction.

You are right. Btrees grow in height by adding a new root
node. Hence the btree space usage should be calculated in bottom-to-top
direction.

> Calculate the number of leaf blocks needed to hold the
> xattr data first, then work out the height of the pointer tree from
> that. e.g:
> 
> If we need 100m xattrs, we need this many 100% full 4k blocks to
> hold them all:
> 
> blocks	= 100m / entries per leaf
> 	= 100m / 61
> 	= 1.64m
> 
> and if we assume 37% for the least populated (because magic
> split/merge number), multiply by 3, so blocks ~= 5m for 100m xattrs
> in 4k blocks.
> 
> That makes a lot more sense. Now the tree itself:
> 
> ptrs per node ^ N = 5m
> ptrs per node ^ (N-1) = 5m / 500 = 10k
> ptrs per node ^ (N-2) = 10k / 500 = 200
> ptrs per node ^ (N-3) = 200 / 500 = 1
> 
> So, N-3 = level 0, so we've got a tree of height 4 for 100m xattrs,
> and the pointer tree requires ~12000 blocks which is noise compared
> to the number of leaf blocks...
> 
> As for the bmbt, we've got ~5m extents worst case, which is
> 
> ptrs per node ^ N = 5m
> ptrs per node ^ (N-1) = 5m / 125 = 40k
> ptrs per node ^ (N-2) = 40k / 125 = 320
> ptrs per node ^ (N-3) = 320 / 125 = 3
> 
> As 3 bmbt records should fit in the inode fork, we'd only need a 4
> level bmbt tree to hold this, too. It's at the lower limit of a 4
> level tree, but 100m xattrs is the extreme case we are talking about
> here...
> 
> FWIW, repeat this with a directory data segment size of 32GB w/ 40
> byte names, and the numbers aren't much different to a worst case
> xattr tree of this shape. You'll see the reason for the dabtree
> height being limited to 5, and that neither the directory structure
> nor the xattr structure is anywhere near the 2^32 bit extent count
> limit...

Directory segment size is 32 GB                                                                                                                                  
  - Number of directory entries required for indexing 32GiB.
    - 32GiB is divided into 4k data blocks. 
    - Number of 4k blocks = 32GB / 4k = 8M
    - Each 4k data block has,
      - struct xfs_dir3_data_hdr = 64 bytes
      - struct xfs_dir2_data_entry = 12 bytes (metadata) + 40 bytes (name)
                                   = 52 bytes
      - Number of 'struct xfs_dir2_data_entry' in a 4k block
        (4096 - 64) / 52 = 78
    - Number of 'struct xfs_dir2_data_entry' in 32-GiB space
      8m * 78 = 654m
  - Contents of a single dabtree leaf
    - struct xfs_dir3_leaf_hdr = 64 bytes
    - struct xfs_dir2_leaf_entry = 8 bytes
    - Number of 'struct xfs_dir2_leaf_entry' = (4096 - 64) / 8 = 504
    - 37% of 504 = 186 entries
  - Contents of a single dabtree node
    - struct xfs_da3_node_hdr = 64 bytes
    - struct xfs_da_node_entry = 8 bytes
    - Number of 'struct xfs_da_node_entry' = (4096 - 64) / 8 = 504
  - Nr leaves
    Level (N) = 654m / 186 = 3m leaves
    Level (N-1) = 3m / 504 = 6k
    Level (N-2) = 6k / 504 = 12
    Level (N-3) = 12 / 504 = 1
    Dabtree having 4 levels is sufficient.

Hence a dabtree with 5 levels should be more than enough to index a 32GiB
directory segment containing directory entries with even shorter names.

Even with 5m extents (used in xattr tree example above) consumed by a da
btree, this is still much less than the limit imposed by 2^32 (i.e. ~4
billion) extents.

Hence the actual log space consumed for logging bmbt blocks is limited by the
height of da btree.

My experiment with changing the values of MAXEXTNUM and MAXAEXTNUM to 2^47 and
2^32 respectively, gave me the following results,
- For 1k block size, bmbt tree height increased by 3.
- For 4k block size, bmbt tree height increased by 2.

This happens because xfs_bmap_compute_maxlevels() calculates the BMBT tree
height by assuming that there will be MAXEXTNUM/MAXAEXTNUM worth of leaf
entries in the worst case.

For Attr fork Bmbt , Do you think the calculation should be changed to
consider the number of extents occupied by a dabtree holding > 100 million
xattrs?

The new increase in Bmbt height in turn causes the static reservation values
to increase. In the worst case, the maximum increase observed was 118k bytes
(4k block size, reflink=0, tr_rename).

The experiment was executed after applying "xfsprogs: Fix log reservation
calculation for xattr insert operation" patch
(https://lore.kernel.org/linux-xfs/20200404085229.2034-2-chandanrlinux@gmail.com/)

I am attaching the output of "xfs_db -c logres <dev>" executed on the
following configurations of the XFS filesystem.
- -b size=1k -m reflink=0
- -b size=1k -m rmapbt=1reflink=1
- -b size=4k -m reflink=0
- -b size=4k -m rmapbt=1reflink=1
- -b size=1k -m crc=0
- -b size=4k -m crc=0

I will go through the code which calculates the log reservations of the
entries which have a drastic increase in their values.

Chandan Rajendra April 30, 2020, 2:29 a.m. UTC | #25

On Monday, April 27, 2020 1:09 PM Christoph Hellwig wrote: 
> FYI, I have had a series in the works for a while but not quite 
> finished yet that moves the in-memory nextents and format fields
> into the ifork structure.  I feared this might conflict badly, but
> so far this seems relatively harmless.  Note that your patch creates
> some not so nice layout in struct xfs_icdinode, so maybe I need to
> rush and finish that series ASAP.
> 
> > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > +					struct xfs_dinode *dip, int whichfork)
> > +{
> > +	int32_t anextents;
> > +
> > +	if (whichfork == XFS_DATA_FORK)
> > +		return be32_to_cpu((dip)->di_nextents);
> > +
> > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > +	if (xfs_sb_version_has_v3inode(sbp))
> > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > +
> > +	return anextents;
> 
> No need for any of the braces around dip.  Also this funcion really
> deserves a proper lower case name now, and probably should be moved out
> of line.

Sure, I will implement that.

> 
> >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> >  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> > -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> > +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> 
> We can just retire xfs_aextnum_t.  It only has 4 uses anyway.
> 
> > @@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
> >  	to->di_nblocks = from->di_nblocks;
> >  	to->di_extsize = from->di_extsize;
> >  	to->di_nextents = from->di_nextents;
> > -	to->di_anextents = from->di_anextents;
> > +	to->di_anextents_lo = ((u32)(from->di_anextents)) & 0xffff;
> 
> No need for any of the casting here.

Ok.

> 
> > @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
> >  			goto out_release;
> >  		}
> >  	}
> > -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> > +
> > +	nextents = ldip->di_anextents_lo;
> > +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> > +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
> > +
> > +	nextents += ldip->di_nextents;
> 
> Little helpers to get/set the attr extents in the log inode would be nice.
>

Ok. I will implement the helper functions.

> 
> Last but not least:  This seems like a feature flag we could just lazily
> set once needed, similar to attr2.
> 

Yes, I will implement this change before posting the next version of the
patchset.

Chandan Rajendra May 1, 2020, 7:08 a.m. UTC | #26

On Wednesday, April 29, 2020 9:05 PM Chandan Rajendra wrote: 
> On Monday, April 27, 2020 3:38 AM Dave Chinner wrote: 
> > On Sat, Apr 25, 2020 at 05:37:39PM +0530, Chandan Rajendra wrote:
> > > On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> > > > On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > > > > Attr bmbt tree height (MINABTPTRS == 2)
> > > > > |-------+------------------------+-------------------------|
> > > > > | Level | Number of nodes/leaves |           Total Nr recs |
> > > > > |       |                        | (nr nodes/leaves * 125) |
> > > > > |-------+------------------------+-------------------------|
> > > > > |     0 |                      1 |                       2 |
> > > > > |     1 |                      2 |                     250 |
> > > > > |     2 |                    250 |                   31250 |
> > > > > |     3 |                  31250 |                 3906250 |
> > > > > |     4 |                3906250 |               488281250 |
> > > > > |     5 |              488281250 |             61035156250 |
> > > > > |-------+------------------------+-------------------------|
> > > > > 
> > > > > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > > > > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > > > > This probably won't cause any regression.
> > > > 
> > > > We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> > > > attr fork extent count makes no difference to the attribute fork
> > > > bmbt reservations. i.e. the bmbt reservations are defined by the
> > > > dabtree structure limits, not the maximum extent count the fork can
> > > > hold.
> > > 
> > > I think the dabtree structure limits is because of the following ...
> > > 
> > > How many levels of dabtree would be needed to hold ~100 million xattrs?
> > > - name len = 16 bytes
> > >          struct xfs_parent_name_rec {
> > >                __be64  p_ino;
> > >                __be32  p_gen;
> > >                __be32  p_diroffset;
> > >        };
> > >   i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
> > > - Value len = file name length = Assume ~40 bytes
> > 
> > That's quite long for a file name, but lets run with it...
> > 
> > > - Formula for number of node entries (used in column 3 in the table given
> > >   below) at any level of the dabtree,
> > >   nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
> > >   xfs_da_node_entry))
> > >   i.e. nr_blocks * ((block size - 64) / 8)
> > > - Formula for number of leaf entries (used in column 4 in the table given
> > >   below),
> > >   (block size - sizeof(xfs_attr_leaf_hdr_t)) /
> > >   (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
> > >   i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
> > > 
> > > Here I have assumed block size to be 4k.
> > > 
> > > |-------+------------------+--------------------------+--------------------------|
> > > | Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
> > > |-------+------------------+--------------------------+--------------------------|
> > > |     0 |              1.0 |                      5e2 |                    6.1e1 |
> > > |     1 |              5e2 |                    2.5e5 |                    3.0e4 |
> > > |     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
> > > |     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
> > > |-------+------------------+--------------------------+--------------------------|
> > 
> > I'm not sure what this table actually represents.
> > 
> > > 
> > > Hence we would need a tree of height 3.
> > > Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
> > 
> > 130 million blocks to hold 100 million xattrs? That doesn't pass the
> > smell test.
> > 
> > I think you are trying to do these calculations from the wrong
> > direction.
> 
> You are right. Btrees grow in height by adding a new root
> node. Hence the btree space usage should be calculated in bottom-to-top
> direction.
> 
> > Calculate the number of leaf blocks needed to hold the
> > xattr data first, then work out the height of the pointer tree from
> > that. e.g:
> > 
> > If we need 100m xattrs, we need this many 100% full 4k blocks to
> > hold them all:
> > 
> > blocks	= 100m / entries per leaf
> > 	= 100m / 61
> > 	= 1.64m
> > 
> > and if we assume 37% for the least populated (because magic
> > split/merge number), multiply by 3, so blocks ~= 5m for 100m xattrs
> > in 4k blocks.
> > 
> > That makes a lot more sense. Now the tree itself:
> > 
> > ptrs per node ^ N = 5m
> > ptrs per node ^ (N-1) = 5m / 500 = 10k
> > ptrs per node ^ (N-2) = 10k / 500 = 200
> > ptrs per node ^ (N-3) = 200 / 500 = 1
> > 
> > So, N-3 = level 0, so we've got a tree of height 4 for 100m xattrs,
> > and the pointer tree requires ~12000 blocks which is noise compared
> > to the number of leaf blocks...
> > 
> > As for the bmbt, we've got ~5m extents worst case, which is
> > 
> > ptrs per node ^ N = 5m
> > ptrs per node ^ (N-1) = 5m / 125 = 40k
> > ptrs per node ^ (N-2) = 40k / 125 = 320
> > ptrs per node ^ (N-3) = 320 / 125 = 3
> > 
> > As 3 bmbt records should fit in the inode fork, we'd only need a 4
> > level bmbt tree to hold this, too. It's at the lower limit of a 4
> > level tree, but 100m xattrs is the extreme case we are talking about
> > here...
> > 
> > FWIW, repeat this with a directory data segment size of 32GB w/ 40
> > byte names, and the numbers aren't much different to a worst case
> > xattr tree of this shape. You'll see the reason for the dabtree
> > height being limited to 5, and that neither the directory structure
> > nor the xattr structure is anywhere near the 2^32 bit extent count
> > limit...
> 
> Directory segment size is 32 GB                                                                                                                                  
>   - Number of directory entries required for indexing 32GiB.
>     - 32GiB is divided into 4k data blocks. 
>     - Number of 4k blocks = 32GB / 4k = 8M
>     - Each 4k data block has,
>       - struct xfs_dir3_data_hdr = 64 bytes
>       - struct xfs_dir2_data_entry = 12 bytes (metadata) + 40 bytes (name)
>                                    = 52 bytes
>       - Number of 'struct xfs_dir2_data_entry' in a 4k block
>         (4096 - 64) / 52 = 78
>     - Number of 'struct xfs_dir2_data_entry' in 32-GiB space
>       8m * 78 = 654m
>   - Contents of a single dabtree leaf
>     - struct xfs_dir3_leaf_hdr = 64 bytes
>     - struct xfs_dir2_leaf_entry = 8 bytes
>     - Number of 'struct xfs_dir2_leaf_entry' = (4096 - 64) / 8 = 504
>     - 37% of 504 = 186 entries
>   - Contents of a single dabtree node
>     - struct xfs_da3_node_hdr = 64 bytes
>     - struct xfs_da_node_entry = 8 bytes
>     - Number of 'struct xfs_da_node_entry' = (4096 - 64) / 8 = 504
>   - Nr leaves
>     Level (N) = 654m / 186 = 3m leaves
>     Level (N-1) = 3m / 504 = 6k
>     Level (N-2) = 6k / 504 = 12
>     Level (N-3) = 12 / 504 = 1
>     Dabtree having 4 levels is sufficient.
> 
> Hence a dabtree with 5 levels should be more than enough to index a 32GiB
> directory segment containing directory entries with even shorter names.
> 
> Even with 5m extents (used in xattr tree example above) consumed by a da
> btree, this is still much less than the limit imposed by 2^32 (i.e. ~4
> billion) extents.
> 
> Hence the actual log space consumed for logging bmbt blocks is limited by the
> height of da btree.
> 
> My experiment with changing the values of MAXEXTNUM and MAXAEXTNUM to 2^47 and
> 2^32 respectively, gave me the following results,
> - For 1k block size, bmbt tree height increased by 3.
> - For 4k block size, bmbt tree height increased by 2.
> 
> This happens because xfs_bmap_compute_maxlevels() calculates the BMBT tree
> height by assuming that there will be MAXEXTNUM/MAXAEXTNUM worth of leaf
> entries in the worst case.
> 
> For Attr fork Bmbt , Do you think the calculation should be changed to
> consider the number of extents occupied by a dabtree holding > 100 million
> xattrs?
> 
> The new increase in Bmbt height in turn causes the static reservation values
> to increase. In the worst case, the maximum increase observed was 118k bytes
> (4k block size, reflink=0, tr_rename).
> 
> The experiment was executed after applying "xfsprogs: Fix log reservation
> calculation for xattr insert operation" patch
> (https://lore.kernel.org/linux-xfs/20200404085229.2034-2-chandanrlinux@gmail.com/)
> 
> I am attaching the output of "xfs_db -c logres <dev>" executed on the
> following configurations of the XFS filesystem.
> - -b size=1k -m reflink=0
> - -b size=1k -m rmapbt=1reflink=1
> - -b size=4k -m reflink=0
> - -b size=4k -m rmapbt=1reflink=1
> - -b size=1k -m crc=0
> - -b size=4k -m crc=0
> 
> I will go through the code which calculates the log reservations of the
> entries which have a drastic increase in their values.
> 

The highest increase (i.e. an increase of 118k) in log reservation was
associated with the rename operation,

STATIC uint
xfs_calc_rename_reservation(
        struct xfs_mount        *mp)
{
        return XFS_DQUOT_LOGRES(mp) +
                max((xfs_calc_inode_res(mp, 4) +
                     xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
                                      XFS_FSB_TO_B(mp, 1))),
                    (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
                     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 3),
                                      XFS_FSB_TO_B(mp, 1))));
}

The first argument to max() contributes the highest value.

xfs_calc_inode_res(mp, 4) + xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),XFS_FSB_TO_B(mp, 1))

The inode reservation part is a constant.

The number of blocks computed by the second operand of the '+' operator is,

2 * ((XFS_DA_NODE_MAXDEPTH + 2) + ((XFS_DA_NODE_MAXDEPTH + 2) * (bmbt_height - 1)))

= 2 * ((5 + 2) + ((5 + 2) * (bmbt_height - 1)))

When bmbt height is 5 (i.e. when using the original 2^31 extent count limit) this
evaluates to,

2 * ((5 + 2) + ((5 + 2) * (5 - 1)))
= 70 blocks

When bmbt height is 7 (i.e. when using the original 2^47 extent count limit) this
evaluates to,

2 * ((5 + 2) + ((5 + 2) * (7 - 1)))
= 98 blocks

However, I don't see any extraneous space reserved by the above calculation
that could be removed. Also, IMHO an increase by 118k is most likely not going
to introduce any bugs. I will execute xfstests to make sure that no
regressions get added.

Darrick J. Wong May 12, 2020, 11:53 p.m. UTC | #27

On Fri, May 01, 2020 at 12:38:30PM +0530, Chandan Rajendra wrote:
> On Wednesday, April 29, 2020 9:05 PM Chandan Rajendra wrote: 
> > On Monday, April 27, 2020 3:38 AM Dave Chinner wrote: 
> > > On Sat, Apr 25, 2020 at 05:37:39PM +0530, Chandan Rajendra wrote:
> > > > On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> > > > > On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > > > > > Attr bmbt tree height (MINABTPTRS == 2)
> > > > > > |-------+------------------------+-------------------------|
> > > > > > | Level | Number of nodes/leaves |           Total Nr recs |
> > > > > > |       |                        | (nr nodes/leaves * 125) |
> > > > > > |-------+------------------------+-------------------------|
> > > > > > |     0 |                      1 |                       2 |
> > > > > > |     1 |                      2 |                     250 |
> > > > > > |     2 |                    250 |                   31250 |
> > > > > > |     3 |                  31250 |                 3906250 |
> > > > > > |     4 |                3906250 |               488281250 |
> > > > > > |     5 |              488281250 |             61035156250 |
> > > > > > |-------+------------------------+-------------------------|
> > > > > > 
> > > > > > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > > > > > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > > > > > This probably won't cause any regression.
> > > > > 
> > > > > We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> > > > > attr fork extent count makes no difference to the attribute fork
> > > > > bmbt reservations. i.e. the bmbt reservations are defined by the
> > > > > dabtree structure limits, not the maximum extent count the fork can
> > > > > hold.
> > > > 
> > > > I think the dabtree structure limits is because of the following ...
> > > > 
> > > > How many levels of dabtree would be needed to hold ~100 million xattrs?
> > > > - name len = 16 bytes
> > > >          struct xfs_parent_name_rec {
> > > >                __be64  p_ino;
> > > >                __be32  p_gen;
> > > >                __be32  p_diroffset;
> > > >        };
> > > >   i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
> > > > - Value len = file name length = Assume ~40 bytes
> > > 
> > > That's quite long for a file name, but lets run with it...
> > > 
> > > > - Formula for number of node entries (used in column 3 in the table given
> > > >   below) at any level of the dabtree,
> > > >   nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
> > > >   xfs_da_node_entry))
> > > >   i.e. nr_blocks * ((block size - 64) / 8)
> > > > - Formula for number of leaf entries (used in column 4 in the table given
> > > >   below),
> > > >   (block size - sizeof(xfs_attr_leaf_hdr_t)) /
> > > >   (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
> > > >   i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
> > > > 
> > > > Here I have assumed block size to be 4k.
> > > > 
> > > > |-------+------------------+--------------------------+--------------------------|
> > > > | Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
> > > > |-------+------------------+--------------------------+--------------------------|
> > > > |     0 |              1.0 |                      5e2 |                    6.1e1 |
> > > > |     1 |              5e2 |                    2.5e5 |                    3.0e4 |
> > > > |     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
> > > > |     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
> > > > |-------+------------------+--------------------------+--------------------------|
> > > 
> > > I'm not sure what this table actually represents.
> > > 
> > > > 
> > > > Hence we would need a tree of height 3.
> > > > Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
> > > 
> > > 130 million blocks to hold 100 million xattrs? That doesn't pass the
> > > smell test.
> > > 
> > > I think you are trying to do these calculations from the wrong
> > > direction.
> > 
> > You are right. Btrees grow in height by adding a new root
> > node. Hence the btree space usage should be calculated in bottom-to-top
> > direction.
> > 
> > > Calculate the number of leaf blocks needed to hold the
> > > xattr data first, then work out the height of the pointer tree from
> > > that. e.g:
> > > 
> > > If we need 100m xattrs, we need this many 100% full 4k blocks to
> > > hold them all:
> > > 
> > > blocks	= 100m / entries per leaf
> > > 	= 100m / 61
> > > 	= 1.64m
> > > 
> > > and if we assume 37% for the least populated (because magic
> > > split/merge number), multiply by 3, so blocks ~= 5m for 100m xattrs
> > > in 4k blocks.
> > > 
> > > That makes a lot more sense. Now the tree itself:
> > > 
> > > ptrs per node ^ N = 5m
> > > ptrs per node ^ (N-1) = 5m / 500 = 10k
> > > ptrs per node ^ (N-2) = 10k / 500 = 200
> > > ptrs per node ^ (N-3) = 200 / 500 = 1
> > > 
> > > So, N-3 = level 0, so we've got a tree of height 4 for 100m xattrs,
> > > and the pointer tree requires ~12000 blocks which is noise compared
> > > to the number of leaf blocks...
> > > 
> > > As for the bmbt, we've got ~5m extents worst case, which is
> > > 
> > > ptrs per node ^ N = 5m
> > > ptrs per node ^ (N-1) = 5m / 125 = 40k
> > > ptrs per node ^ (N-2) = 40k / 125 = 320
> > > ptrs per node ^ (N-3) = 320 / 125 = 3
> > > 
> > > As 3 bmbt records should fit in the inode fork, we'd only need a 4
> > > level bmbt tree to hold this, too. It's at the lower limit of a 4
> > > level tree, but 100m xattrs is the extreme case we are talking about
> > > here...
> > > 
> > > FWIW, repeat this with a directory data segment size of 32GB w/ 40
> > > byte names, and the numbers aren't much different to a worst case
> > > xattr tree of this shape. You'll see the reason for the dabtree
> > > height being limited to 5, and that neither the directory structure
> > > nor the xattr structure is anywhere near the 2^32 bit extent count
> > > limit...
> > 
> > Directory segment size is 32 GB                                                                                                                                  
> >   - Number of directory entries required for indexing 32GiB.
> >     - 32GiB is divided into 4k data blocks. 
> >     - Number of 4k blocks = 32GB / 4k = 8M
> >     - Each 4k data block has,
> >       - struct xfs_dir3_data_hdr = 64 bytes
> >       - struct xfs_dir2_data_entry = 12 bytes (metadata) + 40 bytes (name)
> >                                    = 52 bytes
> >       - Number of 'struct xfs_dir2_data_entry' in a 4k block
> >         (4096 - 64) / 52 = 78
> >     - Number of 'struct xfs_dir2_data_entry' in 32-GiB space
> >       8m * 78 = 654m
> >   - Contents of a single dabtree leaf
> >     - struct xfs_dir3_leaf_hdr = 64 bytes
> >     - struct xfs_dir2_leaf_entry = 8 bytes
> >     - Number of 'struct xfs_dir2_leaf_entry' = (4096 - 64) / 8 = 504
> >     - 37% of 504 = 186 entries
> >   - Contents of a single dabtree node
> >     - struct xfs_da3_node_hdr = 64 bytes
> >     - struct xfs_da_node_entry = 8 bytes
> >     - Number of 'struct xfs_da_node_entry' = (4096 - 64) / 8 = 504
> >   - Nr leaves
> >     Level (N) = 654m / 186 = 3m leaves
> >     Level (N-1) = 3m / 504 = 6k
> >     Level (N-2) = 6k / 504 = 12
> >     Level (N-3) = 12 / 504 = 1
> >     Dabtree having 4 levels is sufficient.
> > 
> > Hence a dabtree with 5 levels should be more than enough to index a 32GiB
> > directory segment containing directory entries with even shorter names.
> > 
> > Even with 5m extents (used in xattr tree example above) consumed by a da
> > btree, this is still much less than the limit imposed by 2^32 (i.e. ~4
> > billion) extents.
> > 
> > Hence the actual log space consumed for logging bmbt blocks is limited by the
> > height of da btree.
> > 
> > My experiment with changing the values of MAXEXTNUM and MAXAEXTNUM to 2^47 and
> > 2^32 respectively, gave me the following results,
> > - For 1k block size, bmbt tree height increased by 3.
> > - For 4k block size, bmbt tree height increased by 2.
> > 
> > This happens because xfs_bmap_compute_maxlevels() calculates the BMBT tree
> > height by assuming that there will be MAXEXTNUM/MAXAEXTNUM worth of leaf
> > entries in the worst case.
> > 
> > For Attr fork Bmbt , Do you think the calculation should be changed to
> > consider the number of extents occupied by a dabtree holding > 100 million
> > xattrs?
> > 
> > The new increase in Bmbt height in turn causes the static reservation values
> > to increase. In the worst case, the maximum increase observed was 118k bytes
> > (4k block size, reflink=0, tr_rename).
> > 
> > The experiment was executed after applying "xfsprogs: Fix log reservation
> > calculation for xattr insert operation" patch
> > (https://lore.kernel.org/linux-xfs/20200404085229.2034-2-chandanrlinux@gmail.com/)
> > 
> > I am attaching the output of "xfs_db -c logres <dev>" executed on the
> > following configurations of the XFS filesystem.
> > - -b size=1k -m reflink=0
> > - -b size=1k -m rmapbt=1reflink=1
> > - -b size=4k -m reflink=0
> > - -b size=4k -m rmapbt=1reflink=1
> > - -b size=1k -m crc=0
> > - -b size=4k -m crc=0
> > 
> > I will go through the code which calculates the log reservations of the
> > entries which have a drastic increase in their values.
> > 
> 
> The highest increase (i.e. an increase of 118k) in log reservation was
> associated with the rename operation,
> 
> STATIC uint
> xfs_calc_rename_reservation(
>         struct xfs_mount        *mp)
> {
>         return XFS_DQUOT_LOGRES(mp) +
>                 max((xfs_calc_inode_res(mp, 4) +
>                      xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
>                                       XFS_FSB_TO_B(mp, 1))),
>                     (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
>                      xfs_calc_buf_res(xfs_allocfree_log_count(mp, 3),
>                                       XFS_FSB_TO_B(mp, 1))));
> }
> 
> The first argument to max() contributes the highest value.
> 
> xfs_calc_inode_res(mp, 4) + xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),XFS_FSB_TO_B(mp, 1))
> 
> The inode reservation part is a constant.
> 
> The number of blocks computed by the second operand of the '+' operator is,
> 
> 2 * ((XFS_DA_NODE_MAXDEPTH + 2) + ((XFS_DA_NODE_MAXDEPTH + 2) * (bmbt_height - 1)))
> 
> = 2 * ((5 + 2) + ((5 + 2) * (bmbt_height - 1)))
> 
> When bmbt height is 5 (i.e. when using the original 2^31 extent count limit) this
> evaluates to,
> 
> 2 * ((5 + 2) + ((5 + 2) * (5 - 1)))
> = 70 blocks
> 
> When bmbt height is 7 (i.e. when using the original 2^47 extent count limit) this
> evaluates to,
> 
> 2 * ((5 + 2) + ((5 + 2) * (7 - 1)))
> = 98 blocks
> 
> However, I don't see any extraneous space reserved by the above calculation
> that could be removed. Also, IMHO an increase by 118k is most likely not going
> to introduce any bugs. I will execute xfstests to make sure that no
> regressions get added.

(Did fstests pass?)

--D

> -- 
> chandan
> 
> 
>

Chandan Rajendra May 13, 2020, 12:19 p.m. UTC | #28

On Wednesday, May 13, 2020 5:23 AM Darrick J. Wong wrote: 
> On Fri, May 01, 2020 at 12:38:30PM +0530, Chandan Rajendra wrote:
> > On Wednesday, April 29, 2020 9:05 PM Chandan Rajendra wrote: 
> > > On Monday, April 27, 2020 3:38 AM Dave Chinner wrote: 
> > > > On Sat, Apr 25, 2020 at 05:37:39PM +0530, Chandan Rajendra wrote:
> > > > > On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> > > > > > On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > > > > > > Attr bmbt tree height (MINABTPTRS == 2)
> > > > > > > |-------+------------------------+-------------------------|
> > > > > > > | Level | Number of nodes/leaves |           Total Nr recs |
> > > > > > > |       |                        | (nr nodes/leaves * 125) |
> > > > > > > |-------+------------------------+-------------------------|
> > > > > > > |     0 |                      1 |                       2 |
> > > > > > > |     1 |                      2 |                     250 |
> > > > > > > |     2 |                    250 |                   31250 |
> > > > > > > |     3 |                  31250 |                 3906250 |
> > > > > > > |     4 |                3906250 |               488281250 |
> > > > > > > |     5 |              488281250 |             61035156250 |
> > > > > > > |-------+------------------------+-------------------------|
> > > > > > > 
> > > > > > > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > > > > > > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > > > > > > This probably won't cause any regression.
> > > > > > 
> > > > > > We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> > > > > > attr fork extent count makes no difference to the attribute fork
> > > > > > bmbt reservations. i.e. the bmbt reservations are defined by the
> > > > > > dabtree structure limits, not the maximum extent count the fork can
> > > > > > hold.
> > > > > 
> > > > > I think the dabtree structure limits is because of the following ...
> > > > > 
> > > > > How many levels of dabtree would be needed to hold ~100 million xattrs?
> > > > > - name len = 16 bytes
> > > > >          struct xfs_parent_name_rec {
> > > > >                __be64  p_ino;
> > > > >                __be32  p_gen;
> > > > >                __be32  p_diroffset;
> > > > >        };
> > > > >   i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
> > > > > - Value len = file name length = Assume ~40 bytes
> > > > 
> > > > That's quite long for a file name, but lets run with it...
> > > > 
> > > > > - Formula for number of node entries (used in column 3 in the table given
> > > > >   below) at any level of the dabtree,
> > > > >   nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
> > > > >   xfs_da_node_entry))
> > > > >   i.e. nr_blocks * ((block size - 64) / 8)
> > > > > - Formula for number of leaf entries (used in column 4 in the table given
> > > > >   below),
> > > > >   (block size - sizeof(xfs_attr_leaf_hdr_t)) /
> > > > >   (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
> > > > >   i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
> > > > > 
> > > > > Here I have assumed block size to be 4k.
> > > > > 
> > > > > |-------+------------------+--------------------------+--------------------------|
> > > > > | Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
> > > > > |-------+------------------+--------------------------+--------------------------|
> > > > > |     0 |              1.0 |                      5e2 |                    6.1e1 |
> > > > > |     1 |              5e2 |                    2.5e5 |                    3.0e4 |
> > > > > |     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
> > > > > |     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
> > > > > |-------+------------------+--------------------------+--------------------------|
> > > > 
> > > > I'm not sure what this table actually represents.
> > > > 
> > > > > 
> > > > > Hence we would need a tree of height 3.
> > > > > Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
> > > > 
> > > > 130 million blocks to hold 100 million xattrs? That doesn't pass the
> > > > smell test.
> > > > 
> > > > I think you are trying to do these calculations from the wrong
> > > > direction.
> > > 
> > > You are right. Btrees grow in height by adding a new root
> > > node. Hence the btree space usage should be calculated in bottom-to-top
> > > direction.
> > > 
> > > > Calculate the number of leaf blocks needed to hold the
> > > > xattr data first, then work out the height of the pointer tree from
> > > > that. e.g:
> > > > 
> > > > If we need 100m xattrs, we need this many 100% full 4k blocks to
> > > > hold them all:
> > > > 
> > > > blocks	= 100m / entries per leaf
> > > > 	= 100m / 61
> > > > 	= 1.64m
> > > > 
> > > > and if we assume 37% for the least populated (because magic
> > > > split/merge number), multiply by 3, so blocks ~= 5m for 100m xattrs
> > > > in 4k blocks.
> > > > 
> > > > That makes a lot more sense. Now the tree itself:
> > > > 
> > > > ptrs per node ^ N = 5m
> > > > ptrs per node ^ (N-1) = 5m / 500 = 10k
> > > > ptrs per node ^ (N-2) = 10k / 500 = 200
> > > > ptrs per node ^ (N-3) = 200 / 500 = 1
> > > > 
> > > > So, N-3 = level 0, so we've got a tree of height 4 for 100m xattrs,
> > > > and the pointer tree requires ~12000 blocks which is noise compared
> > > > to the number of leaf blocks...
> > > > 
> > > > As for the bmbt, we've got ~5m extents worst case, which is
> > > > 
> > > > ptrs per node ^ N = 5m
> > > > ptrs per node ^ (N-1) = 5m / 125 = 40k
> > > > ptrs per node ^ (N-2) = 40k / 125 = 320
> > > > ptrs per node ^ (N-3) = 320 / 125 = 3
> > > > 
> > > > As 3 bmbt records should fit in the inode fork, we'd only need a 4
> > > > level bmbt tree to hold this, too. It's at the lower limit of a 4
> > > > level tree, but 100m xattrs is the extreme case we are talking about
> > > > here...
> > > > 
> > > > FWIW, repeat this with a directory data segment size of 32GB w/ 40
> > > > byte names, and the numbers aren't much different to a worst case
> > > > xattr tree of this shape. You'll see the reason for the dabtree
> > > > height being limited to 5, and that neither the directory structure
> > > > nor the xattr structure is anywhere near the 2^32 bit extent count
> > > > limit...
> > > 
> > > Directory segment size is 32 GB                                                                                                                                  
> > >   - Number of directory entries required for indexing 32GiB.
> > >     - 32GiB is divided into 4k data blocks. 
> > >     - Number of 4k blocks = 32GB / 4k = 8M
> > >     - Each 4k data block has,
> > >       - struct xfs_dir3_data_hdr = 64 bytes
> > >       - struct xfs_dir2_data_entry = 12 bytes (metadata) + 40 bytes (name)
> > >                                    = 52 bytes
> > >       - Number of 'struct xfs_dir2_data_entry' in a 4k block
> > >         (4096 - 64) / 52 = 78
> > >     - Number of 'struct xfs_dir2_data_entry' in 32-GiB space
> > >       8m * 78 = 654m
> > >   - Contents of a single dabtree leaf
> > >     - struct xfs_dir3_leaf_hdr = 64 bytes
> > >     - struct xfs_dir2_leaf_entry = 8 bytes
> > >     - Number of 'struct xfs_dir2_leaf_entry' = (4096 - 64) / 8 = 504
> > >     - 37% of 504 = 186 entries
> > >   - Contents of a single dabtree node
> > >     - struct xfs_da3_node_hdr = 64 bytes
> > >     - struct xfs_da_node_entry = 8 bytes
> > >     - Number of 'struct xfs_da_node_entry' = (4096 - 64) / 8 = 504
> > >   - Nr leaves
> > >     Level (N) = 654m / 186 = 3m leaves
> > >     Level (N-1) = 3m / 504 = 6k
> > >     Level (N-2) = 6k / 504 = 12
> > >     Level (N-3) = 12 / 504 = 1
> > >     Dabtree having 4 levels is sufficient.
> > > 
> > > Hence a dabtree with 5 levels should be more than enough to index a 32GiB
> > > directory segment containing directory entries with even shorter names.
> > > 
> > > Even with 5m extents (used in xattr tree example above) consumed by a da
> > > btree, this is still much less than the limit imposed by 2^32 (i.e. ~4
> > > billion) extents.
> > > 
> > > Hence the actual log space consumed for logging bmbt blocks is limited by the
> > > height of da btree.
> > > 
> > > My experiment with changing the values of MAXEXTNUM and MAXAEXTNUM to 2^47 and
> > > 2^32 respectively, gave me the following results,
> > > - For 1k block size, bmbt tree height increased by 3.
> > > - For 4k block size, bmbt tree height increased by 2.
> > > 
> > > This happens because xfs_bmap_compute_maxlevels() calculates the BMBT tree
> > > height by assuming that there will be MAXEXTNUM/MAXAEXTNUM worth of leaf
> > > entries in the worst case.
> > > 
> > > For Attr fork Bmbt , Do you think the calculation should be changed to
> > > consider the number of extents occupied by a dabtree holding > 100 million
> > > xattrs?
> > > 
> > > The new increase in Bmbt height in turn causes the static reservation values
> > > to increase. In the worst case, the maximum increase observed was 118k bytes
> > > (4k block size, reflink=0, tr_rename).
> > > 
> > > The experiment was executed after applying "xfsprogs: Fix log reservation
> > > calculation for xattr insert operation" patch
> > > (https://lore.kernel.org/linux-xfs/20200404085229.2034-2-chandanrlinux@gmail.com/)
> > > 
> > > I am attaching the output of "xfs_db -c logres <dev>" executed on the
> > > following configurations of the XFS filesystem.
> > > - -b size=1k -m reflink=0
> > > - -b size=1k -m rmapbt=1reflink=1
> > > - -b size=4k -m reflink=0
> > > - -b size=4k -m rmapbt=1reflink=1
> > > - -b size=1k -m crc=0
> > > - -b size=4k -m crc=0
> > > 
> > > I will go through the code which calculates the log reservations of the
> > > entries which have a drastic increase in their values.
> > > 
> > 
> > The highest increase (i.e. an increase of 118k) in log reservation was
> > associated with the rename operation,
> > 
> > STATIC uint
> > xfs_calc_rename_reservation(
> >         struct xfs_mount        *mp)
> > {
> >         return XFS_DQUOT_LOGRES(mp) +
> >                 max((xfs_calc_inode_res(mp, 4) +
> >                      xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
> >                                       XFS_FSB_TO_B(mp, 1))),
> >                     (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
> >                      xfs_calc_buf_res(xfs_allocfree_log_count(mp, 3),
> >                                       XFS_FSB_TO_B(mp, 1))));
> > }
> > 
> > The first argument to max() contributes the highest value.
> > 
> > xfs_calc_inode_res(mp, 4) + xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),XFS_FSB_TO_B(mp, 1))
> > 
> > The inode reservation part is a constant.
> > 
> > The number of blocks computed by the second operand of the '+' operator is,
> > 
> > 2 * ((XFS_DA_NODE_MAXDEPTH + 2) + ((XFS_DA_NODE_MAXDEPTH + 2) * (bmbt_height - 1)))
> > 
> > = 2 * ((5 + 2) + ((5 + 2) * (bmbt_height - 1)))
> > 
> > When bmbt height is 5 (i.e. when using the original 2^31 extent count limit) this
> > evaluates to,
> > 
> > 2 * ((5 + 2) + ((5 + 2) * (5 - 1)))
> > = 70 blocks
> > 
> > When bmbt height is 7 (i.e. when using the original 2^47 extent count limit) this
> > evaluates to,
> > 
> > 2 * ((5 + 2) + ((5 + 2) * (7 - 1)))
> > = 98 blocks
> > 
> > However, I don't see any extraneous space reserved by the above calculation
> > that could be removed. Also, IMHO an increase by 118k is most likely not going
> > to introduce any bugs. I will execute xfstests to make sure that no
> > regressions get added.
> 
> (Did fstests pass?)
>

On Wednesday, May 13, 2020 5:23:22 AM IST you wrote:
> On Fri, May 01, 2020 at 12:38:30PM +0530, Chandan Rajendra wrote:
> > On Wednesday, April 29, 2020 9:05 PM Chandan Rajendra wrote: 
> > > On Monday, April 27, 2020 3:38 AM Dave Chinner wrote: 
> > > > On Sat, Apr 25, 2020 at 05:37:39PM +0530, Chandan Rajendra wrote:
> > > > > On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> > > > > > On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > > > > > > Attr bmbt tree height (MINABTPTRS == 2)
> > > > > > > |-------+------------------------+-------------------------|
> > > > > > > | Level | Number of nodes/leaves |           Total Nr recs |
> > > > > > > |       |                        | (nr nodes/leaves * 125) |
> > > > > > > |-------+------------------------+-------------------------|
> > > > > > > |     0 |                      1 |                       2 |
> > > > > > > |     1 |                      2 |                     250 |
> > > > > > > |     2 |                    250 |                   31250 |
> > > > > > > |     3 |                  31250 |                 3906250 |
> > > > > > > |     4 |                3906250 |               488281250 |
> > > > > > > |     5 |              488281250 |             61035156250 |
> > > > > > > |-------+------------------------+-------------------------|
> > > > > > > 
> > > > > > > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > > > > > > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > > > > > > This probably won't cause any regression.
> > > > > > 
> > > > > > We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> > > > > > attr fork extent count makes no difference to the attribute fork
> > > > > > bmbt reservations. i.e. the bmbt reservations are defined by the
> > > > > > dabtree structure limits, not the maximum extent count the fork can
> > > > > > hold.
> > > > > 
> > > > > I think the dabtree structure limits is because of the following ...
> > > > > 
> > > > > How many levels of dabtree would be needed to hold ~100 million xattrs?
> > > > > - name len = 16 bytes
> > > > >          struct xfs_parent_name_rec {
> > > > >                __be64  p_ino;
> > > > >                __be32  p_gen;
> > > > >                __be32  p_diroffset;
> > > > >        };
> > > > >   i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
> > > > > - Value len = file name length = Assume ~40 bytes
> > > > 
> > > > That's quite long for a file name, but lets run with it...
> > > > 
> > > > > - Formula for number of node entries (used in column 3 in the table given
> > > > >   below) at any level of the dabtree,
> > > > >   nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
> > > > >   xfs_da_node_entry))
> > > > >   i.e. nr_blocks * ((block size - 64) / 8)
> > > > > - Formula for number of leaf entries (used in column 4 in the table given
> > > > >   below),
> > > > >   (block size - sizeof(xfs_attr_leaf_hdr_t)) /
> > > > >   (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
> > > > >   i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
> > > > > 
> > > > > Here I have assumed block size to be 4k.
> > > > > 
> > > > > |-------+------------------+--------------------------+--------------------------|
> > > > > | Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
> > > > > |-------+------------------+--------------------------+--------------------------|
> > > > > |     0 |              1.0 |                      5e2 |                    6.1e1 |
> > > > > |     1 |              5e2 |                    2.5e5 |                    3.0e4 |
> > > > > |     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
> > > > > |     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
> > > > > |-------+------------------+--------------------------+--------------------------|
> > > > 
> > > > I'm not sure what this table actually represents.
> > > > 
> > > > > 
> > > > > Hence we would need a tree of height 3.
> > > > > Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
> > > > 
> > > > 130 million blocks to hold 100 million xattrs? That doesn't pass the
> > > > smell test.
> > > > 
> > > > I think you are trying to do these calculations from the wrong
> > > > direction.
> > > 
> > > You are right. Btrees grow in height by adding a new root
> > > node. Hence the btree space usage should be calculated in bottom-to-top
> > > direction.
> > > 
> > > > Calculate the number of leaf blocks needed to hold the
> > > > xattr data first, then work out the height of the pointer tree from
> > > > that. e.g:
> > > > 
> > > > If we need 100m xattrs, we need this many 100% full 4k blocks to
> > > > hold them all:
> > > > 
> > > > blocks	= 100m / entries per leaf
> > > > 	= 100m / 61
> > > > 	= 1.64m
> > > > 
> > > > and if we assume 37% for the least populated (because magic
> > > > split/merge number), multiply by 3, so blocks ~= 5m for 100m xattrs
> > > > in 4k blocks.
> > > > 
> > > > That makes a lot more sense. Now the tree itself:
> > > > 
> > > > ptrs per node ^ N = 5m
> > > > ptrs per node ^ (N-1) = 5m / 500 = 10k
> > > > ptrs per node ^ (N-2) = 10k / 500 = 200
> > > > ptrs per node ^ (N-3) = 200 / 500 = 1
> > > > 
> > > > So, N-3 = level 0, so we've got a tree of height 4 for 100m xattrs,
> > > > and the pointer tree requires ~12000 blocks which is noise compared
> > > > to the number of leaf blocks...
> > > > 
> > > > As for the bmbt, we've got ~5m extents worst case, which is
> > > > 
> > > > ptrs per node ^ N = 5m
> > > > ptrs per node ^ (N-1) = 5m / 125 = 40k
> > > > ptrs per node ^ (N-2) = 40k / 125 = 320
> > > > ptrs per node ^ (N-3) = 320 / 125 = 3
> > > > 
> > > > As 3 bmbt records should fit in the inode fork, we'd only need a 4
> > > > level bmbt tree to hold this, too. It's at the lower limit of a 4
> > > > level tree, but 100m xattrs is the extreme case we are talking about
> > > > here...
> > > > 
> > > > FWIW, repeat this with a directory data segment size of 32GB w/ 40
> > > > byte names, and the numbers aren't much different to a worst case
> > > > xattr tree of this shape. You'll see the reason for the dabtree
> > > > height being limited to 5, and that neither the directory structure
> > > > nor the xattr structure is anywhere near the 2^32 bit extent count
> > > > limit...
> > > 
> > > Directory segment size is 32 GB                                                                                                                                  
> > >   - Number of directory entries required for indexing 32GiB.
> > >     - 32GiB is divided into 4k data blocks. 
> > >     - Number of 4k blocks = 32GB / 4k = 8M
> > >     - Each 4k data block has,
> > >       - struct xfs_dir3_data_hdr = 64 bytes
> > >       - struct xfs_dir2_data_entry = 12 bytes (metadata) + 40 bytes (name)
> > >                                    = 52 bytes
> > >       - Number of 'struct xfs_dir2_data_entry' in a 4k block
> > >         (4096 - 64) / 52 = 78
> > >     - Number of 'struct xfs_dir2_data_entry' in 32-GiB space
> > >       8m * 78 = 654m
> > >   - Contents of a single dabtree leaf
> > >     - struct xfs_dir3_leaf_hdr = 64 bytes
> > >     - struct xfs_dir2_leaf_entry = 8 bytes
> > >     - Number of 'struct xfs_dir2_leaf_entry' = (4096 - 64) / 8 = 504
> > >     - 37% of 504 = 186 entries
> > >   - Contents of a single dabtree node
> > >     - struct xfs_da3_node_hdr = 64 bytes
> > >     - struct xfs_da_node_entry = 8 bytes
> > >     - Number of 'struct xfs_da_node_entry' = (4096 - 64) / 8 = 504
> > >   - Nr leaves
> > >     Level (N) = 654m / 186 = 3m leaves
> > >     Level (N-1) = 3m / 504 = 6k
> > >     Level (N-2) = 6k / 504 = 12
> > >     Level (N-3) = 12 / 504 = 1
> > >     Dabtree having 4 levels is sufficient.
> > > 
> > > Hence a dabtree with 5 levels should be more than enough to index a 32GiB
> > > directory segment containing directory entries with even shorter names.
> > > 
> > > Even with 5m extents (used in xattr tree example above) consumed by a da
> > > btree, this is still much less than the limit imposed by 2^32 (i.e. ~4
> > > billion) extents.
> > > 
> > > Hence the actual log space consumed for logging bmbt blocks is limited by the
> > > height of da btree.
> > > 
> > > My experiment with changing the values of MAXEXTNUM and MAXAEXTNUM to 2^47 and
> > > 2^32 respectively, gave me the following results,
> > > - For 1k block size, bmbt tree height increased by 3.
> > > - For 4k block size, bmbt tree height increased by 2.
> > > 
> > > This happens because xfs_bmap_compute_maxlevels() calculates the BMBT tree
> > > height by assuming that there will be MAXEXTNUM/MAXAEXTNUM worth of leaf
> > > entries in the worst case.
> > > 
> > > For Attr fork Bmbt , Do you think the calculation should be changed to
> > > consider the number of extents occupied by a dabtree holding > 100 million
> > > xattrs?
> > > 
> > > The new increase in Bmbt height in turn causes the static reservation values
> > > to increase. In the worst case, the maximum increase observed was 118k bytes
> > > (4k block size, reflink=0, tr_rename).
> > > 
> > > The experiment was executed after applying "xfsprogs: Fix log reservation
> > > calculation for xattr insert operation" patch
> > > (https://lore.kernel.org/linux-xfs/20200404085229.2034-2-chandanrlinux@gmail.com/)
> > > 
> > > I am attaching the output of "xfs_db -c logres <dev>" executed on the
> > > following configurations of the XFS filesystem.
> > > - -b size=1k -m reflink=0
> > > - -b size=1k -m rmapbt=1reflink=1
> > > - -b size=4k -m reflink=0
> > > - -b size=4k -m rmapbt=1reflink=1
> > > - -b size=1k -m crc=0
> > > - -b size=4k -m crc=0
> > > 
> > > I will go through the code which calculates the log reservations of the
> > > entries which have a drastic increase in their values.
> > > 
> > 
> > The highest increase (i.e. an increase of 118k) in log reservation was
> > associated with the rename operation,
> > 
> > STATIC uint
> > xfs_calc_rename_reservation(
> >         struct xfs_mount        *mp)
> > {
> >         return XFS_DQUOT_LOGRES(mp) +
> >                 max((xfs_calc_inode_res(mp, 4) +
> >                      xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
> >                                       XFS_FSB_TO_B(mp, 1))),
> >                     (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
> >                      xfs_calc_buf_res(xfs_allocfree_log_count(mp, 3),
> >                                       XFS_FSB_TO_B(mp, 1))));
> > }
> > 
> > The first argument to max() contributes the highest value.
> > 
> > xfs_calc_inode_res(mp, 4) + xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),XFS_FSB_TO_B(mp, 1))
> > 
> > The inode reservation part is a constant.
> > 
> > The number of blocks computed by the second operand of the '+' operator is,
> > 
> > 2 * ((XFS_DA_NODE_MAXDEPTH + 2) + ((XFS_DA_NODE_MAXDEPTH + 2) * (bmbt_height - 1)))
> > 
> > = 2 * ((5 + 2) + ((5 + 2) * (bmbt_height - 1)))
> > 
> > When bmbt height is 5 (i.e. when using the original 2^31 extent count limit) this
> > evaluates to,
> > 
> > 2 * ((5 + 2) + ((5 + 2) * (5 - 1)))
> > = 70 blocks
> > 
> > When bmbt height is 7 (i.e. when using the original 2^47 extent count limit) this
> > evaluates to,
> > 
> > 2 * ((5 + 2) + ((5 + 2) * (7 - 1)))
> > = 98 blocks
> > 
> > However, I don't see any extraneous space reserved by the above calculation
> > that could be removed. Also, IMHO an increase by 118k is most likely not going
> > to introduce any bugs. I will execute xfstests to make sure that no
> > regressions get added.
> 
> (Did fstests pass?)

I had executed fstests with 5 different configurations i.e.
1. -m crc=0 -bsize=1k
2. -m crc=0 -bsize=4k
3. -m crc=0 -bsize=512
4. -m rmapbt=1,reflink=1 -bsize=1k
5. -m rmapbt=1,reflink=1 -bsize=4k

The only test that regressed was xfs/306.  It failed when using "-m
rmapbt=1,reflink=1 -b size=1k" mkfs configuration.

The changes were made only to the kernel and I had used upstream xfsprogs since
the newer kernel is supposed to mount older filesystems as well.

The dmesg log had the following,

[  702.273340] XFS (loop0): Mounting V5 Filesystem
[  702.275511] XFS (loop0): Log size 8906 blocks too small, minimum size is 9075 blocks
[  702.277764] XFS (loop0): AAIEEE! Log failed size checks. Abort!
[  702.279615] XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 711
[  702.283679] ------------[ cut here ]------------
[  702.285170] WARNING: CPU: 0 PID: 12821 at fs/xfs/xfs_message.c:112 assfail+0x25/0x28
[  702.287651] Modules linked in:
[  702.288654] CPU: 0 PID: 12821 Comm: mount Tainted: G        W         5.6.0-rc6-next-20200320-chandan-00003-g071c2af3f4de #1
[  702.291995] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[  702.294159] RIP: 0010:assfail+0x25/0x28
[  702.295176] Code: ff ff 0f 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 40 b7 4b b3 e8 82 f9 ff ff 80 3d 83 d6 64 01 00 74 02 0f $
[  702.300079] RSP: 0018:ffffb05b414cbd78 EFLAGS: 00010246
[  702.301463] RAX: 0000000000000000 RBX: ffff9d9d501d5000 RCX: 0000000000000000
[  702.303293] RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffb346dc65
[  702.304976] RBP: ffff9da444b49a80 R08: 0000000000000000 R09: 0000000000000000
[  702.306747] R10: 000000000000000a R11: f000000000000000 R12: 00000000ffffffea
[  702.308417] R13: 000000000000000e R14: 0000000000004594 R15: ffff9d9d501d5628
[  702.310138] FS:  00007fd6c5d17c80(0000) GS:ffff9da44d800000(0000) knlGS:0000000000000000
[  702.312078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  702.313421] CR2: 0000000000000002 CR3: 00000008a48c0000 CR4: 00000000000006f0
[  702.315210] Call Trace:
[  702.315807]  xfs_log_mount+0xf8/0x300
[  702.316741]  xfs_mountfs+0x46e/0x950
[  702.317640]  xfs_fc_fill_super+0x318/0x510
[  702.318739]  ? xfs_mount_free+0x30/0x30
[  702.319669]  get_tree_bdev+0x15c/0x250
[  702.320579]  vfs_get_tree+0x25/0xb0
[  702.321417]  do_mount+0x740/0x9b0
[  702.322220]  ? memdup_user+0x41/0x80
[  702.323135]  __x64_sys_mount+0x8e/0xd0
[  702.324033]  do_syscall_64+0x48/0x110
[  702.324918]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  702.326133] RIP: 0033:0x7fd6c5f2ccda
[  702.327105] Code: 48 8b 0d b9 e1 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f $
[  702.331596] RSP: 002b:00007ffe00dfb9f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[  702.333430] RAX: ffffffffffffffda RBX: 0000560c1aaa92c0 RCX: 00007fd6c5f2ccda
[  702.335146] RDX: 0000560c1aaae110 RSI: 0000560c1aaad040 RDI: 0000560c1aaa94d0
[  702.336843] RBP: 00007fd6c607d204 R08: 0000000000000000 R09: 0000560c1aaadde0
[  702.338618] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  702.340314] R13: 0000000000000000 R14: 0000560c1aaa94d0 R15: 0000560c1aaae110
[  702.342039] ---[ end trace 6436391b468bc652 ]---
[  702.343308] XFS (loop0): log mount failed

xfs/306 has,

_scratch_mkfs_xfs -d size=20m -n size=64k >> $seqres.full 2>&1

i.e. it creates a filesystem of size 20MiB, data block size of 1KiB and
directory block size of 64KiB. Filesystems of size < 1GiB can have less than
10MiB log (Please refer to calculate_log_size() in xfsprogs).

The highest reservation space was used by tr_rename. The calculation is done
by xfs_calc_rename_reservation(). In this case, the value returned by this
function was accounted by

xfs_calc_inode_res(mp, 4)
+ xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1))

xfs_calc_inode_res(mp, 4) returns a constant value (i.e. 3040).

The largest contribution to the value returned by the above
calculation was by 2 * XFS_DIROP_LOG_COUNT(mp).

XFS_DIROP_LOG_COUNT() is a sum of
1. The maximum number of dabtree blocks that needs to be logged
   i.e. XFS_DAENTER_BLOCKS() = XFS_DAENTER_1B(mp,w) * XFS_DAENTER_DBS(mp,w).
   For directories, this evaluates to (64 * (XFS_DA_NODE_MAXDEPTH + 2)) = (64
   * (5 + 2)) = 448.
   NOTE: I still don't know why we add the "2" to XFS_DA_NODE_MAXDEPTH in the
   above calculation.
2. The corresponding maximum number of bmap btree blocks that needs to be
   logged i.e. XFS_DAENTER_BMAPS() = XFS_DAENTER_DBS(mp,w) *
   XFS_DAENTER_BMAP1B(mp,w)

   XFS_DAENTER_DBS(mp,w) = XFS_DA_NODE_MAXDEPTH + 2 = 7
   XFS_DAENTER_BMAP1B(mp,w)
   = XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w)
   = XFS_NEXTENTADD_SPACE_RES(mp, 64, w)
   = ((64 + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) /
   XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * XFS_EXTENTADD_SPACE_RES(mp, w)

   XFS_MAX_CONTIG_EXTENTS_PER_BLOCK() = (mp)->m_alloc_mxr[0]) -
   ((mp)->m_alloc_mnr[0] = 121 - 60 = 61 

   XFS_DAENTER_BMAP1B(mp,w) = ((64 + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) /
   XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * XFS_EXTENTADD_SPACE_RES(mp, w)
   = ((64 + 61 - 1) / 61) * XFS_EXTENTADD_SPACE_RES(mp, w)
   = 2 * XFS_EXTENTADD_SPACE_RES(mp, w)
   = 2 * (XFS_BM_MAXLEVELS(mp,w) - 1)
   = 2 * (8 - 1) ;; Notice that the height of the bmap btree has increased to 8.
   = 14

   With 2^32 as the maximum extent count the maximum height of the bmap btree
   was 7. Now with 2^47 maximum extent count the height is 8.

   Therefore, XFS_DAENTER_BMAPS() = 7 * 14 = 98.

Also, XFS_DIROP_LOG_COUNT() = 448 + 98 = 546.
2 * XFS_DIROP_LOG_COUNT() = 2 * 546 = 1092.

With 2^32 max extent count, XFS_DIROP_LOG_COUNT() evaluates to 533. Hence 2 *
XFS_DIROP_LOG_COUNT() = 2 * 533 = 1066.

This small difference of 1092 - 1066 = 26 fs blocks is sufficient to trip us
over the minimum log size check.

I could not find a way to reduce the number of blocks that gets logged.

Hence I thought of the following alternate approach.

The maximum number of extents that can be occupied by a directory is ~
2^27. The following steps prove this, (I assumed fs block size to be
512 bytes since it is the one which can create a bmap btree of maximum
possible height).

Maximum number of extents in data space = 32GiB (i.e. XFS_DIR2_SPACE_SIZE) / 2^9 = 2^26.
Maximum number (theoretically) of extents in leaf space = 32GiB / 2^9 = 2^26.

Maximum number of entries in a free space index block
= (512 - (sizeof struct xfs_dir3_free_hdr)) / (sizeof struct xfs_dir2_data_off_t)
= (512 - 64) / 2 = 224
Maximum number of extents in free space index = (Maximum number of extents in
data segment) / 224 = (2^26) / 224 = ~2^18

Maximum number of extents in a directory = 2^26 + 2^26 + 2^18 = ~2^27

Hence my idea was to have a new entry in xfs_mount->m_bm_maxlevels[]
array to hold the maximum height of a bmap btree belonging to a
directory and use that for calculating reservations associated with
directories.

Please let me know your opinion on this.

PS: I had started making the changes in the kernel and was planning to
test the changes before posting this idea on the mailing list.

[2/2] xfs: Extend xattr extent counter to 32-bits

Commit Message

Comments

Patch