diff mbox series

[2/3] xfs: allow sparse inode records at the end of runt AGs

Message ID 20241024025142.4082218-3-david@fromorbit.com (mailing list archive)
State New
Headers show
Series xfs: sparse inodes overlap end of filesystem | expand

Commit Message

Dave Chinner Oct. 24, 2024, 2:51 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Due to the failure to correctly limit sparse inode chunk allocation
in runt AGs, we now have many production filesystems with sparse
inode chunks allocated across the end of the runt AG. xfs_repair
or a growfs is needed to fix this situation, neither of which are
particularly appealing.

The on disk layout from the metadump shows AG 12 as a runt that is
1031 blocks in length and the last inode chunk allocated on disk at
agino 8192.

$ xfs_db -c "agi 12" -c "p length" -c "p newino"a \
> -c "convert agno 12 agino 8192 agbno" \
> -c "a free_root" -c p /mnt/scratch/t.img
length = 1031
newino = 8192
0x400 (1024)
magic = 0x46494233
level = 0
numrecs = 3
leftsib = null
rightsib = null
bno = 62902208
lsn = 0xb5500001849
uuid = e941c927-8697-4c16-a828-bc98e3878f7d
owner = 12
crc = 0xfe0a5c41 (correct)
recs[1-3] = [startino,holemask,count,freecount,free]
1:[128,0,64,11,0xc1ff00]
2:[256,0,64,3,0xb]
3:[8192,0xff00,32,32,0xffffffffffffffff]

The agbno of the inode chunk is 0x400 (1024), but there are only 7
blocks from there to the end of the AG. No inode cluster should have
been allocated there, but the bug fixed in the previous commit
allowed that. We can see from the finobt record #3 that there is a
sparse inode record at agbno 1024 that is for 32 inodes - 4 blocks
worth of inodes. Hence we have a valid inode cluster from agbno
1024-1027 on disk, and we are trying to allocation inodes from it.

This is due to the sparse inode feature requiring sb->sb_spino_align
being set to the inode cluster size, whilst the sb->sb_inoalignmt is
set to the full chunk size.  The args.max_agbno bug in sparse inode
alignment allows an inode cluster at the start of the irec which is
sb_spino_align aligned and sized, but the remainder of the irec to
be beyond EOAG.

There is actually nothing wrong with having a sparse inode cluster
that ends up overlapping the end of the runt AG - it just means that
attempts to make it non-sparse will fail because there's no
contiguous space available to fill out the chunk. However, we can't
even get that far because xfs_inobt_get_rec() will validate the
irec->ir_startino and xfs_verify_agino() will fail on an irec that
spans beyond the end of the AG:

XFS (loop0): finobt record corruption in AG 12 detected at xfs_inobt_check_irec+0x44/0xb0!
XFS (loop0): start inode 0x2000, count 0x20, free 0x20 freemask 0xffffffffffffffff, holemask 0xff00

Hence the actual maximum agino we could allocate is the size of the
AG rounded down by the size of of an inode cluster, not the size of
a full inode chunk. Modify __xfs_agino_range() code to take this
sparse inode case into account and hence allow us of the already
allocated sparse inode chunk at the end of a runt AG.

That change, alone, however, is not sufficient, as
xfs_inobt_get_rec() hard codes the maximum inode number in the chunk
and attempts to verify the last inode number in the chunk.  This
fails because the of the sparse inode record is beyond the end of
the AG. Hence we have to look at the hole mask in the sparse inode
record to determine where the highest allocated inode is. We then
use the calculated high inode number to determine if the allocated
sparse inode cluster fits within the AG.

With this, inode allocation on a sparse inode cluster at the end
of a runt AG now succeeds. Hence any filesystem that has allocated a
cluster in this location will no longer fail allocation and issue
corruption warnings.

Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ag.c     | 47 ++++++++++++++++++++++++++++++--------
 fs/xfs/libxfs/xfs_ialloc.c | 20 +++++++++++++---
 2 files changed, 54 insertions(+), 13 deletions(-)

Comments

Darrick J. Wong Oct. 24, 2024, 5 p.m. UTC | #1
On Thu, Oct 24, 2024 at 01:51:04PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Due to the failure to correctly limit sparse inode chunk allocation
> in runt AGs, we now have many production filesystems with sparse
> inode chunks allocated across the end of the runt AG. xfs_repair
> or a growfs is needed to fix this situation, neither of which are
> particularly appealing.
> 
> The on disk layout from the metadump shows AG 12 as a runt that is
> 1031 blocks in length and the last inode chunk allocated on disk at
> agino 8192.

Does this problem also happen on non-runt AGs?  If the only free space
that could be turned into a sparse cluster is unaligned space at the
end of AG 0, would you still get the same corruption error?

--D

> $ xfs_db -c "agi 12" -c "p length" -c "p newino"a \
> > -c "convert agno 12 agino 8192 agbno" \
> > -c "a free_root" -c p /mnt/scratch/t.img
> length = 1031
> newino = 8192
> 0x400 (1024)
> magic = 0x46494233
> level = 0
> numrecs = 3
> leftsib = null
> rightsib = null
> bno = 62902208
> lsn = 0xb5500001849
> uuid = e941c927-8697-4c16-a828-bc98e3878f7d
> owner = 12
> crc = 0xfe0a5c41 (correct)
> recs[1-3] = [startino,holemask,count,freecount,free]
> 1:[128,0,64,11,0xc1ff00]
> 2:[256,0,64,3,0xb]
> 3:[8192,0xff00,32,32,0xffffffffffffffff]
> 
> The agbno of the inode chunk is 0x400 (1024), but there are only 7
> blocks from there to the end of the AG. No inode cluster should have
> been allocated there, but the bug fixed in the previous commit
> allowed that. We can see from the finobt record #3 that there is a
> sparse inode record at agbno 1024 that is for 32 inodes - 4 blocks
> worth of inodes. Hence we have a valid inode cluster from agbno
> 1024-1027 on disk, and we are trying to allocation inodes from it.
> 
> This is due to the sparse inode feature requiring sb->sb_spino_align
> being set to the inode cluster size, whilst the sb->sb_inoalignmt is
> set to the full chunk size.  The args.max_agbno bug in sparse inode
> alignment allows an inode cluster at the start of the irec which is
> sb_spino_align aligned and sized, but the remainder of the irec to
> be beyond EOAG.
> 
> There is actually nothing wrong with having a sparse inode cluster
> that ends up overlapping the end of the runt AG - it just means that
> attempts to make it non-sparse will fail because there's no
> contiguous space available to fill out the chunk. However, we can't
> even get that far because xfs_inobt_get_rec() will validate the
> irec->ir_startino and xfs_verify_agino() will fail on an irec that
> spans beyond the end of the AG:
> 
> XFS (loop0): finobt record corruption in AG 12 detected at xfs_inobt_check_irec+0x44/0xb0!
> XFS (loop0): start inode 0x2000, count 0x20, free 0x20 freemask 0xffffffffffffffff, holemask 0xff00
> 
> Hence the actual maximum agino we could allocate is the size of the
> AG rounded down by the size of of an inode cluster, not the size of
> a full inode chunk. Modify __xfs_agino_range() code to take this
> sparse inode case into account and hence allow us of the already
> allocated sparse inode chunk at the end of a runt AG.
> 
> That change, alone, however, is not sufficient, as
> xfs_inobt_get_rec() hard codes the maximum inode number in the chunk
> and attempts to verify the last inode number in the chunk.  This
> fails because the of the sparse inode record is beyond the end of
> the AG. Hence we have to look at the hole mask in the sparse inode
> record to determine where the highest allocated inode is. We then
> use the calculated high inode number to determine if the allocated
> sparse inode cluster fits within the AG.
> 
> With this, inode allocation on a sparse inode cluster at the end
> of a runt AG now succeeds. Hence any filesystem that has allocated a
> cluster in this location will no longer fail allocation and issue
> corruption warnings.
> 
> Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ag.c     | 47 ++++++++++++++++++++++++++++++--------
>  fs/xfs/libxfs/xfs_ialloc.c | 20 +++++++++++++---
>  2 files changed, 54 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> index 5ca8d0106827..33290af6ab01 100644
> --- a/fs/xfs/libxfs/xfs_ag.c
> +++ b/fs/xfs/libxfs/xfs_ag.c
> @@ -238,15 +238,36 @@ xfs_ag_block_count(
>  			mp->m_sb.sb_dblocks);
>  }
>  
> -/* Calculate the first and last possible inode number in an AG. */
> +/*
> + * Calculate the first and last possible inode number in an AG.
> + *
> + * Due to a bug in sparse inode allocation for the runt AG at the end of the
> + * filesystem, we can have a valid sparse inode chunk on disk that spans beyond
> + * the end of the AG. Sparse inode chunks have special alignment - the full
> + * chunk must always be naturally aligned, and the regions that are allocated
> + * sparsely are cluster sized and aligned.
> + *
> + * The result of this is that for sparse inode setups, sb->sb_inoalignmt is
> + * always the size of the chunk, and that means M_IGEO(mp)->cluster_align isn't
> + * actually cluster alignment, it is chunk alignment. That means a sparse inode
> + * cluster that overlaps the end of the AG can never be valid based on "cluster
> + * alignment" even though all the inodes allocated within the sparse chunk at
> + * within the valid bounds of the AG and so can be used.
> + *
> + * Hence for the runt AG, the valid maximum inode number is based on sparse
> + * inode cluster alignment (sb->sb_spino_align) and not the "cluster alignment"
> + * value.
> + */
>  static void
>  __xfs_agino_range(
>  	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
>  	xfs_agblock_t		eoag,
>  	xfs_agino_t		*first,
>  	xfs_agino_t		*last)
>  {
>  	xfs_agblock_t		bno;
> +	xfs_agblock_t		end_align;
>  
>  	/*
>  	 * Calculate the first inode, which will be in the first
> @@ -259,7 +280,12 @@ __xfs_agino_range(
>  	 * Calculate the last inode, which will be at the end of the
>  	 * last (aligned) cluster that can be allocated in the AG.
>  	 */
> -	bno = round_down(eoag, M_IGEO(mp)->cluster_align);
> +	if (xfs_has_sparseinodes(mp) && agno == mp->m_sb.sb_agcount - 1)
> +		end_align = mp->m_sb.sb_spino_align;
> +	else
> +		end_align = M_IGEO(mp)->cluster_align;
> +
> +	bno = round_down(eoag, end_align);
>  	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;
>  }
>  
> @@ -270,7 +296,8 @@ xfs_agino_range(
>  	xfs_agino_t		*first,
>  	xfs_agino_t		*last)
>  {
> -	return __xfs_agino_range(mp, xfs_ag_block_count(mp, agno), first, last);
> +	return __xfs_agino_range(mp, agno, xfs_ag_block_count(mp, agno),
> +			first, last);
>  }
>  
>  int
> @@ -284,7 +311,7 @@ xfs_update_last_ag_size(
>  		return -EFSCORRUPTED;
>  	pag->block_count = __xfs_ag_block_count(mp, prev_agcount - 1,
>  			mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks);
> -	__xfs_agino_range(mp, pag->block_count, &pag->agino_min,
> +	__xfs_agino_range(mp, pag->pag_agno, pag->block_count, &pag->agino_min,
>  			&pag->agino_max);
>  	xfs_perag_rele(pag);
>  	return 0;
> @@ -345,8 +372,8 @@ xfs_initialize_perag(
>  		pag->block_count = __xfs_ag_block_count(mp, index, new_agcount,
>  				dblocks);
>  		pag->min_block = XFS_AGFL_BLOCK(mp);
> -		__xfs_agino_range(mp, pag->block_count, &pag->agino_min,
> -				&pag->agino_max);
> +		__xfs_agino_range(mp, pag->pag_agno, pag->block_count,
> +				&pag->agino_min, &pag->agino_max);
>  	}
>  
>  	index = xfs_set_inode_alloc(mp, new_agcount);
> @@ -932,8 +959,8 @@ xfs_ag_shrink_space(
>  
>  	/* Update perag geometry */
>  	pag->block_count -= delta;
> -	__xfs_agino_range(pag->pag_mount, pag->block_count, &pag->agino_min,
> -				&pag->agino_max);
> +	__xfs_agino_range(mp, pag->pag_agno, pag->block_count,
> +				&pag->agino_min, &pag->agino_max);
>  
>  	xfs_ialloc_log_agi(*tpp, agibp, XFS_AGI_LENGTH);
>  	xfs_alloc_log_agf(*tpp, agfbp, XFS_AGF_LENGTH);
> @@ -1003,8 +1030,8 @@ xfs_ag_extend_space(
>  
>  	/* Update perag geometry */
>  	pag->block_count = be32_to_cpu(agf->agf_length);
> -	__xfs_agino_range(pag->pag_mount, pag->block_count, &pag->agino_min,
> -				&pag->agino_max);
> +	__xfs_agino_range(pag->pag_mount, pag->pag_agno, pag->block_count,
> +				&pag->agino_min, &pag->agino_max);
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 6258527315f2..d68b53334990 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -108,22 +108,36 @@ xfs_inobt_rec_freecount(
>  	return hweight64(realfree);
>  }
>  
> +/* Compute the highest allocated inode in an incore inode record. */
> +static xfs_agino_t
> +xfs_inobt_rec_highino(
> +	const struct xfs_inobt_rec_incore	*irec)
> +{
> +	if (xfs_inobt_issparse(irec->ir_holemask))
> +		return xfs_highbit64(xfs_inobt_irec_to_allocmask(irec));
> +	return XFS_INODES_PER_CHUNK;
> +}
> +
>  /* Simple checks for inode records. */
>  xfs_failaddr_t
>  xfs_inobt_check_irec(
>  	struct xfs_perag			*pag,
>  	const struct xfs_inobt_rec_incore	*irec)
>  {
> +	xfs_agino_t	high_ino = xfs_inobt_rec_highino(irec);
> +
>  	/* Record has to be properly aligned within the AG. */
>  	if (!xfs_verify_agino(pag, irec->ir_startino))
>  		return __this_address;
> -	if (!xfs_verify_agino(pag,
> -				irec->ir_startino + XFS_INODES_PER_CHUNK - 1))
> +
> +	if (!xfs_verify_agino(pag, irec->ir_startino + high_ino - 1))
>  		return __this_address;
> +
>  	if (irec->ir_count < XFS_INODES_PER_HOLEMASK_BIT ||
>  	    irec->ir_count > XFS_INODES_PER_CHUNK)
>  		return __this_address;
> -	if (irec->ir_freecount > XFS_INODES_PER_CHUNK)
> +
> +	if (irec->ir_freecount > irec->ir_count)
>  		return __this_address;
>  
>  	if (xfs_inobt_rec_freecount(irec) != irec->ir_freecount)
> -- 
> 2.45.2
> 
>
Dave Chinner Oct. 25, 2024, 6:43 a.m. UTC | #2
On Thu, Oct 24, 2024 at 10:00:38AM -0700, Darrick J. Wong wrote:
> On Thu, Oct 24, 2024 at 01:51:04PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Due to the failure to correctly limit sparse inode chunk allocation
> > in runt AGs, we now have many production filesystems with sparse
> > inode chunks allocated across the end of the runt AG. xfs_repair
> > or a growfs is needed to fix this situation, neither of which are
> > particularly appealing.
> > 
> > The on disk layout from the metadump shows AG 12 as a runt that is
> > 1031 blocks in length and the last inode chunk allocated on disk at
> > agino 8192.
> 
> Does this problem also happen on non-runt AGs?

No. The highest agbno an inode chunk can be allocated at in a full
size AG is aligned by rounding down from sb_agblocks.  Hence
sb_agblocks can be unaligned and nothing will go wrong. The problem
is purely that the runt AG being shorter than sb_agblocks and so
this highest agbno allocation guard is set beyond the end of the
AG...

> If the only free space
> that could be turned into a sparse cluster is unaligned space at the
> end of AG 0, would you still get the same corruption error?

It will only happen if AG 0 is a runt AG, and then the same error
would occur. We don't currently allow single AG filesystems, nor
when they are set up  do we create them as a runt - the are always
full size. So current single AG filesystems made by mkfs won't have
this problem.

That said, the proposed single AG cloud image filesystems that set
AG 0 up as a runt (i.e. dblocks smaller than sb_agblocks) to allow
the AG 0 size to grow along with the size of the filesystem could
definitely have this problem. i.e. sb_dblocks needs to be inode
chunk aligned in this sort of setup, or these filesystems need to
be restricted to fixed kernels....

-Dave.
Darrick J. Wong Oct. 25, 2024, 10:19 p.m. UTC | #3
On Fri, Oct 25, 2024 at 05:43:41PM +1100, Dave Chinner wrote:
> On Thu, Oct 24, 2024 at 10:00:38AM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 24, 2024 at 01:51:04PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Due to the failure to correctly limit sparse inode chunk allocation
> > > in runt AGs, we now have many production filesystems with sparse
> > > inode chunks allocated across the end of the runt AG. xfs_repair
> > > or a growfs is needed to fix this situation, neither of which are
> > > particularly appealing.
> > > 
> > > The on disk layout from the metadump shows AG 12 as a runt that is
> > > 1031 blocks in length and the last inode chunk allocated on disk at
> > > agino 8192.
> > 
> > Does this problem also happen on non-runt AGs?
> 
> No. The highest agbno an inode chunk can be allocated at in a full
> size AG is aligned by rounding down from sb_agblocks.  Hence
> sb_agblocks can be unaligned and nothing will go wrong. The problem
> is purely that the runt AG being shorter than sb_agblocks and so
> this highest agbno allocation guard is set beyond the end of the
> AG...

Ah, right, and we don't want sparse inode chunks to cross EOAG because
then you'd have a chunk whose clusters would cross into the next AG, at
least in the linear LBA space.  That's why (for sparse inode fses) it
makes sense that we want to round last_agino down by the chunk for
non-last AGs, and round it down by only the cluster for the last AG.

Waitaminute, what if the last AG is less than a chunk but more than a
cluster's worth of blocks short of sb_agblocks?  Or what if sb_agblocks
doesn't align with a chunk boundary?  I think the new code:

	if (xfs_has_sparseinodes(mp) && agno == mp->m_sb.sb_agcount - 1)
		end_align = mp->m_sb.sb_spino_align;
	else
		end_align = M_IGEO(mp)->cluster_align;
	bno = round_down(eoag, end_align);
	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;

will allow a sparse chunk that (erroneously) crosses sb_agblocks, right?
Let's say sb_spino_align == 4, sb_inoalignmt == 8, sb_agcount == 2,
sb_agblocks == 100,007, and sb_dblocks == 200,014.

For AG 0, eoag is 100007, end_align == cluster_align == 8, so bno is
rounded down to 100000.  *last is thus set to the inode at the end of
block 99999.

For AG 1, eoag is also 100007, but now end_align == 4.  bno is rounded
down to 100,004.  *last is set to the inode at the end of block 100003,
not 99999.

But now let's say we growfs another 100007 blocks onto the filesystem.
Now we have 3x AGs, each with 100007 blocks.  But now *last for AG 1
becomes 99999 even though we might've allocated an inode in block
100000 before the growfs.  That will cause a corruption error too,
right?

IOWs, don't we want something more like this?

	/*
	 * The preferred inode cluster allocation size cannot ever cross
	 * sb_agblocks.  cluster_align is one of the following:
	 *
	 * - For sparse inodes, this is an inode chunk.
	 * - For aligned non-sparse inodes, this is an inode cluster.
	 */
	bno = round_down(sb_agblocks, cluster_align);
	if (xfs_has_sparseinodes(mp) &&
	    agno == mp->m_sb.sb_agcount - 1) {
		/*
		 * For a filesystem with sparse inodes, an inode chunk
		 * still cannot cross sb_agblocks, but it can cross eoag
		 * if eoag < agblocks.  Inode clusters cannot cross eoag.
		 */
		last_clus_bno = round_down(eoag, sb_spino_align);
		bno = min(bno, last_clus_bno);
	}
	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;

This preserves the invariant that inode chunks cannot span sb_agblocks,
while permitting sparse clusters going right up to EOAG so long as the
chunk doesn't cross sb_agblocks.

> > If the only free space
> > that could be turned into a sparse cluster is unaligned space at the
> > end of AG 0, would you still get the same corruption error?
> 
> It will only happen if AG 0 is a runt AG, and then the same error
> would occur. We don't currently allow single AG filesystems, nor
> when they are set up  do we create them as a runt - the are always
> full size. So current single AG filesystems made by mkfs won't have
> this problem.

Hmm, do you have a quick means to simulate this last-AG unaligned
icluster situation?

> That said, the proposed single AG cloud image filesystems that set
> AG 0 up as a runt (i.e. dblocks smaller than sb_agblocks) to allow
> the AG 0 size to grow along with the size of the filesystem could
> definitely have this problem. i.e. sb_dblocks needs to be inode
> chunk aligned in this sort of setup, or these filesystems need to
> be restricted to fixed kernels....

I wonder if we *should* have a compat flag for these cloud filesystems
just as a warning sign to us all. :)

--D

> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Oct. 26, 2024, 9:47 p.m. UTC | #4
On Fri, Oct 25, 2024 at 03:19:19PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 25, 2024 at 05:43:41PM +1100, Dave Chinner wrote:
> > On Thu, Oct 24, 2024 at 10:00:38AM -0700, Darrick J. Wong wrote:
> > > On Thu, Oct 24, 2024 at 01:51:04PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Due to the failure to correctly limit sparse inode chunk allocation
> > > > in runt AGs, we now have many production filesystems with sparse
> > > > inode chunks allocated across the end of the runt AG. xfs_repair
> > > > or a growfs is needed to fix this situation, neither of which are
> > > > particularly appealing.
> > > > 
> > > > The on disk layout from the metadump shows AG 12 as a runt that is
> > > > 1031 blocks in length and the last inode chunk allocated on disk at
> > > > agino 8192.
> > > 
> > > Does this problem also happen on non-runt AGs?
> > 
> > No. The highest agbno an inode chunk can be allocated at in a full
> > size AG is aligned by rounding down from sb_agblocks.  Hence
> > sb_agblocks can be unaligned and nothing will go wrong. The problem
> > is purely that the runt AG being shorter than sb_agblocks and so
> > this highest agbno allocation guard is set beyond the end of the
> > AG...
> 
> Ah, right, and we don't want sparse inode chunks to cross EOAG because
> then you'd have a chunk whose clusters would cross into the next AG, at
> least in the linear LBA space.  That's why (for sparse inode fses) it
> makes sense that we want to round last_agino down by the chunk for
> non-last AGs, and round it down by only the cluster for the last AG.
> 
> Waitaminute, what if the last AG is less than a chunk but more than a
> cluster's worth of blocks short of sb_agblocks?  Or what if sb_agblocks
> doesn't align with a chunk boundary?  I think the new code:
> 
> 	if (xfs_has_sparseinodes(mp) && agno == mp->m_sb.sb_agcount - 1)
> 		end_align = mp->m_sb.sb_spino_align;
> 	else
> 		end_align = M_IGEO(mp)->cluster_align;
> 	bno = round_down(eoag, end_align);
> 	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;
> 
> will allow a sparse chunk that (erroneously) crosses sb_agblocks, right?
> Let's say sb_spino_align == 4, sb_inoalignmt == 8, sb_agcount == 2,
> sb_agblocks == 100,007, and sb_dblocks == 200,014.
> 
> For AG 0, eoag is 100007, end_align == cluster_align == 8, so bno is
> rounded down to 100000.  *last is thus set to the inode at the end of
> block 99999.
> 
> For AG 1, eoag is also 100007, but now end_align == 4.  bno is rounded
> down to 100,004.  *last is set to the inode at the end of block 100003,
> not 99999.
> 
> But now let's say we growfs another 100007 blocks onto the filesystem.
> Now we have 3x AGs, each with 100007 blocks.  But now *last for AG 1
> becomes 99999 even though we might've allocated an inode in block
> 100000 before the growfs.  That will cause a corruption error too,
> right?

Yes, I overlooked that case. Good catch.

> IOWs, don't we want something more like this?
> 
> 	/*
> 	 * The preferred inode cluster allocation size cannot ever cross
> 	 * sb_agblocks.  cluster_align is one of the following:
> 	 *
> 	 * - For sparse inodes, this is an inode chunk.
> 	 * - For aligned non-sparse inodes, this is an inode cluster.
> 	 */
> 	bno = round_down(sb_agblocks, cluster_align);
> 	if (xfs_has_sparseinodes(mp) &&
> 	    agno == mp->m_sb.sb_agcount - 1) {
> 		/*
> 		 * For a filesystem with sparse inodes, an inode chunk
> 		 * still cannot cross sb_agblocks, but it can cross eoag
> 		 * if eoag < agblocks.  Inode clusters cannot cross eoag.
> 		 */
> 		last_clus_bno = round_down(eoag, sb_spino_align);
> 		bno = min(bno, last_clus_bno);
> 	}
> 	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;

Yes, something like that is needed.

> > > If the only free space
> > > that could be turned into a sparse cluster is unaligned space at the
> > > end of AG 0, would you still get the same corruption error?
> > 
> > It will only happen if AG 0 is a runt AG, and then the same error
> > would occur. We don't currently allow single AG filesystems, nor
> > when they are set up  do we create them as a runt - the are always
> > full size. So current single AG filesystems made by mkfs won't have
> > this problem.
> 
> Hmm, do you have a quick means to simulate this last-AG unaligned
> icluster situation?

No, I haven't been able to reproduce it on demand - nothing I've
tried has specifically landed a sparse inode cluster in exactly the
right position to trigger this. I typically get ENOSPC when I think
it should trigger and it's not immediately obvious what I'm missing
in way of pre-conditions to trigger it. I've been able to test the
fixes on a metadump that has the sparse chunk already on disk
(which came from one of the production systems hitting this).

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 5ca8d0106827..33290af6ab01 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -238,15 +238,36 @@  xfs_ag_block_count(
 			mp->m_sb.sb_dblocks);
 }
 
-/* Calculate the first and last possible inode number in an AG. */
+/*
+ * Calculate the first and last possible inode number in an AG.
+ *
+ * Due to a bug in sparse inode allocation for the runt AG at the end of the
+ * filesystem, we can have a valid sparse inode chunk on disk that spans beyond
+ * the end of the AG. Sparse inode chunks have special alignment - the full
+ * chunk must always be naturally aligned, and the regions that are allocated
+ * sparsely are cluster sized and aligned.
+ *
+ * The result of this is that for sparse inode setups, sb->sb_inoalignmt is
+ * always the size of the chunk, and that means M_IGEO(mp)->cluster_align isn't
+ * actually cluster alignment, it is chunk alignment. That means a sparse inode
+ * cluster that overlaps the end of the AG can never be valid based on "cluster
+ * alignment" even though all the inodes allocated within the sparse chunk at
+ * within the valid bounds of the AG and so can be used.
+ *
+ * Hence for the runt AG, the valid maximum inode number is based on sparse
+ * inode cluster alignment (sb->sb_spino_align) and not the "cluster alignment"
+ * value.
+ */
 static void
 __xfs_agino_range(
 	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
 	xfs_agblock_t		eoag,
 	xfs_agino_t		*first,
 	xfs_agino_t		*last)
 {
 	xfs_agblock_t		bno;
+	xfs_agblock_t		end_align;
 
 	/*
 	 * Calculate the first inode, which will be in the first
@@ -259,7 +280,12 @@  __xfs_agino_range(
 	 * Calculate the last inode, which will be at the end of the
 	 * last (aligned) cluster that can be allocated in the AG.
 	 */
-	bno = round_down(eoag, M_IGEO(mp)->cluster_align);
+	if (xfs_has_sparseinodes(mp) && agno == mp->m_sb.sb_agcount - 1)
+		end_align = mp->m_sb.sb_spino_align;
+	else
+		end_align = M_IGEO(mp)->cluster_align;
+
+	bno = round_down(eoag, end_align);
 	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;
 }
 
@@ -270,7 +296,8 @@  xfs_agino_range(
 	xfs_agino_t		*first,
 	xfs_agino_t		*last)
 {
-	return __xfs_agino_range(mp, xfs_ag_block_count(mp, agno), first, last);
+	return __xfs_agino_range(mp, agno, xfs_ag_block_count(mp, agno),
+			first, last);
 }
 
 int
@@ -284,7 +311,7 @@  xfs_update_last_ag_size(
 		return -EFSCORRUPTED;
 	pag->block_count = __xfs_ag_block_count(mp, prev_agcount - 1,
 			mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks);
-	__xfs_agino_range(mp, pag->block_count, &pag->agino_min,
+	__xfs_agino_range(mp, pag->pag_agno, pag->block_count, &pag->agino_min,
 			&pag->agino_max);
 	xfs_perag_rele(pag);
 	return 0;
@@ -345,8 +372,8 @@  xfs_initialize_perag(
 		pag->block_count = __xfs_ag_block_count(mp, index, new_agcount,
 				dblocks);
 		pag->min_block = XFS_AGFL_BLOCK(mp);
-		__xfs_agino_range(mp, pag->block_count, &pag->agino_min,
-				&pag->agino_max);
+		__xfs_agino_range(mp, pag->pag_agno, pag->block_count,
+				&pag->agino_min, &pag->agino_max);
 	}
 
 	index = xfs_set_inode_alloc(mp, new_agcount);
@@ -932,8 +959,8 @@  xfs_ag_shrink_space(
 
 	/* Update perag geometry */
 	pag->block_count -= delta;
-	__xfs_agino_range(pag->pag_mount, pag->block_count, &pag->agino_min,
-				&pag->agino_max);
+	__xfs_agino_range(mp, pag->pag_agno, pag->block_count,
+				&pag->agino_min, &pag->agino_max);
 
 	xfs_ialloc_log_agi(*tpp, agibp, XFS_AGI_LENGTH);
 	xfs_alloc_log_agf(*tpp, agfbp, XFS_AGF_LENGTH);
@@ -1003,8 +1030,8 @@  xfs_ag_extend_space(
 
 	/* Update perag geometry */
 	pag->block_count = be32_to_cpu(agf->agf_length);
-	__xfs_agino_range(pag->pag_mount, pag->block_count, &pag->agino_min,
-				&pag->agino_max);
+	__xfs_agino_range(pag->pag_mount, pag->pag_agno, pag->block_count,
+				&pag->agino_min, &pag->agino_max);
 	return 0;
 }
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 6258527315f2..d68b53334990 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -108,22 +108,36 @@  xfs_inobt_rec_freecount(
 	return hweight64(realfree);
 }
 
+/* Compute the highest allocated inode in an incore inode record. */
+static xfs_agino_t
+xfs_inobt_rec_highino(
+	const struct xfs_inobt_rec_incore	*irec)
+{
+	if (xfs_inobt_issparse(irec->ir_holemask))
+		return xfs_highbit64(xfs_inobt_irec_to_allocmask(irec));
+	return XFS_INODES_PER_CHUNK;
+}
+
 /* Simple checks for inode records. */
 xfs_failaddr_t
 xfs_inobt_check_irec(
 	struct xfs_perag			*pag,
 	const struct xfs_inobt_rec_incore	*irec)
 {
+	xfs_agino_t	high_ino = xfs_inobt_rec_highino(irec);
+
 	/* Record has to be properly aligned within the AG. */
 	if (!xfs_verify_agino(pag, irec->ir_startino))
 		return __this_address;
-	if (!xfs_verify_agino(pag,
-				irec->ir_startino + XFS_INODES_PER_CHUNK - 1))
+
+	if (!xfs_verify_agino(pag, irec->ir_startino + high_ino - 1))
 		return __this_address;
+
 	if (irec->ir_count < XFS_INODES_PER_HOLEMASK_BIT ||
 	    irec->ir_count > XFS_INODES_PER_CHUNK)
 		return __this_address;
-	if (irec->ir_freecount > XFS_INODES_PER_CHUNK)
+
+	if (irec->ir_freecount > irec->ir_count)
 		return __this_address;
 
 	if (xfs_inobt_rec_freecount(irec) != irec->ir_freecount)