[04/13] xfs: arrange all unlinked inodes into one list

Message ID	20200812092556.2567285-5-david@fromorbit.com (mailing list archive)
State	Accepted
Headers	show Return-Path: <SRS0=vsMz=BW=vger.kernel.org=linux-xfs-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Subject: [PATCH 04/13] xfs: arrange all unlinked inodes into one list Date: Wed, 12 Aug 2020 19:25:47 +1000 Message-Id: <20200812092556.2567285-5-david@fromorbit.com> In-Reply-To: <20200812092556.2567285-1-david@fromorbit.com> References: <20200812092556.2567285-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	xfs: in memory inode unlink log items \| expand [00/13] xfs: in memory inode unlink log items [01/13] xfs: xfs_iflock is no longer a completion [02/13] xfs: add log item precommit operation [03/13] xfs: factor the xfs_iunlink functions [04/13] xfs: arrange all unlinked inodes into one list [05/13] xfs: add unlink list pointers to xfs_inode [06/13] xfs: replace iunlink backref lookups with list lookups [07/13] xfs: mapping unlinked inodes is now redundant [09/13] xfs: validate the unlinked list pointer on update [10/13] xfs: re-order AGI updates in unlink list updates [11/13] xfs: combine iunlink inode update functions [12/13] xfs: add in-memory iunlink log item [13/13] xfs: reorder iunlink remove operation in xfs_ifree

Dave Chinner Aug. 12, 2020, 9:25 a.m. UTC

From: Gao Xiang <hsiangkao@redhat.com>

We currently keep unlinked lists short on disk by hashing the inodes
across multiple buckets. We don't need to ikeep them short anymore
as we no longer need to traverse the entire to remove an inode from
it. The in-memory back reference index provides the previous inode
in the list for us instead.

Log recovery still has to handle existing filesystems that use all
64 on-disk buckets so we detect and handle this case specially so
that so inode eviction can still work properly in recovery.

[dchinner: imported into parent patch series early on and modified
to fit cleanly. ]

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 49 +++++++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 20 deletions(-)

Darrick J. Wong Aug. 18, 2020, 11:59 p.m. UTC | #1

On Wed, Aug 12, 2020 at 07:25:47PM +1000, Dave Chinner wrote:
> From: Gao Xiang <hsiangkao@redhat.com>
> 
> We currently keep unlinked lists short on disk by hashing the inodes
> across multiple buckets. We don't need to ikeep them short anymore
> as we no longer need to traverse the entire to remove an inode from
> it. The in-memory back reference index provides the previous inode
> in the list for us instead.
> 
> Log recovery still has to handle existing filesystems that use all
> 64 on-disk buckets so we detect and handle this case specially so
> that so inode eviction can still work properly in recovery.
> 
> [dchinner: imported into parent patch series early on and modified
> to fit cleanly. ]
> 
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 49 +++++++++++++++++++++++++++-------------------
>  1 file changed, 29 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index f2f502b65691..fa92bdf6e0da 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -33,6 +33,7 @@
>  #include "xfs_symlink.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  #include "xfs_bmap_btree.h"
>  #include "xfs_reflink.h"
>  
> @@ -2092,25 +2093,32 @@ xfs_iunlink_update_bucket(
>  	struct xfs_trans	*tp,
>  	xfs_agnumber_t		agno,
>  	struct xfs_buf		*agibp,
> -	unsigned int		bucket_index,
> +	xfs_agino_t		old_agino,
>  	xfs_agino_t		new_agino)
>  {
> +	struct xlog		*log = tp->t_mountp->m_log;
>  	struct xfs_agi		*agi = agibp->b_addr;
>  	xfs_agino_t		old_value;
> -	int			offset;
> +	unsigned int		bucket_index;
> +	int                     offset;
>  
>  	ASSERT(xfs_verify_agino_or_null(tp->t_mountp, agno, new_agino));
>  
> +	bucket_index = 0;
> +	/* During recovery, the old multiple bucket index can be applied */
> +	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {

Does the flag test need parentheses?

It feels a little funny that we pass in old_agino (having gotten it from
agi_unlinked) and then compare it with agi_unlinked, but as the commit
log points out, I guess this is a wart of having to support the old
unlinked list behavior.  It makes sense to me that if we're going to
change the unlinked list behavior we could be a little more careful
about double-checking things.

Question: if a newer kernel crashes with a super-long unlinked list and
the fs gets recovered on an old kernel, will this lead to insanely high
recovery times?  I think the answer is no, because recovery is single
threaded and the hash only existed to reduce AGI contention during
normal unlinking operations?

--D

> +		ASSERT(old_agino != NULLAGINO);
> +
> +		if (be32_to_cpu(agi->agi_unlinked[0]) != old_agino)
> +			bucket_index = old_agino % XFS_AGI_UNLINKED_BUCKETS;
> +	}
> +
>  	old_value = be32_to_cpu(agi->agi_unlinked[bucket_index]);
>  	trace_xfs_iunlink_update_bucket(tp->t_mountp, agno, bucket_index,
>  			old_value, new_agino);
>  
> -	/*
> -	 * We should never find the head of the list already set to the value
> -	 * passed in because either we're adding or removing ourselves from the
> -	 * head of the list.
> -	 */
> -	if (old_value == new_agino) {
> +	/* check if the old agi_unlinked head is as expected */
> +	if (old_value != old_agino) {
>  		xfs_buf_mark_corrupt(agibp);
>  		return -EFSCORRUPTED;
>  	}
> @@ -2216,17 +2224,18 @@ xfs_iunlink_insert_inode(
>  	xfs_agino_t		next_agino;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> -	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
>  	agi = agibp->b_addr;
>  
>  	/*
> -	 * Get the index into the agi hash table for the list this inode will
> -	 * go on.  Make sure the pointer isn't garbage and that this inode
> -	 * isn't already on the list.
> +	 * We don't need to traverse the on disk unlinked list to find the
> +	 * previous inode in the list when removing inodes anymore, so we don't
> +	 * need multiple on-disk lists anymore. Hence we always use bucket 0.
> +	 * Make sure the pointer isn't garbage and that this inode isn't already
> +	 * on the list.
>  	 */
> -	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> +	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
>  	if (next_agino == agino ||
>  	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
>  		xfs_buf_mark_corrupt(agibp);
> @@ -2256,7 +2265,7 @@ xfs_iunlink_insert_inode(
>  	}
>  
>  	/* Point the head of the list to point to this inode. */
> -	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
> +	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
>  }
>  
>  /*
> @@ -2416,16 +2425,17 @@ xfs_iunlink_remove_inode(
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino;
>  	xfs_agino_t		head_agino;
> -	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
>  	agi = agibp->b_addr;
>  
>  	/*
> -	 * Get the index into the agi hash table for the list this inode will
> -	 * go on.  Make sure the head pointer isn't garbage.
> +	 * We don't need to traverse the on disk unlinked list to find the
> +	 * previous inode in the list when removing inodes anymore, so we don't
> +	 * need multiple on-disk lists anymore. Hence we always use bucket 0.
> +	 * Make sure the head pointer isn't garbage.
>  	 */
> -	head_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> +	head_agino = be32_to_cpu(agi->agi_unlinked[0]);
>  	if (!xfs_verify_agino(mp, agno, head_agino)) {
>  		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
>  				agi, sizeof(*agi));
> @@ -2483,8 +2493,7 @@ xfs_iunlink_remove_inode(
>  	}
>  
>  	/* Point the head of the list to the next unlinked inode. */
> -	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
> -			next_agino);
> +	return xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
>  }
>  
>  /*
> -- 
> 2.26.2.761.g0e0b3e54be
>

Dave Chinner Aug. 19, 2020, 12:45 a.m. UTC | #2

On Tue, Aug 18, 2020 at 04:59:59PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 12, 2020 at 07:25:47PM +1000, Dave Chinner wrote:
> > From: Gao Xiang <hsiangkao@redhat.com>
> > 
> > We currently keep unlinked lists short on disk by hashing the inodes
> > across multiple buckets. We don't need to ikeep them short anymore
> > as we no longer need to traverse the entire to remove an inode from
> > it. The in-memory back reference index provides the previous inode
> > in the list for us instead.
> > 
> > Log recovery still has to handle existing filesystems that use all
> > 64 on-disk buckets so we detect and handle this case specially so
> > that so inode eviction can still work properly in recovery.
> > 
> > [dchinner: imported into parent patch series early on and modified
> > to fit cleanly. ]
> > 
> > Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_inode.c | 49 +++++++++++++++++++++++++++-------------------
> >  1 file changed, 29 insertions(+), 20 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index f2f502b65691..fa92bdf6e0da 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -33,6 +33,7 @@
> >  #include "xfs_symlink.h"
> >  #include "xfs_trans_priv.h"
> >  #include "xfs_log.h"
> > +#include "xfs_log_priv.h"
> >  #include "xfs_bmap_btree.h"
> >  #include "xfs_reflink.h"
> >  
> > @@ -2092,25 +2093,32 @@ xfs_iunlink_update_bucket(
> >  	struct xfs_trans	*tp,
> >  	xfs_agnumber_t		agno,
> >  	struct xfs_buf		*agibp,
> > -	unsigned int		bucket_index,
> > +	xfs_agino_t		old_agino,
> >  	xfs_agino_t		new_agino)
> >  {
> > +	struct xlog		*log = tp->t_mountp->m_log;
> >  	struct xfs_agi		*agi = agibp->b_addr;
> >  	xfs_agino_t		old_value;
> > -	int			offset;
> > +	unsigned int		bucket_index;
> > +	int                     offset;
> >  
> >  	ASSERT(xfs_verify_agino_or_null(tp->t_mountp, agno, new_agino));
> >  
> > +	bucket_index = 0;
> > +	/* During recovery, the old multiple bucket index can be applied */
> > +	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
> 
> Does the flag test need parentheses?

Yes, will fix.

> It feels a little funny that we pass in old_agino (having gotten it from
> agi_unlinked) and then compare it with agi_unlinked, but as the commit
> log points out, I guess this is a wart of having to support the old
> unlinked list behavior.  It makes sense to me that if we're going to
> change the unlinked list behavior we could be a little more careful
> about double-checking things.
> 
> Question: if a newer kernel crashes with a super-long unlinked list and
> the fs gets recovered on an old kernel, will this lead to insanely high
> recovery times?  I think the answer is no, because recovery is single
> threaded and the hash only existed to reduce AGI contention during
> normal unlinking operations?

Right, the answer is no because log recovery even on old kernels
always recovers the inode at the head of the list. It does no
traversal, so it doesn't matter if it's recovering one list or 64
lists, the recovery time is the same.

Cheers,

Dave.

Gao Xiang Aug. 19, 2020, 12:58 a.m. UTC | #3

On Tue, Aug 18, 2020 at 04:59:59PM -0700, Darrick J. Wong wrote:

...

> > +	bucket_index = 0;
> > +	/* During recovery, the old multiple bucket index can be applied */
> > +	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
> 
> Does the flag test need parentheses?

Yeah, that would be better.

> 
> It feels a little funny that we pass in old_agino (having gotten it from
> agi_unlinked) and then compare it with agi_unlinked, but as the commit
> log points out, I guess this is a wart of having to support the old
> unlinked list behavior.  It makes sense to me that if we're going to
> change the unlinked list behavior we could be a little more careful
> about double-checking things.
> 
> Question: if a newer kernel crashes with a super-long unlinked list and
> the fs gets recovered on an old kernel, will this lead to insanely high
> recovery times?  I think the answer is no, because recovery is single
> threaded and the hash only existed to reduce AGI contention during
> normal unlinking operations?

btw, if my understanding is correct, as I mentioned starting from my v1,
this new feature isn't forward compatible since old kernel hardcode
agino % XFS_AGI_UNLINKED_BUCKETS but not tracing original bucket_index
from its logging recovery code. So yeah, a bit awkward from its original
design...

Thanks,
Gao Xiang

Christoph Hellwig Aug. 22, 2020, 9:01 a.m. UTC | #4

On Wed, Aug 19, 2020 at 08:58:30AM +0800, Gao Xiang wrote:
> btw, if my understanding is correct, as I mentioned starting from my v1,
> this new feature isn't forward compatible since old kernel hardcode
> agino % XFS_AGI_UNLINKED_BUCKETS but not tracing original bucket_index
> from its logging recovery code. So yeah, a bit awkward from its original
> design...

I think we should add a log_incompat feature just to be safe.

Gao Xiang Aug. 23, 2020, 5:24 p.m. UTC | #5

Hi Christoph,

On Sat, Aug 22, 2020 at 10:01:45AM +0100, Christoph Hellwig wrote:
> On Wed, Aug 19, 2020 at 08:58:30AM +0800, Gao Xiang wrote:
> > btw, if my understanding is correct, as I mentioned starting from my v1,
> > this new feature isn't forward compatible since old kernel hardcode
> > agino % XFS_AGI_UNLINKED_BUCKETS but not tracing original bucket_index
> > from its logging recovery code. So yeah, a bit awkward from its original
> > design...
> 
> I think we should add a log_incompat feature just to be safe.

Thanks for your suggestion.
Okay, if no other concern, I will try to look into that tomorrow...

Thanks,
Gao Xiang

>

[04/13] xfs: arrange all unlinked inodes into one list

Commit Message

Comments

Patch