xfs: Don't block in xfs_extent_busy_flush

Message ID	20230519171829.4108-1-wen.gang.wang@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-xfs-owner@vger.kernel.org> From: Wengang Wang <wen.gang.wang@oracle.com> To: linux-xfs@vger.kernel.org, djwong@kernel.org, david@fromorbit.com Cc: wen.gang.wang@oracle.com Subject: [PATCH] xfs: Don't block in xfs_extent_busy_flush Date: Fri, 19 May 2023 10:18:29 -0700 Message-Id: <20230519171829.4108-1-wen.gang.wang@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xfs: Don't block in xfs_extent_busy_flush \| expand xfs: Don't block in xfs_extent_busy_flush

Wengang Wang May 19, 2023, 5:18 p.m. UTC

The following calltrace is seen:
	#0	context_switch() kernel/sched/core.c:3881
	#1	__schedule() kernel/sched/core.c:5111
	#2	schedule() kernel/sched/core.c:5186
	#3	xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
	#4	xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
	#5	xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
	#6	xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
	#7	xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
	#8	__xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
	#9	xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
	#10	xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
	#11	xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
	#12	xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
	#13	xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
	#14	xfs_log_mount_finish() fs/xfs/xfs_log.c:764
	#15	xfs_mountfs() fs/xfs/xfs_mount.c:978
	#16	xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
	#17	mount_bdev() fs/super.c:1417
	#18	xfs_fs_mount() fs/xfs/xfs_super.c:1985
	#19	legacy_get_tree() fs/fs_context.c:647
	#20	vfs_get_tree() fs/super.c:1547
	#21	do_new_mount() fs/namespace.c:2843
	#22	do_mount() fs/namespace.c:3163
	#23	ksys_mount() fs/namespace.c:3372
	#24	__do_sys_mount() fs/namespace.c:3386
	#25	__se_sys_mount() fs/namespace.c:3383
	#26	__x64_sys_mount() fs/namespace.c:3383
	#27	do_syscall_64() arch/x86/entry/common.c:296
	#28	entry_SYSCALL_64() arch/x86/entry/entry_64.S:180

During the process of the 2nd and subsequetial record in an EFI.
It is waiting for the busy blocks to be cleaned, but the only busy extent
is still hold in current xfs_trans->t_busy. That busy extent was added when
processing previous EFI record. And because that busy extent is not committed
yet, it can't be cleaned.

To avoid above deadlock, we don't block in xfs_extent_busy_flush() when
allocating AGFL blocks, instead it returns -EAGAIN. On receiving -EAGAIN
we are able to retry that EFI record with a new transaction after committing
the old transactin. With old transaction committed, the busy extent attached
to the old transaction get the change to be cleaned. On the retry, there is
no existing busy extents in the new transaction, thus no deadlock.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 30 ++++++++++++++++++++++++------
 fs/xfs/libxfs/xfs_alloc.h |  2 ++
 fs/xfs/scrub/repair.c     |  4 ++--
 fs/xfs/xfs_extent_busy.c  | 34 +++++++++++++++++++++++++++++-----
 fs/xfs/xfs_extent_busy.h  |  6 +++---
 fs/xfs/xfs_extfree_item.c | 37 ++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_log_recover.c  | 23 ++++++++++-------------
 fs/xfs/xfs_trans_ail.c    |  2 +-
 fs/xfs/xfs_trans_priv.h   |  1 +
 9 files changed, 108 insertions(+), 31 deletions(-)

Darrick J. Wong May 20, 2023, 9:56 p.m. UTC | #1

On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
> The following calltrace is seen:
> 	#0	context_switch() kernel/sched/core.c:3881
> 	#1	__schedule() kernel/sched/core.c:5111
> 	#2	schedule() kernel/sched/core.c:5186
> 	#3	xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
> 	#4	xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
> 	#5	xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
> 	#6	xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
> 	#7	xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
> 	#8	__xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
> 	#9	xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
> 	#10	xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
> 	#11	xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
> 	#12	xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
> 	#13	xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
> 	#14	xfs_log_mount_finish() fs/xfs/xfs_log.c:764
> 	#15	xfs_mountfs() fs/xfs/xfs_mount.c:978
> 	#16	xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
> 	#17	mount_bdev() fs/super.c:1417
> 	#18	xfs_fs_mount() fs/xfs/xfs_super.c:1985
> 	#19	legacy_get_tree() fs/fs_context.c:647
> 	#20	vfs_get_tree() fs/super.c:1547
> 	#21	do_new_mount() fs/namespace.c:2843
> 	#22	do_mount() fs/namespace.c:3163
> 	#23	ksys_mount() fs/namespace.c:3372
> 	#24	__do_sys_mount() fs/namespace.c:3386
> 	#25	__se_sys_mount() fs/namespace.c:3383
> 	#26	__x64_sys_mount() fs/namespace.c:3383
> 	#27	do_syscall_64() arch/x86/entry/common.c:296
> 	#28	entry_SYSCALL_64() arch/x86/entry/entry_64.S:180
> 
> During the process of the 2nd and subsequetial record in an EFI.
> It is waiting for the busy blocks to be cleaned, but the only busy extent
> is still hold in current xfs_trans->t_busy. That busy extent was added when
> processing previous EFI record. And because that busy extent is not committed
> yet, it can't be cleaned.
> 
> To avoid above deadlock, we don't block in xfs_extent_busy_flush() when
> allocating AGFL blocks, instead it returns -EAGAIN. On receiving -EAGAIN
> we are able to retry that EFI record with a new transaction after committing
> the old transactin. With old transaction committed, the busy extent attached
> to the old transaction get the change to be cleaned. On the retry, there is
> no existing busy extents in the new transaction, thus no deadlock.
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 30 ++++++++++++++++++++++++------
>  fs/xfs/libxfs/xfs_alloc.h |  2 ++
>  fs/xfs/scrub/repair.c     |  4 ++--
>  fs/xfs/xfs_extent_busy.c  | 34 +++++++++++++++++++++++++++++-----
>  fs/xfs/xfs_extent_busy.h  |  6 +++---
>  fs/xfs/xfs_extfree_item.c | 37 ++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_log_recover.c  | 23 ++++++++++-------------
>  fs/xfs/xfs_trans_ail.c    |  2 +-
>  fs/xfs/xfs_trans_priv.h   |  1 +
>  9 files changed, 108 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 203f16c48c19..abfd2acb3053 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1491,6 +1491,7 @@ STATIC int
>  xfs_alloc_ag_vextent_near(
>  	struct xfs_alloc_arg	*args)
>  {
> +	int			flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH;
>  	struct xfs_alloc_cur	acur = {};
>  	int			error;		/* error code */
>  	int			i;		/* result code, temporary */
> @@ -1564,8 +1565,11 @@ xfs_alloc_ag_vextent_near(
>  	if (!acur.len) {
>  		if (acur.busy) {
>  			trace_xfs_alloc_near_busy(args);
> -			xfs_extent_busy_flush(args->mp, args->pag,
> -					      acur.busy_gen);
> +			error = xfs_extent_busy_flush(args->tp, args->pag,
> +					      acur.busy_gen, flags);
> +			if (error)
> +				goto out;
> +			flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>  			goto restart;
>  		}
>  		trace_xfs_alloc_size_neither(args);
> @@ -1592,6 +1596,7 @@ STATIC int				/* error */
>  xfs_alloc_ag_vextent_size(
>  	xfs_alloc_arg_t	*args)		/* allocation argument structure */
>  {
> +	int		flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH;

This variable holds XFS_ALLOC_FLAG_* values that are passed only to
xfs_extent_busy_flush, right?  Could this be named "busyflags" or
similar to make that clearer?

(Clearer to 6hr^W6mo from now Darrick who won't remember this at all.)

>  	struct xfs_agf	*agf = args->agbp->b_addr;
>  	struct xfs_btree_cur *bno_cur;	/* cursor for bno btree */
>  	struct xfs_btree_cur *cnt_cur;	/* cursor for cnt btree */
> @@ -1670,8 +1675,13 @@ xfs_alloc_ag_vextent_size(
>  				xfs_btree_del_cursor(cnt_cur,
>  						     XFS_BTREE_NOERROR);
>  				trace_xfs_alloc_size_busy(args);
> -				xfs_extent_busy_flush(args->mp,
> -							args->pag, busy_gen);
> +				error = xfs_extent_busy_flush(args->tp, args->pag,
> +						busy_gen, flags);
> +				if (error) {
> +					cnt_cur = NULL;
> +					goto error0;
> +				}
> +				flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>  				goto restart;
>  			}
>  		}
> @@ -1755,7 +1765,13 @@ xfs_alloc_ag_vextent_size(
>  		if (busy) {
>  			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>  			trace_xfs_alloc_size_busy(args);
> -			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> +			error = xfs_extent_busy_flush(args->tp, args->pag,
> +					busy_gen, flags);
> +			if (error) {
> +				cnt_cur = NULL;
> +				goto error0;
> +			}
> +			flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>  			goto restart;
>  		}
>  		goto out_nominleft;
> @@ -2629,6 +2645,7 @@ xfs_alloc_fix_freelist(
>  	targs.agno = args->agno;
>  	targs.alignment = targs.minlen = targs.prod = 1;
>  	targs.pag = pag;
> +	targs.flags = args->flags & XFS_ALLOC_FLAG_FREEING;

Hmm, I guess this propagates FREEING into the allocation that lengthens
the AGFL.  Right?

>  	error = xfs_alloc_read_agfl(pag, tp, &agflbp);
>  	if (error)
>  		goto out_agbp_relse;
> @@ -3572,6 +3589,7 @@ xfs_free_extent_fix_freelist(
>  	args.mp = tp->t_mountp;
>  	args.agno = pag->pag_agno;
>  	args.pag = pag;
> +	args.flags = XFS_ALLOC_FLAG_FREEING;
>  
>  	/*
>  	 * validate that the block number is legal - the enables us to detect
> @@ -3580,7 +3598,7 @@ xfs_free_extent_fix_freelist(
>  	if (args.agno >= args.mp->m_sb.sb_agcount)
>  		return -EFSCORRUPTED;
>  
> -	error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
> +	error = xfs_alloc_fix_freelist(&args, args.flags);

Hm.  Why stuff XFS_ALLOC_FLAG_FREEING into args.flags here?  I /think/
in this case it doesn't matter because nothing under
xfs_alloc_fix_freelist accesses args->flags...?

AFAICT in xfs_alloc_fix_freelist, the difference between @args->flags
and @flags is that @args is the allocation that we're trying to do,
whereas @flags are for any allocation that might need to be done to fix
the freelist.  IOWS, the only allocation we're doing *is* to fix the
freelist, so they are one in the same.  Maybe?  Otherwise I can't tell
why we'd convey two sets of XFS_ALLOC flags to xfs_alloc_fix_freelist.

So while I think this isn't incorrect, the flag mixing bothers me a bit
because setting fields in a struct can have weird not so obvious side
effects.

(Still trying to make sense of the addition of a @flags field to struct
xfs_alloc_arg.)

>  	if (error)
>  		return error;
>  
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index 2b246d74c189..5038fba87784 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -24,6 +24,7 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp);
>  #define	XFS_ALLOC_FLAG_NORMAP	0x00000004  /* don't modify the rmapbt */
>  #define	XFS_ALLOC_FLAG_NOSHRINK	0x00000008  /* don't shrink the freelist */
>  #define	XFS_ALLOC_FLAG_CHECK	0x00000010  /* test only, don't modify args */
> +#define	XFS_ALLOC_FLAG_TRYFLUSH	0x00000020  /* don't block in busyextent flush*/
>  
>  /*
>   * Argument structure for xfs_alloc routines.
> @@ -57,6 +58,7 @@ typedef struct xfs_alloc_arg {
>  #ifdef DEBUG
>  	bool		alloc_minlen_only; /* allocate exact minlen extent */
>  #endif
> +	int		flags;		/* XFS_ALLOC_FLAG_* */
>  } xfs_alloc_arg_t;
>  
>  /*
> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> index 1b71174ec0d6..2ba28e4257fe 100644
> --- a/fs/xfs/scrub/repair.c
> +++ b/fs/xfs/scrub/repair.c
> @@ -496,9 +496,9 @@ xrep_fix_freelist(
>  	args.agno = sc->sa.pag->pag_agno;
>  	args.alignment = 1;
>  	args.pag = sc->sa.pag;
> +	args.flags = can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK;
>  
> -	return xfs_alloc_fix_freelist(&args,
> -			can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK);
> +	return xfs_alloc_fix_freelist(&args, args.flags);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> index f3d328e4a440..ea1c1857bf5b 100644
> --- a/fs/xfs/xfs_extent_busy.c
> +++ b/fs/xfs/xfs_extent_busy.c
> @@ -567,18 +567,41 @@ xfs_extent_busy_clear(
>  /*
>   * Flush out all busy extents for this AG.
>   */
> -void
> +int
>  xfs_extent_busy_flush(
> -	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
>  	struct xfs_perag	*pag,
> -	unsigned		busy_gen)
> +	unsigned		busy_gen,
> +	int			flags)
>  {
>  	DEFINE_WAIT		(wait);
>  	int			error;
>  
> -	error = xfs_log_force(mp, XFS_LOG_SYNC);
> +	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>  	if (error)
> -		return;
> +		return error;
> +
> +	/*
> +	 * If we are holding busy extents, the caller may not want to block
> +	 * straight away. If we are being told just to try a flush or progress
> +	 * has been made since we last skipped a busy extent, return
> +	 * immediately to allow the caller to try again. If we are freeing
> +	 * extents, we might actually be holding the only free extents in the
> +	 * transaction busy list and the log force won't resolve that
> +	 * situation. In this case, return -EAGAIN in that case to tell the
> +	 * caller it needs to commit the busy extents it holds before retrying
> +	 * the extent free operation.
> +	 */
> +	if (!list_empty(&tp->t_busy)) {
> +		if (flags & XFS_ALLOC_FLAG_TRYFLUSH)
> +			return 0;
> +
> +		if (busy_gen != READ_ONCE(pag->pagb_gen))
> +			return 0;
> +
> +		if (flags & XFS_ALLOC_FLAG_FREEING)
> +			return -EAGAIN;
> +	}
>  
>  	do {
>  		prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
> @@ -588,6 +611,7 @@ xfs_extent_busy_flush(
>  	} while (1);
>  
>  	finish_wait(&pag->pagb_wait, &wait);
> +	return 0;

This part looks good.

>  }
>  
>  void
> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> index 4a118131059f..edeedb92e0df 100644
> --- a/fs/xfs/xfs_extent_busy.h
> +++ b/fs/xfs/xfs_extent_busy.h
> @@ -51,9 +51,9 @@ bool
>  xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
>  		xfs_extlen_t *len, unsigned *busy_gen);
>  
> -void
> -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
> -	unsigned busy_gen);
> +int
> +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag,
> +	unsigned busy_gen, int flags);
>  
>  void
>  xfs_extent_busy_wait_all(struct xfs_mount *mp);
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index 011b50469301..3c5a9e9952ec 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
>  	return efdp;
>  }
>  
> +/*
> + * Fill the EFD with all extents from the EFI and set the counter.
> + * Note: the EFD should comtain at least one extents already.
> + */
> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
> +{
> +	struct xfs_efi_log_item *efip = efdp->efd_efip;
> +	uint                    i;
> +
> +	if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
> +		return;
> +
> +	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> +	       efdp->efd_format.efd_extents[i] =
> +		       efip->efi_format.efi_extents[i];
> +	}
> +	efdp->efd_next_extent = efip->efi_format.efi_nextents;
> +}
> +
>  /*
>   * Free an extent and log it to the EFD. Note that the transaction is marked
>   * dirty regardless of whether the extent free succeeds or fails to support the
> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
>  	error = __xfs_free_extent(tp, xefi->xefi_startblock,
>  			xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
>  			xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
> +	if (error == -EAGAIN) {
> +		xfs_fill_efd_with_efi(efdp);
> +		return error;
> +	}
>  	/*
>  	 * Mark the transaction dirty, even on error. This ensures the
>  	 * transaction is aborted, which:
> @@ -476,7 +499,8 @@ xfs_extent_free_finish_item(
>  	xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
>  
>  	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
> -	kmem_cache_free(xfs_extfree_item_cache, xefi);
> +	if (error != -EAGAIN)
> +		kmem_cache_free(xfs_extfree_item_cache, xefi);
>  	return error;
>  }
>  
> @@ -633,6 +657,17 @@ xfs_efi_item_recover(
>  		fake.xefi_blockcount = extp->ext_len;
>  
>  		error = xfs_trans_free_extent(tp, efdp, &fake);
> +		if (error == -EAGAIN) {

Curious, is this patch based off of 6.4-rc?  There should be a
xfs_extent_free_get_group / xfs_extent_free_put_group here.

> +			xfs_free_extent_later(tp, fake.xefi_startblock,
> +				fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER);
> +			/*
> +			 * try to free as many extents as possible with current
> +			 * transaction
> +			 */
> +			error = 0;
> +			continue;
> +		};

If we get EAGAIN, why not let the unfinished parts of the recovered EFI
work get turned into a new EFI?

> +
>  		if (error == -EFSCORRUPTED)
>  			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
>  					extp, sizeof(*extp));
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 322eb2ee6c55..00bfe9683fa8 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
>  	struct xfs_log_item	*lip;
>  	struct xfs_ail		*ailp;
>  	int			error = 0;
> -#if defined(DEBUG) || defined(XFS_WARN)
> -	xfs_lsn_t		last_lsn;
> -#endif
> +	xfs_lsn_t		threshold_lsn;
>  
>  	ailp = log->l_ailp;
> +	threshold_lsn = xfs_ail_max_lsn(ailp);
>  	spin_lock(&ailp->ail_lock);
> -#if defined(DEBUG) || defined(XFS_WARN)
> -	last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
> -#endif
> +
>  	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>  	     lip != NULL;
>  	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
>  		const struct xfs_item_ops	*ops;
> +		/*
> +		 * Orignal redo EFI could be splitted into new EFIs. Those
> +		 * new EFIs are supposed to be processed in capture_list.
> +		 * Stop here when original redo intents are done.
> +		 */
> +		if (XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
> +			break;

I'm not sure what this change accomplishes?  Is it because of the
continue; statement in xfs_efi_item_recover?

--D

>  
>  		if (!xlog_item_is_intent(lip))
>  			break;
>  
> -		/*
> -		 * We should never see a redo item with a LSN higher than
> -		 * the last transaction we found in the log at the start
> -		 * of recovery.
> -		 */
> -		ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
> -
>  		/*
>  		 * NOTE: If your intent processing routine can create more
>  		 * deferred ops, you /must/ attach them to the capture list in
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 7d4109af193e..2825f55eca88 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -137,7 +137,7 @@ xfs_ail_min_lsn(
>  /*
>   * Return the maximum lsn held in the AIL, or zero if the AIL is empty.
>   */
> -static xfs_lsn_t
> +xfs_lsn_t
>  xfs_ail_max_lsn(
>  	struct xfs_ail		*ailp)
>  {
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index d5400150358e..86b4f29b2a6e 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -106,6 +106,7 @@ void			xfs_ail_push_all(struct xfs_ail *);
>  void			xfs_ail_push_all_sync(struct xfs_ail *);
>  struct xfs_log_item	*xfs_ail_min(struct xfs_ail  *ailp);
>  xfs_lsn_t		xfs_ail_min_lsn(struct xfs_ail *ailp);
> +xfs_lsn_t		xfs_ail_max_lsn(struct xfs_ail *ailp);
>  
>  struct xfs_log_item *	xfs_trans_ail_cursor_first(struct xfs_ail *ailp,
>  					struct xfs_ail_cursor *cur,
> -- 
> 2.21.0 (Apple Git-122.2)
>

Dave Chinner May 22, 2023, 1:17 a.m. UTC | #2

On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
> The following calltrace is seen:
> 	#0	context_switch() kernel/sched/core.c:3881
> 	#1	__schedule() kernel/sched/core.c:5111
> 	#2	schedule() kernel/sched/core.c:5186
> 	#3	xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
> 	#4	xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
> 	#5	xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
> 	#6	xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
> 	#7	xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
> 	#8	__xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
> 	#9	xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
> 	#10	xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
> 	#11	xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
> 	#12	xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
> 	#13	xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
> 	#14	xfs_log_mount_finish() fs/xfs/xfs_log.c:764
> 	#15	xfs_mountfs() fs/xfs/xfs_mount.c:978
> 	#16	xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
> 	#17	mount_bdev() fs/super.c:1417
> 	#18	xfs_fs_mount() fs/xfs/xfs_super.c:1985
> 	#19	legacy_get_tree() fs/fs_context.c:647
> 	#20	vfs_get_tree() fs/super.c:1547
> 	#21	do_new_mount() fs/namespace.c:2843
> 	#22	do_mount() fs/namespace.c:3163
> 	#23	ksys_mount() fs/namespace.c:3372
> 	#24	__do_sys_mount() fs/namespace.c:3386
> 	#25	__se_sys_mount() fs/namespace.c:3383
> 	#26	__x64_sys_mount() fs/namespace.c:3383
> 	#27	do_syscall_64() arch/x86/entry/common.c:296
> 	#28	entry_SYSCALL_64() arch/x86/entry/entry_64.S:180
> 
> During the process of the 2nd and subsequetial record in an EFI.
> It is waiting for the busy blocks to be cleaned, but the only busy extent
> is still hold in current xfs_trans->t_busy. That busy extent was added when
> processing previous EFI record. And because that busy extent is not committed
> yet, it can't be cleaned.
> 
> To avoid above deadlock, we don't block in xfs_extent_busy_flush() when
> allocating AGFL blocks, instead it returns -EAGAIN. On receiving -EAGAIN
> we are able to retry that EFI record with a new transaction after committing
> the old transactin. With old transaction committed, the busy extent attached
> to the old transaction get the change to be cleaned. On the retry, there is
> no existing busy extents in the new transaction, thus no deadlock.
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>

THere are multiple changes in this patchset that aren't immediately
obvious why they have been done. hence this needs to be split into
one set of changes per patch.

> ---
>  fs/xfs/libxfs/xfs_alloc.c | 30 ++++++++++++++++++++++++------
>  fs/xfs/libxfs/xfs_alloc.h |  2 ++
>  fs/xfs/scrub/repair.c     |  4 ++--
>  fs/xfs/xfs_extent_busy.c  | 34 +++++++++++++++++++++++++++++-----
>  fs/xfs/xfs_extent_busy.h  |  6 +++---
>  fs/xfs/xfs_extfree_item.c | 37 ++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_log_recover.c  | 23 ++++++++++-------------
>  fs/xfs/xfs_trans_ail.c    |  2 +-
>  fs/xfs/xfs_trans_priv.h   |  1 +
>  9 files changed, 108 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 203f16c48c19..abfd2acb3053 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1491,6 +1491,7 @@ STATIC int
>  xfs_alloc_ag_vextent_near(
>  	struct xfs_alloc_arg	*args)
>  {
> +	int			flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH;

The first set of changes is encoding allocation flags into the
struct xfs_alloc_arg. I can sort of see why you did this, but it's
done in an inconsistent manner, and I don't see why some of the
changes were made and not others. This conversion belongs in it's
own patch.

>  	struct xfs_alloc_cur	acur = {};
>  	int			error;		/* error code */
>  	int			i;		/* result code, temporary */
> @@ -1564,8 +1565,11 @@ xfs_alloc_ag_vextent_near(
>  	if (!acur.len) {
>  		if (acur.busy) {
>  			trace_xfs_alloc_near_busy(args);
> -			xfs_extent_busy_flush(args->mp, args->pag,
> -					      acur.busy_gen);
> +			error = xfs_extent_busy_flush(args->tp, args->pag,
> +					      acur.busy_gen, flags);
> +			if (error)
> +				goto out;
> +			flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>  			goto restart;
>  		}
>  		trace_xfs_alloc_size_neither(args);
> @@ -1592,6 +1596,7 @@ STATIC int				/* error */
>  xfs_alloc_ag_vextent_size(
>  	xfs_alloc_arg_t	*args)		/* allocation argument structure */
>  {
> +	int		flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH;
>  	struct xfs_agf	*agf = args->agbp->b_addr;
>  	struct xfs_btree_cur *bno_cur;	/* cursor for bno btree */
>  	struct xfs_btree_cur *cnt_cur;	/* cursor for cnt btree */
> @@ -1670,8 +1675,13 @@ xfs_alloc_ag_vextent_size(
>  				xfs_btree_del_cursor(cnt_cur,
>  						     XFS_BTREE_NOERROR);
>  				trace_xfs_alloc_size_busy(args);
> -				xfs_extent_busy_flush(args->mp,
> -							args->pag, busy_gen);
> +				error = xfs_extent_busy_flush(args->tp, args->pag,
> +						busy_gen, flags);
> +				if (error) {
> +					cnt_cur = NULL;
> +					goto error0;
> +				}
> +				flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>  				goto restart;
>  			}
>  		}

The cursor handling for the error case could be improved here by
moving the xfs_btree_del_cursor(cnt_cur...) call here.

			if (i == 0) {
				/* ... */
				trace_xfs_alloc_size_busy(args);
				error = xfs_extent_busy_flush(args->tp, args->pag,
						busy_gen, flags);
				if (error)
					goto error0;

				xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
				flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
				goto restart;
			}

> @@ -1755,7 +1765,13 @@ xfs_alloc_ag_vextent_size(
>  		if (busy) {
>  			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>  			trace_xfs_alloc_size_busy(args);
> -			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> +			error = xfs_extent_busy_flush(args->tp, args->pag,
> +					busy_gen, flags);
> +			if (error) {
> +				cnt_cur = NULL;
> +				goto error0;
> +			}
> +			flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>  			goto restart;
>  		}
>  		goto out_nominleft;

Same here.

I'd also enhance the trace_xfs_alloc_size_error() call to
trace the error that occurred so we can actually see when an EAGAIN
error occurs when we are tracing...


> @@ -2629,6 +2645,7 @@ xfs_alloc_fix_freelist(
>  	targs.agno = args->agno;
>  	targs.alignment = targs.minlen = targs.prod = 1;
>  	targs.pag = pag;
> +	targs.flags = args->flags & XFS_ALLOC_FLAG_FREEING;
>  	error = xfs_alloc_read_agfl(pag, tp, &agflbp);
>  	if (error)
>  		goto out_agbp_relse;

This is a change in behaviour for AGFL filling. This should be in
it's own patch, explaining why the change of behaviour is necessary.

Also, it could just be:

	targs.flags = flags & XFS_ALLOC_FLAG_FREEING;

> @@ -3572,6 +3589,7 @@ xfs_free_extent_fix_freelist(
>  	args.mp = tp->t_mountp;
>  	args.agno = pag->pag_agno;
>  	args.pag = pag;
> +	args.flags = XFS_ALLOC_FLAG_FREEING;
>  
>  	/*
>  	 * validate that the block number is legal - the enables us to detect
> @@ -3580,7 +3598,7 @@ xfs_free_extent_fix_freelist(
>  	if (args.agno >= args.mp->m_sb.sb_agcount)
>  		return -EFSCORRUPTED;
>  
> -	error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
> +	error = xfs_alloc_fix_freelist(&args, args.flags);
>  	if (error)
>  		return error;

This is where I see inconsistency in the usage of args.flags. If we
are going to move the XFS_ALLOC_FLAG_* state into args->flags, then
it should not be passed as a separate parameter to functions that
take a struct xfs_alloc_arg. i.e. we either use args->flags
everywhere for everything, or we pass a separate flags argument
everywhere as we currently do.


> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index 2b246d74c189..5038fba87784 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -24,6 +24,7 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp);
>  #define	XFS_ALLOC_FLAG_NORMAP	0x00000004  /* don't modify the rmapbt */
>  #define	XFS_ALLOC_FLAG_NOSHRINK	0x00000008  /* don't shrink the freelist */
>  #define	XFS_ALLOC_FLAG_CHECK	0x00000010  /* test only, don't modify args */
> +#define	XFS_ALLOC_FLAG_TRYFLUSH	0x00000020  /* don't block in busyextent flush*/

Convert all these to (1U << x) format while you are adding a new
flag...

>  
>  /*
>   * Argument structure for xfs_alloc routines.
> @@ -57,6 +58,7 @@ typedef struct xfs_alloc_arg {
>  #ifdef DEBUG
>  	bool		alloc_minlen_only; /* allocate exact minlen extent */
>  #endif
> +	int		flags;		/* XFS_ALLOC_FLAG_* */

And flags fields should be unsigned.

>  } xfs_alloc_arg_t;
>  
>  /*
> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> index 1b71174ec0d6..2ba28e4257fe 100644
> --- a/fs/xfs/scrub/repair.c
> +++ b/fs/xfs/scrub/repair.c
> @@ -496,9 +496,9 @@ xrep_fix_freelist(
>  	args.agno = sc->sa.pag->pag_agno;
>  	args.alignment = 1;
>  	args.pag = sc->sa.pag;
> +	args.flags = can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK;
>  
> -	return xfs_alloc_fix_freelist(&args,
> -			can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK);
> +	return xfs_alloc_fix_freelist(&args, args.flags);

Same comment as above about just passing separate flags everywhere
or just passing args. This conversion is a separate patch...

> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> index f3d328e4a440..ea1c1857bf5b 100644
> --- a/fs/xfs/xfs_extent_busy.c
> +++ b/fs/xfs/xfs_extent_busy.c
> @@ -567,18 +567,41 @@ xfs_extent_busy_clear(
>  /*
>   * Flush out all busy extents for this AG.
>   */
> -void
> +int
>  xfs_extent_busy_flush(
> -	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
>  	struct xfs_perag	*pag,
> -	unsigned		busy_gen)
> +	unsigned		busy_gen,
> +	int			flags)
>  {
>  	DEFINE_WAIT		(wait);
>  	int			error;
>  
> -	error = xfs_log_force(mp, XFS_LOG_SYNC);
> +	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>  	if (error)
> -		return;
> +		return error;
> +
> +	/*
> +	 * If we are holding busy extents, the caller may not want to block
> +	 * straight away. If we are being told just to try a flush or progress
> +	 * has been made since we last skipped a busy extent, return
> +	 * immediately to allow the caller to try again. If we are freeing
> +	 * extents, we might actually be holding the only free extents in the
> +	 * transaction busy list and the log force won't resolve that
> +	 * situation. In this case, return -EAGAIN in that case to tell the
> +	 * caller it needs to commit the busy extents it holds before retrying
> +	 * the extent free operation.
> +	 */
> +	if (!list_empty(&tp->t_busy)) {
> +		if (flags & XFS_ALLOC_FLAG_TRYFLUSH)
> +			return 0;
> +
> +		if (busy_gen != READ_ONCE(pag->pagb_gen))
> +			return 0;
> +
> +		if (flags & XFS_ALLOC_FLAG_FREEING)
> +			return -EAGAIN;
> +	}
>  
>  	do {
>  		prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
> @@ -588,6 +611,7 @@ xfs_extent_busy_flush(
>  	} while (1);
>  
>  	finish_wait(&pag->pagb_wait, &wait);
> +	return 0;
>  }
>  
>  void
> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> index 4a118131059f..edeedb92e0df 100644
> --- a/fs/xfs/xfs_extent_busy.h
> +++ b/fs/xfs/xfs_extent_busy.h
> @@ -51,9 +51,9 @@ bool
>  xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
>  		xfs_extlen_t *len, unsigned *busy_gen);
>  
> -void
> -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
> -	unsigned busy_gen);
> +int
> +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag,
> +	unsigned busy_gen, int flags);
>  
>  void
>  xfs_extent_busy_wait_all(struct xfs_mount *mp);
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index 011b50469301..3c5a9e9952ec 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
>  	return efdp;
>  }
>  
> +/*
> + * Fill the EFD with all extents from the EFI and set the counter.
> + * Note: the EFD should comtain at least one extents already.
> + */
> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
> +{
> +	struct xfs_efi_log_item *efip = efdp->efd_efip;
> +	uint                    i;
> +
> +	if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
> +		return;
> +
> +	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> +	       efdp->efd_format.efd_extents[i] =
> +		       efip->efi_format.efi_extents[i];
> +	}
> +	efdp->efd_next_extent = efip->efi_format.efi_nextents;
> +}
> +

Ok, but it doesn't dirty the transaction or the EFD, which means....

> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
>  	error = __xfs_free_extent(tp, xefi->xefi_startblock,
>  			xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
>  			xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
> +	if (error == -EAGAIN) {
> +		xfs_fill_efd_with_efi(efdp);
> +		return error;
> +	}

.... this is incorrectly placed.

The very next lines say:

>  	/*
>  	 * Mark the transaction dirty, even on error. This ensures the
>  	 * transaction is aborted, which:

i.e. we have to make the transaction and EFD log item dirty even if
we have an error. In this case, the error is not fatal, but we still
have to ensure that we commit the EFD when we roll the transaction.
Hence the transaction and EFD still need to be dirtied on -EAGAIN...

> @@ -476,7 +499,8 @@ xfs_extent_free_finish_item(
>  	xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
>  
>  	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
> -	kmem_cache_free(xfs_extfree_item_cache, xefi);
> +	if (error != -EAGAIN)
> +		kmem_cache_free(xfs_extfree_item_cache, xefi);
>  	return error;
>  }

Ok, that's a bit nasty. let's do that the same way as
xfs_refcount_update_finish_item():

	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);

	/*
	 * Do we need to roll the transaction and retry to avoid a busy
	 * extent deadlock?
	 */
	if (error == -EAGAIN)
		return error;

	kmem_cache_free(xfs_extfree_item_cache, xefi);
	return error;

>  
> @@ -633,6 +657,17 @@ xfs_efi_item_recover(
>  		fake.xefi_blockcount = extp->ext_len;
>  
>  		error = xfs_trans_free_extent(tp, efdp, &fake);
> +		if (error == -EAGAIN) {
> +			xfs_free_extent_later(tp, fake.xefi_startblock,
> +				fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER);
> +			/*
> +			 * try to free as many extents as possible with current
> +			 * transaction
> +			 */
> +			error = 0;
> +			continue;
> +		};

This looks like it may be problematic - I'd prefer this code to be
obviously correct before we try to do any fancy optimisations to
minimise transaction counts. That is, once we get a -EAGAIN error,
we should defer the rest of the extents and roll the transaction
same as is done in xfs_cui_item_recover().

That is, when we get -EAGAIN, set a boolean "requeue_only" flag
and if that is set simply call xfs_free_extent_later() directly
until the EFI is exhausted.

> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 322eb2ee6c55..00bfe9683fa8 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
>  	struct xfs_log_item	*lip;
>  	struct xfs_ail		*ailp;
>  	int			error = 0;
> -#if defined(DEBUG) || defined(XFS_WARN)
> -	xfs_lsn_t		last_lsn;
> -#endif
> +	xfs_lsn_t		threshold_lsn;
>  
>  	ailp = log->l_ailp;
> +	threshold_lsn = xfs_ail_max_lsn(ailp);
>  	spin_lock(&ailp->ail_lock);
> -#if defined(DEBUG) || defined(XFS_WARN)
> -	last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);

xfs_ail_max_lsn() and l_curr_cycle/l_curr_block are not the same
thing.  max_lsn points to the lsn of the last entry in the AIL (in
memory state), whilst curr_cycle/block points to the current
physical location of the log head in the on-disk journal.

In this case, we can't use in-memory state to determine where to
stop the initial intent replay - recovery of other items may have
inserted new intents beyond the end of the physical region being
recovered, in which case using xfs_ail_max_lsn() will result in
incorrect behaviour here.

> -#endif
> +
>  	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>  	     lip != NULL;
>  	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
>  		const struct xfs_item_ops	*ops;
> +		/*
> +		 * Orignal redo EFI could be splitted into new EFIs. Those
> +		 * new EFIs are supposed to be processed in capture_list.
> +		 * Stop here when original redo intents are done.
> +		 */
> +		if (XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
> +			break;
>  
>  		if (!xlog_item_is_intent(lip))
>  			break;
>  
> -		/*
> -		 * We should never see a redo item with a LSN higher than
> -		 * the last transaction we found in the log at the start
> -		 * of recovery.
> -		 */
> -		ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
> -
>  		/*
>  		 * NOTE: If your intent processing routine can create more
>  		 * deferred ops, you /must/ attach them to the capture list in

As it is, I don't understand why this change is necessary and it is
not explained anywhere. That is, new intents should go to the end of
the AIL past the LSN of any intent that has been read from the
region of the log we are recovering. The first intent that is
processed will place non-intent items (e.g. a buffer or inode) in
the CIL/AIL before any new intent item, so this check:

		if (!xlog_item_is_intent(lip))
			break;

should always trigger before we would start processing newly logged
(deferred) intents.

If this change is actually necessary, it needs to be in it's own
commit and explain what the problem is that it is fixing. IF there
is a bug here tha tneeds fixing, it's something we need to know
about because it will affect all previous kernels, too, not just
ones with this patch that can relog EFIs in recovery.

Cheers,

Dave.

Wengang Wang May 22, 2023, 5:25 p.m. UTC | #3

> On May 20, 2023, at 2:56 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>> The following calltrace is seen:
>> #0 context_switch() kernel/sched/core.c:3881
>> #1 __schedule() kernel/sched/core.c:5111
>> #2 schedule() kernel/sched/core.c:5186
>> #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
>> #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
>> #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
>> #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
>> #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
>> #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
>> #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
>> #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
>> #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
>> #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
>> #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
>> #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764
>> #15 xfs_mountfs() fs/xfs/xfs_mount.c:978
>> #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
>> #17 mount_bdev() fs/super.c:1417
>> #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985
>> #19 legacy_get_tree() fs/fs_context.c:647
>> #20 vfs_get_tree() fs/super.c:1547
>> #21 do_new_mount() fs/namespace.c:2843
>> #22 do_mount() fs/namespace.c:3163
>> #23 ksys_mount() fs/namespace.c:3372
>> #24 __do_sys_mount() fs/namespace.c:3386
>> #25 __se_sys_mount() fs/namespace.c:3383
>> #26 __x64_sys_mount() fs/namespace.c:3383
>> #27 do_syscall_64() arch/x86/entry/common.c:296
>> #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180
>> 
>> During the process of the 2nd and subsequetial record in an EFI.
>> It is waiting for the busy blocks to be cleaned, but the only busy extent
>> is still hold in current xfs_trans->t_busy. That busy extent was added when
>> processing previous EFI record. And because that busy extent is not committed
>> yet, it can't be cleaned.
>> 
>> To avoid above deadlock, we don't block in xfs_extent_busy_flush() when
>> allocating AGFL blocks, instead it returns -EAGAIN. On receiving -EAGAIN
>> we are able to retry that EFI record with a new transaction after committing
>> the old transactin. With old transaction committed, the busy extent attached
>> to the old transaction get the change to be cleaned. On the retry, there is
>> no existing busy extents in the new transaction, thus no deadlock.
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> fs/xfs/libxfs/xfs_alloc.c | 30 ++++++++++++++++++++++++------
>> fs/xfs/libxfs/xfs_alloc.h |  2 ++
>> fs/xfs/scrub/repair.c     |  4 ++--
>> fs/xfs/xfs_extent_busy.c  | 34 +++++++++++++++++++++++++++++-----
>> fs/xfs/xfs_extent_busy.h  |  6 +++---
>> fs/xfs/xfs_extfree_item.c | 37 ++++++++++++++++++++++++++++++++++++-
>> fs/xfs/xfs_log_recover.c  | 23 ++++++++++-------------
>> fs/xfs/xfs_trans_ail.c    |  2 +-
>> fs/xfs/xfs_trans_priv.h   |  1 +
>> 9 files changed, 108 insertions(+), 31 deletions(-)
>> 
>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>> index 203f16c48c19..abfd2acb3053 100644
>> --- a/fs/xfs/libxfs/xfs_alloc.c
>> +++ b/fs/xfs/libxfs/xfs_alloc.c
>> @@ -1491,6 +1491,7 @@ STATIC int
>> xfs_alloc_ag_vextent_near(
>> struct xfs_alloc_arg *args)
>> {
>> + int flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH;
>> struct xfs_alloc_cur acur = {};
>> int error; /* error code */
>> int i; /* result code, temporary */
>> @@ -1564,8 +1565,11 @@ xfs_alloc_ag_vextent_near(
>> if (!acur.len) {
>> if (acur.busy) {
>> trace_xfs_alloc_near_busy(args);
>> - xfs_extent_busy_flush(args->mp, args->pag,
>> -       acur.busy_gen);
>> + error = xfs_extent_busy_flush(args->tp, args->pag,
>> +       acur.busy_gen, flags);
>> + if (error)
>> + goto out;
>> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>> goto restart;
>> }
>> trace_xfs_alloc_size_neither(args);
>> @@ -1592,6 +1596,7 @@ STATIC int /* error */
>> xfs_alloc_ag_vextent_size(
>> xfs_alloc_arg_t *args) /* allocation argument structure */
>> {
>> + int flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH;
> 
> This variable holds XFS_ALLOC_FLAG_* values that are passed only to
> xfs_extent_busy_flush, right?  Could this be named "busyflags" or
> similar to make that clearer?
> 
> (Clearer to 6hr^W6mo from now Darrick who won't remember this at all.)

Yes, the ‘flags’ in only passed to xfs_extent_busy_flush(). will be ‘busyflags'
in next drop.

> 
>> struct xfs_agf *agf = args->agbp->b_addr;
>> struct xfs_btree_cur *bno_cur; /* cursor for bno btree */
>> struct xfs_btree_cur *cnt_cur; /* cursor for cnt btree */
>> @@ -1670,8 +1675,13 @@ xfs_alloc_ag_vextent_size(
>> xfs_btree_del_cursor(cnt_cur,
>>      XFS_BTREE_NOERROR);
>> trace_xfs_alloc_size_busy(args);
>> - xfs_extent_busy_flush(args->mp,
>> - args->pag, busy_gen);
>> + error = xfs_extent_busy_flush(args->tp, args->pag,
>> + busy_gen, flags);
>> + if (error) {
>> + cnt_cur = NULL;
>> + goto error0;
>> + }
>> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>> goto restart;
>> }
>> }
>> @@ -1755,7 +1765,13 @@ xfs_alloc_ag_vextent_size(
>> if (busy) {
>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>> trace_xfs_alloc_size_busy(args);
>> - xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
>> + error = xfs_extent_busy_flush(args->tp, args->pag,
>> + busy_gen, flags);
>> + if (error) {
>> + cnt_cur = NULL;
>> + goto error0;
>> + }
>> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>> goto restart;
>> }
>> goto out_nominleft;
>> @@ -2629,6 +2645,7 @@ xfs_alloc_fix_freelist(
>> targs.agno = args->agno;
>> targs.alignment = targs.minlen = targs.prod = 1;
>> targs.pag = pag;
>> + targs.flags = args->flags & XFS_ALLOC_FLAG_FREEING;
> 
> Hmm, I guess this propagates FREEING into the allocation that lengthens
> the AGFL.  Right?

Yes, the original call path is like this:
xfs_free_extent_fix_freelist()
  xfs_alloc_fix_freelist(args, XFS_ALLOC_FLAG_FREEING as flags)  
    xfs_alloc_ag_vextent_size(targs)
      xfs_extent_busy_flush(tp, pag, busy_gen, flags)

We don’t have a ‘flags’ parameter to xfs_alloc_ag_vextent_size().
Considering the args its self is used for allocation, the ‘flags’ is also for the same
purpose. I moved the ‘flags’ into args (struct xfs_alloc_arg) and thus 
xfs_alloc_fix_freelist() doesn’t need the ‘flags’ parameter any longer since
the 'flags' is in in ‘args’. In my draft code (not posted to this email list), I
removed the ‘flags’ paramter from xfs_alloc_fix_freelist(). While when I was
reviewing the code change, I found that the ‘flags’ is well used in that function,
accessing 'args->flags' instead of ‘flags’ may caused some extra CPU cycles.
So in the patch (for review) I kept the ‘flags’ parameter which cause the code
ugly :D

> 
>> error = xfs_alloc_read_agfl(pag, tp, &agflbp);
>> if (error)
>> goto out_agbp_relse;
>> @@ -3572,6 +3589,7 @@ xfs_free_extent_fix_freelist(
>> args.mp = tp->t_mountp;
>> args.agno = pag->pag_agno;
>> args.pag = pag;
>> + args.flags = XFS_ALLOC_FLAG_FREEING;
>> 
>> /*
>>  * validate that the block number is legal - the enables us to detect
>> @@ -3580,7 +3598,7 @@ xfs_free_extent_fix_freelist(
>> if (args.agno >= args.mp->m_sb.sb_agcount)
>> return -EFSCORRUPTED;
>> 
>> - error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
>> + error = xfs_alloc_fix_freelist(&args, args.flags);
> 
> Hm.  Why stuff XFS_ALLOC_FLAG_FREEING into args.flags here?  I /think/
> in this case it doesn't matter because nothing under
> xfs_alloc_fix_freelist accesses args->flags...?
> 
> AFAICT in xfs_alloc_fix_freelist, the difference between @args->flags
> and @flags is that @args is the allocation that we're trying to do,
> whereas @flags are for any allocation that might need to be done to fix
> the freelist.  IOWS, the only allocation we're doing *is* to fix the
> freelist, so they are one in the same.  Maybe?  Otherwise I can't tell
> why we'd convey two sets of XFS_ALLOC flags to xfs_alloc_fix_freelist.
> 
> So while I think this isn't incorrect, the flag mixing bothers me a bit
> because setting fields in a struct can have weird not so obvious side
> effects.
> 
> (Still trying to make sense of the addition of a @flags field to struct
> xfs_alloc_arg.)

Right, see my above explain.

> 
>> if (error)
>> return error;
>> 
>> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
>> index 2b246d74c189..5038fba87784 100644
>> --- a/fs/xfs/libxfs/xfs_alloc.h
>> +++ b/fs/xfs/libxfs/xfs_alloc.h
>> @@ -24,6 +24,7 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp);
>> #define XFS_ALLOC_FLAG_NORMAP 0x00000004  /* don't modify the rmapbt */
>> #define XFS_ALLOC_FLAG_NOSHRINK 0x00000008  /* don't shrink the freelist */
>> #define XFS_ALLOC_FLAG_CHECK 0x00000010  /* test only, don't modify args */
>> +#define XFS_ALLOC_FLAG_TRYFLUSH 0x00000020  /* don't block in busyextent flush*/
>> 
>> /*
>>  * Argument structure for xfs_alloc routines.
>> @@ -57,6 +58,7 @@ typedef struct xfs_alloc_arg {
>> #ifdef DEBUG
>> bool alloc_minlen_only; /* allocate exact minlen extent */
>> #endif
>> + int flags; /* XFS_ALLOC_FLAG_* */
>> } xfs_alloc_arg_t;
>> 
>> /*
>> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
>> index 1b71174ec0d6..2ba28e4257fe 100644
>> --- a/fs/xfs/scrub/repair.c
>> +++ b/fs/xfs/scrub/repair.c
>> @@ -496,9 +496,9 @@ xrep_fix_freelist(
>> args.agno = sc->sa.pag->pag_agno;
>> args.alignment = 1;
>> args.pag = sc->sa.pag;
>> + args.flags = can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK;
>> 
>> - return xfs_alloc_fix_freelist(&args,
>> - can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK);
>> + return xfs_alloc_fix_freelist(&args, args.flags);
>> }
>> 
>> /*
>> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
>> index f3d328e4a440..ea1c1857bf5b 100644
>> --- a/fs/xfs/xfs_extent_busy.c
>> +++ b/fs/xfs/xfs_extent_busy.c
>> @@ -567,18 +567,41 @@ xfs_extent_busy_clear(
>> /*
>>  * Flush out all busy extents for this AG.
>>  */
>> -void
>> +int
>> xfs_extent_busy_flush(
>> - struct xfs_mount *mp,
>> + struct xfs_trans *tp,
>> struct xfs_perag *pag,
>> - unsigned busy_gen)
>> + unsigned busy_gen,
>> + int flags)
>> {
>> DEFINE_WAIT (wait);
>> int error;
>> 
>> - error = xfs_log_force(mp, XFS_LOG_SYNC);
>> + error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>> if (error)
>> - return;
>> + return error;
>> +
>> + /*
>> +  * If we are holding busy extents, the caller may not want to block
>> +  * straight away. If we are being told just to try a flush or progress
>> +  * has been made since we last skipped a busy extent, return
>> +  * immediately to allow the caller to try again. If we are freeing
>> +  * extents, we might actually be holding the only free extents in the
>> +  * transaction busy list and the log force won't resolve that
>> +  * situation. In this case, return -EAGAIN in that case to tell the
>> +  * caller it needs to commit the busy extents it holds before retrying
>> +  * the extent free operation.
>> +  */
>> + if (!list_empty(&tp->t_busy)) {
>> + if (flags & XFS_ALLOC_FLAG_TRYFLUSH)
>> + return 0;
>> +
>> + if (busy_gen != READ_ONCE(pag->pagb_gen))
>> + return 0;
>> +
>> + if (flags & XFS_ALLOC_FLAG_FREEING)
>> + return -EAGAIN;
>> + }
>> 
>> do {
>> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
>> @@ -588,6 +611,7 @@ xfs_extent_busy_flush(
>> } while (1);
>> 
>> finish_wait(&pag->pagb_wait, &wait);
>> + return 0;
> 
> This part looks good.
> 
>> }
>> 
>> void
>> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
>> index 4a118131059f..edeedb92e0df 100644
>> --- a/fs/xfs/xfs_extent_busy.h
>> +++ b/fs/xfs/xfs_extent_busy.h
>> @@ -51,9 +51,9 @@ bool
>> xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
>> xfs_extlen_t *len, unsigned *busy_gen);
>> 
>> -void
>> -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
>> - unsigned busy_gen);
>> +int
>> +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag,
>> + unsigned busy_gen, int flags);
>> 
>> void
>> xfs_extent_busy_wait_all(struct xfs_mount *mp);
>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>> index 011b50469301..3c5a9e9952ec 100644
>> --- a/fs/xfs/xfs_extfree_item.c
>> +++ b/fs/xfs/xfs_extfree_item.c
>> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
>> return efdp;
>> }
>> 
>> +/*
>> + * Fill the EFD with all extents from the EFI and set the counter.
>> + * Note: the EFD should comtain at least one extents already.
>> + */
>> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
>> +{
>> + struct xfs_efi_log_item *efip = efdp->efd_efip;
>> + uint                    i;
>> +
>> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
>> + return;
>> +
>> + for (i = 0; i < efip->efi_format.efi_nextents; i++) {
>> +        efdp->efd_format.efd_extents[i] =
>> +        efip->efi_format.efi_extents[i];
>> + }
>> + efdp->efd_next_extent = efip->efi_format.efi_nextents;
>> +}
>> +
>> /*
>>  * Free an extent and log it to the EFD. Note that the transaction is marked
>>  * dirty regardless of whether the extent free succeeds or fails to support the
>> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
>> error = __xfs_free_extent(tp, xefi->xefi_startblock,
>> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
>> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
>> + if (error == -EAGAIN) {
>> + xfs_fill_efd_with_efi(efdp);
>> + return error;
>> + }
>> /*
>>  * Mark the transaction dirty, even on error. This ensures the
>>  * transaction is aborted, which:
>> @@ -476,7 +499,8 @@ xfs_extent_free_finish_item(
>> xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
>> 
>> error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
>> - kmem_cache_free(xfs_extfree_item_cache, xefi);
>> + if (error != -EAGAIN)
>> + kmem_cache_free(xfs_extfree_item_cache, xefi);
>> return error;
>> }
>> 
>> @@ -633,6 +657,17 @@ xfs_efi_item_recover(
>> fake.xefi_blockcount = extp->ext_len;
>> 
>> error = xfs_trans_free_extent(tp, efdp, &fake);
>> + if (error == -EAGAIN) {
> 
> Curious, is this patch based off of 6.4-rc?  There should be a
> xfs_extent_free_get_group / xfs_extent_free_put_group here.

It’s still 6.3.0-rc6 — the newest when started to work on the deadlock issue.
Will rebase to current version for next drop.

> 
>> + xfs_free_extent_later(tp, fake.xefi_startblock,
>> + fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER);
>> + /*
>> +  * try to free as many extents as possible with current
>> +  * transaction
>> +  */
>> + error = 0;
>> + continue;
>> + };
> 
> If we get EAGAIN, why not let the unfinished parts of the recovered EFI
> work get turned into a new EFI?

That is an optimization. I wanted finish more records with current transaction
when possible. Say for this case (we hit the first EAGAIN on (AG1, bno2, len2):
EFI {(AG1, bno1, len1), (AG1, bno2, len2), (AG2, bno3, len3), (AG2, bno4, len4)}
with current patch, we need 1 extra EFI, so
orig EFI {(AG1, bno1, len1), (AG2, bno3, len3)} # records shown are really done ones  
new EFI1 {(AG1, bno2, len2), (AG2, bno4, len4)}

Otherwise if we move all the records (from current one) to new EFI on the
first EAGAIN, we would need
orig EFI {(AG1, bno1, len1)}
new EFI1 {(AG1, bno2, len2), (AG2, bno3, len3)} 
new EFI3 {(AG2, bno4, len4)}

In above case, we saved one EFI with the optimization.

> 
>> +
>> if (error == -EFSCORRUPTED)
>> XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
>> extp, sizeof(*extp));
>> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
>> index 322eb2ee6c55..00bfe9683fa8 100644
>> --- a/fs/xfs/xfs_log_recover.c
>> +++ b/fs/xfs/xfs_log_recover.c
>> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
>> struct xfs_log_item *lip;
>> struct xfs_ail *ailp;
>> int error = 0;
>> -#if defined(DEBUG) || defined(XFS_WARN)
>> - xfs_lsn_t last_lsn;
>> -#endif
>> + xfs_lsn_t threshold_lsn;
>> 
>> ailp = log->l_ailp;
>> + threshold_lsn = xfs_ail_max_lsn(ailp);
>> spin_lock(&ailp->ail_lock);
>> -#if defined(DEBUG) || defined(XFS_WARN)
>> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
>> -#endif
>> +
>> for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>>      lip != NULL;
>>      lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
>> const struct xfs_item_ops *ops;
>> + /*
>> +  * Orignal redo EFI could be splitted into new EFIs. Those
>> +  * new EFIs are supposed to be processed in capture_list.
>> +  * Stop here when original redo intents are done.
>> +  */
>> + if (XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
>> + break;
> 
> I'm not sure what this change accomplishes?  Is it because of the
> continue; statement in xfs_efi_item_recover?

The function xlog_recover_process_intents() iterates all the intent
items in AIL and process them. According the the comments:

2519  * When this is called, all of the log intent items which did not have
2520  * corresponding log done items should be in the AIL.  What we do now is update
2521  * the data structures associated with each one.

I am thinking xlog_recover_process_intents should iterates only the ones
which are already on the AIL. If we don’t make a threshold here,  it will continue
to process new intents added when we are processing existing intents. In this
case, we add new EFIs and that runs fast to be added to AIL in callbacks before
we finish the iterations here. And then the new EFIs are captured by
xlog_recover_process_intents(). If xlog_recover_process_intents() continue to
process these new EFIs, it will double free extents in EFI with xlog_finish_defer_ops()
thus cause problem.

thanks,
wengang

Wengang Wang May 22, 2023, 6:20 p.m. UTC | #4

> On May 21, 2023, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>> The following calltrace is seen:
>> #0 context_switch() kernel/sched/core.c:3881
>> #1 __schedule() kernel/sched/core.c:5111
>> #2 schedule() kernel/sched/core.c:5186
>> #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
>> #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
>> #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
>> #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
>> #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
>> #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
>> #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
>> #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
>> #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
>> #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
>> #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
>> #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764
>> #15 xfs_mountfs() fs/xfs/xfs_mount.c:978
>> #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
>> #17 mount_bdev() fs/super.c:1417
>> #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985
>> #19 legacy_get_tree() fs/fs_context.c:647
>> #20 vfs_get_tree() fs/super.c:1547
>> #21 do_new_mount() fs/namespace.c:2843
>> #22 do_mount() fs/namespace.c:3163
>> #23 ksys_mount() fs/namespace.c:3372
>> #24 __do_sys_mount() fs/namespace.c:3386
>> #25 __se_sys_mount() fs/namespace.c:3383
>> #26 __x64_sys_mount() fs/namespace.c:3383
>> #27 do_syscall_64() arch/x86/entry/common.c:296
>> #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180
>> 
>> During the process of the 2nd and subsequetial record in an EFI.
>> It is waiting for the busy blocks to be cleaned, but the only busy extent
>> is still hold in current xfs_trans->t_busy. That busy extent was added when
>> processing previous EFI record. And because that busy extent is not committed
>> yet, it can't be cleaned.
>> 
>> To avoid above deadlock, we don't block in xfs_extent_busy_flush() when
>> allocating AGFL blocks, instead it returns -EAGAIN. On receiving -EAGAIN
>> we are able to retry that EFI record with a new transaction after committing
>> the old transactin. With old transaction committed, the busy extent attached
>> to the old transaction get the change to be cleaned. On the retry, there is
>> no existing busy extents in the new transaction, thus no deadlock.
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> 
> THere are multiple changes in this patchset that aren't immediately
> obvious why they have been done. hence this needs to be split into
> one set of changes per patch.

Yep, I was thinking so. but wanted to know what you say :D
Will separate the patch in next drop.

> 
>> ---
>> fs/xfs/libxfs/xfs_alloc.c | 30 ++++++++++++++++++++++++------
>> fs/xfs/libxfs/xfs_alloc.h |  2 ++
>> fs/xfs/scrub/repair.c     |  4 ++--
>> fs/xfs/xfs_extent_busy.c  | 34 +++++++++++++++++++++++++++++-----
>> fs/xfs/xfs_extent_busy.h  |  6 +++---
>> fs/xfs/xfs_extfree_item.c | 37 ++++++++++++++++++++++++++++++++++++-
>> fs/xfs/xfs_log_recover.c  | 23 ++++++++++-------------
>> fs/xfs/xfs_trans_ail.c    |  2 +-
>> fs/xfs/xfs_trans_priv.h   |  1 +
>> 9 files changed, 108 insertions(+), 31 deletions(-)
>> 
>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>> index 203f16c48c19..abfd2acb3053 100644
>> --- a/fs/xfs/libxfs/xfs_alloc.c
>> +++ b/fs/xfs/libxfs/xfs_alloc.c
>> @@ -1491,6 +1491,7 @@ STATIC int
>> xfs_alloc_ag_vextent_near(
>> struct xfs_alloc_arg *args)
>> {
>> + int flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH;
> 
> The first set of changes is encoding allocation flags into the
> struct xfs_alloc_arg. I can sort of see why you did this, but it's
> done in an inconsistent manner, and I don't see why some of the
> changes were made and not others. This conversion belongs in it's
> own patch.

yep, please see my reply to Darrick.

> 
>> struct xfs_alloc_cur acur = {};
>> int error; /* error code */
>> int i; /* result code, temporary */
>> @@ -1564,8 +1565,11 @@ xfs_alloc_ag_vextent_near(
>> if (!acur.len) {
>> if (acur.busy) {
>> trace_xfs_alloc_near_busy(args);
>> - xfs_extent_busy_flush(args->mp, args->pag,
>> -       acur.busy_gen);
>> + error = xfs_extent_busy_flush(args->tp, args->pag,
>> +       acur.busy_gen, flags);
>> + if (error)
>> + goto out;
>> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>> goto restart;
>> }
>> trace_xfs_alloc_size_neither(args);
>> @@ -1592,6 +1596,7 @@ STATIC int /* error */
>> xfs_alloc_ag_vextent_size(
>> xfs_alloc_arg_t *args) /* allocation argument structure */
>> {
>> + int flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH;
>> struct xfs_agf *agf = args->agbp->b_addr;
>> struct xfs_btree_cur *bno_cur; /* cursor for bno btree */
>> struct xfs_btree_cur *cnt_cur; /* cursor for cnt btree */
>> @@ -1670,8 +1675,13 @@ xfs_alloc_ag_vextent_size(
>> xfs_btree_del_cursor(cnt_cur,
>>      XFS_BTREE_NOERROR);
>> trace_xfs_alloc_size_busy(args);
>> - xfs_extent_busy_flush(args->mp,
>> - args->pag, busy_gen);
>> + error = xfs_extent_busy_flush(args->tp, args->pag,
>> + busy_gen, flags);
>> + if (error) {
>> + cnt_cur = NULL;
>> + goto error0;
>> + }
>> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>> goto restart;
>> }
>> }
> 
> The cursor handling for the error case could be improved here by
> moving the xfs_btree_del_cursor(cnt_cur...) call here.
> 
> if (i == 0) {
> /* ... */
> trace_xfs_alloc_size_busy(args);
> error = xfs_extent_busy_flush(args->tp, args->pag,
> busy_gen, flags);
> if (error)
> goto error0;
> 
> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
> goto restart;
> }
> 

yep, looks better.

>> @@ -1755,7 +1765,13 @@ xfs_alloc_ag_vextent_size(
>> if (busy) {
>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>> trace_xfs_alloc_size_busy(args);
>> - xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
>> + error = xfs_extent_busy_flush(args->tp, args->pag,
>> + busy_gen, flags);
>> + if (error) {
>> + cnt_cur = NULL;
>> + goto error0;
>> + }
>> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>> goto restart;
>> }
>> goto out_nominleft;
> 
> Same here.
> 

yep.

> I'd also enhance the trace_xfs_alloc_size_error() call to
> trace the error that occurred so we can actually see when an EAGAIN
> error occurs when we are tracing...

will do.

> 
> 
>> @@ -2629,6 +2645,7 @@ xfs_alloc_fix_freelist(
>> targs.agno = args->agno;
>> targs.alignment = targs.minlen = targs.prod = 1;
>> targs.pag = pag;
>> + targs.flags = args->flags & XFS_ALLOC_FLAG_FREEING;
>> error = xfs_alloc_read_agfl(pag, tp, &agflbp);
>> if (error)
>> goto out_agbp_relse;
> 
> This is a change in behaviour for AGFL filling. This should be in
> it's own patch, explaining why the change of behaviour is necessary.
> 
> Also, it could just be:
> 
> targs.flags = flags & XFS_ALLOC_FLAG_FREEING;

I actually wanted to remove ‘flags’ parameter as it’s already in args
(but have a little bit performance concern if I do so, want the code best
as possible). See my reply to Darrick, what’s your idea?
If ‘flags’ is removed, it would be still 
targs.flags = args->flags & XFS_ALLOC_FLAG_FREEING; 

> 
>> @@ -3572,6 +3589,7 @@ xfs_free_extent_fix_freelist(
>> args.mp = tp->t_mountp;
>> args.agno = pag->pag_agno;
>> args.pag = pag;
>> + args.flags = XFS_ALLOC_FLAG_FREEING;
>> 
>> /*
>>  * validate that the block number is legal - the enables us to detect
>> @@ -3580,7 +3598,7 @@ xfs_free_extent_fix_freelist(
>> if (args.agno >= args.mp->m_sb.sb_agcount)
>> return -EFSCORRUPTED;
>> 
>> - error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
>> + error = xfs_alloc_fix_freelist(&args, args.flags);
>> if (error)
>> return error;
> 
> This is where I see inconsistency in the usage of args.flags. If we
> are going to move the XFS_ALLOC_FLAG_* state into args->flags, then
> it should not be passed as a separate parameter to functions that
> take a struct xfs_alloc_arg. i.e. we either use args->flags
> everywhere for everything, or we pass a separate flags argument
> everywhere as we currently do.
> 

Yes, pls still see the same reply to Darrick for that part.

> 
>> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
>> index 2b246d74c189..5038fba87784 100644
>> --- a/fs/xfs/libxfs/xfs_alloc.h
>> +++ b/fs/xfs/libxfs/xfs_alloc.h
>> @@ -24,6 +24,7 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp);
>> #define XFS_ALLOC_FLAG_NORMAP 0x00000004  /* don't modify the rmapbt */
>> #define XFS_ALLOC_FLAG_NOSHRINK 0x00000008  /* don't shrink the freelist */
>> #define XFS_ALLOC_FLAG_CHECK 0x00000010  /* test only, don't modify args */
>> +#define XFS_ALLOC_FLAG_TRYFLUSH 0x00000020  /* don't block in busyextent flush*/
> 
> Convert all these to (1U << x) format while you are adding a new
> flag...

OK.

> 
>> 
>> /*
>>  * Argument structure for xfs_alloc routines.
>> @@ -57,6 +58,7 @@ typedef struct xfs_alloc_arg {
>> #ifdef DEBUG
>> bool alloc_minlen_only; /* allocate exact minlen extent */
>> #endif
>> + int flags; /* XFS_ALLOC_FLAG_* */
> 
> And flags fields should be unsigned.

OK basing on changing the defines to ‘U'.

> 
>> } xfs_alloc_arg_t;
>> 
>> /*
>> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
>> index 1b71174ec0d6..2ba28e4257fe 100644
>> --- a/fs/xfs/scrub/repair.c
>> +++ b/fs/xfs/scrub/repair.c
>> @@ -496,9 +496,9 @@ xrep_fix_freelist(
>> args.agno = sc->sa.pag->pag_agno;
>> args.alignment = 1;
>> args.pag = sc->sa.pag;
>> + args.flags = can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK;
>> 
>> - return xfs_alloc_fix_freelist(&args,
>> - can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK);
>> + return xfs_alloc_fix_freelist(&args, args.flags);
> 
> Same comment as above about just passing separate flags everywhere
> or just passing args. This conversion is a separate patch...

Yes.

> 
>> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
>> index f3d328e4a440..ea1c1857bf5b 100644
>> --- a/fs/xfs/xfs_extent_busy.c
>> +++ b/fs/xfs/xfs_extent_busy.c
>> @@ -567,18 +567,41 @@ xfs_extent_busy_clear(
>> /*
>>  * Flush out all busy extents for this AG.
>>  */
>> -void
>> +int
>> xfs_extent_busy_flush(
>> - struct xfs_mount *mp,
>> + struct xfs_trans *tp,
>> struct xfs_perag *pag,
>> - unsigned busy_gen)
>> + unsigned busy_gen,
>> + int flags)
>> {
>> DEFINE_WAIT (wait);
>> int error;
>> 
>> - error = xfs_log_force(mp, XFS_LOG_SYNC);
>> + error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>> if (error)
>> - return;
>> + return error;
>> +
>> + /*
>> +  * If we are holding busy extents, the caller may not want to block
>> +  * straight away. If we are being told just to try a flush or progress
>> +  * has been made since we last skipped a busy extent, return
>> +  * immediately to allow the caller to try again. If we are freeing
>> +  * extents, we might actually be holding the only free extents in the
>> +  * transaction busy list and the log force won't resolve that
>> +  * situation. In this case, return -EAGAIN in that case to tell the
>> +  * caller it needs to commit the busy extents it holds before retrying
>> +  * the extent free operation.
>> +  */
>> + if (!list_empty(&tp->t_busy)) {
>> + if (flags & XFS_ALLOC_FLAG_TRYFLUSH)
>> + return 0;
>> +
>> + if (busy_gen != READ_ONCE(pag->pagb_gen))
>> + return 0;
>> +
>> + if (flags & XFS_ALLOC_FLAG_FREEING)
>> + return -EAGAIN;
>> + }
>> 
>> do {
>> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
>> @@ -588,6 +611,7 @@ xfs_extent_busy_flush(
>> } while (1);
>> 
>> finish_wait(&pag->pagb_wait, &wait);
>> + return 0;
>> }
>> 
>> void
>> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
>> index 4a118131059f..edeedb92e0df 100644
>> --- a/fs/xfs/xfs_extent_busy.h
>> +++ b/fs/xfs/xfs_extent_busy.h
>> @@ -51,9 +51,9 @@ bool
>> xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
>> xfs_extlen_t *len, unsigned *busy_gen);
>> 
>> -void
>> -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
>> - unsigned busy_gen);
>> +int
>> +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag,
>> + unsigned busy_gen, int flags);
>> 
>> void
>> xfs_extent_busy_wait_all(struct xfs_mount *mp);
>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>> index 011b50469301..3c5a9e9952ec 100644
>> --- a/fs/xfs/xfs_extfree_item.c
>> +++ b/fs/xfs/xfs_extfree_item.c
>> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
>> return efdp;
>> }
>> 
>> +/*
>> + * Fill the EFD with all extents from the EFI and set the counter.
>> + * Note: the EFD should comtain at least one extents already.
>> + */
>> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
>> +{
>> + struct xfs_efi_log_item *efip = efdp->efd_efip;
>> + uint                    i;
>> +
>> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
>> + return;
>> +
>> + for (i = 0; i < efip->efi_format.efi_nextents; i++) {
>> +        efdp->efd_format.efd_extents[i] =
>> +        efip->efi_format.efi_extents[i];
>> + }
>> + efdp->efd_next_extent = efip->efi_format.efi_nextents;
>> +}
>> +
> 
> Ok, but it doesn't dirty the transaction or the EFD, which means....

Actually EAGAIN shouldn’t happen with the first record in EFIs because
the trans->t_busy is empty in AGFL block allocation for the first record.
So the dirtying work should already done with the first one.

> 
>> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
>> error = __xfs_free_extent(tp, xefi->xefi_startblock,
>> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
>> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
>> + if (error == -EAGAIN) {
>> + xfs_fill_efd_with_efi(efdp);
>> + return error;
>> + }
> 
> .... this is incorrectly placed.
> 
> The very next lines say:
> 
>> /*
>>  * Mark the transaction dirty, even on error. This ensures the
>>  * transaction is aborted, which:
> 
> i.e. we have to make the transaction and EFD log item dirty even if
> we have an error. In this case, the error is not fatal, but we still
> have to ensure that we commit the EFD when we roll the transaction.
> Hence the transaction and EFD still need to be dirtied on -EAGAIN...

see above.

> 
>> @@ -476,7 +499,8 @@ xfs_extent_free_finish_item(
>> xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
>> 
>> error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
>> - kmem_cache_free(xfs_extfree_item_cache, xefi);
>> + if (error != -EAGAIN)
>> + kmem_cache_free(xfs_extfree_item_cache, xefi);
>> return error;
>> }
> 
> Ok, that's a bit nasty. let's do that the same way as
> xfs_refcount_update_finish_item():
> 
> error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
> 
> /*
>  * Do we need to roll the transaction and retry to avoid a busy
>  * extent deadlock?
>  */
> if (error == -EAGAIN)
> return error;
> 
> kmem_cache_free(xfs_extfree_item_cache, xefi);
> return error;
> 

OK.

>> 
>> @@ -633,6 +657,17 @@ xfs_efi_item_recover(
>> fake.xefi_blockcount = extp->ext_len;
>> 
>> error = xfs_trans_free_extent(tp, efdp, &fake);
>> + if (error == -EAGAIN) {
>> + xfs_free_extent_later(tp, fake.xefi_startblock,
>> + fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER);
>> + /*
>> +  * try to free as many extents as possible with current
>> +  * transaction
>> +  */
>> + error = 0;
>> + continue;
>> + };
> 
> This looks like it may be problematic - I'd prefer this code to be
> obviously correct before we try to do any fancy optimisations to
> minimise transaction counts. That is, once we get a -EAGAIN error,
> we should defer the rest of the extents and roll the transaction
> same as is done in xfs_cui_item_recover().
> 
> That is, when we get -EAGAIN, set a boolean "requeue_only" flag
> and if that is set simply call xfs_free_extent_later() directly
> until the EFI is exhausted.

OK.

> 
>> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
>> index 322eb2ee6c55..00bfe9683fa8 100644
>> --- a/fs/xfs/xfs_log_recover.c
>> +++ b/fs/xfs/xfs_log_recover.c
>> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
>> struct xfs_log_item *lip;
>> struct xfs_ail *ailp;
>> int error = 0;
>> -#if defined(DEBUG) || defined(XFS_WARN)
>> - xfs_lsn_t last_lsn;
>> -#endif
>> + xfs_lsn_t threshold_lsn;
>> 
>> ailp = log->l_ailp;
>> + threshold_lsn = xfs_ail_max_lsn(ailp);
>> spin_lock(&ailp->ail_lock);
>> -#if defined(DEBUG) || defined(XFS_WARN)
>> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
> 
> xfs_ail_max_lsn() and l_curr_cycle/l_curr_block are not the same
> thing.  max_lsn points to the lsn of the last entry in the AIL (in
> memory state), whilst curr_cycle/block points to the current
> physical location of the log head in the on-disk journal.
> 

Yes, I intended to use the lsn of the last entry in the AIL.
For the problem with xlog_recover_process_intents(), please see my reply to
Darrick. On seeing the problem, my first try was to use “last_lsn” to stop
the iteration but that didn’t help.  last_lsn was found quite bigger than even
the new EFI lsn. While use xfs_ail_max_lsn() it solved the problem.


> In this case, we can't use in-memory state to determine where to
> stop the initial intent replay - recovery of other items may have
> inserted new intents beyond the end of the physical region being
> recovered, in which case using xfs_ail_max_lsn() will result in
> incorrect behaviour here.

Yes, this patch is one of those (if some exist) introduce new intents (EFIs here).
We add the new intents to the transaction first (xfs_defer_create_intent()), add
the deferred operations to ‘capture_list’. And finally the deferred options in
‘capture_list’ is processed after the intent-iteration on the AIL.
  
For existing other cases (if there are) where new intents are added,
they don’t use the capture_list for delayed operations? Do you have example then? 
if so I think we should follow their way instead of adding the defer operations
(but reply on the intents on AIL).

If we don’t have an existing example? What’s the best way to so the double free issue
in xlog_recover_process_intents()?  We only add the new EFIs to transaction
(xfs_defer_create_intent()), but don’t really add the deferred operations to capture_list
 (by xfs_defer_ops_capture_and_commit)?  That looks tricky to me.


> 
>> -#endif
>> +
>> for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>>      lip != NULL;
>>      lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
>> const struct xfs_item_ops *ops;
>> + /*
>> +  * Orignal redo EFI could be splitted into new EFIs. Those
>> +  * new EFIs are supposed to be processed in capture_list.
>> +  * Stop here when original redo intents are done.
>> +  */
>> + if (XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
>> + break;
>> 
>> if (!xlog_item_is_intent(lip))
>> break;
>> 
>> - /*
>> -  * We should never see a redo item with a LSN higher than
>> -  * the last transaction we found in the log at the start
>> -  * of recovery.
>> -  */
>> - ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
>> -
>> /*
>>  * NOTE: If your intent processing routine can create more
>>  * deferred ops, you /must/ attach them to the capture list in
> 
> As it is, I don't understand why this change is necessary and it is
> not explained anywhere. That is, new intents should go to the end of
> the AIL past the LSN of any intent that has been read from the
> region of the log we are recovering. The first intent that is
> processed will place non-intent items (e.g. a buffer or inode) in
> the CIL/AIL before any new intent item, so this check:
> 
> if (!xlog_item_is_intent(lip))
> break;
> 
> should always trigger before we would start processing newly logged
> (deferred) intents.
> 
> If this change is actually necessary, it needs to be in it's own
> commit and explain what the problem is that it is fixing. IF there
> is a bug here tha tneeds fixing, it's something we need to know
> about because it will affect all previous kernels, too, not just
> ones with this patch that can relog EFIs in recovery.

This is also related to my fix to the problem in xlog_recover_process_intents().
If we can use ‘threshold_lsn’ from xfs_ail_max_lsn() which is smaller than
‘last_lsn’ and we stop when hitting threshold_lsn, the assertion on last_lsn would
become meaningless. That’s why I just removed it.

thanks,
wengang

Dave Chinner May 22, 2023, 10:40 p.m. UTC | #5

On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
> > On May 21, 2023, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
> >> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> >> index 011b50469301..3c5a9e9952ec 100644
> >> --- a/fs/xfs/xfs_extfree_item.c
> >> +++ b/fs/xfs/xfs_extfree_item.c
> >> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
> >> return efdp;
> >> }
> >> 
> >> +/*
> >> + * Fill the EFD with all extents from the EFI and set the counter.
> >> + * Note: the EFD should comtain at least one extents already.
> >> + */
> >> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
> >> +{
> >> + struct xfs_efi_log_item *efip = efdp->efd_efip;
> >> + uint                    i;
> >> +
> >> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
> >> + return;
> >> +
> >> + for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> >> +        efdp->efd_format.efd_extents[i] =
> >> +        efip->efi_format.efi_extents[i];
> >> + }
> >> + efdp->efd_next_extent = efip->efi_format.efi_nextents;
> >> +}
> >> +
> > 
> > Ok, but it doesn't dirty the transaction or the EFD, which means....
> 
> Actually EAGAIN shouldn’t happen with the first record in EFIs because
> the trans->t_busy is empty in AGFL block allocation for the first record.
> So the dirtying work should already done with the first one.

You're assuming that the only thing we are going to want to return
-EAGAIN for freeing attamps for is busy extents. Being able to
restart btree operations by "commit and retry" opens up a
a whole new set of performance optimisations we can make to the
btree code.

IOWs, I want this functionality to be generic in nature, not
tailored specifically to one situation where an -EAGAIN needs to be
returned to trigger a commit an retry.

> >> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
> >> error = __xfs_free_extent(tp, xefi->xefi_startblock,
> >> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
> >> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
> >> + if (error == -EAGAIN) {
> >> + xfs_fill_efd_with_efi(efdp);
> >> + return error;
> >> + }
> > 
> > .... this is incorrectly placed.
> > 
> > The very next lines say:
> > 
> >> /*
> >>  * Mark the transaction dirty, even on error. This ensures the
> >>  * transaction is aborted, which:
> > 
> > i.e. we have to make the transaction and EFD log item dirty even if
> > we have an error. In this case, the error is not fatal, but we still
> > have to ensure that we commit the EFD when we roll the transaction.
> > Hence the transaction and EFD still need to be dirtied on -EAGAIN...
> 
> see above.

See above :)

> >> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> >> index 322eb2ee6c55..00bfe9683fa8 100644
> >> --- a/fs/xfs/xfs_log_recover.c
> >> +++ b/fs/xfs/xfs_log_recover.c
> >> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
> >> struct xfs_log_item *lip;
> >> struct xfs_ail *ailp;
> >> int error = 0;
> >> -#if defined(DEBUG) || defined(XFS_WARN)
> >> - xfs_lsn_t last_lsn;
> >> -#endif
> >> + xfs_lsn_t threshold_lsn;
> >> 
> >> ailp = log->l_ailp;
> >> + threshold_lsn = xfs_ail_max_lsn(ailp);
> >> spin_lock(&ailp->ail_lock);
> >> -#if defined(DEBUG) || defined(XFS_WARN)
> >> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
> > 
> > xfs_ail_max_lsn() and l_curr_cycle/l_curr_block are not the same
> > thing.  max_lsn points to the lsn of the last entry in the AIL (in
> > memory state), whilst curr_cycle/block points to the current
> > physical location of the log head in the on-disk journal.
> > 
> 
> Yes, I intended to use the lsn of the last entry in the AIL.

Again, they are not the same thing: using the last entry in the
AIL here is incorrect. We want to replay all the items in the AIL
that were active in the log, not up to the last item in the AIL. The
actively recovered log region ends at last_lsn as per above, whilst
xfs_ail_max_lsn() is not guaranteed to be less than last_lsn before
we start walking it.

> For the problem with xlog_recover_process_intents(), please see my reply to
> Darrick. On seeing the problem, my first try was to use “last_lsn” to stop
> the iteration but that didn’t help.  last_lsn was found quite bigger than even
> the new EFI lsn. While use xfs_ail_max_lsn() it solved the problem.

In what case are we queuing a *new* intent into the AIL that has a
LSN less than xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block)?
If we are doing that *anywhere*, then we have a likely journal
corruption bug in the code because it indicates we committed that
item to the journal over something in the log we are currently
replaying.

> > In this case, we can't use in-memory state to determine where to
> > stop the initial intent replay - recovery of other items may have
> > inserted new intents beyond the end of the physical region being
> > recovered, in which case using xfs_ail_max_lsn() will result in
> > incorrect behaviour here.
> 
> Yes, this patch is one of those (if some exist) introduce new intents (EFIs here).
> We add the new intents to the transaction first (xfs_defer_create_intent()), add
> the deferred operations to ‘capture_list’. And finally the deferred options in
> ‘capture_list’ is processed after the intent-iteration on the AIL.

The changes made in that transaction, including the newly logged
EFI, get committed before the rest of the work gets deferred via
xfs_defer_ops_capture_and_commit(). That commits the new efi (along
with all the changes that have already been made in the transaction)
to the CIL, and eventually the journal checkpoints and the new EFI
gets inserted into the AIL at the LSN of the checkpoint.

The LSN of the checkpoint is curr_cycle/block - the log head -
because that's where the start record of the checkpoint is
physically written.  As each iclog is filled, the log head moves
forward - it always points at the location that the next journal
write will be written to. At the end of a checkpoint, the LSN of the
start record is used for AIL insertion.

Hence if a new log item created by recovery has a LSN less than
last_lsn, then we have a serious bug somewhere that needs to be
found and fixed. The use of last_lsn tells us something has gone
badly wrong during recovery, the use of xfs_ail_max_lsn() removes
the detection of the issue and now we don't know that something has
gone badly wrong...

> For existing other cases (if there are) where new intents are added,
> they don’t use the capture_list for delayed operations? Do you have example then? 
> if so I think we should follow their way instead of adding the defer operations
> (but reply on the intents on AIL).

All of the intent recovery stuff uses
xfs_defer_ops_capture_and_commit() to commit the intent being
replayed and cause all further new intent processing in that chain
to be defered until after all the intents recovered from the journal
have been iterated. All those new intents end up in the AIL at a LSN
index >= last_lsn.

Cheers,

Dave.

Darrick J. Wong May 23, 2023, 12:31 a.m. UTC | #6

On Mon, May 22, 2023 at 05:25:01PM +0000, Wengang Wang wrote:
> 
> 
> > On May 20, 2023, at 2:56 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> > 
> > On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
> >> The following calltrace is seen:
> >> #0 context_switch() kernel/sched/core.c:3881
> >> #1 __schedule() kernel/sched/core.c:5111
> >> #2 schedule() kernel/sched/core.c:5186
> >> #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
> >> #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
> >> #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
> >> #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
> >> #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
> >> #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
> >> #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
> >> #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
> >> #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
> >> #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
> >> #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
> >> #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764
> >> #15 xfs_mountfs() fs/xfs/xfs_mount.c:978
> >> #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
> >> #17 mount_bdev() fs/super.c:1417
> >> #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985
> >> #19 legacy_get_tree() fs/fs_context.c:647
> >> #20 vfs_get_tree() fs/super.c:1547
> >> #21 do_new_mount() fs/namespace.c:2843
> >> #22 do_mount() fs/namespace.c:3163
> >> #23 ksys_mount() fs/namespace.c:3372
> >> #24 __do_sys_mount() fs/namespace.c:3386
> >> #25 __se_sys_mount() fs/namespace.c:3383
> >> #26 __x64_sys_mount() fs/namespace.c:3383
> >> #27 do_syscall_64() arch/x86/entry/common.c:296
> >> #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180
> >> 
> >> During the process of the 2nd and subsequetial record in an EFI.
> >> It is waiting for the busy blocks to be cleaned, but the only busy extent
> >> is still hold in current xfs_trans->t_busy. That busy extent was added when
> >> processing previous EFI record. And because that busy extent is not committed
> >> yet, it can't be cleaned.
> >> 
> >> To avoid above deadlock, we don't block in xfs_extent_busy_flush() when
> >> allocating AGFL blocks, instead it returns -EAGAIN. On receiving -EAGAIN
> >> we are able to retry that EFI record with a new transaction after committing
> >> the old transactin. With old transaction committed, the busy extent attached
> >> to the old transaction get the change to be cleaned. On the retry, there is
> >> no existing busy extents in the new transaction, thus no deadlock.
> >> 
> >> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> >> ---
> >> fs/xfs/libxfs/xfs_alloc.c | 30 ++++++++++++++++++++++++------
> >> fs/xfs/libxfs/xfs_alloc.h |  2 ++
> >> fs/xfs/scrub/repair.c     |  4 ++--
> >> fs/xfs/xfs_extent_busy.c  | 34 +++++++++++++++++++++++++++++-----
> >> fs/xfs/xfs_extent_busy.h  |  6 +++---
> >> fs/xfs/xfs_extfree_item.c | 37 ++++++++++++++++++++++++++++++++++++-
> >> fs/xfs/xfs_log_recover.c  | 23 ++++++++++-------------
> >> fs/xfs/xfs_trans_ail.c    |  2 +-
> >> fs/xfs/xfs_trans_priv.h   |  1 +
> >> 9 files changed, 108 insertions(+), 31 deletions(-)
> >> 
> >> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> >> index 203f16c48c19..abfd2acb3053 100644
> >> --- a/fs/xfs/libxfs/xfs_alloc.c
> >> +++ b/fs/xfs/libxfs/xfs_alloc.c
> >> @@ -1491,6 +1491,7 @@ STATIC int
> >> xfs_alloc_ag_vextent_near(
> >> struct xfs_alloc_arg *args)
> >> {
> >> + int flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH;
> >> struct xfs_alloc_cur acur = {};
> >> int error; /* error code */
> >> int i; /* result code, temporary */
> >> @@ -1564,8 +1565,11 @@ xfs_alloc_ag_vextent_near(
> >> if (!acur.len) {
> >> if (acur.busy) {
> >> trace_xfs_alloc_near_busy(args);
> >> - xfs_extent_busy_flush(args->mp, args->pag,
> >> -       acur.busy_gen);
> >> + error = xfs_extent_busy_flush(args->tp, args->pag,
> >> +       acur.busy_gen, flags);
> >> + if (error)
> >> + goto out;
> >> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
> >> goto restart;
> >> }
> >> trace_xfs_alloc_size_neither(args);
> >> @@ -1592,6 +1596,7 @@ STATIC int /* error */
> >> xfs_alloc_ag_vextent_size(
> >> xfs_alloc_arg_t *args) /* allocation argument structure */
> >> {
> >> + int flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH;
> > 
> > This variable holds XFS_ALLOC_FLAG_* values that are passed only to
> > xfs_extent_busy_flush, right?  Could this be named "busyflags" or
> > similar to make that clearer?
> > 
> > (Clearer to 6hr^W6mo from now Darrick who won't remember this at all.)
> 
> Yes, the ‘flags’ in only passed to xfs_extent_busy_flush(). will be ‘busyflags'
> in next drop.

Ok.

> > 
> >> struct xfs_agf *agf = args->agbp->b_addr;
> >> struct xfs_btree_cur *bno_cur; /* cursor for bno btree */
> >> struct xfs_btree_cur *cnt_cur; /* cursor for cnt btree */
> >> @@ -1670,8 +1675,13 @@ xfs_alloc_ag_vextent_size(
> >> xfs_btree_del_cursor(cnt_cur,
> >>      XFS_BTREE_NOERROR);
> >> trace_xfs_alloc_size_busy(args);
> >> - xfs_extent_busy_flush(args->mp,
> >> - args->pag, busy_gen);
> >> + error = xfs_extent_busy_flush(args->tp, args->pag,
> >> + busy_gen, flags);
> >> + if (error) {
> >> + cnt_cur = NULL;
> >> + goto error0;
> >> + }
> >> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
> >> goto restart;
> >> }
> >> }
> >> @@ -1755,7 +1765,13 @@ xfs_alloc_ag_vextent_size(
> >> if (busy) {
> >> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> >> trace_xfs_alloc_size_busy(args);
> >> - xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> >> + error = xfs_extent_busy_flush(args->tp, args->pag,
> >> + busy_gen, flags);
> >> + if (error) {
> >> + cnt_cur = NULL;
> >> + goto error0;
> >> + }
> >> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
> >> goto restart;
> >> }
> >> goto out_nominleft;
> >> @@ -2629,6 +2645,7 @@ xfs_alloc_fix_freelist(
> >> targs.agno = args->agno;
> >> targs.alignment = targs.minlen = targs.prod = 1;
> >> targs.pag = pag;
> >> + targs.flags = args->flags & XFS_ALLOC_FLAG_FREEING;
> > 
> > Hmm, I guess this propagates FREEING into the allocation that lengthens
> > the AGFL.  Right?
> 
> Yes, the original call path is like this:
> xfs_free_extent_fix_freelist()
>   xfs_alloc_fix_freelist(args, XFS_ALLOC_FLAG_FREEING as flags)  
>     xfs_alloc_ag_vextent_size(targs)
>       xfs_extent_busy_flush(tp, pag, busy_gen, flags)
> 
> We don’t have a ‘flags’ parameter to xfs_alloc_ag_vextent_size().
> Considering the args its self is used for allocation, the ‘flags’ is also for the same
> purpose. I moved the ‘flags’ into args (struct xfs_alloc_arg) and thus 
> xfs_alloc_fix_freelist() doesn’t need the ‘flags’ parameter any longer since
> the 'flags' is in in ‘args’. In my draft code (not posted to this email list), I
> removed the ‘flags’ paramter from xfs_alloc_fix_freelist(). While when I was
> reviewing the code change, I found that the ‘flags’ is well used in that function,
> accessing 'args->flags' instead of ‘flags’ may caused some extra CPU cycles.
> So in the patch (for review) I kept the ‘flags’ parameter which cause the code
> ugly :D

I prefer either passing @flags around as a function parameter like we
always have; or stuffing it in xfs_alloc_arg.  Not both, that makes it
much harder to understand.

> > 
> >> error = xfs_alloc_read_agfl(pag, tp, &agflbp);
> >> if (error)
> >> goto out_agbp_relse;
> >> @@ -3572,6 +3589,7 @@ xfs_free_extent_fix_freelist(
> >> args.mp = tp->t_mountp;
> >> args.agno = pag->pag_agno;
> >> args.pag = pag;
> >> + args.flags = XFS_ALLOC_FLAG_FREEING;
> >> 
> >> /*
> >>  * validate that the block number is legal - the enables us to detect
> >> @@ -3580,7 +3598,7 @@ xfs_free_extent_fix_freelist(
> >> if (args.agno >= args.mp->m_sb.sb_agcount)
> >> return -EFSCORRUPTED;
> >> 
> >> - error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
> >> + error = xfs_alloc_fix_freelist(&args, args.flags);
> > 
> > Hm.  Why stuff XFS_ALLOC_FLAG_FREEING into args.flags here?  I /think/
> > in this case it doesn't matter because nothing under
> > xfs_alloc_fix_freelist accesses args->flags...?
> > 
> > AFAICT in xfs_alloc_fix_freelist, the difference between @args->flags
> > and @flags is that @args is the allocation that we're trying to do,
> > whereas @flags are for any allocation that might need to be done to fix
> > the freelist.  IOWS, the only allocation we're doing *is* to fix the
> > freelist, so they are one in the same.  Maybe?  Otherwise I can't tell
> > why we'd convey two sets of XFS_ALLOC flags to xfs_alloc_fix_freelist.
> > 
> > So while I think this isn't incorrect, the flag mixing bothers me a bit
> > because setting fields in a struct can have weird not so obvious side
> > effects.
> > 
> > (Still trying to make sense of the addition of a @flags field to struct
> > xfs_alloc_arg.)
> 
> Right, see my above explain.
> 
> > 
> >> if (error)
> >> return error;
> >> 
> >> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> >> index 2b246d74c189..5038fba87784 100644
> >> --- a/fs/xfs/libxfs/xfs_alloc.h
> >> +++ b/fs/xfs/libxfs/xfs_alloc.h
> >> @@ -24,6 +24,7 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp);
> >> #define XFS_ALLOC_FLAG_NORMAP 0x00000004  /* don't modify the rmapbt */
> >> #define XFS_ALLOC_FLAG_NOSHRINK 0x00000008  /* don't shrink the freelist */
> >> #define XFS_ALLOC_FLAG_CHECK 0x00000010  /* test only, don't modify args */
> >> +#define XFS_ALLOC_FLAG_TRYFLUSH 0x00000020  /* don't block in busyextent flush*/
> >> 
> >> /*
> >>  * Argument structure for xfs_alloc routines.
> >> @@ -57,6 +58,7 @@ typedef struct xfs_alloc_arg {
> >> #ifdef DEBUG
> >> bool alloc_minlen_only; /* allocate exact minlen extent */
> >> #endif
> >> + int flags; /* XFS_ALLOC_FLAG_* */
> >> } xfs_alloc_arg_t;
> >> 
> >> /*
> >> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> >> index 1b71174ec0d6..2ba28e4257fe 100644
> >> --- a/fs/xfs/scrub/repair.c
> >> +++ b/fs/xfs/scrub/repair.c
> >> @@ -496,9 +496,9 @@ xrep_fix_freelist(
> >> args.agno = sc->sa.pag->pag_agno;
> >> args.alignment = 1;
> >> args.pag = sc->sa.pag;
> >> + args.flags = can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK;
> >> 
> >> - return xfs_alloc_fix_freelist(&args,
> >> - can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK);
> >> + return xfs_alloc_fix_freelist(&args, args.flags);
> >> }
> >> 
> >> /*
> >> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> >> index f3d328e4a440..ea1c1857bf5b 100644
> >> --- a/fs/xfs/xfs_extent_busy.c
> >> +++ b/fs/xfs/xfs_extent_busy.c
> >> @@ -567,18 +567,41 @@ xfs_extent_busy_clear(
> >> /*
> >>  * Flush out all busy extents for this AG.
> >>  */
> >> -void
> >> +int
> >> xfs_extent_busy_flush(
> >> - struct xfs_mount *mp,
> >> + struct xfs_trans *tp,
> >> struct xfs_perag *pag,
> >> - unsigned busy_gen)
> >> + unsigned busy_gen,
> >> + int flags)
> >> {
> >> DEFINE_WAIT (wait);
> >> int error;
> >> 
> >> - error = xfs_log_force(mp, XFS_LOG_SYNC);
> >> + error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
> >> if (error)
> >> - return;
> >> + return error;
> >> +
> >> + /*
> >> +  * If we are holding busy extents, the caller may not want to block
> >> +  * straight away. If we are being told just to try a flush or progress
> >> +  * has been made since we last skipped a busy extent, return
> >> +  * immediately to allow the caller to try again. If we are freeing
> >> +  * extents, we might actually be holding the only free extents in the
> >> +  * transaction busy list and the log force won't resolve that
> >> +  * situation. In this case, return -EAGAIN in that case to tell the
> >> +  * caller it needs to commit the busy extents it holds before retrying
> >> +  * the extent free operation.
> >> +  */
> >> + if (!list_empty(&tp->t_busy)) {
> >> + if (flags & XFS_ALLOC_FLAG_TRYFLUSH)
> >> + return 0;
> >> +
> >> + if (busy_gen != READ_ONCE(pag->pagb_gen))
> >> + return 0;
> >> +
> >> + if (flags & XFS_ALLOC_FLAG_FREEING)
> >> + return -EAGAIN;
> >> + }
> >> 
> >> do {
> >> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
> >> @@ -588,6 +611,7 @@ xfs_extent_busy_flush(
> >> } while (1);
> >> 
> >> finish_wait(&pag->pagb_wait, &wait);
> >> + return 0;
> > 
> > This part looks good.
> > 
> >> }
> >> 
> >> void
> >> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> >> index 4a118131059f..edeedb92e0df 100644
> >> --- a/fs/xfs/xfs_extent_busy.h
> >> +++ b/fs/xfs/xfs_extent_busy.h
> >> @@ -51,9 +51,9 @@ bool
> >> xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
> >> xfs_extlen_t *len, unsigned *busy_gen);
> >> 
> >> -void
> >> -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
> >> - unsigned busy_gen);
> >> +int
> >> +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag,
> >> + unsigned busy_gen, int flags);
> >> 
> >> void
> >> xfs_extent_busy_wait_all(struct xfs_mount *mp);
> >> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> >> index 011b50469301..3c5a9e9952ec 100644
> >> --- a/fs/xfs/xfs_extfree_item.c
> >> +++ b/fs/xfs/xfs_extfree_item.c
> >> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
> >> return efdp;
> >> }
> >> 
> >> +/*
> >> + * Fill the EFD with all extents from the EFI and set the counter.
> >> + * Note: the EFD should comtain at least one extents already.
> >> + */
> >> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
> >> +{
> >> + struct xfs_efi_log_item *efip = efdp->efd_efip;
> >> + uint                    i;
> >> +
> >> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
> >> + return;
> >> +
> >> + for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> >> +        efdp->efd_format.efd_extents[i] =
> >> +        efip->efi_format.efi_extents[i];
> >> + }
> >> + efdp->efd_next_extent = efip->efi_format.efi_nextents;
> >> +}
> >> +
> >> /*
> >>  * Free an extent and log it to the EFD. Note that the transaction is marked
> >>  * dirty regardless of whether the extent free succeeds or fails to support the
> >> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
> >> error = __xfs_free_extent(tp, xefi->xefi_startblock,
> >> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
> >> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
> >> + if (error == -EAGAIN) {
> >> + xfs_fill_efd_with_efi(efdp);
> >> + return error;
> >> + }
> >> /*
> >>  * Mark the transaction dirty, even on error. This ensures the
> >>  * transaction is aborted, which:
> >> @@ -476,7 +499,8 @@ xfs_extent_free_finish_item(
> >> xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
> >> 
> >> error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
> >> - kmem_cache_free(xfs_extfree_item_cache, xefi);
> >> + if (error != -EAGAIN)
> >> + kmem_cache_free(xfs_extfree_item_cache, xefi);
> >> return error;
> >> }
> >> 
> >> @@ -633,6 +657,17 @@ xfs_efi_item_recover(
> >> fake.xefi_blockcount = extp->ext_len;
> >> 
> >> error = xfs_trans_free_extent(tp, efdp, &fake);
> >> + if (error == -EAGAIN) {
> > 
> > Curious, is this patch based off of 6.4-rc?  There should be a
> > xfs_extent_free_get_group / xfs_extent_free_put_group here.
> 
> It’s still 6.3.0-rc6 — the newest when started to work on the deadlock issue.
> Will rebase to current version for next drop.

Ok.

> > 
> >> + xfs_free_extent_later(tp, fake.xefi_startblock,
> >> + fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER);
> >> + /*
> >> +  * try to free as many extents as possible with current
> >> +  * transaction
> >> +  */
> >> + error = 0;
> >> + continue;
> >> + };
> > 
> > If we get EAGAIN, why not let the unfinished parts of the recovered EFI
> > work get turned into a new EFI?
> 
> That is an optimization. I wanted finish more records with current transaction
> when possible. Say for this case (we hit the first EAGAIN on (AG1, bno2, len2):
> EFI {(AG1, bno1, len1), (AG1, bno2, len2), (AG2, bno3, len3), (AG2, bno4, len4)}
> with current patch, we need 1 extra EFI, so
> orig EFI {(AG1, bno1, len1), (AG2, bno3, len3)} # records shown are really done ones  
> new EFI1 {(AG1, bno2, len2), (AG2, bno4, len4)}
> 
> Otherwise if we move all the records (from current one) to new EFI on the
> first EAGAIN, we would need
> orig EFI {(AG1, bno1, len1)}
> new EFI1 {(AG1, bno2, len2), (AG2, bno3, len3)} 
> new EFI3 {(AG2, bno4, len4)}
> 
> In above case, we saved one EFI with the optimization.

The trouble here is that now we're performing updates during log
recovery in a different order than they would have been done by the
regular fs code.  I /think/ for EFIs this doesn't matter (it definitely
matters for the other intent items!) but we (maintainers) prefer not to
have to think about two different orderings.

> > 
> >> +
> >> if (error == -EFSCORRUPTED)
> >> XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
> >> extp, sizeof(*extp));
> >> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> >> index 322eb2ee6c55..00bfe9683fa8 100644
> >> --- a/fs/xfs/xfs_log_recover.c
> >> +++ b/fs/xfs/xfs_log_recover.c
> >> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
> >> struct xfs_log_item *lip;
> >> struct xfs_ail *ailp;
> >> int error = 0;
> >> -#if defined(DEBUG) || defined(XFS_WARN)
> >> - xfs_lsn_t last_lsn;
> >> -#endif
> >> + xfs_lsn_t threshold_lsn;
> >> 
> >> ailp = log->l_ailp;
> >> + threshold_lsn = xfs_ail_max_lsn(ailp);
> >> spin_lock(&ailp->ail_lock);
> >> -#if defined(DEBUG) || defined(XFS_WARN)
> >> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
> >> -#endif
> >> +
> >> for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> >>      lip != NULL;
> >>      lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
> >> const struct xfs_item_ops *ops;
> >> + /*
> >> +  * Orignal redo EFI could be splitted into new EFIs. Those
> >> +  * new EFIs are supposed to be processed in capture_list.
> >> +  * Stop here when original redo intents are done.
> >> +  */
> >> + if (XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
> >> + break;
> > 
> > I'm not sure what this change accomplishes?  Is it because of the
> > continue; statement in xfs_efi_item_recover?
> 
> The function xlog_recover_process_intents() iterates all the intent
> items in AIL and process them. According the the comments:
> 
> 2519  * When this is called, all of the log intent items which did not have
> 2520  * corresponding log done items should be in the AIL.  What we do now is update
> 2521  * the data structures associated with each one.
> 
> I am thinking xlog_recover_process_intents should iterates only the ones
> which are already on the AIL. If we don’t make a threshold here,  it will continue
> to process new intents added when we are processing existing intents. In this
> case, we add new EFIs and that runs fast to be added to AIL in callbacks before
> we finish the iterations here. And then the new EFIs are captured by
> xlog_recover_process_intents(). If xlog_recover_process_intents() continue to
> process these new EFIs, it will double free extents in EFI with xlog_finish_defer_ops()
> thus cause problem.

Each ->iop_recover function is supposed to do at least one unit of work,
which should generate new non-intent log items.  These new items are
attached to the end of the AIL's list before any new intent items
created to handle an EAGAIN return.  Hence the list order ought to be:

<recovered EFI> -> <buffer log items for bnobt updates> ->
<new EFI to continue because busy flush didn't move busy gen>

Are you seeing an order other than this?

--D

> thanks,
> wengang
> 
>

Wengang Wang May 23, 2023, 2:59 a.m. UTC | #7

> On May 22, 2023, at 3:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>>>> index 011b50469301..3c5a9e9952ec 100644
>>>> --- a/fs/xfs/xfs_extfree_item.c
>>>> +++ b/fs/xfs/xfs_extfree_item.c
>>>> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
>>>> return efdp;
>>>> }
>>>> 
>>>> +/*
>>>> + * Fill the EFD with all extents from the EFI and set the counter.
>>>> + * Note: the EFD should comtain at least one extents already.
>>>> + */
>>>> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
>>>> +{
>>>> + struct xfs_efi_log_item *efip = efdp->efd_efip;
>>>> + uint                    i;
>>>> +
>>>> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
>>>> + return;
>>>> +
>>>> + for (i = 0; i < efip->efi_format.efi_nextents; i++) {
>>>> +        efdp->efd_format.efd_extents[i] =
>>>> +        efip->efi_format.efi_extents[i];
>>>> + }
>>>> + efdp->efd_next_extent = efip->efi_format.efi_nextents;
>>>> +}
>>>> +
>>> 
>>> Ok, but it doesn't dirty the transaction or the EFD, which means....
>> 
>> Actually EAGAIN shouldn’t happen with the first record in EFIs because
>> the trans->t_busy is empty in AGFL block allocation for the first record.
>> So the dirtying work should already done with the first one.
> 
> You're assuming that the only thing we are going to want to return
> -EAGAIN for freeing attamps for is busy extents. Being able to
> restart btree operations by "commit and retry" opens up a
> a whole new set of performance optimisations we can make to the
> btree code.
> 
> IOWs, I want this functionality to be generic in nature, not
> tailored specifically to one situation where an -EAGAIN needs to be
> returned to trigger a commit an retry.

Yes, I assumed that because I didn’t see relevant EAGAIN handlers
in existing code. It’s reasonable to make it generic for existing or planed
EAGAINs.

> 
>>>> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
>>>> error = __xfs_free_extent(tp, xefi->xefi_startblock,
>>>> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
>>>> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
>>>> + if (error == -EAGAIN) {
>>>> + xfs_fill_efd_with_efi(efdp);
>>>> + return error;
>>>> + }
>>> 
>>> .... this is incorrectly placed.
>>> 
>>> The very next lines say:
>>> 
>>>> /*
>>>> * Mark the transaction dirty, even on error. This ensures the
>>>> * transaction is aborted, which:
>>> 
>>> i.e. we have to make the transaction and EFD log item dirty even if
>>> we have an error. In this case, the error is not fatal, but we still
>>> have to ensure that we commit the EFD when we roll the transaction.
>>> Hence the transaction and EFD still need to be dirtied on -EAGAIN...
>> 
>> see above.
> 
> See above :)
> 
>>>> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
>>>> index 322eb2ee6c55..00bfe9683fa8 100644
>>>> --- a/fs/xfs/xfs_log_recover.c
>>>> +++ b/fs/xfs/xfs_log_recover.c
>>>> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
>>>> struct xfs_log_item *lip;
>>>> struct xfs_ail *ailp;
>>>> int error = 0;
>>>> -#if defined(DEBUG) || defined(XFS_WARN)
>>>> - xfs_lsn_t last_lsn;
>>>> -#endif
>>>> + xfs_lsn_t threshold_lsn;
>>>> 
>>>> ailp = log->l_ailp;
>>>> + threshold_lsn = xfs_ail_max_lsn(ailp);
>>>> spin_lock(&ailp->ail_lock);
>>>> -#if defined(DEBUG) || defined(XFS_WARN)
>>>> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
>>> 
>>> xfs_ail_max_lsn() and l_curr_cycle/l_curr_block are not the same
>>> thing.  max_lsn points to the lsn of the last entry in the AIL (in
>>> memory state), whilst curr_cycle/block points to the current
>>> physical location of the log head in the on-disk journal.
>>> 
>> 
>> Yes, I intended to use the lsn of the last entry in the AIL.
> 
> Again, they are not the same thing: using the last entry in the
> AIL here is incorrect. We want to replay all the items in the AIL
> that were active in the log, not up to the last item in the AIL. The
> actively recovered log region ends at last_lsn as per above, whilst
> xfs_ail_max_lsn() is not guaranteed to be less than last_lsn before
> we start walking it.

OK, got it.

> 
>> For the problem with xlog_recover_process_intents(), please see my reply to
>> Darrick. On seeing the problem, my first try was to use “last_lsn” to stop
>> the iteration but that didn’t help.  last_lsn was found quite bigger than even
>> the new EFI lsn. While use xfs_ail_max_lsn() it solved the problem.
> 
> In what case are we queuing a *new* intent into the AIL that has a
> LSN less than xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block)?
> If we are doing that *anywhere*, then we have a likely journal
> corruption bug in the code because it indicates we committed that
> item to the journal over something in the log we are currently
> replaying.

I made a mistake to say last_lsn is quite bigger than the new EFI lsn,
it’s actually much bigger than the max_lsn (because cycle increased).
The lsn of the new EFI is exactly same as last_lsn.

> 
>>> In this case, we can't use in-memory state to determine where to
>>> stop the initial intent replay - recovery of other items may have
>>> inserted new intents beyond the end of the physical region being
>>> recovered, in which case using xfs_ail_max_lsn() will result in
>>> incorrect behaviour here.
>> 
>> Yes, this patch is one of those (if some exist) introduce new intents (EFIs here).
>> We add the new intents to the transaction first (xfs_defer_create_intent()), add
>> the deferred operations to ‘capture_list’. And finally the deferred options in
>> ‘capture_list’ is processed after the intent-iteration on the AIL.
> 
> The changes made in that transaction, including the newly logged
> EFI, get committed before the rest of the work gets deferred via
> xfs_defer_ops_capture_and_commit(). That commits the new efi (along
> with all the changes that have already been made in the transaction)
> to the CIL, and eventually the journal checkpoints and the new EFI
> gets inserted into the AIL at the LSN of the checkpoint.
> 
> The LSN of the checkpoint is curr_cycle/block - the log head -
> because that's where the start record of the checkpoint is
> physically written.  As each iclog is filled, the log head moves
> forward - it always points at the location that the next journal
> write will be written to. At the end of a checkpoint, the LSN of the
> start record is used for AIL insertion.
> 

Thanks for explanation!

> Hence if a new log item created by recovery has a LSN less than
> last_lsn, then we have a serious bug somewhere that needs to be
> found and fixed. The use of last_lsn tells us something has gone
> badly wrong during recovery, the use of xfs_ail_max_lsn() removes
> the detection of the issue and now we don't know that something has
> gone badly wrong...
> 

I made a mistake.. the (first) new EFI lsn is the same as last_lsn, sorry
for confusing.

>> For existing other cases (if there are) where new intents are added,
>> they don’t use the capture_list for delayed operations? Do you have example then? 
>> if so I think we should follow their way instead of adding the defer operations
>> (but reply on the intents on AIL).
> 
> All of the intent recovery stuff uses
> xfs_defer_ops_capture_and_commit() to commit the intent being
> replayed and cause all further new intent processing in that chain
> to be defered until after all the intents recovered from the journal
> have been iterated. All those new intents end up in the AIL at a LSN
> index >= last_lsn.

Yes. So we break the AIL iteration on seeing an intent with lsn equal to
or bigger than last_lsn and skip the iop_recover() for that item?
and shall we put this change to another separated patch as it is to fix
an existing problem (not introduced by my patch)?

thanks,
wengang

Wengang Wang May 23, 2023, 3:09 a.m. UTC | #8

> On May 22, 2023, at 5:31 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Mon, May 22, 2023 at 05:25:01PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On May 20, 2023, at 2:56 PM, Darrick J. Wong <djwong@kernel.org> wrote:
>>> 
>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>>>> The following calltrace is seen:
>>>> #0 context_switch() kernel/sched/core.c:3881
>>>> #1 __schedule() kernel/sched/core.c:5111
>>>> #2 schedule() kernel/sched/core.c:5186
>>>> #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
>>>> #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
>>>> #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
>>>> #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
>>>> #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
>>>> #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
>>>> #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
>>>> #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
>>>> #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
>>>> #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
>>>> #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
>>>> #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764
>>>> #15 xfs_mountfs() fs/xfs/xfs_mount.c:978
>>>> #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
>>>> #17 mount_bdev() fs/super.c:1417
>>>> #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985
>>>> #19 legacy_get_tree() fs/fs_context.c:647
>>>> #20 vfs_get_tree() fs/super.c:1547
>>>> #21 do_new_mount() fs/namespace.c:2843
>>>> #22 do_mount() fs/namespace.c:3163
>>>> #23 ksys_mount() fs/namespace.c:3372
>>>> #24 __do_sys_mount() fs/namespace.c:3386
>>>> #25 __se_sys_mount() fs/namespace.c:3383
>>>> #26 __x64_sys_mount() fs/namespace.c:3383
>>>> #27 do_syscall_64() arch/x86/entry/common.c:296
>>>> #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180
>>>> 
>>>> During the process of the 2nd and subsequetial record in an EFI.
>>>> It is waiting for the busy blocks to be cleaned, but the only busy extent
>>>> is still hold in current xfs_trans->t_busy. That busy extent was added when
>>>> processing previous EFI record. And because that busy extent is not committed
>>>> yet, it can't be cleaned.
>>>> 
>>>> To avoid above deadlock, we don't block in xfs_extent_busy_flush() when
>>>> allocating AGFL blocks, instead it returns -EAGAIN. On receiving -EAGAIN
>>>> we are able to retry that EFI record with a new transaction after committing
>>>> the old transactin. With old transaction committed, the busy extent attached
>>>> to the old transaction get the change to be cleaned. On the retry, there is
>>>> no existing busy extents in the new transaction, thus no deadlock.
>>>> 
>>>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>>>> ---
>>>> fs/xfs/libxfs/xfs_alloc.c | 30 ++++++++++++++++++++++++------
>>>> fs/xfs/libxfs/xfs_alloc.h |  2 ++
>>>> fs/xfs/scrub/repair.c     |  4 ++--
>>>> fs/xfs/xfs_extent_busy.c  | 34 +++++++++++++++++++++++++++++-----
>>>> fs/xfs/xfs_extent_busy.h  |  6 +++---
>>>> fs/xfs/xfs_extfree_item.c | 37 ++++++++++++++++++++++++++++++++++++-
>>>> fs/xfs/xfs_log_recover.c  | 23 ++++++++++-------------
>>>> fs/xfs/xfs_trans_ail.c    |  2 +-
>>>> fs/xfs/xfs_trans_priv.h   |  1 +
>>>> 9 files changed, 108 insertions(+), 31 deletions(-)
>>>> 
>>>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>>>> index 203f16c48c19..abfd2acb3053 100644
>>>> --- a/fs/xfs/libxfs/xfs_alloc.c
>>>> +++ b/fs/xfs/libxfs/xfs_alloc.c
>>>> @@ -1491,6 +1491,7 @@ STATIC int
>>>> xfs_alloc_ag_vextent_near(
>>>> struct xfs_alloc_arg *args)
>>>> {
>>>> + int flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH;
>>>> struct xfs_alloc_cur acur = {};
>>>> int error; /* error code */
>>>> int i; /* result code, temporary */
>>>> @@ -1564,8 +1565,11 @@ xfs_alloc_ag_vextent_near(
>>>> if (!acur.len) {
>>>> if (acur.busy) {
>>>> trace_xfs_alloc_near_busy(args);
>>>> - xfs_extent_busy_flush(args->mp, args->pag,
>>>> -       acur.busy_gen);
>>>> + error = xfs_extent_busy_flush(args->tp, args->pag,
>>>> +       acur.busy_gen, flags);
>>>> + if (error)
>>>> + goto out;
>>>> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>>>> goto restart;
>>>> }
>>>> trace_xfs_alloc_size_neither(args);
>>>> @@ -1592,6 +1596,7 @@ STATIC int /* error */
>>>> xfs_alloc_ag_vextent_size(
>>>> xfs_alloc_arg_t *args) /* allocation argument structure */
>>>> {
>>>> + int flags = args->flags | XFS_ALLOC_FLAG_TRYFLUSH;
>>> 
>>> This variable holds XFS_ALLOC_FLAG_* values that are passed only to
>>> xfs_extent_busy_flush, right?  Could this be named "busyflags" or
>>> similar to make that clearer?
>>> 
>>> (Clearer to 6hr^W6mo from now Darrick who won't remember this at all.)
>> 
>> Yes, the ‘flags’ in only passed to xfs_extent_busy_flush(). will be ‘busyflags'
>> in next drop.
> 
> Ok.
> 
>>> 
>>>> struct xfs_agf *agf = args->agbp->b_addr;
>>>> struct xfs_btree_cur *bno_cur; /* cursor for bno btree */
>>>> struct xfs_btree_cur *cnt_cur; /* cursor for cnt btree */
>>>> @@ -1670,8 +1675,13 @@ xfs_alloc_ag_vextent_size(
>>>> xfs_btree_del_cursor(cnt_cur,
>>>>     XFS_BTREE_NOERROR);
>>>> trace_xfs_alloc_size_busy(args);
>>>> - xfs_extent_busy_flush(args->mp,
>>>> - args->pag, busy_gen);
>>>> + error = xfs_extent_busy_flush(args->tp, args->pag,
>>>> + busy_gen, flags);
>>>> + if (error) {
>>>> + cnt_cur = NULL;
>>>> + goto error0;
>>>> + }
>>>> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>>>> goto restart;
>>>> }
>>>> }
>>>> @@ -1755,7 +1765,13 @@ xfs_alloc_ag_vextent_size(
>>>> if (busy) {
>>>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>>>> trace_xfs_alloc_size_busy(args);
>>>> - xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
>>>> + error = xfs_extent_busy_flush(args->tp, args->pag,
>>>> + busy_gen, flags);
>>>> + if (error) {
>>>> + cnt_cur = NULL;
>>>> + goto error0;
>>>> + }
>>>> + flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>>>> goto restart;
>>>> }
>>>> goto out_nominleft;
>>>> @@ -2629,6 +2645,7 @@ xfs_alloc_fix_freelist(
>>>> targs.agno = args->agno;
>>>> targs.alignment = targs.minlen = targs.prod = 1;
>>>> targs.pag = pag;
>>>> + targs.flags = args->flags & XFS_ALLOC_FLAG_FREEING;
>>> 
>>> Hmm, I guess this propagates FREEING into the allocation that lengthens
>>> the AGFL.  Right?
>> 
>> Yes, the original call path is like this:
>> xfs_free_extent_fix_freelist()
>>  xfs_alloc_fix_freelist(args, XFS_ALLOC_FLAG_FREEING as flags)  
>>    xfs_alloc_ag_vextent_size(targs)
>>      xfs_extent_busy_flush(tp, pag, busy_gen, flags)
>> 
>> We don’t have a ‘flags’ parameter to xfs_alloc_ag_vextent_size().
>> Considering the args its self is used for allocation, the ‘flags’ is also for the same
>> purpose. I moved the ‘flags’ into args (struct xfs_alloc_arg) and thus 
>> xfs_alloc_fix_freelist() doesn’t need the ‘flags’ parameter any longer since
>> the 'flags' is in in ‘args’. In my draft code (not posted to this email list), I
>> removed the ‘flags’ paramter from xfs_alloc_fix_freelist(). While when I was
>> reviewing the code change, I found that the ‘flags’ is well used in that function,
>> accessing 'args->flags' instead of ‘flags’ may caused some extra CPU cycles.
>> So in the patch (for review) I kept the ‘flags’ parameter which cause the code
>> ugly :D
> 
> I prefer either passing @flags around as a function parameter like we
> always have; or stuffing it in xfs_alloc_arg.  Not both, that makes it
> much harder to understand.
> 

Sure, so will use the flags in xfs_alloc_arg.

>>> 
>>>> error = xfs_alloc_read_agfl(pag, tp, &agflbp);
>>>> if (error)
>>>> goto out_agbp_relse;
>>>> @@ -3572,6 +3589,7 @@ xfs_free_extent_fix_freelist(
>>>> args.mp = tp->t_mountp;
>>>> args.agno = pag->pag_agno;
>>>> args.pag = pag;
>>>> + args.flags = XFS_ALLOC_FLAG_FREEING;
>>>> 
>>>> /*
>>>> * validate that the block number is legal - the enables us to detect
>>>> @@ -3580,7 +3598,7 @@ xfs_free_extent_fix_freelist(
>>>> if (args.agno >= args.mp->m_sb.sb_agcount)
>>>> return -EFSCORRUPTED;
>>>> 
>>>> - error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
>>>> + error = xfs_alloc_fix_freelist(&args, args.flags);
>>> 
>>> Hm.  Why stuff XFS_ALLOC_FLAG_FREEING into args.flags here?  I /think/
>>> in this case it doesn't matter because nothing under
>>> xfs_alloc_fix_freelist accesses args->flags...?
>>> 
>>> AFAICT in xfs_alloc_fix_freelist, the difference between @args->flags
>>> and @flags is that @args is the allocation that we're trying to do,
>>> whereas @flags are for any allocation that might need to be done to fix
>>> the freelist.  IOWS, the only allocation we're doing *is* to fix the
>>> freelist, so they are one in the same.  Maybe?  Otherwise I can't tell
>>> why we'd convey two sets of XFS_ALLOC flags to xfs_alloc_fix_freelist.
>>> 
>>> So while I think this isn't incorrect, the flag mixing bothers me a bit
>>> because setting fields in a struct can have weird not so obvious side
>>> effects.
>>> 
>>> (Still trying to make sense of the addition of a @flags field to struct
>>> xfs_alloc_arg.)
>> 
>> Right, see my above explain.
>> 
>>> 
>>>> if (error)
>>>> return error;
>>>> 
>>>> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
>>>> index 2b246d74c189..5038fba87784 100644
>>>> --- a/fs/xfs/libxfs/xfs_alloc.h
>>>> +++ b/fs/xfs/libxfs/xfs_alloc.h
>>>> @@ -24,6 +24,7 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp);
>>>> #define XFS_ALLOC_FLAG_NORMAP 0x00000004  /* don't modify the rmapbt */
>>>> #define XFS_ALLOC_FLAG_NOSHRINK 0x00000008  /* don't shrink the freelist */
>>>> #define XFS_ALLOC_FLAG_CHECK 0x00000010  /* test only, don't modify args */
>>>> +#define XFS_ALLOC_FLAG_TRYFLUSH 0x00000020  /* don't block in busyextent flush*/
>>>> 
>>>> /*
>>>> * Argument structure for xfs_alloc routines.
>>>> @@ -57,6 +58,7 @@ typedef struct xfs_alloc_arg {
>>>> #ifdef DEBUG
>>>> bool alloc_minlen_only; /* allocate exact minlen extent */
>>>> #endif
>>>> + int flags; /* XFS_ALLOC_FLAG_* */
>>>> } xfs_alloc_arg_t;
>>>> 
>>>> /*
>>>> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
>>>> index 1b71174ec0d6..2ba28e4257fe 100644
>>>> --- a/fs/xfs/scrub/repair.c
>>>> +++ b/fs/xfs/scrub/repair.c
>>>> @@ -496,9 +496,9 @@ xrep_fix_freelist(
>>>> args.agno = sc->sa.pag->pag_agno;
>>>> args.alignment = 1;
>>>> args.pag = sc->sa.pag;
>>>> + args.flags = can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK;
>>>> 
>>>> - return xfs_alloc_fix_freelist(&args,
>>>> - can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK);
>>>> + return xfs_alloc_fix_freelist(&args, args.flags);
>>>> }
>>>> 
>>>> /*
>>>> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
>>>> index f3d328e4a440..ea1c1857bf5b 100644
>>>> --- a/fs/xfs/xfs_extent_busy.c
>>>> +++ b/fs/xfs/xfs_extent_busy.c
>>>> @@ -567,18 +567,41 @@ xfs_extent_busy_clear(
>>>> /*
>>>> * Flush out all busy extents for this AG.
>>>> */
>>>> -void
>>>> +int
>>>> xfs_extent_busy_flush(
>>>> - struct xfs_mount *mp,
>>>> + struct xfs_trans *tp,
>>>> struct xfs_perag *pag,
>>>> - unsigned busy_gen)
>>>> + unsigned busy_gen,
>>>> + int flags)
>>>> {
>>>> DEFINE_WAIT (wait);
>>>> int error;
>>>> 
>>>> - error = xfs_log_force(mp, XFS_LOG_SYNC);
>>>> + error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>>>> if (error)
>>>> - return;
>>>> + return error;
>>>> +
>>>> + /*
>>>> +  * If we are holding busy extents, the caller may not want to block
>>>> +  * straight away. If we are being told just to try a flush or progress
>>>> +  * has been made since we last skipped a busy extent, return
>>>> +  * immediately to allow the caller to try again. If we are freeing
>>>> +  * extents, we might actually be holding the only free extents in the
>>>> +  * transaction busy list and the log force won't resolve that
>>>> +  * situation. In this case, return -EAGAIN in that case to tell the
>>>> +  * caller it needs to commit the busy extents it holds before retrying
>>>> +  * the extent free operation.
>>>> +  */
>>>> + if (!list_empty(&tp->t_busy)) {
>>>> + if (flags & XFS_ALLOC_FLAG_TRYFLUSH)
>>>> + return 0;
>>>> +
>>>> + if (busy_gen != READ_ONCE(pag->pagb_gen))
>>>> + return 0;
>>>> +
>>>> + if (flags & XFS_ALLOC_FLAG_FREEING)
>>>> + return -EAGAIN;
>>>> + }
>>>> 
>>>> do {
>>>> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
>>>> @@ -588,6 +611,7 @@ xfs_extent_busy_flush(
>>>> } while (1);
>>>> 
>>>> finish_wait(&pag->pagb_wait, &wait);
>>>> + return 0;
>>> 
>>> This part looks good.
>>> 
>>>> }
>>>> 
>>>> void
>>>> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
>>>> index 4a118131059f..edeedb92e0df 100644
>>>> --- a/fs/xfs/xfs_extent_busy.h
>>>> +++ b/fs/xfs/xfs_extent_busy.h
>>>> @@ -51,9 +51,9 @@ bool
>>>> xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
>>>> xfs_extlen_t *len, unsigned *busy_gen);
>>>> 
>>>> -void
>>>> -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
>>>> - unsigned busy_gen);
>>>> +int
>>>> +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag,
>>>> + unsigned busy_gen, int flags);
>>>> 
>>>> void
>>>> xfs_extent_busy_wait_all(struct xfs_mount *mp);
>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>>>> index 011b50469301..3c5a9e9952ec 100644
>>>> --- a/fs/xfs/xfs_extfree_item.c
>>>> +++ b/fs/xfs/xfs_extfree_item.c
>>>> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
>>>> return efdp;
>>>> }
>>>> 
>>>> +/*
>>>> + * Fill the EFD with all extents from the EFI and set the counter.
>>>> + * Note: the EFD should comtain at least one extents already.
>>>> + */
>>>> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
>>>> +{
>>>> + struct xfs_efi_log_item *efip = efdp->efd_efip;
>>>> + uint                    i;
>>>> +
>>>> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
>>>> + return;
>>>> +
>>>> + for (i = 0; i < efip->efi_format.efi_nextents; i++) {
>>>> +        efdp->efd_format.efd_extents[i] =
>>>> +        efip->efi_format.efi_extents[i];
>>>> + }
>>>> + efdp->efd_next_extent = efip->efi_format.efi_nextents;
>>>> +}
>>>> +
>>>> /*
>>>> * Free an extent and log it to the EFD. Note that the transaction is marked
>>>> * dirty regardless of whether the extent free succeeds or fails to support the
>>>> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
>>>> error = __xfs_free_extent(tp, xefi->xefi_startblock,
>>>> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
>>>> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
>>>> + if (error == -EAGAIN) {
>>>> + xfs_fill_efd_with_efi(efdp);
>>>> + return error;
>>>> + }
>>>> /*
>>>> * Mark the transaction dirty, even on error. This ensures the
>>>> * transaction is aborted, which:
>>>> @@ -476,7 +499,8 @@ xfs_extent_free_finish_item(
>>>> xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
>>>> 
>>>> error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
>>>> - kmem_cache_free(xfs_extfree_item_cache, xefi);
>>>> + if (error != -EAGAIN)
>>>> + kmem_cache_free(xfs_extfree_item_cache, xefi);
>>>> return error;
>>>> }
>>>> 
>>>> @@ -633,6 +657,17 @@ xfs_efi_item_recover(
>>>> fake.xefi_blockcount = extp->ext_len;
>>>> 
>>>> error = xfs_trans_free_extent(tp, efdp, &fake);
>>>> + if (error == -EAGAIN) {
>>> 
>>> Curious, is this patch based off of 6.4-rc?  There should be a
>>> xfs_extent_free_get_group / xfs_extent_free_put_group here.
>> 
>> It’s still 6.3.0-rc6 — the newest when started to work on the deadlock issue.
>> Will rebase to current version for next drop.
> 
> Ok.
> 
>>> 
>>>> + xfs_free_extent_later(tp, fake.xefi_startblock,
>>>> + fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER);
>>>> + /*
>>>> +  * try to free as many extents as possible with current
>>>> +  * transaction
>>>> +  */
>>>> + error = 0;
>>>> + continue;
>>>> + };
>>> 
>>> If we get EAGAIN, why not let the unfinished parts of the recovered EFI
>>> work get turned into a new EFI?
>> 
>> That is an optimization. I wanted finish more records with current transaction
>> when possible. Say for this case (we hit the first EAGAIN on (AG1, bno2, len2):
>> EFI {(AG1, bno1, len1), (AG1, bno2, len2), (AG2, bno3, len3), (AG2, bno4, len4)}
>> with current patch, we need 1 extra EFI, so
>> orig EFI {(AG1, bno1, len1), (AG2, bno3, len3)} # records shown are really done ones  
>> new EFI1 {(AG1, bno2, len2), (AG2, bno4, len4)}
>> 
>> Otherwise if we move all the records (from current one) to new EFI on the
>> first EAGAIN, we would need
>> orig EFI {(AG1, bno1, len1)}
>> new EFI1 {(AG1, bno2, len2), (AG2, bno3, len3)} 
>> new EFI3 {(AG2, bno4, len4)}
>> 
>> In above case, we saved one EFI with the optimization.
> 
> The trouble here is that now we're performing updates during log
> recovery in a different order than they would have been done by the
> regular fs code.  I /think/ for EFIs this doesn't matter (it definitely
> matters for the other intent items!) but we (maintainers) prefer not to
> have to think about two different orderings.

OK.

> 
>>> 
>>>> +
>>>> if (error == -EFSCORRUPTED)
>>>> XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
>>>> extp, sizeof(*extp));
>>>> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
>>>> index 322eb2ee6c55..00bfe9683fa8 100644
>>>> --- a/fs/xfs/xfs_log_recover.c
>>>> +++ b/fs/xfs/xfs_log_recover.c
>>>> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
>>>> struct xfs_log_item *lip;
>>>> struct xfs_ail *ailp;
>>>> int error = 0;
>>>> -#if defined(DEBUG) || defined(XFS_WARN)
>>>> - xfs_lsn_t last_lsn;
>>>> -#endif
>>>> + xfs_lsn_t threshold_lsn;
>>>> 
>>>> ailp = log->l_ailp;
>>>> + threshold_lsn = xfs_ail_max_lsn(ailp);
>>>> spin_lock(&ailp->ail_lock);
>>>> -#if defined(DEBUG) || defined(XFS_WARN)
>>>> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
>>>> -#endif
>>>> +
>>>> for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>>>>     lip != NULL;
>>>>     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
>>>> const struct xfs_item_ops *ops;
>>>> + /*
>>>> +  * Orignal redo EFI could be splitted into new EFIs. Those
>>>> +  * new EFIs are supposed to be processed in capture_list.
>>>> +  * Stop here when original redo intents are done.
>>>> +  */
>>>> + if (XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
>>>> + break;
>>> 
>>> I'm not sure what this change accomplishes?  Is it because of the
>>> continue; statement in xfs_efi_item_recover?
>> 
>> The function xlog_recover_process_intents() iterates all the intent
>> items in AIL and process them. According the the comments:
>> 
>> 2519  * When this is called, all of the log intent items which did not have
>> 2520  * corresponding log done items should be in the AIL.  What we do now is update
>> 2521  * the data structures associated with each one.
>> 
>> I am thinking xlog_recover_process_intents should iterates only the ones
>> which are already on the AIL. If we don’t make a threshold here,  it will continue
>> to process new intents added when we are processing existing intents. In this
>> case, we add new EFIs and that runs fast to be added to AIL in callbacks before
>> we finish the iterations here. And then the new EFIs are captured by
>> xlog_recover_process_intents(). If xlog_recover_process_intents() continue to
>> process these new EFIs, it will double free extents in EFI with xlog_finish_defer_ops()
>> thus cause problem.
> 
> Each ->iop_recover function is supposed to do at least one unit of work,
> which should generate new non-intent log items.  

I am not quite sure for above. There are cases that new intents are added
in iop_recover(), for example xfs_attri_item_recover():

632         error = xfs_xattri_finish_update(attr, done_item);
633         if (error == -EAGAIN) {
634                 /*
635                  * There's more work to do, so add the intent item to this
636                  * transaction so that we can continue it later.
637                  */
638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
640                 if (error)
641                         goto out_unlock;
642
643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
644                 xfs_irele(ip);
645                 return 0;
646         }

I am thinking line 638 and 639 are doing so.

> These new items are
> attached to the end of the AIL's list before any new intent items
> created to handle an EAGAIN return.  Hence the list order ought to be:
> 
> <recovered EFI> -> <buffer log items for bnobt updates> ->
> <new EFI to continue because busy flush didn't move busy gen>
> 
> Are you seeing an order other than this?

log looks good.
Just it seems that we should fix xlog_recover_process_intents()
not to run iop_recover() for those items with lsn equal to or bigger than
last_lsn to avoid the double free issue. please see also my reply to Dave
about the double free.

thanks,
wengang

Darrick J. Wong May 24, 2023, 12:27 a.m. UTC | #9

On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
> 
> 
> > On May 22, 2023, at 3:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
> >>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
> >>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> >>>> index 011b50469301..3c5a9e9952ec 100644
> >>>> --- a/fs/xfs/xfs_extfree_item.c
> >>>> +++ b/fs/xfs/xfs_extfree_item.c
> >>>> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
> >>>> return efdp;
> >>>> }
> >>>> 
> >>>> +/*
> >>>> + * Fill the EFD with all extents from the EFI and set the counter.
> >>>> + * Note: the EFD should comtain at least one extents already.
> >>>> + */
> >>>> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
> >>>> +{
> >>>> + struct xfs_efi_log_item *efip = efdp->efd_efip;
> >>>> + uint                    i;
> >>>> +
> >>>> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
> >>>> + return;
> >>>> +
> >>>> + for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> >>>> +        efdp->efd_format.efd_extents[i] =
> >>>> +        efip->efi_format.efi_extents[i];
> >>>> + }
> >>>> + efdp->efd_next_extent = efip->efi_format.efi_nextents;
> >>>> +}
> >>>> +
> >>> 
> >>> Ok, but it doesn't dirty the transaction or the EFD, which means....
> >> 
> >> Actually EAGAIN shouldn’t happen with the first record in EFIs because
> >> the trans->t_busy is empty in AGFL block allocation for the first record.
> >> So the dirtying work should already done with the first one.
> > 
> > You're assuming that the only thing we are going to want to return
> > -EAGAIN for freeing attamps for is busy extents. Being able to
> > restart btree operations by "commit and retry" opens up a
> > a whole new set of performance optimisations we can make to the
> > btree code.
> > 
> > IOWs, I want this functionality to be generic in nature, not
> > tailored specifically to one situation where an -EAGAIN needs to be
> > returned to trigger a commit an retry.
> 
> Yes, I assumed that because I didn’t see relevant EAGAIN handlers
> in existing code. It’s reasonable to make it generic for existing or planed
> EAGAINs.
> 
> > 
> >>>> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
> >>>> error = __xfs_free_extent(tp, xefi->xefi_startblock,
> >>>> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
> >>>> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
> >>>> + if (error == -EAGAIN) {
> >>>> + xfs_fill_efd_with_efi(efdp);
> >>>> + return error;
> >>>> + }
> >>> 
> >>> .... this is incorrectly placed.
> >>> 
> >>> The very next lines say:
> >>> 
> >>>> /*
> >>>> * Mark the transaction dirty, even on error. This ensures the
> >>>> * transaction is aborted, which:
> >>> 
> >>> i.e. we have to make the transaction and EFD log item dirty even if
> >>> we have an error. In this case, the error is not fatal, but we still
> >>> have to ensure that we commit the EFD when we roll the transaction.
> >>> Hence the transaction and EFD still need to be dirtied on -EAGAIN...
> >> 
> >> see above.
> > 
> > See above :)
> > 
> >>>> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> >>>> index 322eb2ee6c55..00bfe9683fa8 100644
> >>>> --- a/fs/xfs/xfs_log_recover.c
> >>>> +++ b/fs/xfs/xfs_log_recover.c
> >>>> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
> >>>> struct xfs_log_item *lip;
> >>>> struct xfs_ail *ailp;
> >>>> int error = 0;
> >>>> -#if defined(DEBUG) || defined(XFS_WARN)
> >>>> - xfs_lsn_t last_lsn;
> >>>> -#endif
> >>>> + xfs_lsn_t threshold_lsn;
> >>>> 
> >>>> ailp = log->l_ailp;
> >>>> + threshold_lsn = xfs_ail_max_lsn(ailp);
> >>>> spin_lock(&ailp->ail_lock);
> >>>> -#if defined(DEBUG) || defined(XFS_WARN)
> >>>> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
> >>> 
> >>> xfs_ail_max_lsn() and l_curr_cycle/l_curr_block are not the same
> >>> thing.  max_lsn points to the lsn of the last entry in the AIL (in
> >>> memory state), whilst curr_cycle/block points to the current
> >>> physical location of the log head in the on-disk journal.
> >>> 
> >> 
> >> Yes, I intended to use the lsn of the last entry in the AIL.
> > 
> > Again, they are not the same thing: using the last entry in the
> > AIL here is incorrect. We want to replay all the items in the AIL
> > that were active in the log, not up to the last item in the AIL. The
> > actively recovered log region ends at last_lsn as per above, whilst
> > xfs_ail_max_lsn() is not guaranteed to be less than last_lsn before
> > we start walking it.
> 
> OK, got it.
> 
> > 
> >> For the problem with xlog_recover_process_intents(), please see my reply to
> >> Darrick. On seeing the problem, my first try was to use “last_lsn” to stop
> >> the iteration but that didn’t help.  last_lsn was found quite bigger than even
> >> the new EFI lsn. While use xfs_ail_max_lsn() it solved the problem.
> > 
> > In what case are we queuing a *new* intent into the AIL that has a
> > LSN less than xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block)?
> > If we are doing that *anywhere*, then we have a likely journal
> > corruption bug in the code because it indicates we committed that
> > item to the journal over something in the log we are currently
> > replaying.
> 
> I made a mistake to say last_lsn is quite bigger than the new EFI lsn,
> it’s actually much bigger than the max_lsn (because cycle increased).
> The lsn of the new EFI is exactly same as last_lsn.
> 
> > 
> >>> In this case, we can't use in-memory state to determine where to
> >>> stop the initial intent replay - recovery of other items may have
> >>> inserted new intents beyond the end of the physical region being
> >>> recovered, in which case using xfs_ail_max_lsn() will result in
> >>> incorrect behaviour here.
> >> 
> >> Yes, this patch is one of those (if some exist) introduce new intents (EFIs here).
> >> We add the new intents to the transaction first (xfs_defer_create_intent()), add
> >> the deferred operations to ‘capture_list’. And finally the deferred options in
> >> ‘capture_list’ is processed after the intent-iteration on the AIL.
> > 
> > The changes made in that transaction, including the newly logged
> > EFI, get committed before the rest of the work gets deferred via
> > xfs_defer_ops_capture_and_commit(). That commits the new efi (along
> > with all the changes that have already been made in the transaction)
> > to the CIL, and eventually the journal checkpoints and the new EFI
> > gets inserted into the AIL at the LSN of the checkpoint.
> > 
> > The LSN of the checkpoint is curr_cycle/block - the log head -
> > because that's where the start record of the checkpoint is
> > physically written.  As each iclog is filled, the log head moves
> > forward - it always points at the location that the next journal
> > write will be written to. At the end of a checkpoint, the LSN of the
> > start record is used for AIL insertion.
> > 
> 
> Thanks for explanation!
> 
> > Hence if a new log item created by recovery has a LSN less than
> > last_lsn, then we have a serious bug somewhere that needs to be
> > found and fixed. The use of last_lsn tells us something has gone
> > badly wrong during recovery, the use of xfs_ail_max_lsn() removes
> > the detection of the issue and now we don't know that something has
> > gone badly wrong...
> > 
> 
> I made a mistake.. the (first) new EFI lsn is the same as last_lsn, sorry
> for confusing.
> 
> >> For existing other cases (if there are) where new intents are added,
> >> they don’t use the capture_list for delayed operations? Do you have example then? 
> >> if so I think we should follow their way instead of adding the defer operations
> >> (but reply on the intents on AIL).
> > 
> > All of the intent recovery stuff uses
> > xfs_defer_ops_capture_and_commit() to commit the intent being
> > replayed and cause all further new intent processing in that chain
> > to be defered until after all the intents recovered from the journal
> > have been iterated. All those new intents end up in the AIL at a LSN
> > index >= last_lsn.
> 
> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
> or bigger than last_lsn and skip the iop_recover() for that item?
> and shall we put this change to another separated patch as it is to fix
> an existing problem (not introduced by my patch)?

Intent replay creates non-intent log items (like buffers or inodes) that
are added to the AIL with an LSN higher than last_lsn.  I suppose it
would be possible to break log recovery if an intent's iop_recover
method immediately logged a new intent and returned EAGAIN to roll the
transaction, but none of them do that; and I think the ASSERT you moved
would detect such a thing.

--D

> thanks,
> wengang

Wengang Wang May 24, 2023, 4:27 p.m. UTC | #10

> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
>> 
>> 
>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>>>>>> index 011b50469301..3c5a9e9952ec 100644
>>>>>> --- a/fs/xfs/xfs_extfree_item.c
>>>>>> +++ b/fs/xfs/xfs_extfree_item.c
>>>>>> @@ -336,6 +336,25 @@ xfs_trans_get_efd(
>>>>>> return efdp;
>>>>>> }
>>>>>> 
>>>>>> +/*
>>>>>> + * Fill the EFD with all extents from the EFI and set the counter.
>>>>>> + * Note: the EFD should comtain at least one extents already.
>>>>>> + */
>>>>>> +static void xfs_fill_efd_with_efi(struct xfs_efd_log_item *efdp)
>>>>>> +{
>>>>>> + struct xfs_efi_log_item *efip = efdp->efd_efip;
>>>>>> + uint                    i;
>>>>>> +
>>>>>> + if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
>>>>>> + return;
>>>>>> +
>>>>>> + for (i = 0; i < efip->efi_format.efi_nextents; i++) {
>>>>>> +        efdp->efd_format.efd_extents[i] =
>>>>>> +        efip->efi_format.efi_extents[i];
>>>>>> + }
>>>>>> + efdp->efd_next_extent = efip->efi_format.efi_nextents;
>>>>>> +}
>>>>>> +
>>>>> 
>>>>> Ok, but it doesn't dirty the transaction or the EFD, which means....
>>>> 
>>>> Actually EAGAIN shouldn’t happen with the first record in EFIs because
>>>> the trans->t_busy is empty in AGFL block allocation for the first record.
>>>> So the dirtying work should already done with the first one.
>>> 
>>> You're assuming that the only thing we are going to want to return
>>> -EAGAIN for freeing attamps for is busy extents. Being able to
>>> restart btree operations by "commit and retry" opens up a
>>> a whole new set of performance optimisations we can make to the
>>> btree code.
>>> 
>>> IOWs, I want this functionality to be generic in nature, not
>>> tailored specifically to one situation where an -EAGAIN needs to be
>>> returned to trigger a commit an retry.
>> 
>> Yes, I assumed that because I didn’t see relevant EAGAIN handlers
>> in existing code. It’s reasonable to make it generic for existing or planed
>> EAGAINs.
>> 
>>> 
>>>>>> @@ -369,6 +388,10 @@ xfs_trans_free_extent(
>>>>>> error = __xfs_free_extent(tp, xefi->xefi_startblock,
>>>>>> xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
>>>>>> xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
>>>>>> + if (error == -EAGAIN) {
>>>>>> + xfs_fill_efd_with_efi(efdp);
>>>>>> + return error;
>>>>>> + }
>>>>> 
>>>>> .... this is incorrectly placed.
>>>>> 
>>>>> The very next lines say:
>>>>> 
>>>>>> /*
>>>>>> * Mark the transaction dirty, even on error. This ensures the
>>>>>> * transaction is aborted, which:
>>>>> 
>>>>> i.e. we have to make the transaction and EFD log item dirty even if
>>>>> we have an error. In this case, the error is not fatal, but we still
>>>>> have to ensure that we commit the EFD when we roll the transaction.
>>>>> Hence the transaction and EFD still need to be dirtied on -EAGAIN...
>>>> 
>>>> see above.
>>> 
>>> See above :)
>>> 
>>>>>> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
>>>>>> index 322eb2ee6c55..00bfe9683fa8 100644
>>>>>> --- a/fs/xfs/xfs_log_recover.c
>>>>>> +++ b/fs/xfs/xfs_log_recover.c
>>>>>> @@ -2540,30 +2540,27 @@ xlog_recover_process_intents(
>>>>>> struct xfs_log_item *lip;
>>>>>> struct xfs_ail *ailp;
>>>>>> int error = 0;
>>>>>> -#if defined(DEBUG) || defined(XFS_WARN)
>>>>>> - xfs_lsn_t last_lsn;
>>>>>> -#endif
>>>>>> + xfs_lsn_t threshold_lsn;
>>>>>> 
>>>>>> ailp = log->l_ailp;
>>>>>> + threshold_lsn = xfs_ail_max_lsn(ailp);
>>>>>> spin_lock(&ailp->ail_lock);
>>>>>> -#if defined(DEBUG) || defined(XFS_WARN)
>>>>>> - last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
>>>>> 
>>>>> xfs_ail_max_lsn() and l_curr_cycle/l_curr_block are not the same
>>>>> thing.  max_lsn points to the lsn of the last entry in the AIL (in
>>>>> memory state), whilst curr_cycle/block points to the current
>>>>> physical location of the log head in the on-disk journal.
>>>>> 
>>>> 
>>>> Yes, I intended to use the lsn of the last entry in the AIL.
>>> 
>>> Again, they are not the same thing: using the last entry in the
>>> AIL here is incorrect. We want to replay all the items in the AIL
>>> that were active in the log, not up to the last item in the AIL. The
>>> actively recovered log region ends at last_lsn as per above, whilst
>>> xfs_ail_max_lsn() is not guaranteed to be less than last_lsn before
>>> we start walking it.
>> 
>> OK, got it.
>> 
>>> 
>>>> For the problem with xlog_recover_process_intents(), please see my reply to
>>>> Darrick. On seeing the problem, my first try was to use “last_lsn” to stop
>>>> the iteration but that didn’t help.  last_lsn was found quite bigger than even
>>>> the new EFI lsn. While use xfs_ail_max_lsn() it solved the problem.
>>> 
>>> In what case are we queuing a *new* intent into the AIL that has a
>>> LSN less than xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block)?
>>> If we are doing that *anywhere*, then we have a likely journal
>>> corruption bug in the code because it indicates we committed that
>>> item to the journal over something in the log we are currently
>>> replaying.
>> 
>> I made a mistake to say last_lsn is quite bigger than the new EFI lsn,
>> it’s actually much bigger than the max_lsn (because cycle increased).
>> The lsn of the new EFI is exactly same as last_lsn.
>> 
>>> 
>>>>> In this case, we can't use in-memory state to determine where to
>>>>> stop the initial intent replay - recovery of other items may have
>>>>> inserted new intents beyond the end of the physical region being
>>>>> recovered, in which case using xfs_ail_max_lsn() will result in
>>>>> incorrect behaviour here.
>>>> 
>>>> Yes, this patch is one of those (if some exist) introduce new intents (EFIs here).
>>>> We add the new intents to the transaction first (xfs_defer_create_intent()), add
>>>> the deferred operations to ‘capture_list’. And finally the deferred options in
>>>> ‘capture_list’ is processed after the intent-iteration on the AIL.
>>> 
>>> The changes made in that transaction, including the newly logged
>>> EFI, get committed before the rest of the work gets deferred via
>>> xfs_defer_ops_capture_and_commit(). That commits the new efi (along
>>> with all the changes that have already been made in the transaction)
>>> to the CIL, and eventually the journal checkpoints and the new EFI
>>> gets inserted into the AIL at the LSN of the checkpoint.
>>> 
>>> The LSN of the checkpoint is curr_cycle/block - the log head -
>>> because that's where the start record of the checkpoint is
>>> physically written.  As each iclog is filled, the log head moves
>>> forward - it always points at the location that the next journal
>>> write will be written to. At the end of a checkpoint, the LSN of the
>>> start record is used for AIL insertion.
>>> 
>> 
>> Thanks for explanation!
>> 
>>> Hence if a new log item created by recovery has a LSN less than
>>> last_lsn, then we have a serious bug somewhere that needs to be
>>> found and fixed. The use of last_lsn tells us something has gone
>>> badly wrong during recovery, the use of xfs_ail_max_lsn() removes
>>> the detection of the issue and now we don't know that something has
>>> gone badly wrong...
>>> 
>> 
>> I made a mistake.. the (first) new EFI lsn is the same as last_lsn, sorry
>> for confusing.
>> 
>>>> For existing other cases (if there are) where new intents are added,
>>>> they don’t use the capture_list for delayed operations? Do you have example then? 
>>>> if so I think we should follow their way instead of adding the defer operations
>>>> (but reply on the intents on AIL).
>>> 
>>> All of the intent recovery stuff uses
>>> xfs_defer_ops_capture_and_commit() to commit the intent being
>>> replayed and cause all further new intent processing in that chain
>>> to be defered until after all the intents recovered from the journal
>>> have been iterated. All those new intents end up in the AIL at a LSN
>>> index >= last_lsn.
>> 
>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
>> or bigger than last_lsn and skip the iop_recover() for that item?
>> and shall we put this change to another separated patch as it is to fix
>> an existing problem (not introduced by my patch)?
> 
> Intent replay creates non-intent log items (like buffers or inodes) that
> are added to the AIL with an LSN higher than last_lsn.  I suppose it
> would be possible to break log recovery if an intent's iop_recover
> method immediately logged a new intent and returned EAGAIN to roll the
> transaction, but none of them do that;

I am not quite sure for above. There are cases that new intents are added
in iop_recover(), for example xfs_attri_item_recover():

632         error = xfs_xattri_finish_update(attr, done_item);
633         if (error == -EAGAIN) {
634                 /*
635                  * There's more work to do, so add the intent item to this
636                  * transaction so that we can continue it later.
637                  */
638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
640                 if (error)
641                         goto out_unlock;
642
643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
644                 xfs_irele(ip);
645                 return 0;
646         }

I am thinking line 638 and 639 are doing so.

> and I think the ASSERT you moved
> would detect such a thing.

ASSERT is nothing in production kernel, so it has less chance to detect things.
And because this is log recovery, the iop_recover() quite depends on the active log
and FS usage to log new intents. I am suspecting that’s why the problem
in xlog_recover_process_intents() wasn’t detected previously.
But looking at the existing source code, I believe that
xlog_recover_process_intents() would do double-process with new intent(s) which
are added by xfs_attri_item_recover() when it ran into the -EAGAIN case with the
original active log ATTRI.  The first process is within the AIL intents iteration, the other
is with the capture_list.

thanks,
wengang

Dave Chinner May 25, 2023, 2:20 a.m. UTC | #11

On Wed, May 24, 2023 at 04:27:19PM +0000, Wengang Wang wrote:
> > On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> > On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
> >>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
> >>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
> >>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> >>>> For existing other cases (if there are) where new intents are added,
> >>>> they don’t use the capture_list for delayed operations? Do you have example then? 
> >>>> if so I think we should follow their way instead of adding the defer operations
> >>>> (but reply on the intents on AIL).
> >>> 
> >>> All of the intent recovery stuff uses
> >>> xfs_defer_ops_capture_and_commit() to commit the intent being
> >>> replayed and cause all further new intent processing in that chain
> >>> to be defered until after all the intents recovered from the journal
> >>> have been iterated. All those new intents end up in the AIL at a LSN
> >>> index >= last_lsn.
> >> 
> >> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
> >> or bigger than last_lsn and skip the iop_recover() for that item?
> >> and shall we put this change to another separated patch as it is to fix
> >> an existing problem (not introduced by my patch)?
> > 
> > Intent replay creates non-intent log items (like buffers or inodes) that
> > are added to the AIL with an LSN higher than last_lsn.  I suppose it
> > would be possible to break log recovery if an intent's iop_recover
> > method immediately logged a new intent and returned EAGAIN to roll the
> > transaction, but none of them do that;
> 
> I am not quite sure for above. There are cases that new intents are added
> in iop_recover(), for example xfs_attri_item_recover():
> 
> 632         error = xfs_xattri_finish_update(attr, done_item);
> 633         if (error == -EAGAIN) {
> 634                 /*
> 635                  * There's more work to do, so add the intent item to this
> 636                  * transaction so that we can continue it later.
> 637                  */
> 638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
> 639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
> 640                 if (error)
> 641                         goto out_unlock;
> 642
> 643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
> 644                 xfs_irele(ip);
> 645                 return 0;
> 646         }
> 
> I am thinking line 638 and 639 are doing so.

I don't think so. @attrip is the attribute information recovered
from the original intent - we allocated @attr in
xfs_attri_item_recover() to enable the operation indicated by
@attrip to be executed.  We copy stuff from @attrip (the item we are
recovering) to the new @attr structure (the work we need to do) and
run it through the attr modification state machine. If there's more
work to be done, we then add @attr to the defer list.

If there is deferred work to be continued, we still need a new
intent to indicate what attr operation needs to be performed next in
the chain.  When we commit the transaction (639), it will create a
new intent log item for the deferred attribute operation in the
->create_intent callback before the transaction is committed.

IOWs, we never re-use the incoming intent item that we are
recovering. The new log item will end up in the AIL at/beyond
last_lsn when the CIL is committed. It does not get further
processing work done until all the intents in the log that need
recovery have had their initial processing performed and the log
space they consume has been freed up.

> > and I think the ASSERT you moved would detect such a thing.
> 
> ASSERT is nothing in production kernel, so it has less chance to
> detect things.

Please understand that we do actually know how the ASSERT
infrastructure works and we utilise it to our advantage in many
ways.  We often use asserts to document design/code constraints and
use debug kernels to perform runtime design rule violation
detection.

Indeed, we really don't want design/code constraints being checked on
production kernels, largely because we know they are never going to
be tripped in normal production system operation. IOWs, these checks
are unnecessary in production systems because we've already done all
the constraint/correctness checking of the code on debug kernels
before we release the software to production.

If a particular situation is a production concern, we code it as an
error check, not an ASSERT. If it's a design or implementation
constraint check, it's an ASSERT. The last_lsn ASSERT is checking
that the code is behaving according to design constraints (i.e.
CIL/AIL ordering has not been screwed up, intent recovery has not
changed behaviour, and new objects always appear at >= last_lsn).
None of these things should ever occur in a production system - if
any of them occur do then we'll have much, much worse problems to
address than log recovery maybe running an intent too early.

-Dave.

Wengang Wang May 25, 2023, 5:13 p.m. UTC | #12

> On May 24, 2023, at 7:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Wed, May 24, 2023 at 04:27:19PM +0000, Wengang Wang wrote:
>>> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@kernel.org> wrote:
>>> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
>>>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
>>>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>>>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>>>>>> For existing other cases (if there are) where new intents are added,
>>>>>> they don’t use the capture_list for delayed operations? Do you have example then? 
>>>>>> if so I think we should follow their way instead of adding the defer operations
>>>>>> (but reply on the intents on AIL).
>>>>> 
>>>>> All of the intent recovery stuff uses
>>>>> xfs_defer_ops_capture_and_commit() to commit the intent being
>>>>> replayed and cause all further new intent processing in that chain
>>>>> to be defered until after all the intents recovered from the journal
>>>>> have been iterated. All those new intents end up in the AIL at a LSN
>>>>> index >= last_lsn.
>>>> 
>>>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
>>>> or bigger than last_lsn and skip the iop_recover() for that item?
>>>> and shall we put this change to another separated patch as it is to fix
>>>> an existing problem (not introduced by my patch)?
>>> 
>>> Intent replay creates non-intent log items (like buffers or inodes) that
>>> are added to the AIL with an LSN higher than last_lsn.  I suppose it
>>> would be possible to break log recovery if an intent's iop_recover
>>> method immediately logged a new intent and returned EAGAIN to roll the
>>> transaction, but none of them do that;
>> 
>> I am not quite sure for above. There are cases that new intents are added
>> in iop_recover(), for example xfs_attri_item_recover():
>> 
>> 632         error = xfs_xattri_finish_update(attr, done_item);
>> 633         if (error == -EAGAIN) {
>> 634                 /*
>> 635                  * There's more work to do, so add the intent item to this
>> 636                  * transaction so that we can continue it later.
>> 637                  */
>> 638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
>> 639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
>> 640                 if (error)
>> 641                         goto out_unlock;
>> 642
>> 643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> 644                 xfs_irele(ip);
>> 645                 return 0;
>> 646         }
>> 
>> I am thinking line 638 and 639 are doing so.
> 
> I don't think so. @attrip is the attribute information recovered
> from the original intent - we allocated @attr in
> xfs_attri_item_recover() to enable the operation indicated by
> @attrip to be executed.  We copy stuff from @attrip (the item we are
> recovering) to the new @attr structure (the work we need to do) and
> run it through the attr modification state machine. If there's more
> work to be done, we then add @attr to the defer list.

Assuming “if there’s more work to be done” is the -EAGAIN case...

> 
> If there is deferred work to be continued, we still need a new
> intent to indicate what attr operation needs to be performed next in
> the chain.  When we commit the transaction (639), it will create a
> new intent log item for the deferred attribute operation in the
> ->create_intent callback before the transaction is committed.

Yes, that’s true. But it has no conflict with what I was saying. The thing I wanted
to say is that:
The new intent log item (introduced by xfs_attri_item_recover()) could appear on
AIL before the AIL iteration ends in xlog_recover_process_intents(). 
Do you agree with above?

In the following I will talk about the new intent log item. 

The iop_recover() is then executed on the new intent during the AIL intents iteration.
I meant this loop:

2548         for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
2549              lip != NULL;
2550              lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
….
2584         }
Do you agree with above?

And the lsn for the new intent is equal to or bigger than last_lsn.
Do you agree with above?

In above case the iop_recover() is xfs_attri_item_recover(). The later
creates a xfs_attr_intent, attr1, copying things from the new ATTRI to attr1
and process attr1.
Do you agree with above?

Above xfs_attri_item_recover() runs successfully and the AIL intent iteration ends.
Now it’s time to process the capture_list with xlog_finish_defer_ops(). The
capture_list contains the deferred operation which was added at line 639 with type
XFS_DEFER_OPS_TYPE_ATTR. 
Do you agree with above?

In xlog_finish_defer_ops(), a new transaction is created by xfs_trans_alloc().
the deferred XFS_DEFER_OPS_TYPE_ATTR operation is attached to that new
transaction by xfs_defer_ops_continue(). Then the new transaction is committed by
xfs_trans_commit().
Do you agree with above?

During the xfs_trans_commit(), xfs_defer_finish_noroll() is called. and
xfs_defer_finish_one() is called inside xfs_defer_finish_noroll(). For the
deferred XFS_DEFER_OPS_TYPE_ATTR operation, xfs_attr_finish_item()
is called.
Do you agree with above?

In xfs_attr_finish_item(), xfs_attr_intent (attr2) is taken out from above new ATTRI
and is processed.  attr2 and attr1 contain exactly same thing because they are both
from the new ATTRI.  So the processing on attr2 is pure a duplication of the processing
on attr1.  So actually the new ATTRI are double-processed.
Do you agree with above? 


> 
> IOWs, we never re-use the incoming intent item that we are
> recovering. The new log item will end up in the AIL at/beyond
> last_lsn when the CIL is committed. It does not get further
> processing work done until all the intents in the log that need
> recovery have had their initial processing performed and the log
> space they consume has been freed up.
> 
>>> and I think the ASSERT you moved would detect such a thing.
>> 
>> ASSERT is nothing in production kernel, so it has less chance to
>> detect things.
> 
> Please understand that we do actually know how the ASSERT
> infrastructure works and we utilise it to our advantage in many
> ways.  We often use asserts to document design/code constraints and
> use debug kernels to perform runtime design rule violation
> detection.
> 
> Indeed, we really don't want design/code constraints being checked on
> production kernels, largely because we know they are never going to
> be tripped in normal production system operation. IOWs, these checks
> are unnecessary in production systems because we've already done all
> the constraint/correctness checking of the code on debug kernels
> before we release the software to production.
> 
> If a particular situation is a production concern, we code it as an
> error check, not an ASSERT. If it's a design or implementation
> constraint check, it's an ASSERT. The last_lsn ASSERT is checking
> that the code is behaving according to design constraints (i.e.
> CIL/AIL ordering has not been screwed up, intent recovery has not
> changed behaviour, and new objects always appear at >= last_lsn).
> None of these things should ever occur in a production system - if
> any of them occur do then we'll have much, much worse problems to
> address than log recovery maybe running an intent too early.

Thanks for the explain on the ASSERT, I have no double on it. I was trying to to say
why it didn’t capture the problem in xlog_recover_process_intents().

thanks,
wengang

Dave Chinner May 26, 2023, 12:26 a.m. UTC | #13

On Thu, May 25, 2023 at 05:13:41PM +0000, Wengang Wang wrote:
> 
> 
> > On May 24, 2023, at 7:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Wed, May 24, 2023 at 04:27:19PM +0000, Wengang Wang wrote:
> >>> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> >>> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
> >>>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
> >>>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
> >>>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> >>>>>> For existing other cases (if there are) where new intents are added,
> >>>>>> they don’t use the capture_list for delayed operations? Do you have example then? 
> >>>>>> if so I think we should follow their way instead of adding the defer operations
> >>>>>> (but reply on the intents on AIL).
> >>>>> 
> >>>>> All of the intent recovery stuff uses
> >>>>> xfs_defer_ops_capture_and_commit() to commit the intent being
> >>>>> replayed and cause all further new intent processing in that chain
> >>>>> to be defered until after all the intents recovered from the journal
> >>>>> have been iterated. All those new intents end up in the AIL at a LSN
> >>>>> index >= last_lsn.
> >>>> 
> >>>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
> >>>> or bigger than last_lsn and skip the iop_recover() for that item?
> >>>> and shall we put this change to another separated patch as it is to fix
> >>>> an existing problem (not introduced by my patch)?
> >>> 
> >>> Intent replay creates non-intent log items (like buffers or inodes) that
> >>> are added to the AIL with an LSN higher than last_lsn.  I suppose it
> >>> would be possible to break log recovery if an intent's iop_recover
> >>> method immediately logged a new intent and returned EAGAIN to roll the
> >>> transaction, but none of them do that;
> >> 
> >> I am not quite sure for above. There are cases that new intents are added
> >> in iop_recover(), for example xfs_attri_item_recover():
> >> 
> >> 632         error = xfs_xattri_finish_update(attr, done_item);
> >> 633         if (error == -EAGAIN) {
> >> 634                 /*
> >> 635                  * There's more work to do, so add the intent item to this
> >> 636                  * transaction so that we can continue it later.
> >> 637                  */
> >> 638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
> >> 639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
> >> 640                 if (error)
> >> 641                         goto out_unlock;
> >> 642
> >> 643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >> 644                 xfs_irele(ip);
> >> 645                 return 0;
> >> 646         }
> >> 
> >> I am thinking line 638 and 639 are doing so.
> > 
> > I don't think so. @attrip is the attribute information recovered
> > from the original intent - we allocated @attr in
> > xfs_attri_item_recover() to enable the operation indicated by
> > @attrip to be executed.  We copy stuff from @attrip (the item we are
> > recovering) to the new @attr structure (the work we need to do) and
> > run it through the attr modification state machine. If there's more
> > work to be done, we then add @attr to the defer list.
> 
> Assuming “if there’s more work to be done” is the -EAGAIN case...
> 
> > 
> > If there is deferred work to be continued, we still need a new
> > intent to indicate what attr operation needs to be performed next in
> > the chain.  When we commit the transaction (639), it will create a
> > new intent log item for the deferred attribute operation in the
> > ->create_intent callback before the transaction is committed.
> 
> Yes, that’s true. But it has no conflict with what I was saying. The thing I wanted
> to say is that:
> The new intent log item (introduced by xfs_attri_item_recover()) could appear on
> AIL before the AIL iteration ends in xlog_recover_process_intents(). 
> Do you agree with above?

Recovery of intents has *always* been able to do this. What I don't
understand is why you think this is a problem.

> In the following I will talk about the new intent log item. 
> 
> The iop_recover() is then executed on the new intent during the AIL intents iteration.
> I meant this loop:
> 
> 2548         for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> 2549              lip != NULL;
> 2550              lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
> ….
> 2584         }
> Do you agree with above?

Repeatedly asserting that I must agree that your logic is correct
comes across as unecessarily aggressive and combative and is not
acceptible behaviour.

Aggressively asserting that you are right doesn't make your logic
any more compelling. It just makes you look really bad when,
inevitably, you get politely told that you've made yet another
mistake.

> And the lsn for the new intent is equal to or bigger than last_lsn.
> Do you agree with above?
> 
> In above case the iop_recover() is xfs_attri_item_recover(). The later
> creates a xfs_attr_intent, attr1, copying things from the new ATTRI to attr1
> and process attr1.
> Do you agree with above?
> 
> Above xfs_attri_item_recover() runs successfully and the AIL intent iteration ends.

We ran xfs_xattri_finish_update(attr, done_item) which returned
-EAGAIN. This means we ran xfs_xattri_finish_update() and modified
-something- before we logged the new intent. That, however, doesn't
dictate what we need to log in the new intent - that is determined
by the overall intent chain and recovery architecture the subsystem
uses. And the attribute modification architecture is extremely
subtle, complex and full of unexpected loops.

> Now it’s time to process the capture_list with xlog_finish_defer_ops(). The
> capture_list contains the deferred operation which was added at line 639 with type
> XFS_DEFER_OPS_TYPE_ATTR. 
> Do you agree with above?
> 
> In xlog_finish_defer_ops(), a new transaction is created by xfs_trans_alloc().
> the deferred XFS_DEFER_OPS_TYPE_ATTR operation is attached to that new
> transaction by xfs_defer_ops_continue(). Then the new transaction is committed by
> xfs_trans_commit().
> Do you agree with above?

So at this point it we are logging the same intent *information*
again. But it's not the same intent - we've retired the original
intent and we have a new intent chain for the operations we are
going to perform. But why are we logging the same information in a
new intent?

Well, the attr modification state machine was -carefully- designed
to work this way: if we crash part way through an intent chain, we
always need to restart it in recovery with a "garbage collection"
step that undoes the partially complete changes that had been made
before we crashed. Only once we've done that garbage collection step
can we restart the original operation the intent described.

Indeed, the operations we perform recovering an attr intent are
actually different to the operations we perform with those intents
at runtime. Recovery has to perform garbage collection as a first
step, yet runtime does not.

IOWs, xfs_attri_item_recover() -> xfs_xattri_finish_update() is
typically performing garbage collection in the first recovery
transaction.  If it does perform a gc operation, then it is
guaranteed to return -EAGAIN so that the actual operation recovery
needs to perform can have new intents logged, be deferred and then
eventually replayed.

If we don't need to perform gc, then the operation we perform may
still require multiple transactions to complete and we get -EAGAIN
in that case, too. When that happens, we need to ensure if we crash
before recovery completes that the next mount will perform the
correct GC steps before it replays the intent from scratch. So we
always need to log an intent for the xattr operation that will
trigger the GC step on recovery.

Either way, we need to log new intents that are identical to the
original one at each step of the process so that log recovery will
always start the intent recovery process with the correct garbage
collection operation....

> During the xfs_trans_commit(), xfs_defer_finish_noroll() is called. and
> xfs_defer_finish_one() is called inside xfs_defer_finish_noroll(). For the
> deferred XFS_DEFER_OPS_TYPE_ATTR operation, xfs_attr_finish_item()
> is called.
> Do you agree with above?
> 
> In xfs_attr_finish_item(), xfs_attr_intent (attr2) is taken out from above new ATTRI
> and is processed.  attr2 and attr1 contain exactly same thing because they are both
> from the new ATTRI.  So the processing on attr2 is pure a duplication of the processing
> on attr1.  So actually the new ATTRI are double-processed.
> Do you agree with above? 

No, that's wrong. There is now "attr1 and attr2" structures - they
are one and the same. i.e. the xfs_attr_intent used in
xfs_attr_finish_item() is the same one created in
xfs_attri_item_recover() that we called
xfs_defer_add(&attr->xattri_list) with way back at line 638. 

xfs_attr_finish_item(
        struct xfs_trans                *tp,
        struct xfs_log_item             *done,
        struct list_head                *item,
        struct xfs_btree_cur            **state)
{
        struct xfs_attr_intent          *attr;
        struct xfs_attrd_log_item       *done_item = NULL;
        int                             error;

>>>>>   attr = container_of(item, struct xfs_attr_intent, xattri_list);
        if (done)
                done_item = ATTRD_ITEM(done);

xfs_attr_finish_item() extracts the @attr structure from the
listhead that is linking it into the deferred op chain, and we
run xfs_attr_set_iter() again to run the next step(s) in the xattr
modification state machine. If this returns -EAGAIN (and it can)
we go around the "defer(attr->attri_list), create new intent,
commit, ... ->finish_item() -> xfs_attr_finish_item()" loop again.

From this, I am guessing that you are confusing the per-defer-cycle
xfs_attri_log_item with the overall xfs_attr_intent context. i.e. We
allocate a new xfs_attri_log_item (and xfs_attrd_log_item) every
time we create new intents in the deferred operation chain. Each
xfs_attri_log_item is created from the state that is carried
across entire the attr recovery chain in the xfs_attr_intent
structure....

Cheers,

Dave.

Wengang Wang May 26, 2023, 6:59 p.m. UTC | #14

> On May 25, 2023, at 5:26 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Thu, May 25, 2023 at 05:13:41PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On May 24, 2023, at 7:20 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Wed, May 24, 2023 at 04:27:19PM +0000, Wengang Wang wrote:
>>>>> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@kernel.org> wrote:
>>>>> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
>>>>>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
>>>>>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>>>>>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>>>>>>>> For existing other cases (if there are) where new intents are added,
>>>>>>>> they don’t use the capture_list for delayed operations? Do you have example then? 
>>>>>>>> if so I think we should follow their way instead of adding the defer operations
>>>>>>>> (but reply on the intents on AIL).
>>>>>>> 
>>>>>>> All of the intent recovery stuff uses
>>>>>>> xfs_defer_ops_capture_and_commit() to commit the intent being
>>>>>>> replayed and cause all further new intent processing in that chain
>>>>>>> to be defered until after all the intents recovered from the journal
>>>>>>> have been iterated. All those new intents end up in the AIL at a LSN
>>>>>>> index >= last_lsn.
>>>>>> 
>>>>>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
>>>>>> or bigger than last_lsn and skip the iop_recover() for that item?
>>>>>> and shall we put this change to another separated patch as it is to fix
>>>>>> an existing problem (not introduced by my patch)?
>>>>> 
>>>>> Intent replay creates non-intent log items (like buffers or inodes) that
>>>>> are added to the AIL with an LSN higher than last_lsn.  I suppose it
>>>>> would be possible to break log recovery if an intent's iop_recover
>>>>> method immediately logged a new intent and returned EAGAIN to roll the
>>>>> transaction, but none of them do that;
>>>> 
>>>> I am not quite sure for above. There are cases that new intents are added
>>>> in iop_recover(), for example xfs_attri_item_recover():
>>>> 
>>>> 632         error = xfs_xattri_finish_update(attr, done_item);
>>>> 633         if (error == -EAGAIN) {
>>>> 634                 /*
>>>> 635                  * There's more work to do, so add the intent item to this
>>>> 636                  * transaction so that we can continue it later.
>>>> 637                  */
>>>> 638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
>>>> 639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
>>>> 640                 if (error)
>>>> 641                         goto out_unlock;
>>>> 642
>>>> 643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
>>>> 644                 xfs_irele(ip);
>>>> 645                 return 0;
>>>> 646         }
>>>> 
>>>> I am thinking line 638 and 639 are doing so.
>>> 
>>> I don't think so. @attrip is the attribute information recovered
>>> from the original intent - we allocated @attr in
>>> xfs_attri_item_recover() to enable the operation indicated by
>>> @attrip to be executed.  We copy stuff from @attrip (the item we are
>>> recovering) to the new @attr structure (the work we need to do) and
>>> run it through the attr modification state machine. If there's more
>>> work to be done, we then add @attr to the defer list.
>> 
>> Assuming “if there’s more work to be done” is the -EAGAIN case...
>> 
>>> 
>>> If there is deferred work to be continued, we still need a new
>>> intent to indicate what attr operation needs to be performed next in
>>> the chain.  When we commit the transaction (639), it will create a
>>> new intent log item for the deferred attribute operation in the
>>> ->create_intent callback before the transaction is committed.
>> 
>> Yes, that’s true. But it has no conflict with what I was saying. The thing I wanted
>> to say is that:
>> The new intent log item (introduced by xfs_attri_item_recover()) could appear on
>> AIL before the AIL iteration ends in xlog_recover_process_intents(). 
>> Do you agree with above?
> 
> Recovery of intents has *always* been able to do this. What I don't
> understand is why you think this is a problem.

Adding new intent is not a problem. 
The problem I am thinking is that for the NEW intents,
we should’d run both:
1. iop_recover() during the AIL intents iteration and
2. the ops->finish_item() in xlog_finish_defer_ops()’s sub calls()

I am thinking the iop_recover() and the finish_item() are doing same thing.
I am not quite sure about ATTRI but  I am sure about EFI. For EFI,
both xfs_efi_item_recover() (as iop_recover()) and xfs_extent_free_finish_item()
(as finish_item()) calls xfs_trans_free_extent() to free the same extent(s).

I have a metadump with which I can reproduce the original AGFL deadlock problem
during log recovery.  When I tested my patch (without the changes in
xlog_recover_process_intents()), I found the records in the NEW EFI are freed twice,
one is during the AIL intents iteration with iop_recover(), the other is inside
xlog_finish_defer_ops() with finish_item().

That needs to be fixed. The best way to fix that I thought is to fix xlog_recover_process_intents().
As log recovery, I am thinking the iop_cover() should only applied the original on disk redo intents
only. Those original redo intents has no deferred option attached.  And for the NEW intents
introduced during the processing of the original intents, I don’t think iop_recover() should
run for them becuase
1. they are NEW ones, not the original on disk redo intents and 
2. they have deferred operation attached which process the new intents later.

And I am also thinking we also depend  xlog_recover_process_intents() to run iop_recover()
on the new intents. That’s because the new intents could be already inserted to AIL
before the AIL intents iteration ends, in that case have the change to run iop_recover() on the
new ones.  but the new intents also could be inserted to AIL after the AIL intents iteration ends,
in that case we have no change to run iop_recover() on the new intents. — that’s an unstable thing.


> 
>> In the following I will talk about the new intent log item. 
>> 
>> The iop_recover() is then executed on the new intent during the AIL intents iteration.
>> I meant this loop:
>> 
>> 2548         for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>> 2549              lip != NULL;
>> 2550              lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
>> ….
>> 2584         }
>> Do you agree with above?
> 
> Repeatedly asserting that I must agree that your logic is correct
> comes across as unecessarily aggressive and combative and is not
> acceptible behaviour.
> 
> Aggressively asserting that you are right doesn't make your logic
> any more compelling. It just makes you look really bad when,
> inevitably, you get politely told that you've made yet another
> mistake.

Nope, I didn’t want to be aggressive. Since this is already discussed a lot
but we still don’t have an agreement. So I wanted to know where is the disagreement,
little by little. I thought it would be a good way to make things clear,
if it made you unhappy, sorry, and I didn’t mean that.

> 
>> And the lsn for the new intent is equal to or bigger than last_lsn.
>> Do you agree with above?
>> 
>> In above case the iop_recover() is xfs_attri_item_recover(). The later
>> creates a xfs_attr_intent, attr1, copying things from the new ATTRI to attr1
>> and process attr1.
>> Do you agree with above?
>> 
>> Above xfs_attri_item_recover() runs successfully and the AIL intent iteration ends.
> 
> We ran xfs_xattri_finish_update(attr, done_item) which returned
> -EAGAIN. This means we ran xfs_xattri_finish_update() and modified
> -something- before we logged the new intent. That, however, doesn't
> dictate what we need to log in the new intent - that is determined
> by the overall intent chain and recovery architecture the subsystem
> uses. And the attribute modification architecture is extremely
> subtle, complex and full of unexpected loops.

I am a bit confused here. 
I know that xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing same thing
(free the extent specified in the EFI record(s)). So whatever the new would be logged int
the new intents, I am thinking the for the same new ATTRI intent, xfs_attri_item_recover()
and xfs_attr_finish_item() should do the same thing. If they don’t not,  ATTIR has inconsistent
behavior with EFI.

> 
>> Now it’s time to process the capture_list with xlog_finish_defer_ops(). The
>> capture_list contains the deferred operation which was added at line 639 with type
>> XFS_DEFER_OPS_TYPE_ATTR. 
>> Do you agree with above?
>> 
>> In xlog_finish_defer_ops(), a new transaction is created by xfs_trans_alloc().
>> the deferred XFS_DEFER_OPS_TYPE_ATTR operation is attached to that new
>> transaction by xfs_defer_ops_continue(). Then the new transaction is committed by
>> xfs_trans_commit().
>> Do you agree with above?
> 
> So at this point it we are logging the same intent *information*
> again. But it's not the same intent - we've retired the original
> intent and we have a new intent chain for the operations we are
> going to perform. But why are we logging the same information in a
> new intent?
> 
> Well, the attr modification state machine was -carefully- designed
> to work this way: if we crash part way through an intent chain, we
> always need to restart it in recovery with a "garbage collection"
> step that undoes the partially complete changes that had been made
> before we crashed. Only once we've done that garbage collection step
> can we restart the original operation the intent described.
> 
> Indeed, the operations we perform recovering an attr intent are
> actually different to the operations we perform with those intents
> at runtime. Recovery has to perform garbage collection as a first
> step, yet runtime does not.
> 
> IOWs, xfs_attri_item_recover() -> xfs_xattri_finish_update() is
> typically performing garbage collection in the first recovery
> transaction.  If it does perform a gc operation, then it is
> guaranteed to return -EAGAIN so that the actual operation recovery
> needs to perform can have new intents logged, be deferred and then
> eventually replayed.
> 
> If we don't need to perform gc, then the operation we perform may
> still require multiple transactions to complete and we get -EAGAIN
> in that case, too. When that happens, we need to ensure if we crash
> before recovery completes that the next mount will perform the
> correct GC steps before it replays the intent from scratch. So we
> always need to log an intent for the xattr operation that will
> trigger the GC step on recovery.
> 
> Either way, we need to log new intents that are identical to the
> original one at each step of the process so that log recovery will
> always start the intent recovery process with the correct garbage
> collection operation....
> 
>> During the xfs_trans_commit(), xfs_defer_finish_noroll() is called. and
>> xfs_defer_finish_one() is called inside xfs_defer_finish_noroll(). For the
>> deferred XFS_DEFER_OPS_TYPE_ATTR operation, xfs_attr_finish_item()
>> is called.
>> Do you agree with above?
>> 
>> In xfs_attr_finish_item(), xfs_attr_intent (attr2) is taken out from above new ATTRI
>> and is processed.  attr2 and attr1 contain exactly same thing because they are both
>> from the new ATTRI.  So the processing on attr2 is pure a duplication of the processing
>> on attr1.  So actually the new ATTRI are double-processed.
>> Do you agree with above? 
> 
> No, that's wrong. There is now "attr1 and attr2" structures - they
> are one and the same. i.e. the xfs_attr_intent used in
> xfs_attr_finish_item() is the same one created in
> xfs_attri_item_recover() that we called
> xfs_defer_add(&attr->xattri_list) with way back at line 638. 
> 
> xfs_attr_finish_item(
>        struct xfs_trans                *tp,
>        struct xfs_log_item             *done,
>        struct list_head                *item,
>        struct xfs_btree_cur            **state)
> {
>        struct xfs_attr_intent          *attr;
>        struct xfs_attrd_log_item       *done_item = NULL;
>        int                             error;
> 
>>>>>>  attr = container_of(item, struct xfs_attr_intent, xattri_list);
>        if (done)
>                done_item = ATTRD_ITEM(done);
> 
> xfs_attr_finish_item() extracts the @attr structure from the
> listhead that is linking it into the deferred op chain, and we
> run xfs_attr_set_iter() again to run the next step(s) in the xattr
> modification state machine. If this returns -EAGAIN (and it can)
> we go around the "defer(attr->attri_list), create new intent,
> commit, ... ->finish_item() -> xfs_attr_finish_item()" loop again.
> 
> From this, I am guessing that you are confusing the per-defer-cycle
> xfs_attri_log_item with the overall xfs_attr_intent context. i.e. We
> allocate a new xfs_attri_log_item (and xfs_attrd_log_item) every
> time we create new intents in the deferred operation chain. Each
> xfs_attri_log_item is created from the state that is carried
> across entire the attr recovery chain in the xfs_attr_intent
> structure....

OK, ATTRI really looks complex. And it looks like the
xfs_attri_item_recover() and xfs_attr_finish_item() can do different things
according the current state. Thus running both xfs_attri_item_recover()
and xfs_attr_finish_item() has no problem.  I didn’t take a good example :(. 

But, the original 
ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
in xlog_recover_process_intents() is incorrect, right? For example if we
introduced two different new ATTRIs and the 2nd should have li_lsn bigger
than last_lsn?

Anyway, let’s go back to the EFI. EFI is different from ATTRI, it has no state machine.
xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing the same thing —
free the extent specified by the EFI record(s).  For the new EFI, the
xfs_efi_item_recover() goes well, but xfs_extent_free_finish_item() fails because it
want to free the extent which is already freed by the xfs_efi_item_recover(). 
We pass some information (indicating it already done) from xfs_efi_item_recover() to
xfs_extent_free_finish_item() somehow so that xfs_extent_free_finish_item() won’t do
the 2nd free as a workaround? What’s your idea?

thanks,
wengang

Wengang Wang May 26, 2023, 9:31 p.m. UTC | #15

> On May 26, 2023, at 11:59 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On May 25, 2023, at 5:26 PM, Dave Chinner <david@fromorbit.com> wrote:
>> 
>> On Thu, May 25, 2023 at 05:13:41PM +0000, Wengang Wang wrote:
>>> 
>>> 
>>>> On May 24, 2023, at 7:20 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>> 
>>>> On Wed, May 24, 2023 at 04:27:19PM +0000, Wengang Wang wrote:
>>>>>> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@kernel.org> wrote:
>>>>>> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
>>>>>>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
>>>>>>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>>>>>>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>>>>>>>>> For existing other cases (if there are) where new intents are added,
>>>>>>>>> they don’t use the capture_list for delayed operations? Do you have example then? 
>>>>>>>>> if so I think we should follow their way instead of adding the defer operations
>>>>>>>>> (but reply on the intents on AIL).
>>>>>>>> 
>>>>>>>> All of the intent recovery stuff uses
>>>>>>>> xfs_defer_ops_capture_and_commit() to commit the intent being
>>>>>>>> replayed and cause all further new intent processing in that chain
>>>>>>>> to be defered until after all the intents recovered from the journal
>>>>>>>> have been iterated. All those new intents end up in the AIL at a LSN
>>>>>>>> index >= last_lsn.
>>>>>>> 
>>>>>>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
>>>>>>> or bigger than last_lsn and skip the iop_recover() for that item?
>>>>>>> and shall we put this change to another separated patch as it is to fix
>>>>>>> an existing problem (not introduced by my patch)?
>>>>>> 
>>>>>> Intent replay creates non-intent log items (like buffers or inodes) that
>>>>>> are added to the AIL with an LSN higher than last_lsn.  I suppose it
>>>>>> would be possible to break log recovery if an intent's iop_recover
>>>>>> method immediately logged a new intent and returned EAGAIN to roll the
>>>>>> transaction, but none of them do that;
>>>>> 
>>>>> I am not quite sure for above. There are cases that new intents are added
>>>>> in iop_recover(), for example xfs_attri_item_recover():
>>>>> 
>>>>> 632         error = xfs_xattri_finish_update(attr, done_item);
>>>>> 633         if (error == -EAGAIN) {
>>>>> 634                 /*
>>>>> 635                  * There's more work to do, so add the intent item to this
>>>>> 636                  * transaction so that we can continue it later.
>>>>> 637                  */
>>>>> 638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
>>>>> 639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
>>>>> 640                 if (error)
>>>>> 641                         goto out_unlock;
>>>>> 642
>>>>> 643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
>>>>> 644                 xfs_irele(ip);
>>>>> 645                 return 0;
>>>>> 646         }
>>>>> 
>>>>> I am thinking line 638 and 639 are doing so.
>>>> 
>>>> I don't think so. @attrip is the attribute information recovered
>>>> from the original intent - we allocated @attr in
>>>> xfs_attri_item_recover() to enable the operation indicated by
>>>> @attrip to be executed.  We copy stuff from @attrip (the item we are
>>>> recovering) to the new @attr structure (the work we need to do) and
>>>> run it through the attr modification state machine. If there's more
>>>> work to be done, we then add @attr to the defer list.
>>> 
>>> Assuming “if there’s more work to be done” is the -EAGAIN case...
>>> 
>>>> 
>>>> If there is deferred work to be continued, we still need a new
>>>> intent to indicate what attr operation needs to be performed next in
>>>> the chain.  When we commit the transaction (639), it will create a
>>>> new intent log item for the deferred attribute operation in the
>>>> ->create_intent callback before the transaction is committed.
>>> 
>>> Yes, that’s true. But it has no conflict with what I was saying. The thing I wanted
>>> to say is that:
>>> The new intent log item (introduced by xfs_attri_item_recover()) could appear on
>>> AIL before the AIL iteration ends in xlog_recover_process_intents(). 
>>> Do you agree with above?
>> 
>> Recovery of intents has *always* been able to do this. What I don't
>> understand is why you think this is a problem.
> 
> Adding new intent is not a problem. 
> The problem I am thinking is that for the NEW intents,
> we should’d run both:
I meant “shouldn’t”

thanks,
wengang

> 1. iop_recover() during the AIL intents iteration and
> 2. the ops->finish_item() in xlog_finish_defer_ops()’s sub calls()
> 
> I am thinking the iop_recover() and the finish_item() are doing same thing.
> I am not quite sure about ATTRI but  I am sure about EFI. For EFI,
> both xfs_efi_item_recover() (as iop_recover()) and xfs_extent_free_finish_item()
> (as finish_item()) calls xfs_trans_free_extent() to free the same extent(s).
> 
> I have a metadump with which I can reproduce the original AGFL deadlock problem
> during log recovery.  When I tested my patch (without the changes in
> xlog_recover_process_intents()), I found the records in the NEW EFI are freed twice,
> one is during the AIL intents iteration with iop_recover(), the other is inside
> xlog_finish_defer_ops() with finish_item().
> 
> That needs to be fixed. The best way to fix that I thought is to fix xlog_recover_process_intents().
> As log recovery, I am thinking the iop_cover() should only applied the original on disk redo intents
> only. Those original redo intents has no deferred option attached.  And for the NEW intents
> introduced during the processing of the original intents, I don’t think iop_recover() should
> run for them becuase
> 1. they are NEW ones, not the original on disk redo intents and 
> 2. they have deferred operation attached which process the new intents later.
> 
> And I am also thinking we also depend  xlog_recover_process_intents() to run iop_recover()
> on the new intents. That’s because the new intents could be already inserted to AIL
> before the AIL intents iteration ends, in that case have the change to run iop_recover() on the
> new ones.  but the new intents also could be inserted to AIL after the AIL intents iteration ends,
> in that case we have no change to run iop_recover() on the new intents. — that’s an unstable thing.
> 
> 
>> 
>>> In the following I will talk about the new intent log item. 
>>> 
>>> The iop_recover() is then executed on the new intent during the AIL intents iteration.
>>> I meant this loop:
>>> 
>>> 2548         for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>>> 2549              lip != NULL;
>>> 2550              lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
>>> ….
>>> 2584         }
>>> Do you agree with above?
>> 
>> Repeatedly asserting that I must agree that your logic is correct
>> comes across as unecessarily aggressive and combative and is not
>> acceptible behaviour.
>> 
>> Aggressively asserting that you are right doesn't make your logic
>> any more compelling. It just makes you look really bad when,
>> inevitably, you get politely told that you've made yet another
>> mistake.
> 
> Nope, I didn’t want to be aggressive. Since this is already discussed a lot
> but we still don’t have an agreement. So I wanted to know where is the disagreement,
> little by little. I thought it would be a good way to make things clear,
> if it made you unhappy, sorry, and I didn’t mean that.
> 
>> 
>>> And the lsn for the new intent is equal to or bigger than last_lsn.
>>> Do you agree with above?
>>> 
>>> In above case the iop_recover() is xfs_attri_item_recover(). The later
>>> creates a xfs_attr_intent, attr1, copying things from the new ATTRI to attr1
>>> and process attr1.
>>> Do you agree with above?
>>> 
>>> Above xfs_attri_item_recover() runs successfully and the AIL intent iteration ends.
>> 
>> We ran xfs_xattri_finish_update(attr, done_item) which returned
>> -EAGAIN. This means we ran xfs_xattri_finish_update() and modified
>> -something- before we logged the new intent. That, however, doesn't
>> dictate what we need to log in the new intent - that is determined
>> by the overall intent chain and recovery architecture the subsystem
>> uses. And the attribute modification architecture is extremely
>> subtle, complex and full of unexpected loops.
> 
> I am a bit confused here. 
> I know that xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing same thing
> (free the extent specified in the EFI record(s)). So whatever the new would be logged int
> the new intents, I am thinking the for the same new ATTRI intent, xfs_attri_item_recover()
> and xfs_attr_finish_item() should do the same thing. If they don’t not,  ATTIR has inconsistent
> behavior with EFI.
> 
>> 
>>> Now it’s time to process the capture_list with xlog_finish_defer_ops(). The
>>> capture_list contains the deferred operation which was added at line 639 with type
>>> XFS_DEFER_OPS_TYPE_ATTR. 
>>> Do you agree with above?
>>> 
>>> In xlog_finish_defer_ops(), a new transaction is created by xfs_trans_alloc().
>>> the deferred XFS_DEFER_OPS_TYPE_ATTR operation is attached to that new
>>> transaction by xfs_defer_ops_continue(). Then the new transaction is committed by
>>> xfs_trans_commit().
>>> Do you agree with above?
>> 
>> So at this point it we are logging the same intent *information*
>> again. But it's not the same intent - we've retired the original
>> intent and we have a new intent chain for the operations we are
>> going to perform. But why are we logging the same information in a
>> new intent?
>> 
>> Well, the attr modification state machine was -carefully- designed
>> to work this way: if we crash part way through an intent chain, we
>> always need to restart it in recovery with a "garbage collection"
>> step that undoes the partially complete changes that had been made
>> before we crashed. Only once we've done that garbage collection step
>> can we restart the original operation the intent described.
>> 
>> Indeed, the operations we perform recovering an attr intent are
>> actually different to the operations we perform with those intents
>> at runtime. Recovery has to perform garbage collection as a first
>> step, yet runtime does not.
>> 
>> IOWs, xfs_attri_item_recover() -> xfs_xattri_finish_update() is
>> typically performing garbage collection in the first recovery
>> transaction.  If it does perform a gc operation, then it is
>> guaranteed to return -EAGAIN so that the actual operation recovery
>> needs to perform can have new intents logged, be deferred and then
>> eventually replayed.
>> 
>> If we don't need to perform gc, then the operation we perform may
>> still require multiple transactions to complete and we get -EAGAIN
>> in that case, too. When that happens, we need to ensure if we crash
>> before recovery completes that the next mount will perform the
>> correct GC steps before it replays the intent from scratch. So we
>> always need to log an intent for the xattr operation that will
>> trigger the GC step on recovery.
>> 
>> Either way, we need to log new intents that are identical to the
>> original one at each step of the process so that log recovery will
>> always start the intent recovery process with the correct garbage
>> collection operation....
>> 
>>> During the xfs_trans_commit(), xfs_defer_finish_noroll() is called. and
>>> xfs_defer_finish_one() is called inside xfs_defer_finish_noroll(). For the
>>> deferred XFS_DEFER_OPS_TYPE_ATTR operation, xfs_attr_finish_item()
>>> is called.
>>> Do you agree with above?
>>> 
>>> In xfs_attr_finish_item(), xfs_attr_intent (attr2) is taken out from above new ATTRI
>>> and is processed.  attr2 and attr1 contain exactly same thing because they are both
>>> from the new ATTRI.  So the processing on attr2 is pure a duplication of the processing
>>> on attr1.  So actually the new ATTRI are double-processed.
>>> Do you agree with above? 
>> 
>> No, that's wrong. There is now "attr1 and attr2" structures - they
>> are one and the same. i.e. the xfs_attr_intent used in
>> xfs_attr_finish_item() is the same one created in
>> xfs_attri_item_recover() that we called
>> xfs_defer_add(&attr->xattri_list) with way back at line 638. 
>> 
>> xfs_attr_finish_item(
>>       struct xfs_trans                *tp,
>>       struct xfs_log_item             *done,
>>       struct list_head                *item,
>>       struct xfs_btree_cur            **state)
>> {
>>       struct xfs_attr_intent          *attr;
>>       struct xfs_attrd_log_item       *done_item = NULL;
>>       int                             error;
>> 
>>>>>>> attr = container_of(item, struct xfs_attr_intent, xattri_list);
>>       if (done)
>>               done_item = ATTRD_ITEM(done);
>> 
>> xfs_attr_finish_item() extracts the @attr structure from the
>> listhead that is linking it into the deferred op chain, and we
>> run xfs_attr_set_iter() again to run the next step(s) in the xattr
>> modification state machine. If this returns -EAGAIN (and it can)
>> we go around the "defer(attr->attri_list), create new intent,
>> commit, ... ->finish_item() -> xfs_attr_finish_item()" loop again.
>> 
>> From this, I am guessing that you are confusing the per-defer-cycle
>> xfs_attri_log_item with the overall xfs_attr_intent context. i.e. We
>> allocate a new xfs_attri_log_item (and xfs_attrd_log_item) every
>> time we create new intents in the deferred operation chain. Each
>> xfs_attri_log_item is created from the state that is carried
>> across entire the attr recovery chain in the xfs_attr_intent
>> structure....
> 
> OK, ATTRI really looks complex. And it looks like the
> xfs_attri_item_recover() and xfs_attr_finish_item() can do different things
> according the current state. Thus running both xfs_attri_item_recover()
> and xfs_attr_finish_item() has no problem.  I didn’t take a good example :(. 
> 
> But, the original 
> ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
> in xlog_recover_process_intents() is incorrect, right? For example if we
> introduced two different new ATTRIs and the 2nd should have li_lsn bigger
> than last_lsn?
> 
> Anyway, let’s go back to the EFI. EFI is different from ATTRI, it has no state machine.
> xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing the same thing —
> free the extent specified by the EFI record(s).  For the new EFI, the
> xfs_efi_item_recover() goes well, but xfs_extent_free_finish_item() fails because it
> want to free the extent which is already freed by the xfs_efi_item_recover(). 
> We pass some information (indicating it already done) from xfs_efi_item_recover() to
> xfs_extent_free_finish_item() somehow so that xfs_extent_free_finish_item() won’t do
> the 2nd free as a workaround? What’s your idea?
> 
> thanks,
> wengang

Wengang Wang June 3, 2023, 12:26 a.m. UTC | #16

Hi Dave and Darrick,

What do you think?

thanks,
wengang

> On May 26, 2023, at 2:31 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On May 26, 2023, at 11:59 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>> 
>> 
>> 
>>> On May 25, 2023, at 5:26 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Thu, May 25, 2023 at 05:13:41PM +0000, Wengang Wang wrote:
>>>> 
>>>> 
>>>>> On May 24, 2023, at 7:20 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> 
>>>>> On Wed, May 24, 2023 at 04:27:19PM +0000, Wengang Wang wrote:
>>>>>>> On May 23, 2023, at 5:27 PM, Darrick J. Wong <djwong@kernel.org> wrote:
>>>>>>> On Tue, May 23, 2023 at 02:59:40AM +0000, Wengang Wang wrote:
>>>>>>>>> On May 22, 2023, at 3:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>>>>> On Mon, May 22, 2023 at 06:20:11PM +0000, Wengang Wang wrote:
>>>>>>>>>>> On May 21, 2023, at 6:17 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>>>>>>> On Fri, May 19, 2023 at 10:18:29AM -0700, Wengang Wang wrote:
>>>>>>>>>>>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>>>>>>>>>> For existing other cases (if there are) where new intents are added,
>>>>>>>>>> they don’t use the capture_list for delayed operations? Do you have example then? 
>>>>>>>>>> if so I think we should follow their way instead of adding the defer operations
>>>>>>>>>> (but reply on the intents on AIL).
>>>>>>>>> 
>>>>>>>>> All of the intent recovery stuff uses
>>>>>>>>> xfs_defer_ops_capture_and_commit() to commit the intent being
>>>>>>>>> replayed and cause all further new intent processing in that chain
>>>>>>>>> to be defered until after all the intents recovered from the journal
>>>>>>>>> have been iterated. All those new intents end up in the AIL at a LSN
>>>>>>>>> index >= last_lsn.
>>>>>>>> 
>>>>>>>> Yes. So we break the AIL iteration on seeing an intent with lsn equal to
>>>>>>>> or bigger than last_lsn and skip the iop_recover() for that item?
>>>>>>>> and shall we put this change to another separated patch as it is to fix
>>>>>>>> an existing problem (not introduced by my patch)?
>>>>>>> 
>>>>>>> Intent replay creates non-intent log items (like buffers or inodes) that
>>>>>>> are added to the AIL with an LSN higher than last_lsn.  I suppose it
>>>>>>> would be possible to break log recovery if an intent's iop_recover
>>>>>>> method immediately logged a new intent and returned EAGAIN to roll the
>>>>>>> transaction, but none of them do that;
>>>>>> 
>>>>>> I am not quite sure for above. There are cases that new intents are added
>>>>>> in iop_recover(), for example xfs_attri_item_recover():
>>>>>> 
>>>>>> 632         error = xfs_xattri_finish_update(attr, done_item);
>>>>>> 633         if (error == -EAGAIN) {
>>>>>> 634                 /*
>>>>>> 635                  * There's more work to do, so add the intent item to this
>>>>>> 636                  * transaction so that we can continue it later.
>>>>>> 637                  */
>>>>>> 638                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
>>>>>> 639                 error = xfs_defer_ops_capture_and_commit(tp, capture_list);
>>>>>> 640                 if (error)
>>>>>> 641                         goto out_unlock;
>>>>>> 642
>>>>>> 643                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
>>>>>> 644                 xfs_irele(ip);
>>>>>> 645                 return 0;
>>>>>> 646         }
>>>>>> 
>>>>>> I am thinking line 638 and 639 are doing so.
>>>>> 
>>>>> I don't think so. @attrip is the attribute information recovered
>>>>> from the original intent - we allocated @attr in
>>>>> xfs_attri_item_recover() to enable the operation indicated by
>>>>> @attrip to be executed.  We copy stuff from @attrip (the item we are
>>>>> recovering) to the new @attr structure (the work we need to do) and
>>>>> run it through the attr modification state machine. If there's more
>>>>> work to be done, we then add @attr to the defer list.
>>>> 
>>>> Assuming “if there’s more work to be done” is the -EAGAIN case...
>>>> 
>>>>> 
>>>>> If there is deferred work to be continued, we still need a new
>>>>> intent to indicate what attr operation needs to be performed next in
>>>>> the chain.  When we commit the transaction (639), it will create a
>>>>> new intent log item for the deferred attribute operation in the
>>>>> ->create_intent callback before the transaction is committed.
>>>> 
>>>> Yes, that’s true. But it has no conflict with what I was saying. The thing I wanted
>>>> to say is that:
>>>> The new intent log item (introduced by xfs_attri_item_recover()) could appear on
>>>> AIL before the AIL iteration ends in xlog_recover_process_intents(). 
>>>> Do you agree with above?
>>> 
>>> Recovery of intents has *always* been able to do this. What I don't
>>> understand is why you think this is a problem.
>> 
>> Adding new intent is not a problem. 
>> The problem I am thinking is that for the NEW intents,
>> we should’d run both:
> I meant “shouldn’t”
> 
> thanks,
> wengang
> 
>> 1. iop_recover() during the AIL intents iteration and
>> 2. the ops->finish_item() in xlog_finish_defer_ops()’s sub calls()
>> 
>> I am thinking the iop_recover() and the finish_item() are doing same thing.
>> I am not quite sure about ATTRI but  I am sure about EFI. For EFI,
>> both xfs_efi_item_recover() (as iop_recover()) and xfs_extent_free_finish_item()
>> (as finish_item()) calls xfs_trans_free_extent() to free the same extent(s).
>> 
>> I have a metadump with which I can reproduce the original AGFL deadlock problem
>> during log recovery.  When I tested my patch (without the changes in
>> xlog_recover_process_intents()), I found the records in the NEW EFI are freed twice,
>> one is during the AIL intents iteration with iop_recover(), the other is inside
>> xlog_finish_defer_ops() with finish_item().
>> 
>> That needs to be fixed. The best way to fix that I thought is to fix xlog_recover_process_intents().
>> As log recovery, I am thinking the iop_cover() should only applied the original on disk redo intents
>> only. Those original redo intents has no deferred option attached.  And for the NEW intents
>> introduced during the processing of the original intents, I don’t think iop_recover() should
>> run for them becuase
>> 1. they are NEW ones, not the original on disk redo intents and 
>> 2. they have deferred operation attached which process the new intents later.
>> 
>> And I am also thinking we also depend  xlog_recover_process_intents() to run iop_recover()
>> on the new intents. That’s because the new intents could be already inserted to AIL
>> before the AIL intents iteration ends, in that case have the change to run iop_recover() on the
>> new ones.  but the new intents also could be inserted to AIL after the AIL intents iteration ends,
>> in that case we have no change to run iop_recover() on the new intents. — that’s an unstable thing.
>> 
>> 
>>> 
>>>> In the following I will talk about the new intent log item. 
>>>> 
>>>> The iop_recover() is then executed on the new intent during the AIL intents iteration.
>>>> I meant this loop:
>>>> 
>>>> 2548         for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>>>> 2549              lip != NULL;
>>>> 2550              lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
>>>> ….
>>>> 2584         }
>>>> Do you agree with above?
>>> 
>>> Repeatedly asserting that I must agree that your logic is correct
>>> comes across as unecessarily aggressive and combative and is not
>>> acceptible behaviour.
>>> 
>>> Aggressively asserting that you are right doesn't make your logic
>>> any more compelling. It just makes you look really bad when,
>>> inevitably, you get politely told that you've made yet another
>>> mistake.
>> 
>> Nope, I didn’t want to be aggressive. Since this is already discussed a lot
>> but we still don’t have an agreement. So I wanted to know where is the disagreement,
>> little by little. I thought it would be a good way to make things clear,
>> if it made you unhappy, sorry, and I didn’t mean that.
>> 
>>> 
>>>> And the lsn for the new intent is equal to or bigger than last_lsn.
>>>> Do you agree with above?
>>>> 
>>>> In above case the iop_recover() is xfs_attri_item_recover(). The later
>>>> creates a xfs_attr_intent, attr1, copying things from the new ATTRI to attr1
>>>> and process attr1.
>>>> Do you agree with above?
>>>> 
>>>> Above xfs_attri_item_recover() runs successfully and the AIL intent iteration ends.
>>> 
>>> We ran xfs_xattri_finish_update(attr, done_item) which returned
>>> -EAGAIN. This means we ran xfs_xattri_finish_update() and modified
>>> -something- before we logged the new intent. That, however, doesn't
>>> dictate what we need to log in the new intent - that is determined
>>> by the overall intent chain and recovery architecture the subsystem
>>> uses. And the attribute modification architecture is extremely
>>> subtle, complex and full of unexpected loops.
>> 
>> I am a bit confused here. 
>> I know that xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing same thing
>> (free the extent specified in the EFI record(s)). So whatever the new would be logged int
>> the new intents, I am thinking the for the same new ATTRI intent, xfs_attri_item_recover()
>> and xfs_attr_finish_item() should do the same thing. If they don’t not,  ATTIR has inconsistent
>> behavior with EFI.
>> 
>>> 
>>>> Now it’s time to process the capture_list with xlog_finish_defer_ops(). The
>>>> capture_list contains the deferred operation which was added at line 639 with type
>>>> XFS_DEFER_OPS_TYPE_ATTR. 
>>>> Do you agree with above?
>>>> 
>>>> In xlog_finish_defer_ops(), a new transaction is created by xfs_trans_alloc().
>>>> the deferred XFS_DEFER_OPS_TYPE_ATTR operation is attached to that new
>>>> transaction by xfs_defer_ops_continue(). Then the new transaction is committed by
>>>> xfs_trans_commit().
>>>> Do you agree with above?
>>> 
>>> So at this point it we are logging the same intent *information*
>>> again. But it's not the same intent - we've retired the original
>>> intent and we have a new intent chain for the operations we are
>>> going to perform. But why are we logging the same information in a
>>> new intent?
>>> 
>>> Well, the attr modification state machine was -carefully- designed
>>> to work this way: if we crash part way through an intent chain, we
>>> always need to restart it in recovery with a "garbage collection"
>>> step that undoes the partially complete changes that had been made
>>> before we crashed. Only once we've done that garbage collection step
>>> can we restart the original operation the intent described.
>>> 
>>> Indeed, the operations we perform recovering an attr intent are
>>> actually different to the operations we perform with those intents
>>> at runtime. Recovery has to perform garbage collection as a first
>>> step, yet runtime does not.
>>> 
>>> IOWs, xfs_attri_item_recover() -> xfs_xattri_finish_update() is
>>> typically performing garbage collection in the first recovery
>>> transaction.  If it does perform a gc operation, then it is
>>> guaranteed to return -EAGAIN so that the actual operation recovery
>>> needs to perform can have new intents logged, be deferred and then
>>> eventually replayed.
>>> 
>>> If we don't need to perform gc, then the operation we perform may
>>> still require multiple transactions to complete and we get -EAGAIN
>>> in that case, too. When that happens, we need to ensure if we crash
>>> before recovery completes that the next mount will perform the
>>> correct GC steps before it replays the intent from scratch. So we
>>> always need to log an intent for the xattr operation that will
>>> trigger the GC step on recovery.
>>> 
>>> Either way, we need to log new intents that are identical to the
>>> original one at each step of the process so that log recovery will
>>> always start the intent recovery process with the correct garbage
>>> collection operation....
>>> 
>>>> During the xfs_trans_commit(), xfs_defer_finish_noroll() is called. and
>>>> xfs_defer_finish_one() is called inside xfs_defer_finish_noroll(). For the
>>>> deferred XFS_DEFER_OPS_TYPE_ATTR operation, xfs_attr_finish_item()
>>>> is called.
>>>> Do you agree with above?
>>>> 
>>>> In xfs_attr_finish_item(), xfs_attr_intent (attr2) is taken out from above new ATTRI
>>>> and is processed.  attr2 and attr1 contain exactly same thing because they are both
>>>> from the new ATTRI.  So the processing on attr2 is pure a duplication of the processing
>>>> on attr1.  So actually the new ATTRI are double-processed.
>>>> Do you agree with above? 
>>> 
>>> No, that's wrong. There is now "attr1 and attr2" structures - they
>>> are one and the same. i.e. the xfs_attr_intent used in
>>> xfs_attr_finish_item() is the same one created in
>>> xfs_attri_item_recover() that we called
>>> xfs_defer_add(&attr->xattri_list) with way back at line 638. 
>>> 
>>> xfs_attr_finish_item(
>>>      struct xfs_trans                *tp,
>>>      struct xfs_log_item             *done,
>>>      struct list_head                *item,
>>>      struct xfs_btree_cur            **state)
>>> {
>>>      struct xfs_attr_intent          *attr;
>>>      struct xfs_attrd_log_item       *done_item = NULL;
>>>      int                             error;
>>> 
>>>>>>>> attr = container_of(item, struct xfs_attr_intent, xattri_list);
>>>      if (done)
>>>              done_item = ATTRD_ITEM(done);
>>> 
>>> xfs_attr_finish_item() extracts the @attr structure from the
>>> listhead that is linking it into the deferred op chain, and we
>>> run xfs_attr_set_iter() again to run the next step(s) in the xattr
>>> modification state machine. If this returns -EAGAIN (and it can)
>>> we go around the "defer(attr->attri_list), create new intent,
>>> commit, ... ->finish_item() -> xfs_attr_finish_item()" loop again.
>>> 
>>> From this, I am guessing that you are confusing the per-defer-cycle
>>> xfs_attri_log_item with the overall xfs_attr_intent context. i.e. We
>>> allocate a new xfs_attri_log_item (and xfs_attrd_log_item) every
>>> time we create new intents in the deferred operation chain. Each
>>> xfs_attri_log_item is created from the state that is carried
>>> across entire the attr recovery chain in the xfs_attr_intent
>>> structure....
>> 
>> OK, ATTRI really looks complex. And it looks like the
>> xfs_attri_item_recover() and xfs_attr_finish_item() can do different things
>> according the current state. Thus running both xfs_attri_item_recover()
>> and xfs_attr_finish_item() has no problem.  I didn’t take a good example :(. 
>> 
>> But, the original 
>> ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
>> in xlog_recover_process_intents() is incorrect, right? For example if we
>> introduced two different new ATTRIs and the 2nd should have li_lsn bigger
>> than last_lsn?
>> 
>> Anyway, let’s go back to the EFI. EFI is different from ATTRI, it has no state machine.
>> xfs_efi_item_recover() and xfs_extent_free_finish_item() are doing the same thing —
>> free the extent specified by the EFI record(s).  For the new EFI, the
>> xfs_efi_item_recover() goes well, but xfs_extent_free_finish_item() fails because it
>> want to free the extent which is already freed by the xfs_efi_item_recover(). 
>> We pass some information (indicating it already done) from xfs_efi_item_recover() to
>> xfs_extent_free_finish_item() somehow so that xfs_extent_free_finish_item() won’t do
>> the 2nd free as a workaround? What’s your idea?
>> 
>> thanks,
>> wengang

xfs: Don't block in xfs_extent_busy_flush

Commit Message

Comments

Patch