diff mbox series

[3/3] xfs: periodically relog deferred intent items

Message ID 160083919968.1401135.1020138085396332868.stgit@magnolia (mailing list archive)
State Superseded
Headers show
Series xfs: fix some log stalling problems in defer ops | expand

Commit Message

Darrick J. Wong Sept. 23, 2020, 5:33 a.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

There's a subtle design flaw in the deferred log item code that can lead
to pinning the log tail.  Taking up the defer ops chain examples from
the previous commit, we can get trapped in sequences like this:

Caller hands us a transaction t0 with D0-D3 attached.  The defer ops
chain will look like the following if the transaction rolls succeed:

t1: D0(t0), D1(t0), D2(t0), D3(t0)
t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
t3: d5(t1), D1(t0), D2(t0), D3(t0)
...
t9: d9(t7), D3(t0)
t10: D3(t0)
t11: d10(t10), d11(t10)
t12: d11(t10)

In transaction 9, we finish d9 and try to roll to t10 while holding onto
an intent item for D3 that we logged in t0.

The previous commit changed the order in which we place new defer ops in
the defer ops processing chain to reduce the maximum chain length.  Now
make xfs_defer_finish_noroll capable of relogging the entire chain
periodically so that we can always move the log tail forward.  Most
chains will never get relogged, except for operations that generate very
long chains (large extents containing many blocks with different sharing
levels) or are on filesystems with small logs and a lot of ongoing
metadata updates.

Callers are now required to ensure that the transaction reservation is
large enough to handle logging done items and new intent items for the
maximum possible chain length.  Most callers are careful to keep the
chain lengths low, so the overhead should be minimal.

The decision to relog an intent item is made based on whether or not the
intent was added to the current checkpoint.  If so, the checkpoint is
still open and there's no point in relogging.  Otherwise, the old
checkpoint is closed and we relog the intent to add it to the current
one.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c  |   52 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_item.c     |   27 +++++++++++++++++++++++
 fs/xfs/xfs_extfree_item.c  |   29 +++++++++++++++++++++++++
 fs/xfs/xfs_refcount_item.c |   27 +++++++++++++++++++++++
 fs/xfs/xfs_rmap_item.c     |   27 +++++++++++++++++++++++
 fs/xfs/xfs_trace.h         |    1 +
 fs/xfs/xfs_trans.h         |   10 ++++++++
 7 files changed, 173 insertions(+)

Comments

Brian Foster Sept. 25, 2020, 11:15 a.m. UTC | #1
On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> There's a subtle design flaw in the deferred log item code that can lead
> to pinning the log tail.  Taking up the defer ops chain examples from
> the previous commit, we can get trapped in sequences like this:
> 
> Caller hands us a transaction t0 with D0-D3 attached.  The defer ops
> chain will look like the following if the transaction rolls succeed:
> 
> t1: D0(t0), D1(t0), D2(t0), D3(t0)
> t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
> t3: d5(t1), D1(t0), D2(t0), D3(t0)
> ...
> t9: d9(t7), D3(t0)
> t10: D3(t0)
> t11: d10(t10), d11(t10)
> t12: d11(t10)
> 
> In transaction 9, we finish d9 and try to roll to t10 while holding onto
> an intent item for D3 that we logged in t0.
> 
> The previous commit changed the order in which we place new defer ops in
> the defer ops processing chain to reduce the maximum chain length.  Now
> make xfs_defer_finish_noroll capable of relogging the entire chain
> periodically so that we can always move the log tail forward.  Most
> chains will never get relogged, except for operations that generate very
> long chains (large extents containing many blocks with different sharing
> levels) or are on filesystems with small logs and a lot of ongoing
> metadata updates.
> 
> Callers are now required to ensure that the transaction reservation is
> large enough to handle logging done items and new intent items for the
> maximum possible chain length.  Most callers are careful to keep the
> chain lengths low, so the overhead should be minimal.
> 
> The decision to relog an intent item is made based on whether or not the
> intent was added to the current checkpoint.  If so, the checkpoint is
> still open and there's no point in relogging.  Otherwise, the old
> checkpoint is closed and we relog the intent to add it to the current
> one.
> 

Looks like this last bit needs to be updated.

> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_defer.c  |   52 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_bmap_item.c     |   27 +++++++++++++++++++++++
>  fs/xfs/xfs_extfree_item.c  |   29 +++++++++++++++++++++++++
>  fs/xfs/xfs_refcount_item.c |   27 +++++++++++++++++++++++
>  fs/xfs/xfs_rmap_item.c     |   27 +++++++++++++++++++++++
>  fs/xfs/xfs_trace.h         |    1 +
>  fs/xfs/xfs_trans.h         |   10 ++++++++
>  7 files changed, 173 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> index 84a70edd0da1..c601cc2af254 100644
> --- a/fs/xfs/libxfs/xfs_defer.c
> +++ b/fs/xfs/libxfs/xfs_defer.c
...
> @@ -447,6 +494,11 @@ xfs_defer_finish_noroll(
>  		if (error)
>  			goto out_shutdown;
>  
> +		/* Every few rolls we relog all the intent items. */

Stale comment?

Those nits aside:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +		error = xfs_defer_relog(tp, &dop_pending);
> +		if (error)
> +			goto out_shutdown;
> +
>  		dfp = list_first_entry(&dop_pending, struct xfs_defer_pending,
>  				       dfp_list);
>  		error = xfs_defer_finish_one(*tp, dfp);
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index 008436ef5bce..6317fdf4a3a0 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -532,6 +532,32 @@ xfs_bui_item_match(
>  	return BUI_ITEM(lip)->bui_format.bui_id == intent_id;
>  }
>  
> +/* Relog an intent item to push the log tail forward. */
> +static struct xfs_log_item *
> +xfs_bui_item_relog(
> +	struct xfs_log_item		*intent,
> +	struct xfs_trans		*tp)
> +{
> +	struct xfs_bud_log_item		*budp;
> +	struct xfs_bui_log_item		*buip;
> +	struct xfs_map_extent		*extp;
> +	unsigned int			count;
> +
> +	count = BUI_ITEM(intent)->bui_format.bui_nextents;
> +	extp = BUI_ITEM(intent)->bui_format.bui_extents;
> +
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	budp = xfs_trans_get_bud(tp, BUI_ITEM(intent));
> +	set_bit(XFS_LI_DIRTY, &budp->bud_item.li_flags);
> +
> +	buip = xfs_bui_init(tp->t_mountp);
> +	memcpy(buip->bui_format.bui_extents, extp, count * sizeof(*extp));
> +	atomic_set(&buip->bui_next_extent, count);
> +	xfs_trans_add_item(tp, &buip->bui_item);
> +	set_bit(XFS_LI_DIRTY, &buip->bui_item.li_flags);
> +	return &buip->bui_item;
> +}
> +
>  static const struct xfs_item_ops xfs_bui_item_ops = {
>  	.iop_size	= xfs_bui_item_size,
>  	.iop_format	= xfs_bui_item_format,
> @@ -539,6 +565,7 @@ static const struct xfs_item_ops xfs_bui_item_ops = {
>  	.iop_release	= xfs_bui_item_release,
>  	.iop_recover	= xfs_bui_item_recover,
>  	.iop_match	= xfs_bui_item_match,
> +	.iop_relog	= xfs_bui_item_relog,
>  };
>  
>  /*
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index 337fae015bca..30a53f3913d1 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -642,6 +642,34 @@ xfs_efi_item_match(
>  	return EFI_ITEM(lip)->efi_format.efi_id == intent_id;
>  }
>  
> +/* Relog an intent item to push the log tail forward. */
> +static struct xfs_log_item *
> +xfs_efi_item_relog(
> +	struct xfs_log_item		*intent,
> +	struct xfs_trans		*tp)
> +{
> +	struct xfs_efd_log_item		*efdp;
> +	struct xfs_efi_log_item		*efip;
> +	struct xfs_extent		*extp;
> +	unsigned int			count;
> +
> +	count = EFI_ITEM(intent)->efi_format.efi_nextents;
> +	extp = EFI_ITEM(intent)->efi_format.efi_extents;
> +
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	efdp = xfs_trans_get_efd(tp, EFI_ITEM(intent), count);
> +	efdp->efd_next_extent = count;
> +	memcpy(efdp->efd_format.efd_extents, extp, count * sizeof(*extp));
> +	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
> +
> +	efip = xfs_efi_init(tp->t_mountp, count);
> +	memcpy(efip->efi_format.efi_extents, extp, count * sizeof(*extp));
> +	atomic_set(&efip->efi_next_extent, count);
> +	xfs_trans_add_item(tp, &efip->efi_item);
> +	set_bit(XFS_LI_DIRTY, &efip->efi_item.li_flags);
> +	return &efip->efi_item;
> +}
> +
>  static const struct xfs_item_ops xfs_efi_item_ops = {
>  	.iop_size	= xfs_efi_item_size,
>  	.iop_format	= xfs_efi_item_format,
> @@ -649,6 +677,7 @@ static const struct xfs_item_ops xfs_efi_item_ops = {
>  	.iop_release	= xfs_efi_item_release,
>  	.iop_recover	= xfs_efi_item_recover,
>  	.iop_match	= xfs_efi_item_match,
> +	.iop_relog	= xfs_efi_item_relog,
>  };
>  
>  /*
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index c78412755b8a..cf0205fdc607 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -560,6 +560,32 @@ xfs_cui_item_match(
>  	return CUI_ITEM(lip)->cui_format.cui_id == intent_id;
>  }
>  
> +/* Relog an intent item to push the log tail forward. */
> +static struct xfs_log_item *
> +xfs_cui_item_relog(
> +	struct xfs_log_item		*intent,
> +	struct xfs_trans		*tp)
> +{
> +	struct xfs_cud_log_item		*cudp;
> +	struct xfs_cui_log_item		*cuip;
> +	struct xfs_phys_extent		*extp;
> +	unsigned int			count;
> +
> +	count = CUI_ITEM(intent)->cui_format.cui_nextents;
> +	extp = CUI_ITEM(intent)->cui_format.cui_extents;
> +
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	cudp = xfs_trans_get_cud(tp, CUI_ITEM(intent));
> +	set_bit(XFS_LI_DIRTY, &cudp->cud_item.li_flags);
> +
> +	cuip = xfs_cui_init(tp->t_mountp, count);
> +	memcpy(cuip->cui_format.cui_extents, extp, count * sizeof(*extp));
> +	atomic_set(&cuip->cui_next_extent, count);
> +	xfs_trans_add_item(tp, &cuip->cui_item);
> +	set_bit(XFS_LI_DIRTY, &cuip->cui_item.li_flags);
> +	return &cuip->cui_item;
> +}
> +
>  static const struct xfs_item_ops xfs_cui_item_ops = {
>  	.iop_size	= xfs_cui_item_size,
>  	.iop_format	= xfs_cui_item_format,
> @@ -567,6 +593,7 @@ static const struct xfs_item_ops xfs_cui_item_ops = {
>  	.iop_release	= xfs_cui_item_release,
>  	.iop_recover	= xfs_cui_item_recover,
>  	.iop_match	= xfs_cui_item_match,
> +	.iop_relog	= xfs_cui_item_relog,
>  };
>  
>  /*
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> index c1a1fd62ca74..3237243d375d 100644
> --- a/fs/xfs/xfs_rmap_item.c
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -583,6 +583,32 @@ xfs_rui_item_match(
>  	return RUI_ITEM(lip)->rui_format.rui_id == intent_id;
>  }
>  
> +/* Relog an intent item to push the log tail forward. */
> +static struct xfs_log_item *
> +xfs_rui_item_relog(
> +	struct xfs_log_item		*intent,
> +	struct xfs_trans		*tp)
> +{
> +	struct xfs_rud_log_item		*rudp;
> +	struct xfs_rui_log_item		*ruip;
> +	struct xfs_map_extent		*extp;
> +	unsigned int			count;
> +
> +	count = RUI_ITEM(intent)->rui_format.rui_nextents;
> +	extp = RUI_ITEM(intent)->rui_format.rui_extents;
> +
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	rudp = xfs_trans_get_rud(tp, RUI_ITEM(intent));
> +	set_bit(XFS_LI_DIRTY, &rudp->rud_item.li_flags);
> +
> +	ruip = xfs_rui_init(tp->t_mountp, count);
> +	memcpy(ruip->rui_format.rui_extents, extp, count * sizeof(*extp));
> +	atomic_set(&ruip->rui_next_extent, count);
> +	xfs_trans_add_item(tp, &ruip->rui_item);
> +	set_bit(XFS_LI_DIRTY, &ruip->rui_item.li_flags);
> +	return &ruip->rui_item;
> +}
> +
>  static const struct xfs_item_ops xfs_rui_item_ops = {
>  	.iop_size	= xfs_rui_item_size,
>  	.iop_format	= xfs_rui_item_format,
> @@ -590,6 +616,7 @@ static const struct xfs_item_ops xfs_rui_item_ops = {
>  	.iop_release	= xfs_rui_item_release,
>  	.iop_recover	= xfs_rui_item_recover,
>  	.iop_match	= xfs_rui_item_match,
> +	.iop_relog	= xfs_rui_item_relog,
>  };
>  
>  /*
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index a3a35a2d8ed9..362c155be525 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2533,6 +2533,7 @@ DEFINE_DEFER_PENDING_EVENT(xfs_defer_create_intent);
>  DEFINE_DEFER_PENDING_EVENT(xfs_defer_cancel_list);
>  DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_finish);
>  DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_abort);
> +DEFINE_DEFER_PENDING_EVENT(xfs_defer_relog_intent);
>  
>  #define DEFINE_BMAP_FREE_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
>  DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_defer);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 995c1513693c..e838e8327510 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -78,6 +78,8 @@ struct xfs_item_ops {
>  	int (*iop_recover)(struct xfs_log_item *lip,
>  			   struct xfs_defer_capture **dfcp);
>  	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
> +	struct xfs_log_item *(*iop_relog)(struct xfs_log_item *intent,
> +			struct xfs_trans *tp);
>  };
>  
>  /*
> @@ -239,4 +241,12 @@ void		xfs_trans_buf_copy_type(struct xfs_buf *dst_bp,
>  
>  extern kmem_zone_t	*xfs_trans_zone;
>  
> +static inline struct xfs_log_item *
> +xfs_trans_item_relog(
> +	struct xfs_log_item	*lip,
> +	struct xfs_trans	*tp)
> +{
> +	return lip->li_ops->iop_relog(lip, tp);
> +}
> +
>  #endif	/* __XFS_TRANS_H__ */
>
Darrick J. Wong Sept. 25, 2020, 7:06 p.m. UTC | #2
On Fri, Sep 25, 2020 at 07:15:40AM -0400, Brian Foster wrote:
> On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > There's a subtle design flaw in the deferred log item code that can lead
> > to pinning the log tail.  Taking up the defer ops chain examples from
> > the previous commit, we can get trapped in sequences like this:
> > 
> > Caller hands us a transaction t0 with D0-D3 attached.  The defer ops
> > chain will look like the following if the transaction rolls succeed:
> > 
> > t1: D0(t0), D1(t0), D2(t0), D3(t0)
> > t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
> > t3: d5(t1), D1(t0), D2(t0), D3(t0)
> > ...
> > t9: d9(t7), D3(t0)
> > t10: D3(t0)
> > t11: d10(t10), d11(t10)
> > t12: d11(t10)
> > 
> > In transaction 9, we finish d9 and try to roll to t10 while holding onto
> > an intent item for D3 that we logged in t0.
> > 
> > The previous commit changed the order in which we place new defer ops in
> > the defer ops processing chain to reduce the maximum chain length.  Now
> > make xfs_defer_finish_noroll capable of relogging the entire chain
> > periodically so that we can always move the log tail forward.  Most
> > chains will never get relogged, except for operations that generate very
> > long chains (large extents containing many blocks with different sharing
> > levels) or are on filesystems with small logs and a lot of ongoing
> > metadata updates.
> > 
> > Callers are now required to ensure that the transaction reservation is
> > large enough to handle logging done items and new intent items for the
> > maximum possible chain length.  Most callers are careful to keep the
> > chain lengths low, so the overhead should be minimal.
> > 
> > The decision to relog an intent item is made based on whether or not the
> > intent was added to the current checkpoint.  If so, the checkpoint is
> > still open and there's no point in relogging.  Otherwise, the old
> > checkpoint is closed and we relog the intent to add it to the current
> > one.
> > 
> 
> Looks like this last bit needs to be updated.

Yep.  The next version will replace that paragraph with the following:

"The decision to relog an intent item is made based on two factors: (a)
whether the intent was logged in a previous checkpoint, and (b) if the
log has hit the 75% full threshold.  There's no point in relogging an
intent into the same checkpoint, and there's no need to relog if there's
plenty of free space in the log."

> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_defer.c  |   52 ++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_bmap_item.c     |   27 +++++++++++++++++++++++
> >  fs/xfs/xfs_extfree_item.c  |   29 +++++++++++++++++++++++++
> >  fs/xfs/xfs_refcount_item.c |   27 +++++++++++++++++++++++
> >  fs/xfs/xfs_rmap_item.c     |   27 +++++++++++++++++++++++
> >  fs/xfs/xfs_trace.h         |    1 +
> >  fs/xfs/xfs_trans.h         |   10 ++++++++
> >  7 files changed, 173 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > index 84a70edd0da1..c601cc2af254 100644
> > --- a/fs/xfs/libxfs/xfs_defer.c
> > +++ b/fs/xfs/libxfs/xfs_defer.c
> ...
> > @@ -447,6 +494,11 @@ xfs_defer_finish_noroll(
> >  		if (error)
> >  			goto out_shutdown;
> >  
> > +		/* Every few rolls we relog all the intent items. */
> 
> Stale comment?

Yes.  Change that to:

/* Possibly relog intent items to keep the log tail moving. */

> Those nits aside:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Thanks for the review!

--D

> > +		error = xfs_defer_relog(tp, &dop_pending);
> > +		if (error)
> > +			goto out_shutdown;
> > +
> >  		dfp = list_first_entry(&dop_pending, struct xfs_defer_pending,
> >  				       dfp_list);
> >  		error = xfs_defer_finish_one(*tp, dfp);
> > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > index 008436ef5bce..6317fdf4a3a0 100644
> > --- a/fs/xfs/xfs_bmap_item.c
> > +++ b/fs/xfs/xfs_bmap_item.c
> > @@ -532,6 +532,32 @@ xfs_bui_item_match(
> >  	return BUI_ITEM(lip)->bui_format.bui_id == intent_id;
> >  }
> >  
> > +/* Relog an intent item to push the log tail forward. */
> > +static struct xfs_log_item *
> > +xfs_bui_item_relog(
> > +	struct xfs_log_item		*intent,
> > +	struct xfs_trans		*tp)
> > +{
> > +	struct xfs_bud_log_item		*budp;
> > +	struct xfs_bui_log_item		*buip;
> > +	struct xfs_map_extent		*extp;
> > +	unsigned int			count;
> > +
> > +	count = BUI_ITEM(intent)->bui_format.bui_nextents;
> > +	extp = BUI_ITEM(intent)->bui_format.bui_extents;
> > +
> > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > +	budp = xfs_trans_get_bud(tp, BUI_ITEM(intent));
> > +	set_bit(XFS_LI_DIRTY, &budp->bud_item.li_flags);
> > +
> > +	buip = xfs_bui_init(tp->t_mountp);
> > +	memcpy(buip->bui_format.bui_extents, extp, count * sizeof(*extp));
> > +	atomic_set(&buip->bui_next_extent, count);
> > +	xfs_trans_add_item(tp, &buip->bui_item);
> > +	set_bit(XFS_LI_DIRTY, &buip->bui_item.li_flags);
> > +	return &buip->bui_item;
> > +}
> > +
> >  static const struct xfs_item_ops xfs_bui_item_ops = {
> >  	.iop_size	= xfs_bui_item_size,
> >  	.iop_format	= xfs_bui_item_format,
> > @@ -539,6 +565,7 @@ static const struct xfs_item_ops xfs_bui_item_ops = {
> >  	.iop_release	= xfs_bui_item_release,
> >  	.iop_recover	= xfs_bui_item_recover,
> >  	.iop_match	= xfs_bui_item_match,
> > +	.iop_relog	= xfs_bui_item_relog,
> >  };
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> > index 337fae015bca..30a53f3913d1 100644
> > --- a/fs/xfs/xfs_extfree_item.c
> > +++ b/fs/xfs/xfs_extfree_item.c
> > @@ -642,6 +642,34 @@ xfs_efi_item_match(
> >  	return EFI_ITEM(lip)->efi_format.efi_id == intent_id;
> >  }
> >  
> > +/* Relog an intent item to push the log tail forward. */
> > +static struct xfs_log_item *
> > +xfs_efi_item_relog(
> > +	struct xfs_log_item		*intent,
> > +	struct xfs_trans		*tp)
> > +{
> > +	struct xfs_efd_log_item		*efdp;
> > +	struct xfs_efi_log_item		*efip;
> > +	struct xfs_extent		*extp;
> > +	unsigned int			count;
> > +
> > +	count = EFI_ITEM(intent)->efi_format.efi_nextents;
> > +	extp = EFI_ITEM(intent)->efi_format.efi_extents;
> > +
> > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > +	efdp = xfs_trans_get_efd(tp, EFI_ITEM(intent), count);
> > +	efdp->efd_next_extent = count;
> > +	memcpy(efdp->efd_format.efd_extents, extp, count * sizeof(*extp));
> > +	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
> > +
> > +	efip = xfs_efi_init(tp->t_mountp, count);
> > +	memcpy(efip->efi_format.efi_extents, extp, count * sizeof(*extp));
> > +	atomic_set(&efip->efi_next_extent, count);
> > +	xfs_trans_add_item(tp, &efip->efi_item);
> > +	set_bit(XFS_LI_DIRTY, &efip->efi_item.li_flags);
> > +	return &efip->efi_item;
> > +}
> > +
> >  static const struct xfs_item_ops xfs_efi_item_ops = {
> >  	.iop_size	= xfs_efi_item_size,
> >  	.iop_format	= xfs_efi_item_format,
> > @@ -649,6 +677,7 @@ static const struct xfs_item_ops xfs_efi_item_ops = {
> >  	.iop_release	= xfs_efi_item_release,
> >  	.iop_recover	= xfs_efi_item_recover,
> >  	.iop_match	= xfs_efi_item_match,
> > +	.iop_relog	= xfs_efi_item_relog,
> >  };
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> > index c78412755b8a..cf0205fdc607 100644
> > --- a/fs/xfs/xfs_refcount_item.c
> > +++ b/fs/xfs/xfs_refcount_item.c
> > @@ -560,6 +560,32 @@ xfs_cui_item_match(
> >  	return CUI_ITEM(lip)->cui_format.cui_id == intent_id;
> >  }
> >  
> > +/* Relog an intent item to push the log tail forward. */
> > +static struct xfs_log_item *
> > +xfs_cui_item_relog(
> > +	struct xfs_log_item		*intent,
> > +	struct xfs_trans		*tp)
> > +{
> > +	struct xfs_cud_log_item		*cudp;
> > +	struct xfs_cui_log_item		*cuip;
> > +	struct xfs_phys_extent		*extp;
> > +	unsigned int			count;
> > +
> > +	count = CUI_ITEM(intent)->cui_format.cui_nextents;
> > +	extp = CUI_ITEM(intent)->cui_format.cui_extents;
> > +
> > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > +	cudp = xfs_trans_get_cud(tp, CUI_ITEM(intent));
> > +	set_bit(XFS_LI_DIRTY, &cudp->cud_item.li_flags);
> > +
> > +	cuip = xfs_cui_init(tp->t_mountp, count);
> > +	memcpy(cuip->cui_format.cui_extents, extp, count * sizeof(*extp));
> > +	atomic_set(&cuip->cui_next_extent, count);
> > +	xfs_trans_add_item(tp, &cuip->cui_item);
> > +	set_bit(XFS_LI_DIRTY, &cuip->cui_item.li_flags);
> > +	return &cuip->cui_item;
> > +}
> > +
> >  static const struct xfs_item_ops xfs_cui_item_ops = {
> >  	.iop_size	= xfs_cui_item_size,
> >  	.iop_format	= xfs_cui_item_format,
> > @@ -567,6 +593,7 @@ static const struct xfs_item_ops xfs_cui_item_ops = {
> >  	.iop_release	= xfs_cui_item_release,
> >  	.iop_recover	= xfs_cui_item_recover,
> >  	.iop_match	= xfs_cui_item_match,
> > +	.iop_relog	= xfs_cui_item_relog,
> >  };
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> > index c1a1fd62ca74..3237243d375d 100644
> > --- a/fs/xfs/xfs_rmap_item.c
> > +++ b/fs/xfs/xfs_rmap_item.c
> > @@ -583,6 +583,32 @@ xfs_rui_item_match(
> >  	return RUI_ITEM(lip)->rui_format.rui_id == intent_id;
> >  }
> >  
> > +/* Relog an intent item to push the log tail forward. */
> > +static struct xfs_log_item *
> > +xfs_rui_item_relog(
> > +	struct xfs_log_item		*intent,
> > +	struct xfs_trans		*tp)
> > +{
> > +	struct xfs_rud_log_item		*rudp;
> > +	struct xfs_rui_log_item		*ruip;
> > +	struct xfs_map_extent		*extp;
> > +	unsigned int			count;
> > +
> > +	count = RUI_ITEM(intent)->rui_format.rui_nextents;
> > +	extp = RUI_ITEM(intent)->rui_format.rui_extents;
> > +
> > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > +	rudp = xfs_trans_get_rud(tp, RUI_ITEM(intent));
> > +	set_bit(XFS_LI_DIRTY, &rudp->rud_item.li_flags);
> > +
> > +	ruip = xfs_rui_init(tp->t_mountp, count);
> > +	memcpy(ruip->rui_format.rui_extents, extp, count * sizeof(*extp));
> > +	atomic_set(&ruip->rui_next_extent, count);
> > +	xfs_trans_add_item(tp, &ruip->rui_item);
> > +	set_bit(XFS_LI_DIRTY, &ruip->rui_item.li_flags);
> > +	return &ruip->rui_item;
> > +}
> > +
> >  static const struct xfs_item_ops xfs_rui_item_ops = {
> >  	.iop_size	= xfs_rui_item_size,
> >  	.iop_format	= xfs_rui_item_format,
> > @@ -590,6 +616,7 @@ static const struct xfs_item_ops xfs_rui_item_ops = {
> >  	.iop_release	= xfs_rui_item_release,
> >  	.iop_recover	= xfs_rui_item_recover,
> >  	.iop_match	= xfs_rui_item_match,
> > +	.iop_relog	= xfs_rui_item_relog,
> >  };
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index a3a35a2d8ed9..362c155be525 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -2533,6 +2533,7 @@ DEFINE_DEFER_PENDING_EVENT(xfs_defer_create_intent);
> >  DEFINE_DEFER_PENDING_EVENT(xfs_defer_cancel_list);
> >  DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_finish);
> >  DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_abort);
> > +DEFINE_DEFER_PENDING_EVENT(xfs_defer_relog_intent);
> >  
> >  #define DEFINE_BMAP_FREE_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
> >  DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_defer);
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index 995c1513693c..e838e8327510 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -78,6 +78,8 @@ struct xfs_item_ops {
> >  	int (*iop_recover)(struct xfs_log_item *lip,
> >  			   struct xfs_defer_capture **dfcp);
> >  	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
> > +	struct xfs_log_item *(*iop_relog)(struct xfs_log_item *intent,
> > +			struct xfs_trans *tp);
> >  };
> >  
> >  /*
> > @@ -239,4 +241,12 @@ void		xfs_trans_buf_copy_type(struct xfs_buf *dst_bp,
> >  
> >  extern kmem_zone_t	*xfs_trans_zone;
> >  
> > +static inline struct xfs_log_item *
> > +xfs_trans_item_relog(
> > +	struct xfs_log_item	*lip,
> > +	struct xfs_trans	*tp)
> > +{
> > +	return lip->li_ops->iop_relog(lip, tp);
> > +}
> > +
> >  #endif	/* __XFS_TRANS_H__ */
> > 
>
Dave Chinner Sept. 27, 2020, 11:08 p.m. UTC | #3
On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> There's a subtle design flaw in the deferred log item code that can lead
> to pinning the log tail.  Taking up the defer ops chain examples from
> the previous commit, we can get trapped in sequences like this:
> 
> Caller hands us a transaction t0 with D0-D3 attached.  The defer ops
> chain will look like the following if the transaction rolls succeed:
> 
> t1: D0(t0), D1(t0), D2(t0), D3(t0)
> t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
> t3: d5(t1), D1(t0), D2(t0), D3(t0)
> ...
> t9: d9(t7), D3(t0)
> t10: D3(t0)
> t11: d10(t10), d11(t10)
> t12: d11(t10)
> 
> In transaction 9, we finish d9 and try to roll to t10 while holding onto
> an intent item for D3 that we logged in t0.
> 
> The previous commit changed the order in which we place new defer ops in
> the defer ops processing chain to reduce the maximum chain length.  Now
> make xfs_defer_finish_noroll capable of relogging the entire chain
> periodically so that we can always move the log tail forward.  Most
> chains will never get relogged, except for operations that generate very
> long chains (large extents containing many blocks with different sharing
> levels) or are on filesystems with small logs and a lot of ongoing
> metadata updates.
> 
> Callers are now required to ensure that the transaction reservation is
> large enough to handle logging done items and new intent items for the
> maximum possible chain length.  Most callers are careful to keep the
> chain lengths low, so the overhead should be minimal.
> 
> The decision to relog an intent item is made based on whether or not the
> intent was added to the current checkpoint.  If so, the checkpoint is
> still open and there's no point in relogging.  Otherwise, the old
> checkpoint is closed and we relog the intent to add it to the current
> one.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_defer.c  |   52 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_bmap_item.c     |   27 +++++++++++++++++++++++
>  fs/xfs/xfs_extfree_item.c  |   29 +++++++++++++++++++++++++
>  fs/xfs/xfs_refcount_item.c |   27 +++++++++++++++++++++++
>  fs/xfs/xfs_rmap_item.c     |   27 +++++++++++++++++++++++
>  fs/xfs/xfs_trace.h         |    1 +
>  fs/xfs/xfs_trans.h         |   10 ++++++++
>  7 files changed, 173 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> index 84a70edd0da1..c601cc2af254 100644
> --- a/fs/xfs/libxfs/xfs_defer.c
> +++ b/fs/xfs/libxfs/xfs_defer.c
> @@ -17,6 +17,7 @@
>  #include "xfs_inode_item.h"
>  #include "xfs_trace.h"
>  #include "xfs_icache.h"
> +#include "xfs_log.h"
>  
>  /*
>   * Deferred Operations in XFS
> @@ -361,6 +362,52 @@ xfs_defer_cancel_list(
>  	}
>  }
>  
> +/*
> + * Prevent a log intent item from pinning the tail of the log by logging a
> + * done item to release the intent item; and then log a new intent item.
> + * The caller should provide a fresh transaction and roll it after we're done.
> + */
> +static int
> +xfs_defer_relog(
> +	struct xfs_trans		**tpp,
> +	struct list_head		*dfops)
> +{
> +	struct xfs_defer_pending	*dfp;
> +	xfs_lsn_t			threshold_lsn;
> +
> +	ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> +
> +	/*
> +	 * Figure out where we need the tail to be in order to maintain the
> +	 * minimum required free space in the log.
> +	 */
> +	threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0);
> +	if (threshold_lsn == NULLCOMMITLSN)
> +		return 0;

This smells of premature optimisation.

When we are in a tail-pushing scenario (i.e. any sort of
sustained metadata workload) this will always return true, and so we
will relog every intent that isn't in the current checkpoint every
time this is called.  Under light load, we don't care if we add a
little bit of relogging overhead as the CIL slowly flushes/pushes -
it will have neglible impact on performance because there is little
load on the journal.

However, when we are under heavy load the code will now be reading
the grant head and log position accounting variables during every
commit, hence greatly increasing the number and temporal
distribution of accesses to the hotest cachelines in the log. We
currently never access these cache lines during commit unless the
unit reservation has run out and we have to regrant physical log
space for the transaction to continue (i.e. we are into slow path
commit code). IOWs, this is like causing far more than double the
number of accesses to the grant head, the log tail, the
last_sync_lsn, etc, all of which is unnecessary exactly when we care
about minimising contention on the log space accounting variables...

Given that it is a redundant check under heavy load journal load
when access to the log grant/head/tail are already contended,
I think we should just be checking the "in current checkpoint" logic
and not making it conditional on the log being near full.

Cheers,

Dave.
Darrick J. Wong Sept. 27, 2020, 11:30 p.m. UTC | #4
On Mon, Sep 28, 2020 at 09:08:23AM +1000, Dave Chinner wrote:
> On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > There's a subtle design flaw in the deferred log item code that can lead
> > to pinning the log tail.  Taking up the defer ops chain examples from
> > the previous commit, we can get trapped in sequences like this:
> > 
> > Caller hands us a transaction t0 with D0-D3 attached.  The defer ops
> > chain will look like the following if the transaction rolls succeed:
> > 
> > t1: D0(t0), D1(t0), D2(t0), D3(t0)
> > t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
> > t3: d5(t1), D1(t0), D2(t0), D3(t0)
> > ...
> > t9: d9(t7), D3(t0)
> > t10: D3(t0)
> > t11: d10(t10), d11(t10)
> > t12: d11(t10)
> > 
> > In transaction 9, we finish d9 and try to roll to t10 while holding onto
> > an intent item for D3 that we logged in t0.
> > 
> > The previous commit changed the order in which we place new defer ops in
> > the defer ops processing chain to reduce the maximum chain length.  Now
> > make xfs_defer_finish_noroll capable of relogging the entire chain
> > periodically so that we can always move the log tail forward.  Most
> > chains will never get relogged, except for operations that generate very
> > long chains (large extents containing many blocks with different sharing
> > levels) or are on filesystems with small logs and a lot of ongoing
> > metadata updates.
> > 
> > Callers are now required to ensure that the transaction reservation is
> > large enough to handle logging done items and new intent items for the
> > maximum possible chain length.  Most callers are careful to keep the
> > chain lengths low, so the overhead should be minimal.
> > 
> > The decision to relog an intent item is made based on whether or not the
> > intent was added to the current checkpoint.  If so, the checkpoint is
> > still open and there's no point in relogging.  Otherwise, the old
> > checkpoint is closed and we relog the intent to add it to the current
> > one.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_defer.c  |   52 ++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_bmap_item.c     |   27 +++++++++++++++++++++++
> >  fs/xfs/xfs_extfree_item.c  |   29 +++++++++++++++++++++++++
> >  fs/xfs/xfs_refcount_item.c |   27 +++++++++++++++++++++++
> >  fs/xfs/xfs_rmap_item.c     |   27 +++++++++++++++++++++++
> >  fs/xfs/xfs_trace.h         |    1 +
> >  fs/xfs/xfs_trans.h         |   10 ++++++++
> >  7 files changed, 173 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > index 84a70edd0da1..c601cc2af254 100644
> > --- a/fs/xfs/libxfs/xfs_defer.c
> > +++ b/fs/xfs/libxfs/xfs_defer.c
> > @@ -17,6 +17,7 @@
> >  #include "xfs_inode_item.h"
> >  #include "xfs_trace.h"
> >  #include "xfs_icache.h"
> > +#include "xfs_log.h"
> >  
> >  /*
> >   * Deferred Operations in XFS
> > @@ -361,6 +362,52 @@ xfs_defer_cancel_list(
> >  	}
> >  }
> >  
> > +/*
> > + * Prevent a log intent item from pinning the tail of the log by logging a
> > + * done item to release the intent item; and then log a new intent item.
> > + * The caller should provide a fresh transaction and roll it after we're done.
> > + */
> > +static int
> > +xfs_defer_relog(
> > +	struct xfs_trans		**tpp,
> > +	struct list_head		*dfops)
> > +{
> > +	struct xfs_defer_pending	*dfp;
> > +	xfs_lsn_t			threshold_lsn;
> > +
> > +	ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > +
> > +	/*
> > +	 * Figure out where we need the tail to be in order to maintain the
> > +	 * minimum required free space in the log.
> > +	 */
> > +	threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0);
> > +	if (threshold_lsn == NULLCOMMITLSN)
> > +		return 0;
> 
> This smells of premature optimisation.
> 
> When we are in a tail-pushing scenario (i.e. any sort of
> sustained metadata workload) this will always return true, and so we
> will relog every intent that isn't in the current checkpoint every
> time this is called.  Under light load, we don't care if we add a
> little bit of relogging overhead as the CIL slowly flushes/pushes -
> it will have neglible impact on performance because there is little
> load on the journal.
> 
> However, when we are under heavy load the code will now be reading
> the grant head and log position accounting variables during every
> commit, hence greatly increasing the number and temporal
> distribution of accesses to the hotest cachelines in the log. We
> currently never access these cache lines during commit unless the
> unit reservation has run out and we have to regrant physical log
> space for the transaction to continue (i.e. we are into slow path
> commit code). IOWs, this is like causing far more than double the
> number of accesses to the grant head, the log tail, the
> last_sync_lsn, etc, all of which is unnecessary exactly when we care
> about minimising contention on the log space accounting variables...
> 
> Given that it is a redundant check under heavy load journal load
> when access to the log grant/head/tail are already contended,
> I think we should just be checking the "in current checkpoint" logic
> and not making it conditional on the log being near full.

<nod> FWIW I broke this patch up again into the first part that
only does relogging if the checkpoints don't match, and a second part
that does the LSN push target check to see if I could observe any
difference.

Across a ~4h fstests run I noticed that there was about ~20% fewer
relogs, but OTOH the total runtime didn't change noticeably.  I kind of
wondered if the increased cacheline contention would at least slow down
the frontend a bit to give the log a chance to push things out, but
haven't had time to dig any further than "ran fstests, recorded runtimes
and grep | wc -l'd the ftrace log".

Anyway, I was about to resend with all these patches rebased against
something resembling the 5.10 branch, so expect to see this broken out a
bit.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Sept. 28, 2020, 1 a.m. UTC | #5
On Sun, Sep 27, 2020 at 04:30:25PM -0700, Darrick J. Wong wrote:
> On Mon, Sep 28, 2020 at 09:08:23AM +1000, Dave Chinner wrote:
> > On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote:
> > > @@ -361,6 +362,52 @@ xfs_defer_cancel_list(
> > >  	}
> > >  }
> > >  
> > > +/*
> > > + * Prevent a log intent item from pinning the tail of the log by logging a
> > > + * done item to release the intent item; and then log a new intent item.
> > > + * The caller should provide a fresh transaction and roll it after we're done.
> > > + */
> > > +static int
> > > +xfs_defer_relog(
> > > +	struct xfs_trans		**tpp,
> > > +	struct list_head		*dfops)
> > > +{
> > > +	struct xfs_defer_pending	*dfp;
> > > +	xfs_lsn_t			threshold_lsn;
> > > +
> > > +	ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > +
> > > +	/*
> > > +	 * Figure out where we need the tail to be in order to maintain the
> > > +	 * minimum required free space in the log.
> > > +	 */
> > > +	threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0);
> > > +	if (threshold_lsn == NULLCOMMITLSN)
> > > +		return 0;
> > 
> > This smells of premature optimisation.
> > 
> > When we are in a tail-pushing scenario (i.e. any sort of
> > sustained metadata workload) this will always return true, and so we
> > will relog every intent that isn't in the current checkpoint every
> > time this is called.  Under light load, we don't care if we add a
> > little bit of relogging overhead as the CIL slowly flushes/pushes -
> > it will have neglible impact on performance because there is little
> > load on the journal.
> > 
> > However, when we are under heavy load the code will now be reading
> > the grant head and log position accounting variables during every
> > commit, hence greatly increasing the number and temporal
> > distribution of accesses to the hotest cachelines in the log. We
> > currently never access these cache lines during commit unless the
> > unit reservation has run out and we have to regrant physical log
> > space for the transaction to continue (i.e. we are into slow path
> > commit code). IOWs, this is like causing far more than double the
> > number of accesses to the grant head, the log tail, the
> > last_sync_lsn, etc, all of which is unnecessary exactly when we care
> > about minimising contention on the log space accounting variables...
> > 
> > Given that it is a redundant check under heavy load journal load
> > when access to the log grant/head/tail are already contended,
> > I think we should just be checking the "in current checkpoint" logic
> > and not making it conditional on the log being near full.
> 
> <nod> FWIW I broke this patch up again into the first part that
> only does relogging if the checkpoints don't match, and a second part
> that does the LSN push target check to see if I could observe any
> difference.
> 
> Across a ~4h fstests run I noticed that there was about ~20% fewer
> relogs, but OTOH the total runtime didn't change noticeably.  I kind of
> wondered if the increased cacheline contention would at least slow down
> the frontend a bit to give the log a chance to push things out, but
> haven't had time to dig any further than "ran fstests, recorded runtimes
> and grep | wc -l'd the ftrace log".

fstests doesn't generate anywhere near the load necessary to measure
log space accounting cacheline contention. That's one of the things
I use fsmark workloads for.

One of the things that "20% reduction" tells me, though, is that 80%
of the time that relogging is happening with this patch we are in a
tail pushing situation. i.e. the log is more often full than it is
empty when we are relogging.

That tends to support my statements that this is optimising the
wrong case.

FWIW, it seems to me that we need real stats for the deferops (i.e.
exposed in /proc/fs/xfs/stat) so we can monitor in realtime the
breakdown of work that is being done. Being able to see things like
how many defer rolls a transaction takes under a given workload is
very interesting information, and would give us insight into things
like whether the log count for transaction reservations is
reasonable or not....

Cheers,

Dave.
Brian Foster Sept. 28, 2020, 3:16 p.m. UTC | #6
On Mon, Sep 28, 2020 at 09:08:23AM +1000, Dave Chinner wrote:
> On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > There's a subtle design flaw in the deferred log item code that can lead
> > to pinning the log tail.  Taking up the defer ops chain examples from
> > the previous commit, we can get trapped in sequences like this:
> > 
> > Caller hands us a transaction t0 with D0-D3 attached.  The defer ops
> > chain will look like the following if the transaction rolls succeed:
> > 
> > t1: D0(t0), D1(t0), D2(t0), D3(t0)
> > t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
> > t3: d5(t1), D1(t0), D2(t0), D3(t0)
> > ...
> > t9: d9(t7), D3(t0)
> > t10: D3(t0)
> > t11: d10(t10), d11(t10)
> > t12: d11(t10)
> > 
> > In transaction 9, we finish d9 and try to roll to t10 while holding onto
> > an intent item for D3 that we logged in t0.
> > 
> > The previous commit changed the order in which we place new defer ops in
> > the defer ops processing chain to reduce the maximum chain length.  Now
> > make xfs_defer_finish_noroll capable of relogging the entire chain
> > periodically so that we can always move the log tail forward.  Most
> > chains will never get relogged, except for operations that generate very
> > long chains (large extents containing many blocks with different sharing
> > levels) or are on filesystems with small logs and a lot of ongoing
> > metadata updates.
> > 
> > Callers are now required to ensure that the transaction reservation is
> > large enough to handle logging done items and new intent items for the
> > maximum possible chain length.  Most callers are careful to keep the
> > chain lengths low, so the overhead should be minimal.
> > 
> > The decision to relog an intent item is made based on whether or not the
> > intent was added to the current checkpoint.  If so, the checkpoint is
> > still open and there's no point in relogging.  Otherwise, the old
> > checkpoint is closed and we relog the intent to add it to the current
> > one.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_defer.c  |   52 ++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_bmap_item.c     |   27 +++++++++++++++++++++++
> >  fs/xfs/xfs_extfree_item.c  |   29 +++++++++++++++++++++++++
> >  fs/xfs/xfs_refcount_item.c |   27 +++++++++++++++++++++++
> >  fs/xfs/xfs_rmap_item.c     |   27 +++++++++++++++++++++++
> >  fs/xfs/xfs_trace.h         |    1 +
> >  fs/xfs/xfs_trans.h         |   10 ++++++++
> >  7 files changed, 173 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > index 84a70edd0da1..c601cc2af254 100644
> > --- a/fs/xfs/libxfs/xfs_defer.c
> > +++ b/fs/xfs/libxfs/xfs_defer.c
> > @@ -17,6 +17,7 @@
> >  #include "xfs_inode_item.h"
> >  #include "xfs_trace.h"
> >  #include "xfs_icache.h"
> > +#include "xfs_log.h"
> >  
> >  /*
> >   * Deferred Operations in XFS
> > @@ -361,6 +362,52 @@ xfs_defer_cancel_list(
> >  	}
> >  }
> >  
> > +/*
> > + * Prevent a log intent item from pinning the tail of the log by logging a
> > + * done item to release the intent item; and then log a new intent item.
> > + * The caller should provide a fresh transaction and roll it after we're done.
> > + */
> > +static int
> > +xfs_defer_relog(
> > +	struct xfs_trans		**tpp,
> > +	struct list_head		*dfops)
> > +{
> > +	struct xfs_defer_pending	*dfp;
> > +	xfs_lsn_t			threshold_lsn;
> > +
> > +	ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > +
> > +	/*
> > +	 * Figure out where we need the tail to be in order to maintain the
> > +	 * minimum required free space in the log.
> > +	 */
> > +	threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0);
> > +	if (threshold_lsn == NULLCOMMITLSN)
> > +		return 0;
> 
> This smells of premature optimisation.
> 
> When we are in a tail-pushing scenario (i.e. any sort of
> sustained metadata workload) this will always return true, and so we
> will relog every intent that isn't in the current checkpoint every
> time this is called.  Under light load, we don't care if we add a
> little bit of relogging overhead as the CIL slowly flushes/pushes -
> it will have neglible impact on performance because there is little
> load on the journal.
> 

That seems like an overly broad and not necessarily correct assumption.
The threshold above is tail relative and hardcoded (with the zero param)
to the default AIL pushing threshold, which is 25% of the log. If we
assume sustained tail pushing conditions, a committed intent doesn't
trigger relog again until it falls within that threshold in the on-disk
log. The current CIL (nonblocking) size threshold is half that at 12.5%.
So relative to a given rolling transaction processing an intent chain,
there's a decent number of full CIL checkpoints that can occur before
any of those intents need to relog again (if at all), regardless of log
size.

Without that logic, we're changing behavior to relog the entire chain in
every CIL checkpoint. That's a fairly significant change in behavior
with less predictable breakdown conditions. Do we know how long a single
chain is going to be? Do we know how many CPUs are concurrently
processing "long chain" operations? Do we know whether an external
workload is causing even more frequent checkpoints than governed by the
CIL size limit? The examples in the commit logs in this series refer to
chains of of hundreds of intents with hundreds of transaction rolls.
Those types of scenarios are hard enough to reason about, particularly
when we consider boxes with hundreds of CPUs, so I'm somewhat skeptical
of us accurately predicting performance/behavior over an implementation
that bounds processing more explicitly.

ISTM that these are all essentially undefined contributing factors. For
those reasons, I don't really consider this is an optimization at all.
Rather IMO it's a component of sane/predictable functional behavior. I
don't think concerns over cacheline contention on the log tail field
justify removing it entirely.

> However, when we are under heavy load the code will now be reading
> the grant head and log position accounting variables during every
> commit, hence greatly increasing the number and temporal
> distribution of accesses to the hotest cachelines in the log. We
> currently never access these cache lines during commit unless the
> unit reservation has run out and we have to regrant physical log
> space for the transaction to continue (i.e. we are into slow path
> commit code). IOWs, this is like causing far more than double the
> number of accesses to the grant head, the log tail, the
> last_sync_lsn, etc, all of which is unnecessary exactly when we care
> about minimising contention on the log space accounting variables...
> 

If we're under sustained tail pushing pressure, blocking transactions
have already hit this cacheline as well, FWIW.

But if we're concerned about cacheline access in the commit path
specifically, I'm sure we could come up with any number of optimizations
to address that directly. E.g., we could sample the threshold
occasionally instead of on every roll, wait until the current
transaction has consumed all logres units (where we'll be hitting that
line anyways), shift state tracking over to xfsaild via setting a flag
on log items with an ->iop_relog() handler set that happen to pin the
tail, etc. etc. That said, I do think any such added complexity should
be justified with some accompanying profiling data (and I like the idea
of new stats counters mentioned in the separate mail).

Brian

> Given that it is a redundant check under heavy load journal load
> when access to the log grant/head/tail are already contended,
> I think we should just be checking the "in current checkpoint" logic
> and not making it conditional on the log being near full.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Sept. 28, 2020, 11:09 p.m. UTC | #7
On Mon, Sep 28, 2020 at 11:16:27AM -0400, Brian Foster wrote:
> On Mon, Sep 28, 2020 at 09:08:23AM +1000, Dave Chinner wrote:
> > On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote:
> > > +xfs_defer_relog(
> > > +	struct xfs_trans		**tpp,
> > > +	struct list_head		*dfops)
> > > +{
> > > +	struct xfs_defer_pending	*dfp;
> > > +	xfs_lsn_t			threshold_lsn;
> > > +
> > > +	ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > +
> > > +	/*
> > > +	 * Figure out where we need the tail to be in order to maintain the
> > > +	 * minimum required free space in the log.
> > > +	 */
> > > +	threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0);
> > > +	if (threshold_lsn == NULLCOMMITLSN)
> > > +		return 0;
> > 
> > This smells of premature optimisation.
> > 
> > When we are in a tail-pushing scenario (i.e. any sort of
> > sustained metadata workload) this will always return true, and so we
> > will relog every intent that isn't in the current checkpoint every
> > time this is called.  Under light load, we don't care if we add a
> > little bit of relogging overhead as the CIL slowly flushes/pushes -
> > it will have neglible impact on performance because there is little
> > load on the journal.
> > 
> 
> That seems like an overly broad and not necessarily correct assumption.
> The threshold above is tail relative and hardcoded (with the zero param)
> to the default AIL pushing threshold, which is 25% of the log. If we

The pushing threshold is 75% of the log space has been reserved by
the grant head, not that there's only 25% of the log free. The
reservation is typically much, much larger than the change that is
actually written to the log, and so we can be tail pushing even when
the log is largely empty and the CIL hasn't reached it's threshold
for pushing.

Think about it - if we have a log with a 256kB log stripe unit,
a itruncate reservation is somewhere around 1.5MB in size (700kB *
2), and if we have reflink/rmap enabled, it's closer to 6MB in size.
Yet a typical modification will not use any of the 512kB*8 space
reserved for the LSU (because CIL!) and might only use 10kB of new
log space in the CIL because that's all that was newly dirtied.

That means we reserve 6MB of log space for each itruncate
transaction in flight, (directory ops are almost as heavyweight) so
it only takes 16 transactions in flight to consume 100MB of reserve
grant head space. If we've got a log smaller than 130MB in this
situation, we are into tail pushing -even if the physical log is
empty-.

IOWs, tail pushing is not triggered by how much physical log space
has been consumed - it is triggered by reservation pressure. And we
can trivially drive the system into reservation pressure with
concurrent workloads

> assume sustained tail pushing conditions, a committed intent doesn't
> trigger relog again until it falls within that threshold in the on-disk
> log. The current CIL (nonblocking) size threshold is half that at 12.5%.

Or smaller. For small logs it is 12.5% of the log size. For larger
logs, it caps at 32MB (8 * (256k << 4)).

> So relative to a given rolling transaction processing an intent chain,
> there's a decent number of full CIL checkpoints that can occur before
> any of those intents need to relog again (if at all), regardless of log
> size.

Your logic is inverted because reservation pressure does not reflect
CIL pressure.  Log space consumption and reservation pressure were
decoupled by the in-memory relogging that the CIL enables, hence
given a specific intent chain, the likely probability is that it
will not span more than a single CIL context. i.e. it will complete
rolling entirely within a single CIL commit.

As the log gets larger and the CIL increases in size, the
probability that an intent chain completes within a single CIL
commit goes up (e.g. 6MB reservation vs 32MB CIL push threshold).
The lighter the load, the more likely it is that a transaction chain
will complete within a single CIL checkpoint, even on small logs.

> Without that logic, we're changing behavior to relog the entire chain in
> every CIL checkpoint. That's a fairly significant change in behavior
> with less predictable breakdown conditions. Do we know how long a single
> chain is going to be?

Yes: the typical worst case chain length is defined by the
transaction reservation log count. That is supposed to cover > %99
of typical log space requirements for that operation.

For any chain that is longer than this, we end up in the slow
regrant path that reserves a unit of log space at a time, and these
are the cases where the transaction must be careful about not
pinning the tail of the log. i.e. long chains are exactly the case
where we want to relog the intents in every new CIL context. This is
no different to the way we relog the inode in -every transaction
roll- whether it is modified or not so that we can guarantee it
doesn't pin the tail of the log when we need to regrant space on
commit....

> Do we know how many CPUs are concurrently
> processing "long chain" operations?

Yes: as many as the grant head reservations will allow. That's
always been the bound maximum transaction concurrency we can
support.

> Do we know whether an external
> workload is causing even more frequent checkpoints than governed by the
> CIL size limit?

Irrelevant, because checkpoint frequency affects the relogging
efficiency of -everything- that running transactions. i.e. the
impact is not isolated to relogging a few small intents, it affects
relogging of all metadata, from AG headers to btree blocks,
directories and inodes. IOWs, the overhead of relogging intents
will be lost in the overhead of completely relogging entire dirty
metadata blocks that the transactions also touch, even though they
may have only newly dirtied a few bytes in each buffer.

> The examples in the commit logs in this series refer to
> chains of hundreds of intents with hundreds of transaction rolls.

Truncating a file with 10 million extents will roll the transaction
10 million times. Having a long intent chain is no different from a
"don't pin the tail of the log" perspective than truncating a
badly fragmented file. Both need to relog their outstanding items
that could pin the tail of the log as frequently as they commit...

This is a basic principle of permanent transactions: they must
guarantee forwards progress by ensuring they can't pin the tail of
the log at any time. That is, if you hold an item that *may* pin the
tail of the log when you need to regrant log space, you need to move
those items to the head of the log -before- you attempt to regrant
log space. We do not do that with intents, and conditional relogging
doesn't provide that guarantee, either.  Relogging the item once the
item is no longer at the head of the log provides that forwards
progress guarantee.

I don't want to slap a band-aid over the problem in the name of
"performance". Correctness comes first, then if we find a
performance problem we'll address that without breaking correctness.

> Those types of scenarios are hard enough to reason about,
> particularly when we consider boxes with hundreds of CPUs, so I'm
> somewhat skeptical of us accurately predicting
> performance/behavior over an implementation that bounds processing
> more explicitly.  ISTM that these are all essentially undefined
> contributing factors.

I don't find these things hard to reason about, nor do I consider
them undefined factors. This is largely because I've been dealing
with journalling and AIL issues at this scale for the best part of
15 years. Everything I think about in this area starts with the
requirement "this needs to scale to 10,000+ concurrent transactions
across thousands of CPUs". The reason for that can be found in
ancient history, such as:

commit 249a8c1124653fa90f3a3afff869095a31bc229f
Author: David Chinner <dgc@sgi.com>
Date:   Tue Feb 5 12:13:32 2008 +1100

    [XFS] Move AIL pushing into it's own thread
    
    When many hundreds to thousands of threads all try to do simultaneous
    transactions and the log is in a tail-pushing situation (i.e. full), we
    can get multiple threads walking the AIL list and contending on the AIL
    lock.
....

commit 28f1bc4aabea316d01ca7a59c38d619bc17e3550
Author: Dave Chinner <dgc@sgi.com>
Date:   Fri Feb 15 15:19:57 2008 +0000

    Prevent AIL lock contention during transaction completion
    
    When hundreds of processors attempt to commit transactions at the
    same time, they can contend on the AIL lock when updating the tail
    LSN held in the in-core log structure.
....

The workload that lead to the lockless grant head reservation code,
the single threaded, cursor based AIL traversals, the atomic tail
accounting, etc was a MPI workload on a 2048p machine doing an MPI
synchronised close on ~6 fds per CPU on all CPUs at once. The
lockless log code reduced this part of the workload from 4 hours to
4 seconds....

Servers these days are still small compared to the SGI machines that
XFS ran on in the mid-2000s....

> > However, when we are under heavy load the code will now be reading
> > the grant head and log position accounting variables during every
> > commit, hence greatly increasing the number and temporal
> > distribution of accesses to the hotest cachelines in the log. We
> > currently never access these cache lines during commit unless the
> > unit reservation has run out and we have to regrant physical log
> > space for the transaction to continue (i.e. we are into slow path
> > commit code). IOWs, this is like causing far more than double the
> > number of accesses to the grant head, the log tail, the
> > last_sync_lsn, etc, all of which is unnecessary exactly when we care
> > about minimising contention on the log space accounting variables...
> > 
> 
> If we're under sustained tail pushing pressure, blocking transactions
> have already hit this cacheline as well, FWIW.

No, tail pushing != blocking. If the AIL pushing is keeping up with
the targets that are being set (as frequently happens with large
logs and fast storage), then we rarely run out of log space and we
do not block new reservations waiting for reservation space.

> But if we're concerned about cacheline access in the commit path
> specifically, I'm sure we could come up with any number of optimizations
> to address that directly. E.g., we could sample the threshold
> occasionally instead of on every roll, wait until the current
> transaction has consumed all logres units (where we'll be hitting that
> line anyways), shift state tracking over to xfsaild via setting a flag
> on log items with an ->iop_relog() handler set that happen to pin the
> tail, etc. etc. That said, I do think any such added complexity should
> be justified with some accompanying profiling data (and I like the idea
> of new stats counters mentioned in the separate mail).

Yes, there's lots of ways the pushing threshold could be done
differently, but to focus on that misses the point that intents
violate forwards progress guarantees rolling transactions are
supposed to provide the log. I simply made the "access overhead"
argument as a supporting point that this change is also likely to
have unintended side effects...

Cheers,

Dave.
Brian Foster Sept. 29, 2020, 12:27 p.m. UTC | #8
On Tue, Sep 29, 2020 at 09:09:10AM +1000, Dave Chinner wrote:
> On Mon, Sep 28, 2020 at 11:16:27AM -0400, Brian Foster wrote:
> > On Mon, Sep 28, 2020 at 09:08:23AM +1000, Dave Chinner wrote:
> > > On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote:
> > > > +xfs_defer_relog(
> > > > +	struct xfs_trans		**tpp,
> > > > +	struct list_head		*dfops)
> > > > +{
> > > > +	struct xfs_defer_pending	*dfp;
> > > > +	xfs_lsn_t			threshold_lsn;
> > > > +
> > > > +	ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > > +
> > > > +	/*
> > > > +	 * Figure out where we need the tail to be in order to maintain the
> > > > +	 * minimum required free space in the log.
> > > > +	 */
> > > > +	threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0);
> > > > +	if (threshold_lsn == NULLCOMMITLSN)
> > > > +		return 0;
> > > 
> > > This smells of premature optimisation.
> > > 
> > > When we are in a tail-pushing scenario (i.e. any sort of
> > > sustained metadata workload) this will always return true, and so we
> > > will relog every intent that isn't in the current checkpoint every
> > > time this is called.  Under light load, we don't care if we add a
> > > little bit of relogging overhead as the CIL slowly flushes/pushes -
> > > it will have neglible impact on performance because there is little
> > > load on the journal.
> > > 
> > 
> > That seems like an overly broad and not necessarily correct assumption.
> > The threshold above is tail relative and hardcoded (with the zero param)
> > to the default AIL pushing threshold, which is 25% of the log. If we
> 
> The pushing threshold is 75% of the log space has been reserved by
> the grant head, not that there's only 25% of the log free. The
> reservation is typically much, much larger than the change that is
> actually written to the log, and so we can be tail pushing even when
> the log is largely empty and the CIL hasn't reached it's threshold
> for pushing.
> 
> Think about it - if we have a log with a 256kB log stripe unit,
> a itruncate reservation is somewhere around 1.5MB in size (700kB *
> 2), and if we have reflink/rmap enabled, it's closer to 6MB in size.
> Yet a typical modification will not use any of the 512kB*8 space
> reserved for the LSU (because CIL!) and might only use 10kB of new
> log space in the CIL because that's all that was newly dirtied.
> 
> That means we reserve 6MB of log space for each itruncate
> transaction in flight, (directory ops are almost as heavyweight) so
> it only takes 16 transactions in flight to consume 100MB of reserve
> grant head space. If we've got a log smaller than 130MB in this
> situation, we are into tail pushing -even if the physical log is
> empty-.
> 
> IOWs, tail pushing is not triggered by how much physical log space
> has been consumed - it is triggered by reservation pressure. And we
> can trivially drive the system into reservation pressure with
> concurrent workloads
> 

Yes, I understand. What I'm saying is that once we get past the
reservation head checks, there is a threshold lsn check that requires an
intent sit in that 25% range from the on-disk tail of the log for it to
be relogged. That _might_ unnecessarily relog in certain cases (i.e.
when the active range of the log is small), but it is not always the
case and I don't think it results in per CIL context relogging in
practice.

> > assume sustained tail pushing conditions, a committed intent doesn't
> > trigger relog again until it falls within that threshold in the on-disk
> > log. The current CIL (nonblocking) size threshold is half that at 12.5%.
> 
> Or smaller. For small logs it is 12.5% of the log size. For larger
> logs, it caps at 32MB (8 * (256k << 4)).
> 
> > So relative to a given rolling transaction processing an intent chain,
> > there's a decent number of full CIL checkpoints that can occur before
> > any of those intents need to relog again (if at all), regardless of log
> > size.
> 
> Your logic is inverted because reservation pressure does not reflect
> CIL pressure.  Log space consumption and reservation pressure were
> decoupled by the in-memory relogging that the CIL enables, hence
> given a specific intent chain, the likely probability is that it
> will not span more than a single CIL context. i.e. it will complete
> rolling entirely within a single CIL commit.
> 
> As the log gets larger and the CIL increases in size, the
> probability that an intent chain completes within a single CIL
> commit goes up (e.g. 6MB reservation vs 32MB CIL push threshold).
> The lighter the load, the more likely it is that a transaction chain
> will complete within a single CIL checkpoint, even on small logs.
> 

Sure, I agree with that. That's not a major factor for intent chains
that pin the log tail long enough to cause deadlocks.

> > Without that logic, we're changing behavior to relog the entire chain in
> > every CIL checkpoint. That's a fairly significant change in behavior
> > with less predictable breakdown conditions. Do we know how long a single
> > chain is going to be?
> 
> Yes: the typical worst case chain length is defined by the
> transaction reservation log count. That is supposed to cover > %99
> of typical log space requirements for that operation.
> 

I don't think reservation log counts define worst case conditions for
intent chains. Darrick has documented cases in the commit logs that far
exceed those values. I'm assuming those are legitimate/observed
scenarios, but could be mistaken.

> For any chain that is longer than this, we end up in the slow
> regrant path that reserves a unit of log space at a time, and these
> are the cases where the transaction must be careful about not
> pinning the tail of the log. i.e. long chains are exactly the case
> where we want to relog the intents in every new CIL context. This is
> no different to the way we relog the inode in -every transaction
> roll- whether it is modified or not so that we can guarantee it
> doesn't pin the tail of the log when we need to regrant space on
> commit....
> 

Yes, the relogging approach is the same. The difference is that in one
case we're logging a small, fixed size (inode) object repeatedly that is
already factored into the transaction reservation and in the other case,
we're logging a dynamic and non-deterministic number of (intent) objects
a non-deterministic number of times.

> > Do we know how many CPUs are concurrently
> > processing "long chain" operations?
> 
> Yes: as many as the grant head reservations will allow. That's
> always been the bound maximum transaction concurrency we can
> support.
> 
> > Do we know whether an external
> > workload is causing even more frequent checkpoints than governed by the
> > CIL size limit?
> 
> Irrelevant, because checkpoint frequency affects the relogging
> efficiency of -everything- that running transactions. i.e. the
> impact is not isolated to relogging a few small intents, it affects
> relogging of all metadata, from AG headers to btree blocks,
> directories and inodes. IOWs, the overhead of relogging intents
> will be lost in the overhead of completely relogging entire dirty
> metadata blocks that the transactions also touch, even though they
> may have only newly dirtied a few bytes in each buffer.
> 
> > The examples in the commit logs in this series refer to
> > chains of hundreds of intents with hundreds of transaction rolls.
> 
> Truncating a file with 10 million extents will roll the transaction
> 10 million times. Having a long intent chain is no different from a
> "don't pin the tail of the log" perspective than truncating a
> badly fragmented file. Both need to relog their outstanding items
> that could pin the tail of the log as frequently as they commit...
> 
> This is a basic principle of permanent transactions: they must
> guarantee forwards progress by ensuring they can't pin the tail of
> the log at any time. That is, if you hold an item that *may* pin the
> tail of the log when you need to regrant log space, you need to move
> those items to the head of the log -before- you attempt to regrant
> log space. We do not do that with intents, and conditional relogging
> doesn't provide that guarantee, either.  Relogging the item once the
> item is no longer at the head of the log provides that forwards
> progress guarantee.
> 

It sounds like there's a correctness argument buried in here. If you
think the tail lsn threshold approach is not fundamentally correct, can
you please elaborate on that? I think fundamental correctness is far
more important at this early stage than the prospective impact of
cacheline contention caused by an implementation detail.

> I don't want to slap a band-aid over the problem in the name of
> "performance". Correctness comes first, then if we find a
> performance problem we'll address that without breaking correctness.
> 
...
> > > However, when we are under heavy load the code will now be reading
> > > the grant head and log position accounting variables during every
> > > commit, hence greatly increasing the number and temporal
> > > distribution of accesses to the hotest cachelines in the log. We
> > > currently never access these cache lines during commit unless the
> > > unit reservation has run out and we have to regrant physical log
> > > space for the transaction to continue (i.e. we are into slow path
> > > commit code). IOWs, this is like causing far more than double the
> > > number of accesses to the grant head, the log tail, the
> > > last_sync_lsn, etc, all of which is unnecessary exactly when we care
> > > about minimising contention on the log space accounting variables...
> > > 
> > 
> > If we're under sustained tail pushing pressure, blocking transactions
> > have already hit this cacheline as well, FWIW.
> 
> No, tail pushing != blocking. If the AIL pushing is keeping up with
> the targets that are being set (as frequently happens with large
> logs and fast storage), then we rarely run out of log space and we
> do not block new reservations waiting for reservation space.
> 

Just pointing out that there's another prospective path for a
transaction to hit the cacheline. As you've already pointed out, it's
quite easy to reach reservation blocking conditions before the AIL is
even involved.

> > But if we're concerned about cacheline access in the commit path
> > specifically, I'm sure we could come up with any number of optimizations
> > to address that directly. E.g., we could sample the threshold
> > occasionally instead of on every roll, wait until the current
> > transaction has consumed all logres units (where we'll be hitting that
> > line anyways), shift state tracking over to xfsaild via setting a flag
> > on log items with an ->iop_relog() handler set that happen to pin the
> > tail, etc. etc. That said, I do think any such added complexity should
> > be justified with some accompanying profiling data (and I like the idea
> > of new stats counters mentioned in the separate mail).
> 
> Yes, there's lots of ways the pushing threshold could be done
> differently, but to focus on that misses the point that intents
> violate forwards progress guarantees rolling transactions are
> supposed to provide the log. I simply made the "access overhead"
> argument as a supporting point that this change is also likely to
> have unintended side effects...
> 

The argument stated was that the relog filtering was redundant logic. I
think that misinterprets the logic. If there's a correctness issue, then
I'd much prefer to focus on that..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Darrick J. Wong Sept. 29, 2020, 5:01 p.m. UTC | #9
On Tue, Sep 29, 2020 at 08:27:39AM -0400, Brian Foster wrote:
> On Tue, Sep 29, 2020 at 09:09:10AM +1000, Dave Chinner wrote:
> > On Mon, Sep 28, 2020 at 11:16:27AM -0400, Brian Foster wrote:
> > > On Mon, Sep 28, 2020 at 09:08:23AM +1000, Dave Chinner wrote:
> > > > On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote:
> > > > > +xfs_defer_relog(
> > > > > +	struct xfs_trans		**tpp,
> > > > > +	struct list_head		*dfops)
> > > > > +{
> > > > > +	struct xfs_defer_pending	*dfp;
> > > > > +	xfs_lsn_t			threshold_lsn;
> > > > > +
> > > > > +	ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > > > +
> > > > > +	/*
> > > > > +	 * Figure out where we need the tail to be in order to maintain the
> > > > > +	 * minimum required free space in the log.
> > > > > +	 */
> > > > > +	threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0);
> > > > > +	if (threshold_lsn == NULLCOMMITLSN)
> > > > > +		return 0;
> > > > 
> > > > This smells of premature optimisation.
> > > > 
> > > > When we are in a tail-pushing scenario (i.e. any sort of
> > > > sustained metadata workload) this will always return true, and so we
> > > > will relog every intent that isn't in the current checkpoint every
> > > > time this is called.  Under light load, we don't care if we add a
> > > > little bit of relogging overhead as the CIL slowly flushes/pushes -
> > > > it will have neglible impact on performance because there is little
> > > > load on the journal.
> > > > 
> > > 
> > > That seems like an overly broad and not necessarily correct assumption.
> > > The threshold above is tail relative and hardcoded (with the zero param)
> > > to the default AIL pushing threshold, which is 25% of the log. If we
> > 
> > The pushing threshold is 75% of the log space has been reserved by
> > the grant head, not that there's only 25% of the log free. The
> > reservation is typically much, much larger than the change that is
> > actually written to the log, and so we can be tail pushing even when
> > the log is largely empty and the CIL hasn't reached it's threshold
> > for pushing.
> > 
> > Think about it - if we have a log with a 256kB log stripe unit,
> > a itruncate reservation is somewhere around 1.5MB in size (700kB *
> > 2), and if we have reflink/rmap enabled, it's closer to 6MB in size.
> > Yet a typical modification will not use any of the 512kB*8 space
> > reserved for the LSU (because CIL!) and might only use 10kB of new
> > log space in the CIL because that's all that was newly dirtied.
> > 
> > That means we reserve 6MB of log space for each itruncate
> > transaction in flight, (directory ops are almost as heavyweight) so
> > it only takes 16 transactions in flight to consume 100MB of reserve
> > grant head space. If we've got a log smaller than 130MB in this
> > situation, we are into tail pushing -even if the physical log is
> > empty-.
> > 
> > IOWs, tail pushing is not triggered by how much physical log space
> > has been consumed - it is triggered by reservation pressure. And we
> > can trivially drive the system into reservation pressure with
> > concurrent workloads
> > 
> 
> Yes, I understand. What I'm saying is that once we get past the
> reservation head checks, there is a threshold lsn check that requires an
> intent sit in that 25% range from the on-disk tail of the log for it to
> be relogged. That _might_ unnecessarily relog in certain cases (i.e.
> when the active range of the log is small), but it is not always the
> case and I don't think it results in per CIL context relogging in
> practice.

FWIW, I added a defer_relog stats counter and re-ran my nightly fstests
runs (which at this point consist of):

A) upstream V5 with all features enabled
B) same, but with 1k block size
C) V4 filesystem
D) V5 with DAX

With the first of the relogging patches enabled, we relog any time we
find that an intent item is not in the current checkpoint.

A) defer_relog 26237
B) defer_relog 81500
C) defer_relog 15
D) defer_relog 2660

With both enabled, we relog if the intent item is below the push
threshold and the intent item isn't in the current checkpoint.

A) defer_relog 11345
B) defer_relog 1
C) defer_relog 945
D) defer_relog 1805

The runtime was about the same, though oddly the runtime has gone from
3.5h (last week) to 6h (this week) though I did just rebase fstests
and... wow, there are some new tests that take forever to run.

From the ftrace data, the longest chains I see are from evil reflink
tests that create a large extent, share it, punch out every other block
of one of the copies, and then delete the other copy to stress CUI/EFI
creation.

Those chains can run into the hundreds (~400 on the (A) config) though
sadly they don't test the relogging mechanism since the removal is
single-threaded, and the log doesn't checkpoint or hit the push
thresholds.

> > > assume sustained tail pushing conditions, a committed intent doesn't
> > > trigger relog again until it falls within that threshold in the on-disk
> > > log. The current CIL (nonblocking) size threshold is half that at 12.5%.
> > 
> > Or smaller. For small logs it is 12.5% of the log size. For larger
> > logs, it caps at 32MB (8 * (256k << 4)).
> > 
> > > So relative to a given rolling transaction processing an intent chain,
> > > there's a decent number of full CIL checkpoints that can occur before
> > > any of those intents need to relog again (if at all), regardless of log
> > > size.
> > 
> > Your logic is inverted because reservation pressure does not reflect
> > CIL pressure.  Log space consumption and reservation pressure were
> > decoupled by the in-memory relogging that the CIL enables, hence
> > given a specific intent chain, the likely probability is that it
> > will not span more than a single CIL context. i.e. it will complete
> > rolling entirely within a single CIL commit.
> > 
> > As the log gets larger and the CIL increases in size, the
> > probability that an intent chain completes within a single CIL
> > commit goes up (e.g. 6MB reservation vs 32MB CIL push threshold).
> > The lighter the load, the more likely it is that a transaction chain
> > will complete within a single CIL checkpoint, even on small logs.
> > 
> 
> Sure, I agree with that. That's not a major factor for intent chains
> that pin the log tail long enough to cause deadlocks.
> 
> > > Without that logic, we're changing behavior to relog the entire chain in
> > > every CIL checkpoint. That's a fairly significant change in behavior
> > > with less predictable breakdown conditions. Do we know how long a single
> > > chain is going to be?
> > 
> > Yes: the typical worst case chain length is defined by the
> > transaction reservation log count. That is supposed to cover > %99
> > of typical log space requirements for that operation.
> > 
> 
> I don't think reservation log counts define worst case conditions for
> intent chains. Darrick has documented cases in the commit logs that far
> exceed those values. I'm assuming those are legitimate/observed
> scenarios, but could be mistaken.

They're rare, but definitely the kinds of things (see above) that
unprivileged userspace can create.

> > For any chain that is longer than this, we end up in the slow
> > regrant path that reserves a unit of log space at a time, and these
> > are the cases where the transaction must be careful about not
> > pinning the tail of the log. i.e. long chains are exactly the case
> > where we want to relog the intents in every new CIL context. This is
> > no different to the way we relog the inode in -every transaction
> > roll- whether it is modified or not so that we can guarantee it
> > doesn't pin the tail of the log when we need to regrant space on
> > commit....
> > 
> 
> Yes, the relogging approach is the same. The difference is that in one
> case we're logging a small, fixed size (inode) object repeatedly that is
> already factored into the transaction reservation and in the other case,
> we're logging a dynamic and non-deterministic number of (intent) objects
> a non-deterministic number of times.
> 
> > > Do we know how many CPUs are concurrently
> > > processing "long chain" operations?
> > 
> > Yes: as many as the grant head reservations will allow. That's
> > always been the bound maximum transaction concurrency we can
> > support.
> > 
> > > Do we know whether an external
> > > workload is causing even more frequent checkpoints than governed by the
> > > CIL size limit?
> > 
> > Irrelevant, because checkpoint frequency affects the relogging
> > efficiency of -everything- that running transactions. i.e. the
> > impact is not isolated to relogging a few small intents, it affects
> > relogging of all metadata, from AG headers to btree blocks,
> > directories and inodes. IOWs, the overhead of relogging intents
> > will be lost in the overhead of completely relogging entire dirty
> > metadata blocks that the transactions also touch, even though they
> > may have only newly dirtied a few bytes in each buffer.

/me is confused, does this refer to transactions relogging dirty
buffers because they dirtied something (aka xfs_trans_log_buf)?  Or does
the checkpointing itself force relogging into the new checkpoint?

> > 
> > > The examples in the commit logs in this series refer to
> > > chains of hundreds of intents with hundreds of transaction rolls.
> > 
> > Truncating a file with 10 million extents will roll the transaction
> > 10 million times. Having a long intent chain is no different from a
> > "don't pin the tail of the log" perspective than truncating a
> > badly fragmented file. Both need to relog their outstanding items
> > that could pin the tail of the log as frequently as they commit...
> > 
> > This is a basic principle of permanent transactions: they must
> > guarantee forwards progress by ensuring they can't pin the tail of
> > the log at any time. That is, if you hold an item that *may* pin the
> > tail of the log when you need to regrant log space, you need to move
> > those items to the head of the log -before- you attempt to regrant
> > log space. We do not do that with intents, and conditional relogging
> > doesn't provide that guarantee, either.  Relogging the item once the
> > item is no longer at the head of the log provides that forwards
> > progress guarantee.
> > 
> 
> It sounds like there's a correctness argument buried in here. If you
> think the tail lsn threshold approach is not fundamentally correct, can
> you please elaborate on that? I think fundamental correctness is far
> more important at this early stage than the prospective impact of
> cacheline contention caused by an implementation detail.

I've been wondering if this discussion could be summarized like this:

We don't want logged inode items to pin the tail on long running
operations, so we tell everyone that if they're going to keep the inode
locked across transaction rolls, they must relog the inode into the new
transaction to keep it moving forward.  This is easy to spot and easy to
think about, since we generally only operate on one inode at a time and
the core is a fixed size.

For log intent items, the same general principle applies (anything that
we want to hold onto across a roll must be relogged to avoid pinning the
tail), but a transaction can have an arbitrary number of arbitrarily
sized intent items, and the overhead of relogging (aka logging an intent
done item and a new intent item) is much higher than relogging an inode.

Therefore, we add a new rule (A) that intent items must be relogged
across checkpoints.  There's less overhead than doing it *every
transaction* but this keeps the intents moving along.  This is still a
bright line rule, albeit buried in the defer ops code, because we know
exactly when it will trigger relogging.

However, adding (B) the push threshold thing now makes it /harder/ to
think about the intent item relogging because now we sort of have to
merge two not-quite-the-same concepts (the push threshold of the actual
log space usage vs. space usage according to the reservation grants)
which just makes relogging less predictable.

From my POV, we don't currently do any relogging of deferred items, so
applying either (A) or (B) seems like a benefit because at least we
close off a theoretical deadlock vector.  (B) on its own makes me
somewhat nervous because it sounds like the sort of thing that ends in
"infrequently used code path explodes in subtle ways on $config that
nobody tests", whereas (A) at least does it regularly enough that
everyone will know its there.  OTOH, if checkpoints are small, that
could result in a lot of unnecessary relogging, hence the attraction of
(B) or (A+B).

(I don't know anything about cacheline contention: I don't have quite
enough fast hardware and perf-fu (or free time) to study that in depth.
Since we're kind of lazy about locking around the grant head, it's
probably fine to be even lazier and only read that stuff once per
go-around, and only if we find an item that wasn't logged in the current
checkpoint.  Anyone got numbers?)

> > I don't want to slap a band-aid over the problem in the name of
> > "performance". Correctness comes first, then if we find a
> > performance problem we'll address that without breaking correctness.
> > 
> ...
> > > > However, when we are under heavy load the code will now be reading
> > > > the grant head and log position accounting variables during every
> > > > commit, hence greatly increasing the number and temporal
> > > > distribution of accesses to the hotest cachelines in the log. We
> > > > currently never access these cache lines during commit unless the
> > > > unit reservation has run out and we have to regrant physical log
> > > > space for the transaction to continue (i.e. we are into slow path
> > > > commit code). IOWs, this is like causing far more than double the
> > > > number of accesses to the grant head, the log tail, the
> > > > last_sync_lsn, etc, all of which is unnecessary exactly when we care
> > > > about minimising contention on the log space accounting variables...
> > > > 
> > > 
> > > If we're under sustained tail pushing pressure, blocking transactions
> > > have already hit this cacheline as well, FWIW.
> > 
> > No, tail pushing != blocking. If the AIL pushing is keeping up with
> > the targets that are being set (as frequently happens with large
> > logs and fast storage), then we rarely run out of log space and we
> > do not block new reservations waiting for reservation space.
> > 
> 
> Just pointing out that there's another prospective path for a
> transaction to hit the cacheline. As you've already pointed out, it's
> quite easy to reach reservation blocking conditions before the AIL is
> even involved.
> 
> > > But if we're concerned about cacheline access in the commit path
> > > specifically, I'm sure we could come up with any number of optimizations
> > > to address that directly. E.g., we could sample the threshold
> > > occasionally instead of on every roll, wait until the current
> > > transaction has consumed all logres units (where we'll be hitting that
> > > line anyways), shift state tracking over to xfsaild via setting a flag
> > > on log items with an ->iop_relog() handler set that happen to pin the

I'm not sure what I think about the AIL reaching out to touch intent
items that higher level code could also be accessing at the same time.
Wouldn't that require more coordination?

> > > tail, etc. etc. That said, I do think any such added complexity should
> > > be justified with some accompanying profiling data (and I like the idea
> > > of new stats counters mentioned in the separate mail).
> > 
> > Yes, there's lots of ways the pushing threshold could be done
> > differently, but to focus on that misses the point that intents
> > violate forwards progress guarantees rolling transactions are
> > supposed to provide the log. I simply made the "access overhead"
> > argument as a supporting point that this change is also likely to
> > have unintended side effects...
> > 
> 
> The argument stated was that the relog filtering was redundant logic. I
> think that misinterprets the logic. If there's a correctness issue, then
> I'd much prefer to focus on that..

I hope I didn't just stumble in and muddy things further. :)

--D

> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
>
Brian Foster Sept. 29, 2020, 6:45 p.m. UTC | #10
On Tue, Sep 29, 2020 at 10:01:13AM -0700, Darrick J. Wong wrote:
> On Tue, Sep 29, 2020 at 08:27:39AM -0400, Brian Foster wrote:
> > On Tue, Sep 29, 2020 at 09:09:10AM +1000, Dave Chinner wrote:
> > > On Mon, Sep 28, 2020 at 11:16:27AM -0400, Brian Foster wrote:
> > > > On Mon, Sep 28, 2020 at 09:08:23AM +1000, Dave Chinner wrote:
> > > > > On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote:
> > > > > > +xfs_defer_relog(
> > > > > > +	struct xfs_trans		**tpp,
> > > > > > +	struct list_head		*dfops)
> > > > > > +{
> > > > > > +	struct xfs_defer_pending	*dfp;
> > > > > > +	xfs_lsn_t			threshold_lsn;
> > > > > > +
> > > > > > +	ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Figure out where we need the tail to be in order to maintain the
> > > > > > +	 * minimum required free space in the log.
> > > > > > +	 */
> > > > > > +	threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0);
> > > > > > +	if (threshold_lsn == NULLCOMMITLSN)
> > > > > > +		return 0;
> > > > > 
> > > > > This smells of premature optimisation.
> > > > > 
> > > > > When we are in a tail-pushing scenario (i.e. any sort of
> > > > > sustained metadata workload) this will always return true, and so we
> > > > > will relog every intent that isn't in the current checkpoint every
> > > > > time this is called.  Under light load, we don't care if we add a
> > > > > little bit of relogging overhead as the CIL slowly flushes/pushes -
> > > > > it will have neglible impact on performance because there is little
> > > > > load on the journal.
> > > > > 
> > > > 
> > > > That seems like an overly broad and not necessarily correct assumption.
> > > > The threshold above is tail relative and hardcoded (with the zero param)
> > > > to the default AIL pushing threshold, which is 25% of the log. If we
> > > 
> > > The pushing threshold is 75% of the log space has been reserved by
> > > the grant head, not that there's only 25% of the log free. The
> > > reservation is typically much, much larger than the change that is
> > > actually written to the log, and so we can be tail pushing even when
> > > the log is largely empty and the CIL hasn't reached it's threshold
> > > for pushing.
> > > 
> > > Think about it - if we have a log with a 256kB log stripe unit,
> > > a itruncate reservation is somewhere around 1.5MB in size (700kB *
> > > 2), and if we have reflink/rmap enabled, it's closer to 6MB in size.
> > > Yet a typical modification will not use any of the 512kB*8 space
> > > reserved for the LSU (because CIL!) and might only use 10kB of new
> > > log space in the CIL because that's all that was newly dirtied.
> > > 
> > > That means we reserve 6MB of log space for each itruncate
> > > transaction in flight, (directory ops are almost as heavyweight) so
> > > it only takes 16 transactions in flight to consume 100MB of reserve
> > > grant head space. If we've got a log smaller than 130MB in this
> > > situation, we are into tail pushing -even if the physical log is
> > > empty-.
> > > 
> > > IOWs, tail pushing is not triggered by how much physical log space
> > > has been consumed - it is triggered by reservation pressure. And we
> > > can trivially drive the system into reservation pressure with
> > > concurrent workloads
> > > 
> > 
> > Yes, I understand. What I'm saying is that once we get past the
> > reservation head checks, there is a threshold lsn check that requires an
> > intent sit in that 25% range from the on-disk tail of the log for it to
> > be relogged. That _might_ unnecessarily relog in certain cases (i.e.
> > when the active range of the log is small), but it is not always the
> > case and I don't think it results in per CIL context relogging in
> > practice.
> 
> FWIW, I added a defer_relog stats counter and re-ran my nightly fstests
> runs (which at this point consist of):
> 
> A) upstream V5 with all features enabled
> B) same, but with 1k block size
> C) V4 filesystem
> D) V5 with DAX
> 
> With the first of the relogging patches enabled, we relog any time we
> find that an intent item is not in the current checkpoint.
> 
> A) defer_relog 26237
> B) defer_relog 81500
> C) defer_relog 15
> D) defer_relog 2660
> 
> With both enabled, we relog if the intent item is below the push
> threshold and the intent item isn't in the current checkpoint.
> 
> A) defer_relog 11345
> B) defer_relog 1
> C) defer_relog 945
> D) defer_relog 1805
> 

Those B and C numbers are definitely interesting. It's hard to say what
might be going on there since fstests can do so many crazy things
(including explicitly formatting filesystems with a version other than
what is specified in the config, IIRC).

> The runtime was about the same, though oddly the runtime has gone from
> 3.5h (last week) to 6h (this week) though I did just rebase fstests
> and... wow, there are some new tests that take forever to run.
> 
> From the ftrace data, the longest chains I see are from evil reflink
> tests that create a large extent, share it, punch out every other block
> of one of the copies, and then delete the other copy to stress CUI/EFI
> creation.
> 
> Those chains can run into the hundreds (~400 on the (A) config) though
> sadly they don't test the relogging mechanism since the removal is
> single-threaded, and the log doesn't checkpoint or hit the push
> thresholds.
> 

Ok, good to know. Any idea how large of a file generates that length of
chain?

Just some quick thoughts on the rest, since Dave should probably chime
in on some of the other bits..

> > > > assume sustained tail pushing conditions, a committed intent doesn't
> > > > trigger relog again until it falls within that threshold in the on-disk
> > > > log. The current CIL (nonblocking) size threshold is half that at 12.5%.
> > > 
> > > Or smaller. For small logs it is 12.5% of the log size. For larger
> > > logs, it caps at 32MB (8 * (256k << 4)).
> > > 
> > > > So relative to a given rolling transaction processing an intent chain,
> > > > there's a decent number of full CIL checkpoints that can occur before
> > > > any of those intents need to relog again (if at all), regardless of log
> > > > size.
> > > 
> > > Your logic is inverted because reservation pressure does not reflect
> > > CIL pressure.  Log space consumption and reservation pressure were
> > > decoupled by the in-memory relogging that the CIL enables, hence
> > > given a specific intent chain, the likely probability is that it
> > > will not span more than a single CIL context. i.e. it will complete
> > > rolling entirely within a single CIL commit.
> > > 
> > > As the log gets larger and the CIL increases in size, the
> > > probability that an intent chain completes within a single CIL
> > > commit goes up (e.g. 6MB reservation vs 32MB CIL push threshold).
> > > The lighter the load, the more likely it is that a transaction chain
> > > will complete within a single CIL checkpoint, even on small logs.
> > > 
> > 
> > Sure, I agree with that. That's not a major factor for intent chains
> > that pin the log tail long enough to cause deadlocks.
> > 
> > > > Without that logic, we're changing behavior to relog the entire chain in
> > > > every CIL checkpoint. That's a fairly significant change in behavior
> > > > with less predictable breakdown conditions. Do we know how long a single
> > > > chain is going to be?
> > > 
> > > Yes: the typical worst case chain length is defined by the
> > > transaction reservation log count. That is supposed to cover > %99
> > > of typical log space requirements for that operation.
> > > 
> > 
> > I don't think reservation log counts define worst case conditions for
> > intent chains. Darrick has documented cases in the commit logs that far
> > exceed those values. I'm assuming those are legitimate/observed
> > scenarios, but could be mistaken.
> 
> They're rare, but definitely the kinds of things (see above) that
> unprivileged userspace can create.
> 
> > > For any chain that is longer than this, we end up in the slow
> > > regrant path that reserves a unit of log space at a time, and these
> > > are the cases where the transaction must be careful about not
> > > pinning the tail of the log. i.e. long chains are exactly the case
> > > where we want to relog the intents in every new CIL context. This is
> > > no different to the way we relog the inode in -every transaction
> > > roll- whether it is modified or not so that we can guarantee it
> > > doesn't pin the tail of the log when we need to regrant space on
> > > commit....
> > > 
> > 
> > Yes, the relogging approach is the same. The difference is that in one
> > case we're logging a small, fixed size (inode) object repeatedly that is
> > already factored into the transaction reservation and in the other case,
> > we're logging a dynamic and non-deterministic number of (intent) objects
> > a non-deterministic number of times.
> > 
> > > > Do we know how many CPUs are concurrently
> > > > processing "long chain" operations?
> > > 
> > > Yes: as many as the grant head reservations will allow. That's
> > > always been the bound maximum transaction concurrency we can
> > > support.
> > > 
> > > > Do we know whether an external
> > > > workload is causing even more frequent checkpoints than governed by the
> > > > CIL size limit?
> > > 
> > > Irrelevant, because checkpoint frequency affects the relogging
> > > efficiency of -everything- that running transactions. i.e. the
> > > impact is not isolated to relogging a few small intents, it affects
> > > relogging of all metadata, from AG headers to btree blocks,
> > > directories and inodes. IOWs, the overhead of relogging intents
> > > will be lost in the overhead of completely relogging entire dirty
> > > metadata blocks that the transactions also touch, even though they
> > > may have only newly dirtied a few bytes in each buffer.
> 
> /me is confused, does this refer to transactions relogging dirty
> buffers because they dirtied something (aka xfs_trans_log_buf)?  Or does
> the checkpointing itself force relogging into the new checkpoint?
> 
> > > 
> > > > The examples in the commit logs in this series refer to
> > > > chains of hundreds of intents with hundreds of transaction rolls.
> > > 
> > > Truncating a file with 10 million extents will roll the transaction
> > > 10 million times. Having a long intent chain is no different from a
> > > "don't pin the tail of the log" perspective than truncating a
> > > badly fragmented file. Both need to relog their outstanding items
> > > that could pin the tail of the log as frequently as they commit...
> > > 
> > > This is a basic principle of permanent transactions: they must
> > > guarantee forwards progress by ensuring they can't pin the tail of
> > > the log at any time. That is, if you hold an item that *may* pin the
> > > tail of the log when you need to regrant log space, you need to move
> > > those items to the head of the log -before- you attempt to regrant
> > > log space. We do not do that with intents, and conditional relogging
> > > doesn't provide that guarantee, either.  Relogging the item once the
> > > item is no longer at the head of the log provides that forwards
> > > progress guarantee.
> > > 
> > 
> > It sounds like there's a correctness argument buried in here. If you
> > think the tail lsn threshold approach is not fundamentally correct, can
> > you please elaborate on that? I think fundamental correctness is far
> > more important at this early stage than the prospective impact of
> > cacheline contention caused by an implementation detail.
> 
> I've been wondering if this discussion could be summarized like this:
> 
> We don't want logged inode items to pin the tail on long running
> operations, so we tell everyone that if they're going to keep the inode
> locked across transaction rolls, they must relog the inode into the new
> transaction to keep it moving forward.  This is easy to spot and easy to
> think about, since we generally only operate on one inode at a time and
> the core is a fixed size.
> 
> For log intent items, the same general principle applies (anything that
> we want to hold onto across a roll must be relogged to avoid pinning the
> tail), but a transaction can have an arbitrary number of arbitrarily
> sized intent items, and the overhead of relogging (aka logging an intent
> done item and a new intent item) is much higher than relogging an inode.
> 
> Therefore, we add a new rule (A) that intent items must be relogged
> across checkpoints.  There's less overhead than doing it *every
> transaction* but this keeps the intents moving along.  This is still a
> bright line rule, albeit buried in the defer ops code, because we know
> exactly when it will trigger relogging.
> 

That all sounds about right to me.

> However, adding (B) the push threshold thing now makes it /harder/ to
> think about the intent item relogging because now we sort of have to
> merge two not-quite-the-same concepts (the push threshold of the actual
> log space usage vs. space usage according to the reservation grants)
> which just makes relogging less predictable.
> 

I'm not sure relogging is more or less predictable either way due to
external factors that can change behavior. The same intent chain might
relog many times if contending with 100 CPUs committing unrelated
transactions, doing fsyncs, etc. and not at all on an otherwise idle fs.
I'm also not sure how much that really matters since it's just a
characteristic of how the underlying log subsystem (delayed logging)
already works.

> From my POV, we don't currently do any relogging of deferred items, so
> applying either (A) or (B) seems like a benefit because at least we
> close off a theoretical deadlock vector.  (B) on its own makes me
> somewhat nervous because it sounds like the sort of thing that ends in
> "infrequently used code path explodes in subtle ways on $config that
> nobody tests", whereas (A) at least does it regularly enough that
> everyone will know its there.  OTOH, if checkpoints are small, that
> could result in a lot of unnecessary relogging, hence the attraction of
> (B) or (A+B).
> 

I think that last bit applies to both cases, TBH. As Dave mentioned
earlier, the common case is probably going to be that a chain is logged
entirely in a single checkpoint. I think the more important question is
whether we have sufficient tests to stress the underlying scheme.

E.g., if you guys prefer to just relog the entire chain (per
checkpoint), then I think we should at least make sure we have test
coverage for scenarios like long intent chains that result in excessive
relogging. Perhaps we could accomplish that with some creative
combination of error injection tags and variants of the existing tests
(i.e., set a tag to control relog frequency and run one or more of those
evil reflink workloads, or just combine the latter with an fsstress
workload, etc.). That would at least provide some tangible data on
questions like whether we're going to start blowing transaction
reservations or burning up CPUs processing huge intent chains.

> (I don't know anything about cacheline contention: I don't have quite
> enough fast hardware and perf-fu (or free time) to study that in depth.
> Since we're kind of lazy about locking around the grant head, it's
> probably fine to be even lazier and only read that stuff once per
> go-around, and only if we find an item that wasn't logged in the current
> checkpoint.  Anyone got numbers?)
> 
> > > I don't want to slap a band-aid over the problem in the name of
> > > "performance". Correctness comes first, then if we find a
> > > performance problem we'll address that without breaking correctness.
> > > 
> > ...
> > > > > However, when we are under heavy load the code will now be reading
> > > > > the grant head and log position accounting variables during every
> > > > > commit, hence greatly increasing the number and temporal
> > > > > distribution of accesses to the hotest cachelines in the log. We
> > > > > currently never access these cache lines during commit unless the
> > > > > unit reservation has run out and we have to regrant physical log
> > > > > space for the transaction to continue (i.e. we are into slow path
> > > > > commit code). IOWs, this is like causing far more than double the
> > > > > number of accesses to the grant head, the log tail, the
> > > > > last_sync_lsn, etc, all of which is unnecessary exactly when we care
> > > > > about minimising contention on the log space accounting variables...
> > > > > 
> > > > 
> > > > If we're under sustained tail pushing pressure, blocking transactions
> > > > have already hit this cacheline as well, FWIW.
> > > 
> > > No, tail pushing != blocking. If the AIL pushing is keeping up with
> > > the targets that are being set (as frequently happens with large
> > > logs and fast storage), then we rarely run out of log space and we
> > > do not block new reservations waiting for reservation space.
> > > 
> > 
> > Just pointing out that there's another prospective path for a
> > transaction to hit the cacheline. As you've already pointed out, it's
> > quite easy to reach reservation blocking conditions before the AIL is
> > even involved.
> > 
> > > > But if we're concerned about cacheline access in the commit path
> > > > specifically, I'm sure we could come up with any number of optimizations
> > > > to address that directly. E.g., we could sample the threshold
> > > > occasionally instead of on every roll, wait until the current
> > > > transaction has consumed all logres units (where we'll be hitting that
> > > > line anyways), shift state tracking over to xfsaild via setting a flag
> > > > on log items with an ->iop_relog() handler set that happen to pin the
> 
> I'm not sure what I think about the AIL reaching out to touch intent
> items that higher level code could also be accessing at the same time.
> Wouldn't that require more coordination?
> 

This already happens today. More traditional items (inodes, bufs, etc.)
mediate access through item specific locks, but the AIL does list
processing, flag updates, etc. outside of that. I don't think anything
prevents the AIL from setting an XFS_LI_AIL_PUSHED flag or some such
that the intent owner could use to identify an intent to relog. That
said, that was just a quick/handwavy thought wrt to avoiding l_tail_lsn
access and may or may not be a viable option.

Brian

> > > > tail, etc. etc. That said, I do think any such added complexity should
> > > > be justified with some accompanying profiling data (and I like the idea
> > > > of new stats counters mentioned in the separate mail).
> > > 
> > > Yes, there's lots of ways the pushing threshold could be done
> > > differently, but to focus on that misses the point that intents
> > > violate forwards progress guarantees rolling transactions are
> > > supposed to provide the log. I simply made the "access overhead"
> > > argument as a supporting point that this change is also likely to
> > > have unintended side effects...
> > > 
> > 
> > The argument stated was that the relog filtering was redundant logic. I
> > think that misinterprets the logic. If there's a correctness issue, then
> > I'd much prefer to focus on that..
> 
> I hope I didn't just stumble in and muddy things further. :)
> 
> --D
> 
> > Brian
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > 
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 84a70edd0da1..c601cc2af254 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -17,6 +17,7 @@ 
 #include "xfs_inode_item.h"
 #include "xfs_trace.h"
 #include "xfs_icache.h"
+#include "xfs_log.h"
 
 /*
  * Deferred Operations in XFS
@@ -361,6 +362,52 @@  xfs_defer_cancel_list(
 	}
 }
 
+/*
+ * Prevent a log intent item from pinning the tail of the log by logging a
+ * done item to release the intent item; and then log a new intent item.
+ * The caller should provide a fresh transaction and roll it after we're done.
+ */
+static int
+xfs_defer_relog(
+	struct xfs_trans		**tpp,
+	struct list_head		*dfops)
+{
+	struct xfs_defer_pending	*dfp;
+	xfs_lsn_t			threshold_lsn;
+
+	ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
+
+	/*
+	 * Figure out where we need the tail to be in order to maintain the
+	 * minimum required free space in the log.
+	 */
+	threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0);
+	if (threshold_lsn == NULLCOMMITLSN)
+		return 0;
+
+	list_for_each_entry(dfp, dfops, dfp_list) {
+		/*
+		 * If the log intent item for this deferred op is behind the
+		 * desired log tail threshold and is not a part of the current
+		 * log checkpoint, relog the intent item to keep the log tail
+		 * moving forward.  We're ok with this being racy because an
+		 * incorrect decision means we'll be a little slower at pushing
+		 * the tail.
+		 */
+		if (dfp->dfp_intent == NULL ||
+		    XFS_LSN_CMP(dfp->dfp_intent->li_lsn, threshold_lsn) >= 0 ||
+		    xfs_log_item_in_current_chkpt(dfp->dfp_intent))
+			continue;
+
+		trace_xfs_defer_relog_intent((*tpp)->t_mountp, dfp);
+		dfp->dfp_intent = xfs_trans_item_relog(dfp->dfp_intent, *tpp);
+	}
+
+	if ((*tpp)->t_flags & XFS_TRANS_DIRTY)
+		return xfs_defer_trans_roll(tpp);
+	return 0;
+}
+
 /*
  * Log an intent-done item for the first pending intent, and finish the work
  * items.
@@ -447,6 +494,11 @@  xfs_defer_finish_noroll(
 		if (error)
 			goto out_shutdown;
 
+		/* Every few rolls we relog all the intent items. */
+		error = xfs_defer_relog(tp, &dop_pending);
+		if (error)
+			goto out_shutdown;
+
 		dfp = list_first_entry(&dop_pending, struct xfs_defer_pending,
 				       dfp_list);
 		error = xfs_defer_finish_one(*tp, dfp);
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 008436ef5bce..6317fdf4a3a0 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -532,6 +532,32 @@  xfs_bui_item_match(
 	return BUI_ITEM(lip)->bui_format.bui_id == intent_id;
 }
 
+/* Relog an intent item to push the log tail forward. */
+static struct xfs_log_item *
+xfs_bui_item_relog(
+	struct xfs_log_item		*intent,
+	struct xfs_trans		*tp)
+{
+	struct xfs_bud_log_item		*budp;
+	struct xfs_bui_log_item		*buip;
+	struct xfs_map_extent		*extp;
+	unsigned int			count;
+
+	count = BUI_ITEM(intent)->bui_format.bui_nextents;
+	extp = BUI_ITEM(intent)->bui_format.bui_extents;
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	budp = xfs_trans_get_bud(tp, BUI_ITEM(intent));
+	set_bit(XFS_LI_DIRTY, &budp->bud_item.li_flags);
+
+	buip = xfs_bui_init(tp->t_mountp);
+	memcpy(buip->bui_format.bui_extents, extp, count * sizeof(*extp));
+	atomic_set(&buip->bui_next_extent, count);
+	xfs_trans_add_item(tp, &buip->bui_item);
+	set_bit(XFS_LI_DIRTY, &buip->bui_item.li_flags);
+	return &buip->bui_item;
+}
+
 static const struct xfs_item_ops xfs_bui_item_ops = {
 	.iop_size	= xfs_bui_item_size,
 	.iop_format	= xfs_bui_item_format,
@@ -539,6 +565,7 @@  static const struct xfs_item_ops xfs_bui_item_ops = {
 	.iop_release	= xfs_bui_item_release,
 	.iop_recover	= xfs_bui_item_recover,
 	.iop_match	= xfs_bui_item_match,
+	.iop_relog	= xfs_bui_item_relog,
 };
 
 /*
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 337fae015bca..30a53f3913d1 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -642,6 +642,34 @@  xfs_efi_item_match(
 	return EFI_ITEM(lip)->efi_format.efi_id == intent_id;
 }
 
+/* Relog an intent item to push the log tail forward. */
+static struct xfs_log_item *
+xfs_efi_item_relog(
+	struct xfs_log_item		*intent,
+	struct xfs_trans		*tp)
+{
+	struct xfs_efd_log_item		*efdp;
+	struct xfs_efi_log_item		*efip;
+	struct xfs_extent		*extp;
+	unsigned int			count;
+
+	count = EFI_ITEM(intent)->efi_format.efi_nextents;
+	extp = EFI_ITEM(intent)->efi_format.efi_extents;
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	efdp = xfs_trans_get_efd(tp, EFI_ITEM(intent), count);
+	efdp->efd_next_extent = count;
+	memcpy(efdp->efd_format.efd_extents, extp, count * sizeof(*extp));
+	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
+
+	efip = xfs_efi_init(tp->t_mountp, count);
+	memcpy(efip->efi_format.efi_extents, extp, count * sizeof(*extp));
+	atomic_set(&efip->efi_next_extent, count);
+	xfs_trans_add_item(tp, &efip->efi_item);
+	set_bit(XFS_LI_DIRTY, &efip->efi_item.li_flags);
+	return &efip->efi_item;
+}
+
 static const struct xfs_item_ops xfs_efi_item_ops = {
 	.iop_size	= xfs_efi_item_size,
 	.iop_format	= xfs_efi_item_format,
@@ -649,6 +677,7 @@  static const struct xfs_item_ops xfs_efi_item_ops = {
 	.iop_release	= xfs_efi_item_release,
 	.iop_recover	= xfs_efi_item_recover,
 	.iop_match	= xfs_efi_item_match,
+	.iop_relog	= xfs_efi_item_relog,
 };
 
 /*
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index c78412755b8a..cf0205fdc607 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -560,6 +560,32 @@  xfs_cui_item_match(
 	return CUI_ITEM(lip)->cui_format.cui_id == intent_id;
 }
 
+/* Relog an intent item to push the log tail forward. */
+static struct xfs_log_item *
+xfs_cui_item_relog(
+	struct xfs_log_item		*intent,
+	struct xfs_trans		*tp)
+{
+	struct xfs_cud_log_item		*cudp;
+	struct xfs_cui_log_item		*cuip;
+	struct xfs_phys_extent		*extp;
+	unsigned int			count;
+
+	count = CUI_ITEM(intent)->cui_format.cui_nextents;
+	extp = CUI_ITEM(intent)->cui_format.cui_extents;
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	cudp = xfs_trans_get_cud(tp, CUI_ITEM(intent));
+	set_bit(XFS_LI_DIRTY, &cudp->cud_item.li_flags);
+
+	cuip = xfs_cui_init(tp->t_mountp, count);
+	memcpy(cuip->cui_format.cui_extents, extp, count * sizeof(*extp));
+	atomic_set(&cuip->cui_next_extent, count);
+	xfs_trans_add_item(tp, &cuip->cui_item);
+	set_bit(XFS_LI_DIRTY, &cuip->cui_item.li_flags);
+	return &cuip->cui_item;
+}
+
 static const struct xfs_item_ops xfs_cui_item_ops = {
 	.iop_size	= xfs_cui_item_size,
 	.iop_format	= xfs_cui_item_format,
@@ -567,6 +593,7 @@  static const struct xfs_item_ops xfs_cui_item_ops = {
 	.iop_release	= xfs_cui_item_release,
 	.iop_recover	= xfs_cui_item_recover,
 	.iop_match	= xfs_cui_item_match,
+	.iop_relog	= xfs_cui_item_relog,
 };
 
 /*
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index c1a1fd62ca74..3237243d375d 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -583,6 +583,32 @@  xfs_rui_item_match(
 	return RUI_ITEM(lip)->rui_format.rui_id == intent_id;
 }
 
+/* Relog an intent item to push the log tail forward. */
+static struct xfs_log_item *
+xfs_rui_item_relog(
+	struct xfs_log_item		*intent,
+	struct xfs_trans		*tp)
+{
+	struct xfs_rud_log_item		*rudp;
+	struct xfs_rui_log_item		*ruip;
+	struct xfs_map_extent		*extp;
+	unsigned int			count;
+
+	count = RUI_ITEM(intent)->rui_format.rui_nextents;
+	extp = RUI_ITEM(intent)->rui_format.rui_extents;
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	rudp = xfs_trans_get_rud(tp, RUI_ITEM(intent));
+	set_bit(XFS_LI_DIRTY, &rudp->rud_item.li_flags);
+
+	ruip = xfs_rui_init(tp->t_mountp, count);
+	memcpy(ruip->rui_format.rui_extents, extp, count * sizeof(*extp));
+	atomic_set(&ruip->rui_next_extent, count);
+	xfs_trans_add_item(tp, &ruip->rui_item);
+	set_bit(XFS_LI_DIRTY, &ruip->rui_item.li_flags);
+	return &ruip->rui_item;
+}
+
 static const struct xfs_item_ops xfs_rui_item_ops = {
 	.iop_size	= xfs_rui_item_size,
 	.iop_format	= xfs_rui_item_format,
@@ -590,6 +616,7 @@  static const struct xfs_item_ops xfs_rui_item_ops = {
 	.iop_release	= xfs_rui_item_release,
 	.iop_recover	= xfs_rui_item_recover,
 	.iop_match	= xfs_rui_item_match,
+	.iop_relog	= xfs_rui_item_relog,
 };
 
 /*
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index a3a35a2d8ed9..362c155be525 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2533,6 +2533,7 @@  DEFINE_DEFER_PENDING_EVENT(xfs_defer_create_intent);
 DEFINE_DEFER_PENDING_EVENT(xfs_defer_cancel_list);
 DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_finish);
 DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_abort);
+DEFINE_DEFER_PENDING_EVENT(xfs_defer_relog_intent);
 
 #define DEFINE_BMAP_FREE_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
 DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_defer);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 995c1513693c..e838e8327510 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -78,6 +78,8 @@  struct xfs_item_ops {
 	int (*iop_recover)(struct xfs_log_item *lip,
 			   struct xfs_defer_capture **dfcp);
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
+	struct xfs_log_item *(*iop_relog)(struct xfs_log_item *intent,
+			struct xfs_trans *tp);
 };
 
 /*
@@ -239,4 +241,12 @@  void		xfs_trans_buf_copy_type(struct xfs_buf *dst_bp,
 
 extern kmem_zone_t	*xfs_trans_zone;
 
+static inline struct xfs_log_item *
+xfs_trans_item_relog(
+	struct xfs_log_item	*lip,
+	struct xfs_trans	*tp)
+{
+	return lip->li_ops->iop_relog(lip, tp);
+}
+
 #endif	/* __XFS_TRANS_H__ */