diff mbox series

[7/7] xfs: push the grant head when the log head moves forward

Message ID 20190904042451.9314-8-david@fromorbit.com (mailing list archive)
State Superseded
Headers show
Series xfs: log race fixes and cleanups | expand

Commit Message

Dave Chinner Sept. 4, 2019, 4:24 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

When the log fills up, we can get into the state where the
outstanding items in the CIL being committed and aggregated are
larger than the range that the reservation grant head tail pushing
will attempt to clean. This can result in the tail pushing range
being trimmed back to the the log head (l_last_sync_lsn) and so
may not actually move the push target at all.

When the iclogs associated with the CIL commit finally land, the
log head moves forward, and this removes the restriction on the AIL
push target. However, if we already have transactions sleeping on
the grant head, and there's nothing in the AIL still to flush from
the current push target, then nothing will move the tail of the log
and trigger a log reservation wakeup.

Hence the there is nothing that will trigger xlog_grant_push_ail()
to recalculate the AIL push target and start pushing on the AIL
again to write back the metadata objects that pin the tail of the
log and hence free up space and allow the transaction reservations
to be woken and make progress.

Hence we need to push on the grant head when we move the log head
forward, as this may be the only trigger we have that can move the
AIL push target forwards in this situation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 72 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 47 insertions(+), 25 deletions(-)

Comments

Christoph Hellwig Sept. 4, 2019, 6:45 a.m. UTC | #1
> +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);

This adds an > 80 char line.

Otherwise this looks sensible to me.
Brian Foster Sept. 4, 2019, 7:34 p.m. UTC | #2
On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When the log fills up, we can get into the state where the
> outstanding items in the CIL being committed and aggregated are
> larger than the range that the reservation grant head tail pushing
> will attempt to clean. This can result in the tail pushing range
> being trimmed back to the the log head (l_last_sync_lsn) and so
> may not actually move the push target at all.
> 
> When the iclogs associated with the CIL commit finally land, the
> log head moves forward, and this removes the restriction on the AIL
> push target. However, if we already have transactions sleeping on
> the grant head, and there's nothing in the AIL still to flush from
> the current push target, then nothing will move the tail of the log
> and trigger a log reservation wakeup.
> 
> Hence the there is nothing that will trigger xlog_grant_push_ail()
> to recalculate the AIL push target and start pushing on the AIL
> again to write back the metadata objects that pin the tail of the
> log and hence free up space and allow the transaction reservations
> to be woken and make progress.
> 
> Hence we need to push on the grant head when we move the log head
> forward, as this may be the only trigger we have that can move the
> AIL push target forwards in this situation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c | 72 +++++++++++++++++++++++++++++++-----------------
>  1 file changed, 47 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 6a59d71d4c60..733693e1ac9f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2632,6 +2632,46 @@ xlog_get_lowest_lsn(
>  	return lowest_lsn;
>  }
>  
> +/*
> + * Completion of a iclog IO does not imply that a transaction has completed, as
> + * transactions can be large enough to span many iclogs. We cannot change the
> + * tail of the log half way through a transaction as this may be the only
> + * transaction in the log and moving the tail to point to the middle of it
> + * will prevent recovery from finding the start of the transaction. Hence we
> + * should only update the last_sync_lsn if this iclog contains transaction
> + * completion callbacks on it.
> + *
> + * We have to do this before we drop the icloglock to ensure we are the only one
> + * that can update it.
> + *
> + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> + * the reservation grant head pushing. This is due to the fact that the push
> + * target is bound by the current last_sync_lsn value. Hence if we have a large
> + * amount of log space bound up in this committing transaction then the
> + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> + * should push the AIL to ensure the push target (and hence the grant head) is
> + * no longer bound by the old log head location and can move forwards and make
> + * progress again.
> + */
> +static void
> +xlog_state_set_callback(
> +	struct xlog		*log,
> +	struct xlog_in_core	*iclog,
> +	xfs_lsn_t		header_lsn)
> +{
> +	iclog->ic_state = XLOG_STATE_CALLBACK;
> +
> +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> +
> +	if (list_empty_careful(&iclog->ic_callbacks))
> +		return;
> +
> +	atomic64_set(&log->l_last_sync_lsn, header_lsn);
> +	xlog_grant_push_ail(log, 0);
> +

Nit: extra whitespace line above.

This still seems racy to me, FWIW. What if the AIL is empty (i.e. the
push is skipped)? What if xfsaild completes this push before the
associated log items land in the AIL or we race with xfsaild emptying
the AIL? Why not just reuse/update the existing grant head wake up logic
in the iclog callback itself? E.g., something like the following
(untested):

@@ -740,12 +740,10 @@ xfs_trans_ail_update_bulk(
 	if (mlip_changed) {
 		if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
 			xlog_assign_tail_lsn_locked(ailp->ail_mount);
-		spin_unlock(&ailp->ail_lock);
-
-		xfs_log_space_wake(ailp->ail_mount);
-	} else {
-		spin_unlock(&ailp->ail_lock);
 	}
+
+	spin_unlock(&ailp->ail_lock);
+	xfs_log_space_wake(ailp->ail_mount);
}

... seems to solve the same prospective problem without being racy and
with less and more simple code. Hm?

Brian

> +}
> +
>  /*
>   * Return true if we need to stop processing, false to continue to the next
>   * iclog. The caller will need to run callbacks if the iclog is returned in the
> @@ -2644,6 +2684,7 @@ xlog_state_iodone_process_iclog(
>  	struct xlog_in_core	*completed_iclog)
>  {
>  	xfs_lsn_t		lowest_lsn;
> +	xfs_lsn_t		header_lsn;
>  
>  	/* Skip all iclogs in the ACTIVE & DIRTY states */
>  	if (iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_DIRTY))
> @@ -2681,34 +2722,15 @@ xlog_state_iodone_process_iclog(
>  	 * callbacks) see the above if.
>  	 *
>  	 * We will do one more check here to see if we have chased our tail
> -	 * around.
> +	 * around. If this is not the lowest lsn iclog, then we will leave it
> +	 * for another completion to process.
>  	 */
> +	header_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  	lowest_lsn = xlog_get_lowest_lsn(log);
> -	if (lowest_lsn &&
> -	    XFS_LSN_CMP(lowest_lsn, be64_to_cpu(iclog->ic_header.h_lsn)) < 0)
> -		return false; /* Leave this iclog for another thread */
> -
> -	iclog->ic_state = XLOG_STATE_CALLBACK;
> -
> -	/*
> -	 * Completion of a iclog IO does not imply that a transaction has
> -	 * completed, as transactions can be large enough to span many iclogs.
> -	 * We cannot change the tail of the log half way through a transaction
> -	 * as this may be the only transaction in the log and moving th etail to
> -	 * point to the middle of it will prevent recovery from finding the
> -	 * start of the transaction.  Hence we should only update the
> -	 * last_sync_lsn if this iclog contains transaction completion callbacks
> -	 * on it.
> -	 *
> -	 * We have to do this before we drop the icloglock to ensure we are the
> -	 * only one that can update it.
> -	 */
> -	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
> -			be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
> -	if (!list_empty_careful(&iclog->ic_callbacks))
> -		atomic64_set(&log->l_last_sync_lsn,
> -			be64_to_cpu(iclog->ic_header.h_lsn));
> +	if (lowest_lsn && XFS_LSN_CMP(lowest_lsn, header_lsn) < 0)
> +		return false;
>  
> +	xlog_state_set_callback(log, iclog, header_lsn);
>  	return false;
>  
>  }
> -- 
> 2.23.0.rc1
>
Dave Chinner Sept. 4, 2019, 9:49 p.m. UTC | #3
On Tue, Sep 03, 2019 at 11:45:10PM -0700, Christoph Hellwig wrote:
> > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> 
> This adds an > 80 char line.

Fixed.
Dave Chinner Sept. 4, 2019, 10:50 p.m. UTC | #4
On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote:
> On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > +/*
> > + * Completion of a iclog IO does not imply that a transaction has completed, as
> > + * transactions can be large enough to span many iclogs. We cannot change the
> > + * tail of the log half way through a transaction as this may be the only
> > + * transaction in the log and moving the tail to point to the middle of it
> > + * will prevent recovery from finding the start of the transaction. Hence we
> > + * should only update the last_sync_lsn if this iclog contains transaction
> > + * completion callbacks on it.
> > + *
> > + * We have to do this before we drop the icloglock to ensure we are the only one
> > + * that can update it.
> > + *
> > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> > + * the reservation grant head pushing. This is due to the fact that the push
> > + * target is bound by the current last_sync_lsn value. Hence if we have a large
> > + * amount of log space bound up in this committing transaction then the
> > + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> > + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> > + * should push the AIL to ensure the push target (and hence the grant head) is
> > + * no longer bound by the old log head location and can move forwards and make
> > + * progress again.
> > + */
> > +static void
> > +xlog_state_set_callback(
> > +	struct xlog		*log,
> > +	struct xlog_in_core	*iclog,
> > +	xfs_lsn_t		header_lsn)
> > +{
> > +	iclog->ic_state = XLOG_STATE_CALLBACK;
> > +
> > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> > +
> > +	if (list_empty_careful(&iclog->ic_callbacks))
> > +		return;
> > +
> > +	atomic64_set(&log->l_last_sync_lsn, header_lsn);
> > +	xlog_grant_push_ail(log, 0);
> > +
> 
> Nit: extra whitespace line above.

Fixed.

> This still seems racy to me, FWIW. What if the AIL is empty (i.e. the
> push is skipped)?

If the AIL is empty, then it's a no-op because pushing on the AIL is
not going to make more log space become free. Besides, that's not
the problem being solved here - reservation wakeups on first insert
into the AIL are already handled by xfs_trans_ail_update_bulk() and
hence the first patch in the series. This patch is addressing the
situation where the bulk insert that occurs from the callbacks that
are about to run -does not modify the tail of the log-. i.e. the
commit moved the head but not the tail, so we have to update the AIL
push target to take into account the new log head....

i.e. the AIL is for moving the tail of the log - this code moves the
head of the log. But both impact on the AIL push target (it is based on
the distance between the head and tail), so we need
to update the push target just in case this commit does not move
the tail...

> What if xfsaild completes this push before the
> associated log items land in the AIL or we race with xfsaild emptying
> the AIL? Why not just reuse/update the existing grant head wake up logic
> in the iclog callback itself? E.g., something like the following
> (untested):
> 
> @@ -740,12 +740,10 @@ xfs_trans_ail_update_bulk(
>  	if (mlip_changed) {
>  		if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
>  			xlog_assign_tail_lsn_locked(ailp->ail_mount);
> -		spin_unlock(&ailp->ail_lock);
> -
> -		xfs_log_space_wake(ailp->ail_mount);
> -	} else {
> -		spin_unlock(&ailp->ail_lock);
>  	}
> +
> +	spin_unlock(&ailp->ail_lock);
> +	xfs_log_space_wake(ailp->ail_mount);

Two things that I see straight away:

1. if the AIL is empty, the first insert does not set mlip_changed =
true and and so there will be no wakeup in the scenario you are
posing. This would be easy to fix - if (!mlip || changed) - so that
a wakeup is triggered in this case.

2. if we have not moved the tail, then calling xfs_log_space_wake()
will, at best, just burn CPU. At worst, it wll cause hundreds of
thousands of spurious wakeups a seconds because the waiting
transaction reservation will be woken continuously when there isn't
space available and there hasn't been any space made available.

So, from #1 we see that unconditional wakeups are not necessary in
the scenario you pose, and from #2 it's not a viable solution even
if it was required.

However, #1 indicates other problems if a xfs_log_space_wake() call
is necessary in this case. No reservations space and an empty AIL
implies that the CIL pins the entire log - a pending commit that
hasn't finished flushing and the current context that is
aggregating. This implies we've violated a much more important rule
of the on-disk log format: finding the head and tail of the log
requires no individual commit be larger than 50% of the log.

So if we are actually stalling on trasnaction reservations with an
empty AIL and an uncommitted CIL, screwing around with tail pushing
wakeups does not address the bigger problem being seen...

Cheers,

Dave.
Brian Foster Sept. 5, 2019, 4:25 p.m. UTC | #5
On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote:
> On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote:
> > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > +/*
> > > + * Completion of a iclog IO does not imply that a transaction has completed, as
> > > + * transactions can be large enough to span many iclogs. We cannot change the
> > > + * tail of the log half way through a transaction as this may be the only
> > > + * transaction in the log and moving the tail to point to the middle of it
> > > + * will prevent recovery from finding the start of the transaction. Hence we
> > > + * should only update the last_sync_lsn if this iclog contains transaction
> > > + * completion callbacks on it.
> > > + *
> > > + * We have to do this before we drop the icloglock to ensure we are the only one
> > > + * that can update it.
> > > + *
> > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> > > + * the reservation grant head pushing. This is due to the fact that the push
> > > + * target is bound by the current last_sync_lsn value. Hence if we have a large
> > > + * amount of log space bound up in this committing transaction then the
> > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> > > + * should push the AIL to ensure the push target (and hence the grant head) is
> > > + * no longer bound by the old log head location and can move forwards and make
> > > + * progress again.
> > > + */
> > > +static void
> > > +xlog_state_set_callback(
> > > +	struct xlog		*log,
> > > +	struct xlog_in_core	*iclog,
> > > +	xfs_lsn_t		header_lsn)
> > > +{
> > > +	iclog->ic_state = XLOG_STATE_CALLBACK;
> > > +
> > > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> > > +
> > > +	if (list_empty_careful(&iclog->ic_callbacks))
> > > +		return;
> > > +
> > > +	atomic64_set(&log->l_last_sync_lsn, header_lsn);
> > > +	xlog_grant_push_ail(log, 0);
> > > +
> > 
> > Nit: extra whitespace line above.
> 
> Fixed.
> 
> > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the
> > push is skipped)?
> 
> If the AIL is empty, then it's a no-op because pushing on the AIL is
> not going to make more log space become free. Besides, that's not
> the problem being solved here - reservation wakeups on first insert
> into the AIL are already handled by xfs_trans_ail_update_bulk() and
> hence the first patch in the series. This patch is addressing the

Nothing currently wakes up reservation waiters on first AIL insertion. I
pointed this out in the original thread along with the fact that the
push is a no-op for an empty AIL. What wasn't clear to me is whether it
matters for the problem this patch is trying to fix. It sounds like not,
but that's a separate question from whether this is a problem itself.

> situation where the bulk insert that occurs from the callbacks that
> are about to run -does not modify the tail of the log-. i.e. the
> commit moved the head but not the tail, so we have to update the AIL
> push target to take into account the new log head....
> 

Ok, I figured based on process of elimination. xfs_ail_push() ignores
the push on an empty AIL and we obviously already have wakeups on tail
updates.

> i.e. the AIL is for moving the tail of the log - this code moves the
> head of the log. But both impact on the AIL push target (it is based on
> the distance between the head and tail), so we need
> to update the push target just in case this commit does not move
> the tail...
> 
> > What if xfsaild completes this push before the
> > associated log items land in the AIL or we race with xfsaild emptying
> > the AIL? Why not just reuse/update the existing grant head wake up logic
> > in the iclog callback itself? E.g., something like the following
> > (untested):
> > 

And the raciness concerns..? AFAICT this still opens a race window where
the AIL can idle on the push target before AIL insertion.

> > @@ -740,12 +740,10 @@ xfs_trans_ail_update_bulk(
> >  	if (mlip_changed) {
> >  		if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
> >  			xlog_assign_tail_lsn_locked(ailp->ail_mount);
> > -		spin_unlock(&ailp->ail_lock);
> > -
> > -		xfs_log_space_wake(ailp->ail_mount);
> > -	} else {
> > -		spin_unlock(&ailp->ail_lock);
> >  	}
> > +
> > +	spin_unlock(&ailp->ail_lock);
> > +	xfs_log_space_wake(ailp->ail_mount);
> 
> Two things that I see straight away:
> 
> 1. if the AIL is empty, the first insert does not set mlip_changed =
> true and and so there will be no wakeup in the scenario you are
> posing. This would be easy to fix - if (!mlip || changed) - so that
> a wakeup is triggered in this case.
> 

This (again) was what I suggested originally in Chandan's thread for the
empty AIL case.

> 2. if we have not moved the tail, then calling xfs_log_space_wake()
> will, at best, just burn CPU. At worst, it wll cause hundreds of
> thousands of spurious wakeups a seconds because the waiting
> transaction reservation will be woken continuously when there isn't
> space available and there hasn't been any space made available.
> 

Yes, I can see how that would be problematic with the diff I posted
above. It's also something that can be easily fixed. Note that I think
there's another potential side effect of that diff in terms of
amplifying pressure on the AIL because we don't know whether the waiters
were blocked because of pent up in-core reservation consumption or
simply because the tail is pinned. That said, I think both patches share
that particular quirk.

Either way, this doesn't address the raciness concern I have with this
patch. If you're wedded to this particular approach, then the simplest
fix is probably to just reorder the xlog_grans_push_ail() call properly
after processing iclog callbacks. A more appropriate fix, IMO, would be
to either export this logic to where the AIL update happens and/or
enhance the existing log space wake up logic to filter wakeups in the
scenarios where it is not necessary (i.e. no tail update &&
xa_push_target == max_lsn), but this is more subjective...

> So, from #1 we see that unconditional wakeups are not necessary in
> the scenario you pose, and from #2 it's not a viable solution even
> if it was required.
> 
> However, #1 indicates other problems if a xfs_log_space_wake() call
> is necessary in this case. No reservations space and an empty AIL
> implies that the CIL pins the entire log - a pending commit that
> hasn't finished flushing and the current context that is
> aggregating. This implies we've violated a much more important rule
> of the on-disk log format: finding the head and tail of the log
> requires no individual commit be larger than 50% of the log.
> 

I described this exact problem days ago in the original thread. There's
no need to rehash it here. FWIW, I can reproduce much worse than 50% log
consumption aggregated outside of the AIL with the current code and it
doesn't depend on a nonpreemptible kernel (though the workqueue fix
looks legit to me).

Brian

> So if we are actually stalling on trasnaction reservations with an
> empty AIL and an uncommitted CIL, screwing around with tail pushing
> wakeups does not address the bigger problem being seen...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Sept. 6, 2019, 12:02 a.m. UTC | #6
On Thu, Sep 05, 2019 at 12:25:33PM -0400, Brian Foster wrote:
> On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote:
> > On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote:
> > > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > +/*
> > > > + * Completion of a iclog IO does not imply that a transaction has completed, as
> > > > + * transactions can be large enough to span many iclogs. We cannot change the
> > > > + * tail of the log half way through a transaction as this may be the only
> > > > + * transaction in the log and moving the tail to point to the middle of it
> > > > + * will prevent recovery from finding the start of the transaction. Hence we
> > > > + * should only update the last_sync_lsn if this iclog contains transaction
> > > > + * completion callbacks on it.
> > > > + *
> > > > + * We have to do this before we drop the icloglock to ensure we are the only one
> > > > + * that can update it.
> > > > + *
> > > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> > > > + * the reservation grant head pushing. This is due to the fact that the push
> > > > + * target is bound by the current last_sync_lsn value. Hence if we have a large
> > > > + * amount of log space bound up in this committing transaction then the
> > > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> > > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> > > > + * should push the AIL to ensure the push target (and hence the grant head) is
> > > > + * no longer bound by the old log head location and can move forwards and make
> > > > + * progress again.
> > > > + */
> > > > +static void
> > > > +xlog_state_set_callback(
> > > > +	struct xlog		*log,
> > > > +	struct xlog_in_core	*iclog,
> > > > +	xfs_lsn_t		header_lsn)
> > > > +{
> > > > +	iclog->ic_state = XLOG_STATE_CALLBACK;
> > > > +
> > > > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> > > > +
> > > > +	if (list_empty_careful(&iclog->ic_callbacks))
> > > > +		return;
> > > > +
> > > > +	atomic64_set(&log->l_last_sync_lsn, header_lsn);
> > > > +	xlog_grant_push_ail(log, 0);
> > > > +
> > > 
> > > Nit: extra whitespace line above.
> > 
> > Fixed.
> > 
> > > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the
> > > push is skipped)?
> > 
> > If the AIL is empty, then it's a no-op because pushing on the AIL is
> > not going to make more log space become free. Besides, that's not
> > the problem being solved here - reservation wakeups on first insert
> > into the AIL are already handled by xfs_trans_ail_update_bulk() and
> > hence the first patch in the series. This patch is addressing the
> 
> Nothing currently wakes up reservation waiters on first AIL insertion.

Nor should it be necessary - it's the removal from the AIL that
frees up log space, not insertion. The update operation is a
remove followed by an insert - the remove part of that operation is
what may free up log space, not the insert.

Hence if we need to wake the log reservation waiters on first AIL
insert to fix a bug, we haven't found the underlying problem is
preventing log space from being freed...
>
> > i.e. the AIL is for moving the tail of the log - this code moves the
> > head of the log. But both impact on the AIL push target (it is based on
> > the distance between the head and tail), so we need
> > to update the push target just in case this commit does not move
> > the tail...
> > 
> > > What if xfsaild completes this push before the
> > > associated log items land in the AIL or we race with xfsaild emptying
> > > the AIL? Why not just reuse/update the existing grant head wake up logic
> > > in the iclog callback itself? E.g., something like the following
> > > (untested):
> > > 
> 
> And the raciness concerns..? AFAICT this still opens a race window where
> the AIL can idle on the push target before AIL insertion.

I don't know what race you see - if the AIL completes a push before
we insert new objects at the head from the current commit, then it
does not matter one bit because the items are being inserted at the
log head, not the log tail where the pushing occurs at. If we are
inserting objects into the AIL within the push target window, then
there is something else very wrong going on, because when the log is
out of space the push target should be nowhere near the LSN we are
inserting inew objects into the AIL at. (i.e. they should be 3/4s of
the log apart...)

> > So, from #1 we see that unconditional wakeups are not necessary in
> > the scenario you pose, and from #2 it's not a viable solution even
> > if it was required.
> > 
> > However, #1 indicates other problems if a xfs_log_space_wake() call
> > is necessary in this case. No reservations space and an empty AIL
> > implies that the CIL pins the entire log - a pending commit that
> > hasn't finished flushing and the current context that is
> > aggregating. This implies we've violated a much more important rule
> > of the on-disk log format: finding the head and tail of the log
> > requires no individual commit be larger than 50% of the log.
> > 
> 
> I described this exact problem days ago in the original thread. There's
> no need to rehash it here. FWIW, I can reproduce much worse than 50% log
> consumption aggregated outside of the AIL with the current code and it
> doesn't depend on a nonpreemptible kernel (though the workqueue fix
> looks legit to me).

I'm not rehashing anything intentionally - I'm responding to the
questions you are asking me directly in this thread. Maybe I am
going over something you've already mentioned in a previous thread,
and maybe that hasn't occurred to me because you didn't reference it
and the similarites didn't occur to me because I've spend more time
looking at the code trying to understand how this "impossible
situation" was occurring than reading mailing list discussions.

I've been certain that we were seeing was some fundamental rule was
being violated to cause this "log full, AIL empty", but I couldn't
work out exactly what it was. I was even questioning whether I
understood the basic operation of the log because there was no way I
could see that CIL would not push during log recovery until the log
was full.  I said this to Darrick yesterday morning on #xfs:

[5/9/19 12:56] <dchinner> there's something bothering me about this
log head update thing and I can't put my finger on what it is....

It wasn't until Chandan's trace showed me the CPU hold-off problem
with the CIL workqueue. A couple of hours later, after I'd seen
Chandan's trace:

[5/9/19 14:26] <dchinner> oooohhhh
[5/9/19 14:27] <dchinner> this isn't a premeptible kernel, is it?

And that was the thing that I couldn't put my finger on - I couldn't
work out how a CIL push was being delayed so long on a multi-cpu
system with lots of idle CPU that we had a completely empty AIL when
we ran out of reservation space.  IOWs, I didn't know the right
question to ask until I saw the answer in front of me.

I've never seen a "CIL checkpoint too large" issue manifiest in the
real world, but it's been there since delayed logging was
introduced. I knew about this issue right from the start, but it was
largely a theoretical concern because workqueue scheduling preempts
userspace and so is mostly only ever delayed by the number of
transactions in a single syscall. And for large, ongoing
transactions like a truncate, it will yield the moment we have to
pull in metadata from disk.

What's new in recent kernels is the in-core inode unlinked
processing mechanisms have changed the way both the syscall and log
recovery mechanisms work (merged in 5.1, IIRC), and it looks like it
no longer blocks in log recovery like it used to. Given Christoph
first reported this generic/530 issue in May there's a fair
correlation indicating that the two are linked.

i.e. we changed the unlinked inode processing in a way that
the kernel can now runs tens of thousands of unlink transactions
without yeilding the CPU. That violated the "CIL push work will run
within a few transactions of the background push occurring"
mechanism the workqueue provided us with and that, fundamentally, is
the underlying issue here. It's not a CIL vs empty AIL vs log
reservation exhaustion race condition - that's just an observable
symptom.

To that end, I have been prototyping patches to fix this exact
problem as part of the non-blocking inode reclaim series. I've been
looking at this because the CIL pins so much memory on large logs
and I wanted to put an upper bound on it that wasn't measured in GBs
of RAM. Hence I'm planning to pull these out into a separate series
now as it's clear that non-preemptible kernels and workqueues do not
play well together and that the more we use workqueues for async
processing, the more we introduce a potential real-world vector for
CIL overruns...

Cheers,

Dave.
Brian Foster Sept. 6, 2019, 1:10 p.m. UTC | #7
On Fri, Sep 06, 2019 at 10:02:05AM +1000, Dave Chinner wrote:
> On Thu, Sep 05, 2019 at 12:25:33PM -0400, Brian Foster wrote:
> > On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote:
> > > On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote:
> > > > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > +/*
> > > > > + * Completion of a iclog IO does not imply that a transaction has completed, as
> > > > > + * transactions can be large enough to span many iclogs. We cannot change the
> > > > > + * tail of the log half way through a transaction as this may be the only
> > > > > + * transaction in the log and moving the tail to point to the middle of it
> > > > > + * will prevent recovery from finding the start of the transaction. Hence we
> > > > > + * should only update the last_sync_lsn if this iclog contains transaction
> > > > > + * completion callbacks on it.
> > > > > + *
> > > > > + * We have to do this before we drop the icloglock to ensure we are the only one
> > > > > + * that can update it.
> > > > > + *
> > > > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> > > > > + * the reservation grant head pushing. This is due to the fact that the push
> > > > > + * target is bound by the current last_sync_lsn value. Hence if we have a large
> > > > > + * amount of log space bound up in this committing transaction then the
> > > > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> > > > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> > > > > + * should push the AIL to ensure the push target (and hence the grant head) is
> > > > > + * no longer bound by the old log head location and can move forwards and make
> > > > > + * progress again.
> > > > > + */
> > > > > +static void
> > > > > +xlog_state_set_callback(
> > > > > +	struct xlog		*log,
> > > > > +	struct xlog_in_core	*iclog,
> > > > > +	xfs_lsn_t		header_lsn)
> > > > > +{
> > > > > +	iclog->ic_state = XLOG_STATE_CALLBACK;
> > > > > +
> > > > > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> > > > > +
> > > > > +	if (list_empty_careful(&iclog->ic_callbacks))
> > > > > +		return;
> > > > > +
> > > > > +	atomic64_set(&log->l_last_sync_lsn, header_lsn);
> > > > > +	xlog_grant_push_ail(log, 0);
> > > > > +
> > > > 
> > > > Nit: extra whitespace line above.
> > > 
> > > Fixed.
> > > 
> > > > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the
> > > > push is skipped)?
> > > 
> > > If the AIL is empty, then it's a no-op because pushing on the AIL is
> > > not going to make more log space become free. Besides, that's not
> > > the problem being solved here - reservation wakeups on first insert
> > > into the AIL are already handled by xfs_trans_ail_update_bulk() and
> > > hence the first patch in the series. This patch is addressing the
> > 
> > Nothing currently wakes up reservation waiters on first AIL insertion.
> 
> Nor should it be necessary - it's the removal from the AIL that
> frees up log space, not insertion. The update operation is a
> remove followed by an insert - the remove part of that operation is
> what may free up log space, not the insert.
> 

Just above you wrote: "reservation wakeups on first insert into the AIL
are already handled by xfs_trans_ail_update_bulk()". My reply was just
to point out that there are no wakeups in that case.

> Hence if we need to wake the log reservation waiters on first AIL
> insert to fix a bug, we haven't found the underlying problem is
> preventing log space from being freed...
> >
> > > i.e. the AIL is for moving the tail of the log - this code moves the
> > > head of the log. But both impact on the AIL push target (it is based on
> > > the distance between the head and tail), so we need
> > > to update the push target just in case this commit does not move
> > > the tail...
> > > 
> > > > What if xfsaild completes this push before the
> > > > associated log items land in the AIL or we race with xfsaild emptying
> > > > the AIL? Why not just reuse/update the existing grant head wake up logic
> > > > in the iclog callback itself? E.g., something like the following
> > > > (untested):
> > > > 
> > 
> > And the raciness concerns..? AFAICT this still opens a race window where
> > the AIL can idle on the push target before AIL insertion.
> 
> I don't know what race you see - if the AIL completes a push before
> we insert new objects at the head from the current commit, then it
> does not matter one bit because the items are being inserted at the
> log head, not the log tail where the pushing occurs at. If we are
> inserting objects into the AIL within the push target window, then
> there is something else very wrong going on, because when the log is
> out of space the push target should be nowhere near the LSN we are
> inserting inew objects into the AIL at. (i.e. they should be 3/4s of
> the log apart...)
> 

I'm not following your reasoning. It sounds to me that you're arguing it
doesn't matter that the AIL is not populated from the current commit
because the push target should be much farther behind the head. If
that's the case, why does this patch order the AIL push after a
->l_last_sync_lsn update? That's the LSN of the most recent commit
record to hit the log and hence the new physical log head.

Side note: I think the LSN of the commit record iclog is different and
actually ahead of the LSN associated with AIL insertion. I don't
necessarily think that's a problem given how the log subsystem behaves
today, but it's another subtle/undocumented (and easily avoidable) quirk
that may not always be so benign.

> > > So, from #1 we see that unconditional wakeups are not necessary in
> > > the scenario you pose, and from #2 it's not a viable solution even
> > > if it was required.
> > > 
> > > However, #1 indicates other problems if a xfs_log_space_wake() call
> > > is necessary in this case. No reservations space and an empty AIL
> > > implies that the CIL pins the entire log - a pending commit that
> > > hasn't finished flushing and the current context that is
> > > aggregating. This implies we've violated a much more important rule
> > > of the on-disk log format: finding the head and tail of the log
> > > requires no individual commit be larger than 50% of the log.
> > > 
> > 
> > I described this exact problem days ago in the original thread. There's
> > no need to rehash it here. FWIW, I can reproduce much worse than 50% log
> > consumption aggregated outside of the AIL with the current code and it
> > doesn't depend on a nonpreemptible kernel (though the workqueue fix
> > looks legit to me).
> 
...
> 
> i.e. we changed the unlinked inode processing in a way that
> the kernel can now runs tens of thousands of unlink transactions
> without yeilding the CPU. That violated the "CIL push work will run
> within a few transactions of the background push occurring"
> mechanism the workqueue provided us with and that, fundamentally, is
> the underlying issue here. It's not a CIL vs empty AIL vs log
> reservation exhaustion race condition - that's just an observable
> symptom.
> 

Yes, but the point is that's not the only thing that can delay CIL push
work. Since the AIL is not populated until the commit record iclog is
written out, and background CIL pushing doesn't flush the commit record
for the associated checkpoint before it completes, and CIL pushing
itself is serialized, a stalled commit record iclog I/O is enough to
create "log full, empty AIL" conditions.

> To that end, I have been prototyping patches to fix this exact problem
> as part of the non-blocking inode reclaim series. I've been looking at
> this because the CIL pins so much memory on large logs and I wanted to
> put an upper bound on it that wasn't measured in GBs of RAM. Hence I'm
> planning to pull these out into a separate series now as it's clear
> that non-preemptible kernels and workqueues do not play well together
> and that the more we use workqueues for async processing, the more we
> introduce a potential real-world vector for CIL overruns...
> 

Yes, I think a separate series for CIL management makes sense.

Brian

> Cheers,
> 
> Dave.  -- Dave Chinner david@fromorbit.com
Brian Foster Sept. 7, 2019, 3:10 p.m. UTC | #8
On Fri, Sep 06, 2019 at 09:10:14AM -0400, Brian Foster wrote:
> On Fri, Sep 06, 2019 at 10:02:05AM +1000, Dave Chinner wrote:
> > On Thu, Sep 05, 2019 at 12:25:33PM -0400, Brian Foster wrote:
> > > On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote:
> > > > On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote:
> > > > > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote:
> > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > +/*
> > > > > > + * Completion of a iclog IO does not imply that a transaction has completed, as
> > > > > > + * transactions can be large enough to span many iclogs. We cannot change the
> > > > > > + * tail of the log half way through a transaction as this may be the only
> > > > > > + * transaction in the log and moving the tail to point to the middle of it
> > > > > > + * will prevent recovery from finding the start of the transaction. Hence we
> > > > > > + * should only update the last_sync_lsn if this iclog contains transaction
> > > > > > + * completion callbacks on it.
> > > > > > + *
> > > > > > + * We have to do this before we drop the icloglock to ensure we are the only one
> > > > > > + * that can update it.
> > > > > > + *
> > > > > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> > > > > > + * the reservation grant head pushing. This is due to the fact that the push
> > > > > > + * target is bound by the current last_sync_lsn value. Hence if we have a large
> > > > > > + * amount of log space bound up in this committing transaction then the
> > > > > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> > > > > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> > > > > > + * should push the AIL to ensure the push target (and hence the grant head) is
> > > > > > + * no longer bound by the old log head location and can move forwards and make
> > > > > > + * progress again.
> > > > > > + */
> > > > > > +static void
> > > > > > +xlog_state_set_callback(
> > > > > > +	struct xlog		*log,
> > > > > > +	struct xlog_in_core	*iclog,
> > > > > > +	xfs_lsn_t		header_lsn)
> > > > > > +{
> > > > > > +	iclog->ic_state = XLOG_STATE_CALLBACK;
> > > > > > +
> > > > > > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
> > > > > > +
> > > > > > +	if (list_empty_careful(&iclog->ic_callbacks))
> > > > > > +		return;
> > > > > > +
> > > > > > +	atomic64_set(&log->l_last_sync_lsn, header_lsn);
> > > > > > +	xlog_grant_push_ail(log, 0);
> > > > > > +
> > > > > 
> > > > > Nit: extra whitespace line above.
> > > > 
> > > > Fixed.
> > > > 
> > > > > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the
> > > > > push is skipped)?
> > > > 
> > > > If the AIL is empty, then it's a no-op because pushing on the AIL is
> > > > not going to make more log space become free. Besides, that's not
> > > > the problem being solved here - reservation wakeups on first insert
> > > > into the AIL are already handled by xfs_trans_ail_update_bulk() and
> > > > hence the first patch in the series. This patch is addressing the
> > > 
> > > Nothing currently wakes up reservation waiters on first AIL insertion.
> > 
> > Nor should it be necessary - it's the removal from the AIL that
> > frees up log space, not insertion. The update operation is a
> > remove followed by an insert - the remove part of that operation is
> > what may free up log space, not the insert.
> > 
> 
> Just above you wrote: "reservation wakeups on first insert into the AIL
> are already handled by xfs_trans_ail_update_bulk()". My reply was just
> to point out that there are no wakeups in that case.
> 
> > Hence if we need to wake the log reservation waiters on first AIL
> > insert to fix a bug, we haven't found the underlying problem is
> > preventing log space from being freed...
> > >
> > > > i.e. the AIL is for moving the tail of the log - this code moves the
> > > > head of the log. But both impact on the AIL push target (it is based on
> > > > the distance between the head and tail), so we need
> > > > to update the push target just in case this commit does not move
> > > > the tail...
> > > > 
> > > > > What if xfsaild completes this push before the
> > > > > associated log items land in the AIL or we race with xfsaild emptying
> > > > > the AIL? Why not just reuse/update the existing grant head wake up logic
> > > > > in the iclog callback itself? E.g., something like the following
> > > > > (untested):
> > > > > 
> > > 
> > > And the raciness concerns..? AFAICT this still opens a race window where
> > > the AIL can idle on the push target before AIL insertion.
> > 
> > I don't know what race you see - if the AIL completes a push before
> > we insert new objects at the head from the current commit, then it
> > does not matter one bit because the items are being inserted at the
> > log head, not the log tail where the pushing occurs at. If we are
> > inserting objects into the AIL within the push target window, then
> > there is something else very wrong going on, because when the log is
> > out of space the push target should be nowhere near the LSN we are
> > inserting inew objects into the AIL at. (i.e. they should be 3/4s of
> > the log apart...)
> > 
> 
> I'm not following your reasoning. It sounds to me that you're arguing it
> doesn't matter that the AIL is not populated from the current commit
> because the push target should be much farther behind the head. If
> that's the case, why does this patch order the AIL push after a
> ->l_last_sync_lsn update? That's the LSN of the most recent commit
> record to hit the log and hence the new physical log head.
> 
> Side note: I think the LSN of the commit record iclog is different and
> actually ahead of the LSN associated with AIL insertion. I don't
> necessarily think that's a problem given how the log subsystem behaves
> today, but it's another subtle/undocumented (and easily avoidable) quirk
> that may not always be so benign.
> 

Just to put a finer point on this (and since this seems to be the only
way I can get you to consider nontrivial feedback to your patches):

kworker/0:1H-220   [000] ...1  3869.403829: xlog_state_do_callback: 2691: l_last_sync_lsn 0x15000021f6
kworker/0:1H-220   [000] ...1  3869.403864: xfs_ail_push: 639: ail_target 0x15000021f6
      <...>-215246 [002] ...1  3875.568561: xfsaild: 403: empty (target 0x15000021f6)
      <...>-215246 [002] ....  3875.568649: xfsaild: 589: idle
kworker/0:1H-220   [000] ...1  3889.843872: xfs_trans_ail_update_bulk: 746: inserted lsn 0x1500001bf6

This is an instance of xfsaild going idle between the time this new AIL
push sets the target based on the iclog about to be committed and AIL
insertion of the associated log items, reproduced via a bit of timing
instrumentation. Don't be distracted by the timestamps or the fact that
the LSNs do not match because the log items in the AIL end up indexed by
the start lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
commit record). The point is simply that xfsaild has completed a push of
a target that hasn't been inserted yet.

A couple additional notes.. I don't see further side effects in the
variant I reproduced, I suspect because we have other wakeups that
squash this transient state created by the race, but I'm not totally
sure of that. I'm also not totally convinced this is the only vector to
this problem, FWIW. It wouldn't surprise me a ton if we had some other
scenario that could result in the same problem with actual side effects,
but this is beside the point.

Brian

> > > > So, from #1 we see that unconditional wakeups are not necessary in
> > > > the scenario you pose, and from #2 it's not a viable solution even
> > > > if it was required.
> > > > 
> > > > However, #1 indicates other problems if a xfs_log_space_wake() call
> > > > is necessary in this case. No reservations space and an empty AIL
> > > > implies that the CIL pins the entire log - a pending commit that
> > > > hasn't finished flushing and the current context that is
> > > > aggregating. This implies we've violated a much more important rule
> > > > of the on-disk log format: finding the head and tail of the log
> > > > requires no individual commit be larger than 50% of the log.
> > > > 
> > > 
> > > I described this exact problem days ago in the original thread. There's
> > > no need to rehash it here. FWIW, I can reproduce much worse than 50% log
> > > consumption aggregated outside of the AIL with the current code and it
> > > doesn't depend on a nonpreemptible kernel (though the workqueue fix
> > > looks legit to me).
> > 
> ...
> > 
> > i.e. we changed the unlinked inode processing in a way that
> > the kernel can now runs tens of thousands of unlink transactions
> > without yeilding the CPU. That violated the "CIL push work will run
> > within a few transactions of the background push occurring"
> > mechanism the workqueue provided us with and that, fundamentally, is
> > the underlying issue here. It's not a CIL vs empty AIL vs log
> > reservation exhaustion race condition - that's just an observable
> > symptom.
> > 
> 
> Yes, but the point is that's not the only thing that can delay CIL push
> work. Since the AIL is not populated until the commit record iclog is
> written out, and background CIL pushing doesn't flush the commit record
> for the associated checkpoint before it completes, and CIL pushing
> itself is serialized, a stalled commit record iclog I/O is enough to
> create "log full, empty AIL" conditions.
> 
> > To that end, I have been prototyping patches to fix this exact problem
> > as part of the non-blocking inode reclaim series. I've been looking at
> > this because the CIL pins so much memory on large logs and I wanted to
> > put an upper bound on it that wasn't measured in GBs of RAM. Hence I'm
> > planning to pull these out into a separate series now as it's clear
> > that non-preemptible kernels and workqueues do not play well together
> > and that the more we use workqueues for async processing, the more we
> > introduce a potential real-world vector for CIL overruns...
> > 
> 
> Yes, I think a separate series for CIL management makes sense.
> 
> Brian
> 
> > Cheers,
> > 
> > Dave.  -- Dave Chinner david@fromorbit.com
Dave Chinner Sept. 8, 2019, 11:26 p.m. UTC | #9
On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> On Fri, Sep 06, 2019 at 09:10:14AM -0400, Brian Foster wrote:
> > On Fri, Sep 06, 2019 at 10:02:05AM +1000, Dave Chinner wrote:
> > > > And the raciness concerns..? AFAICT this still opens a race window where
> > > > the AIL can idle on the push target before AIL insertion.
> > > 
> > > I don't know what race you see - if the AIL completes a push before
> > > we insert new objects at the head from the current commit, then it
> > > does not matter one bit because the items are being inserted at the
> > > log head, not the log tail where the pushing occurs at. If we are
> > > inserting objects into the AIL within the push target window, then
> > > there is something else very wrong going on, because when the log is
> > > out of space the push target should be nowhere near the LSN we are
> > > inserting inew objects into the AIL at. (i.e. they should be 3/4s of
> > > the log apart...)
> > > 
> > 
> > I'm not following your reasoning. It sounds to me that you're arguing it
> > doesn't matter that the AIL is not populated from the current commit
> > because the push target should be much farther behind the head. If
> > that's the case, why does this patch order the AIL push after a
> > ->l_last_sync_lsn update? That's the LSN of the most recent commit
> > record to hit the log and hence the new physical log head.
> > 
> > Side note: I think the LSN of the commit record iclog is different and
> > actually ahead of the LSN associated with AIL insertion. I don't
> > necessarily think that's a problem given how the log subsystem behaves
> > today, but it's another subtle/undocumented (and easily avoidable) quirk
> > that may not always be so benign.
> > 
> 
> Just to put a finer point on this (and since this seems to be the only
> way I can get you to consider nontrivial feedback to your patches):

If I can't make head or tail of the problem you are describing,
exactly how am I supposed to respond? If I'm unable to get my point
across, I'd much prefer to spend my time on patches than on going
around in circles. I'm not interested in winning arguments. I'm not
interested in spending lots of time discussing theoretical problems
with the current set of fixes that don't exist once the root cause
we've already identified is fixed. My time is much better spent
fixing that root cause...

Keep in mind that I also have a lot of different, complex  things
going on at once that all require total focus while I'm looking at
them, so it can take days for me to cycle through everything and get
back to past topics. Delay doesn't mean I haven't read your response
or taken it on board, it just means I don't have time to write a
-meaingful response- straight away.

> kworker/0:1H-220   [000] ...1  3869.403829: xlog_state_do_callback: 2691: l_last_sync_lsn 0x15000021f6
> kworker/0:1H-220   [000] ...1  3869.403864: xfs_ail_push: 639: ail_target 0x15000021f6

Which implies that the log has less than 25% of space free because
we've issued a push, and that the distance we push is bound by the
log head.

>       <...>-215246 [002] ...1  3875.568561: xfsaild: 403: empty (target 0x15000021f6)
>       <...>-215246 [002] ....  3875.568649: xfsaild: 589: idle

has an empty AIL. IOWs, you are creating the situation where the CIL
has not been allowed to run and hence has violated the >50% log size
limit on transactions. This goes away once we prevent the CIL from
doing this.

> kworker/0:1H-220   [000] ...1  3889.843872: xfs_trans_ail_update_bulk: 746: inserted lsn 0x1500001bf6

Ok, so what you see here is somewhat intentional, based on the fact
that the LSN used for items is different to the LSN used for the
commit record (start of commit vs end of commit).  We don't want to
push the currently commiting items instantly to disk as that defeats
the "delayed write" behaviour the AIL uses to allow efficient
relogging to occur.

The next commit will do a similar push during with the new
l_last_sync_lsn, which causes the target to point at the new
last_sync_lsn and so all the items in the AIL from the previous
commit that haven't been relogged (pinned) in the current commit
will get pushed. i.e. commit N will cause commit (N - 1) to get
pushed.

This will continue while we are in a situation where the current log
head location is limiting the push target and we are completely out
of log reservation space. Once we get to the point where the
physical head of the log is more than 25% of the log away from the
tail, the push target will stop being limited by the l_last_sync_lsn
and we'll go back to triggering push target updates via the tail of
the log moving forwards as we currently do.  IOWs, this "log head
pushing" behaviour is likely only necessary for the first 2-3 CIL
commits of a workload, then we fall back into the normal tail
pushing scenario.

> This is an instance of xfsaild going idle between the time this
> new AIL push sets the target based on the iclog about to be
> committed and AIL insertion of the associated log items,
> reproduced via a bit of timing instrumentation.  Don't be
> distracted by the timestamps or the fact that the LSNs do not
> match because the log items in the AIL end up indexed by the start
> lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> commit record). The point is simply that xfsaild has completed a
> push of a target that hasn't been inserted yet.

AFAICT, what you are showing requires delaying of the CIL push to the
point it violates a fundamental assumption about commit sizes, which
is why I largely think it's irrelevant.

> A couple additional notes.. I don't see further side effects in the
> variant I reproduced, I suspect because we have other wakeups that
> squash this transient state created by the race,

Right, if we do run out of log space, the log reservation tail
pushing mechanisms takes over and does the right thing.

> > > i.e. we changed the unlinked inode processing in a way that
> > > the kernel can now runs tens of thousands of unlink transactions
> > > without yeilding the CPU. That violated the "CIL push work will run
> > > within a few transactions of the background push occurring"
> > > mechanism the workqueue provided us with and that, fundamentally, is
> > > the underlying issue here. It's not a CIL vs empty AIL vs log
> > > reservation exhaustion race condition - that's just an observable
> > > symptom.
> > > 
> > 
> > Yes, but the point is that's not the only thing that can delay CIL push
> > work. Since the AIL is not populated until the commit record iclog is
> > written out, and background CIL pushing doesn't flush the commit record
> > for the associated checkpoint before it completes, and CIL pushing
> > itself is serialized, a stalled commit record iclog I/O is enough to
> > create "log full, empty AIL" conditions.

CIL pushing is not actually serialised. Ordered, yes, serialised,
no. ANd stalling an iclog with a commit record should not cause the
log to fill completely - the next CIL push when it overflows should
get it moving long before the log runs out of reservation space.

Cheers,

Dave.
Brian Foster Sept. 10, 2019, 9:56 a.m. UTC | #10
On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > On Fri, Sep 06, 2019 at 09:10:14AM -0400, Brian Foster wrote:
> > > On Fri, Sep 06, 2019 at 10:02:05AM +1000, Dave Chinner wrote:
> > > > > And the raciness concerns..? AFAICT this still opens a race window where
> > > > > the AIL can idle on the push target before AIL insertion.
> > > > 
> > > > I don't know what race you see - if the AIL completes a push before
> > > > we insert new objects at the head from the current commit, then it
> > > > does not matter one bit because the items are being inserted at the
> > > > log head, not the log tail where the pushing occurs at. If we are
> > > > inserting objects into the AIL within the push target window, then
> > > > there is something else very wrong going on, because when the log is
> > > > out of space the push target should be nowhere near the LSN we are
> > > > inserting inew objects into the AIL at. (i.e. they should be 3/4s of
> > > > the log apart...)
> > > > 
> > > 
> > > I'm not following your reasoning. It sounds to me that you're arguing it
> > > doesn't matter that the AIL is not populated from the current commit
> > > because the push target should be much farther behind the head. If
> > > that's the case, why does this patch order the AIL push after a
> > > ->l_last_sync_lsn update? That's the LSN of the most recent commit
> > > record to hit the log and hence the new physical log head.
> > > 
> > > Side note: I think the LSN of the commit record iclog is different and
> > > actually ahead of the LSN associated with AIL insertion. I don't
> > > necessarily think that's a problem given how the log subsystem behaves
> > > today, but it's another subtle/undocumented (and easily avoidable) quirk
> > > that may not always be so benign.
> > > 
> > 
> > Just to put a finer point on this (and since this seems to be the only
> > way I can get you to consider nontrivial feedback to your patches):
> 
> If I can't make head or tail of the problem you are describing,
> exactly how am I supposed to respond? If I'm unable to get my point
> across, I'd much prefer to spend my time on patches than on going
> around in circles. I'm not interested in winning arguments. I'm not
> interested in spending lots of time discussing theoretical problems
> with the current set of fixes that don't exist once the root cause
> we've already identified is fixed. My time is much better spent
> fixing that root cause...
> 
> Keep in mind that I also have a lot of different, complex  things
> going on at once that all require total focus while I'm looking at
> them, so it can take days for me to cycle through everything and get
> back to past topics. Delay doesn't mean I haven't read your response
> or taken it on board, it just means I don't have time to write a
> -meaingful response- straight away.
> 
> > kworker/0:1H-220   [000] ...1  3869.403829: xlog_state_do_callback: 2691: l_last_sync_lsn 0x15000021f6
> > kworker/0:1H-220   [000] ...1  3869.403864: xfs_ail_push: 639: ail_target 0x15000021f6
> 
> Which implies that the log has less than 25% of space free because
> we've issued a push, and that the distance we push is bound by the
> log head.
> 
> >       <...>-215246 [002] ...1  3875.568561: xfsaild: 403: empty (target 0x15000021f6)
> >       <...>-215246 [002] ....  3875.568649: xfsaild: 589: idle
> 
> has an empty AIL. IOWs, you are creating the situation where the CIL
> has not been allowed to run and hence has violated the >50% log size
> limit on transactions. This goes away once we prevent the CIL from
> doing this.
> 
> > kworker/0:1H-220   [000] ...1  3889.843872: xfs_trans_ail_update_bulk: 746: inserted lsn 0x1500001bf6
> 
> Ok, so what you see here is somewhat intentional, based on the fact
> that the LSN used for items is different to the LSN used for the
> commit record (start of commit vs end of commit).  We don't want to
> push the currently commiting items instantly to disk as that defeats
> the "delayed write" behaviour the AIL uses to allow efficient
> relogging to occur.
> 
> The next commit will do a similar push during with the new
> l_last_sync_lsn, which causes the target to point at the new
> last_sync_lsn and so all the items in the AIL from the previous
> commit that haven't been relogged (pinned) in the current commit
> will get pushed. i.e. commit N will cause commit (N - 1) to get
> pushed.
> 
> This will continue while we are in a situation where the current log
> head location is limiting the push target and we are completely out
> of log reservation space. Once we get to the point where the
> physical head of the log is more than 25% of the log away from the
> tail, the push target will stop being limited by the l_last_sync_lsn
> and we'll go back to triggering push target updates via the tail of
> the log moving forwards as we currently do.  IOWs, this "log head
> pushing" behaviour is likely only necessary for the first 2-3 CIL
> commits of a workload, then we fall back into the normal tail
> pushing scenario.
> 
> > This is an instance of xfsaild going idle between the time this
> > new AIL push sets the target based on the iclog about to be
> > committed and AIL insertion of the associated log items,
> > reproduced via a bit of timing instrumentation.  Don't be
> > distracted by the timestamps or the fact that the LSNs do not
> > match because the log items in the AIL end up indexed by the start
> > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > commit record). The point is simply that xfsaild has completed a
> > push of a target that hasn't been inserted yet.
> 
> AFAICT, what you are showing requires delaying of the CIL push to the
> point it violates a fundamental assumption about commit sizes, which
> is why I largely think it's irrelevant.
> 

The CIL checkpoint size is an unrelated side effect of the test I
happened to use, not a fundamental cause of the problem it demonstrates.
Fixing CIL checkpoint size issues won't change anything. Here's a
different variant of the same problem with a small enough number of log
items such that background CIL pushing is not a factor:

       <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
	...
       <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
       <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
	....

This sequence starts with one log item in the AIL and some number of
items in the CIL such that a checkpoint executes from the background log
worker. The worker forces the CIL and log I/O completion issues an AIL
push that is truncated by the recently updated ->l_last_sync_lsn due to
outstanding transaction reservation and small AIL size. This push races
with completion of a previous push that empties the AIL and iclog
callbacks insert log items for the current checkpoint at the LSN target
xfsaild just idled at.

Brian

> > A couple additional notes.. I don't see further side effects in the
> > variant I reproduced, I suspect because we have other wakeups that
> > squash this transient state created by the race,
> 
> Right, if we do run out of log space, the log reservation tail
> pushing mechanisms takes over and does the right thing.
> 
> > > > i.e. we changed the unlinked inode processing in a way that
> > > > the kernel can now runs tens of thousands of unlink transactions
> > > > without yeilding the CPU. That violated the "CIL push work will run
> > > > within a few transactions of the background push occurring"
> > > > mechanism the workqueue provided us with and that, fundamentally, is
> > > > the underlying issue here. It's not a CIL vs empty AIL vs log
> > > > reservation exhaustion race condition - that's just an observable
> > > > symptom.
> > > > 
> > > 
> > > Yes, but the point is that's not the only thing that can delay CIL push
> > > work. Since the AIL is not populated until the commit record iclog is
> > > written out, and background CIL pushing doesn't flush the commit record
> > > for the associated checkpoint before it completes, and CIL pushing
> > > itself is serialized, a stalled commit record iclog I/O is enough to
> > > create "log full, empty AIL" conditions.
> 
> CIL pushing is not actually serialised. Ordered, yes, serialised,
> no. ANd stalling an iclog with a commit record should not cause the
> log to fill completely - the next CIL push when it overflows should
> get it moving long before the log runs out of reservation space.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Sept. 10, 2019, 11:38 p.m. UTC | #11
On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > This is an instance of xfsaild going idle between the time this
> > > new AIL push sets the target based on the iclog about to be
> > > committed and AIL insertion of the associated log items,
> > > reproduced via a bit of timing instrumentation.  Don't be
> > > distracted by the timestamps or the fact that the LSNs do not
> > > match because the log items in the AIL end up indexed by the start
> > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > commit record). The point is simply that xfsaild has completed a
> > > push of a target that hasn't been inserted yet.
> > 
> > AFAICT, what you are showing requires delaying of the CIL push to the
> > point it violates a fundamental assumption about commit sizes, which
> > is why I largely think it's irrelevant.
> > 
> 
> The CIL checkpoint size is an unrelated side effect of the test I
> happened to use, not a fundamental cause of the problem it demonstrates.
> Fixing CIL checkpoint size issues won't change anything. Here's a
> different variant of the same problem with a small enough number of log
> items such that background CIL pushing is not a factor:
> 
>        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> 	...
>        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
>        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> 	....
>
> This sequence starts with one log item in the AIL and some number of
> items in the CIL such that a checkpoint executes from the background log
> worker. The worker forces the CIL and log I/O completion issues an AIL
> push that is truncated by the recently updated ->l_last_sync_lsn due to
> outstanding transaction reservation and small AIL size. This push races
> with completion of a previous push that empties the AIL and iclog
> callbacks insert log items for the current checkpoint at the LSN target
> xfsaild just idled at.

I'm just not seeing what the problem here is. The behaviour you are
describing has been around since day zero and doesn't require the
addition of an ail push from iclog completion to trigger.  Prior to
this series, it would be:

process 1	reservation	log completion	xfsaild
<completes metadata IO>
  xfs_ail_delete()
    mlip_changed
    xlog_assign_tail_lsn_locked()
      ail empty, sets l_last_sync = 0x1000032e2
    xfs_log_space_wake()
				xlog_state_do_callback
				  sets CALLBACK
				  sets last_sync_lsn to iclog head
				    -> 0x1000032e4
				  <drops icloglock, gets preempted>
		<wakes>
		xlog_grant_head_wait
		  free_bytes < need_bytes
		    xlog_grant_push_ail()
		      xlog_push_ail()
		        ->ail_target 0x1000032e4
		<sleeps>
						<wakes>
						sets prev target to 0x1000032e4
						sees empty AIL
						<sleeps>
				    <runs again>
				    runs callbacks
				      xfs_ail_insert(lsn = 0x1000032e4)

and now we have the AIL push thread asleep with items in it at the
push threshold.  IOWs, what you describe has always been possible,
and before the CIL was introduced this sort of thing happened quite
a bit because iclog completions freed up much less space in the log
than a CIL commit completion.

It's not a problem, however, because if we are out of transaction
reservation space we must have transactions in progress, and as long
as they make progress then the commit of each transaction will end
up calling xlog_ungrant_log_space() to return the unused portion of
the transaction reservation. That calls xfs_log_space_wake() to
allow reservation waiters to try to make progress.

If there's still not enough space reservation after the transaction
in progress has released it's reservation, then it goes back to
sleep. As long as we have active transactions in progress while
there are transaction reservations waiting on reservation space,
there will be a wakeup vector for the reservation independent of
the CIL, iclogs and AIL behaviour.

[ Yes, there was a bug here, in the case xfs_log_space_wake() did
not issue a wakeup because of not enough space being availble and
the push target was limited by the old log head location. i.e.
nothing ever updated the push target to reflect the new log head and
so the tail might never get moved now. That particular bug was fixed
by a an earlier patch in the series, so we can ignore it here. ]

IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
the log space, and we have transactions waiting on log reservation
space, then we must have enough transactions in progress to cover at
least 75% of the log space. Completion of those transactions will
wake waiters and, if necessary, push the AIL again to keep the log
tail moving appropriately. This handles the AIL empty and "insert
before target" situations you are concerned about just fine, as long
as we have a guarantee of forwards progress. Bounding the CIL size
provides that forwards progress guarantee for the CIL...

Cheers,

Dave.
Brian Foster Sept. 12, 2019, 1:46 p.m. UTC | #12
On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote:
> On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > > This is an instance of xfsaild going idle between the time this
> > > > new AIL push sets the target based on the iclog about to be
> > > > committed and AIL insertion of the associated log items,
> > > > reproduced via a bit of timing instrumentation.  Don't be
> > > > distracted by the timestamps or the fact that the LSNs do not
> > > > match because the log items in the AIL end up indexed by the start
> > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > > commit record). The point is simply that xfsaild has completed a
> > > > push of a target that hasn't been inserted yet.
> > > 
> > > AFAICT, what you are showing requires delaying of the CIL push to the
> > > point it violates a fundamental assumption about commit sizes, which
> > > is why I largely think it's irrelevant.
> > > 
> > 
> > The CIL checkpoint size is an unrelated side effect of the test I
> > happened to use, not a fundamental cause of the problem it demonstrates.
> > Fixing CIL checkpoint size issues won't change anything. Here's a
> > different variant of the same problem with a small enough number of log
> > items such that background CIL pushing is not a factor:
> > 
> >        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> > 	...
> >        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
> >        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> > kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > 	....
> >
> > This sequence starts with one log item in the AIL and some number of
> > items in the CIL such that a checkpoint executes from the background log
> > worker. The worker forces the CIL and log I/O completion issues an AIL
> > push that is truncated by the recently updated ->l_last_sync_lsn due to
> > outstanding transaction reservation and small AIL size. This push races
> > with completion of a previous push that empties the AIL and iclog
> > callbacks insert log items for the current checkpoint at the LSN target
> > xfsaild just idled at.
> 
> I'm just not seeing what the problem here is. The behaviour you are
> describing has been around since day zero and doesn't require the
> addition of an ail push from iclog completion to trigger.  Prior to
> this series, it would be:
> 

A few days ago you said that if we're inserting log items before the
push target, "something is very wrong." Since this was what I was
concerned about, I attempted to manufacture the issue to demonstrate.
You suggested the first reproducer I came up with was a separate problem
(related to CIL size issues), so I came up with the one above to avoid
that distraction. Now you're telling me this has always happened and is
fine..

While I don't think this is quite accurate (more below), I do find this
reasoning somewhat amusing in that it essentially implies that this
patch itself is dubious. If this new AIL push is required to fix a real
issue, and this race is essentially manifest as implied, then this patch
can't possibly reliably fix the original problem. Anyways, that is
neither here nor there..

All of the details of this particular issue aside, I do think there's a
development process problem here. It shouldn't require an extended game
of whack-a-mole with this kind of inconsistent reasoning just to request
a trivial change to a patch (you also implied in a previous response it
was me wasting your time on this topic) that closes an obvious race and
otherwise has no negative effect. Someone is being unreasonable here and
I don't think it's me. More importantly, discussion of open issues
shouldn't be a race against the associated patch being merged. :/

> process 1	reservation	log completion	xfsaild
> <completes metadata IO>
>   xfs_ail_delete()
>     mlip_changed
>     xlog_assign_tail_lsn_locked()
>       ail empty, sets l_last_sync = 0x1000032e2
>     xfs_log_space_wake()
> 				xlog_state_do_callback
> 				  sets CALLBACK
> 				  sets last_sync_lsn to iclog head
> 				    -> 0x1000032e4
> 				  <drops icloglock, gets preempted>
> 		<wakes>
> 		xlog_grant_head_wait
> 		  free_bytes < need_bytes
> 		    xlog_grant_push_ail()
> 		      xlog_push_ail()
> 		        ->ail_target 0x1000032e4
> 		<sleeps>
> 						<wakes>
> 						sets prev target to 0x1000032e4
> 						sees empty AIL
> 						<sleeps>
> 				    <runs again>
> 				    runs callbacks
> 				      xfs_ail_insert(lsn = 0x1000032e4)
> 
> and now we have the AIL push thread asleep with items in it at the
> push threshold.  IOWs, what you describe has always been possible,
> and before the CIL was introduced this sort of thing happened quite
> a bit because iclog completions freed up much less space in the log
> than a CIL commit completion.
> 

I was suspicious that this could occur prior to this change but I hadn't
confirmed. The scenario documented above cannot occur because a push on
an empty AIL has no effect. The target doesn't move and the task isn't
woken. That said, I still suspect the race can occur with the current
code via between a grant head waiter, AIL emptying and iclog completion.

This just speaks to the frequency of the problem, though. I'm not
convinced it's something that happens "quite a bit" given the nature of
the 3-way race. I also don't agree that existence of a historical
problem somehow excuses introduction a new variant of the same problem.
Instead, if this patch exposes a historical problem that simply had no
noticeable impact to this point, we should probably look into whether it
needs fixing too.

> It's not a problem, however, because if we are out of transaction
> reservation space we must have transactions in progress, and as long
> as they make progress then the commit of each transaction will end
> up calling xlog_ungrant_log_space() to return the unused portion of
> the transaction reservation. That calls xfs_log_space_wake() to
> allow reservation waiters to try to make progress.
> 

Yes, this is why I don't see immediate side effects in the tests I've
run so far. The assumptions you're basing this off are not always true,
however. Particularly on smaller (<= 1GB) filesystems, it's relatively
easy to produce conditions where the entire reservation space is
consumed by open transactions that don't ultimately commit anything to
the log subsystem and thus generate no forward progress.

> If there's still not enough space reservation after the transaction
> in progress has released it's reservation, then it goes back to
> sleep. As long as we have active transactions in progress while
> there are transaction reservations waiting on reservation space,
> there will be a wakeup vector for the reservation independent of
> the CIL, iclogs and AIL behaviour.
> 

We do have clean transaction cancel and error scenarios, existing log
deadlock vectors, increasing reliance on long running transactions via
deferred ops, scrub, etc. Also consider the fact that open transactions
consume considerably more reservation than committed transactions on
average.

I'm not saying it's likely for a real world workload to consume the
entirety of log reservation space via open transactions and then release
it without filesystem modification (and then race with log I/O and AIL
emptying), but from the perspective of proving the existence of a bug
it's really not that difficult to produce. I've not seen a real world
workload that reproduces the problems fixed by any of these patches
either, but we still fix them.

> [ Yes, there was a bug here, in the case xfs_log_space_wake() did
> not issue a wakeup because of not enough space being availble and
> the push target was limited by the old log head location. i.e.
> nothing ever updated the push target to reflect the new log head and
> so the tail might never get moved now. That particular bug was fixed
> by a an earlier patch in the series, so we can ignore it here. ]
> 
> IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
> the log space, and we have transactions waiting on log reservation
> space, then we must have enough transactions in progress to cover at
> least 75% of the log space. Completion of those transactions will
> wake waiters and, if necessary, push the AIL again to keep the log
> tail moving appropriately. This handles the AIL empty and "insert
> before target" situations you are concerned about just fine, as long
> as we have a guarantee of forwards progress. Bounding the CIL size
> provides that forwards progress guarantee for the CIL...
> 

I think you have some tunnel vision or something going on here with
regard to the higher level architectural view of how things are supposed
to operate in a normal running/steady state vs simply what can and
cannot happen in the code. I can't really tell why/how, but the only
suggestion I can make is to perhaps separate from this high level view
of things and take a closer look at the code. This is a simple code bug,
not some grand architectural flaw. The context here is way out of whack.
The repeated unrelated and overblown architectural assertions come off
as indication of lack of any real argument to allow this race to live.
There is simply no such guarantee of forward progress in all scenarios
that produce the conditions that can cause this race.

Yet another example:

           <...>-369   [002] ...2   220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
           <...>-27    [003] ...1   224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
           <...>-404   [003] ...1   224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa
     kworker/3:1-39    [003] ...2   224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
    xfsaild/dm-4-1034  [000] ....   224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa
    kworker/3:1H-404   [003] ...2   225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL
     kworker/3:1-39    [003] ...1   254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
	...
     kworker/3:2-1920  [003] ...1  3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]


# cat /sys/fs/xfs/dm-4/log/log_*lsn
1:252
1:250

This instance of the race uses the same serialization instrumentation to
control execution timing and whatnot as before (i.e. no functional
changes). First, an item is inserted into the AIL. Immediately after AIL
insertion, another transaction commits to the CIL (not shown in the
trace). The background log worker comes around a few seconds later and
forces the log/CIL. The checkpoint for this log force races with an AIL
delete and idle (same as before). AIL insertion occurs at the push
target xfsaild just idled at, but this time reservation pressure
relieves and the filesystem goes idle.

At this point, nothing occurs on the fs except for continuous background
log worker jobs. Note the timestamp difference between the first
post-race log force and the last in the trace. The log worker runs at
the default 30s interval and has run repeatedly for almost an hour while
failing to push the AIL and subsequently cover the log. To confirm the
AIL is populated, see the log head/tail LSNs reported via sysfs. This
state persists indefinitely so long as the fs is idle. This is a bug.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong Sept. 17, 2019, 4:31 a.m. UTC | #13
On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote:
> On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote:
> > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > > > This is an instance of xfsaild going idle between the time this
> > > > > new AIL push sets the target based on the iclog about to be
> > > > > committed and AIL insertion of the associated log items,
> > > > > reproduced via a bit of timing instrumentation.  Don't be
> > > > > distracted by the timestamps or the fact that the LSNs do not
> > > > > match because the log items in the AIL end up indexed by the start
> > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > > > commit record). The point is simply that xfsaild has completed a
> > > > > push of a target that hasn't been inserted yet.
> > > > 
> > > > AFAICT, what you are showing requires delaying of the CIL push to the
> > > > point it violates a fundamental assumption about commit sizes, which
> > > > is why I largely think it's irrelevant.
> > > > 
> > > 
> > > The CIL checkpoint size is an unrelated side effect of the test I
> > > happened to use, not a fundamental cause of the problem it demonstrates.
> > > Fixing CIL checkpoint size issues won't change anything. Here's a
> > > different variant of the same problem with a small enough number of log
> > > items such that background CIL pushing is not a factor:
> > > 
> > >        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> > > 	...
> > >        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
> > >        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> > > kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > 	....
> > >
> > > This sequence starts with one log item in the AIL and some number of
> > > items in the CIL such that a checkpoint executes from the background log
> > > worker. The worker forces the CIL and log I/O completion issues an AIL
> > > push that is truncated by the recently updated ->l_last_sync_lsn due to
> > > outstanding transaction reservation and small AIL size. This push races
> > > with completion of a previous push that empties the AIL and iclog
> > > callbacks insert log items for the current checkpoint at the LSN target
> > > xfsaild just idled at.
> > 
> > I'm just not seeing what the problem here is. The behaviour you are
> > describing has been around since day zero and doesn't require the
> > addition of an ail push from iclog completion to trigger.  Prior to
> > this series, it would be:
> > 
> 
> A few days ago you said that if we're inserting log items before the
> push target, "something is very wrong." Since this was what I was
> concerned about, I attempted to manufacture the issue to demonstrate.
> You suggested the first reproducer I came up with was a separate problem
> (related to CIL size issues), so I came up with the one above to avoid
> that distraction. Now you're telling me this has always happened and is
> fine..
> 
> While I don't think this is quite accurate (more below), I do find this
> reasoning somewhat amusing in that it essentially implies that this
> patch itself is dubious. If this new AIL push is required to fix a real
> issue, and this race is essentially manifest as implied, then this patch
> can't possibly reliably fix the original problem. Anyways, that is
> neither here nor there..
> 
> All of the details of this particular issue aside, I do think there's a
> development process problem here. It shouldn't require an extended game
> of whack-a-mole with this kind of inconsistent reasoning just to request
> a trivial change to a patch (you also implied in a previous response it
> was me wasting your time on this topic) that closes an obvious race and
> otherwise has no negative effect. Someone is being unreasonable here and
> I don't think it's me. More importantly, discussion of open issues
> shouldn't be a race against the associated patch being merged. :/
> 
> > process 1	reservation	log completion	xfsaild
> > <completes metadata IO>
> >   xfs_ail_delete()
> >     mlip_changed
> >     xlog_assign_tail_lsn_locked()
> >       ail empty, sets l_last_sync = 0x1000032e2
> >     xfs_log_space_wake()
> > 				xlog_state_do_callback
> > 				  sets CALLBACK
> > 				  sets last_sync_lsn to iclog head
> > 				    -> 0x1000032e4
> > 				  <drops icloglock, gets preempted>
> > 		<wakes>
> > 		xlog_grant_head_wait
> > 		  free_bytes < need_bytes
> > 		    xlog_grant_push_ail()
> > 		      xlog_push_ail()
> > 		        ->ail_target 0x1000032e4
> > 		<sleeps>
> > 						<wakes>
> > 						sets prev target to 0x1000032e4
> > 						sees empty AIL
> > 						<sleeps>
> > 				    <runs again>
> > 				    runs callbacks
> > 				      xfs_ail_insert(lsn = 0x1000032e4)
> > 
> > and now we have the AIL push thread asleep with items in it at the
> > push threshold.  IOWs, what you describe has always been possible,
> > and before the CIL was introduced this sort of thing happened quite
> > a bit because iclog completions freed up much less space in the log
> > than a CIL commit completion.
> > 
> 
> I was suspicious that this could occur prior to this change but I hadn't
> confirmed. The scenario documented above cannot occur because a push on
> an empty AIL has no effect. The target doesn't move and the task isn't
> woken. That said, I still suspect the race can occur with the current
> code via between a grant head waiter, AIL emptying and iclog completion.
> 
> This just speaks to the frequency of the problem, though. I'm not
> convinced it's something that happens "quite a bit" given the nature of
> the 3-way race. I also don't agree that existence of a historical
> problem somehow excuses introduction a new variant of the same problem.
> Instead, if this patch exposes a historical problem that simply had no
> noticeable impact to this point, we should probably look into whether it
> needs fixing too.
> 
> > It's not a problem, however, because if we are out of transaction
> > reservation space we must have transactions in progress, and as long
> > as they make progress then the commit of each transaction will end
> > up calling xlog_ungrant_log_space() to return the unused portion of
> > the transaction reservation. That calls xfs_log_space_wake() to
> > allow reservation waiters to try to make progress.
> > 
> 
> Yes, this is why I don't see immediate side effects in the tests I've
> run so far. The assumptions you're basing this off are not always true,
> however. Particularly on smaller (<= 1GB) filesystems, it's relatively
> easy to produce conditions where the entire reservation space is
> consumed by open transactions that don't ultimately commit anything to
> the log subsystem and thus generate no forward progress.
> 
> > If there's still not enough space reservation after the transaction
> > in progress has released it's reservation, then it goes back to
> > sleep. As long as we have active transactions in progress while
> > there are transaction reservations waiting on reservation space,
> > there will be a wakeup vector for the reservation independent of
> > the CIL, iclogs and AIL behaviour.
> > 
> 
> We do have clean transaction cancel and error scenarios, existing log
> deadlock vectors, increasing reliance on long running transactions via
> deferred ops, scrub, etc. Also consider the fact that open transactions
> consume considerably more reservation than committed transactions on
> average.
> 
> I'm not saying it's likely for a real world workload to consume the
> entirety of log reservation space via open transactions and then release
> it without filesystem modification (and then race with log I/O and AIL
> emptying), but from the perspective of proving the existence of a bug
> it's really not that difficult to produce. I've not seen a real world
> workload that reproduces the problems fixed by any of these patches
> either, but we still fix them.
> 
> > [ Yes, there was a bug here, in the case xfs_log_space_wake() did
> > not issue a wakeup because of not enough space being availble and
> > the push target was limited by the old log head location. i.e.
> > nothing ever updated the push target to reflect the new log head and
> > so the tail might never get moved now. That particular bug was fixed
> > by a an earlier patch in the series, so we can ignore it here. ]
> > 
> > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
> > the log space, and we have transactions waiting on log reservation
> > space, then we must have enough transactions in progress to cover at
> > least 75% of the log space. Completion of those transactions will
> > wake waiters and, if necessary, push the AIL again to keep the log
> > tail moving appropriately. This handles the AIL empty and "insert
> > before target" situations you are concerned about just fine, as long
> > as we have a guarantee of forwards progress. Bounding the CIL size
> > provides that forwards progress guarantee for the CIL...
> > 
> 
> I think you have some tunnel vision or something going on here with
> regard to the higher level architectural view of how things are supposed
> to operate in a normal running/steady state vs simply what can and
> cannot happen in the code. I can't really tell why/how, but the only
> suggestion I can make is to perhaps separate from this high level view
> of things and take a closer look at the code. This is a simple code bug,
> not some grand architectural flaw. The context here is way out of whack.
> The repeated unrelated and overblown architectural assertions come off
> as indication of lack of any real argument to allow this race to live.
> There is simply no such guarantee of forward progress in all scenarios
> that produce the conditions that can cause this race.
> 
> Yet another example:
> 
>            <...>-369   [002] ...2   220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
>            <...>-27    [003] ...1   224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
>            <...>-404   [003] ...1   224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa
>      kworker/3:1-39    [003] ...2   224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
>     xfsaild/dm-4-1034  [000] ....   224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa
>     kworker/3:1H-404   [003] ...2   225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL
>      kworker/3:1-39    [003] ...1   254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> 	...
>      kworker/3:2-1920  [003] ...1  3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> 
> 
> # cat /sys/fs/xfs/dm-4/log/log_*lsn
> 1:252
> 1:250
> 
> This instance of the race uses the same serialization instrumentation to
> control execution timing and whatnot as before (i.e. no functional
> changes). First, an item is inserted into the AIL. Immediately after AIL
> insertion, another transaction commits to the CIL (not shown in the
> trace). The background log worker comes around a few seconds later and
> forces the log/CIL. The checkpoint for this log force races with an AIL
> delete and idle (same as before). AIL insertion occurs at the push
> target xfsaild just idled at, but this time reservation pressure
> relieves and the filesystem goes idle.
> 
> At this point, nothing occurs on the fs except for continuous background
> log worker jobs. Note the timestamp difference between the first
> post-race log force and the last in the trace. The log worker runs at
> the default 30s interval and has run repeatedly for almost an hour while
> failing to push the AIL and subsequently cover the log. To confirm the
> AIL is populated, see the log head/tail LSNs reported via sysfs. This
> state persists indefinitely so long as the fs is idle. This is a bug.

/me stumbles back in after ~2wks, and has a few questions:

1) Are these concerns a reason to hold up this series, or are they a
separate bug lurking in the code being touched by the series?  AFAICT I
think it's the second, but <shrug> my brain is still mush.

2) Er... how do you get the log stuck like this?  I see things earlier
in the thread like "open transactions that don't ultimately commit
anything to the log subsystem" and think "OH, you mean xfs_scrub!"

--D

> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
Brian Foster Sept. 17, 2019, 12:48 p.m. UTC | #14
On Mon, Sep 16, 2019 at 09:31:56PM -0700, Darrick J. Wong wrote:
> On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote:
> > On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote:
> > > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> > > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > > > > This is an instance of xfsaild going idle between the time this
> > > > > > new AIL push sets the target based on the iclog about to be
> > > > > > committed and AIL insertion of the associated log items,
> > > > > > reproduced via a bit of timing instrumentation.  Don't be
> > > > > > distracted by the timestamps or the fact that the LSNs do not
> > > > > > match because the log items in the AIL end up indexed by the start
> > > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > > > > commit record). The point is simply that xfsaild has completed a
> > > > > > push of a target that hasn't been inserted yet.
> > > > > 
> > > > > AFAICT, what you are showing requires delaying of the CIL push to the
> > > > > point it violates a fundamental assumption about commit sizes, which
> > > > > is why I largely think it's irrelevant.
> > > > > 
> > > > 
> > > > The CIL checkpoint size is an unrelated side effect of the test I
> > > > happened to use, not a fundamental cause of the problem it demonstrates.
> > > > Fixing CIL checkpoint size issues won't change anything. Here's a
> > > > different variant of the same problem with a small enough number of log
> > > > items such that background CIL pushing is not a factor:
> > > > 
> > > >        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > > kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> > > > 	...
> > > >        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
> > > >        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> > > > kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > > kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > > 	....
> > > >
> > > > This sequence starts with one log item in the AIL and some number of
> > > > items in the CIL such that a checkpoint executes from the background log
> > > > worker. The worker forces the CIL and log I/O completion issues an AIL
> > > > push that is truncated by the recently updated ->l_last_sync_lsn due to
> > > > outstanding transaction reservation and small AIL size. This push races
> > > > with completion of a previous push that empties the AIL and iclog
> > > > callbacks insert log items for the current checkpoint at the LSN target
> > > > xfsaild just idled at.
> > > 
> > > I'm just not seeing what the problem here is. The behaviour you are
> > > describing has been around since day zero and doesn't require the
> > > addition of an ail push from iclog completion to trigger.  Prior to
> > > this series, it would be:
> > > 
> > 
> > A few days ago you said that if we're inserting log items before the
> > push target, "something is very wrong." Since this was what I was
> > concerned about, I attempted to manufacture the issue to demonstrate.
> > You suggested the first reproducer I came up with was a separate problem
> > (related to CIL size issues), so I came up with the one above to avoid
> > that distraction. Now you're telling me this has always happened and is
> > fine..
> > 
> > While I don't think this is quite accurate (more below), I do find this
> > reasoning somewhat amusing in that it essentially implies that this
> > patch itself is dubious. If this new AIL push is required to fix a real
> > issue, and this race is essentially manifest as implied, then this patch
> > can't possibly reliably fix the original problem. Anyways, that is
> > neither here nor there..
> > 
> > All of the details of this particular issue aside, I do think there's a
> > development process problem here. It shouldn't require an extended game
> > of whack-a-mole with this kind of inconsistent reasoning just to request
> > a trivial change to a patch (you also implied in a previous response it
> > was me wasting your time on this topic) that closes an obvious race and
> > otherwise has no negative effect. Someone is being unreasonable here and
> > I don't think it's me. More importantly, discussion of open issues
> > shouldn't be a race against the associated patch being merged. :/
> > 
> > > process 1	reservation	log completion	xfsaild
> > > <completes metadata IO>
> > >   xfs_ail_delete()
> > >     mlip_changed
> > >     xlog_assign_tail_lsn_locked()
> > >       ail empty, sets l_last_sync = 0x1000032e2
> > >     xfs_log_space_wake()
> > > 				xlog_state_do_callback
> > > 				  sets CALLBACK
> > > 				  sets last_sync_lsn to iclog head
> > > 				    -> 0x1000032e4
> > > 				  <drops icloglock, gets preempted>
> > > 		<wakes>
> > > 		xlog_grant_head_wait
> > > 		  free_bytes < need_bytes
> > > 		    xlog_grant_push_ail()
> > > 		      xlog_push_ail()
> > > 		        ->ail_target 0x1000032e4
> > > 		<sleeps>
> > > 						<wakes>
> > > 						sets prev target to 0x1000032e4
> > > 						sees empty AIL
> > > 						<sleeps>
> > > 				    <runs again>
> > > 				    runs callbacks
> > > 				      xfs_ail_insert(lsn = 0x1000032e4)
> > > 
> > > and now we have the AIL push thread asleep with items in it at the
> > > push threshold.  IOWs, what you describe has always been possible,
> > > and before the CIL was introduced this sort of thing happened quite
> > > a bit because iclog completions freed up much less space in the log
> > > than a CIL commit completion.
> > > 
> > 
> > I was suspicious that this could occur prior to this change but I hadn't
> > confirmed. The scenario documented above cannot occur because a push on
> > an empty AIL has no effect. The target doesn't move and the task isn't
> > woken. That said, I still suspect the race can occur with the current
> > code via between a grant head waiter, AIL emptying and iclog completion.
> > 
> > This just speaks to the frequency of the problem, though. I'm not
> > convinced it's something that happens "quite a bit" given the nature of
> > the 3-way race. I also don't agree that existence of a historical
> > problem somehow excuses introduction a new variant of the same problem.
> > Instead, if this patch exposes a historical problem that simply had no
> > noticeable impact to this point, we should probably look into whether it
> > needs fixing too.
> > 
> > > It's not a problem, however, because if we are out of transaction
> > > reservation space we must have transactions in progress, and as long
> > > as they make progress then the commit of each transaction will end
> > > up calling xlog_ungrant_log_space() to return the unused portion of
> > > the transaction reservation. That calls xfs_log_space_wake() to
> > > allow reservation waiters to try to make progress.
> > > 
> > 
> > Yes, this is why I don't see immediate side effects in the tests I've
> > run so far. The assumptions you're basing this off are not always true,
> > however. Particularly on smaller (<= 1GB) filesystems, it's relatively
> > easy to produce conditions where the entire reservation space is
> > consumed by open transactions that don't ultimately commit anything to
> > the log subsystem and thus generate no forward progress.
> > 
> > > If there's still not enough space reservation after the transaction
> > > in progress has released it's reservation, then it goes back to
> > > sleep. As long as we have active transactions in progress while
> > > there are transaction reservations waiting on reservation space,
> > > there will be a wakeup vector for the reservation independent of
> > > the CIL, iclogs and AIL behaviour.
> > > 
> > 
> > We do have clean transaction cancel and error scenarios, existing log
> > deadlock vectors, increasing reliance on long running transactions via
> > deferred ops, scrub, etc. Also consider the fact that open transactions
> > consume considerably more reservation than committed transactions on
> > average.
> > 
> > I'm not saying it's likely for a real world workload to consume the
> > entirety of log reservation space via open transactions and then release
> > it without filesystem modification (and then race with log I/O and AIL
> > emptying), but from the perspective of proving the existence of a bug
> > it's really not that difficult to produce. I've not seen a real world
> > workload that reproduces the problems fixed by any of these patches
> > either, but we still fix them.
> > 
> > > [ Yes, there was a bug here, in the case xfs_log_space_wake() did
> > > not issue a wakeup because of not enough space being availble and
> > > the push target was limited by the old log head location. i.e.
> > > nothing ever updated the push target to reflect the new log head and
> > > so the tail might never get moved now. That particular bug was fixed
> > > by a an earlier patch in the series, so we can ignore it here. ]
> > > 
> > > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
> > > the log space, and we have transactions waiting on log reservation
> > > space, then we must have enough transactions in progress to cover at
> > > least 75% of the log space. Completion of those transactions will
> > > wake waiters and, if necessary, push the AIL again to keep the log
> > > tail moving appropriately. This handles the AIL empty and "insert
> > > before target" situations you are concerned about just fine, as long
> > > as we have a guarantee of forwards progress. Bounding the CIL size
> > > provides that forwards progress guarantee for the CIL...
> > > 
> > 
> > I think you have some tunnel vision or something going on here with
> > regard to the higher level architectural view of how things are supposed
> > to operate in a normal running/steady state vs simply what can and
> > cannot happen in the code. I can't really tell why/how, but the only
> > suggestion I can make is to perhaps separate from this high level view
> > of things and take a closer look at the code. This is a simple code bug,
> > not some grand architectural flaw. The context here is way out of whack.
> > The repeated unrelated and overblown architectural assertions come off
> > as indication of lack of any real argument to allow this race to live.
> > There is simply no such guarantee of forward progress in all scenarios
> > that produce the conditions that can cause this race.
> > 
> > Yet another example:
> > 
> >            <...>-369   [002] ...2   220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
> >            <...>-27    [003] ...1   224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> >            <...>-404   [003] ...1   224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa
> >      kworker/3:1-39    [003] ...2   224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
> >     xfsaild/dm-4-1034  [000] ....   224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa
> >     kworker/3:1H-404   [003] ...2   225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL
> >      kworker/3:1-39    [003] ...1   254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > 	...
> >      kworker/3:2-1920  [003] ...1  3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > 
> > 
> > # cat /sys/fs/xfs/dm-4/log/log_*lsn
> > 1:252
> > 1:250
> > 
> > This instance of the race uses the same serialization instrumentation to
> > control execution timing and whatnot as before (i.e. no functional
> > changes). First, an item is inserted into the AIL. Immediately after AIL
> > insertion, another transaction commits to the CIL (not shown in the
> > trace). The background log worker comes around a few seconds later and
> > forces the log/CIL. The checkpoint for this log force races with an AIL
> > delete and idle (same as before). AIL insertion occurs at the push
> > target xfsaild just idled at, but this time reservation pressure
> > relieves and the filesystem goes idle.
> > 
> > At this point, nothing occurs on the fs except for continuous background
> > log worker jobs. Note the timestamp difference between the first
> > post-race log force and the last in the trace. The log worker runs at
> > the default 30s interval and has run repeatedly for almost an hour while
> > failing to push the AIL and subsequently cover the log. To confirm the
> > AIL is populated, see the log head/tail LSNs reported via sysfs. This
> > state persists indefinitely so long as the fs is idle. This is a bug.
> 
> /me stumbles back in after ~2wks, and has a few questions:
> 

Heh, welcome back.. ;)

> 1) Are these concerns a reason to hold up this series, or are they a
> separate bug lurking in the code being touched by the series?  AFAICT I
> think it's the second, but <shrug> my brain is still mush.
> 

A little of both I guess. To Dave's earlier point, I think this
technically can happen in the existing code as a 3-way race between the
aforementioned tasks (just not the way it was described). OTOH, I'm not
sure what this has to do with the fact that the new code being added is
racy on its own (or since when discovery of some old bug justifies
adding new ones..?). The examples shown above are fundamentally races
between log I/O completion and xfsaild. The last one shows the log
remain uncovered indefinitely on an idle fs (which is not a corruption
or anything, but certainly a bug) simply because that's the easiest side
effect to reproduce. I'm fairly confident at this point that one could
be manufactured into a similar log deadlock if we really wanted to try,
but that would be much more difficult and TBH I'm tired of burning
myself out on these kind of objections to obvious and easily addressed
landmines. How likely is it that somebody would hit these problems?
Probably highly unlikely. How likely is it somebody would hit this
problem before whatever problem this patch fixes? *shrug*

I don't think it's a reason to hold up the series, but at the same time
this patch is unrelated to the original problem. IIRC, it fell out of
some other issue reproduced with a different experimental hack/fix (that
was eventually replaced) to the original problem. FWIW, I'm annoyed with
the lazy approach to review here more than anything. In hindsight, if I
knew the feedback was going to be dismissed and the patchset rolled
forward and merged, perhaps I should have just nacked the subsequent
reposts to make the objection clear.

I dunno, not my call on what to do with it now. Feel free to add my
Nacked-by: to the upstream commit I guess so I at least remember this
when/if considering whether to backport it anywhere. :/

> 2) Er... how do you get the log stuck like this?  I see things earlier
> in the thread like "open transactions that don't ultimately commit
> anything to the log subsystem" and think "OH, you mean xfs_scrub!"
> 

That's one thing I was thinking about but I didn't end up looking into
it (does scrub actually acquire log reservation?). For a more simple
example, consider a bunch of threads running into quota block allocation
failures where a system is also under memory pressure. On filesystems
with smaller logs, it only takes a handful of such threads to bash the
reservation grant head against the log tail even though the log is empty
(and doing so without ever committing anything to the log).

Note that this by itself isn't what gets the log "stuck" in the most
recent example (note: not deadlocked), but rather if we're in a state
where the grant head is close enough to the log head (such that we AIL
push the items associated with the current checkpoint before it inserts)
when log I/O completion happens to race with AIL emptying as described.

Brian

> --D
> 
> > Brian
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
Darrick J. Wong Sept. 24, 2019, 5:16 p.m. UTC | #15
On Tue, Sep 17, 2019 at 08:48:27AM -0400, Brian Foster wrote:
> On Mon, Sep 16, 2019 at 09:31:56PM -0700, Darrick J. Wong wrote:
> > On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote:
> > > On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote:
> > > > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> > > > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > > > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > > > > > This is an instance of xfsaild going idle between the time this
> > > > > > > new AIL push sets the target based on the iclog about to be
> > > > > > > committed and AIL insertion of the associated log items,
> > > > > > > reproduced via a bit of timing instrumentation.  Don't be
> > > > > > > distracted by the timestamps or the fact that the LSNs do not
> > > > > > > match because the log items in the AIL end up indexed by the start
> > > > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > > > > > commit record). The point is simply that xfsaild has completed a
> > > > > > > push of a target that hasn't been inserted yet.
> > > > > > 
> > > > > > AFAICT, what you are showing requires delaying of the CIL push to the
> > > > > > point it violates a fundamental assumption about commit sizes, which
> > > > > > is why I largely think it's irrelevant.
> > > > > > 
> > > > > 
> > > > > The CIL checkpoint size is an unrelated side effect of the test I
> > > > > happened to use, not a fundamental cause of the problem it demonstrates.
> > > > > Fixing CIL checkpoint size issues won't change anything. Here's a
> > > > > different variant of the same problem with a small enough number of log
> > > > > items such that background CIL pushing is not a factor:
> > > > > 
> > > > >        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > > > kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> > > > > 	...
> > > > >        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
> > > > >        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> > > > > kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > > > kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > > > 	....
> > > > >
> > > > > This sequence starts with one log item in the AIL and some number of
> > > > > items in the CIL such that a checkpoint executes from the background log
> > > > > worker. The worker forces the CIL and log I/O completion issues an AIL
> > > > > push that is truncated by the recently updated ->l_last_sync_lsn due to
> > > > > outstanding transaction reservation and small AIL size. This push races
> > > > > with completion of a previous push that empties the AIL and iclog
> > > > > callbacks insert log items for the current checkpoint at the LSN target
> > > > > xfsaild just idled at.
> > > > 
> > > > I'm just not seeing what the problem here is. The behaviour you are
> > > > describing has been around since day zero and doesn't require the
> > > > addition of an ail push from iclog completion to trigger.  Prior to
> > > > this series, it would be:
> > > > 
> > > 
> > > A few days ago you said that if we're inserting log items before the
> > > push target, "something is very wrong." Since this was what I was
> > > concerned about, I attempted to manufacture the issue to demonstrate.
> > > You suggested the first reproducer I came up with was a separate problem
> > > (related to CIL size issues), so I came up with the one above to avoid
> > > that distraction. Now you're telling me this has always happened and is
> > > fine..
> > > 
> > > While I don't think this is quite accurate (more below), I do find this
> > > reasoning somewhat amusing in that it essentially implies that this
> > > patch itself is dubious. If this new AIL push is required to fix a real
> > > issue, and this race is essentially manifest as implied, then this patch
> > > can't possibly reliably fix the original problem. Anyways, that is
> > > neither here nor there..
> > > 
> > > All of the details of this particular issue aside, I do think there's a
> > > development process problem here. It shouldn't require an extended game
> > > of whack-a-mole with this kind of inconsistent reasoning just to request
> > > a trivial change to a patch (you also implied in a previous response it
> > > was me wasting your time on this topic) that closes an obvious race and
> > > otherwise has no negative effect. Someone is being unreasonable here and
> > > I don't think it's me. More importantly, discussion of open issues
> > > shouldn't be a race against the associated patch being merged. :/
> > > 
> > > > process 1	reservation	log completion	xfsaild
> > > > <completes metadata IO>
> > > >   xfs_ail_delete()
> > > >     mlip_changed
> > > >     xlog_assign_tail_lsn_locked()
> > > >       ail empty, sets l_last_sync = 0x1000032e2
> > > >     xfs_log_space_wake()
> > > > 				xlog_state_do_callback
> > > > 				  sets CALLBACK
> > > > 				  sets last_sync_lsn to iclog head
> > > > 				    -> 0x1000032e4
> > > > 				  <drops icloglock, gets preempted>
> > > > 		<wakes>
> > > > 		xlog_grant_head_wait
> > > > 		  free_bytes < need_bytes
> > > > 		    xlog_grant_push_ail()
> > > > 		      xlog_push_ail()
> > > > 		        ->ail_target 0x1000032e4
> > > > 		<sleeps>
> > > > 						<wakes>
> > > > 						sets prev target to 0x1000032e4
> > > > 						sees empty AIL
> > > > 						<sleeps>
> > > > 				    <runs again>
> > > > 				    runs callbacks
> > > > 				      xfs_ail_insert(lsn = 0x1000032e4)
> > > > 
> > > > and now we have the AIL push thread asleep with items in it at the
> > > > push threshold.  IOWs, what you describe has always been possible,
> > > > and before the CIL was introduced this sort of thing happened quite
> > > > a bit because iclog completions freed up much less space in the log
> > > > than a CIL commit completion.
> > > > 
> > > 
> > > I was suspicious that this could occur prior to this change but I hadn't
> > > confirmed. The scenario documented above cannot occur because a push on
> > > an empty AIL has no effect. The target doesn't move and the task isn't
> > > woken. That said, I still suspect the race can occur with the current
> > > code via between a grant head waiter, AIL emptying and iclog completion.
> > > 
> > > This just speaks to the frequency of the problem, though. I'm not
> > > convinced it's something that happens "quite a bit" given the nature of
> > > the 3-way race. I also don't agree that existence of a historical
> > > problem somehow excuses introduction a new variant of the same problem.
> > > Instead, if this patch exposes a historical problem that simply had no
> > > noticeable impact to this point, we should probably look into whether it
> > > needs fixing too.
> > > 
> > > > It's not a problem, however, because if we are out of transaction
> > > > reservation space we must have transactions in progress, and as long
> > > > as they make progress then the commit of each transaction will end
> > > > up calling xlog_ungrant_log_space() to return the unused portion of
> > > > the transaction reservation. That calls xfs_log_space_wake() to
> > > > allow reservation waiters to try to make progress.
> > > > 
> > > 
> > > Yes, this is why I don't see immediate side effects in the tests I've
> > > run so far. The assumptions you're basing this off are not always true,
> > > however. Particularly on smaller (<= 1GB) filesystems, it's relatively
> > > easy to produce conditions where the entire reservation space is
> > > consumed by open transactions that don't ultimately commit anything to
> > > the log subsystem and thus generate no forward progress.
> > > 
> > > > If there's still not enough space reservation after the transaction
> > > > in progress has released it's reservation, then it goes back to
> > > > sleep. As long as we have active transactions in progress while
> > > > there are transaction reservations waiting on reservation space,
> > > > there will be a wakeup vector for the reservation independent of
> > > > the CIL, iclogs and AIL behaviour.
> > > > 
> > > 
> > > We do have clean transaction cancel and error scenarios, existing log
> > > deadlock vectors, increasing reliance on long running transactions via
> > > deferred ops, scrub, etc. Also consider the fact that open transactions
> > > consume considerably more reservation than committed transactions on
> > > average.
> > > 
> > > I'm not saying it's likely for a real world workload to consume the
> > > entirety of log reservation space via open transactions and then release
> > > it without filesystem modification (and then race with log I/O and AIL
> > > emptying), but from the perspective of proving the existence of a bug
> > > it's really not that difficult to produce. I've not seen a real world
> > > workload that reproduces the problems fixed by any of these patches
> > > either, but we still fix them.
> > > 
> > > > [ Yes, there was a bug here, in the case xfs_log_space_wake() did
> > > > not issue a wakeup because of not enough space being availble and
> > > > the push target was limited by the old log head location. i.e.
> > > > nothing ever updated the push target to reflect the new log head and
> > > > so the tail might never get moved now. That particular bug was fixed
> > > > by a an earlier patch in the series, so we can ignore it here. ]
> > > > 
> > > > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
> > > > the log space, and we have transactions waiting on log reservation
> > > > space, then we must have enough transactions in progress to cover at
> > > > least 75% of the log space. Completion of those transactions will
> > > > wake waiters and, if necessary, push the AIL again to keep the log
> > > > tail moving appropriately. This handles the AIL empty and "insert
> > > > before target" situations you are concerned about just fine, as long
> > > > as we have a guarantee of forwards progress. Bounding the CIL size
> > > > provides that forwards progress guarantee for the CIL...
> > > > 
> > > 
> > > I think you have some tunnel vision or something going on here with
> > > regard to the higher level architectural view of how things are supposed
> > > to operate in a normal running/steady state vs simply what can and
> > > cannot happen in the code. I can't really tell why/how, but the only
> > > suggestion I can make is to perhaps separate from this high level view
> > > of things and take a closer look at the code. This is a simple code bug,
> > > not some grand architectural flaw. The context here is way out of whack.
> > > The repeated unrelated and overblown architectural assertions come off
> > > as indication of lack of any real argument to allow this race to live.
> > > There is simply no such guarantee of forward progress in all scenarios
> > > that produce the conditions that can cause this race.
> > > 
> > > Yet another example:
> > > 
> > >            <...>-369   [002] ...2   220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
> > >            <...>-27    [003] ...1   224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > >            <...>-404   [003] ...1   224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa
> > >      kworker/3:1-39    [003] ...2   224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
> > >     xfsaild/dm-4-1034  [000] ....   224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa
> > >     kworker/3:1H-404   [003] ...2   225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL
> > >      kworker/3:1-39    [003] ...1   254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > 	...
> > >      kworker/3:2-1920  [003] ...1  3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > 
> > > 
> > > # cat /sys/fs/xfs/dm-4/log/log_*lsn
> > > 1:252
> > > 1:250
> > > 
> > > This instance of the race uses the same serialization instrumentation to
> > > control execution timing and whatnot as before (i.e. no functional
> > > changes). First, an item is inserted into the AIL. Immediately after AIL
> > > insertion, another transaction commits to the CIL (not shown in the
> > > trace). The background log worker comes around a few seconds later and
> > > forces the log/CIL. The checkpoint for this log force races with an AIL
> > > delete and idle (same as before). AIL insertion occurs at the push
> > > target xfsaild just idled at, but this time reservation pressure
> > > relieves and the filesystem goes idle.
> > > 
> > > At this point, nothing occurs on the fs except for continuous background
> > > log worker jobs. Note the timestamp difference between the first
> > > post-race log force and the last in the trace. The log worker runs at
> > > the default 30s interval and has run repeatedly for almost an hour while
> > > failing to push the AIL and subsequently cover the log. To confirm the
> > > AIL is populated, see the log head/tail LSNs reported via sysfs. This
> > > state persists indefinitely so long as the fs is idle. This is a bug.
> > 
> > /me stumbles back in after ~2wks, and has a few questions:
> > 
> 
> Heh, welcome back.. ;)
> 
> > 1) Are these concerns a reason to hold up this series, or are they a
> > separate bug lurking in the code being touched by the series?  AFAICT I
> > think it's the second, but <shrug> my brain is still mush.
> > 
> 
> A little of both I guess. To Dave's earlier point, I think this
> technically can happen in the existing code as a 3-way race between the
> aforementioned tasks (just not the way it was described). OTOH, I'm not
> sure what this has to do with the fact that the new code being added is
> racy on its own (or since when discovery of some old bug justifies
> adding new ones..?). The examples shown above are fundamentally races
> between log I/O completion and xfsaild. The last one shows the log
> remain uncovered indefinitely on an idle fs (which is not a corruption
> or anything, but certainly a bug) simply because that's the easiest side
> effect to reproduce. I'm fairly confident at this point that one could
> be manufactured into a similar log deadlock if we really wanted to try,
> but that would be much more difficult and TBH I'm tired of burning
> myself out on these kind of objections to obvious and easily addressed
> landmines. How likely is it that somebody would hit these problems?
> Probably highly unlikely. How likely is it somebody would hit this
> problem before whatever problem this patch fixes? *shrug*
> 
> I don't think it's a reason to hold up the series, but at the same time
> this patch is unrelated to the original problem. IIRC, it fell out of
> some other issue reproduced with a different experimental hack/fix (that
> was eventually replaced) to the original problem. FWIW, I'm annoyed with
> the lazy approach to review here more than anything. In hindsight, if I
> knew the feedback was going to be dismissed and the patchset rolled
> forward and merged, perhaps I should have just nacked the subsequent
> reposts to make the objection clear.

:(

I'm sorry you feel that way.  I myself don't feel that my own handling
of this merge window has been good, between feeling pressured to get the
branches ready to go before vacation and for-next becoming intermittent
right around the same time.  Those both decrease my certainty about
what's going in the next merge and increases my own anxieties, and it
becomes a competition in my head between "I can add it now and revert it
later as a regression fix" vs. "if I don't add it I'll wonder if it was
necessary".

Anyway, I /think/ the end result is that if the AIL gets stuck /and/ the
system goes down before it becomes unstuck, then there'll be more work
for log recovery to do, because we failed to checkpoint everything that
we possibly could have before the crash?

So AFAICT it's not a critical disaster bug but I would like to study
this situation some more, particularly now that we have ~2mos for
stabilizing things.

> I dunno, not my call on what to do with it now. Feel free to add my
> Nacked-by: to the upstream commit I guess so I at least remember this
> when/if considering whether to backport it anywhere. :/

(/me continues to wish there was an easy way to add tagging to a commit,
particularly when it comes well after the fact.)

> > 2) Er... how do you get the log stuck like this?  I see things earlier
> > in the thread like "open transactions that don't ultimately commit
> > anything to the log subsystem" and think "OH, you mean xfs_scrub!"
> > 
> 
> That's one thing I was thinking about but I didn't end up looking into
> it (does scrub actually acquire log reservation?).

If you invoke the scrub ioctl with IFLAG_REPAIR set, it allocates a
non-empty transaction (itruncate, iirc) to do the check and rebuild the
data structure.  If the item is ok then it'll cancel the transaction.

> For a more simple
> example, consider a bunch of threads running into quota block allocation
> failures where a system is also under memory pressure. On filesystems
> with smaller logs, it only takes a handful of such threads to bash the
> reservation grant head against the log tail even though the log is empty
> (and doing so without ever committing anything to the log).
> 
> Note that this by itself isn't what gets the log "stuck" in the most
> recent example (note: not deadlocked), but rather if we're in a state
> where the grant head is close enough to the log head (such that we AIL
> push the items associated with the current checkpoint before it inserts)
> when log I/O completion happens to race with AIL emptying as described.

Hmm... I wonder if we could reproduce this by formatting a filesystem
with a small log; running a slow moving thread that touches a file once
per second (or something to generate a moderate amount of workload) and
monitors the log to watch its progress; and then kicking off dozens of
threads to invoke IFLAG_REPAIR scrubbers on some other non-corrupt part
of the filesystem?

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
Brian Foster Sept. 26, 2019, 1:19 p.m. UTC | #16
On Tue, Sep 24, 2019 at 10:16:09AM -0700, Darrick J. Wong wrote:
> On Tue, Sep 17, 2019 at 08:48:27AM -0400, Brian Foster wrote:
> > On Mon, Sep 16, 2019 at 09:31:56PM -0700, Darrick J. Wong wrote:
> > > On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote:
> > > > On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote:
> > > > > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> > > > > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > > > > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > > > > > > This is an instance of xfsaild going idle between the time this
> > > > > > > > new AIL push sets the target based on the iclog about to be
> > > > > > > > committed and AIL insertion of the associated log items,
> > > > > > > > reproduced via a bit of timing instrumentation.  Don't be
> > > > > > > > distracted by the timestamps or the fact that the LSNs do not
> > > > > > > > match because the log items in the AIL end up indexed by the start
> > > > > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > > > > > > commit record). The point is simply that xfsaild has completed a
> > > > > > > > push of a target that hasn't been inserted yet.
> > > > > > > 
> > > > > > > AFAICT, what you are showing requires delaying of the CIL push to the
> > > > > > > point it violates a fundamental assumption about commit sizes, which
> > > > > > > is why I largely think it's irrelevant.
> > > > > > > 
> > > > > > 
> > > > > > The CIL checkpoint size is an unrelated side effect of the test I
> > > > > > happened to use, not a fundamental cause of the problem it demonstrates.
> > > > > > Fixing CIL checkpoint size issues won't change anything. Here's a
> > > > > > different variant of the same problem with a small enough number of log
> > > > > > items such that background CIL pushing is not a factor:
> > > > > > 
> > > > > >        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > > > > kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> > > > > > 	...
> > > > > >        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
> > > > > >        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> > > > > > kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > > > > kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > > > > 	....
> > > > > >
> > > > > > This sequence starts with one log item in the AIL and some number of
> > > > > > items in the CIL such that a checkpoint executes from the background log
> > > > > > worker. The worker forces the CIL and log I/O completion issues an AIL
> > > > > > push that is truncated by the recently updated ->l_last_sync_lsn due to
> > > > > > outstanding transaction reservation and small AIL size. This push races
> > > > > > with completion of a previous push that empties the AIL and iclog
> > > > > > callbacks insert log items for the current checkpoint at the LSN target
> > > > > > xfsaild just idled at.
> > > > > 
> > > > > I'm just not seeing what the problem here is. The behaviour you are
> > > > > describing has been around since day zero and doesn't require the
> > > > > addition of an ail push from iclog completion to trigger.  Prior to
> > > > > this series, it would be:
> > > > > 
> > > > 
> > > > A few days ago you said that if we're inserting log items before the
> > > > push target, "something is very wrong." Since this was what I was
> > > > concerned about, I attempted to manufacture the issue to demonstrate.
> > > > You suggested the first reproducer I came up with was a separate problem
> > > > (related to CIL size issues), so I came up with the one above to avoid
> > > > that distraction. Now you're telling me this has always happened and is
> > > > fine..
> > > > 
> > > > While I don't think this is quite accurate (more below), I do find this
> > > > reasoning somewhat amusing in that it essentially implies that this
> > > > patch itself is dubious. If this new AIL push is required to fix a real
> > > > issue, and this race is essentially manifest as implied, then this patch
> > > > can't possibly reliably fix the original problem. Anyways, that is
> > > > neither here nor there..
> > > > 
> > > > All of the details of this particular issue aside, I do think there's a
> > > > development process problem here. It shouldn't require an extended game
> > > > of whack-a-mole with this kind of inconsistent reasoning just to request
> > > > a trivial change to a patch (you also implied in a previous response it
> > > > was me wasting your time on this topic) that closes an obvious race and
> > > > otherwise has no negative effect. Someone is being unreasonable here and
> > > > I don't think it's me. More importantly, discussion of open issues
> > > > shouldn't be a race against the associated patch being merged. :/
> > > > 
> > > > > process 1	reservation	log completion	xfsaild
> > > > > <completes metadata IO>
> > > > >   xfs_ail_delete()
> > > > >     mlip_changed
> > > > >     xlog_assign_tail_lsn_locked()
> > > > >       ail empty, sets l_last_sync = 0x1000032e2
> > > > >     xfs_log_space_wake()
> > > > > 				xlog_state_do_callback
> > > > > 				  sets CALLBACK
> > > > > 				  sets last_sync_lsn to iclog head
> > > > > 				    -> 0x1000032e4
> > > > > 				  <drops icloglock, gets preempted>
> > > > > 		<wakes>
> > > > > 		xlog_grant_head_wait
> > > > > 		  free_bytes < need_bytes
> > > > > 		    xlog_grant_push_ail()
> > > > > 		      xlog_push_ail()
> > > > > 		        ->ail_target 0x1000032e4
> > > > > 		<sleeps>
> > > > > 						<wakes>
> > > > > 						sets prev target to 0x1000032e4
> > > > > 						sees empty AIL
> > > > > 						<sleeps>
> > > > > 				    <runs again>
> > > > > 				    runs callbacks
> > > > > 				      xfs_ail_insert(lsn = 0x1000032e4)
> > > > > 
> > > > > and now we have the AIL push thread asleep with items in it at the
> > > > > push threshold.  IOWs, what you describe has always been possible,
> > > > > and before the CIL was introduced this sort of thing happened quite
> > > > > a bit because iclog completions freed up much less space in the log
> > > > > than a CIL commit completion.
> > > > > 
> > > > 
> > > > I was suspicious that this could occur prior to this change but I hadn't
> > > > confirmed. The scenario documented above cannot occur because a push on
> > > > an empty AIL has no effect. The target doesn't move and the task isn't
> > > > woken. That said, I still suspect the race can occur with the current
> > > > code via between a grant head waiter, AIL emptying and iclog completion.
> > > > 
> > > > This just speaks to the frequency of the problem, though. I'm not
> > > > convinced it's something that happens "quite a bit" given the nature of
> > > > the 3-way race. I also don't agree that existence of a historical
> > > > problem somehow excuses introduction a new variant of the same problem.
> > > > Instead, if this patch exposes a historical problem that simply had no
> > > > noticeable impact to this point, we should probably look into whether it
> > > > needs fixing too.
> > > > 
> > > > > It's not a problem, however, because if we are out of transaction
> > > > > reservation space we must have transactions in progress, and as long
> > > > > as they make progress then the commit of each transaction will end
> > > > > up calling xlog_ungrant_log_space() to return the unused portion of
> > > > > the transaction reservation. That calls xfs_log_space_wake() to
> > > > > allow reservation waiters to try to make progress.
> > > > > 
> > > > 
> > > > Yes, this is why I don't see immediate side effects in the tests I've
> > > > run so far. The assumptions you're basing this off are not always true,
> > > > however. Particularly on smaller (<= 1GB) filesystems, it's relatively
> > > > easy to produce conditions where the entire reservation space is
> > > > consumed by open transactions that don't ultimately commit anything to
> > > > the log subsystem and thus generate no forward progress.
> > > > 
> > > > > If there's still not enough space reservation after the transaction
> > > > > in progress has released it's reservation, then it goes back to
> > > > > sleep. As long as we have active transactions in progress while
> > > > > there are transaction reservations waiting on reservation space,
> > > > > there will be a wakeup vector for the reservation independent of
> > > > > the CIL, iclogs and AIL behaviour.
> > > > > 
> > > > 
> > > > We do have clean transaction cancel and error scenarios, existing log
> > > > deadlock vectors, increasing reliance on long running transactions via
> > > > deferred ops, scrub, etc. Also consider the fact that open transactions
> > > > consume considerably more reservation than committed transactions on
> > > > average.
> > > > 
> > > > I'm not saying it's likely for a real world workload to consume the
> > > > entirety of log reservation space via open transactions and then release
> > > > it without filesystem modification (and then race with log I/O and AIL
> > > > emptying), but from the perspective of proving the existence of a bug
> > > > it's really not that difficult to produce. I've not seen a real world
> > > > workload that reproduces the problems fixed by any of these patches
> > > > either, but we still fix them.
> > > > 
> > > > > [ Yes, there was a bug here, in the case xfs_log_space_wake() did
> > > > > not issue a wakeup because of not enough space being availble and
> > > > > the push target was limited by the old log head location. i.e.
> > > > > nothing ever updated the push target to reflect the new log head and
> > > > > so the tail might never get moved now. That particular bug was fixed
> > > > > by a an earlier patch in the series, so we can ignore it here. ]
> > > > > 
> > > > > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
> > > > > the log space, and we have transactions waiting on log reservation
> > > > > space, then we must have enough transactions in progress to cover at
> > > > > least 75% of the log space. Completion of those transactions will
> > > > > wake waiters and, if necessary, push the AIL again to keep the log
> > > > > tail moving appropriately. This handles the AIL empty and "insert
> > > > > before target" situations you are concerned about just fine, as long
> > > > > as we have a guarantee of forwards progress. Bounding the CIL size
> > > > > provides that forwards progress guarantee for the CIL...
> > > > > 
> > > > 
> > > > I think you have some tunnel vision or something going on here with
> > > > regard to the higher level architectural view of how things are supposed
> > > > to operate in a normal running/steady state vs simply what can and
> > > > cannot happen in the code. I can't really tell why/how, but the only
> > > > suggestion I can make is to perhaps separate from this high level view
> > > > of things and take a closer look at the code. This is a simple code bug,
> > > > not some grand architectural flaw. The context here is way out of whack.
> > > > The repeated unrelated and overblown architectural assertions come off
> > > > as indication of lack of any real argument to allow this race to live.
> > > > There is simply no such guarantee of forward progress in all scenarios
> > > > that produce the conditions that can cause this race.
> > > > 
> > > > Yet another example:
> > > > 
> > > >            <...>-369   [002] ...2   220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
> > > >            <...>-27    [003] ...1   224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > >            <...>-404   [003] ...1   224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa
> > > >      kworker/3:1-39    [003] ...2   224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
> > > >     xfsaild/dm-4-1034  [000] ....   224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa
> > > >     kworker/3:1H-404   [003] ...2   225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL
> > > >      kworker/3:1-39    [003] ...1   254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > > 	...
> > > >      kworker/3:2-1920  [003] ...1  3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > > 
> > > > 
> > > > # cat /sys/fs/xfs/dm-4/log/log_*lsn
> > > > 1:252
> > > > 1:250
> > > > 
> > > > This instance of the race uses the same serialization instrumentation to
> > > > control execution timing and whatnot as before (i.e. no functional
> > > > changes). First, an item is inserted into the AIL. Immediately after AIL
> > > > insertion, another transaction commits to the CIL (not shown in the
> > > > trace). The background log worker comes around a few seconds later and
> > > > forces the log/CIL. The checkpoint for this log force races with an AIL
> > > > delete and idle (same as before). AIL insertion occurs at the push
> > > > target xfsaild just idled at, but this time reservation pressure
> > > > relieves and the filesystem goes idle.
> > > > 
> > > > At this point, nothing occurs on the fs except for continuous background
> > > > log worker jobs. Note the timestamp difference between the first
> > > > post-race log force and the last in the trace. The log worker runs at
> > > > the default 30s interval and has run repeatedly for almost an hour while
> > > > failing to push the AIL and subsequently cover the log. To confirm the
> > > > AIL is populated, see the log head/tail LSNs reported via sysfs. This
> > > > state persists indefinitely so long as the fs is idle. This is a bug.
> > > 
> > > /me stumbles back in after ~2wks, and has a few questions:
> > > 
> > 
> > Heh, welcome back.. ;)
> > 
> > > 1) Are these concerns a reason to hold up this series, or are they a
> > > separate bug lurking in the code being touched by the series?  AFAICT I
> > > think it's the second, but <shrug> my brain is still mush.
> > > 
> > 
> > A little of both I guess. To Dave's earlier point, I think this
> > technically can happen in the existing code as a 3-way race between the
> > aforementioned tasks (just not the way it was described). OTOH, I'm not
> > sure what this has to do with the fact that the new code being added is
> > racy on its own (or since when discovery of some old bug justifies
> > adding new ones..?). The examples shown above are fundamentally races
> > between log I/O completion and xfsaild. The last one shows the log
> > remain uncovered indefinitely on an idle fs (which is not a corruption
> > or anything, but certainly a bug) simply because that's the easiest side
> > effect to reproduce. I'm fairly confident at this point that one could
> > be manufactured into a similar log deadlock if we really wanted to try,
> > but that would be much more difficult and TBH I'm tired of burning
> > myself out on these kind of objections to obvious and easily addressed
> > landmines. How likely is it that somebody would hit these problems?
> > Probably highly unlikely. How likely is it somebody would hit this
> > problem before whatever problem this patch fixes? *shrug*
> > 
> > I don't think it's a reason to hold up the series, but at the same time
> > this patch is unrelated to the original problem. IIRC, it fell out of
> > some other issue reproduced with a different experimental hack/fix (that
> > was eventually replaced) to the original problem. FWIW, I'm annoyed with
> > the lazy approach to review here more than anything. In hindsight, if I
> > knew the feedback was going to be dismissed and the patchset rolled
> > forward and merged, perhaps I should have just nacked the subsequent
> > reposts to make the objection clear.
> 
> :(
> 
> I'm sorry you feel that way.  I myself don't feel that my own handling
> of this merge window has been good, between feeling pressured to get the
> branches ready to go before vacation and for-next becoming intermittent
> right around the same time.  Those both decrease my certainty about
> what's going in the next merge and increases my own anxieties, and it
> becomes a competition in my head between "I can add it now and revert it
> later as a regression fix" vs. "if I don't add it I'll wonder if it was
> necessary".
> 

Eh, it is what it is. I don't expect to always agree on everything. I
could/should have noted the objection more clearly on subsequent posts
and will try to do so in the future. Conversely, a quick "there appears
to be a disagreement, the maintainer is making a decision" note on the
list would be appreciated should that play out in the future.

Just my .02 (and not saying this is some kind of pattern or anything),
but I do think the whole "merge it since we can revert it from for-next
later if it causes problems" thing kind of sets a low bar and a bad
precedent. What's the point of reviewing patches with that approach?
Beyond discouraging review, it also diminishes sense of responsibility
for the quality of affected areas of code IMO.

> Anyway, I /think/ the end result is that if the AIL gets stuck /and/ the
> system goes down before it becomes unstuck, then there'll be more work
> for log recovery to do, because we failed to checkpoint everything that
> we possibly could have before the crash?
> 

Yep, in this particular variant at least.

> So AFAICT it's not a critical disaster bug but I would like to study
> this situation some more, particularly now that we have ~2mos for
> stabilizing things.
> 

Right, this is definitely not a critical bug. AFAICT neither is the bug
fixed by the patch.

Note that just on principle I'm not going to spend a whole lot of time
testing for things post-merge that I consider and/or call out as
problems during review and end up ignored. The point of addressing such
things during review is to avoid those problems in the first place. In
this particular case, why spend time on that when it is so relatively
simple to relocate the xlog_grant_push_ail() call to after callback
processing? I still don't understand that tbh.

> > I dunno, not my call on what to do with it now. Feel free to add my
> > Nacked-by: to the upstream commit I guess so I at least remember this
> > when/if considering whether to backport it anywhere. :/
> 
> (/me continues to wish there was an easy way to add tagging to a commit,
> particularly when it comes well after the fact.)
> 

No big deal, this was all in hindsight.

> > > 2) Er... how do you get the log stuck like this?  I see things earlier
> > > in the thread like "open transactions that don't ultimately commit
> > > anything to the log subsystem" and think "OH, you mean xfs_scrub!"
> > > 
> > 
> > That's one thing I was thinking about but I didn't end up looking into
> > it (does scrub actually acquire log reservation?).
> 
> If you invoke the scrub ioctl with IFLAG_REPAIR set, it allocates a
> non-empty transaction (itruncate, iirc) to do the check and rebuild the
> data structure.  If the item is ok then it'll cancel the transaction.
> 

Ah, tr_itruncate is actually larger than tr_write too (which I assume is
why it's used here)...

> > For a more simple
> > example, consider a bunch of threads running into quota block allocation
> > failures where a system is also under memory pressure. On filesystems
> > with smaller logs, it only takes a handful of such threads to bash the
> > reservation grant head against the log tail even though the log is empty
> > (and doing so without ever committing anything to the log).
> > 
> > Note that this by itself isn't what gets the log "stuck" in the most
> > recent example (note: not deadlocked), but rather if we're in a state
> > where the grant head is close enough to the log head (such that we AIL
> > push the items associated with the current checkpoint before it inserts)
> > when log I/O completion happens to race with AIL emptying as described.
> 
> Hmm... I wonder if we could reproduce this by formatting a filesystem
> with a small log; running a slow moving thread that touches a file once
> per second (or something to generate a moderate amount of workload) and
> monitors the log to watch its progress; and then kicking off dozens of
> threads to invoke IFLAG_REPAIR scrubbers on some other non-corrupt part
> of the filesystem?
> 

Yeah, given the above and if we're able to kick off enough concurrent
scrubbers such that they do nontrivial work and don't block eachother
before transaction allocation, that looks like it could result in
similar behavior wrt to reservation. Concurrent allocbt scrubs perhaps?
If so, on an otherwise default sized 100g fs, concurrent repair scrubs
to anything more than 18 or so AGs (based on tr_itruncate size) would be
enough to cause AIL pushing from the log subsystem (with a minimum size
log, I think something like 3 AGs would be enough) without any guarantee
that any of those transactions commit. The rest is just a simple race
between AIL pushing a fabricated target and xfsaild.

BTW, another thing I noticed with regard to timing is that on 32-bit
systems the push target is updated under ->ail_lock, which I think means
the xfs_ail_push() call from log I/O completion can serialize behind an
xfsaild push in progress (after the former has checked for a !empty
AIL). I'm not sure that's enough to make the race easy to reproduce
without explicit delay injection, but it's a step in that direction..

Brian

> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > Cheers,
> > > > > 
> > > > > Dave.
> > > > > -- 
> > > > > Dave Chinner
> > > > > david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 6a59d71d4c60..733693e1ac9f 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2632,6 +2632,46 @@  xlog_get_lowest_lsn(
 	return lowest_lsn;
 }
 
+/*
+ * Completion of a iclog IO does not imply that a transaction has completed, as
+ * transactions can be large enough to span many iclogs. We cannot change the
+ * tail of the log half way through a transaction as this may be the only
+ * transaction in the log and moving the tail to point to the middle of it
+ * will prevent recovery from finding the start of the transaction. Hence we
+ * should only update the last_sync_lsn if this iclog contains transaction
+ * completion callbacks on it.
+ *
+ * We have to do this before we drop the icloglock to ensure we are the only one
+ * that can update it.
+ *
+ * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
+ * the reservation grant head pushing. This is due to the fact that the push
+ * target is bound by the current last_sync_lsn value. Hence if we have a large
+ * amount of log space bound up in this committing transaction then the
+ * last_sync_lsn value may be the limiting factor preventing tail pushing from
+ * freeing space in the log. Hence once we've updated the last_sync_lsn we
+ * should push the AIL to ensure the push target (and hence the grant head) is
+ * no longer bound by the old log head location and can move forwards and make
+ * progress again.
+ */
+static void
+xlog_state_set_callback(
+	struct xlog		*log,
+	struct xlog_in_core	*iclog,
+	xfs_lsn_t		header_lsn)
+{
+	iclog->ic_state = XLOG_STATE_CALLBACK;
+
+	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
+
+	if (list_empty_careful(&iclog->ic_callbacks))
+		return;
+
+	atomic64_set(&log->l_last_sync_lsn, header_lsn);
+	xlog_grant_push_ail(log, 0);
+
+}
+
 /*
  * Return true if we need to stop processing, false to continue to the next
  * iclog. The caller will need to run callbacks if the iclog is returned in the
@@ -2644,6 +2684,7 @@  xlog_state_iodone_process_iclog(
 	struct xlog_in_core	*completed_iclog)
 {
 	xfs_lsn_t		lowest_lsn;
+	xfs_lsn_t		header_lsn;
 
 	/* Skip all iclogs in the ACTIVE & DIRTY states */
 	if (iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_DIRTY))
@@ -2681,34 +2722,15 @@  xlog_state_iodone_process_iclog(
 	 * callbacks) see the above if.
 	 *
 	 * We will do one more check here to see if we have chased our tail
-	 * around.
+	 * around. If this is not the lowest lsn iclog, then we will leave it
+	 * for another completion to process.
 	 */
+	header_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 	lowest_lsn = xlog_get_lowest_lsn(log);
-	if (lowest_lsn &&
-	    XFS_LSN_CMP(lowest_lsn, be64_to_cpu(iclog->ic_header.h_lsn)) < 0)
-		return false; /* Leave this iclog for another thread */
-
-	iclog->ic_state = XLOG_STATE_CALLBACK;
-
-	/*
-	 * Completion of a iclog IO does not imply that a transaction has
-	 * completed, as transactions can be large enough to span many iclogs.
-	 * We cannot change the tail of the log half way through a transaction
-	 * as this may be the only transaction in the log and moving th etail to
-	 * point to the middle of it will prevent recovery from finding the
-	 * start of the transaction.  Hence we should only update the
-	 * last_sync_lsn if this iclog contains transaction completion callbacks
-	 * on it.
-	 *
-	 * We have to do this before we drop the icloglock to ensure we are the
-	 * only one that can update it.
-	 */
-	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
-			be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
-	if (!list_empty_careful(&iclog->ic_callbacks))
-		atomic64_set(&log->l_last_sync_lsn,
-			be64_to_cpu(iclog->ic_header.h_lsn));
+	if (lowest_lsn && XFS_LSN_CMP(lowest_lsn, header_lsn) < 0)
+		return false;
 
+	xlog_state_set_callback(log, iclog, header_lsn);
 	return false;
 
 }