diff mbox series

[2/2] xfs: Throttle commits on delayed background CIL push

Message ID 20190930060344.14561-3-david@fromorbit.com (mailing list archive)
State Superseded
Headers show
Series xfs: limit CIL push sizes | expand

Commit Message

Dave Chinner Sept. 30, 2019, 6:03 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

In certain situations the background CIL push can be indefinitely
delayed. While we have workarounds from the obvious cases now, it
doesn't solve the underlying issue. This issue is that there is no
upper limit on the CIL where we will either force or wait for
a background push to start, hence allowing the CIL to grow without
bound until it consumes all log space.

To fix this, add a new wait queue to the CIL which allows background
pushes to wait for the CIL context to be switched out. This happens
when the push starts, so it will allow us to block incoming
transaction commit completion until the push has started. This will
only affect processes that are running modifications, and only when
the CIL threshold has been significantly overrun.

This has no apparent impact on performance, and doesn't even trigger
until over 45 million inodes had been created in a 16-way fsmark
test on a 2GB log. That was limiting at 64MB of log space used, so
the active CIL size is only about 3% of the total log in that case.
The concurrent removal of those files did not trigger the background
sleep at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 37 +++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++
 fs/xfs/xfs_trace.h    |  1 +
 3 files changed, 58 insertions(+), 4 deletions(-)

Comments

Brian Foster Sept. 30, 2019, 5:03 p.m. UTC | #1
On Mon, Sep 30, 2019 at 04:03:44PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In certain situations the background CIL push can be indefinitely
> delayed. While we have workarounds from the obvious cases now, it
> doesn't solve the underlying issue. This issue is that there is no
> upper limit on the CIL where we will either force or wait for
> a background push to start, hence allowing the CIL to grow without
> bound until it consumes all log space.
> 
> To fix this, add a new wait queue to the CIL which allows background
> pushes to wait for the CIL context to be switched out. This happens
> when the push starts, so it will allow us to block incoming
> transaction commit completion until the push has started. This will
> only affect processes that are running modifications, and only when
> the CIL threshold has been significantly overrun.
> 
> This has no apparent impact on performance, and doesn't even trigger
> until over 45 million inodes had been created in a 16-way fsmark
> test on a 2GB log. That was limiting at 64MB of log space used, so
> the active CIL size is only about 3% of the total log in that case.
> The concurrent removal of those files did not trigger the background
> sleep at all.
> 

Have you done similar testing for small/minimum sized logs?

Also, if this is so limited in occurrence, had you given any thought to
something even more simple like flushing the CIL push workqueue when
over the throttle threshold? That would wait longer to fill the iclogs,
but would eliminate the need for another waitqueue that's apparently
only used for throttling. It may also facilitate reuse of
xlog_cil_push_now() in the >= XLOG_CIL_SPACE_BLOCKING_LIMIT() case
(actually, I think this could facilitate the elimination of
xlog_cil_push_background() entirely).

That aside...

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 37 +++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++
>  fs/xfs/xfs_trace.h    |  1 +
>  3 files changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index ef652abd112c..4a09d50e1368 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -670,6 +670,11 @@ xlog_cil_push(
>  	push_seq = cil->xc_push_seq;
>  	ASSERT(push_seq <= ctx->sequence);
>  
> +	/*
> +	 * Wake up any background push waiters now this context is being pushed.
> +	 */
> +	wake_up_all(&ctx->push_wait);
> +
>  	/*
>  	 * Check if we've anything to push. If there is nothing, then we don't
>  	 * move on to a new sequence number and so we have to be able to push
> @@ -746,6 +751,7 @@ xlog_cil_push(
>  	 */
>  	INIT_LIST_HEAD(&new_ctx->committing);
>  	INIT_LIST_HEAD(&new_ctx->busy_extents);
> +	init_waitqueue_head(&new_ctx->push_wait);
>  	new_ctx->sequence = ctx->sequence + 1;
>  	new_ctx->cil = cil;
>  	cil->xc_ctx = new_ctx;
> @@ -900,7 +906,7 @@ xlog_cil_push_work(
>   */
>  static void
>  xlog_cil_push_background(
> -	struct xlog	*log)
> +	struct xlog	*log) __releases(cil->xc_ctx_lock)
>  {
>  	struct xfs_cil	*cil = log->l_cilp;
>  
> @@ -914,14 +920,36 @@ xlog_cil_push_background(
>  	 * don't do a background push if we haven't used up all the
>  	 * space available yet.
>  	 */
> -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
> +	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> +		up_read(&cil->xc_ctx_lock);
>  		return;
> +	}
>  
>  	spin_lock(&cil->xc_push_lock);
>  	if (cil->xc_push_seq < cil->xc_current_sequence) {
>  		cil->xc_push_seq = cil->xc_current_sequence;
>  		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
>  	}
> +
> +	/*
> +	 * Drop the context lock now, we can't hold that if we need to sleep
> +	 * because we are over the blocking threshold. The push_lock is still
> +	 * held, so blocking threshold sleep/wakeup is still correctly
> +	 * serialised here.
> +	 */
> +	up_read(&cil->xc_ctx_lock);
> +
> +	/*
> +	 * If we are well over the space limit, throttle the work that is being
> +	 * done until the push work on this context has begun.
> +	 */
> +	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {

Can we use consistent logic with the rest of the function? I.e.

	up_read(..);
	if (space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
		spin_unlock(..);
		return;
	}

	...
	xlog_wait(...);

> +		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> +		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> +		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
> +		return;
> +	}
> +

Also, I find it slightly annoying that even with all of this locking
quirkiness we still manage to read ->space_used unprotected (now twice).
IMO, the simple thing would be to let xlog_cil_insert_items() return the
size immediately after the current transaction inserts and pass that
into this function as a parameter.

>  	spin_unlock(&cil->xc_push_lock);
>  
>  }
> @@ -1038,9 +1066,9 @@ xfs_log_commit_cil(
>  		if (lip->li_ops->iop_committing)
>  			lip->li_ops->iop_committing(lip, xc_commit_lsn);
>  	}
> -	xlog_cil_push_background(log);
>  
> -	up_read(&cil->xc_ctx_lock);
> +	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
> +	xlog_cil_push_background(log);
>  }
>  
>  /*
> @@ -1199,6 +1227,7 @@ xlog_cil_init(
>  
>  	INIT_LIST_HEAD(&ctx->committing);
>  	INIT_LIST_HEAD(&ctx->busy_extents);
> +	init_waitqueue_head(&ctx->push_wait);
>  	ctx->sequence = 1;
>  	ctx->cil = cil;
>  	cil->xc_ctx = ctx;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index a3cc8a9a16d9..f231b7dfaeab 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -247,6 +247,7 @@ struct xfs_cil_ctx {
>  	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
>  	struct list_head	iclog_entry;
>  	struct list_head	committing;	/* ctx committing list */
> +	wait_queue_head_t	push_wait;	/* background push throttle */
>  	struct work_struct	discard_endio_work;
>  };
>  
> @@ -344,10 +345,33 @@ struct xfs_cil {
>   *   buffer window (32MB) as measurements have shown this to be roughly the
>   *   point of diminishing performance increases under highly concurrent
>   *   modification workloads.
> + *
> + * To prevent the CIL from overflowing upper commit size bounds, we introduce a
> + * new threshold at which we block committing transactions until the background
> + * CIL commit commences and switches to a new context. While this is not a hard
> + * limit, it forces the process committing a transaction to the CIL to block and
> + * yeild the CPU, giving the CIL push work a chance to be scheduled and start
> + * work. This prevents a process running lots of transactions from overfilling
> + * the CIL because it is not yielding the CPU. We set the blocking limit at
> + * twice the background push space threshold so we keep in line with the AIL
> + * push thresholds.
> + *
> + * Note: this is not a -hard- limit as blocking is applied after the transaction
> + * is inserted into the CIL and the push has been triggered. It is largely a
> + * throttling mechanism that allows the CIL push to be scheduled and run. A hard
> + * limit will be difficult to implement without introducing global serialisation
> + * in the CIL commit fast path, and it's not at all clear that we actually need
> + * such hard limits given the ~7 years we've run without a hard limit before
> + * finding the first situation where a checkpoint size overflow actually
> + * occurred. Hence the simple throttle, and an ASSERT check to tell us that
> + * we've overrun the max size.
>   */

I appreciate the extra documentation here, but I think most of the
second paragraph is better commit log description material than
something worth lengthening this already huge comment for. I'd suggest
something like the following, but feel free to rework of course:

 *
 * Since the CIL push threshold only triggers a background push, a
 * second threshold triggers transaction commit blocking until the
 * background push initiates and switches the CIL context. The blocking
 * threshold is set to twice the background push threshold to keep in
 * line with AIL push thresholds. Note that this is not a CIL context
 * size limit. This is a throttle threshold to slow the growth of the
 * context and yield the CPU for a background push under overload
 * conditions.
 */

Brian

>  #define XLOG_CIL_SPACE_LIMIT(log)	\
>  	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
>  
> +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log)	\
> +	(XLOG_CIL_SPACE_LIMIT(log) * 2)
> +
>  /*
>   * ticket grant locks, queues and accounting have their own cachlines
>   * as these are quite hot and can be operated on concurrently.
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index eaae275ed430..e7087ede2662 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1011,6 +1011,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_sub);
>  DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_enter);
>  DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_exit);
>  DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_sub);
> +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait);
>  
>  DECLARE_EVENT_CLASS(xfs_log_item_class,
>  	TP_PROTO(struct xfs_log_item *lip),
> -- 
> 2.23.0.rc1
>
Dave Chinner Sept. 30, 2019, 9:53 p.m. UTC | #2
On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote:
> On Mon, Sep 30, 2019 at 04:03:44PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > In certain situations the background CIL push can be indefinitely
> > delayed. While we have workarounds from the obvious cases now, it
> > doesn't solve the underlying issue. This issue is that there is no
> > upper limit on the CIL where we will either force or wait for
> > a background push to start, hence allowing the CIL to grow without
> > bound until it consumes all log space.
> > 
> > To fix this, add a new wait queue to the CIL which allows background
> > pushes to wait for the CIL context to be switched out. This happens
> > when the push starts, so it will allow us to block incoming
> > transaction commit completion until the push has started. This will
> > only affect processes that are running modifications, and only when
> > the CIL threshold has been significantly overrun.
> > 
> > This has no apparent impact on performance, and doesn't even trigger
> > until over 45 million inodes had been created in a 16-way fsmark
> > test on a 2GB log. That was limiting at 64MB of log space used, so
> > the active CIL size is only about 3% of the total log in that case.
> > The concurrent removal of those files did not trigger the background
> > sleep at all.
> > 
> 
> Have you done similar testing for small/minimum sized logs?

Yes. I've had the tracepoint active during xfstests runs on test
filesystems using default log sizes on 5-15GB filesystems. The only
test in all of xfstests that has triggered it is generic/017, and it
only triggered once.

e.g.

# trace-cmd start -e xfs_log_cil_wait
<run xfstests>
# trace-cmd show
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
          xfs_io-2158  [001] ...1   309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560
#

And the timestamp matched the time that generic/017 was running.

> Also, if this is so limited in occurrence, had you given any thought to
> something even more simple like flushing the CIL push workqueue when
> over the throttle threshold?

Yes, I've tried that - flush_workqueue() blocks /everything/ until
all the queued work is complete. That means it waits for the CIL to
flush to disk and write a commit record, and every modification in
the filesystem is stopped dead in it's tracks until the CIL push is
complete.

The problem is that flushing workqueues is a synchronous operation,
and it can't wait for partial work completion. We only need to wait
for the CIL context to be swapped out - this is done by the push
code before it starts all the expensive iclog formating and waiting
for iclog space...

> That would wait longer to fill the iclogs,
> but would eliminate the need for another waitqueue that's apparently
> only used for throttling. It may also facilitate reuse of
> xlog_cil_push_now() in the >= XLOG_CIL_SPACE_BLOCKING_LIMIT() case
> (actually, I think this could facilitate the elimination of
> xlog_cil_push_background() entirely).

xlog_cil_push_now() uses flush_work() to push any pending work
before it queues up the CIL flush that the caller is about to wait
for.  i.e. the callers of xlog_cil_push_now() must ensure that all
CIL contexts are flushed for the purposes of a log force as they are
going to wait for all pending CIL flushes to complete. If we've
already pushed the CIL to the sequence that we are asking to push
to, we still have to wait for that previous push to be
done. This is what the flush_work() call in xlog_cil_push_now()
acheives.

> That aside...
> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c  | 37 +++++++++++++++++++++++++++++++++----
> >  fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++
> >  fs/xfs/xfs_trace.h    |  1 +
> >  3 files changed, 58 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index ef652abd112c..4a09d50e1368 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -670,6 +670,11 @@ xlog_cil_push(
> >  	push_seq = cil->xc_push_seq;
> >  	ASSERT(push_seq <= ctx->sequence);
> >  
> > +	/*
> > +	 * Wake up any background push waiters now this context is being pushed.
> > +	 */
> > +	wake_up_all(&ctx->push_wait);
> > +
> >  	/*
> >  	 * Check if we've anything to push. If there is nothing, then we don't
> >  	 * move on to a new sequence number and so we have to be able to push
> > @@ -746,6 +751,7 @@ xlog_cil_push(
> >  	 */
> >  	INIT_LIST_HEAD(&new_ctx->committing);
> >  	INIT_LIST_HEAD(&new_ctx->busy_extents);
> > +	init_waitqueue_head(&new_ctx->push_wait);
> >  	new_ctx->sequence = ctx->sequence + 1;
> >  	new_ctx->cil = cil;
> >  	cil->xc_ctx = new_ctx;
> > @@ -900,7 +906,7 @@ xlog_cil_push_work(
> >   */
> >  static void
> >  xlog_cil_push_background(
> > -	struct xlog	*log)
> > +	struct xlog	*log) __releases(cil->xc_ctx_lock)
> >  {
> >  	struct xfs_cil	*cil = log->l_cilp;
> >  
> > @@ -914,14 +920,36 @@ xlog_cil_push_background(
> >  	 * don't do a background push if we haven't used up all the
> >  	 * space available yet.
> >  	 */
> > -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
> > +	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> > +		up_read(&cil->xc_ctx_lock);
> >  		return;
> > +	}
> >  
> >  	spin_lock(&cil->xc_push_lock);
> >  	if (cil->xc_push_seq < cil->xc_current_sequence) {
> >  		cil->xc_push_seq = cil->xc_current_sequence;
> >  		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
> >  	}
> > +
> > +	/*
> > +	 * Drop the context lock now, we can't hold that if we need to sleep
> > +	 * because we are over the blocking threshold. The push_lock is still
> > +	 * held, so blocking threshold sleep/wakeup is still correctly
> > +	 * serialised here.
> > +	 */
> > +	up_read(&cil->xc_ctx_lock);
> > +
> > +	/*
> > +	 * If we are well over the space limit, throttle the work that is being
> > +	 * done until the push work on this context has begun.
> > +	 */
> > +	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> 
> Can we use consistent logic with the rest of the function? I.e.
> 
> 	up_read(..);
> 	if (space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> 		spin_unlock(..);
> 		return;
> 	}

I did it this way because code inside the if() statement is
typically pushed out of line by the compiler if it finished with a
return statement (i.e. the compiler makes an "unlikely" static
prediction). Hence it's best to put the slow path code inside thei
if(), not the fast path code.

I can change it, but then we're into ugly likely()/unlikely() macro
territory and I much prefer to structure the code so things like
that are completely unnecessary.

> 
> 	...
> 	xlog_wait(...);
> 
> > +		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> > +		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> > +		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
> > +		return;
> > +	}
> > +
> 
> Also, I find it slightly annoying that even with all of this locking
> quirkiness we still manage to read ->space_used unprotected (now twice).
> IMO, the simple thing would be to let xlog_cil_insert_items() return the
> size immediately after the current transaction inserts and pass that
> into this function as a parameter.

Yes, I noticed this and thought about it a bit, but it doesn't
really matter if the read is racy. It never really has mattered
because it's not a hard limit where accuracy is vitally important
for correct functioning. Locking and/or atomics are really only
necessary if we care about exact accuracy, but we care only about
avoiding fast path locking overhead as much as possible here.

I'll deal with that in a separate patchset when I have some spare
time.

> > + * To prevent the CIL from overflowing upper commit size bounds, we introduce a
> > + * new threshold at which we block committing transactions until the background
> > + * CIL commit commences and switches to a new context. While this is not a hard
> > + * limit, it forces the process committing a transaction to the CIL to block and
> > + * yeild the CPU, giving the CIL push work a chance to be scheduled and start
> > + * work. This prevents a process running lots of transactions from overfilling
> > + * the CIL because it is not yielding the CPU. We set the blocking limit at
> > + * twice the background push space threshold so we keep in line with the AIL
> > + * push thresholds.
> > + *
> > + * Note: this is not a -hard- limit as blocking is applied after the transaction
> > + * is inserted into the CIL and the push has been triggered. It is largely a
> > + * throttling mechanism that allows the CIL push to be scheduled and run. A hard
> > + * limit will be difficult to implement without introducing global serialisation
> > + * in the CIL commit fast path, and it's not at all clear that we actually need
> > + * such hard limits given the ~7 years we've run without a hard limit before
> > + * finding the first situation where a checkpoint size overflow actually
> > + * occurred. Hence the simple throttle, and an ASSERT check to tell us that
> > + * we've overrun the max size.
> >   */
> 
> I appreciate the extra documentation here, but I think most of the
> second paragraph is better commit log description material than
> something worth lengthening this already huge comment for. I'd suggest
> something like the following, but feel free to rework of course:

Darrick asked for me include some of the information I had in the
commit description directly in the comment, so I did exaclty what he
asked....

>  *
>  * Since the CIL push threshold only triggers a background push, a
>  * second threshold triggers transaction commit blocking until the
>  * background push initiates and switches the CIL context. The blocking
>  * threshold is set to twice the background push threshold to keep in
>  * line with AIL push thresholds. Note that this is not a CIL context
>  * size limit. This is a throttle threshold to slow the growth of the
>  * context and yield the CPU for a background push under overload
>  * conditions.

Sure, but that doesn't tell us *why* it is implemented as a throttle
threshold instead of a hard limit. This text tells us what the code
does, not why it was implemented that way. I'll let darrick decide
on this.

Cheers,

Dave.


>  */
> 
> Brian
> 
> >  #define XLOG_CIL_SPACE_LIMIT(log)	\
> >  	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
> >  
> > +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log)	\
> > +	(XLOG_CIL_SPACE_LIMIT(log) * 2)
> > +
> >  /*
> >   * ticket grant locks, queues and accounting have their own cachlines
> >   * as these are quite hot and can be operated on concurrently.
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index eaae275ed430..e7087ede2662 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -1011,6 +1011,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_sub);
> >  DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_enter);
> >  DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_exit);
> >  DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_sub);
> > +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait);
> >  
> >  DECLARE_EVENT_CLASS(xfs_log_item_class,
> >  	TP_PROTO(struct xfs_log_item *lip),
> > -- 
> > 2.23.0.rc1
> > 
>
Dave Chinner Oct. 1, 2019, 3:42 a.m. UTC | #3
On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote:
> On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote:
> > On Mon, Sep 30, 2019 at 04:03:44PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > In certain situations the background CIL push can be indefinitely
> > > delayed. While we have workarounds from the obvious cases now, it
> > > doesn't solve the underlying issue. This issue is that there is no
> > > upper limit on the CIL where we will either force or wait for
> > > a background push to start, hence allowing the CIL to grow without
> > > bound until it consumes all log space.
> > > 
> > > To fix this, add a new wait queue to the CIL which allows background
> > > pushes to wait for the CIL context to be switched out. This happens
> > > when the push starts, so it will allow us to block incoming
> > > transaction commit completion until the push has started. This will
> > > only affect processes that are running modifications, and only when
> > > the CIL threshold has been significantly overrun.
> > > 
> > > This has no apparent impact on performance, and doesn't even trigger
> > > until over 45 million inodes had been created in a 16-way fsmark
> > > test on a 2GB log. That was limiting at 64MB of log space used, so
> > > the active CIL size is only about 3% of the total log in that case.
> > > The concurrent removal of those files did not trigger the background
> > > sleep at all.
> > > 
> > 
> > Have you done similar testing for small/minimum sized logs?
> 
> Yes. I've had the tracepoint active during xfstests runs on test
> filesystems using default log sizes on 5-15GB filesystems. The only
> test in all of xfstests that has triggered it is generic/017, and it
> only triggered once.
> 
> e.g.
> 
> # trace-cmd start -e xfs_log_cil_wait
> <run xfstests>
> # trace-cmd show
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 1/1   #P:4
> #
> #                              _-----=> irqs-off
> #                             / _----=> need-resched
> #                            | / _---=> hardirq/softirq
> #                            || / _--=> preempt-depth
> #                            ||| /     delay
> #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> #              | |       |   ||||       |         |
>           xfs_io-2158  [001] ...1   309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560
> #
> 
> And the timestamp matched the time that generic/017 was running.

SO I've run this on my typical 16-way fsmark workload with different
size logs. It barely triggers on log sizes larger than 64MB, on 32MB
logs I can see it capturing all 16 fsmark processes while waiting
for the CIL context to switch. This will give you an idea of the
log cycles the capture is occuring on, and the count of processes
being captured:

$ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c
     16 251
     32 475
     16 494
     32 870
     15 1132
     15 1166
     14 1221
      1 1222
     16 1223
      7 1307
      8 1308
     16 1315
     16 1738
     16 1832
      9 2167
      7 2168
     16 2200
     16 2375
     16 2383
     16 2700
     16 2797
     16 2798
     16 2892
$

So typically groups of captures are hundreds of log cycles apart
(100 cycles x 32MB = ~3GB of log writes), then there will be a
stutter where the CIL dispatch is delayed, and then everything
continues on. These all show the log is always around the 75% full
(AIL tail pushing theshold) but the reservation grant wait lists are
always empty so we're not running out of reservation space here.

If I make the log even smaller - 16MB - the log is always full, the
AIL is always tail pushing, and there is a constant stream of log
forces (30-40/s) because tail pushing is hitting pinned items
several thousand times a second.  IOWs, the frequency of the log
forces means that CIL is almost never growing large enough to do a
background push, let alone overrun the blocking threshold. Same
trace for the same workload as above:

$ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c
     16 1400
     16 5284
     16 5624
     16 5778
     16 6159
     10 6477
$

So when we have lots of concurrency and modification, tiny logs
appear to be less susceptible to CIL overruns than small logs
because they are constantly tail pushing and issuing log forces that
trigger trigger flushes of the CIL before an overruns could occur.

Cheers,

Dave.
Brian Foster Oct. 1, 2019, 1:13 p.m. UTC | #4
On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote:
> On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote:
> > On Mon, Sep 30, 2019 at 04:03:44PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > In certain situations the background CIL push can be indefinitely
> > > delayed. While we have workarounds from the obvious cases now, it
> > > doesn't solve the underlying issue. This issue is that there is no
> > > upper limit on the CIL where we will either force or wait for
> > > a background push to start, hence allowing the CIL to grow without
> > > bound until it consumes all log space.
> > > 
> > > To fix this, add a new wait queue to the CIL which allows background
> > > pushes to wait for the CIL context to be switched out. This happens
> > > when the push starts, so it will allow us to block incoming
> > > transaction commit completion until the push has started. This will
> > > only affect processes that are running modifications, and only when
> > > the CIL threshold has been significantly overrun.
> > > 
> > > This has no apparent impact on performance, and doesn't even trigger
> > > until over 45 million inodes had been created in a 16-way fsmark
> > > test on a 2GB log. That was limiting at 64MB of log space used, so
> > > the active CIL size is only about 3% of the total log in that case.
> > > The concurrent removal of those files did not trigger the background
> > > sleep at all.
> > > 
> > 
> > Have you done similar testing for small/minimum sized logs?
> 
> Yes. I've had the tracepoint active during xfstests runs on test
> filesystems using default log sizes on 5-15GB filesystems. The only
> test in all of xfstests that has triggered it is generic/017, and it
> only triggered once.
> 

Ok, interesting. I guess it would be nice if we had a test that somehow
or another more effectively exercised this codepath.

> e.g.
> 
> # trace-cmd start -e xfs_log_cil_wait
> <run xfstests>
> # trace-cmd show
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 1/1   #P:4
> #
> #                              _-----=> irqs-off
> #                             / _----=> need-resched
> #                            | / _---=> hardirq/softirq
> #                            || / _--=> preempt-depth
> #                            ||| /     delay
> #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> #              | |       |   ||||       |         |
>           xfs_io-2158  [001] ...1   309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560
> #
> 
> And the timestamp matched the time that generic/017 was running.
> 
> > Also, if this is so limited in occurrence, had you given any thought to
> > something even more simple like flushing the CIL push workqueue when
> > over the throttle threshold?
> 
> Yes, I've tried that - flush_workqueue() blocks /everything/ until
> all the queued work is complete. That means it waits for the CIL to
> flush to disk and write a commit record, and every modification in
> the filesystem is stopped dead in it's tracks until the CIL push is
> complete.
> 
> The problem is that flushing workqueues is a synchronous operation,
> and it can't wait for partial work completion. We only need to wait
> for the CIL context to be swapped out - this is done by the push
> code before it starts all the expensive iclog formating and waiting
> for iclog space...
> 

I know it waits on the work to complete. I poked around a bit for an
interface to "kick" a workqueue, so to speak (as opposed to flush), but
I didn't see anything (not surprisingly, since it probably wouldn't be a
broadly useful mechanism).

That aside, where would this wait on the CIL to flush to disk? AFAICT
the only thing that might happen here is log buffer submission. That
only happens when the log buffer is full (regardless of the current CIL
push writing its commit record). In fact if we did wait on I/O anywhere
in here, then that suggests potential problems with async log force.

> > That would wait longer to fill the iclogs,
> > but would eliminate the need for another waitqueue that's apparently
> > only used for throttling. It may also facilitate reuse of
> > xlog_cil_push_now() in the >= XLOG_CIL_SPACE_BLOCKING_LIMIT() case
> > (actually, I think this could facilitate the elimination of
> > xlog_cil_push_background() entirely).
> 
> xlog_cil_push_now() uses flush_work() to push any pending work
> before it queues up the CIL flush that the caller is about to wait
> for.  i.e. the callers of xlog_cil_push_now() must ensure that all
> CIL contexts are flushed for the purposes of a log force as they are
> going to wait for all pending CIL flushes to complete. If we've
> already pushed the CIL to the sequence that we are asking to push
> to, we still have to wait for that previous push to be
> done. This is what the flush_work() call in xlog_cil_push_now()
> acheives.
> 

Yes, I'm just exploring potential to reuse this code..

> > That aside...
> > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log_cil.c  | 37 +++++++++++++++++++++++++++++++++----
> > >  fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++
> > >  fs/xfs/xfs_trace.h    |  1 +
> > >  3 files changed, 58 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index ef652abd112c..4a09d50e1368 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -670,6 +670,11 @@ xlog_cil_push(
> > >  	push_seq = cil->xc_push_seq;
> > >  	ASSERT(push_seq <= ctx->sequence);
> > >  
> > > +	/*
> > > +	 * Wake up any background push waiters now this context is being pushed.
> > > +	 */
> > > +	wake_up_all(&ctx->push_wait);
> > > +
> > >  	/*
> > >  	 * Check if we've anything to push. If there is nothing, then we don't
> > >  	 * move on to a new sequence number and so we have to be able to push
> > > @@ -746,6 +751,7 @@ xlog_cil_push(
> > >  	 */
> > >  	INIT_LIST_HEAD(&new_ctx->committing);
> > >  	INIT_LIST_HEAD(&new_ctx->busy_extents);
> > > +	init_waitqueue_head(&new_ctx->push_wait);
> > >  	new_ctx->sequence = ctx->sequence + 1;
> > >  	new_ctx->cil = cil;
> > >  	cil->xc_ctx = new_ctx;
> > > @@ -900,7 +906,7 @@ xlog_cil_push_work(
> > >   */
> > >  static void
> > >  xlog_cil_push_background(
> > > -	struct xlog	*log)
> > > +	struct xlog	*log) __releases(cil->xc_ctx_lock)
> > >  {
> > >  	struct xfs_cil	*cil = log->l_cilp;
> > >  
> > > @@ -914,14 +920,36 @@ xlog_cil_push_background(
> > >  	 * don't do a background push if we haven't used up all the
> > >  	 * space available yet.
> > >  	 */
> > > -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
> > > +	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> > > +		up_read(&cil->xc_ctx_lock);
> > >  		return;
> > > +	}
> > >  
> > >  	spin_lock(&cil->xc_push_lock);
> > >  	if (cil->xc_push_seq < cil->xc_current_sequence) {
> > >  		cil->xc_push_seq = cil->xc_current_sequence;
> > >  		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
> > >  	}
> > > +
> > > +	/*
> > > +	 * Drop the context lock now, we can't hold that if we need to sleep
> > > +	 * because we are over the blocking threshold. The push_lock is still
> > > +	 * held, so blocking threshold sleep/wakeup is still correctly
> > > +	 * serialised here.
> > > +	 */
> > > +	up_read(&cil->xc_ctx_lock);
> > > +
> > > +	/*
> > > +	 * If we are well over the space limit, throttle the work that is being
> > > +	 * done until the push work on this context has begun.
> > > +	 */
> > > +	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> > 
> > Can we use consistent logic with the rest of the function? I.e.
> > 
> > 	up_read(..);
> > 	if (space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> > 		spin_unlock(..);
> > 		return;
> > 	}
> 
> I did it this way because code inside the if() statement is
> typically pushed out of line by the compiler if it finished with a
> return statement (i.e. the compiler makes an "unlikely" static
> prediction). Hence it's best to put the slow path code inside thei
> if(), not the fast path code.
> 

It looks to me that occurs regardless of whether there's a return
statement or not. AFAICT, this branching aready happens in the fast exit
path of this function as well as for the queue_work() branch. So if it
matters that much, perhaps the rest of the function should change
appropriately..

> I can change it, but then we're into ugly likely()/unlikely() macro
> territory and I much prefer to structure the code so things like
> that are completely unnecessary.
> 

Hasn't the [un]likely() stuff been considered bogus for a while now?

Either way, we're already under spinlock at this point so this isn't a
fast exit path. We're talking about branching to an exit path that
still requires a function call (i.e. branch) to unlock the spinlock
before we return. Combine that with the rest of the function and I'm
having a hard time seeing how the factoring affects performance in
practice. Maybe it does, but I guess I'd want to see some data to grok
that the very least.

> > 
> > 	...
> > 	xlog_wait(...);
> > 
> > > +		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> > > +		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> > > +		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
> > > +		return;
> > > +	}
> > > +
> > 
> > Also, I find it slightly annoying that even with all of this locking
> > quirkiness we still manage to read ->space_used unprotected (now twice).
> > IMO, the simple thing would be to let xlog_cil_insert_items() return the
> > size immediately after the current transaction inserts and pass that
> > into this function as a parameter.
> 
> Yes, I noticed this and thought about it a bit, but it doesn't
> really matter if the read is racy. It never really has mattered
> because it's not a hard limit where accuracy is vitally important
> for correct functioning. Locking and/or atomics are really only
> necessary if we care about exact accuracy, but we care only about
> avoiding fast path locking overhead as much as possible here.
> 
> I'll deal with that in a separate patchset when I have some spare
> time.
> 

Fair enough, it's not technically a problem with this patch.

> > > + * To prevent the CIL from overflowing upper commit size bounds, we introduce a
> > > + * new threshold at which we block committing transactions until the background
> > > + * CIL commit commences and switches to a new context. While this is not a hard
> > > + * limit, it forces the process committing a transaction to the CIL to block and
> > > + * yeild the CPU, giving the CIL push work a chance to be scheduled and start
> > > + * work. This prevents a process running lots of transactions from overfilling
> > > + * the CIL because it is not yielding the CPU. We set the blocking limit at
> > > + * twice the background push space threshold so we keep in line with the AIL
> > > + * push thresholds.
> > > + *
> > > + * Note: this is not a -hard- limit as blocking is applied after the transaction
> > > + * is inserted into the CIL and the push has been triggered. It is largely a
> > > + * throttling mechanism that allows the CIL push to be scheduled and run. A hard
> > > + * limit will be difficult to implement without introducing global serialisation
> > > + * in the CIL commit fast path, and it's not at all clear that we actually need
> > > + * such hard limits given the ~7 years we've run without a hard limit before
> > > + * finding the first situation where a checkpoint size overflow actually
> > > + * occurred. Hence the simple throttle, and an ASSERT check to tell us that
> > > + * we've overrun the max size.
> > >   */
> > 
> > I appreciate the extra documentation here, but I think most of the
> > second paragraph is better commit log description material than
> > something worth lengthening this already huge comment for. I'd suggest
> > something like the following, but feel free to rework of course:
> 
> Darrick asked for me include some of the information I had in the
> commit description directly in the comment, so I did exaclty what he
> asked....
> 

I don't see any reasoning about why this implementation was chosen over
a hard limit in the commit log description (of this patch or the
previous version). FWIW, I think it makes perfect sense to include it in
the commit log description.

> >  *
> >  * Since the CIL push threshold only triggers a background push, a
> >  * second threshold triggers transaction commit blocking until the
> >  * background push initiates and switches the CIL context. The blocking
> >  * threshold is set to twice the background push threshold to keep in
> >  * line with AIL push thresholds. Note that this is not a CIL context
> >  * size limit. This is a throttle threshold to slow the growth of the
> >  * context and yield the CPU for a background push under overload
> >  * conditions.
> 
> Sure, but that doesn't tell us *why* it is implemented as a throttle
> threshold instead of a hard limit. This text tells us what the code
> does, not why it was implemented that way. I'll let darrick decide
> on this.
> 

I generally don't find value in code comments that explain why we didn't
do X or Y in code that ultimately does Z (that kind of thing belongs in
the commit log description IMO), but if somebody finds the content
useful then it's not a big deal to me.

Brian

> Cheers,
> 
> Dave.
> 
> 
> >  */
> > 
> > Brian
> > 
> > >  #define XLOG_CIL_SPACE_LIMIT(log)	\
> > >  	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
> > >  
> > > +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log)	\
> > > +	(XLOG_CIL_SPACE_LIMIT(log) * 2)
> > > +
> > >  /*
> > >   * ticket grant locks, queues and accounting have their own cachlines
> > >   * as these are quite hot and can be operated on concurrently.
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index eaae275ed430..e7087ede2662 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -1011,6 +1011,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_sub);
> > >  DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_enter);
> > >  DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_exit);
> > >  DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_sub);
> > > +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait);
> > >  
> > >  DECLARE_EVENT_CLASS(xfs_log_item_class,
> > >  	TP_PROTO(struct xfs_log_item *lip),
> > > -- 
> > > 2.23.0.rc1
> > > 
> > 
> 
> -- 
> Dave Chinner
> david@fromorbit.com
Brian Foster Oct. 1, 2019, 1:13 p.m. UTC | #5
On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote:
> On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote:
> > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote:
> > > On Mon, Sep 30, 2019 at 04:03:44PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > In certain situations the background CIL push can be indefinitely
> > > > delayed. While we have workarounds from the obvious cases now, it
> > > > doesn't solve the underlying issue. This issue is that there is no
> > > > upper limit on the CIL where we will either force or wait for
> > > > a background push to start, hence allowing the CIL to grow without
> > > > bound until it consumes all log space.
> > > > 
> > > > To fix this, add a new wait queue to the CIL which allows background
> > > > pushes to wait for the CIL context to be switched out. This happens
> > > > when the push starts, so it will allow us to block incoming
> > > > transaction commit completion until the push has started. This will
> > > > only affect processes that are running modifications, and only when
> > > > the CIL threshold has been significantly overrun.
> > > > 
> > > > This has no apparent impact on performance, and doesn't even trigger
> > > > until over 45 million inodes had been created in a 16-way fsmark
> > > > test on a 2GB log. That was limiting at 64MB of log space used, so
> > > > the active CIL size is only about 3% of the total log in that case.
> > > > The concurrent removal of those files did not trigger the background
> > > > sleep at all.
> > > > 
> > > 
> > > Have you done similar testing for small/minimum sized logs?
> > 
> > Yes. I've had the tracepoint active during xfstests runs on test
> > filesystems using default log sizes on 5-15GB filesystems. The only
> > test in all of xfstests that has triggered it is generic/017, and it
> > only triggered once.
> > 
> > e.g.
> > 
> > # trace-cmd start -e xfs_log_cil_wait
> > <run xfstests>
> > # trace-cmd show
> > # tracer: nop
> > #
> > # entries-in-buffer/entries-written: 1/1   #P:4
> > #
> > #                              _-----=> irqs-off
> > #                             / _----=> need-resched
> > #                            | / _---=> hardirq/softirq
> > #                            || / _--=> preempt-depth
> > #                            ||| /     delay
> > #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> > #              | |       |   ||||       |         |
> >           xfs_io-2158  [001] ...1   309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560
> > #
> > 
> > And the timestamp matched the time that generic/017 was running.
> 
> SO I've run this on my typical 16-way fsmark workload with different
> size logs. It barely triggers on log sizes larger than 64MB, on 32MB
> logs I can see it capturing all 16 fsmark processes while waiting
> for the CIL context to switch. This will give you an idea of the
> log cycles the capture is occuring on, and the count of processes
> being captured:
> 
> $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c
>      16 251
>      32 475
>      16 494
>      32 870
>      15 1132
>      15 1166
>      14 1221
>       1 1222
>      16 1223
>       7 1307
>       8 1308
>      16 1315
>      16 1738
>      16 1832
>       9 2167
>       7 2168
>      16 2200
>      16 2375
>      16 2383
>      16 2700
>      16 2797
>      16 2798
>      16 2892
> $

Thanks. I assume I'm looking at cycle numbers and event counts here?

> 
> So typically groups of captures are hundreds of log cycles apart
> (100 cycles x 32MB = ~3GB of log writes), then there will be a
> stutter where the CIL dispatch is delayed, and then everything
> continues on. These all show the log is always around the 75% full
> (AIL tail pushing theshold) but the reservation grant wait lists are
> always empty so we're not running out of reservation space here.
> 

It's somewhat interesting that we manage to block every thread most of
the time before the CIL push task starts. I wonder a bit if that pattern
would hold for a system/workload with more CPUs (and if so, if there are
any odd side effects of stalling and waking hundreds of tasks at the
same time vs. our traditional queuing behavior).

> If I make the log even smaller - 16MB - the log is always full, the
> AIL is always tail pushing, and there is a constant stream of log
> forces (30-40/s) because tail pushing is hitting pinned items
> several thousand times a second.  IOWs, the frequency of the log
> forces means that CIL is almost never growing large enough to do a
> background push, let alone overrun the blocking threshold. Same
> trace for the same workload as above:
> 
> $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c
>      16 1400
>      16 5284
>      16 5624
>      16 5778
>      16 6159
>      10 6477
> $
> 
> So when we have lots of concurrency and modification, tiny logs
> appear to be less susceptible to CIL overruns than small logs
> because they are constantly tail pushing and issuing log forces that
> trigger trigger flushes of the CIL before an overruns could occur.
> 

Makes sense, thanks.

Brian

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Oct. 1, 2019, 10:31 p.m. UTC | #6
On Tue, Oct 01, 2019 at 09:13:04AM -0400, Brian Foster wrote:
> On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote:
> > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote:
> > > Have you done similar testing for small/minimum sized logs?
> > 
> > Yes. I've had the tracepoint active during xfstests runs on test
> > filesystems using default log sizes on 5-15GB filesystems. The only
> > test in all of xfstests that has triggered it is generic/017, and it
> > only triggered once.
> > 
> 
> Ok, interesting. I guess it would be nice if we had a test that somehow
> or another more effectively exercised this codepath.

*nod*

But it's essentially difficult to predict in any way because
behaviour is not just a function of log size. :/

> > > Also, if this is so limited in occurrence, had you given any thought to
> > > something even more simple like flushing the CIL push workqueue when
> > > over the throttle threshold?
> > 
> > Yes, I've tried that - flush_workqueue() blocks /everything/ until
> > all the queued work is complete. That means it waits for the CIL to
> > flush to disk and write a commit record, and every modification in
> > the filesystem is stopped dead in it's tracks until the CIL push is
> > complete.
> > 
> > The problem is that flushing workqueues is a synchronous operation,
> > and it can't wait for partial work completion. We only need to wait
> > for the CIL context to be swapped out - this is done by the push
> > code before it starts all the expensive iclog formating and waiting
> > for iclog space...
> > 
> 
> I know it waits on the work to complete. I poked around a bit for an
> interface to "kick" a workqueue, so to speak (as opposed to flush), but
> I didn't see anything (not surprisingly, since it probably wouldn't be a
> broadly useful mechanism).

*nod*

> That aside, where would this wait on the CIL to flush to disk? AFAICT
> the only thing that might happen here is log buffer submission. That
> only happens when the log buffer is full (regardless of the current CIL
> push writing its commit record). In fact if we did wait on I/O anywhere
> in here, then that suggests potential problems with async log force.

There is no such thing as a "async log force". The log force always
waits on the CIL flush - XFS_LOG_SYNC only defines whether it waits
on all iclogbufs post CIL flush to hit the disk.

Further, when the CIL flushes, it's normally flushing more metadata that we
can hold in 8x iclogbufs. The default is 32kB iclogbufs, so if we've
got more than 256kB of checkpoint data to be written, then we end up
waiting on iclogbuf completion to write more then 256kB.

Typically I see a sustainted ratio of roughly 4 iclog writes to 1
noiclogbufs in the metric graphs on small logs - just measured 700
log writes/s, 250 noiclogs/s for a 64MB log and 256kB logbuf size.
IOWs, CIL flushes are regularly waiting in xlog_get_iclog_space() on
iclog IO completion to occur...

> > > That would wait longer to fill the iclogs,
> > > but would eliminate the need for another waitqueue that's apparently
> > > only used for throttling. It may also facilitate reuse of
> > > xlog_cil_push_now() in the >= XLOG_CIL_SPACE_BLOCKING_LIMIT() case
> > > (actually, I think this could facilitate the elimination of
> > > xlog_cil_push_background() entirely).
> > 
> > xlog_cil_push_now() uses flush_work() to push any pending work
> > before it queues up the CIL flush that the caller is about to wait
> > for.  i.e. the callers of xlog_cil_push_now() must ensure that all
> > CIL contexts are flushed for the purposes of a log force as they are
> > going to wait for all pending CIL flushes to complete. If we've
> > already pushed the CIL to the sequence that we are asking to push
> > to, we still have to wait for that previous push to be
> > done. This is what the flush_work() call in xlog_cil_push_now()
> > acheives.
> > 
> 
> Yes, I'm just exploring potential to reuse this code..

Yeah, I have a few prototype patches for revamping this, including
an actual async CIL flush. I do some work here, but it didn't solve
any of the problems I needed to fix so it put it aside. See below.

> > > > +	/*
> > > > +	 * If we are well over the space limit, throttle the work that is being
> > > > +	 * done until the push work on this context has begun.
> > > > +	 */
> > > > +	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> > > 
> > > Can we use consistent logic with the rest of the function? I.e.
> > > 
> > > 	up_read(..);
> > > 	if (space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> > > 		spin_unlock(..);
> > > 		return;
> > > 	}
> > 
> > I did it this way because code inside the if() statement is
> > typically pushed out of line by the compiler if it finished with a
> > return statement (i.e. the compiler makes an "unlikely" static
> > prediction). Hence it's best to put the slow path code inside thei
> > if(), not the fast path code.
> > 
> 
> It looks to me that occurs regardless of whether there's a return
> statement or not. AFAICT, this branching aready happens in the fast exit
> path of this function as well as for the queue_work() branch. So if it
> matters that much, perhaps the rest of the function should change
> appropriately..
> 
> > I can change it, but then we're into ugly likely()/unlikely() macro
> > territory and I much prefer to structure the code so things like
> > that are completely unnecessary.
> > 
> 
> Hasn't the [un]likely() stuff been considered bogus for a while now?

Yes, mainly because they are often wrongi and often the compiler and
hardware do a much better job with static and/or dynamic prediction.
Nobody ever profiles the branch predictions to check the manual
annotations are correct and valid, and code can change over time
meaning the static prediction is wrong...

> Either way, we're already under spinlock at this point so this isn't a
> fast exit path. We're talking about branching to an exit path that
> still requires a function call (i.e. branch) to unlock the spinlock
> before we return. Combine that with the rest of the function and I'm
> having a hard time seeing how the factoring affects performance in
> practice. Maybe it does, but I guess I'd want to see some data to grok
> that the very least.

I'm not wedded to a specific layout, just explaining the reasoning.
I'll change it for the next revision...

Cheers,

Dave.
Dave Chinner Oct. 1, 2019, 11:14 p.m. UTC | #7
On Tue, Oct 01, 2019 at 09:13:36AM -0400, Brian Foster wrote:
> On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote:
> > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote:
> > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote:
> > > > Have you done similar testing for small/minimum sized logs?
> > > 
> > > Yes. I've had the tracepoint active during xfstests runs on test
> > > filesystems using default log sizes on 5-15GB filesystems. The only
> > > test in all of xfstests that has triggered it is generic/017, and it
> > > only triggered once.
> > > 
> > > e.g.
> > > 
> > > # trace-cmd start -e xfs_log_cil_wait
> > > <run xfstests>
> > > # trace-cmd show
> > > # tracer: nop
> > > #
> > > # entries-in-buffer/entries-written: 1/1   #P:4
> > > #
> > > #                              _-----=> irqs-off
> > > #                             / _----=> need-resched
> > > #                            | / _---=> hardirq/softirq
> > > #                            || / _--=> preempt-depth
> > > #                            ||| /     delay
> > > #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> > > #              | |       |   ||||       |         |
> > >           xfs_io-2158  [001] ...1   309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560
> > > #
> > > 
> > > And the timestamp matched the time that generic/017 was running.
> > 
> > SO I've run this on my typical 16-way fsmark workload with different
> > size logs. It barely triggers on log sizes larger than 64MB, on 32MB
> > logs I can see it capturing all 16 fsmark processes while waiting
> > for the CIL context to switch. This will give you an idea of the
> > log cycles the capture is occuring on, and the count of processes
> > being captured:
> > 
> > $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c
> >      16 251
[snip]
> >      16 2892
> > $
> 
> Thanks. I assume I'm looking at cycle numbers and event counts here?

Yes.

> > So typically groups of captures are hundreds of log cycles apart
> > (100 cycles x 32MB = ~3GB of log writes), then there will be a
> > stutter where the CIL dispatch is delayed, and then everything
> > continues on. These all show the log is always around the 75% full
> > (AIL tail pushing theshold) but the reservation grant wait lists are
> > always empty so we're not running out of reservation space here.
> > 
> 
> It's somewhat interesting that we manage to block every thread most of
> the time before the CIL push task starts. I wonder a bit if that pattern
> would hold for a system/workload with more CPUs (and if so, if there are
> any odd side effects of stalling and waking hundreds of tasks at the
> same time vs. our traditional queuing behavior).

If I increase the concurrency (e.g. 16->32 threads for fsmark on a
64MB log), we hammer the spinlock on the grant head -hard-. i.e. CPU
usage goes up by 40%, performance goes down by 50%, and all that CPU
time is spent spinning on the reserve grant head lock. Basically,
the log reservation space runs out, and we end up queuing on the
reservation grant head and then we get reminded of just how bad
having a serialisation point in the reservation fast path actually
is for scalability...

+   41.05%    37.93%  [kernel]            [k] __pv_queued_spin_lock_slowpath
....
-   30.37%     0.02%  [kernel]            [k] xlog_grant_head_check
   - 17.98% xlog_grant_head_check
      - 14.17% _raw_spin_lock
         + 14.47% __pv_queued_spin_lock_slowpath
      + 8.39% xlog_grant_head_wait
      + 1.17% xlog_grant_head_wake
+   26.47%     0.10%  [kernel]            [k] __xfs_trans_commit
-   26.06%     1.15%  [kernel]            [k] xfs_log_commit_cil
   - 15.69% xfs_log_commit_cil
      - 9.90% xfs_log_done
         + 9.83% xfs_log_space_wake
      + 2.28% xfs_buf_item_format
      + 0.57% _raw_spin_lock
-   16.50%     0.01%  [kernel]            [k] xfs_log_done
   - 9.90% xfs_log_done
      - 9.84% xfs_log_space_wake
         + 13.63% _raw_spin_lock
         + 1.39% xlog_grant_head_wake
-   16.22%     0.02%  [kernel]            [k] xfs_log_space_wake
   - 9.83% xfs_log_space_wake
      + 13.64% _raw_spin_lock
      + 1.40% xlog_grant_head_wake

So, essentially, increasing the concurrency on a small log quickly
hits the log reservation limits (because tail pushing!) and that
largely prevents the CIL from overrrunning, too. Compared to the
16-way/64MB log output I snipped above, here's the 32-way CIL
blocking:

$ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c
     30 994
     28 1610
      4 1611
$

Much less, but it still tends to block most threads when it occurs.
I think we can largely ignore the potential thundering heard here
because it seems quite rare comapred to the amount of throttling
being done by the grant head...

Larger logs block more threads on the CIL throttle, but the 32MB CIL
window can soak up hundreds of max sized transaction reservations
before overflowing so even running several hundred concurrent
modification threads I haven't been able to drive enough concurrency
through the CIL to see any sort of adverse behaviour.  And the
workloads are running pretty consistently at less than 5,000 context
switches/sec so there's no evidence of repeated thundering heard
wakeup problems, either.

Cheers,

Dave.
Brian Foster Oct. 2, 2019, 12:40 p.m. UTC | #8
On Wed, Oct 02, 2019 at 08:31:07AM +1000, Dave Chinner wrote:
> On Tue, Oct 01, 2019 at 09:13:04AM -0400, Brian Foster wrote:
> > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote:
> > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote:
> > > > Have you done similar testing for small/minimum sized logs?
> > > 
> > > Yes. I've had the tracepoint active during xfstests runs on test
> > > filesystems using default log sizes on 5-15GB filesystems. The only
> > > test in all of xfstests that has triggered it is generic/017, and it
> > > only triggered once.
> > > 
> > 
> > Ok, interesting. I guess it would be nice if we had a test that somehow
> > or another more effectively exercised this codepath.
> 
> *nod*
> 
> But it's essentially difficult to predict in any way because
> behaviour is not just a function of log size. :/
> 
> > > > Also, if this is so limited in occurrence, had you given any thought to
> > > > something even more simple like flushing the CIL push workqueue when
> > > > over the throttle threshold?
> > > 
> > > Yes, I've tried that - flush_workqueue() blocks /everything/ until
> > > all the queued work is complete. That means it waits for the CIL to
> > > flush to disk and write a commit record, and every modification in
> > > the filesystem is stopped dead in it's tracks until the CIL push is
> > > complete.
> > > 
> > > The problem is that flushing workqueues is a synchronous operation,
> > > and it can't wait for partial work completion. We only need to wait
> > > for the CIL context to be swapped out - this is done by the push
> > > code before it starts all the expensive iclog formating and waiting
> > > for iclog space...
> > > 
> > 
> > I know it waits on the work to complete. I poked around a bit for an
> > interface to "kick" a workqueue, so to speak (as opposed to flush), but
> > I didn't see anything (not surprisingly, since it probably wouldn't be a
> > broadly useful mechanism).
> 
> *nod*
> 
> > That aside, where would this wait on the CIL to flush to disk? AFAICT
> > the only thing that might happen here is log buffer submission. That
> > only happens when the log buffer is full (regardless of the current CIL
> > push writing its commit record). In fact if we did wait on I/O anywhere
> > in here, then that suggests potential problems with async log force.
> 
> There is no such thing as a "async log force". The log force always
> waits on the CIL flush - XFS_LOG_SYNC only defines whether it waits
> on all iclogbufs post CIL flush to hit the disk.
> 

I'm just referring to the semantics/naming of the existing interface. I
suppose I could have used "a log force that doesn't wait on all
iclogbufs to hit the disk," but that doesn't quite roll off the tongue.
;)

> Further, when the CIL flushes, it's normally flushing more metadata that we
> can hold in 8x iclogbufs. The default is 32kB iclogbufs, so if we've
> got more than 256kB of checkpoint data to be written, then we end up
> waiting on iclogbuf completion to write more then 256kB.
> 
> Typically I see a sustainted ratio of roughly 4 iclog writes to 1
> noiclogbufs in the metric graphs on small logs - just measured 700
> log writes/s, 250 noiclogs/s for a 64MB log and 256kB logbuf size.
> IOWs, CIL flushes are regularly waiting in xlog_get_iclog_space() on
> iclog IO completion to occur...
> 

Ok, that's not quite what I was concerned about when you mentioned
waiting on the CIL to flush to disk. No matter, the important bit here
is the performance cost of including the extra blocking on log I/O (to
cycle iclogs back to active for reuse) in the throttle.

I'm curious about how noticeable this extra blocking would be because
it's one likely cause of the CIL pressure buildup in the first place. My
original tests reproduced huge CIL checkpoints purely based on one CIL
push being blocked behind the processing of another, the latter taking
relatively more time due to log I/O.

This is not to say there aren't other causes of excessively sized
checkpoints. Rather, if we're at a point where we've blocked
transactions on this new threshold, that means we've doubled the
background threshold in the time we've first triggered a background CIL
push and the push actually started. From that, it seems fairly likely
that we could replenish the CIL to the background threshold once
threads are unblocked but before the previous push completes.

The question is: can we get all the way to the blocking threshold before
that happens? That doesn't seem unrealistic to me, but it's hard to
reason about without having tested it. If so, I think it means we end up
blocking on completion of the first push to some degree anyways.

> > > > That would wait longer to fill the iclogs,
> > > > but would eliminate the need for another waitqueue that's apparently
> > > > only used for throttling. It may also facilitate reuse of
> > > > xlog_cil_push_now() in the >= XLOG_CIL_SPACE_BLOCKING_LIMIT() case
> > > > (actually, I think this could facilitate the elimination of
> > > > xlog_cil_push_background() entirely).
> > > 
> > > xlog_cil_push_now() uses flush_work() to push any pending work
> > > before it queues up the CIL flush that the caller is about to wait
> > > for.  i.e. the callers of xlog_cil_push_now() must ensure that all
> > > CIL contexts are flushed for the purposes of a log force as they are
> > > going to wait for all pending CIL flushes to complete. If we've
> > > already pushed the CIL to the sequence that we are asking to push
> > > to, we still have to wait for that previous push to be
> > > done. This is what the flush_work() call in xlog_cil_push_now()
> > > acheives.
> > > 
> > 
> > Yes, I'm just exploring potential to reuse this code..
> 
> Yeah, I have a few prototype patches for revamping this, including
> an actual async CIL flush. I do some work here, but it didn't solve
> any of the problems I needed to fix so it put it aside. See below.
> 

That sounds more involved than what I was thinking. My thought is that
this throttle is already not predictable or deterministic (i.e. we're
essentially waiting on a scheduler event) and so might not require the
extra complexity of a new waitqueue. It certainly could be the case that
blocking on the entire push is just too long in practice, but since this
is already based on empirical evidence and subject to unpredictability,
ISTM that testing is the only way to know for sure. For reference, I
hacked up something to reuse xlog_cil_push_now() for background pushing
and throttling that ends up removing 20 or so lines of code by the time
it's in place, but I haven't given it any testing.

That said, this is just an observation and an idea. I'm fine with the
proposed implementation with the other nits and whatnot fixed up.

Brian

> > > > > +	/*
> > > > > +	 * If we are well over the space limit, throttle the work that is being
> > > > > +	 * done until the push work on this context has begun.
> > > > > +	 */
> > > > > +	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> > > > 
> > > > Can we use consistent logic with the rest of the function? I.e.
> > > > 
> > > > 	up_read(..);
> > > > 	if (space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> > > > 		spin_unlock(..);
> > > > 		return;
> > > > 	}
> > > 
> > > I did it this way because code inside the if() statement is
> > > typically pushed out of line by the compiler if it finished with a
> > > return statement (i.e. the compiler makes an "unlikely" static
> > > prediction). Hence it's best to put the slow path code inside thei
> > > if(), not the fast path code.
> > > 
> > 
> > It looks to me that occurs regardless of whether there's a return
> > statement or not. AFAICT, this branching aready happens in the fast exit
> > path of this function as well as for the queue_work() branch. So if it
> > matters that much, perhaps the rest of the function should change
> > appropriately..
> > 
> > > I can change it, but then we're into ugly likely()/unlikely() macro
> > > territory and I much prefer to structure the code so things like
> > > that are completely unnecessary.
> > > 
> > 
> > Hasn't the [un]likely() stuff been considered bogus for a while now?
> 
> Yes, mainly because they are often wrongi and often the compiler and
> hardware do a much better job with static and/or dynamic prediction.
> Nobody ever profiles the branch predictions to check the manual
> annotations are correct and valid, and code can change over time
> meaning the static prediction is wrong...
> 
> > Either way, we're already under spinlock at this point so this isn't a
> > fast exit path. We're talking about branching to an exit path that
> > still requires a function call (i.e. branch) to unlock the spinlock
> > before we return. Combine that with the rest of the function and I'm
> > having a hard time seeing how the factoring affects performance in
> > practice. Maybe it does, but I guess I'd want to see some data to grok
> > that the very least.
> 
> I'm not wedded to a specific layout, just explaining the reasoning.
> I'll change it for the next revision...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: Don't block tail pushing on CIL forces
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> When AIL tail pushing encounters a pinned item, it forces the log
> and waits for it to complete before it starts walking items again.
> However, in forcing the log it waits for all the dirty objects in
> memory to be written to the log, hence waiting for (potentially) a
> lot of log IO to complete when the CIL contains thousands of dirty
> objects. This can block processing of the AIL for seconds and
> prevent the tail of the log from being moved forward while the AIL
> is blocked.
> 
> To prevent the log tail pushing from waiting on log flushes, just
> trigger a CIL push from the xfsaild instead of a full log force.
> This will get the majority of the items pinned moving via teh CIL
> workqueue and so won't cause the xfsaild to wait for it to complete
> before it can do it's own work.
> 
> Further, the AIL has it's own backoff waits, so move the CIL push
> to before the AIL backoff wait rather than afterwards, so the CIL
> has time to do some work before the AIL will start scanning again.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c   | 13 ++++++++-----
>  fs/xfs/xfs_log_priv.h  | 22 ++++++++++++++++++----
>  fs/xfs/xfs_trans_ail.c | 29 +++++++++++++++--------------
>  3 files changed, 41 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 2f4cfd5b707e..a713f58e9e86 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -931,7 +931,7 @@ xlog_cil_push_background(
>   * @push_seq, but it won't be completed. The caller is expected to do any
>   * waiting for push_seq to complete if it is required.
>   */
> -static void
> +void
>  xlog_cil_push_now(
>  	struct xlog	*log,
>  	xfs_lsn_t	push_seq)
> @@ -941,21 +941,24 @@ xlog_cil_push_now(
>  	if (!cil)
>  		return;
>  
> -	ASSERT(push_seq && push_seq <= cil->xc_current_sequence);
> -
>  	/*
>  	 * If the CIL is empty or we've already pushed the sequence then
> -	 * there's no work we need to do.
> +	 * there's no work we need to do. If push_seq is not set, someone is
> +	 * just giving us the hurry up (e.g. to unpin objects without otherwise
> +	 * waiting on completion) so just push the current context.
>  	 */
>  	spin_lock(&cil->xc_push_lock);
> +	if (!push_seq)
> +		push_seq = cil->xc_current_sequence;
> +
>  	if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) {
>  		spin_unlock(&cil->xc_push_lock);
>  		return;
>  	}
>  
>  	cil->xc_push_seq = push_seq;
> -	spin_unlock(&cil->xc_push_lock);
>  	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
> +	spin_unlock(&cil->xc_push_lock);
>  }
>  
>  bool
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index f66043853310..7842f923b3a1 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -520,11 +520,25 @@ bool	xlog_cil_empty(struct xlog *log);
>  
>  /*
>   * CIL force routines
> + *
> + * If you just want to schedule a CIL context flush to get things moving but
> + * not wait for anything (not an integrity operation!), use
> + *
> + *		xlog_cil_push_now(log, 0);
> + *
> + * Blocking or integrity based CIL forces should not be called directly, but
> + * instead be marshalled through xfs_log_force(). i.e. if you want to:
> + *
> + *	- flush and wait for the entire CIL to be submitted to the log, use
> + *		xfs_log_force(mp, 0);
> + *
> + *	- flush and wait for the CIL to be stable on disk and all items to be
> + *	  inserted into the AIL, use
> + *		xfs_log_force(mp, XFS_LOG_SYNC);
>   */
> -xfs_lsn_t
> -xlog_cil_force_lsn(
> -	struct xlog *log,
> -	xfs_lsn_t sequence);
> +void xlog_cil_push_now(struct xlog *log, xfs_lsn_t sequence);
> +
> +xfs_lsn_t xlog_cil_force_lsn(struct xlog *log, xfs_lsn_t sequence);
>  
>  static inline void
>  xlog_cil_force(struct xlog *log)
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 5802139f786b..c9d8d05c3594 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -13,6 +13,7 @@
>  #include "xfs_mount.h"
>  #include "xfs_trans.h"
>  #include "xfs_trans_priv.h"
> +#include "xfs_log_priv.h"
>  #include "xfs_trace.h"
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
> @@ -381,20 +382,6 @@ xfsaild_push(
>  	int			flushing = 0;
>  	int			count = 0;
>  
> -	/*
> -	 * If we encountered pinned items or did not finish writing out all
> -	 * buffers the last time we ran, force the log first and wait for it
> -	 * before pushing again.
> -	 */
> -	if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
> -	    (!list_empty_careful(&ailp->ail_buf_list) ||
> -	     xfs_ail_min_lsn(ailp))) {
> -		ailp->ail_log_flush = 0;
> -
> -		XFS_STATS_INC(mp, xs_push_ail_flush);
> -		xfs_log_force(mp, XFS_LOG_SYNC);
> -	}
> -
>  	spin_lock(&ailp->ail_lock);
>  
>  	/* barrier matches the ail_target update in xfs_ail_push() */
> @@ -528,6 +515,20 @@ xfsaild_push(
>  		tout = 10;
>  	}
>  
> +	/*
> +	 * If we encountered pinned items or did not finish writing out all
> +	 * buffers the last time we ran, give the CIL a nudge to start
> +	 * unpinning the blocked items while we wait for a while...
> +	 */
> +	if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
> +	    (!list_empty_careful(&ailp->ail_buf_list) ||
> +	     xfs_ail_min_lsn(ailp))) {
> +		ailp->ail_log_flush = 0;
> +
> +		XFS_STATS_INC(mp, xs_push_ail_flush);
> +		xlog_cil_push_now(mp->m_log, 0);
> +	}
> +
>  	return tout;
>  }
>
Brian Foster Oct. 2, 2019, 12:41 p.m. UTC | #9
On Wed, Oct 02, 2019 at 09:14:33AM +1000, Dave Chinner wrote:
> On Tue, Oct 01, 2019 at 09:13:36AM -0400, Brian Foster wrote:
> > On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote:
> > > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote:
> > > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote:
> > > > > Have you done similar testing for small/minimum sized logs?
> > > > 
> > > > Yes. I've had the tracepoint active during xfstests runs on test
> > > > filesystems using default log sizes on 5-15GB filesystems. The only
> > > > test in all of xfstests that has triggered it is generic/017, and it
> > > > only triggered once.
> > > > 
> > > > e.g.
> > > > 
> > > > # trace-cmd start -e xfs_log_cil_wait
> > > > <run xfstests>
> > > > # trace-cmd show
> > > > # tracer: nop
> > > > #
> > > > # entries-in-buffer/entries-written: 1/1   #P:4
> > > > #
> > > > #                              _-----=> irqs-off
> > > > #                             / _----=> need-resched
> > > > #                            | / _---=> hardirq/softirq
> > > > #                            || / _--=> preempt-depth
> > > > #                            ||| /     delay
> > > > #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> > > > #              | |       |   ||||       |         |
> > > >           xfs_io-2158  [001] ...1   309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560
> > > > #
> > > > 
> > > > And the timestamp matched the time that generic/017 was running.
> > > 
> > > SO I've run this on my typical 16-way fsmark workload with different
> > > size logs. It barely triggers on log sizes larger than 64MB, on 32MB
> > > logs I can see it capturing all 16 fsmark processes while waiting
> > > for the CIL context to switch. This will give you an idea of the
> > > log cycles the capture is occuring on, and the count of processes
> > > being captured:
> > > 
> > > $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c
> > >      16 251
> [snip]
> > >      16 2892
> > > $
> > 
> > Thanks. I assume I'm looking at cycle numbers and event counts here?
> 
> Yes.
> 
> > > So typically groups of captures are hundreds of log cycles apart
> > > (100 cycles x 32MB = ~3GB of log writes), then there will be a
> > > stutter where the CIL dispatch is delayed, and then everything
> > > continues on. These all show the log is always around the 75% full
> > > (AIL tail pushing theshold) but the reservation grant wait lists are
> > > always empty so we're not running out of reservation space here.
> > > 
> > 
> > It's somewhat interesting that we manage to block every thread most of
> > the time before the CIL push task starts. I wonder a bit if that pattern
> > would hold for a system/workload with more CPUs (and if so, if there are
> > any odd side effects of stalling and waking hundreds of tasks at the
> > same time vs. our traditional queuing behavior).
> 
> If I increase the concurrency (e.g. 16->32 threads for fsmark on a
> 64MB log), we hammer the spinlock on the grant head -hard-. i.e. CPU
> usage goes up by 40%, performance goes down by 50%, and all that CPU
> time is spent spinning on the reserve grant head lock. Basically,
> the log reservation space runs out, and we end up queuing on the
> reservation grant head and then we get reminded of just how bad
> having a serialisation point in the reservation fast path actually
> is for scalability...
> 

The small log case is not really what I'm wondering about. Does this
behavior translate to a similar test with a maximum sized log?

...
> 
> Larger logs block more threads on the CIL throttle, but the 32MB CIL
> window can soak up hundreds of max sized transaction reservations
> before overflowing so even running several hundred concurrent
> modification threads I haven't been able to drive enough concurrency
> through the CIL to see any sort of adverse behaviour.  And the
> workloads are running pretty consistently at less than 5,000 context
> switches/sec so there's no evidence of repeated thundering heard
> wakeup problems, either.
> 

That speaks to the rarity of the throttle, which is good. But I'm
wondering, for example, what might happen on systems where we could have
hundreds of physical CPUs committing to the CIL, we block them all on
the throttle and then wake them all at once. IOW, can we potentially
create the contention conditions you reproduce above in scenarios where
they might not have existed before?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Oct. 3, 2019, 12:53 a.m. UTC | #10
On Wed, Oct 02, 2019 at 08:40:56AM -0400, Brian Foster wrote:
> On Wed, Oct 02, 2019 at 08:31:07AM +1000, Dave Chinner wrote:
> > On Tue, Oct 01, 2019 at 09:13:04AM -0400, Brian Foster wrote:
> > > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote:
> > > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote:
> > > > > Have you done similar testing for small/minimum sized logs?
> > > > 
> > > > Yes. I've had the tracepoint active during xfstests runs on test
> > > > filesystems using default log sizes on 5-15GB filesystems. The only
> > > > test in all of xfstests that has triggered it is generic/017, and it
> > > > only triggered once.
> > > > 
> > > 
> > > Ok, interesting. I guess it would be nice if we had a test that somehow
> > > or another more effectively exercised this codepath.
> > 
> > *nod*
> > 
> > But it's essentially difficult to predict in any way because
> > behaviour is not just a function of log size. :/
> > 
> > > > > Also, if this is so limited in occurrence, had you given any thought to
> > > > > something even more simple like flushing the CIL push workqueue when
> > > > > over the throttle threshold?
> > > > 
> > > > Yes, I've tried that - flush_workqueue() blocks /everything/ until
> > > > all the queued work is complete. That means it waits for the CIL to
> > > > flush to disk and write a commit record, and every modification in
> > > > the filesystem is stopped dead in it's tracks until the CIL push is
> > > > complete.
> > > > 
> > > > The problem is that flushing workqueues is a synchronous operation,
> > > > and it can't wait for partial work completion. We only need to wait
> > > > for the CIL context to be swapped out - this is done by the push
> > > > code before it starts all the expensive iclog formating and waiting
> > > > for iclog space...
> > > > 
> > > 
> > > I know it waits on the work to complete. I poked around a bit for an
> > > interface to "kick" a workqueue, so to speak (as opposed to flush), but
> > > I didn't see anything (not surprisingly, since it probably wouldn't be a
> > > broadly useful mechanism).
> > 
> > *nod*
> > 
> > > That aside, where would this wait on the CIL to flush to disk? AFAICT
> > > the only thing that might happen here is log buffer submission. That
> > > only happens when the log buffer is full (regardless of the current CIL
> > > push writing its commit record). In fact if we did wait on I/O anywhere
> > > in here, then that suggests potential problems with async log force.
> > 
> > There is no such thing as a "async log force". The log force always
> > waits on the CIL flush - XFS_LOG_SYNC only defines whether it waits
> > on all iclogbufs post CIL flush to hit the disk.
> > 
> 
> I'm just referring to the semantics/naming of the existing interface. I
> suppose I could have used "a log force that doesn't wait on all
> iclogbufs to hit the disk," but that doesn't quite roll off the tongue.
> ;)

*nod*

Just wanted to make sure we were both talking about the same thing
:)

> 
> > Further, when the CIL flushes, it's normally flushing more metadata that we
> > can hold in 8x iclogbufs. The default is 32kB iclogbufs, so if we've
> > got more than 256kB of checkpoint data to be written, then we end up
> > waiting on iclogbuf completion to write more then 256kB.
> > 
> > Typically I see a sustainted ratio of roughly 4 iclog writes to 1
> > noiclogbufs in the metric graphs on small logs - just measured 700
> > log writes/s, 250 noiclogs/s for a 64MB log and 256kB logbuf size.
> > IOWs, CIL flushes are regularly waiting in xlog_get_iclog_space() on
> > iclog IO completion to occur...
> > 
> 
> Ok, that's not quite what I was concerned about when you mentioned
> waiting on the CIL to flush to disk. No matter, the important bit here
> is the performance cost of including the extra blocking on log I/O (to
> cycle iclogs back to active for reuse) in the throttle.
> 
> I'm curious about how noticeable this extra blocking would be because
> it's one likely cause of the CIL pressure buildup in the first place. My
> original tests reproduced huge CIL checkpoints purely based on one CIL
> push being blocked behind the processing of another, the latter taking
> relatively more time due to log I/O.

It's very noticable. I dropped that as a throttle mechanism because
even on a SSD at 200us per write, a 32MB CIL flush takes for
20-30ms. And it stops everything dead for that time because it
stalls all async transaction commits while the log flushes.

When several CIL flushes a second occur (pushing 100-150MB/s to the
log), I start seeing a substantial amount of additional time
(15-20% of total CPU time) being spent idle just waiting for CIL
flush completion. And it only gets worse as the storage gets
slower...

The throttle that these patches implement are typically only
stalling incoming transactions for a couple of hundred microseconds.
The timestamps from the groups of log events show the blocking of
all threads are very close together in time, and the push work to
switch over to the new context to allow them to unblock and continue
happens within another 200-300 microseconds. And the worst case I've
seen, this is happening once or twice a second. IOWs, the blocking
time of the throttle is very short and largely unnoticable, and the
difference to performance it causes is far, far less than the noise
threshold of the benchmarks.

> This is not to say there aren't other causes of excessively sized
> checkpoints. Rather, if we're at a point where we've blocked
> transactions on this new threshold, that means we've doubled the
> background threshold in the time we've first triggered a background CIL
> push and the push actually started. From that, it seems fairly likely
> that we could replenish the CIL to the background threshold once
> threads are unblocked but before the previous push completes.

That's just fine. The CIL is actually designed to allow that sort of
overlap between multiple committing background contexts and the
current aggregating context. As long as each individual CIL commit
size doesn't go over half the log and there's always a complete
commit in the log (which there should be via head/tail limits),
the only limit on the number of concurrent committing CIL contexts
on the fly at once is the amount of log space we have available....

> The question is: can we get all the way to the blocking threshold before
> that happens? That doesn't seem unrealistic to me, but it's hard to
> reason about without having tested it. If so, I think it means we end up
> blocking on completion of the first push to some degree anyways.

Yes, we can fill and push the current sequence before the previous
sequence has finished committing.

> > > > xlog_cil_push_now() uses flush_work() to push any pending work
> > > > before it queues up the CIL flush that the caller is about to wait
> > > > for.  i.e. the callers of xlog_cil_push_now() must ensure that all
> > > > CIL contexts are flushed for the purposes of a log force as they are
> > > > going to wait for all pending CIL flushes to complete. If we've
> > > > already pushed the CIL to the sequence that we are asking to push
> > > > to, we still have to wait for that previous push to be
> > > > done. This is what the flush_work() call in xlog_cil_push_now()
> > > > acheives.
> > > > 
> > > 
> > > Yes, I'm just exploring potential to reuse this code..
> > 
> > Yeah, I have a few prototype patches for revamping this, including
> > an actual async CIL flush. I do some work here, but it didn't solve
> > any of the problems I needed to fix so it put it aside. See below.
> > 
> 
> That sounds more involved than what I was thinking. My thought is that
> this throttle is already not predictable or deterministic (i.e. we're
> essentially waiting on a scheduler event) and so might not require the
> extra complexity of a new waitqueue. It certainly could be the case that
> blocking on the entire push is just too long in practice, but since this

I've made that sort of change here and measured the regression.
Then I discarded it as an unworkable option and looked for other
solutions.

> is already based on empirical evidence and subject to unpredictability,
> ISTM that testing is the only way to know for sure. For reference, I
> hacked up something to reuse xlog_cil_push_now() for background pushing
> and throttling that ends up removing 20 or so lines of code by the time
> it's in place, but I haven't given it any testing.
>
> That said, this is just an observation and an idea. I'm fine with the
> proposed implementation with the other nits and whatnot fixed up.

No worries. It's still worth exploring all the alternatives. :)

Cheers,

Dave.
Dave Chinner Oct. 3, 2019, 1:25 a.m. UTC | #11
On Wed, Oct 02, 2019 at 08:41:39AM -0400, Brian Foster wrote:
> On Wed, Oct 02, 2019 at 09:14:33AM +1000, Dave Chinner wrote:
> > On Tue, Oct 01, 2019 at 09:13:36AM -0400, Brian Foster wrote:
> > > On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote:
> > > > So typically groups of captures are hundreds of log cycles apart
> > > > (100 cycles x 32MB = ~3GB of log writes), then there will be a
> > > > stutter where the CIL dispatch is delayed, and then everything
> > > > continues on. These all show the log is always around the 75% full
> > > > (AIL tail pushing theshold) but the reservation grant wait lists are
> > > > always empty so we're not running out of reservation space here.
> > > > 
> > > 
> > > It's somewhat interesting that we manage to block every thread most of
> > > the time before the CIL push task starts. I wonder a bit if that pattern
> > > would hold for a system/workload with more CPUs (and if so, if there are
> > > any odd side effects of stalling and waking hundreds of tasks at the
> > > same time vs. our traditional queuing behavior).
> > 
> > If I increase the concurrency (e.g. 16->32 threads for fsmark on a
> > 64MB log), we hammer the spinlock on the grant head -hard-. i.e. CPU
> > usage goes up by 40%, performance goes down by 50%, and all that CPU
> > time is spent spinning on the reserve grant head lock. Basically,
> > the log reservation space runs out, and we end up queuing on the
> > reservation grant head and then we get reminded of just how bad
> > having a serialisation point in the reservation fast path actually
> > is for scalability...
> > 
> 
> The small log case is not really what I'm wondering about. Does this
> behavior translate to a similar test with a maximum sized log?

Nope, the transactions all hit the CIL throttle within a couple of
hundred microseconds of each other, then the CIL push schedules, and
then a couple of hundred microseconds later they are unblocked
because the CIL push has started.

> ...
> > 
> > Larger logs block more threads on the CIL throttle, but the 32MB CIL
> > window can soak up hundreds of max sized transaction reservations
> > before overflowing so even running several hundred concurrent
> > modification threads I haven't been able to drive enough concurrency
> > through the CIL to see any sort of adverse behaviour.  And the
> > workloads are running pretty consistently at less than 5,000 context
> > switches/sec so there's no evidence of repeated thundering heard
> > wakeup problems, either.
> > 
> 
> That speaks to the rarity of the throttle, which is good. But I'm
> wondering, for example, what might happen on systems where we could have
> hundreds of physical CPUs committing to the CIL, we block them all on
> the throttle and then wake them all at once. IOW, can we potentially
> create the contention conditions you reproduce above in scenarios where
> they might not have existed before?

I don't think it will create any new contention points - the
contention I described above can be triggered without the CIL
throttle in place, too. It just requires enough concurrent
transactions to exhaust the entire log reservation, and then we go
from a lockless grant head reservation algorithm to a spinlock
serialised waiting algorithm.  i.e. the contention starts when we
have enough concurrency to fall off the lockless fast path.

So with a 2GB log and fast storage, we likely need a sustained
workload of tens of thousands of concurrent transaction reservations
to exhaust log space and drive us into this situation. We generally
don't have applications that have this sort of concurrency
capability...

Cheers,

Dave.
Brian Foster Oct. 3, 2019, 2:39 p.m. UTC | #12
On Thu, Oct 03, 2019 at 10:53:10AM +1000, Dave Chinner wrote:
> On Wed, Oct 02, 2019 at 08:40:56AM -0400, Brian Foster wrote:
> > On Wed, Oct 02, 2019 at 08:31:07AM +1000, Dave Chinner wrote:
> > > On Tue, Oct 01, 2019 at 09:13:04AM -0400, Brian Foster wrote:
> > > > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote:
> > > > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote:
> > > > > > Have you done similar testing for small/minimum sized logs?
> > > > > 
> > > > > Yes. I've had the tracepoint active during xfstests runs on test
> > > > > filesystems using default log sizes on 5-15GB filesystems. The only
> > > > > test in all of xfstests that has triggered it is generic/017, and it
> > > > > only triggered once.
> > > > > 
> > > > 
> > > > Ok, interesting. I guess it would be nice if we had a test that somehow
> > > > or another more effectively exercised this codepath.
> > > 
> > > *nod*
> > > 
> > > But it's essentially difficult to predict in any way because
> > > behaviour is not just a function of log size. :/
> > > 
> > > > > > Also, if this is so limited in occurrence, had you given any thought to
> > > > > > something even more simple like flushing the CIL push workqueue when
> > > > > > over the throttle threshold?
> > > > > 
> > > > > Yes, I've tried that - flush_workqueue() blocks /everything/ until
> > > > > all the queued work is complete. That means it waits for the CIL to
> > > > > flush to disk and write a commit record, and every modification in
> > > > > the filesystem is stopped dead in it's tracks until the CIL push is
> > > > > complete.
> > > > > 
> > > > > The problem is that flushing workqueues is a synchronous operation,
> > > > > and it can't wait for partial work completion. We only need to wait
> > > > > for the CIL context to be swapped out - this is done by the push
> > > > > code before it starts all the expensive iclog formating and waiting
> > > > > for iclog space...
> > > > > 
> > > > 
> > > > I know it waits on the work to complete. I poked around a bit for an
> > > > interface to "kick" a workqueue, so to speak (as opposed to flush), but
> > > > I didn't see anything (not surprisingly, since it probably wouldn't be a
> > > > broadly useful mechanism).
> > > 
> > > *nod*
> > > 
> > > > That aside, where would this wait on the CIL to flush to disk? AFAICT
> > > > the only thing that might happen here is log buffer submission. That
> > > > only happens when the log buffer is full (regardless of the current CIL
> > > > push writing its commit record). In fact if we did wait on I/O anywhere
> > > > in here, then that suggests potential problems with async log force.
> > > 
> > > There is no such thing as a "async log force". The log force always
> > > waits on the CIL flush - XFS_LOG_SYNC only defines whether it waits
> > > on all iclogbufs post CIL flush to hit the disk.
> > > 
> > 
> > I'm just referring to the semantics/naming of the existing interface. I
> > suppose I could have used "a log force that doesn't wait on all
> > iclogbufs to hit the disk," but that doesn't quite roll off the tongue.
> > ;)
> 
> *nod*
> 
> Just wanted to make sure we were both talking about the same thing
> :)
> 
> > 
> > > Further, when the CIL flushes, it's normally flushing more metadata that we
> > > can hold in 8x iclogbufs. The default is 32kB iclogbufs, so if we've
> > > got more than 256kB of checkpoint data to be written, then we end up
> > > waiting on iclogbuf completion to write more then 256kB.
> > > 
> > > Typically I see a sustainted ratio of roughly 4 iclog writes to 1
> > > noiclogbufs in the metric graphs on small logs - just measured 700
> > > log writes/s, 250 noiclogs/s for a 64MB log and 256kB logbuf size.
> > > IOWs, CIL flushes are regularly waiting in xlog_get_iclog_space() on
> > > iclog IO completion to occur...
> > > 
> > 
> > Ok, that's not quite what I was concerned about when you mentioned
> > waiting on the CIL to flush to disk. No matter, the important bit here
> > is the performance cost of including the extra blocking on log I/O (to
> > cycle iclogs back to active for reuse) in the throttle.
> > 
> > I'm curious about how noticeable this extra blocking would be because
> > it's one likely cause of the CIL pressure buildup in the first place. My
> > original tests reproduced huge CIL checkpoints purely based on one CIL
> > push being blocked behind the processing of another, the latter taking
> > relatively more time due to log I/O.
> 
> It's very noticable. I dropped that as a throttle mechanism because
> even on a SSD at 200us per write, a 32MB CIL flush takes for
> 20-30ms. And it stops everything dead for that time because it
> stalls all async transaction commits while the log flushes.
> 
> When several CIL flushes a second occur (pushing 100-150MB/s to the
> log), I start seeing a substantial amount of additional time
> (15-20% of total CPU time) being spent idle just waiting for CIL
> flush completion. And it only gets worse as the storage gets
> slower...
> 
> The throttle that these patches implement are typically only
> stalling incoming transactions for a couple of hundred microseconds.
> The timestamps from the groups of log events show the blocking of
> all threads are very close together in time, and the push work to
> switch over to the new context to allow them to unblock and continue
> happens within another 200-300 microseconds. And the worst case I've
> seen, this is happening once or twice a second. IOWs, the blocking
> time of the throttle is very short and largely unnoticable, and the
> difference to performance it causes is far, far less than the noise
> threshold of the benchmarks.
> 

Ok, thanks for the info. Is that 200-300us delay reliable on a sustained
workload (where the throttle triggers multiple times)? If so, I suppose
that longer delay I was wondering about is not as likely in practice.

> > This is not to say there aren't other causes of excessively sized
> > checkpoints. Rather, if we're at a point where we've blocked
> > transactions on this new threshold, that means we've doubled the
> > background threshold in the time we've first triggered a background CIL
> > push and the push actually started. From that, it seems fairly likely
> > that we could replenish the CIL to the background threshold once
> > threads are unblocked but before the previous push completes.
> 
> That's just fine. The CIL is actually designed to allow that sort of
> overlap between multiple committing background contexts and the
> current aggregating context. As long as each individual CIL commit
> size doesn't go over half the log and there's always a complete
> commit in the log (which there should be via head/tail limits),
> the only limit on the number of concurrent committing CIL contexts
> on the fly at once is the amount of log space we have available....
> 

Right, but if we're blocking transactions on the start of a CIL push and
the start of a CIL push blocks on completion of the previous push, it's
possible to indirectly block transactions on completion of a CIL push.
There is some staggering there compared to blocking a transaction
directly on a CIL push start to finish, however, so perhaps that keeps
the delay down in practice.

> > The question is: can we get all the way to the blocking threshold before
> > that happens? That doesn't seem unrealistic to me, but it's hard to
> > reason about without having tested it. If so, I think it means we end up
> > blocking on completion of the first push to some degree anyways.
> 
> Yes, we can fill and push the current sequence before the previous
> sequence has finished committing.
> 
> > > > > xlog_cil_push_now() uses flush_work() to push any pending work
> > > > > before it queues up the CIL flush that the caller is about to wait
> > > > > for.  i.e. the callers of xlog_cil_push_now() must ensure that all
> > > > > CIL contexts are flushed for the purposes of a log force as they are
> > > > > going to wait for all pending CIL flushes to complete. If we've
> > > > > already pushed the CIL to the sequence that we are asking to push
> > > > > to, we still have to wait for that previous push to be
> > > > > done. This is what the flush_work() call in xlog_cil_push_now()
> > > > > acheives.
> > > > > 
> > > > 
> > > > Yes, I'm just exploring potential to reuse this code..
> > > 
> > > Yeah, I have a few prototype patches for revamping this, including
> > > an actual async CIL flush. I do some work here, but it didn't solve
> > > any of the problems I needed to fix so it put it aside. See below.
> > > 
> > 
> > That sounds more involved than what I was thinking. My thought is that
> > this throttle is already not predictable or deterministic (i.e. we're
> > essentially waiting on a scheduler event) and so might not require the
> > extra complexity of a new waitqueue. It certainly could be the case that
> > blocking on the entire push is just too long in practice, but since this
> 
> I've made that sort of change here and measured the regression.
> Then I discarded it as an unworkable option and looked for other
> solutions.
> 

Ok. What was the regression that resulted from the longer throttle
delay, out of curiosity?

Brian

> > is already based on empirical evidence and subject to unpredictability,
> > ISTM that testing is the only way to know for sure. For reference, I
> > hacked up something to reuse xlog_cil_push_now() for background pushing
> > and throttling that ends up removing 20 or so lines of code by the time
> > it's in place, but I haven't given it any testing.
> >
> > That said, this is just an observation and an idea. I'm fine with the
> > proposed implementation with the other nits and whatnot fixed up.
> 
> No worries. It's still worth exploring all the alternatives. :)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Brian Foster Oct. 3, 2019, 2:41 p.m. UTC | #13
On Thu, Oct 03, 2019 at 11:25:56AM +1000, Dave Chinner wrote:
> On Wed, Oct 02, 2019 at 08:41:39AM -0400, Brian Foster wrote:
> > On Wed, Oct 02, 2019 at 09:14:33AM +1000, Dave Chinner wrote:
> > > On Tue, Oct 01, 2019 at 09:13:36AM -0400, Brian Foster wrote:
> > > > On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote:
> > > > > So typically groups of captures are hundreds of log cycles apart
> > > > > (100 cycles x 32MB = ~3GB of log writes), then there will be a
> > > > > stutter where the CIL dispatch is delayed, and then everything
> > > > > continues on. These all show the log is always around the 75% full
> > > > > (AIL tail pushing theshold) but the reservation grant wait lists are
> > > > > always empty so we're not running out of reservation space here.
> > > > > 
> > > > 
> > > > It's somewhat interesting that we manage to block every thread most of
> > > > the time before the CIL push task starts. I wonder a bit if that pattern
> > > > would hold for a system/workload with more CPUs (and if so, if there are
> > > > any odd side effects of stalling and waking hundreds of tasks at the
> > > > same time vs. our traditional queuing behavior).
> > > 
> > > If I increase the concurrency (e.g. 16->32 threads for fsmark on a
> > > 64MB log), we hammer the spinlock on the grant head -hard-. i.e. CPU
> > > usage goes up by 40%, performance goes down by 50%, and all that CPU
> > > time is spent spinning on the reserve grant head lock. Basically,
> > > the log reservation space runs out, and we end up queuing on the
> > > reservation grant head and then we get reminded of just how bad
> > > having a serialisation point in the reservation fast path actually
> > > is for scalability...
> > > 
> > 
> > The small log case is not really what I'm wondering about. Does this
> > behavior translate to a similar test with a maximum sized log?
> 
> Nope, the transactions all hit the CIL throttle within a couple of
> hundred microseconds of each other, then the CIL push schedules, and
> then a couple of hundred microseconds later they are unblocked
> because the CIL push has started.
> 
> > ...
> > > 
> > > Larger logs block more threads on the CIL throttle, but the 32MB CIL
> > > window can soak up hundreds of max sized transaction reservations
> > > before overflowing so even running several hundred concurrent
> > > modification threads I haven't been able to drive enough concurrency
> > > through the CIL to see any sort of adverse behaviour.  And the
> > > workloads are running pretty consistently at less than 5,000 context
> > > switches/sec so there's no evidence of repeated thundering heard
> > > wakeup problems, either.
> > > 
> > 
> > That speaks to the rarity of the throttle, which is good. But I'm
> > wondering, for example, what might happen on systems where we could have
> > hundreds of physical CPUs committing to the CIL, we block them all on
> > the throttle and then wake them all at once. IOW, can we potentially
> > create the contention conditions you reproduce above in scenarios where
> > they might not have existed before?
> 
> I don't think it will create any new contention points - the
> contention I described above can be triggered without the CIL
> throttle in place, too. It just requires enough concurrent
> transactions to exhaust the entire log reservation, and then we go
> from a lockless grant head reservation algorithm to a spinlock
> serialised waiting algorithm.  i.e. the contention starts when we
> have enough concurrency to fall off the lockless fast path.
> 
> So with a 2GB log and fast storage, we likely need a sustained
> workload of tens of thousands of concurrent transaction reservations
> to exhaust log space and drive us into this situation. We generally
> don't have applications that have this sort of concurrency
> capability...
> 

That there are some systems/configurations out there that are fast
enough to avoid this condition doesn't really answer the question. If
you assume something like a 1TB fs and 500MB log, with 1/4 the log
consumed in the AIL and 64MB in the CIL (such that transaction commits
start to block), the remaining log reservation can easily be consumed by
something on the order of 100 open transactions.

Hmm, I'm also not sure the lockless reservation algorithm is totally
immune to increased concurrency in this regard. What prevents multiple
tasks from racing through xlog_grant_head_check() and blowing past the
log head, for example?

I gave this a quick test out of curiosity and with a 15GB fs with a 10MB
log, I should only be able to send 5 or so truncate transactions through
xfs_log_reserve() before blocking. With a couple injected delays, I'm
easily able to send 32 into the grant space modification code and that
eventually results in something like this:

  truncate-1233  [002] ...1  1520.396545: xfs_log_reserve_exit: dev 253:4 t_ocnt 8 t_cnt 8 t_curr_res 266260 t_unit_res 266260 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 7 grant_reserve_bytes 5306880 grant_write_cycle 7 grant_write_bytes 5306880 curr_cycle 1 curr_block 115 tail_cycle 1 tail_block 115

... where the grant heads have not only blown the tail, but cycled
around the log multiple times. :/

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Oct. 4, 2019, 2:27 a.m. UTC | #14
On Thu, Oct 03, 2019 at 10:41:14AM -0400, Brian Foster wrote:
> On Thu, Oct 03, 2019 at 11:25:56AM +1000, Dave Chinner wrote:
> > On Wed, Oct 02, 2019 at 08:41:39AM -0400, Brian Foster wrote:
> > > On Wed, Oct 02, 2019 at 09:14:33AM +1000, Dave Chinner wrote:
> > > > On Tue, Oct 01, 2019 at 09:13:36AM -0400, Brian Foster wrote:
> > > > > On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote:
> > > > > > So typically groups of captures are hundreds of log cycles apart
> > > > > > (100 cycles x 32MB = ~3GB of log writes), then there will be a
> > > > > > stutter where the CIL dispatch is delayed, and then everything
> > > > > > continues on. These all show the log is always around the 75% full
> > > > > > (AIL tail pushing theshold) but the reservation grant wait lists are
> > > > > > always empty so we're not running out of reservation space here.
> > > > > > 
> > > > > 
> > > > > It's somewhat interesting that we manage to block every thread most of
> > > > > the time before the CIL push task starts. I wonder a bit if that pattern
> > > > > would hold for a system/workload with more CPUs (and if so, if there are
> > > > > any odd side effects of stalling and waking hundreds of tasks at the
> > > > > same time vs. our traditional queuing behavior).
> > > > 
> > > > If I increase the concurrency (e.g. 16->32 threads for fsmark on a
> > > > 64MB log), we hammer the spinlock on the grant head -hard-. i.e. CPU
> > > > usage goes up by 40%, performance goes down by 50%, and all that CPU
> > > > time is spent spinning on the reserve grant head lock. Basically,
> > > > the log reservation space runs out, and we end up queuing on the
> > > > reservation grant head and then we get reminded of just how bad
> > > > having a serialisation point in the reservation fast path actually
> > > > is for scalability...
> > > > 
> > > 
> > > The small log case is not really what I'm wondering about. Does this
> > > behavior translate to a similar test with a maximum sized log?
> > 
> > Nope, the transactions all hit the CIL throttle within a couple of
> > hundred microseconds of each other, then the CIL push schedules, and
> > then a couple of hundred microseconds later they are unblocked
> > because the CIL push has started.
> > 
> > > ...
> > > > 
> > > > Larger logs block more threads on the CIL throttle, but the 32MB CIL
> > > > window can soak up hundreds of max sized transaction reservations
> > > > before overflowing so even running several hundred concurrent
> > > > modification threads I haven't been able to drive enough concurrency
> > > > through the CIL to see any sort of adverse behaviour.  And the
> > > > workloads are running pretty consistently at less than 5,000 context
> > > > switches/sec so there's no evidence of repeated thundering heard
> > > > wakeup problems, either.
> > > > 
> > > 
> > > That speaks to the rarity of the throttle, which is good. But I'm
> > > wondering, for example, what might happen on systems where we could have
> > > hundreds of physical CPUs committing to the CIL, we block them all on
> > > the throttle and then wake them all at once. IOW, can we potentially
> > > create the contention conditions you reproduce above in scenarios where
> > > they might not have existed before?
> > 
> > I don't think it will create any new contention points - the
> > contention I described above can be triggered without the CIL
> > throttle in place, too. It just requires enough concurrent
> > transactions to exhaust the entire log reservation, and then we go
> > from a lockless grant head reservation algorithm to a spinlock
> > serialised waiting algorithm.  i.e. the contention starts when we
> > have enough concurrency to fall off the lockless fast path.
> > 
> > So with a 2GB log and fast storage, we likely need a sustained
> > workload of tens of thousands of concurrent transaction reservations
> > to exhaust log space and drive us into this situation. We generally
> > don't have applications that have this sort of concurrency
> > capability...
> > 
> 
> That there are some systems/configurations out there that are fast
> enough to avoid this condition doesn't really answer the question. If
> you assume something like a 1TB fs and 500MB log, with 1/4 the log
> consumed in the AIL and 64MB in the CIL (such that transaction commits
> start to block), the remaining log reservation can easily be consumed by
> something on the order of 100 open transactions.

If individual reservations in in the order of 3MB on a 4kB
filesystem, then we've screwed reservations up really badly.

/me goes looking

Reservation sizes for 1TB AGs and reflink enabled show:

 xfs_trans_resv_calc:  dev 253:32 type 0 logres 201976 logcount 8 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 1 logres 361344 logcount 8 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 2 logres 307936 logcount 2 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 3 logres 187760 logcount 2 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 4 logres 197760 logcount 2 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 5 logres 303616 logcount 3 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 6 logres 302464 logcount 2 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 7 logres 319488 logcount 2 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 8 logres 302464 logcount 3 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 9 logres 337784 logcount 2 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 10 logres 2168 logcount 0 flags 0x0
 xfs_trans_resv_calc:  dev 253:32 type 11 logres 90624 logcount 2 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 12 logres 116856 logcount 2 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 13 logres 760 logcount 0 flags 0x0
 xfs_trans_resv_calc:  dev 253:32 type 14 logres 360576 logcount 1 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 15 logres 23288 logcount 3 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 16 logres 13312 logcount 0 flags 0x0
 xfs_trans_resv_calc:  dev 253:32 type 17 logres 181376 logcount 3 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 18 logres 640 logcount 0 flags 0x0
 xfs_trans_resv_calc:  dev 253:32 type 19 logres 111864 logcount 2 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 20 logres 4224 logcount 0 flags 0x0
 xfs_trans_resv_calc:  dev 253:32 type 21 logres 6512 logcount 0 flags 0x0
 xfs_trans_resv_calc:  dev 253:32 type 22 logres 232 logcount 1 flags 0x0
 xfs_trans_resv_calc:  dev 253:32 type 23 logres 206199 logcount 8 flags 0x4
 xfs_trans_resv_calc:  dev 253:32 type 24 logres 976 logcount 1 flags 0x0
 xfs_trans_resv_calc:  dev 253:32 type 25 logres 336 logcount 1 flags 0x0
 xfs_trans_resv_calc:  dev 253:32 type 26 logres 640 logcount 1 flags 0x0
 xfs_trans_resv_calc:  dev 253:32 type 27 logres 760 logcount 0 flags 0x0
 xfs_trans_resv_calc:  dev 253:32 type -1 logres 361344 logcount 8 flags 0x4

Oh, wow, those are actually massive. A write reservation should
not need a 200kB reservation - it used to be in the order of 100kB
and with a log count of 3. And that truncate reservation @ 360kB is
way more than it needs to be.

I'm going to have a deeper look into this, because I strongly
suspect we've screwed things up here with respect to rmap/reflink
and deferred ops.

> Hmm, I'm also not sure the lockless reservation algorithm is totally
> immune to increased concurrency in this regard. What prevents multiple
> tasks from racing through xlog_grant_head_check() and blowing past the
> log head, for example?

Nothing. Debug kernels even emit a "xlog_verify_grant_tail: space >
BBTOB(tail_blocks)" messages when that happens. It's pretty
difficult to do this in real world conditions, even when there is
lots of concurrency being used.

But here's the rub: it's not actually the end of the world because
the reservation doesn't actually determine how much of the log is
currently being used by running transactions - the reservation is
for a maximal rolling iteration of a permanent transaction, not the
initial transaction will be running. Hence if we overrun
occassionally we don't immediately run out of log space and corrupt
the log.

Yes, if none of the rolling transactions complete and they all need
to use their entire reservation, and the tail of the log cannot be
moved forward because it is pinned by one of the transactions that
is running, then we'll likely get a log hang on a regrant on the
write head. But if any of the transactions don't use all of their
reservation, then the overrun gets soaked up by the unused parts of
the transactions that are completed and returned to reservation
head, and nobody even notices taht there was a temporary overrun of
the grant head space.

Hence occasional overruns on the reservation head before they start
blocking isn't really a problem in practice because the probability
of all the transaction reservation of all transactions running being
required to make forwards progress is extremely small.

Basically, we gave up "perfect reservation space grant accounting"
because performance was extremely important and risk of log hangs as
a result of overruns was considered to be extremely low and worth
taking for the benefits the algorithm provided. This was just a
simple, pragmatic risk based engineering decision.

> I gave this a quick test out of curiosity and with a 15GB fs with a 10MB
> log, I should only be able to send 5 or so truncate transactions through
> xfs_log_reserve() before blocking. With a couple injected delays, I'm
> easily able to send 32 into the grant space modification code and that
> eventually results in something like this:

Yup, we can break lots of things by injecting artificial delays, but
that doesn't mean there is a practical real world problem with the
algorithm we need to fix....

Cheers,

Dave.
Brian Foster Oct. 4, 2019, 11:50 a.m. UTC | #15
On Fri, Oct 04, 2019 at 12:27:55PM +1000, Dave Chinner wrote:
> On Thu, Oct 03, 2019 at 10:41:14AM -0400, Brian Foster wrote:
> > On Thu, Oct 03, 2019 at 11:25:56AM +1000, Dave Chinner wrote:
> > > On Wed, Oct 02, 2019 at 08:41:39AM -0400, Brian Foster wrote:
> > > > On Wed, Oct 02, 2019 at 09:14:33AM +1000, Dave Chinner wrote:
> > > > > On Tue, Oct 01, 2019 at 09:13:36AM -0400, Brian Foster wrote:
> > > > > > On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote:
> > > > > > > So typically groups of captures are hundreds of log cycles apart
> > > > > > > (100 cycles x 32MB = ~3GB of log writes), then there will be a
> > > > > > > stutter where the CIL dispatch is delayed, and then everything
> > > > > > > continues on. These all show the log is always around the 75% full
> > > > > > > (AIL tail pushing theshold) but the reservation grant wait lists are
> > > > > > > always empty so we're not running out of reservation space here.
> > > > > > > 
> > > > > > 
> > > > > > It's somewhat interesting that we manage to block every thread most of
> > > > > > the time before the CIL push task starts. I wonder a bit if that pattern
> > > > > > would hold for a system/workload with more CPUs (and if so, if there are
> > > > > > any odd side effects of stalling and waking hundreds of tasks at the
> > > > > > same time vs. our traditional queuing behavior).
> > > > > 
> > > > > If I increase the concurrency (e.g. 16->32 threads for fsmark on a
> > > > > 64MB log), we hammer the spinlock on the grant head -hard-. i.e. CPU
> > > > > usage goes up by 40%, performance goes down by 50%, and all that CPU
> > > > > time is spent spinning on the reserve grant head lock. Basically,
> > > > > the log reservation space runs out, and we end up queuing on the
> > > > > reservation grant head and then we get reminded of just how bad
> > > > > having a serialisation point in the reservation fast path actually
> > > > > is for scalability...
> > > > > 
> > > > 
> > > > The small log case is not really what I'm wondering about. Does this
> > > > behavior translate to a similar test with a maximum sized log?
> > > 
> > > Nope, the transactions all hit the CIL throttle within a couple of
> > > hundred microseconds of each other, then the CIL push schedules, and
> > > then a couple of hundred microseconds later they are unblocked
> > > because the CIL push has started.
> > > 
> > > > ...
> > > > > 
> > > > > Larger logs block more threads on the CIL throttle, but the 32MB CIL
> > > > > window can soak up hundreds of max sized transaction reservations
> > > > > before overflowing so even running several hundred concurrent
> > > > > modification threads I haven't been able to drive enough concurrency
> > > > > through the CIL to see any sort of adverse behaviour.  And the
> > > > > workloads are running pretty consistently at less than 5,000 context
> > > > > switches/sec so there's no evidence of repeated thundering heard
> > > > > wakeup problems, either.
> > > > > 
> > > > 
> > > > That speaks to the rarity of the throttle, which is good. But I'm
> > > > wondering, for example, what might happen on systems where we could have
> > > > hundreds of physical CPUs committing to the CIL, we block them all on
> > > > the throttle and then wake them all at once. IOW, can we potentially
> > > > create the contention conditions you reproduce above in scenarios where
> > > > they might not have existed before?
> > > 
> > > I don't think it will create any new contention points - the
> > > contention I described above can be triggered without the CIL
> > > throttle in place, too. It just requires enough concurrent
> > > transactions to exhaust the entire log reservation, and then we go
> > > from a lockless grant head reservation algorithm to a spinlock
> > > serialised waiting algorithm.  i.e. the contention starts when we
> > > have enough concurrency to fall off the lockless fast path.
> > > 
> > > So with a 2GB log and fast storage, we likely need a sustained
> > > workload of tens of thousands of concurrent transaction reservations
> > > to exhaust log space and drive us into this situation. We generally
> > > don't have applications that have this sort of concurrency
> > > capability...
> > > 
> > 
> > That there are some systems/configurations out there that are fast
> > enough to avoid this condition doesn't really answer the question. If
> > you assume something like a 1TB fs and 500MB log, with 1/4 the log
> > consumed in the AIL and 64MB in the CIL (such that transaction commits
> > start to block), the remaining log reservation can easily be consumed by
> > something on the order of 100 open transactions.
> 
> If individual reservations in in the order of 3MB on a 4kB
> filesystem, then we've screwed reservations up really badly.
> 
> /me goes looking
> 
> Reservation sizes for 1TB AGs and reflink enabled show:
> 
>  xfs_trans_resv_calc:  dev 253:32 type 0 logres 201976 logcount 8 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 1 logres 361344 logcount 8 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 2 logres 307936 logcount 2 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 3 logres 187760 logcount 2 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 4 logres 197760 logcount 2 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 5 logres 303616 logcount 3 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 6 logres 302464 logcount 2 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 7 logres 319488 logcount 2 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 8 logres 302464 logcount 3 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 9 logres 337784 logcount 2 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 10 logres 2168 logcount 0 flags 0x0
>  xfs_trans_resv_calc:  dev 253:32 type 11 logres 90624 logcount 2 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 12 logres 116856 logcount 2 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 13 logres 760 logcount 0 flags 0x0
>  xfs_trans_resv_calc:  dev 253:32 type 14 logres 360576 logcount 1 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 15 logres 23288 logcount 3 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 16 logres 13312 logcount 0 flags 0x0
>  xfs_trans_resv_calc:  dev 253:32 type 17 logres 181376 logcount 3 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 18 logres 640 logcount 0 flags 0x0
>  xfs_trans_resv_calc:  dev 253:32 type 19 logres 111864 logcount 2 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 20 logres 4224 logcount 0 flags 0x0
>  xfs_trans_resv_calc:  dev 253:32 type 21 logres 6512 logcount 0 flags 0x0
>  xfs_trans_resv_calc:  dev 253:32 type 22 logres 232 logcount 1 flags 0x0
>  xfs_trans_resv_calc:  dev 253:32 type 23 logres 206199 logcount 8 flags 0x4
>  xfs_trans_resv_calc:  dev 253:32 type 24 logres 976 logcount 1 flags 0x0
>  xfs_trans_resv_calc:  dev 253:32 type 25 logres 336 logcount 1 flags 0x0
>  xfs_trans_resv_calc:  dev 253:32 type 26 logres 640 logcount 1 flags 0x0
>  xfs_trans_resv_calc:  dev 253:32 type 27 logres 760 logcount 0 flags 0x0
>  xfs_trans_resv_calc:  dev 253:32 type -1 logres 361344 logcount 8 flags 0x4
> 
> Oh, wow, those are actually massive. A write reservation should
> not need a 200kB reservation - it used to be in the order of 100kB
> and with a log count of 3. And that truncate reservation @ 360kB is
> way more than it needs to be.
> 

Yeah, reflink extends some of the log counts quite a bit.

> I'm going to have a deeper look into this, because I strongly
> suspect we've screwed things up here with respect to rmap/reflink
> and deferred ops.
> 

Ok.

> > Hmm, I'm also not sure the lockless reservation algorithm is totally
> > immune to increased concurrency in this regard. What prevents multiple
> > tasks from racing through xlog_grant_head_check() and blowing past the
> > log head, for example?
> 
> Nothing. Debug kernels even emit a "xlog_verify_grant_tail: space >
> BBTOB(tail_blocks)" messages when that happens. It's pretty
> difficult to do this in real world conditions, even when there is
> lots of concurrency being used.
> 

Hm, Ok. Though I've seen that alert enough times that I
(unintentionally) ignore it at this point, so it can't be that hard to
reproduce. ;) That is usually during fstests however, and not a typical
workload that I recall. Of course, there's a difference between
reproducing the basic condition and taking it to the point where it
manifests into a problem.

> But here's the rub: it's not actually the end of the world because
> the reservation doesn't actually determine how much of the log is
> currently being used by running transactions - the reservation is
> for a maximal rolling iteration of a permanent transaction, not the
> initial transaction will be running. Hence if we overrun
> occassionally we don't immediately run out of log space and corrupt
> the log.
> 

Ok, that much is evident from the amount of time this mechanism has been
in place without any notable issues.

> Yes, if none of the rolling transactions complete and they all need
> to use their entire reservation, and the tail of the log cannot be
> moved forward because it is pinned by one of the transactions that
> is running, then we'll likely get a log hang on a regrant on the
> write head. But if any of the transactions don't use all of their
> reservation, then the overrun gets soaked up by the unused parts of
> the transactions that are completed and returned to reservation
> head, and nobody even notices taht there was a temporary overrun of
> the grant head space.
> 

Ok, I didn't expect this to be some catastrophic problem or really a
problem with your patch simply based on the lifetime of the code and how
the grant heads are actually used. I was going to suggest an assert or
something to detect whether batching behavior as a side effect of the
commit throttle would ever increase likelihood of this situation, but it
looks like the grant verify function somewhat serves that purpose
already.

I'd _prefer_ to see something, at least in DEBUG mode, that indicates
the frequency of the fundamental incorrect accounting condition as
opposed to just the side effect of blowing the tail (because the latter
depends on other difficult to reproduce factors), but I'd have to think
about that some more as it would need to balance against normal/expected
execution flow. Thanks for the background.

> Hence occasional overruns on the reservation head before they start
> blocking isn't really a problem in practice because the probability
> of all the transaction reservation of all transactions running being
> required to make forwards progress is extremely small.
> 
> Basically, we gave up "perfect reservation space grant accounting"
> because performance was extremely important and risk of log hangs as
> a result of overruns was considered to be extremely low and worth
> taking for the benefits the algorithm provided. This was just a
> simple, pragmatic risk based engineering decision.
> 

FWIW, the comment for xlog_verify_tail() also suggests the potential for
false positives and references a panic tag, which all seems kind of
erratic and misleading compared to what you explain here. Unless I've
missed it somewhere, it would be nice if this intention were documented
somewhere more clearly. Perhaps I'll look into rewriting that comment
too..

> > I gave this a quick test out of curiosity and with a 15GB fs with a 10MB
> > log, I should only be able to send 5 or so truncate transactions through
> > xfs_log_reserve() before blocking. With a couple injected delays, I'm
> > easily able to send 32 into the grant space modification code and that
> > eventually results in something like this:
> 
> Yup, we can break lots of things by injecting artificial delays, but
> that doesn't mean there is a practical real world problem with the
> algorithm we need to fix....
> 

Delay (or error) injection is just a debugging technique to control and
observe execution flow. It doesn't necessarily speak to whether
something is practically reproducible or a bug or not, merely whether
it's possible, which is precisely why I try to point out behavior that
is manufactured vs reproduced via standard operations. Any reasoning
beyond that depends on context.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Oct. 8, 2019, 2:51 a.m. UTC | #16
On Fri, Oct 04, 2019 at 07:50:01AM -0400, Brian Foster wrote:
> On Fri, Oct 04, 2019 at 12:27:55PM +1000, Dave Chinner wrote:
> > On Thu, Oct 03, 2019 at 10:41:14AM -0400, Brian Foster wrote:
> > > Hmm, I'm also not sure the lockless reservation algorithm is totally
> > > immune to increased concurrency in this regard. What prevents multiple
> > > tasks from racing through xlog_grant_head_check() and blowing past the
> > > log head, for example?
> > 
> > Nothing. Debug kernels even emit a "xlog_verify_grant_tail: space >
> > BBTOB(tail_blocks)" messages when that happens. It's pretty
> > difficult to do this in real world conditions, even when there is
> > lots of concurrency being used.
> > 
> 
> Hm, Ok. Though I've seen that alert enough times that I
> (unintentionally) ignore it at this point, so it can't be that hard to
> reproduce. ;) That is usually during fstests however, and not a typical
> workload that I recall.

I can't say I've seen it for a long time now - I want to say "years"
but I may well have simply missed it on the rare occasion it has
occurred and fstests hasn't captured it. i.e. fstests is supposed to
capture unusual things like this appearing in dmesg during a
test....

> Of course, there's a difference between
> reproducing the basic condition and taking it to the point where it
> manifests into a problem.

*nod*

> > But here's the rub: it's not actually the end of the world because
> > the reservation doesn't actually determine how much of the log is
> > currently being used by running transactions - the reservation is
> > for a maximal rolling iteration of a permanent transaction, not the
> > initial transaction will be running. Hence if we overrun
> > occassionally we don't immediately run out of log space and corrupt
> > the log.
> > 
> 
> Ok, that much is evident from the amount of time this mechanism has been
> in place without any notable issues.
> 
> > Yes, if none of the rolling transactions complete and they all need
> > to use their entire reservation, and the tail of the log cannot be
> > moved forward because it is pinned by one of the transactions that
> > is running, then we'll likely get a log hang on a regrant on the
> > write head. But if any of the transactions don't use all of their
> > reservation, then the overrun gets soaked up by the unused parts of
> > the transactions that are completed and returned to reservation
> > head, and nobody even notices taht there was a temporary overrun of
> > the grant head space.
> > 
> 
> Ok, I didn't expect this to be some catastrophic problem or really a
> problem with your patch simply based on the lifetime of the code and how
> the grant heads are actually used. I was going to suggest an assert or
> something to detect whether batching behavior as a side effect of the
> commit throttle would ever increase likelihood of this situation, but it
> looks like the grant verify function somewhat serves that purpose
> already.

Yeah - xlog_verify_grant_tail() will the report reservation
overruns, but the serious log space problems (i.e. head overwritting
the tail) are detected by xlog_verify_tail_lsn() when we stamp the
tail_lsn into the current iclog header. That's still done under the
icloglock and the AIL lock, so the comparison of the tail with the
current log head is still completely serialised.

> I'd _prefer_ to see something, at least in DEBUG mode, that indicates
> the frequency of the fundamental incorrect accounting condition as
> opposed to just the side effect of blowing the tail (because the latter
> depends on other difficult to reproduce factors), but I'd have to think
> about that some more as it would need to balance against normal/expected
> execution flow. Thanks for the background.

You can test that just by removing the XLOG_TAIL_WARN flag setting,
then it will warn on every reservation overrun rather than just the
first.

> > Hence occasional overruns on the reservation head before they start
> > blocking isn't really a problem in practice because the probability
> > of all the transaction reservation of all transactions running being
> > required to make forwards progress is extremely small.
> > 
> > Basically, we gave up "perfect reservation space grant accounting"
> > because performance was extremely important and risk of log hangs as
> > a result of overruns was considered to be extremely low and worth
> > taking for the benefits the algorithm provided. This was just a
> > simple, pragmatic risk based engineering decision.
> > 
> 
> FWIW, the comment for xlog_verify_tail() also suggests the potential for
> false positives and references a panic tag, which all seems kind of
> erratic and misleading compared to what you explain here.

Well, it's fundamentally an unserialised check, so it can race with
other reservation grants, commits that release grant space and tail
lsn updates. Hence it's not a 100% reliable debug check.

It also used to be run at all times, not just under
XFS_CONFIG_DEBUG=y, which is why it has a panic tag associated with
it. When we first deployed it, we weren't 100% sure there weren't
customer workloads that would trip over this and hang the log, so
we gave ourselves a way of triggering kernel dumps the instant an
overrun was detected. Hence a site that had log hangs with this
message in the logs could turn on the panic tag and we'd get a
kernel dump to analyse...

Since then, this code has been relegated to debug code but the panic
tag still exists. It could be turned back into a ASSERT now, but
it's still useful the way it is as it means debug kernels don't fall
over the moment a spurious overrun occurs...

Cheers,

Dave.
Dave Chinner Oct. 8, 2019, 3:34 a.m. UTC | #17
On Thu, Oct 03, 2019 at 10:39:34AM -0400, Brian Foster wrote:
> On Thu, Oct 03, 2019 at 10:53:10AM +1000, Dave Chinner wrote:
> > On Wed, Oct 02, 2019 at 08:40:56AM -0400, Brian Foster wrote:
> > > Ok, that's not quite what I was concerned about when you mentioned
> > > waiting on the CIL to flush to disk. No matter, the important bit here
> > > is the performance cost of including the extra blocking on log I/O (to
> > > cycle iclogs back to active for reuse) in the throttle.
> > > 
> > > I'm curious about how noticeable this extra blocking would be because
> > > it's one likely cause of the CIL pressure buildup in the first place. My
> > > original tests reproduced huge CIL checkpoints purely based on one CIL
> > > push being blocked behind the processing of another, the latter taking
> > > relatively more time due to log I/O.
> > 
> > It's very noticable. I dropped that as a throttle mechanism because
> > even on a SSD at 200us per write, a 32MB CIL flush takes for
> > 20-30ms. And it stops everything dead for that time because it
> > stalls all async transaction commits while the log flushes.
> > 
> > When several CIL flushes a second occur (pushing 100-150MB/s to the
> > log), I start seeing a substantial amount of additional time
> > (15-20% of total CPU time) being spent idle just waiting for CIL
> > flush completion. And it only gets worse as the storage gets
> > slower...
> > 
> > The throttle that these patches implement are typically only
> > stalling incoming transactions for a couple of hundred microseconds.
> > The timestamps from the groups of log events show the blocking of
> > all threads are very close together in time, and the push work to
> > switch over to the new context to allow them to unblock and continue
> > happens within another 200-300 microseconds. And the worst case I've
> > seen, this is happening once or twice a second. IOWs, the blocking
> > time of the throttle is very short and largely unnoticable, and the
> > difference to performance it causes is far, far less than the noise
> > threshold of the benchmarks.
> > 
> 
> Ok, thanks for the info. Is that 200-300us delay reliable on a sustained
> workload (where the throttle triggers multiple times)? If so, I suppose
> that longer delay I was wondering about is not as likely in practice.

THe more threads there are, the longer the capture and delay time,
but I'm not seeing delays that are long enough to measurably reduce
the CPU consumption of the workloads nor the throughput of the
benchmark outside of the normal noise floor.

> > > This is not to say there aren't other causes of excessively sized
> > > checkpoints. Rather, if we're at a point where we've blocked
> > > transactions on this new threshold, that means we've doubled the
> > > background threshold in the time we've first triggered a background CIL
> > > push and the push actually started. From that, it seems fairly likely
> > > that we could replenish the CIL to the background threshold once
> > > threads are unblocked but before the previous push completes.
> > 
> > That's just fine. The CIL is actually designed to allow that sort of
> > overlap between multiple committing background contexts and the
> > current aggregating context. As long as each individual CIL commit
> > size doesn't go over half the log and there's always a complete
> > commit in the log (which there should be via head/tail limits),
> > the only limit on the number of concurrent committing CIL contexts
> > on the fly at once is the amount of log space we have available....
> > 
> 
> Right, but if we're blocking transactions on the start of a CIL push and
> the start of a CIL push blocks on completion of the previous push, it's
> possible to indirectly block transactions on completion of a CIL push.
> There is some staggering there compared to blocking a transaction
> directly on a CIL push start to finish, however, so perhaps that keeps
> the delay down in practice.

*nod*

The idea is that this all works as a multi-stage pipeline so there
is always overlap and work being done, even if there are short
stalls in individual work units in the pipeline:

reservation
  modification
    commit
      cil insert
	Cil push
	  iclog write
	    iclog completion
	      AIL insert
	        AIL push
		  metadata IO completion
		    AIL removal
		      log tail updates

Given that CPUs are generally much faster than the IO part of the
pipeline, short CPU scheduling delays don't introduce stalls in the
IO parts of the pipeline and so largely don't affect throughput.
It's when delays become long enough to fill the buffers of the
previous stage and stall multiple stages that performance goes down.
Ideally, there should be no stalls or pipeline delays after the
transaction has a reservation as those are bound by the final stage
throughput (log tail updates).

What we are doing here is adding more feedback between smaller parts
of the pipeline to manage the flow of modifications within the
pipeline. In this case, it is between the iclog writes and the CIL
insert stages so that we don't buffer the entire pipeline's work on
the CIL insert stage when CIL pushes stall on iclog writes due to
slowness in iclog completion and (perhaps) AIL insertion stages.

So, yeah, we can commit and insert in the CIL while a push is
occurring, and mostly they will happen at the same time and the CIL
is large enough that a typical push will complete at the same time
the cil inserts fill a new queue to be pushed....

> > > That sounds more involved than what I was thinking. My thought is that
> > > this throttle is already not predictable or deterministic (i.e. we're
> > > essentially waiting on a scheduler event) and so might not require the
> > > extra complexity of a new waitqueue. It certainly could be the case that
> > > blocking on the entire push is just too long in practice, but since this
> > 
> > I've made that sort of change here and measured the regression.
> > Then I discarded it as an unworkable option and looked for other
> > solutions.
> 
> Ok. What was the regression that resulted from the longer throttle
> delay, out of curiosity?

Well, I tried two things. The first was throttling by waiting for
the entire CIL push by issuing a flush_work() similar to the
xlog_cil_push_now() code. That introduced substantial pipeline
stalls at the CIL insert point, and performance went way down.

The second was moving the up_write() in xlog_cil_push() down to
after then CIL had been formatted and written to the iclogs
entirely. This didn't wait on the commit record write, so removed
taht part of the serialisation from the throttling. i.e. this stalls
incoming inserts (commits) on the down_read() of the context lock
until the CIL push had processed the CIL. This didn't make much
difference, because all the work is in formatting the thousands of
CIL logvecs into the iclogs and writing them to disk.

Basically, blocking the CIL insert on CIL pushes and iclog IO
creates pipeline stalls and prevents us from committing new changes
while the CIL is flushing. Hence we reduce concurrency in the
transaction commit path, and performance goes down accordingly...

Cheers,

Dave.
Brian Foster Oct. 8, 2019, 1:22 p.m. UTC | #18
On Tue, Oct 08, 2019 at 01:51:57PM +1100, Dave Chinner wrote:
> On Fri, Oct 04, 2019 at 07:50:01AM -0400, Brian Foster wrote:
> > On Fri, Oct 04, 2019 at 12:27:55PM +1000, Dave Chinner wrote:
> > > On Thu, Oct 03, 2019 at 10:41:14AM -0400, Brian Foster wrote:
> > > > Hmm, I'm also not sure the lockless reservation algorithm is totally
> > > > immune to increased concurrency in this regard. What prevents multiple
> > > > tasks from racing through xlog_grant_head_check() and blowing past the
> > > > log head, for example?
> > > 
> > > Nothing. Debug kernels even emit a "xlog_verify_grant_tail: space >
> > > BBTOB(tail_blocks)" messages when that happens. It's pretty
> > > difficult to do this in real world conditions, even when there is
> > > lots of concurrency being used.
> > > 
> > 
> > Hm, Ok. Though I've seen that alert enough times that I
> > (unintentionally) ignore it at this point, so it can't be that hard to
> > reproduce. ;) That is usually during fstests however, and not a typical
> > workload that I recall.
> 
> I can't say I've seen it for a long time now - I want to say "years"
> but I may well have simply missed it on the rare occasion it has
> occurred and fstests hasn't captured it. i.e. fstests is supposed to
> capture unusual things like this appearing in dmesg during a
> test....
> 
> > Of course, there's a difference between
> > reproducing the basic condition and taking it to the point where it
> > manifests into a problem.
> 
> *nod*
> 
> > > But here's the rub: it's not actually the end of the world because
> > > the reservation doesn't actually determine how much of the log is
> > > currently being used by running transactions - the reservation is
> > > for a maximal rolling iteration of a permanent transaction, not the
> > > initial transaction will be running. Hence if we overrun
> > > occassionally we don't immediately run out of log space and corrupt
> > > the log.
> > > 
> > 
> > Ok, that much is evident from the amount of time this mechanism has been
> > in place without any notable issues.
> > 
> > > Yes, if none of the rolling transactions complete and they all need
> > > to use their entire reservation, and the tail of the log cannot be
> > > moved forward because it is pinned by one of the transactions that
> > > is running, then we'll likely get a log hang on a regrant on the
> > > write head. But if any of the transactions don't use all of their
> > > reservation, then the overrun gets soaked up by the unused parts of
> > > the transactions that are completed and returned to reservation
> > > head, and nobody even notices taht there was a temporary overrun of
> > > the grant head space.
> > > 
> > 
> > Ok, I didn't expect this to be some catastrophic problem or really a
> > problem with your patch simply based on the lifetime of the code and how
> > the grant heads are actually used. I was going to suggest an assert or
> > something to detect whether batching behavior as a side effect of the
> > commit throttle would ever increase likelihood of this situation, but it
> > looks like the grant verify function somewhat serves that purpose
> > already.
> 
> Yeah - xlog_verify_grant_tail() will the report reservation
> overruns, but the serious log space problems (i.e. head overwritting
> the tail) are detected by xlog_verify_tail_lsn() when we stamp the
> tail_lsn into the current iclog header. That's still done under the
> icloglock and the AIL lock, so the comparison of the tail with the
> current log head is still completely serialised.
> 
> > I'd _prefer_ to see something, at least in DEBUG mode, that indicates
> > the frequency of the fundamental incorrect accounting condition as
> > opposed to just the side effect of blowing the tail (because the latter
> > depends on other difficult to reproduce factors), but I'd have to think
> > about that some more as it would need to balance against normal/expected
> > execution flow. Thanks for the background.
> 
> You can test that just by removing the XLOG_TAIL_WARN flag setting,
> then it will warn on every reservation overrun rather than just the
> first.
> 

That's not quite what I'm after. That just removes the oneshot nature of
the existing check. The current debug check looks for a side effect of
the current algorithm in the form of overrunning the tail. What I would
like to see, somehow or another, is something that provides earlier and
more useful information on the frequency/scale of occurrence where
multiple reservations have been made based on the same grant baseline.

This is not so much an error check so not something that should be an
assert or warning, but rather more of a metric that provides a base for
comparison whenever we might have code changes or workloads that
potentially change this behavior. For example, consider a debug mode
stat or sysfs file that could be used to dump out a counter of "same
value" grant head samples after a benchmark workload or fstests run. The
fact that this value might go up or down is not necessarily a problem.
But that would provide some debug mode data to identify and analyze
potentially unintended side effects like transient spikes in concurrent
grant head checks due to blocking/scheduling changes in log reservation
tasks.

See the appended RFC for a quick idea of what I mean. This is slightly
racy, but I think conservative enough to provide valid values with
respect to already racy reservation implementation. On my slow 4xcpu vm,
I see occasional sample counts of 2 running a -p8 fsstress. If I stick a
smallish delay in xfs_log_reserve(), the frequency of the output
increases and I see occasional bumps to 3, a spike of 8 in one case,
etc. Of course I'd probably factor the atomic calls into DEBUG only
inline functions that can be compiled out on production kernels and
replace the tracepoint with a global counter (exported via stats or
sysfs knob), but this just illustrates the idea. The global counter
could also be replaced with (or kept in addition to) something that
tracks a max concurrency value if that is more useful. Any thoughts on
something like this?

> > > Hence occasional overruns on the reservation head before they start
> > > blocking isn't really a problem in practice because the probability
> > > of all the transaction reservation of all transactions running being
> > > required to make forwards progress is extremely small.
> > > 
> > > Basically, we gave up "perfect reservation space grant accounting"
> > > because performance was extremely important and risk of log hangs as
> > > a result of overruns was considered to be extremely low and worth
> > > taking for the benefits the algorithm provided. This was just a
> > > simple, pragmatic risk based engineering decision.
> > > 
> > 
> > FWIW, the comment for xlog_verify_tail() also suggests the potential for
> > false positives and references a panic tag, which all seems kind of
> > erratic and misleading compared to what you explain here.
> 
> Well, it's fundamentally an unserialised check, so it can race with
> other reservation grants, commits that release grant space and tail
> lsn updates. Hence it's not a 100% reliable debug check.
> 

Right. Yet there is a panic knob...

> It also used to be run at all times, not just under
> XFS_CONFIG_DEBUG=y, which is why it has a panic tag associated with
> it. When we first deployed it, we weren't 100% sure there weren't
> customer workloads that would trip over this and hang the log, so
> we gave ourselves a way of triggering kernel dumps the instant an
> overrun was detected. Hence a site that had log hangs with this
> message in the logs could turn on the panic tag and we'd get a
> kernel dump to analyse...
> 

Ok, makes sense. This kind of speaks to the earlier point around having
more useful data. While this isn't necessarily a problem right now, we
have no real data to tell us whether some particular code change alters
this behavior. If this was enough of a concern when the change was first
put in place to insert a panic hook, then it stands to reason it's
something we should at least be cognizant of when making changes that
could impact its behavior.

> Since then, this code has been relegated to debug code but the panic
> tag still exists. It could be turned back into a ASSERT now, but
> it's still useful the way it is as it means debug kernels don't fall
> over the moment a spurious overrun occurs...
> 

Yeah, ISTM the panic bit could be removed at this point. The warning (as
opposed to an assert) is probably reasonable so long as the check itself
is racy so as to not cause false positive panics with fatal asserts
enabled.

Brian

--- 8< ---

RFC: crude concurrency stat on reserve grant head

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index a2beee9f74da..1d3056176e6e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -175,6 +175,7 @@ xlog_grant_head_init(
 	xlog_assign_grant_head(&head->grant, 1, 0);
 	INIT_LIST_HEAD(&head->waiters);
 	spin_lock_init(&head->lock);
+	atomic64_set(&head->sample_cnt, 0);
 }
 
 STATIC void
@@ -446,6 +447,7 @@ xfs_log_reserve(
 	struct xlog_ticket	*tic;
 	int			need_bytes;
 	int			error = 0;
+	int64_t			sample_cnt;
 
 	ASSERT(client == XFS_TRANSACTION || client == XFS_LOG);
 
@@ -465,13 +467,19 @@ xfs_log_reserve(
 
 	error = xlog_grant_head_check(log, &log->l_reserve_head, tic,
 				      &need_bytes);
+	atomic64_inc(&log->l_reserve_head.sample_cnt);
 	if (error)
 		goto out_error;
 
+	sample_cnt = atomic64_xchg(&log->l_reserve_head.sample_cnt, 0);
 	xlog_grant_add_space(log, &log->l_reserve_head.grant, need_bytes);
 	xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes);
 	trace_xfs_log_reserve_exit(log, tic);
 	xlog_verify_grant_tail(log);
+
+	if (sample_cnt > 1)
+		trace_printk("%d: %lld\n", __LINE__, sample_cnt);
+
 	return 0;
 
 out_error:
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index b880c23cb6e4..62e4949f91c4 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -339,6 +339,7 @@ struct xlog_grant_head {
 	spinlock_t		lock ____cacheline_aligned_in_smp;
 	struct list_head	waiters;
 	atomic64_t		grant;
+	atomic64_t		sample_cnt;
 };
 
 /*
Brian Foster Oct. 8, 2019, 5:34 p.m. UTC | #19
On Tue, Oct 08, 2019 at 09:22:40AM -0400, Brian Foster wrote:
> On Tue, Oct 08, 2019 at 01:51:57PM +1100, Dave Chinner wrote:
> > On Fri, Oct 04, 2019 at 07:50:01AM -0400, Brian Foster wrote:
> > > On Fri, Oct 04, 2019 at 12:27:55PM +1000, Dave Chinner wrote:
> > > > On Thu, Oct 03, 2019 at 10:41:14AM -0400, Brian Foster wrote:
> > > > > Hmm, I'm also not sure the lockless reservation algorithm is totally
> > > > > immune to increased concurrency in this regard. What prevents multiple
> > > > > tasks from racing through xlog_grant_head_check() and blowing past the
> > > > > log head, for example?
> > > > 
> > > > Nothing. Debug kernels even emit a "xlog_verify_grant_tail: space >
> > > > BBTOB(tail_blocks)" messages when that happens. It's pretty
> > > > difficult to do this in real world conditions, even when there is
> > > > lots of concurrency being used.
> > > > 
> > > 
> > > Hm, Ok. Though I've seen that alert enough times that I
> > > (unintentionally) ignore it at this point, so it can't be that hard to
> > > reproduce. ;) That is usually during fstests however, and not a typical
> > > workload that I recall.
> > 
> > I can't say I've seen it for a long time now - I want to say "years"
> > but I may well have simply missed it on the rare occasion it has
> > occurred and fstests hasn't captured it. i.e. fstests is supposed to
> > capture unusual things like this appearing in dmesg during a
> > test....
> > 
> > > Of course, there's a difference between
> > > reproducing the basic condition and taking it to the point where it
> > > manifests into a problem.
> > 
> > *nod*
> > 
> > > > But here's the rub: it's not actually the end of the world because
> > > > the reservation doesn't actually determine how much of the log is
> > > > currently being used by running transactions - the reservation is
> > > > for a maximal rolling iteration of a permanent transaction, not the
> > > > initial transaction will be running. Hence if we overrun
> > > > occassionally we don't immediately run out of log space and corrupt
> > > > the log.
> > > > 
> > > 
> > > Ok, that much is evident from the amount of time this mechanism has been
> > > in place without any notable issues.
> > > 
> > > > Yes, if none of the rolling transactions complete and they all need
> > > > to use their entire reservation, and the tail of the log cannot be
> > > > moved forward because it is pinned by one of the transactions that
> > > > is running, then we'll likely get a log hang on a regrant on the
> > > > write head. But if any of the transactions don't use all of their
> > > > reservation, then the overrun gets soaked up by the unused parts of
> > > > the transactions that are completed and returned to reservation
> > > > head, and nobody even notices taht there was a temporary overrun of
> > > > the grant head space.
> > > > 
> > > 
> > > Ok, I didn't expect this to be some catastrophic problem or really a
> > > problem with your patch simply based on the lifetime of the code and how
> > > the grant heads are actually used. I was going to suggest an assert or
> > > something to detect whether batching behavior as a side effect of the
> > > commit throttle would ever increase likelihood of this situation, but it
> > > looks like the grant verify function somewhat serves that purpose
> > > already.
> > 
> > Yeah - xlog_verify_grant_tail() will the report reservation
> > overruns, but the serious log space problems (i.e. head overwritting
> > the tail) are detected by xlog_verify_tail_lsn() when we stamp the
> > tail_lsn into the current iclog header. That's still done under the
> > icloglock and the AIL lock, so the comparison of the tail with the
> > current log head is still completely serialised.
> > 
> > > I'd _prefer_ to see something, at least in DEBUG mode, that indicates
> > > the frequency of the fundamental incorrect accounting condition as
> > > opposed to just the side effect of blowing the tail (because the latter
> > > depends on other difficult to reproduce factors), but I'd have to think
> > > about that some more as it would need to balance against normal/expected
> > > execution flow. Thanks for the background.
> > 
> > You can test that just by removing the XLOG_TAIL_WARN flag setting,
> > then it will warn on every reservation overrun rather than just the
> > first.
> > 
> 
> That's not quite what I'm after. That just removes the oneshot nature of
> the existing check. The current debug check looks for a side effect of
> the current algorithm in the form of overrunning the tail. What I would
> like to see, somehow or another, is something that provides earlier and
> more useful information on the frequency/scale of occurrence where
> multiple reservations have been made based on the same grant baseline.
> 
> This is not so much an error check so not something that should be an
> assert or warning, but rather more of a metric that provides a base for
> comparison whenever we might have code changes or workloads that
> potentially change this behavior. For example, consider a debug mode
> stat or sysfs file that could be used to dump out a counter of "same
> value" grant head samples after a benchmark workload or fstests run. The
> fact that this value might go up or down is not necessarily a problem.
> But that would provide some debug mode data to identify and analyze
> potentially unintended side effects like transient spikes in concurrent
> grant head checks due to blocking/scheduling changes in log reservation
> tasks.
> 
> See the appended RFC for a quick idea of what I mean. This is slightly
> racy, but I think conservative enough to provide valid values with
> respect to already racy reservation implementation. On my slow 4xcpu vm,
> I see occasional sample counts of 2 running a -p8 fsstress. If I stick a
> smallish delay in xfs_log_reserve(), the frequency of the output
> increases and I see occasional bumps to 3, a spike of 8 in one case,
> etc. Of course I'd probably factor the atomic calls into DEBUG only
> inline functions that can be compiled out on production kernels and
> replace the tracepoint with a global counter (exported via stats or
> sysfs knob), but this just illustrates the idea. The global counter
> could also be replaced with (or kept in addition to) something that
> tracks a max concurrency value if that is more useful. Any thoughts on
> something like this?
> 

BTW, this doesn't necessarily have to be a task based counter. Another
approach could be to keep a "lockless reservation slop" counter that is
incremented by the ticket reservation size down in
xlog_grant_head_check() (only in the lockless case) and
flagged/decremented once the reservation is added to the grant head.
Somewhere between the atomic add/sub we check the max slop value we've
encountered and either track it and dump it in a sysfs file and/or just
assert that it never reaches something unexpected/outrageous (based on
some % of the log size, for example).

Brian

> > > > Hence occasional overruns on the reservation head before they start
> > > > blocking isn't really a problem in practice because the probability
> > > > of all the transaction reservation of all transactions running being
> > > > required to make forwards progress is extremely small.
> > > > 
> > > > Basically, we gave up "perfect reservation space grant accounting"
> > > > because performance was extremely important and risk of log hangs as
> > > > a result of overruns was considered to be extremely low and worth
> > > > taking for the benefits the algorithm provided. This was just a
> > > > simple, pragmatic risk based engineering decision.
> > > > 
> > > 
> > > FWIW, the comment for xlog_verify_tail() also suggests the potential for
> > > false positives and references a panic tag, which all seems kind of
> > > erratic and misleading compared to what you explain here.
> > 
> > Well, it's fundamentally an unserialised check, so it can race with
> > other reservation grants, commits that release grant space and tail
> > lsn updates. Hence it's not a 100% reliable debug check.
> > 
> 
> Right. Yet there is a panic knob...
> 
> > It also used to be run at all times, not just under
> > XFS_CONFIG_DEBUG=y, which is why it has a panic tag associated with
> > it. When we first deployed it, we weren't 100% sure there weren't
> > customer workloads that would trip over this and hang the log, so
> > we gave ourselves a way of triggering kernel dumps the instant an
> > overrun was detected. Hence a site that had log hangs with this
> > message in the logs could turn on the panic tag and we'd get a
> > kernel dump to analyse...
> > 
> 
> Ok, makes sense. This kind of speaks to the earlier point around having
> more useful data. While this isn't necessarily a problem right now, we
> have no real data to tell us whether some particular code change alters
> this behavior. If this was enough of a concern when the change was first
> put in place to insert a panic hook, then it stands to reason it's
> something we should at least be cognizant of when making changes that
> could impact its behavior.
> 
> > Since then, this code has been relegated to debug code but the panic
> > tag still exists. It could be turned back into a ASSERT now, but
> > it's still useful the way it is as it means debug kernels don't fall
> > over the moment a spurious overrun occurs...
> > 
> 
> Yeah, ISTM the panic bit could be removed at this point. The warning (as
> opposed to an assert) is probably reasonable so long as the check itself
> is racy so as to not cause false positive panics with fatal asserts
> enabled.
> 
> Brian
> 
> --- 8< ---
> 
> RFC: crude concurrency stat on reserve grant head
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index a2beee9f74da..1d3056176e6e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -175,6 +175,7 @@ xlog_grant_head_init(
>  	xlog_assign_grant_head(&head->grant, 1, 0);
>  	INIT_LIST_HEAD(&head->waiters);
>  	spin_lock_init(&head->lock);
> +	atomic64_set(&head->sample_cnt, 0);
>  }
>  
>  STATIC void
> @@ -446,6 +447,7 @@ xfs_log_reserve(
>  	struct xlog_ticket	*tic;
>  	int			need_bytes;
>  	int			error = 0;
> +	int64_t			sample_cnt;
>  
>  	ASSERT(client == XFS_TRANSACTION || client == XFS_LOG);
>  
> @@ -465,13 +467,19 @@ xfs_log_reserve(
>  
>  	error = xlog_grant_head_check(log, &log->l_reserve_head, tic,
>  				      &need_bytes);
> +	atomic64_inc(&log->l_reserve_head.sample_cnt);
>  	if (error)
>  		goto out_error;
>  
> +	sample_cnt = atomic64_xchg(&log->l_reserve_head.sample_cnt, 0);
>  	xlog_grant_add_space(log, &log->l_reserve_head.grant, need_bytes);
>  	xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes);
>  	trace_xfs_log_reserve_exit(log, tic);
>  	xlog_verify_grant_tail(log);
> +
> +	if (sample_cnt > 1)
> +		trace_printk("%d: %lld\n", __LINE__, sample_cnt);
> +
>  	return 0;
>  
>  out_error:
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index b880c23cb6e4..62e4949f91c4 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -339,6 +339,7 @@ struct xlog_grant_head {
>  	spinlock_t		lock ____cacheline_aligned_in_smp;
>  	struct list_head	waiters;
>  	atomic64_t		grant;
> +	atomic64_t		sample_cnt;
>  };
>  
>  /*
diff mbox series

Patch

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index ef652abd112c..4a09d50e1368 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -670,6 +670,11 @@  xlog_cil_push(
 	push_seq = cil->xc_push_seq;
 	ASSERT(push_seq <= ctx->sequence);
 
+	/*
+	 * Wake up any background push waiters now this context is being pushed.
+	 */
+	wake_up_all(&ctx->push_wait);
+
 	/*
 	 * Check if we've anything to push. If there is nothing, then we don't
 	 * move on to a new sequence number and so we have to be able to push
@@ -746,6 +751,7 @@  xlog_cil_push(
 	 */
 	INIT_LIST_HEAD(&new_ctx->committing);
 	INIT_LIST_HEAD(&new_ctx->busy_extents);
+	init_waitqueue_head(&new_ctx->push_wait);
 	new_ctx->sequence = ctx->sequence + 1;
 	new_ctx->cil = cil;
 	cil->xc_ctx = new_ctx;
@@ -900,7 +906,7 @@  xlog_cil_push_work(
  */
 static void
 xlog_cil_push_background(
-	struct xlog	*log)
+	struct xlog	*log) __releases(cil->xc_ctx_lock)
 {
 	struct xfs_cil	*cil = log->l_cilp;
 
@@ -914,14 +920,36 @@  xlog_cil_push_background(
 	 * don't do a background push if we haven't used up all the
 	 * space available yet.
 	 */
-	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log))
+	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
+		up_read(&cil->xc_ctx_lock);
 		return;
+	}
 
 	spin_lock(&cil->xc_push_lock);
 	if (cil->xc_push_seq < cil->xc_current_sequence) {
 		cil->xc_push_seq = cil->xc_current_sequence;
 		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
 	}
+
+	/*
+	 * Drop the context lock now, we can't hold that if we need to sleep
+	 * because we are over the blocking threshold. The push_lock is still
+	 * held, so blocking threshold sleep/wakeup is still correctly
+	 * serialised here.
+	 */
+	up_read(&cil->xc_ctx_lock);
+
+	/*
+	 * If we are well over the space limit, throttle the work that is being
+	 * done until the push work on this context has begun.
+	 */
+	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
+		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
+		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
+		return;
+	}
+
 	spin_unlock(&cil->xc_push_lock);
 
 }
@@ -1038,9 +1066,9 @@  xfs_log_commit_cil(
 		if (lip->li_ops->iop_committing)
 			lip->li_ops->iop_committing(lip, xc_commit_lsn);
 	}
-	xlog_cil_push_background(log);
 
-	up_read(&cil->xc_ctx_lock);
+	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
+	xlog_cil_push_background(log);
 }
 
 /*
@@ -1199,6 +1227,7 @@  xlog_cil_init(
 
 	INIT_LIST_HEAD(&ctx->committing);
 	INIT_LIST_HEAD(&ctx->busy_extents);
+	init_waitqueue_head(&ctx->push_wait);
 	ctx->sequence = 1;
 	ctx->cil = cil;
 	cil->xc_ctx = ctx;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index a3cc8a9a16d9..f231b7dfaeab 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -247,6 +247,7 @@  struct xfs_cil_ctx {
 	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
 	struct list_head	iclog_entry;
 	struct list_head	committing;	/* ctx committing list */
+	wait_queue_head_t	push_wait;	/* background push throttle */
 	struct work_struct	discard_endio_work;
 };
 
@@ -344,10 +345,33 @@  struct xfs_cil {
  *   buffer window (32MB) as measurements have shown this to be roughly the
  *   point of diminishing performance increases under highly concurrent
  *   modification workloads.
+ *
+ * To prevent the CIL from overflowing upper commit size bounds, we introduce a
+ * new threshold at which we block committing transactions until the background
+ * CIL commit commences and switches to a new context. While this is not a hard
+ * limit, it forces the process committing a transaction to the CIL to block and
+ * yeild the CPU, giving the CIL push work a chance to be scheduled and start
+ * work. This prevents a process running lots of transactions from overfilling
+ * the CIL because it is not yielding the CPU. We set the blocking limit at
+ * twice the background push space threshold so we keep in line with the AIL
+ * push thresholds.
+ *
+ * Note: this is not a -hard- limit as blocking is applied after the transaction
+ * is inserted into the CIL and the push has been triggered. It is largely a
+ * throttling mechanism that allows the CIL push to be scheduled and run. A hard
+ * limit will be difficult to implement without introducing global serialisation
+ * in the CIL commit fast path, and it's not at all clear that we actually need
+ * such hard limits given the ~7 years we've run without a hard limit before
+ * finding the first situation where a checkpoint size overflow actually
+ * occurred. Hence the simple throttle, and an ASSERT check to tell us that
+ * we've overrun the max size.
  */
 #define XLOG_CIL_SPACE_LIMIT(log)	\
 	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
 
+#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log)	\
+	(XLOG_CIL_SPACE_LIMIT(log) * 2)
+
 /*
  * ticket grant locks, queues and accounting have their own cachlines
  * as these are quite hot and can be operated on concurrently.
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index eaae275ed430..e7087ede2662 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1011,6 +1011,7 @@  DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_sub);
 DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_enter);
 DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_exit);
 DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_sub);
+DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait);
 
 DECLARE_EVENT_CLASS(xfs_log_item_class,
 	TP_PROTO(struct xfs_log_item *lip),