diff mbox series

[3/5,RFC] xfs: use percpu counters for CIL context counters

Message ID 20200512092811.1846252-4-david@fromorbit.com (mailing list archive)
State Deferred, archived
Headers show
Series xfs: fix a couple of performance issues | expand

Commit Message

Dave Chinner May 12, 2020, 9:28 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

With the m_active_trans atomic bottleneck out of the way, the CIL
xc_cil_lock is the next bottleneck that causes cacheline contention.
This protects several things, the first of which is the CIL context
reservation ticket and space usage counters.

We can lift them out of the xc_cil_lock by converting them to
percpu counters. THis involves two things, the first of which is
lifting calculations and samples that don't actually need protecting
from races outside the xc_cil lock.

The second is converting the counters to percpu counters and lifting
them outside the lock. This requires a couple of tricky things to
minimise initial state races and to ensure we take into account
split reservations. We do this by erring on the "take the
reservation just in case" side, which largely lost in the noise of
many frequent large transactions.

We use a trick with percpu_counter_add_batch() to ensure the global
sum is updated immediately on first reservation, hence allowing us
to use fast counter reads everywhere to determine if the CIL is
empty or not, rather than using the list itself. This is important
for later patches where the CIL is moved to percpu lists
and hence cannot use list_empty() to detect an empty CIL. Hence we
provide a low overhead, lockless mechanism for determining if the
CIL is empty or not via this mechanisms. All other percpu counter
updates use a large batch count so they aggregate on the local CPU
and minimise global sum updates.

The xc_ctx_lock rwsem protects draining the percpu counters to the
context's ticket, similar to the way it allows access to the CIL
without using the xc_cil_lock. i.e. the CIL push has exclusive
access to the CIL, the context and the percpu counters while holding
the xc_ctx_lock. This ensures that we can sum and zero the counters
atomically from the perspective of the transaction commit side of
the push. i.e. they reset to zero atomically with the CIL context
swap and hence we don't need to have the percpu counters attached to
the CIL context.

Performance wise, this increases the transaction rate from
~620,000/s to around 750,000/second. Using a 32-way concurrent
create instead of 16-way on a 32p/16GB virtual machine:

		create time	rate		unlink time
unpatched	  2m03s      472k/s+/-9k/s	 3m6s
patched		  1m56s	     533k/s+/-28k/s	 2m34

Notably, the system time for the create went from 44m20s down to
38m37s, whilst going faster. There is more variance, but I think
that is from the cacheline contention having inconsistent overhead.

XXX: probably should split into two patches

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 99 ++++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_log_priv.h |  2 +
 2 files changed, 72 insertions(+), 29 deletions(-)

Comments

Brian Foster May 12, 2020, 2:05 p.m. UTC | #1
On Tue, May 12, 2020 at 07:28:09PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> With the m_active_trans atomic bottleneck out of the way, the CIL
> xc_cil_lock is the next bottleneck that causes cacheline contention.
> This protects several things, the first of which is the CIL context
> reservation ticket and space usage counters.
> 
> We can lift them out of the xc_cil_lock by converting them to
> percpu counters. THis involves two things, the first of which is
> lifting calculations and samples that don't actually need protecting
> from races outside the xc_cil lock.
> 
> The second is converting the counters to percpu counters and lifting
> them outside the lock. This requires a couple of tricky things to
> minimise initial state races and to ensure we take into account
> split reservations. We do this by erring on the "take the
> reservation just in case" side, which largely lost in the noise of
> many frequent large transactions.
> 
> We use a trick with percpu_counter_add_batch() to ensure the global
> sum is updated immediately on first reservation, hence allowing us
> to use fast counter reads everywhere to determine if the CIL is
> empty or not, rather than using the list itself. This is important
> for later patches where the CIL is moved to percpu lists
> and hence cannot use list_empty() to detect an empty CIL. Hence we
> provide a low overhead, lockless mechanism for determining if the
> CIL is empty or not via this mechanisms. All other percpu counter
> updates use a large batch count so they aggregate on the local CPU
> and minimise global sum updates.
> 
> The xc_ctx_lock rwsem protects draining the percpu counters to the
> context's ticket, similar to the way it allows access to the CIL
> without using the xc_cil_lock. i.e. the CIL push has exclusive
> access to the CIL, the context and the percpu counters while holding
> the xc_ctx_lock. This ensures that we can sum and zero the counters
> atomically from the perspective of the transaction commit side of
> the push. i.e. they reset to zero atomically with the CIL context
> swap and hence we don't need to have the percpu counters attached to
> the CIL context.
> 
> Performance wise, this increases the transaction rate from
> ~620,000/s to around 750,000/second. Using a 32-way concurrent
> create instead of 16-way on a 32p/16GB virtual machine:
> 
> 		create time	rate		unlink time
> unpatched	  2m03s      472k/s+/-9k/s	 3m6s
> patched		  1m56s	     533k/s+/-28k/s	 2m34
> 
> Notably, the system time for the create went from 44m20s down to
> 38m37s, whilst going faster. There is more variance, but I think
> that is from the cacheline contention having inconsistent overhead.
> 
> XXX: probably should split into two patches
> 

Yes please. :)

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 99 ++++++++++++++++++++++++++++++-------------
>  fs/xfs/xfs_log_priv.h |  2 +
>  2 files changed, 72 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b43f0e8f43f2e..746c841757ed1 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -393,7 +393,7 @@ xlog_cil_insert_items(
>  	struct xfs_log_item	*lip;
>  	int			len = 0;
>  	int			diff_iovecs = 0;
> -	int			iclog_space;
> +	int			iclog_space, space_used;
>  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
>  

fs/xfs//xfs_log_cil.c: In function ‘xlog_cil_insert_items’:
fs/xfs//xfs_log_cil.c:396:21: warning: unused variable ‘space_used’ [-Wunused-variable]

>  	ASSERT(tp);
> @@ -403,17 +403,16 @@ xlog_cil_insert_items(
>  	 * are done so it doesn't matter exactly how we update the CIL.
>  	 */
>  	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
> -
> -	spin_lock(&cil->xc_cil_lock);
> -
>  	/* account for space used by new iovec headers  */
> +
>  	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
>  	len += iovhdr_res;
>  	ctx->nvecs += diff_iovecs;
>  
> -	/* attach the transaction to the CIL if it has any busy extents */
> -	if (!list_empty(&tp->t_busy))
> -		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> +	/*
> +	 * The ticket can't go away from us here, so we can do racy sampling
> +	 * and precalculate everything.
> +	 */
>  
>  	/*
>  	 * Now transfer enough transaction reservation to the context ticket
> @@ -421,27 +420,28 @@ xlog_cil_insert_items(
>  	 * reservation has to grow as well as the current reservation as we
>  	 * steal from tickets so we can correctly determine the space used
>  	 * during the transaction commit.
> +	 *
> +	 * We use percpu_counter_add_batch() here to force the addition into the
> +	 * global sum immediately. This will result in percpu_counter_read() now
> +	 * always returning a non-zero value, and hence we'll only ever have a
> +	 * very short race window on new contexts.
>  	 */
> -	if (ctx->ticket->t_curr_res == 0) {
> +	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
>  		ctx_res = ctx->ticket->t_unit_res;
> -		ctx->ticket->t_curr_res = ctx_res;
>  		tp->t_ticket->t_curr_res -= ctx_res;
> +		percpu_counter_add_batch(&cil->xc_curr_res, ctx_res, ctx_res - 1);
>  	}

Ok, so we open a race here at the cost of stealing more reservation than
necessary from the transaction. Seems harmless, but I would like to see
some quantification/analysis on what a 'very short race window' is in
this context. Particularly as it relates to percpu functionality. Does
the window scale with cpu count, for example? It might not matter either
way because we expect any given transaction to accommodate the ctx res,
but it would be good to understand the behavior here so we can think
about potential side effects, if any.

>  
>  	/* do we need space for more log record headers? */
> -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> -	if (len > 0 && (ctx->space_used / iclog_space !=
> -				(ctx->space_used + len) / iclog_space)) {
> +	if (len > 0 && !ctx_res) {
> +		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
>  		split_res = (len + iclog_space - 1) / iclog_space;
>  		/* need to take into account split region headers, too */
>  		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> -		ctx->ticket->t_unit_res += split_res;
> -		ctx->ticket->t_curr_res += split_res;
>  		tp->t_ticket->t_curr_res -= split_res;
>  		ASSERT(tp->t_ticket->t_curr_res >= len);
>  	}

Similarly here, assume additional split reservation for every context
rather than checking each commit. Seems reasonable in principle, but
just from a cursory glance this doesn't cover the case of the context
expanding beyond more than two iclogs. IOW, the current logic adds
split_res if the size increase from the current transaction expands the
ctx into another iclog than before the transaction. The new logic only
seems to add split_res for the first transaction into the ctx. Also note
that len seems to be a factor in the calculation of split_res, but it's
not immediately clear to me what impact filtering the split_res
calculation as such has in that regard.

(BTW the comment above this hunk needs an update if we do end up with
some special logic here.)

Other than those bits this seems fairly sane to me.

Brian

>  	tp->t_ticket->t_curr_res -= len;
> -	ctx->space_used += len;
>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
> @@ -458,6 +458,15 @@ xlog_cil_insert_items(
>  		xlog_print_trans(tp);
>  	}
>  
> +	percpu_counter_add_batch(&cil->xc_curr_res, split_res, 1000 * 1000);
> +	percpu_counter_add_batch(&cil->xc_space_used, len, 1000 * 1000);
> +
> +	spin_lock(&cil->xc_cil_lock);
> +
> +	/* attach the transaction to the CIL if it has any busy extents */
> +	if (!list_empty(&tp->t_busy))
> +		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> +
>  	/*
>  	 * Now (re-)position everything modified at the tail of the CIL.
>  	 * We do this here so we only need to take the CIL lock once during
> @@ -741,6 +750,18 @@ xlog_cil_push_work(
>  		num_iovecs += lv->lv_niovecs;
>  	}
>  
> +	/*
> +	 * Drain per cpu counters back to context so they can be re-initialised
> +	 * to zero before we allow commits to the new context we are about to
> +	 * switch to.
> +	 */
> +	ctx->space_used = percpu_counter_sum(&cil->xc_space_used);
> +	ctx->ticket->t_curr_res = percpu_counter_sum(&cil->xc_curr_res);
> +	ctx->ticket->t_unit_res = ctx->ticket->t_curr_res;
> +	percpu_counter_set(&cil->xc_space_used, 0);
> +	percpu_counter_set(&cil->xc_curr_res, 0);
> +
> +
>  	/*
>  	 * initialise the new context and attach it to the CIL. Then attach
>  	 * the current context to the CIL committing lsit so it can be found
> @@ -900,6 +921,7 @@ xlog_cil_push_background(
>  	struct xlog	*log) __releases(cil->xc_ctx_lock)
>  {
>  	struct xfs_cil	*cil = log->l_cilp;
> +	s64		space_used = percpu_counter_read(&cil->xc_space_used);
>  
>  	/*
>  	 * The cil won't be empty because we are called while holding the
> @@ -911,7 +933,7 @@ xlog_cil_push_background(
>  	 * don't do a background push if we haven't used up all the
>  	 * space available yet.
>  	 */
> -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> +	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
>  		up_read(&cil->xc_ctx_lock);
>  		return;
>  	}
> @@ -934,9 +956,9 @@ xlog_cil_push_background(
>  	 * If we are well over the space limit, throttle the work that is being
>  	 * done until the push work on this context has begun.
>  	 */
> -	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> +	if (space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
>  		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> -		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> +		ASSERT(space_used < log->l_logsize);
>  		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
>  		return;
>  	}
> @@ -1200,16 +1222,23 @@ xlog_cil_init(
>  {
>  	struct xfs_cil	*cil;
>  	struct xfs_cil_ctx *ctx;
> +	int		error = -ENOMEM;
>  
>  	cil = kmem_zalloc(sizeof(*cil), KM_MAYFAIL);
>  	if (!cil)
> -		return -ENOMEM;
> +		return error;
>  
>  	ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL);
> -	if (!ctx) {
> -		kmem_free(cil);
> -		return -ENOMEM;
> -	}
> +	if (!ctx)
> +		goto out_free_cil;
> +
> +	error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL);
> +	if (error)
> +		goto out_free_ctx;
> +
> +	error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL);
> +	if (error)
> +		goto out_free_space;
>  
>  	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
>  	INIT_LIST_HEAD(&cil->xc_cil);
> @@ -1230,19 +1259,31 @@ xlog_cil_init(
>  	cil->xc_log = log;
>  	log->l_cilp = cil;
>  	return 0;
> +
> +out_free_space:
> +	percpu_counter_destroy(&cil->xc_space_used);
> +out_free_ctx:
> +	kmem_free(ctx);
> +out_free_cil:
> +	kmem_free(cil);
> +	return error;
>  }
>  
>  void
>  xlog_cil_destroy(
>  	struct xlog	*log)
>  {
> -	if (log->l_cilp->xc_ctx) {
> -		if (log->l_cilp->xc_ctx->ticket)
> -			xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket);
> -		kmem_free(log->l_cilp->xc_ctx);
> +	struct xfs_cil  *cil = log->l_cilp;
> +
> +	if (cil->xc_ctx) {
> +		if (cil->xc_ctx->ticket)
> +			xfs_log_ticket_put(cil->xc_ctx->ticket);
> +		kmem_free(cil->xc_ctx);
>  	}
> +	percpu_counter_destroy(&cil->xc_space_used);
> +	percpu_counter_destroy(&cil->xc_curr_res);
>  
> -	ASSERT(list_empty(&log->l_cilp->xc_cil));
> -	kmem_free(log->l_cilp);
> +	ASSERT(list_empty(&cil->xc_cil));
> +	kmem_free(cil);
>  }
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index ec22c7a3867f1..f5e79a7d44c8e 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -262,6 +262,8 @@ struct xfs_cil_ctx {
>   */
>  struct xfs_cil {
>  	struct xlog		*xc_log;
> +	struct percpu_counter	xc_space_used;
> +	struct percpu_counter	xc_curr_res;
>  	struct list_head	xc_cil;
>  	spinlock_t		xc_cil_lock;
>  
> -- 
> 2.26.1.301.g55bc3eb7cb9
>
Dave Chinner May 12, 2020, 11:36 p.m. UTC | #2
On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> On Tue, May 12, 2020 at 07:28:09PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > With the m_active_trans atomic bottleneck out of the way, the CIL
> > xc_cil_lock is the next bottleneck that causes cacheline contention.
> > This protects several things, the first of which is the CIL context
> > reservation ticket and space usage counters.
> > 
> > We can lift them out of the xc_cil_lock by converting them to
> > percpu counters. THis involves two things, the first of which is
> > lifting calculations and samples that don't actually need protecting
> > from races outside the xc_cil lock.
> > 
> > The second is converting the counters to percpu counters and lifting
> > them outside the lock. This requires a couple of tricky things to
> > minimise initial state races and to ensure we take into account
> > split reservations. We do this by erring on the "take the
> > reservation just in case" side, which largely lost in the noise of
> > many frequent large transactions.
> > 
> > We use a trick with percpu_counter_add_batch() to ensure the global
> > sum is updated immediately on first reservation, hence allowing us
> > to use fast counter reads everywhere to determine if the CIL is
> > empty or not, rather than using the list itself. This is important
> > for later patches where the CIL is moved to percpu lists
> > and hence cannot use list_empty() to detect an empty CIL. Hence we
> > provide a low overhead, lockless mechanism for determining if the
> > CIL is empty or not via this mechanisms. All other percpu counter
> > updates use a large batch count so they aggregate on the local CPU
> > and minimise global sum updates.
> > 
> > The xc_ctx_lock rwsem protects draining the percpu counters to the
> > context's ticket, similar to the way it allows access to the CIL
> > without using the xc_cil_lock. i.e. the CIL push has exclusive
> > access to the CIL, the context and the percpu counters while holding
> > the xc_ctx_lock. This ensures that we can sum and zero the counters
> > atomically from the perspective of the transaction commit side of
> > the push. i.e. they reset to zero atomically with the CIL context
> > swap and hence we don't need to have the percpu counters attached to
> > the CIL context.
> > 
> > Performance wise, this increases the transaction rate from
> > ~620,000/s to around 750,000/second. Using a 32-way concurrent
> > create instead of 16-way on a 32p/16GB virtual machine:
> > 
> > 		create time	rate		unlink time
> > unpatched	  2m03s      472k/s+/-9k/s	 3m6s
> > patched		  1m56s	     533k/s+/-28k/s	 2m34
> > 
> > Notably, the system time for the create went from 44m20s down to
> > 38m37s, whilst going faster. There is more variance, but I think
> > that is from the cacheline contention having inconsistent overhead.
> > 
> > XXX: probably should split into two patches
> > 
> 
> Yes please. :)
> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c  | 99 ++++++++++++++++++++++++++++++-------------
> >  fs/xfs/xfs_log_priv.h |  2 +
> >  2 files changed, 72 insertions(+), 29 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index b43f0e8f43f2e..746c841757ed1 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -393,7 +393,7 @@ xlog_cil_insert_items(
> >  	struct xfs_log_item	*lip;
> >  	int			len = 0;
> >  	int			diff_iovecs = 0;
> > -	int			iclog_space;
> > +	int			iclog_space, space_used;
> >  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
> >  
> 
> fs/xfs//xfs_log_cil.c: In function ‘xlog_cil_insert_items’:
> fs/xfs//xfs_log_cil.c:396:21: warning: unused variable ‘space_used’ [-Wunused-variable]
> 
> >  	ASSERT(tp);
> > @@ -403,17 +403,16 @@ xlog_cil_insert_items(
> >  	 * are done so it doesn't matter exactly how we update the CIL.
> >  	 */
> >  	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
> > -
> > -	spin_lock(&cil->xc_cil_lock);
> > -
> >  	/* account for space used by new iovec headers  */
> > +
> >  	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
> >  	len += iovhdr_res;
> >  	ctx->nvecs += diff_iovecs;
> >  
> > -	/* attach the transaction to the CIL if it has any busy extents */
> > -	if (!list_empty(&tp->t_busy))
> > -		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> > +	/*
> > +	 * The ticket can't go away from us here, so we can do racy sampling
> > +	 * and precalculate everything.
> > +	 */
> >  
> >  	/*
> >  	 * Now transfer enough transaction reservation to the context ticket
> > @@ -421,27 +420,28 @@ xlog_cil_insert_items(
> >  	 * reservation has to grow as well as the current reservation as we
> >  	 * steal from tickets so we can correctly determine the space used
> >  	 * during the transaction commit.
> > +	 *
> > +	 * We use percpu_counter_add_batch() here to force the addition into the
> > +	 * global sum immediately. This will result in percpu_counter_read() now
> > +	 * always returning a non-zero value, and hence we'll only ever have a
> > +	 * very short race window on new contexts.
> >  	 */
> > -	if (ctx->ticket->t_curr_res == 0) {
> > +	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
> >  		ctx_res = ctx->ticket->t_unit_res;
> > -		ctx->ticket->t_curr_res = ctx_res;
> >  		tp->t_ticket->t_curr_res -= ctx_res;
> > +		percpu_counter_add_batch(&cil->xc_curr_res, ctx_res, ctx_res - 1);
> >  	}
> 
> Ok, so we open a race here at the cost of stealing more reservation than
> necessary from the transaction. Seems harmless, but I would like to see
> some quantification/analysis on what a 'very short race window' is in
> this context.

About 20 instructions when the value is zero. The number of racers
will be dependent on how many threads are blocked in commit on the
xc_ctx_lock while a CIL push is in progress. The unit reservation
stolen here for the CIL is:

	xfs_log_calc_unit_res(mp, 0)
		~= 4 * sizeof(xlog_op_header) +
		   sizeof(xlog_trans_header) +
		   sizeof(sector) +
		   log stripe unit roundoff

So for a typical 4k log sector, we are talking about ~6kB of unit
reservation per thread that races here. For a 256k log stripe unit,
then it's going to be about 520kB per racing thread.

That said, every thread that races has this reservation available,
and the amount reserved adds to the space used in the CIL. Hence the
only downside of racing here is that the CIL background pushes
earlier because it hits the threshold sooner. That isn't a huge
issue - if we can't push immediately then the CIL will
run to the hard limit and block commits there; that overrun space is
larger than amount of space "wasted" by racing commits here.

> Particularly as it relates to percpu functionality. Does
> the window scale with cpu count, for example? It might not matter either

Not really. We need a thundering herd to cause issues, and this
occurs after formatting an item so we won't get a huge thundering
herd even when lots of threads block on the xc_ctx_lock waiting for
a push to complete.

> way because we expect any given transaction to accommodate the ctx res,
> but it would be good to understand the behavior here so we can think
> about potential side effects, if any.

I haven't been able to come up with any adverse side effects except
for "performance might drop a bit if we reserve too much and push
early", but that is tempered by the fact that performance goes up
much more than we might lose by getting rid of the xc_cil_lock
bottleneck.

> >  	/* do we need space for more log record headers? */
> > -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > -	if (len > 0 && (ctx->space_used / iclog_space !=
> > -				(ctx->space_used + len) / iclog_space)) {
> > +	if (len > 0 && !ctx_res) {
> > +		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> >  		split_res = (len + iclog_space - 1) / iclog_space;
> >  		/* need to take into account split region headers, too */
> >  		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> > -		ctx->ticket->t_unit_res += split_res;
> > -		ctx->ticket->t_curr_res += split_res;
> >  		tp->t_ticket->t_curr_res -= split_res;
> >  		ASSERT(tp->t_ticket->t_curr_res >= len);
> >  	}
> 
> Similarly here, assume additional split reservation for every
> context rather than checking each commit. Seems reasonable in
> principle, but just from a cursory glance this doesn't cover the
> case of the context expanding beyond more than two iclogs.  IOW,
> the current logic adds split_res if the size increase from the
> current transaction expands the ctx into another iclog than before
> the transaction. The new logic only seems to add split_res for the
> first transaction into the ctx. Also note

No, I changed it to apply to any vector length longer than a single
iclog except for transactions that have taken the unit reservation
above.

Sampling the space used isn't accurate here, and we do not want to
be doing an accurate sum across all CPUs, hence trying to detect a
reservation crossing an iclog boundary is difficult. Hence I just
took the reservation for anything that is guaranteed to cross an
iclog boundary. 

> that len seems to be a factor in the calculation of split_res, but it's
> not immediately clear to me what impact filtering the split_res
> calculation as such has in that regard.

The len used in the split_res calc guarantees that we always have at
least 1 log and op header accounted for by a split, and if the
vector length is greater than a single iclog it will include a
header for every iclog that the vector may span. i.e. if len >
iclog_space, it will reserve 2 extra iclog headers and op headers as
it may split across 3 iclogs....

> (BTW the comment above this hunk needs an update if we do end up with
> some special logic here.)

Definitely. :)

Cheers,

Dave.
Brian Foster May 13, 2020, 12:09 p.m. UTC | #3
On Wed, May 13, 2020 at 09:36:27AM +1000, Dave Chinner wrote:
> On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> > On Tue, May 12, 2020 at 07:28:09PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > With the m_active_trans atomic bottleneck out of the way, the CIL
> > > xc_cil_lock is the next bottleneck that causes cacheline contention.
> > > This protects several things, the first of which is the CIL context
> > > reservation ticket and space usage counters.
> > > 
> > > We can lift them out of the xc_cil_lock by converting them to
> > > percpu counters. THis involves two things, the first of which is
> > > lifting calculations and samples that don't actually need protecting
> > > from races outside the xc_cil lock.
> > > 
> > > The second is converting the counters to percpu counters and lifting
> > > them outside the lock. This requires a couple of tricky things to
> > > minimise initial state races and to ensure we take into account
> > > split reservations. We do this by erring on the "take the
> > > reservation just in case" side, which largely lost in the noise of
> > > many frequent large transactions.
> > > 
> > > We use a trick with percpu_counter_add_batch() to ensure the global
> > > sum is updated immediately on first reservation, hence allowing us
> > > to use fast counter reads everywhere to determine if the CIL is
> > > empty or not, rather than using the list itself. This is important
> > > for later patches where the CIL is moved to percpu lists
> > > and hence cannot use list_empty() to detect an empty CIL. Hence we
> > > provide a low overhead, lockless mechanism for determining if the
> > > CIL is empty or not via this mechanisms. All other percpu counter
> > > updates use a large batch count so they aggregate on the local CPU
> > > and minimise global sum updates.
> > > 
> > > The xc_ctx_lock rwsem protects draining the percpu counters to the
> > > context's ticket, similar to the way it allows access to the CIL
> > > without using the xc_cil_lock. i.e. the CIL push has exclusive
> > > access to the CIL, the context and the percpu counters while holding
> > > the xc_ctx_lock. This ensures that we can sum and zero the counters
> > > atomically from the perspective of the transaction commit side of
> > > the push. i.e. they reset to zero atomically with the CIL context
> > > swap and hence we don't need to have the percpu counters attached to
> > > the CIL context.
> > > 
> > > Performance wise, this increases the transaction rate from
> > > ~620,000/s to around 750,000/second. Using a 32-way concurrent
> > > create instead of 16-way on a 32p/16GB virtual machine:
> > > 
> > > 		create time	rate		unlink time
> > > unpatched	  2m03s      472k/s+/-9k/s	 3m6s
> > > patched		  1m56s	     533k/s+/-28k/s	 2m34
> > > 
> > > Notably, the system time for the create went from 44m20s down to
> > > 38m37s, whilst going faster. There is more variance, but I think
> > > that is from the cacheline contention having inconsistent overhead.
> > > 
> > > XXX: probably should split into two patches
> > > 
> > 
> > Yes please. :)
> > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log_cil.c  | 99 ++++++++++++++++++++++++++++++-------------
> > >  fs/xfs/xfs_log_priv.h |  2 +
> > >  2 files changed, 72 insertions(+), 29 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index b43f0e8f43f2e..746c841757ed1 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -393,7 +393,7 @@ xlog_cil_insert_items(
> > >  	struct xfs_log_item	*lip;
> > >  	int			len = 0;
> > >  	int			diff_iovecs = 0;
> > > -	int			iclog_space;
> > > +	int			iclog_space, space_used;
> > >  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
> > >  
> > 
> > fs/xfs//xfs_log_cil.c: In function ‘xlog_cil_insert_items’:
> > fs/xfs//xfs_log_cil.c:396:21: warning: unused variable ‘space_used’ [-Wunused-variable]
> > 
> > >  	ASSERT(tp);
> > > @@ -403,17 +403,16 @@ xlog_cil_insert_items(
> > >  	 * are done so it doesn't matter exactly how we update the CIL.
> > >  	 */
> > >  	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
> > > -
> > > -	spin_lock(&cil->xc_cil_lock);
> > > -
> > >  	/* account for space used by new iovec headers  */
> > > +
> > >  	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
> > >  	len += iovhdr_res;
> > >  	ctx->nvecs += diff_iovecs;
> > >  
> > > -	/* attach the transaction to the CIL if it has any busy extents */
> > > -	if (!list_empty(&tp->t_busy))
> > > -		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> > > +	/*
> > > +	 * The ticket can't go away from us here, so we can do racy sampling
> > > +	 * and precalculate everything.
> > > +	 */
> > >  
> > >  	/*
> > >  	 * Now transfer enough transaction reservation to the context ticket
> > > @@ -421,27 +420,28 @@ xlog_cil_insert_items(
> > >  	 * reservation has to grow as well as the current reservation as we
> > >  	 * steal from tickets so we can correctly determine the space used
> > >  	 * during the transaction commit.
> > > +	 *
> > > +	 * We use percpu_counter_add_batch() here to force the addition into the
> > > +	 * global sum immediately. This will result in percpu_counter_read() now
> > > +	 * always returning a non-zero value, and hence we'll only ever have a
> > > +	 * very short race window on new contexts.
> > >  	 */
> > > -	if (ctx->ticket->t_curr_res == 0) {
> > > +	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
> > >  		ctx_res = ctx->ticket->t_unit_res;
> > > -		ctx->ticket->t_curr_res = ctx_res;
> > >  		tp->t_ticket->t_curr_res -= ctx_res;
> > > +		percpu_counter_add_batch(&cil->xc_curr_res, ctx_res, ctx_res - 1);
> > >  	}
> > 
> > Ok, so we open a race here at the cost of stealing more reservation than
> > necessary from the transaction. Seems harmless, but I would like to see
> > some quantification/analysis on what a 'very short race window' is in
> > this context.
> 
> About 20 instructions when the value is zero. The number of racers
> will be dependent on how many threads are blocked in commit on the
> xc_ctx_lock while a CIL push is in progress. The unit reservation
> stolen here for the CIL is:
> 

Well we're also in the transaction commit path, which these changes are
intended to improve. The post push situation makes sense for a worst
case scenario, though, as that can probably line up a bunch of CPUs
behind the ctx lock.

> 	xfs_log_calc_unit_res(mp, 0)
> 		~= 4 * sizeof(xlog_op_header) +
> 		   sizeof(xlog_trans_header) +
> 		   sizeof(sector) +
> 		   log stripe unit roundoff
> 
> So for a typical 4k log sector, we are talking about ~6kB of unit
> reservation per thread that races here. For a 256k log stripe unit,
> then it's going to be about 520kB per racing thread.
> 
> That said, every thread that races has this reservation available,
> and the amount reserved adds to the space used in the CIL. Hence the
> only downside of racing here is that the CIL background pushes
> earlier because it hits the threshold sooner. That isn't a huge
> issue - if we can't push immediately then the CIL will
> run to the hard limit and block commits there; that overrun space is
> larger than amount of space "wasted" by racing commits here.
> 

Right, it's more of a tradeoff of holding on to unused reservation a bit
longer for performance. Note that the background CIL push is based on
context size, not reservation size, so I don't see how that would be
affected by this particular race.

> > Particularly as it relates to percpu functionality. Does
> > the window scale with cpu count, for example? It might not matter either
> 
> Not really. We need a thundering herd to cause issues, and this
> occurs after formatting an item so we won't get a huge thundering
> herd even when lots of threads block on the xc_ctx_lock waiting for
> a push to complete.
> 

It would be nice to have some debug code somewhere that somehow or
another asserts or warns if the CIL reservation exceeds some
insane/unexpected heuristic based on the current size of the context. I
don't know what that code or heuristic looks like (i.e. multiple factors
of the ctx size?) so I'm obviously handwaving. Just something to think
about if we can come up with a way to accomplish that opportunistically.

> > way because we expect any given transaction to accommodate the ctx res,
> > but it would be good to understand the behavior here so we can think
> > about potential side effects, if any.
> 
> I haven't been able to come up with any adverse side effects except
> for "performance might drop a bit if we reserve too much and push
> early", but that is tempered by the fact that performance goes up
> much more than we might lose by getting rid of the xc_cil_lock
> bottleneck.
> 

FWIW, a more extreme test vector could be to steal the remainder of the
transaction reservation for the CIL ticket and see how that affects
things. That's probably more suited for a local test than something to
live in the upstream code, though.

> > >  	/* do we need space for more log record headers? */
> > > -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > > -	if (len > 0 && (ctx->space_used / iclog_space !=
> > > -				(ctx->space_used + len) / iclog_space)) {
> > > +	if (len > 0 && !ctx_res) {
> > > +		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > >  		split_res = (len + iclog_space - 1) / iclog_space;
> > >  		/* need to take into account split region headers, too */
> > >  		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> > > -		ctx->ticket->t_unit_res += split_res;
> > > -		ctx->ticket->t_curr_res += split_res;
> > >  		tp->t_ticket->t_curr_res -= split_res;
> > >  		ASSERT(tp->t_ticket->t_curr_res >= len);
> > >  	}
> > 
> > Similarly here, assume additional split reservation for every
> > context rather than checking each commit. Seems reasonable in
> > principle, but just from a cursory glance this doesn't cover the
> > case of the context expanding beyond more than two iclogs.  IOW,
> > the current logic adds split_res if the size increase from the
> > current transaction expands the ctx into another iclog than before
> > the transaction. The new logic only seems to add split_res for the
> > first transaction into the ctx. Also note
> 
> No, I changed it to apply to any vector length longer than a single
> iclog except for transactions that have taken the unit reservation
> above.
> 

Ok, I had the ctx_res logic inverted in my head. So it's not that
split_res is only added for the first transaction, but rather we treat
every transaction that didn't contribute unit res as if it crosses an
iclog boundary. That seems much more reasonable, though it does add to
the "overreservation" of the ticket so I'll reemphasize the request for
some form of debug/trace check that helps analyze runtime CIL ticket
reservation accounting. ;)

OTOH, this skips the split_res in the case where a single large
multi-iclog transaction happens to be the first in the ctx, right? That
doesn't seem that unlikely a scenario considering minimum iclog and
worst case transaction unit res sizes. It actually makes me wonder what
happens if the CIL ticket underruns.. :P

> Sampling the space used isn't accurate here, and we do not want to
> be doing an accurate sum across all CPUs, hence trying to detect a
> reservation crossing an iclog boundary is difficult. Hence I just
> took the reservation for anything that is guaranteed to cross an
> iclog boundary. 
> 

Right.. though it looks more like we take the reservation for anything
that might possibly cross an iclog boundary vs. anything that is
guaranteed to do so, because we can't really establish the latter
without batching up the current size.

Brian

> > that len seems to be a factor in the calculation of split_res, but it's
> > not immediately clear to me what impact filtering the split_res
> > calculation as such has in that regard.
> 
> The len used in the split_res calc guarantees that we always have at
> least 1 log and op header accounted for by a split, and if the
> vector length is greater than a single iclog it will include a
> header for every iclog that the vector may span. i.e. if len >
> iclog_space, it will reserve 2 extra iclog headers and op headers as
> it may split across 3 iclogs....
> 
> > (BTW the comment above this hunk needs an update if we do end up with
> > some special logic here.)
> 
> Definitely. :)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner May 13, 2020, 9:52 p.m. UTC | #4
On Wed, May 13, 2020 at 08:09:59AM -0400, Brian Foster wrote:
> On Wed, May 13, 2020 at 09:36:27AM +1000, Dave Chinner wrote:
> > On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> > > Particularly as it relates to percpu functionality. Does
> > > the window scale with cpu count, for example? It might not matter either
> > 
> > Not really. We need a thundering herd to cause issues, and this
> > occurs after formatting an item so we won't get a huge thundering
> > herd even when lots of threads block on the xc_ctx_lock waiting for
> > a push to complete.
> > 
> 
> It would be nice to have some debug code somewhere that somehow or
> another asserts or warns if the CIL reservation exceeds some
> insane/unexpected heuristic based on the current size of the context. I
> don't know what that code or heuristic looks like (i.e. multiple factors
> of the ctx size?) so I'm obviously handwaving. Just something to think
> about if we can come up with a way to accomplish that opportunistically.

I don't think there is a reliable mechanism that can be used here.
At one end of the scale we have the valid case of a synchronous
inode modification on a log with a 256k stripe unit. So it's valid
to have a CIL reservation of ~550kB for a single item that consumes
~700 bytes of log space.

OTOH, we might be freeing extents on a massively fragmented file and
filesystem, so we're pushing 200kB+ transactions into the CIL for
every rolling transaction. On a filesystem with a 512 byte log
sector size and no LSU, the CIL reservations are dwarfed by the
actual metadata being logged...

I'd suggest that looking at the ungrant trace for the CIL ticket
once it has committed will tell us exactly how much the reservation
was over-estimated, as the unused portion of the reservation will be
returned to the reserve grant head at this point in time.

> > > way because we expect any given transaction to accommodate the ctx res,
> > > but it would be good to understand the behavior here so we can think
> > > about potential side effects, if any.
> > 
> > I haven't been able to come up with any adverse side effects except
> > for "performance might drop a bit if we reserve too much and push
> > early", but that is tempered by the fact that performance goes up
> > much more than we might lose by getting rid of the xc_cil_lock
> > bottleneck.
> > 
> 
> FWIW, a more extreme test vector could be to steal the remainder of the
> transaction reservation for the CIL ticket and see how that affects
> things. That's probably more suited for a local test than something to
> live in the upstream code, though.

It will only affect performance. Worst case is that it degrades the
CIL to behave like the original logging code that wrote direct to
iclogs. i.e. it defeats the in memory aggregation of delayed logging
and effectively writes directly to the iclogs.

Which, really, is what using the -o wsync or -o dirsync largely do
by making a large number of transactions synchrnonous.

In short, performance goes down if we reserve too much, but nothing
incorrect should occur.

> 
> > > >  	/* do we need space for more log record headers? */
> > > > -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > > > -	if (len > 0 && (ctx->space_used / iclog_space !=
> > > > -				(ctx->space_used + len) / iclog_space)) {
> > > > +	if (len > 0 && !ctx_res) {
> > > > +		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > > >  		split_res = (len + iclog_space - 1) / iclog_space;
> > > >  		/* need to take into account split region headers, too */
> > > >  		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> > > > -		ctx->ticket->t_unit_res += split_res;
> > > > -		ctx->ticket->t_curr_res += split_res;
> > > >  		tp->t_ticket->t_curr_res -= split_res;
> > > >  		ASSERT(tp->t_ticket->t_curr_res >= len);
> > > >  	}
> > > 
> > > Similarly here, assume additional split reservation for every
> > > context rather than checking each commit. Seems reasonable in
> > > principle, but just from a cursory glance this doesn't cover the
> > > case of the context expanding beyond more than two iclogs.  IOW,
> > > the current logic adds split_res if the size increase from the
> > > current transaction expands the ctx into another iclog than before
> > > the transaction. The new logic only seems to add split_res for the
> > > first transaction into the ctx. Also note
> > 
> > No, I changed it to apply to any vector length longer than a single
> > iclog except for transactions that have taken the unit reservation
> > above.
> > 
> 
> Ok, I had the ctx_res logic inverted in my head. So it's not that
> split_res is only added for the first transaction, but rather we treat
> every transaction that didn't contribute unit res as if it crosses an
> iclog boundary. That seems much more reasonable, though it does add to
> the "overreservation" of the ticket so I'll reemphasize the request for
> some form of debug/trace check that helps analyze runtime CIL ticket
> reservation accounting. ;)

I can add some trace points, but I think we already have the
tracepoints we need to understand how much overestimation occurs.
i.e. trace_xfs_log_ticket_ungrant() from the CIL push worker
context.

> OTOH, this skips the split_res in the case where a single large
> multi-iclog transaction happens to be the first in the ctx, right?

True, that's easy enough to fix.

> That
> doesn't seem that unlikely a scenario considering minimum iclog and
> worst case transaction unit res sizes. It actually makes me wonder what
> happens if the CIL ticket underruns.. :P

xlog_write() will catch the underrun when we go to write the commit
record via xlog_commit_record(), dump the ticket and shutdown the
filesystem.

CHeers,

Dave.
Dave Chinner May 14, 2020, 1:50 a.m. UTC | #5
On Thu, May 14, 2020 at 07:52:41AM +1000, Dave Chinner wrote:
> On Wed, May 13, 2020 at 08:09:59AM -0400, Brian Foster wrote:
> > On Wed, May 13, 2020 at 09:36:27AM +1000, Dave Chinner wrote:
> > > On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> > > > Particularly as it relates to percpu functionality. Does
> > > > the window scale with cpu count, for example? It might not matter either
> > > 
> > > Not really. We need a thundering herd to cause issues, and this
> > > occurs after formatting an item so we won't get a huge thundering
> > > herd even when lots of threads block on the xc_ctx_lock waiting for
> > > a push to complete.
> > > 
> > 
> > It would be nice to have some debug code somewhere that somehow or
> > another asserts or warns if the CIL reservation exceeds some
> > insane/unexpected heuristic based on the current size of the context. I
> > don't know what that code or heuristic looks like (i.e. multiple factors
> > of the ctx size?) so I'm obviously handwaving. Just something to think
> > about if we can come up with a way to accomplish that opportunistically.
> 
> I don't think there is a reliable mechanism that can be used here.
> At one end of the scale we have the valid case of a synchronous
> inode modification on a log with a 256k stripe unit. So it's valid
> to have a CIL reservation of ~550kB for a single item that consumes
> ~700 bytes of log space.
> 
> OTOH, we might be freeing extents on a massively fragmented file and
> filesystem, so we're pushing 200kB+ transactions into the CIL for
> every rolling transaction. On a filesystem with a 512 byte log
> sector size and no LSU, the CIL reservations are dwarfed by the
> actual metadata being logged...
> 
> I'd suggest that looking at the ungrant trace for the CIL ticket
> once it has committed will tell us exactly how much the reservation
> was over-estimated, as the unused portion of the reservation will be
> returned to the reserve grant head at this point in time.

Typical for this workload is a CIl ticket that looks like this at
ungrant time:

t_curr_res 13408 t_unit_res 231100
t_curr_res 9240 t_unit_res 140724
t_curr_res 46284 t_unit_res 263964
t_curr_res 29780 t_unit_res 190020
t_curr_res 38044 t_unit_res 342016
t_curr_res 21636 t_unit_res 321476
t_curr_res 21576 t_unit_res 263964
t_curr_res 42200 t_unit_res 411852
t_curr_res 21636 t_unit_res 292720
t_curr_res 62740 t_unit_res 514552
t_curr_res 17456 t_unit_res 284504
t_curr_res 29852 t_unit_res 411852
t_curr_res 13384 t_unit_res 206452
t_curr_res 70956 t_unit_res 518660
t_curr_res 70908 t_unit_res 333800
t_curr_res 50404 t_unit_res 518660
t_curr_res 17480 t_unit_res 321476
t_curr_res 33948 t_unit_res 436500
t_curr_res 17492 t_unit_res 317368
t_curr_res 50392 t_unit_res 489904
t_curr_res 13360 t_unit_res 325584
t_curr_res 66812 t_unit_res 506336
t_curr_res 33924 t_unit_res 366664
t_curr_res 70932 t_unit_res 551524
t_curr_res 29852 t_unit_res 374880
t_curr_res 25720 t_unit_res 494012
t_curr_res 42152 t_unit_res 506336
t_curr_res 21684 t_unit_res 543308
t_curr_res 29840 t_unit_res 440608
t_curr_res 46320 t_unit_res 551524
t_curr_res 21624 t_unit_res 387204
t_curr_res 29840 t_unit_res 522768

So we are looking at a reservation of up to 500KB, and typically
using all but a few 10s of KB of it.

I'll use this as the ballpark for the lockless code.

Cheers,

Dave.
Dave Chinner May 14, 2020, 2:49 a.m. UTC | #6
On Thu, May 14, 2020 at 11:50:55AM +1000, Dave Chinner wrote:
> On Thu, May 14, 2020 at 07:52:41AM +1000, Dave Chinner wrote:
> > On Wed, May 13, 2020 at 08:09:59AM -0400, Brian Foster wrote:
> > > On Wed, May 13, 2020 at 09:36:27AM +1000, Dave Chinner wrote:
> > > > On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> > > > > Particularly as it relates to percpu functionality. Does
> > > > > the window scale with cpu count, for example? It might not matter either
> > > > 
> > > > Not really. We need a thundering herd to cause issues, and this
> > > > occurs after formatting an item so we won't get a huge thundering
> > > > herd even when lots of threads block on the xc_ctx_lock waiting for
> > > > a push to complete.
> > > > 
> > > 
> > > It would be nice to have some debug code somewhere that somehow or
> > > another asserts or warns if the CIL reservation exceeds some
> > > insane/unexpected heuristic based on the current size of the context. I
> > > don't know what that code or heuristic looks like (i.e. multiple factors
> > > of the ctx size?) so I'm obviously handwaving. Just something to think
> > > about if we can come up with a way to accomplish that opportunistically.
> > 
> > I don't think there is a reliable mechanism that can be used here.
> > At one end of the scale we have the valid case of a synchronous
> > inode modification on a log with a 256k stripe unit. So it's valid
> > to have a CIL reservation of ~550kB for a single item that consumes
> > ~700 bytes of log space.
> > 
> > OTOH, we might be freeing extents on a massively fragmented file and
> > filesystem, so we're pushing 200kB+ transactions into the CIL for
> > every rolling transaction. On a filesystem with a 512 byte log
> > sector size and no LSU, the CIL reservations are dwarfed by the
> > actual metadata being logged...
> > 
> > I'd suggest that looking at the ungrant trace for the CIL ticket
> > once it has committed will tell us exactly how much the reservation
> > was over-estimated, as the unused portion of the reservation will be
> > returned to the reserve grant head at this point in time.
> 
> Typical for this workload is a CIl ticket that looks like this at
> ungrant time:
> 
> t_curr_res 13408 t_unit_res 231100
> t_curr_res 9240 t_unit_res 140724
> t_curr_res 46284 t_unit_res 263964
> t_curr_res 29780 t_unit_res 190020
> t_curr_res 38044 t_unit_res 342016
> t_curr_res 21636 t_unit_res 321476
> t_curr_res 21576 t_unit_res 263964
> t_curr_res 42200 t_unit_res 411852
> t_curr_res 21636 t_unit_res 292720
> t_curr_res 62740 t_unit_res 514552
> t_curr_res 17456 t_unit_res 284504
> t_curr_res 29852 t_unit_res 411852
> t_curr_res 13384 t_unit_res 206452
> t_curr_res 70956 t_unit_res 518660
> t_curr_res 70908 t_unit_res 333800
> t_curr_res 50404 t_unit_res 518660
> t_curr_res 17480 t_unit_res 321476
> t_curr_res 33948 t_unit_res 436500
> t_curr_res 17492 t_unit_res 317368
> t_curr_res 50392 t_unit_res 489904
> t_curr_res 13360 t_unit_res 325584
> t_curr_res 66812 t_unit_res 506336
> t_curr_res 33924 t_unit_res 366664
> t_curr_res 70932 t_unit_res 551524
> t_curr_res 29852 t_unit_res 374880
> t_curr_res 25720 t_unit_res 494012
> t_curr_res 42152 t_unit_res 506336
> t_curr_res 21684 t_unit_res 543308
> t_curr_res 29840 t_unit_res 440608
> t_curr_res 46320 t_unit_res 551524
> t_curr_res 21624 t_unit_res 387204
> t_curr_res 29840 t_unit_res 522768
> 
> So we are looking at a reservation of up to 500KB, and typically
> using all but a few 10s of KB of it.
> 
> I'll use this as the ballpark for the lockless code.

And, yeah, just reserving a split res for everything blows it out by
a factor of 100.

I've worked out via experimentation and measurement that taking a
split reservation when the current transaction is near the iclog
threshold results in roughly doubling the above reservation, but I
haven't had it go anywhere near an underrun at this point....

I'll work with this heuristic for the moment, and spend some more
time thinking about stuff I've just learnt getting to this point.

Cheers,

Dave.
Brian Foster May 14, 2020, 1:43 p.m. UTC | #7
On Thu, May 14, 2020 at 07:52:41AM +1000, Dave Chinner wrote:
> On Wed, May 13, 2020 at 08:09:59AM -0400, Brian Foster wrote:
> > On Wed, May 13, 2020 at 09:36:27AM +1000, Dave Chinner wrote:
> > > On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> > > > Particularly as it relates to percpu functionality. Does
> > > > the window scale with cpu count, for example? It might not matter either
> > > 
> > > Not really. We need a thundering herd to cause issues, and this
> > > occurs after formatting an item so we won't get a huge thundering
> > > herd even when lots of threads block on the xc_ctx_lock waiting for
> > > a push to complete.
> > > 
> > 
> > It would be nice to have some debug code somewhere that somehow or
> > another asserts or warns if the CIL reservation exceeds some
> > insane/unexpected heuristic based on the current size of the context. I
> > don't know what that code or heuristic looks like (i.e. multiple factors
> > of the ctx size?) so I'm obviously handwaving. Just something to think
> > about if we can come up with a way to accomplish that opportunistically.
> 
> I don't think there is a reliable mechanism that can be used here.
> At one end of the scale we have the valid case of a synchronous
> inode modification on a log with a 256k stripe unit. So it's valid
> to have a CIL reservation of ~550kB for a single item that consumes
> ~700 bytes of log space.
> 

Perhaps ctx size isn't a sufficient baseline by itself. That said, log
stripe unit is still fixed and afaict centralized to the base unit res
calculation of the CIL ctx and regular transaction tickets. So I'm not
convinced we couldn't come up with something useful on the push side
that factors lsunit into the metric. It doesn't have to be perfect, just
something conservative enough to catch consuming reservation beyond
expected worst case conditions without causing false positives.

Re: your followups, I'll put more thought into it when the percpu
algorithm is more settled..

> OTOH, we might be freeing extents on a massively fragmented file and
> filesystem, so we're pushing 200kB+ transactions into the CIL for
> every rolling transaction. On a filesystem with a 512 byte log
> sector size and no LSU, the CIL reservations are dwarfed by the
> actual metadata being logged...
> 
> I'd suggest that looking at the ungrant trace for the CIL ticket
> once it has committed will tell us exactly how much the reservation
> was over-estimated, as the unused portion of the reservation will be
> returned to the reserve grant head at this point in time.
> 

Yeah, that works for me.

> > > > way because we expect any given transaction to accommodate the ctx res,
> > > > but it would be good to understand the behavior here so we can think
> > > > about potential side effects, if any.
> > > 
> > > I haven't been able to come up with any adverse side effects except
> > > for "performance might drop a bit if we reserve too much and push
> > > early", but that is tempered by the fact that performance goes up
> > > much more than we might lose by getting rid of the xc_cil_lock
> > > bottleneck.
> > > 
> > 
> > FWIW, a more extreme test vector could be to steal the remainder of the
> > transaction reservation for the CIL ticket and see how that affects
> > things. That's probably more suited for a local test than something to
> > live in the upstream code, though.
> 
> It will only affect performance. Worst case is that it degrades the
> CIL to behave like the original logging code that wrote direct to
> iclogs. i.e. it defeats the in memory aggregation of delayed logging
> and effectively writes directly to the iclogs.
> 
> Which, really, is what using the -o wsync or -o dirsync largely do
> by making a large number of transactions synchrnonous.
> 
> In short, performance goes down if we reserve too much, but nothing
> incorrect should occur.
> 

Ok.

> > 
> > > > >  	/* do we need space for more log record headers? */
> > > > > -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > > > > -	if (len > 0 && (ctx->space_used / iclog_space !=
> > > > > -				(ctx->space_used + len) / iclog_space)) {
> > > > > +	if (len > 0 && !ctx_res) {
> > > > > +		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > > > >  		split_res = (len + iclog_space - 1) / iclog_space;
> > > > >  		/* need to take into account split region headers, too */
> > > > >  		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> > > > > -		ctx->ticket->t_unit_res += split_res;
> > > > > -		ctx->ticket->t_curr_res += split_res;
> > > > >  		tp->t_ticket->t_curr_res -= split_res;
> > > > >  		ASSERT(tp->t_ticket->t_curr_res >= len);
> > > > >  	}
> > > > 
> > > > Similarly here, assume additional split reservation for every
> > > > context rather than checking each commit. Seems reasonable in
> > > > principle, but just from a cursory glance this doesn't cover the
> > > > case of the context expanding beyond more than two iclogs.  IOW,
> > > > the current logic adds split_res if the size increase from the
> > > > current transaction expands the ctx into another iclog than before
> > > > the transaction. The new logic only seems to add split_res for the
> > > > first transaction into the ctx. Also note
> > > 
> > > No, I changed it to apply to any vector length longer than a single
> > > iclog except for transactions that have taken the unit reservation
> > > above.
> > > 
> > 
> > Ok, I had the ctx_res logic inverted in my head. So it's not that
> > split_res is only added for the first transaction, but rather we treat
> > every transaction that didn't contribute unit res as if it crosses an
> > iclog boundary. That seems much more reasonable, though it does add to
> > the "overreservation" of the ticket so I'll reemphasize the request for
> > some form of debug/trace check that helps analyze runtime CIL ticket
> > reservation accounting. ;)
> 
> I can add some trace points, but I think we already have the
> tracepoints we need to understand how much overestimation occurs.
> i.e. trace_xfs_log_ticket_ungrant() from the CIL push worker
> context.
> 

Sure..

> > OTOH, this skips the split_res in the case where a single large
> > multi-iclog transaction happens to be the first in the ctx, right?
> 
> True, that's easy enough to fix.
> 
> > That
> > doesn't seem that unlikely a scenario considering minimum iclog and
> > worst case transaction unit res sizes. It actually makes me wonder what
> > happens if the CIL ticket underruns.. :P
> 
> xlog_write() will catch the underrun when we go to write the commit
> record via xlog_commit_record(), dump the ticket and shutdown the
> filesystem.
> 

Ah, right. Note that's not the last place we take from ->t_curr_res in
the xlog_write() path. xlog_state_get_iclog_space() steals a bit as well
if we're at the top of an iclog so it might be worth bolstering that
check.

Brian

> CHeers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b43f0e8f43f2e..746c841757ed1 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -393,7 +393,7 @@  xlog_cil_insert_items(
 	struct xfs_log_item	*lip;
 	int			len = 0;
 	int			diff_iovecs = 0;
-	int			iclog_space;
+	int			iclog_space, space_used;
 	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
 
 	ASSERT(tp);
@@ -403,17 +403,16 @@  xlog_cil_insert_items(
 	 * are done so it doesn't matter exactly how we update the CIL.
 	 */
 	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
-
-	spin_lock(&cil->xc_cil_lock);
-
 	/* account for space used by new iovec headers  */
+
 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
 	len += iovhdr_res;
 	ctx->nvecs += diff_iovecs;
 
-	/* attach the transaction to the CIL if it has any busy extents */
-	if (!list_empty(&tp->t_busy))
-		list_splice_init(&tp->t_busy, &ctx->busy_extents);
+	/*
+	 * The ticket can't go away from us here, so we can do racy sampling
+	 * and precalculate everything.
+	 */
 
 	/*
 	 * Now transfer enough transaction reservation to the context ticket
@@ -421,27 +420,28 @@  xlog_cil_insert_items(
 	 * reservation has to grow as well as the current reservation as we
 	 * steal from tickets so we can correctly determine the space used
 	 * during the transaction commit.
+	 *
+	 * We use percpu_counter_add_batch() here to force the addition into the
+	 * global sum immediately. This will result in percpu_counter_read() now
+	 * always returning a non-zero value, and hence we'll only ever have a
+	 * very short race window on new contexts.
 	 */
-	if (ctx->ticket->t_curr_res == 0) {
+	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
 		ctx_res = ctx->ticket->t_unit_res;
-		ctx->ticket->t_curr_res = ctx_res;
 		tp->t_ticket->t_curr_res -= ctx_res;
+		percpu_counter_add_batch(&cil->xc_curr_res, ctx_res, ctx_res - 1);
 	}
 
 	/* do we need space for more log record headers? */
-	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
-	if (len > 0 && (ctx->space_used / iclog_space !=
-				(ctx->space_used + len) / iclog_space)) {
+	if (len > 0 && !ctx_res) {
+		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
 		split_res = (len + iclog_space - 1) / iclog_space;
 		/* need to take into account split region headers, too */
 		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
-		ctx->ticket->t_unit_res += split_res;
-		ctx->ticket->t_curr_res += split_res;
 		tp->t_ticket->t_curr_res -= split_res;
 		ASSERT(tp->t_ticket->t_curr_res >= len);
 	}
 	tp->t_ticket->t_curr_res -= len;
-	ctx->space_used += len;
 
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move
@@ -458,6 +458,15 @@  xlog_cil_insert_items(
 		xlog_print_trans(tp);
 	}
 
+	percpu_counter_add_batch(&cil->xc_curr_res, split_res, 1000 * 1000);
+	percpu_counter_add_batch(&cil->xc_space_used, len, 1000 * 1000);
+
+	spin_lock(&cil->xc_cil_lock);
+
+	/* attach the transaction to the CIL if it has any busy extents */
+	if (!list_empty(&tp->t_busy))
+		list_splice_init(&tp->t_busy, &ctx->busy_extents);
+
 	/*
 	 * Now (re-)position everything modified at the tail of the CIL.
 	 * We do this here so we only need to take the CIL lock once during
@@ -741,6 +750,18 @@  xlog_cil_push_work(
 		num_iovecs += lv->lv_niovecs;
 	}
 
+	/*
+	 * Drain per cpu counters back to context so they can be re-initialised
+	 * to zero before we allow commits to the new context we are about to
+	 * switch to.
+	 */
+	ctx->space_used = percpu_counter_sum(&cil->xc_space_used);
+	ctx->ticket->t_curr_res = percpu_counter_sum(&cil->xc_curr_res);
+	ctx->ticket->t_unit_res = ctx->ticket->t_curr_res;
+	percpu_counter_set(&cil->xc_space_used, 0);
+	percpu_counter_set(&cil->xc_curr_res, 0);
+
+
 	/*
 	 * initialise the new context and attach it to the CIL. Then attach
 	 * the current context to the CIL committing lsit so it can be found
@@ -900,6 +921,7 @@  xlog_cil_push_background(
 	struct xlog	*log) __releases(cil->xc_ctx_lock)
 {
 	struct xfs_cil	*cil = log->l_cilp;
+	s64		space_used = percpu_counter_read(&cil->xc_space_used);
 
 	/*
 	 * The cil won't be empty because we are called while holding the
@@ -911,7 +933,7 @@  xlog_cil_push_background(
 	 * don't do a background push if we haven't used up all the
 	 * space available yet.
 	 */
-	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
+	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
 		up_read(&cil->xc_ctx_lock);
 		return;
 	}
@@ -934,9 +956,9 @@  xlog_cil_push_background(
 	 * If we are well over the space limit, throttle the work that is being
 	 * done until the push work on this context has begun.
 	 */
-	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+	if (space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
 		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
-		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
+		ASSERT(space_used < log->l_logsize);
 		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
 		return;
 	}
@@ -1200,16 +1222,23 @@  xlog_cil_init(
 {
 	struct xfs_cil	*cil;
 	struct xfs_cil_ctx *ctx;
+	int		error = -ENOMEM;
 
 	cil = kmem_zalloc(sizeof(*cil), KM_MAYFAIL);
 	if (!cil)
-		return -ENOMEM;
+		return error;
 
 	ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL);
-	if (!ctx) {
-		kmem_free(cil);
-		return -ENOMEM;
-	}
+	if (!ctx)
+		goto out_free_cil;
+
+	error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL);
+	if (error)
+		goto out_free_ctx;
+
+	error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL);
+	if (error)
+		goto out_free_space;
 
 	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
 	INIT_LIST_HEAD(&cil->xc_cil);
@@ -1230,19 +1259,31 @@  xlog_cil_init(
 	cil->xc_log = log;
 	log->l_cilp = cil;
 	return 0;
+
+out_free_space:
+	percpu_counter_destroy(&cil->xc_space_used);
+out_free_ctx:
+	kmem_free(ctx);
+out_free_cil:
+	kmem_free(cil);
+	return error;
 }
 
 void
 xlog_cil_destroy(
 	struct xlog	*log)
 {
-	if (log->l_cilp->xc_ctx) {
-		if (log->l_cilp->xc_ctx->ticket)
-			xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket);
-		kmem_free(log->l_cilp->xc_ctx);
+	struct xfs_cil  *cil = log->l_cilp;
+
+	if (cil->xc_ctx) {
+		if (cil->xc_ctx->ticket)
+			xfs_log_ticket_put(cil->xc_ctx->ticket);
+		kmem_free(cil->xc_ctx);
 	}
+	percpu_counter_destroy(&cil->xc_space_used);
+	percpu_counter_destroy(&cil->xc_curr_res);
 
-	ASSERT(list_empty(&log->l_cilp->xc_cil));
-	kmem_free(log->l_cilp);
+	ASSERT(list_empty(&cil->xc_cil));
+	kmem_free(cil);
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index ec22c7a3867f1..f5e79a7d44c8e 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -262,6 +262,8 @@  struct xfs_cil_ctx {
  */
 struct xfs_cil {
 	struct xlog		*xc_log;
+	struct percpu_counter	xc_space_used;
+	struct percpu_counter	xc_curr_res;
 	struct list_head	xc_cil;
 	spinlock_t		xc_cil_lock;