[15/45] xfs: CIL work is serialised, not pipelined

Message ID	20210305051143.182133-16-david@fromorbit.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Subject: [PATCH 15/45] xfs: CIL work is serialised, not pipelined Date: Fri, 5 Mar 2021 16:11:13 +1100 Message-Id: <20210305051143.182133-16-david@fromorbit.com> In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xfs: consolidated log and optimisation changes \| expand [00/45,v3] xfs: consolidated log and optimisation changes [01/45] xfs: initialise attr fork on inode create [02/45] xfs: log stripe roundoff is a property of the log [03/45] xfs: separate CIL commit record IO [04/45] xfs: remove xfs_blkdev_issue_flush [05/45] xfs: async blkdev cache flush [06/45] xfs: CIL checkpoint flushes caches unconditionally [07/45] xfs: remove need_start_rec parameter from xlog_write() [08/45] xfs: journal IO cache flush reductions [09/45] xfs: Fix CIL throttle hang when CIL space used going backwards [10/45] xfs: reduce buffer log item shadow allocations [11/45] xfs: xfs_buf_item_size_segment() needs to pass segment offset [12/45] xfs: optimise xfs_buf_item_size/format for contiguous regions [13/45] xfs: xfs_log_force_lsn isn't passed a LSN [14/45] xfs: AIL needs asynchronous CIL forcing [15/45] xfs: CIL work is serialised, not pipelined [16/45] xfs: type verification is expensive [17/45] xfs: No need for inode number error injection in __xfs_dir3_data_check [18/45] xfs: reduce debug overhead of dir leaf/node checks [19/45] xfs: factor out the CIL transaction header building [20/45] xfs: only CIL pushes require a start record [21/45] xfs: embed the xlog_op_header in the unmount record [22/45] xfs: embed the xlog_op_header in the commit record [23/45] xfs: log tickets don't need log client id [24/45] xfs: move log iovec alignment to preparation function [25/45] xfs: reserve space and initialise xlog_op_header in item formatting [26/45] xfs: log ticket region debug is largely useless [27/45] xfs: pass lv chain length into xlog_write() [28/45] xfs: introduce xlog_write_single() [29/45] xfs:_introduce xlog_write_partial() [30/45] xfs: xlog_write() no longer needs contwr state [31/45] xfs: CIL context doesn't need to count iovecs [32/45] xfs: use the CIL space used counter for emptiness checks [33/45] xfs: lift init CIL reservation out of xc_cil_lock [34/45] xfs: rework per-iclog header CIL reservation [35/45] xfs: introduce per-cpu CIL tracking sructure [36/45] xfs: implement percpu cil space used calculation [37/45] xfs: track CIL ticket reservation in percpu structure [38/45] xfs: convert CIL busy extents to per-cpu [39/45] xfs: Add order IDs to log items in CIL [40/45] xfs: convert CIL to unordered per cpu lists [41/45] xfs: move CIL ordering to the logvec chain [42/45] xfs: __percpu_counter_compare() inode count debug too expensive [43/45] xfs: avoid cil push lock if possible [44/45] xfs: xlog_sync() manually adjusts grant head space [45/45] xfs: expanding delayed logging design with background material

Dave Chinner March 5, 2021, 5:11 a.m. UTC

From: Dave Chinner <dchinner@redhat.com>

Because we use a single work structure attached to the CIL rather
than the CIL context, we can only queue a single work item at a
time. This results in the CIL being single threaded and limits
performance when it becomes CPU bound.

The design of the CIL is that it is pipelined and multiple commits
can be running concurrently, but the way the work is currently
implemented means that it is not pipelining as it was intended. The
critical work to switch the CIL context can take a few milliseconds
to run, but the rest of the CIL context flush can take hundreds of
milliseconds to complete. The context switching is the serialisation
point of the CIL, once the context has been switched the rest of the
context push can run asynchrnously with all other context pushes.

Hence we can move the work to the CIL context so that we can run
multiple CIL pushes at the same time and spread the majority of
the work out over multiple CPUs. We can keep the per-cpu CIL commit
state on the CIL rather than the context, because the context is
pinned to the CIL until the switch is done and we aggregate and
drain the per-cpu state held on the CIL during the context switch.

However, because we no longer serialise the CIL work, we can have
effectively unlimited CIL pushes in progress. We don't want to do
this - not only does it create contention on the iclogs and the
state machine locks, we can run the log right out of space with
outstanding pushes. Instead, limit the work concurrency to 4
concurrent works being processed at a time. THis is enough
concurrency to remove the CIL from being a CPU bound bottleneck but
not enough to create new contention points or unbound concurrency
issues.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 80 +++++++++++++++++++++++--------------------
 fs/xfs/xfs_log_priv.h |  2 +-
 fs/xfs/xfs_super.c    |  2 +-
 3 files changed, 44 insertions(+), 40 deletions(-)

Darrick J. Wong March 8, 2021, 11:14 p.m. UTC | #1

On Fri, Mar 05, 2021 at 04:11:13PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because we use a single work structure attached to the CIL rather
> than the CIL context, we can only queue a single work item at a
> time. This results in the CIL being single threaded and limits
> performance when it becomes CPU bound.
> 
> The design of the CIL is that it is pipelined and multiple commits
> can be running concurrently, but the way the work is currently
> implemented means that it is not pipelining as it was intended. The
> critical work to switch the CIL context can take a few milliseconds
> to run, but the rest of the CIL context flush can take hundreds of
> milliseconds to complete. The context switching is the serialisation
> point of the CIL, once the context has been switched the rest of the
> context push can run asynchrnously with all other context pushes.
> 
> Hence we can move the work to the CIL context so that we can run
> multiple CIL pushes at the same time and spread the majority of
> the work out over multiple CPUs. We can keep the per-cpu CIL commit
> state on the CIL rather than the context, because the context is
> pinned to the CIL until the switch is done and we aggregate and
> drain the per-cpu state held on the CIL during the context switch.
> 
> However, because we no longer serialise the CIL work, we can have
> effectively unlimited CIL pushes in progress. We don't want to do
> this - not only does it create contention on the iclogs and the
> state machine locks, we can run the log right out of space with
> outstanding pushes. Instead, limit the work concurrency to 4
> concurrent works being processed at a time. THis is enough

Four?  Was that determined experimentally, or is that a fundamental
limit of how many cil checkpoints we can working on at a time?  The
current one, the previous one, and ... something else that was already
in progress?

I think the rest of the patch looks reasonable, FWIW.

--D

> concurrency to remove the CIL from being a CPU bound bottleneck but
> not enough to create new contention points or unbound concurrency
> issues.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 80 +++++++++++++++++++++++--------------------
>  fs/xfs/xfs_log_priv.h |  2 +-
>  fs/xfs/xfs_super.c    |  2 +-
>  3 files changed, 44 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b101c25cc9a9..dfc9ef692a80 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -47,6 +47,34 @@ xlog_cil_ticket_alloc(
>  	return tic;
>  }
>  
> +/*
> + * Unavoidable forward declaration - xlog_cil_push_work() calls
> + * xlog_cil_ctx_alloc() itself.
> + */
> +static void xlog_cil_push_work(struct work_struct *work);
> +
> +static struct xfs_cil_ctx *
> +xlog_cil_ctx_alloc(void)
> +{
> +	struct xfs_cil_ctx	*ctx;
> +
> +	ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS);
> +	INIT_LIST_HEAD(&ctx->committing);
> +	INIT_LIST_HEAD(&ctx->busy_extents);
> +	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
> +	return ctx;
> +}
> +
> +static void
> +xlog_cil_ctx_switch(
> +	struct xfs_cil		*cil,
> +	struct xfs_cil_ctx	*ctx)
> +{
> +	ctx->sequence = ++cil->xc_current_sequence;
> +	ctx->cil = cil;
> +	cil->xc_ctx = ctx;
> +}
> +
>  /*
>   * After the first stage of log recovery is done, we know where the head and
>   * tail of the log are. We need this log initialisation done before we can
> @@ -641,11 +669,11 @@ static void
>  xlog_cil_push_work(
>  	struct work_struct	*work)
>  {
> -	struct xfs_cil		*cil =
> -		container_of(work, struct xfs_cil, xc_push_work);
> +	struct xfs_cil_ctx	*ctx =
> +		container_of(work, struct xfs_cil_ctx, push_work);
> +	struct xfs_cil		*cil = ctx->cil;
>  	struct xlog		*log = cil->xc_log;
>  	struct xfs_log_vec	*lv;
> -	struct xfs_cil_ctx	*ctx;
>  	struct xfs_cil_ctx	*new_ctx;
>  	struct xlog_in_core	*commit_iclog;
>  	struct xlog_ticket	*tic;
> @@ -660,11 +688,10 @@ xlog_cil_push_work(
>  	DECLARE_COMPLETION_ONSTACK(bdev_flush);
>  	bool			commit_iclog_sync = false;
>  
> -	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
> +	new_ctx = xlog_cil_ctx_alloc();
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
>  
>  	down_write(&cil->xc_ctx_lock);
> -	ctx = cil->xc_ctx;
>  
>  	spin_lock(&cil->xc_push_lock);
>  	push_seq = cil->xc_push_seq;
> @@ -696,7 +723,7 @@ xlog_cil_push_work(
>  
>  
>  	/* check for a previously pushed sequence */
> -	if (push_seq < cil->xc_ctx->sequence) {
> +	if (push_seq < ctx->sequence) {
>  		spin_unlock(&cil->xc_push_lock);
>  		goto out_skip;
>  	}
> @@ -767,19 +794,7 @@ xlog_cil_push_work(
>  	}
>  
>  	/*
> -	 * initialise the new context and attach it to the CIL. Then attach
> -	 * the current context to the CIL committing list so it can be found
> -	 * during log forces to extract the commit lsn of the sequence that
> -	 * needs to be forced.
> -	 */
> -	INIT_LIST_HEAD(&new_ctx->committing);
> -	INIT_LIST_HEAD(&new_ctx->busy_extents);
> -	new_ctx->sequence = ctx->sequence + 1;
> -	new_ctx->cil = cil;
> -	cil->xc_ctx = new_ctx;
> -
> -	/*
> -	 * The switch is now done, so we can drop the context lock and move out
> +	 * Switch the contexts so we can drop the context lock and move out
>  	 * of a shared context. We can't just go straight to the commit record,
>  	 * though - we need to synchronise with previous and future commits so
>  	 * that the commit records are correctly ordered in the log to ensure
> @@ -804,7 +819,7 @@ xlog_cil_push_work(
>  	 * deferencing a freed context pointer.
>  	 */
>  	spin_lock(&cil->xc_push_lock);
> -	cil->xc_current_sequence = new_ctx->sequence;
> +	xlog_cil_ctx_switch(cil, new_ctx);
>  	spin_unlock(&cil->xc_push_lock);
>  	up_write(&cil->xc_ctx_lock);
>  
> @@ -968,7 +983,7 @@ xlog_cil_push_background(
>  	spin_lock(&cil->xc_push_lock);
>  	if (cil->xc_push_seq < cil->xc_current_sequence) {
>  		cil->xc_push_seq = cil->xc_current_sequence;
> -		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
> +		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work);
>  	}
>  
>  	/*
> @@ -1034,7 +1049,7 @@ xlog_cil_push_now(
>  
>  	/* start on any pending background push to minimise wait time on it */
>  	if (sync)
> -		flush_work(&cil->xc_push_work);
> +		flush_workqueue(log->l_mp->m_cil_workqueue);
>  
>  	/*
>  	 * If the CIL is empty or we've already pushed the sequence then
> @@ -1049,7 +1064,7 @@ xlog_cil_push_now(
>  	cil->xc_push_seq = push_seq;
>  	if (!sync)
>  		cil->xc_push_async = true;
> -	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
> +	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work);
>  	spin_unlock(&cil->xc_push_lock);
>  }
>  
> @@ -1286,13 +1301,6 @@ xlog_cil_init(
>  	if (!cil)
>  		return -ENOMEM;
>  
> -	ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL);
> -	if (!ctx) {
> -		kmem_free(cil);
> -		return -ENOMEM;
> -	}
> -
> -	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
>  	INIT_LIST_HEAD(&cil->xc_cil);
>  	INIT_LIST_HEAD(&cil->xc_committing);
>  	spin_lock_init(&cil->xc_cil_lock);
> @@ -1300,16 +1308,12 @@ xlog_cil_init(
>  	init_waitqueue_head(&cil->xc_push_wait);
>  	init_rwsem(&cil->xc_ctx_lock);
>  	init_waitqueue_head(&cil->xc_commit_wait);
> -
> -	INIT_LIST_HEAD(&ctx->committing);
> -	INIT_LIST_HEAD(&ctx->busy_extents);
> -	ctx->sequence = 1;
> -	ctx->cil = cil;
> -	cil->xc_ctx = ctx;
> -	cil->xc_current_sequence = ctx->sequence;
> -
>  	cil->xc_log = log;
>  	log->l_cilp = cil;
> +
> +	ctx = xlog_cil_ctx_alloc();
> +	xlog_cil_ctx_switch(cil, ctx);
> +
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index a4e46258b2aa..bb5fa6b71114 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -245,6 +245,7 @@ struct xfs_cil_ctx {
>  	struct list_head	iclog_entry;
>  	struct list_head	committing;	/* ctx committing list */
>  	struct work_struct	discard_endio_work;
> +	struct work_struct	push_work;
>  };
>  
>  /*
> @@ -277,7 +278,6 @@ struct xfs_cil {
>  	struct list_head	xc_committing;
>  	wait_queue_head_t	xc_commit_wait;
>  	xfs_csn_t		xc_current_sequence;
> -	struct work_struct	xc_push_work;
>  	wait_queue_head_t	xc_push_wait;	/* background push throttle */
>  } ____cacheline_aligned_in_smp;
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index ca2cb0448b5e..962f03a541e7 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -502,7 +502,7 @@ xfs_init_mount_workqueues(
>  
>  	mp->m_cil_workqueue = alloc_workqueue("xfs-cil/%s",
>  			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM | WQ_UNBOUND),
> -			0, mp->m_super->s_id);
> +			4, mp->m_super->s_id);
>  	if (!mp->m_cil_workqueue)
>  		goto out_destroy_unwritten;
>  
> -- 
> 2.28.0
>

Dave Chinner March 8, 2021, 11:38 p.m. UTC | #2

On Mon, Mar 08, 2021 at 03:14:32PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:13PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because we use a single work structure attached to the CIL rather
> > than the CIL context, we can only queue a single work item at a
> > time. This results in the CIL being single threaded and limits
> > performance when it becomes CPU bound.
> > 
> > The design of the CIL is that it is pipelined and multiple commits
> > can be running concurrently, but the way the work is currently
> > implemented means that it is not pipelining as it was intended. The
> > critical work to switch the CIL context can take a few milliseconds
> > to run, but the rest of the CIL context flush can take hundreds of
> > milliseconds to complete. The context switching is the serialisation
> > point of the CIL, once the context has been switched the rest of the
> > context push can run asynchrnously with all other context pushes.
> > 
> > Hence we can move the work to the CIL context so that we can run
> > multiple CIL pushes at the same time and spread the majority of
> > the work out over multiple CPUs. We can keep the per-cpu CIL commit
> > state on the CIL rather than the context, because the context is
> > pinned to the CIL until the switch is done and we aggregate and
> > drain the per-cpu state held on the CIL during the context switch.
> > 
> > However, because we no longer serialise the CIL work, we can have
> > effectively unlimited CIL pushes in progress. We don't want to do
> > this - not only does it create contention on the iclogs and the
> > state machine locks, we can run the log right out of space with
> > outstanding pushes. Instead, limit the work concurrency to 4
> > concurrent works being processed at a time. THis is enough
> 
> Four?  Was that determined experimentally, or is that a fundamental
> limit of how many cil checkpoints we can working on at a time?  The
> current one, the previous one, and ... something else that was already
> in progress?

No fundamental limit, but....

> > concurrency to remove the CIL from being a CPU bound bottleneck but
> > not enough to create new contention points or unbound concurrency
> > issues.

spinlocks in well written code scale linearly to 3-4 CPUs banging on
them frequently.  Beyond that they start to show non-linear
behaviour before they break down completely at somewhere between
8-16 threads banging on them. If we have 4 CIL writes going on, we
have 4 CPUs banging on the log->l_icloglock through xlog_write()
through xlog_state_get_iclog_space() and then releasing the iclogs
when they are full. We then have iclog IO completion banging on the
icloglock to serialise completion can change iclog state on
completion.

Hence a 4 CIL push works, we're starting to get back to the point
where the icloglock will start to see non-linear access cost. This
was a problem before delayed logging removed the icloglock from the
front end transaction commit path where it could see unbound
concurrency and was the hottest lock in the log.

Allowing a limited amount of concurrency prevents us from
unnecessarily allowing wasteful and performance limiting lock
contention from occurring. And given that I'm only hitting the
single CPU limit of the CIL push when there's 31 other CPUs all
running transactions flat out, having 4 CPUs to run the same work is
more than enough. Especially as those 31 other CPUs running
transactions are already pushing VFS level spinlocks
(sb->sb_inode_list_lock, dentry ref count locking, etc) to breakdown
point so we're not going to be able to push enough change into the
CIL to keep 4 CPUs fully busy any time soon.

Cheers,

Dave.

Darrick J. Wong March 9, 2021, 1:55 a.m. UTC | #3

On Tue, Mar 09, 2021 at 10:38:19AM +1100, Dave Chinner wrote:
> On Mon, Mar 08, 2021 at 03:14:32PM -0800, Darrick J. Wong wrote:
> > On Fri, Mar 05, 2021 at 04:11:13PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Because we use a single work structure attached to the CIL rather
> > > than the CIL context, we can only queue a single work item at a
> > > time. This results in the CIL being single threaded and limits
> > > performance when it becomes CPU bound.
> > > 
> > > The design of the CIL is that it is pipelined and multiple commits
> > > can be running concurrently, but the way the work is currently
> > > implemented means that it is not pipelining as it was intended. The
> > > critical work to switch the CIL context can take a few milliseconds
> > > to run, but the rest of the CIL context flush can take hundreds of
> > > milliseconds to complete. The context switching is the serialisation
> > > point of the CIL, once the context has been switched the rest of the
> > > context push can run asynchrnously with all other context pushes.
> > > 
> > > Hence we can move the work to the CIL context so that we can run
> > > multiple CIL pushes at the same time and spread the majority of
> > > the work out over multiple CPUs. We can keep the per-cpu CIL commit
> > > state on the CIL rather than the context, because the context is
> > > pinned to the CIL until the switch is done and we aggregate and
> > > drain the per-cpu state held on the CIL during the context switch.
> > > 
> > > However, because we no longer serialise the CIL work, we can have
> > > effectively unlimited CIL pushes in progress. We don't want to do
> > > this - not only does it create contention on the iclogs and the
> > > state machine locks, we can run the log right out of space with
> > > outstanding pushes. Instead, limit the work concurrency to 4
> > > concurrent works being processed at a time. THis is enough
> > 
> > Four?  Was that determined experimentally, or is that a fundamental
> > limit of how many cil checkpoints we can working on at a time?  The
> > current one, the previous one, and ... something else that was already
> > in progress?
> 
> No fundamental limit, but....
> 
> > > concurrency to remove the CIL from being a CPU bound bottleneck but
> > > not enough to create new contention points or unbound concurrency
> > > issues.
> 
> spinlocks in well written code scale linearly to 3-4 CPUs banging on
> them frequently.  Beyond that they start to show non-linear
> behaviour before they break down completely at somewhere between
> 8-16 threads banging on them. If we have 4 CIL writes going on, we
> have 4 CPUs banging on the log->l_icloglock through xlog_write()
> through xlog_state_get_iclog_space() and then releasing the iclogs
> when they are full. We then have iclog IO completion banging on the
> icloglock to serialise completion can change iclog state on
> completion.
> 
> Hence a 4 CIL push works, we're starting to get back to the point
> where the icloglock will start to see non-linear access cost. This
> was a problem before delayed logging removed the icloglock from the
> front end transaction commit path where it could see unbound
> concurrency and was the hottest lock in the log.
> 
> Allowing a limited amount of concurrency prevents us from
> unnecessarily allowing wasteful and performance limiting lock
> contention from occurring. And given that I'm only hitting the
> single CPU limit of the CIL push when there's 31 other CPUs all
> running transactions flat out, having 4 CPUs to run the same work is
> more than enough. Especially as those 31 other CPUs running
> transactions are already pushing VFS level spinlocks
> (sb->sb_inode_list_lock, dentry ref count locking, etc) to breakdown
> point so we're not going to be able to push enough change into the
> CIL to keep 4 CPUs fully busy any time soon.

It might be nice to leave that as a breadcrumb, then, in case the
spinlock scalability problems ever get solved.

	/*
	 * Limit ourselves to 4 CIL push workers per log to avoid
	 * excessive contention of the icloglock spinlock.
	 */
	error = alloc_workqueue(..., 4, ...);

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Andi Kleen March 9, 2021, 10:35 p.m. UTC | #4

"Darrick J. Wong" <djwong@kernel.org> writes:
> It might be nice to leave that as a breadcrumb, then, in case the
> spinlock scalability problems ever get solved.

It might be already solved, depending on if Dave's rule of thumb
was determined before the Linux spinlocks switched to MCS locks or not.

In my experience spinlock scalability depends a lot on how long the
critical section is (that is very important, short sections are a lot
worse than long sections), as well as if the contention is inside a
socket or over sockets, and the actual hardware behaves differently too.

So I would be quite surprised if the "rule of 4" generally holds.

-Andi

Dave Chinner March 10, 2021, 6:11 a.m. UTC | #5

On Tue, Mar 09, 2021 at 02:35:42PM -0800, Andi Kleen wrote:
> "Darrick J. Wong" <djwong@kernel.org> writes:
> > It might be nice to leave that as a breadcrumb, then, in case the
> > spinlock scalability problems ever get solved.
> 
> It might be already solved, depending on if Dave's rule of thumb
> was determined before the Linux spinlocks switched to MCS locks or not.

It's what I see on my current 2-socket, 32p/64t machine with a
handful of optane DC4800 SSDs attached to it running 5.12-rc1+.  MCS
doesn't make spin contention go away, just stops spinlocks from
bouncing the same cacheline all over the machine.

i.e. if you've got more than a single CPU's worth of critical
section to execute, then spinlocks are going to spin, not matter how
they are implemented. So AFAICT the contention I'm measuring is not
cacheline bouncing, but just the cost of multiple CPUs
spinning while queued waiting for the spinlock...

> In my experience spinlock scalability depends a lot on how long the
> critical section is (that is very important, short sections are a lot
> worse than long sections), as well as if the contention is inside a
> socket or over sockets, and the actual hardware behaves differently too.

Yup, and most of the critical sections that the icloglock is used
to protect are quite short.

> So I would be quite surprised if the "rule of 4" generally holds.

It's served me well for the past couple of decades, especially when
working with machines that have thousands of CPUs that can turn even
lightly trafficed spin locks into highly contended locks. That was
the lesson I learnt from this commit:

commit 249a8c1124653fa90f3a3afff869095a31bc229f
Author: David Chinner <dgc@sgi.com>
Date:   Tue Feb 5 12:13:32 2008 +1100

    [XFS] Move AIL pushing into it's own thread

    When many hundreds to thousands of threads all try to do simultaneous
    transactions and the log is in a tail-pushing situation (i.e. full), we
    can get multiple threads walking the AIL list and contending on the AIL
    lock.

Getting half a dozen simultaneous AIL pushes, and the AIL spinlock
would break down and burn an entire 2048p machine for half a day
doing what should only take half a second. Unbound concurrency
-always- finds spinlocks to contend on. And if the machine is large
enough, it will then block up the entire machine as more and more
CPUs hit the serialisation point.

As long as I've worked with 500+ cpu machines (since 2002),
scalability has always been about either removing spinlocks from
hot paths or controlling concurrency to a level below where a
spinlock breaks down. You see it again and again in XFS commit logs
where I've either changed something to be lockless or to strictly
constrain concurrency to be less than 4-8p across known hot and/or
contended spinlocks and mutexes.

And I've used it outside XFS, too. It was the basic concept behind
the NUMA aware shrinker infrastructure and the per-node LRU lists
that it uses. Even the internal spinlocks on those lists start to
break down when bashing on the inode and dentry caches on systems
with per-node CPU counts of 8-16...

Oh, another "rule of 4" I came across a couple of days ago. My test
machine has 4 node, so 4 kswapd threads. One buffered IO reader,
running 100% cpu bound at 2GB/s from a 6GB/s capable block device.
The reader was burning 12% of that CPU on the mapping spinlock
insert pages into the page cache..  The kswapds were each burning
12% of a CPU on the same mapping spinlock reclaiming page cache
pages. So, overall, the system was burning over 50% of a CPU
spinning on the mapping spinlock and really only doing about half a
CPU worth of real work.

Same workload with one kswapd (i.e. single node)? Contention on the
mapping lock is barely measurable.  IOws, at just 5 concurrent
threads doing repeated fast accesses to the inode mapping spinlock
we have reached lock breakdown conditions.

Perhaps we should consider spinlocks harmful these days...

Cheers,

Dave.

[15/45] xfs: CIL work is serialised, not pipelined

Commit Message

Comments

Patch