From patchwork Tue May 12 09:28:07 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 11542767
Return-Path: <SRS0=k9vQ=62=vger.kernel.org=linux-xfs-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DAD4C159A
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Tue, 12 May 2020 09:28:19 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id CC7742075E
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Tue, 12 May 2020 09:28:19 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1728990AbgELJ2T (ORCPT
        <rfc822;patchwork-linux-xfs@patchwork.kernel.org>);
        Tue, 12 May 2020 05:28:19 -0400
Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:40339 "EHLO
        mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1729027AbgELJ2S (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 12 May 2020 05:28:18 -0400
Received: from dread.disaster.area (pa49-195-157-175.pa.nsw.optusnet.com.au
 [49.195.157.175])
        by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 7B1635AAF59
        for <linux-xfs@vger.kernel.org>;
 Tue, 12 May 2020 19:28:15 +1000 (AEST)
Received: from discord.disaster.area ([192.168.253.110])
        by dread.disaster.area with esmtp (Exim 4.92.3)
        (envelope-from <david@fromorbit.com>)
        id 1jYRCw-0004H1-7l
        for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000
Received: from dave by discord.disaster.area with local (Exim 4.93)
        (envelope-from <david@fromorbit.com>)
        id 1jYRCv-007kYM-V6
        for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:13 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 1/5] xfs: separate read-only variables in struct xfs_mount
Date: Tue, 12 May 2020 19:28:07 +1000
Message-Id: <20200512092811.1846252-2-david@fromorbit.com>
X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9
In-Reply-To: <20200512092811.1846252-1-david@fromorbit.com>
References: <20200512092811.1846252-1-david@fromorbit.com>
MIME-Version: 1.0
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0
        a=ONQRW0k9raierNYdzxQi9Q==:117 a=ONQRW0k9raierNYdzxQi9Q==:17
        a=sTwFKg_x9MkA:10 a=20KFwNOVAAAA:8 a=6bWVlVJL7jcz4EFoCGYA:9
        a=T7BEPd9-KIUQysAa:21 a=DsDIQ_qJmw4xj2vw:21
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

Seeing massive cpu usage from xfs_agino_range() on one machine;
instruction level profiles look similar to another machine running
the same workload, only one machien is consuming 10x as much CPU as
the other and going much slower. The only real difference between
the two machines is core count per socket. Both are running
identical 16p/16GB virtual machine configurations

Machine A:

  25.83%  [k] xfs_agino_range
  12.68%  [k] __xfs_dir3_data_check
   6.95%  [k] xfs_verify_ino
   6.78%  [k] xfs_dir2_data_entry_tag_p
   3.56%  [k] xfs_buf_find
   2.31%  [k] xfs_verify_dir_ino
   2.02%  [k] xfs_dabuf_map.constprop.0
   1.65%  [k] xfs_ag_block_count

And takes around 13 minutes to remove 50 million inodes.

Machine B:

  13.90%  [k] __pv_queued_spin_lock_slowpath
   3.76%  [k] do_raw_spin_lock
   2.83%  [k] xfs_dir3_leaf_check_int
   2.75%  [k] xfs_agino_range
   2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
   2.18%  [k] __xfs_dir3_data_check
   2.02%  [k] xfs_log_commit_cil

And takes around 5m30s to remove 50 million inodes.

Suspect is cacheline contention on m_sectbb_log which is used in one
of the macros in xfs_agino_range. This is a read-only variable but
shares a cacheline with m_active_trans which is a global atomic that
gets bounced all around the machine.

The workload is trying to run hundreds of thousands of transactions
per second and hence cacheline contention will be occuring on this
atomic counter. Hence xfs_agino_range() is likely just be an
innocent bystander as the cache coherency protocol fights over the
cacheline between CPU cores and sockets.

On machine A, this rearrangement of the struct xfs_mount
results in the profile changing to:

   9.77%  [kernel]  [k] xfs_agino_range
   6.27%  [kernel]  [k] __xfs_dir3_data_check
   5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   4.54%  [kernel]  [k] xfs_buf_find
   3.79%  [kernel]  [k] do_raw_spin_lock
   3.39%  [kernel]  [k] xfs_verify_ino
   2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock

Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
of machine B and still runs substantially slower than it should.

Current rm -rf of 50 million files:

		vanilla		patched
machine A	13m20s		8m30s
machine B	5m30s		5m02s

It's an improvement, hence indicating that separation and further
optimisation of read-only global filesystem data is worthwhile, but
it clearly isn't the underlying issue causing this specific
performance degradation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_mount.h | 50 +++++++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index aba5a15792792..712b3e2583316 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -88,21 +88,12 @@ typedef struct xfs_mount {
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
 	char			*m_rtname;	/* realtime device name */
 	char			*m_logname;	/* external log device name */
-	int			m_bsize;	/* fs logical block size */
 	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
 	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
 	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
-	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
-	uint			m_allocsize_log;/* min write size log bytes */
-	uint			m_allocsize_blocks; /* min write size blocks */
 	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
 	struct xfs_da_geometry	*m_attr_geo;	/* attribute block geometry */
 	struct xlog		*m_log;		/* log specific stuff */
-	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
-	int			m_logbufs;	/* number of log buffers */
-	int			m_logbsize;	/* size of each log buffer */
-	uint			m_rsumlevels;	/* rt summary levels */
-	uint			m_rsumsize;	/* size of rt summary, bytes */
 	/*
 	 * Optional cache of rt summary level per bitmap block with the
 	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
@@ -117,9 +108,15 @@ typedef struct xfs_mount {
 	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
 	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
 	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
+
+	/*
+	 * Read-only variables that are pre-calculated at mount time.
+	 */
+	int ____cacheline_aligned m_bsize;	/* fs logical block size */
 	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
 	uint8_t			m_blkbb_log;	/* blocklog - BBSHIFT */
 	uint8_t			m_agno_log;	/* log #ag's */
+	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
 	uint			m_blockmask;	/* sb_blocksize-1 */
 	uint			m_blockwsize;	/* sb_blocksize in words */
 	uint			m_blockwmask;	/* blockwsize-1 */
@@ -138,20 +135,35 @@ typedef struct xfs_mount {
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
 	uint			m_alloc_set_aside; /* space we can't use */
 	uint			m_ag_max_usable; /* max space per AG */
-	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
-	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
-	struct mutex		m_growlock;	/* growfs mutex */
+	int			m_dalign;	/* stripe unit */
+	int			m_swidth;	/* stripe width */
+	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
+	uint			m_allocsize_log;/* min write size log bytes */
+	uint			m_allocsize_blocks; /* min write size blocks */
+	int			m_logbufs;	/* number of log buffers */
+	int			m_logbsize;	/* size of each log buffer */
+	uint			m_rsumlevels;	/* rt summary levels */
+	uint			m_rsumsize;	/* size of rt summary, bytes */
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
-	uint64_t		m_flags;	/* global mount flags */
-	bool			m_finobt_nores; /* no per-AG finobt resv. */
 	uint			m_qflags;	/* quota status flags */
+	uint64_t		m_flags;	/* global mount flags */
+	int64_t			m_low_space[XFS_LOWSP_MAX];
+	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
 	struct xfs_trans_resv	m_resv;		/* precomputed res values */
+						/* low free space thresholds */
+	bool			m_always_cow;
+	bool			m_fail_unmount;
+	bool			m_finobt_nores; /* no per-AG finobt resv. */
+	/*
+	 * End of pre-calculated read-only variables
+	 */
+
+	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
+	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
+	struct mutex		m_growlock;	/* growfs mutex */
 	uint64_t		m_resblks;	/* total reserved blocks */
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
-	int			m_dalign;	/* stripe unit */
-	int			m_swidth;	/* stripe width */
-	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
 	atomic_t		m_active_trans;	/* number trans frozen */
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
@@ -160,8 +172,6 @@ typedef struct xfs_mount {
 	struct delayed_work	m_cowblocks_work; /* background cow blocks
 						     trimming */
 	bool			m_update_sb;	/* sb needs update in mount */
-	int64_t			m_low_space[XFS_LOWSP_MAX];
-						/* low free space thresholds */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
@@ -191,8 +201,6 @@ typedef struct xfs_mount {
 	 */
 	uint32_t		m_generation;
 
-	bool			m_always_cow;
-	bool			m_fail_unmount;
 #ifdef DEBUG
 	/*
 	 * Frequency with which errors are injected.  Replaces xfs_etest; the

From patchwork Tue May 12 09:28:08 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 11542769
Return-Path: <SRS0=k9vQ=62=vger.kernel.org=linux-xfs-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 56E5615E6
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Tue, 12 May 2020 09:28:20 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 48DEA2075E
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Tue, 12 May 2020 09:28:20 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729027AbgELJ2T (ORCPT
        <rfc822;patchwork-linux-xfs@patchwork.kernel.org>);
        Tue, 12 May 2020 05:28:19 -0400
Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:52050 "EHLO
        mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1729178AbgELJ2T (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 12 May 2020 05:28:19 -0400
Received: from dread.disaster.area (pa49-195-157-175.pa.nsw.optusnet.com.au
 [49.195.157.175])
        by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id D452010639A
        for <linux-xfs@vger.kernel.org>;
 Tue, 12 May 2020 19:28:15 +1000 (AEST)
Received: from discord.disaster.area ([192.168.253.110])
        by dread.disaster.area with esmtp (Exim 4.92.3)
        (envelope-from <david@fromorbit.com>)
        id 1jYRCw-0004H3-A0
        for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000
Received: from dave by discord.disaster.area with local (Exim 4.93)
        (envelope-from <david@fromorbit.com>)
        id 1jYRCw-007kYR-0j
        for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 2/5] xfs: convert m_active_trans counter to per-cpu
Date: Tue, 12 May 2020 19:28:08 +1000
Message-Id: <20200512092811.1846252-3-david@fromorbit.com>
X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9
In-Reply-To: <20200512092811.1846252-1-david@fromorbit.com>
References: <20200512092811.1846252-1-david@fromorbit.com>
MIME-Version: 1.0
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0
        a=ONQRW0k9raierNYdzxQi9Q==:117 a=ONQRW0k9raierNYdzxQi9Q==:17
        a=sTwFKg_x9MkA:10 a=20KFwNOVAAAA:8 a=DQkj07otzc7_PMFSBPwA:9
        a=8MshfPfKmnwD3r4O:21 a=ea1rjcbdGSVUsQv9:21
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

It's a global atomic counter, and we are hitting it at a rate of
half a million transactions a second, so it's bouncing the counter
cacheline all over the place on large machines. Convert it to a
per-cpu counter.

And .... oh wow, that was unexpected!

Concurrent create, 50 million inodes, identical 16p/16GB virtual
machines on different physical hosts. Machine A has twice the CPU
cores per socket of machine B:

		unpatched	patched
machine A:	3m45s		2m27s
machine B:	4m13s		4m14s

Create rates:
		unpatched	patched
machine A:	246k+/-15k	384k+/-10k
machine B:	225k+/-13k	223k+/-11k

Concurrent rm of same 50 million inodes:

		unpatched	patched
machine A:	8m30s		3m09s
machine B:	5m02s		4m51s

The transaction rate on the fast machine went from about 250k/sec to
over 600k/sec, which indicates just how much of a bottleneck this
atomic counter was.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_mount.h |  2 +-
 fs/xfs/xfs_super.c | 12 +++++++++---
 fs/xfs/xfs_trans.c |  6 +++---
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 712b3e2583316..af3d8b71e9591 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -84,6 +84,7 @@ typedef struct xfs_mount {
 	 * extents or anything related to the rt device.
 	 */
 	struct percpu_counter	m_delalloc_blks;
+	struct percpu_counter	m_active_trans;	/* in progress xact counter */
 
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
 	char			*m_rtname;	/* realtime device name */
@@ -164,7 +165,6 @@ typedef struct xfs_mount {
 	uint64_t		m_resblks;	/* total reserved blocks */
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
-	atomic_t		m_active_trans;	/* number trans frozen */
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct delayed_work	m_eofblocks_work; /* background eof blocks
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e80bd2c4c279e..bc4853525ce18 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -883,7 +883,7 @@ xfs_quiesce_attr(
 	int	error = 0;
 
 	/* wait for all modifications to complete */
-	while (atomic_read(&mp->m_active_trans) > 0)
+	while (percpu_counter_sum(&mp->m_active_trans) > 0)
 		delay(100);
 
 	/* force the log to unpin objects from the now complete transactions */
@@ -902,7 +902,7 @@ xfs_quiesce_attr(
 	 * Just warn here till VFS can correctly support
 	 * read-only remount without racing.
 	 */
-	WARN_ON(atomic_read(&mp->m_active_trans) != 0);
+	WARN_ON(percpu_counter_sum(&mp->m_active_trans) != 0);
 
 	xfs_log_quiesce(mp);
 }
@@ -1027,8 +1027,14 @@ xfs_init_percpu_counters(
 	if (error)
 		goto free_fdblocks;
 
+	error = percpu_counter_init(&mp->m_active_trans, 0, GFP_KERNEL);
+	if (error)
+		goto free_delalloc_blocks;
+
 	return 0;
 
+free_delalloc_blocks:
+	percpu_counter_destroy(&mp->m_delalloc_blks);
 free_fdblocks:
 	percpu_counter_destroy(&mp->m_fdblocks);
 free_ifree:
@@ -1057,6 +1063,7 @@ xfs_destroy_percpu_counters(
 	ASSERT(XFS_FORCED_SHUTDOWN(mp) ||
 	       percpu_counter_sum(&mp->m_delalloc_blks) == 0);
 	percpu_counter_destroy(&mp->m_delalloc_blks);
+	percpu_counter_destroy(&mp->m_active_trans);
 }
 
 static void
@@ -1792,7 +1799,6 @@ static int xfs_init_fs_context(
 	INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC);
 	spin_lock_init(&mp->m_perag_lock);
 	mutex_init(&mp->m_growlock);
-	atomic_set(&mp->m_active_trans, 0);
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 28b983ff8b113..636df5017782e 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -68,7 +68,7 @@ xfs_trans_free(
 	xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false);
 
 	trace_xfs_trans_free(tp, _RET_IP_);
-	atomic_dec(&tp->t_mountp->m_active_trans);
+	percpu_counter_dec(&tp->t_mountp->m_active_trans);
 	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
 		sb_end_intwrite(tp->t_mountp->m_super);
 	xfs_trans_free_dqinfo(tp);
@@ -126,7 +126,7 @@ xfs_trans_dup(
 
 	xfs_trans_dup_dqinfo(tp, ntp);
 
-	atomic_inc(&tp->t_mountp->m_active_trans);
+	percpu_counter_inc(&tp->t_mountp->m_active_trans);
 	return ntp;
 }
 
@@ -275,7 +275,7 @@ xfs_trans_alloc(
 	 */
 	WARN_ON(resp->tr_logres > 0 &&
 		mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
-	atomic_inc(&mp->m_active_trans);
+	percpu_counter_inc(&mp->m_active_trans);
 
 	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
 	tp->t_flags = flags;

From patchwork Tue May 12 09:28:09 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 11542763
Return-Path: <SRS0=k9vQ=62=vger.kernel.org=linux-xfs-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7C74615E6
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Tue, 12 May 2020 09:28:19 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 68F152075E
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Tue, 12 May 2020 09:28:19 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726891AbgELJ2T (ORCPT
        <rfc822;patchwork-linux-xfs@patchwork.kernel.org>);
        Tue, 12 May 2020 05:28:19 -0400
Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:35755 "EHLO
        mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1725889AbgELJ2S (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 12 May 2020 05:28:18 -0400
Received: from dread.disaster.area (pa49-195-157-175.pa.nsw.optusnet.com.au
 [49.195.157.175])
        by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 5FD283A2E4C
        for <linux-xfs@vger.kernel.org>;
 Tue, 12 May 2020 19:28:15 +1000 (AEST)
Received: from discord.disaster.area ([192.168.253.110])
        by dread.disaster.area with esmtp (Exim 4.92.3)
        (envelope-from <david@fromorbit.com>)
        id 1jYRCw-0004H5-BY
        for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000
Received: from dave by discord.disaster.area with local (Exim 4.93)
        (envelope-from <david@fromorbit.com>)
        id 1jYRCw-007kYW-2S
        for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters
Date: Tue, 12 May 2020 19:28:09 +1000
Message-Id: <20200512092811.1846252-4-david@fromorbit.com>
X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9
In-Reply-To: <20200512092811.1846252-1-david@fromorbit.com>
References: <20200512092811.1846252-1-david@fromorbit.com>
MIME-Version: 1.0
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0
        a=ONQRW0k9raierNYdzxQi9Q==:117 a=ONQRW0k9raierNYdzxQi9Q==:17
        a=sTwFKg_x9MkA:10 a=20KFwNOVAAAA:8 a=EZ1i1sUKn3mE4C4D1V4A:9
        a=0fyGEN4HqmrUkTAy:21 a=GkttGSZADD73hktX:21
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

With the m_active_trans atomic bottleneck out of the way, the CIL
xc_cil_lock is the next bottleneck that causes cacheline contention.
This protects several things, the first of which is the CIL context
reservation ticket and space usage counters.

We can lift them out of the xc_cil_lock by converting them to
percpu counters. THis involves two things, the first of which is
lifting calculations and samples that don't actually need protecting
from races outside the xc_cil lock.

The second is converting the counters to percpu counters and lifting
them outside the lock. This requires a couple of tricky things to
minimise initial state races and to ensure we take into account
split reservations. We do this by erring on the "take the
reservation just in case" side, which largely lost in the noise of
many frequent large transactions.

We use a trick with percpu_counter_add_batch() to ensure the global
sum is updated immediately on first reservation, hence allowing us
to use fast counter reads everywhere to determine if the CIL is
empty or not, rather than using the list itself. This is important
for later patches where the CIL is moved to percpu lists
and hence cannot use list_empty() to detect an empty CIL. Hence we
provide a low overhead, lockless mechanism for determining if the
CIL is empty or not via this mechanisms. All other percpu counter
updates use a large batch count so they aggregate on the local CPU
and minimise global sum updates.

The xc_ctx_lock rwsem protects draining the percpu counters to the
context's ticket, similar to the way it allows access to the CIL
without using the xc_cil_lock. i.e. the CIL push has exclusive
access to the CIL, the context and the percpu counters while holding
the xc_ctx_lock. This ensures that we can sum and zero the counters
atomically from the perspective of the transaction commit side of
the push. i.e. they reset to zero atomically with the CIL context
swap and hence we don't need to have the percpu counters attached to
the CIL context.

Performance wise, this increases the transaction rate from
~620,000/s to around 750,000/second. Using a 32-way concurrent
create instead of 16-way on a 32p/16GB virtual machine:

		create time	rate		unlink time
unpatched	  2m03s      472k/s+/-9k/s	 3m6s
patched		  1m56s	     533k/s+/-28k/s	 2m34

Notably, the system time for the create went from 44m20s down to
38m37s, whilst going faster. There is more variance, but I think
that is from the cacheline contention having inconsistent overhead.

XXX: probably should split into two patches

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 99 ++++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_log_priv.h |  2 +
 2 files changed, 72 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b43f0e8f43f2e..746c841757ed1 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -393,7 +393,7 @@ xlog_cil_insert_items(
 	struct xfs_log_item	*lip;
 	int			len = 0;
 	int			diff_iovecs = 0;
-	int			iclog_space;
+	int			iclog_space, space_used;
 	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
 
 	ASSERT(tp);
@@ -403,17 +403,16 @@ xlog_cil_insert_items(
 	 * are done so it doesn't matter exactly how we update the CIL.
 	 */
 	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
-
-	spin_lock(&cil->xc_cil_lock);
-
 	/* account for space used by new iovec headers  */
+
 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
 	len += iovhdr_res;
 	ctx->nvecs += diff_iovecs;
 
-	/* attach the transaction to the CIL if it has any busy extents */
-	if (!list_empty(&tp->t_busy))
-		list_splice_init(&tp->t_busy, &ctx->busy_extents);
+	/*
+	 * The ticket can't go away from us here, so we can do racy sampling
+	 * and precalculate everything.
+	 */
 
 	/*
 	 * Now transfer enough transaction reservation to the context ticket
@@ -421,27 +420,28 @@ xlog_cil_insert_items(
 	 * reservation has to grow as well as the current reservation as we
 	 * steal from tickets so we can correctly determine the space used
 	 * during the transaction commit.
+	 *
+	 * We use percpu_counter_add_batch() here to force the addition into the
+	 * global sum immediately. This will result in percpu_counter_read() now
+	 * always returning a non-zero value, and hence we'll only ever have a
+	 * very short race window on new contexts.
 	 */
-	if (ctx->ticket->t_curr_res == 0) {
+	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
 		ctx_res = ctx->ticket->t_unit_res;
-		ctx->ticket->t_curr_res = ctx_res;
 		tp->t_ticket->t_curr_res -= ctx_res;
+		percpu_counter_add_batch(&cil->xc_curr_res, ctx_res, ctx_res - 1);
 	}
 
 	/* do we need space for more log record headers? */
-	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
-	if (len > 0 && (ctx->space_used / iclog_space !=
-				(ctx->space_used + len) / iclog_space)) {
+	if (len > 0 && !ctx_res) {
+		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
 		split_res = (len + iclog_space - 1) / iclog_space;
 		/* need to take into account split region headers, too */
 		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
-		ctx->ticket->t_unit_res += split_res;
-		ctx->ticket->t_curr_res += split_res;
 		tp->t_ticket->t_curr_res -= split_res;
 		ASSERT(tp->t_ticket->t_curr_res >= len);
 	}
 	tp->t_ticket->t_curr_res -= len;
-	ctx->space_used += len;
 
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move
@@ -458,6 +458,15 @@ xlog_cil_insert_items(
 		xlog_print_trans(tp);
 	}
 
+	percpu_counter_add_batch(&cil->xc_curr_res, split_res, 1000 * 1000);
+	percpu_counter_add_batch(&cil->xc_space_used, len, 1000 * 1000);
+
+	spin_lock(&cil->xc_cil_lock);
+
+	/* attach the transaction to the CIL if it has any busy extents */
+	if (!list_empty(&tp->t_busy))
+		list_splice_init(&tp->t_busy, &ctx->busy_extents);
+
 	/*
 	 * Now (re-)position everything modified at the tail of the CIL.
 	 * We do this here so we only need to take the CIL lock once during
@@ -741,6 +750,18 @@ xlog_cil_push_work(
 		num_iovecs += lv->lv_niovecs;
 	}
 
+	/*
+	 * Drain per cpu counters back to context so they can be re-initialised
+	 * to zero before we allow commits to the new context we are about to
+	 * switch to.
+	 */
+	ctx->space_used = percpu_counter_sum(&cil->xc_space_used);
+	ctx->ticket->t_curr_res = percpu_counter_sum(&cil->xc_curr_res);
+	ctx->ticket->t_unit_res = ctx->ticket->t_curr_res;
+	percpu_counter_set(&cil->xc_space_used, 0);
+	percpu_counter_set(&cil->xc_curr_res, 0);
+
+
 	/*
 	 * initialise the new context and attach it to the CIL. Then attach
 	 * the current context to the CIL committing lsit so it can be found
@@ -900,6 +921,7 @@ xlog_cil_push_background(
 	struct xlog	*log) __releases(cil->xc_ctx_lock)
 {
 	struct xfs_cil	*cil = log->l_cilp;
+	s64		space_used = percpu_counter_read(&cil->xc_space_used);
 
 	/*
 	 * The cil won't be empty because we are called while holding the
@@ -911,7 +933,7 @@ xlog_cil_push_background(
 	 * don't do a background push if we haven't used up all the
 	 * space available yet.
 	 */
-	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
+	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
 		up_read(&cil->xc_ctx_lock);
 		return;
 	}
@@ -934,9 +956,9 @@ xlog_cil_push_background(
 	 * If we are well over the space limit, throttle the work that is being
 	 * done until the push work on this context has begun.
 	 */
-	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+	if (space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
 		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
-		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
+		ASSERT(space_used < log->l_logsize);
 		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
 		return;
 	}
@@ -1200,16 +1222,23 @@ xlog_cil_init(
 {
 	struct xfs_cil	*cil;
 	struct xfs_cil_ctx *ctx;
+	int		error = -ENOMEM;
 
 	cil = kmem_zalloc(sizeof(*cil), KM_MAYFAIL);
 	if (!cil)
-		return -ENOMEM;
+		return error;
 
 	ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL);
-	if (!ctx) {
-		kmem_free(cil);
-		return -ENOMEM;
-	}
+	if (!ctx)
+		goto out_free_cil;
+
+	error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL);
+	if (error)
+		goto out_free_ctx;
+
+	error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL);
+	if (error)
+		goto out_free_space;
 
 	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
 	INIT_LIST_HEAD(&cil->xc_cil);
@@ -1230,19 +1259,31 @@ xlog_cil_init(
 	cil->xc_log = log;
 	log->l_cilp = cil;
 	return 0;
+
+out_free_space:
+	percpu_counter_destroy(&cil->xc_space_used);
+out_free_ctx:
+	kmem_free(ctx);
+out_free_cil:
+	kmem_free(cil);
+	return error;
 }
 
 void
 xlog_cil_destroy(
 	struct xlog	*log)
 {
-	if (log->l_cilp->xc_ctx) {
-		if (log->l_cilp->xc_ctx->ticket)
-			xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket);
-		kmem_free(log->l_cilp->xc_ctx);
+	struct xfs_cil  *cil = log->l_cilp;
+
+	if (cil->xc_ctx) {
+		if (cil->xc_ctx->ticket)
+			xfs_log_ticket_put(cil->xc_ctx->ticket);
+		kmem_free(cil->xc_ctx);
 	}
+	percpu_counter_destroy(&cil->xc_space_used);
+	percpu_counter_destroy(&cil->xc_curr_res);
 
-	ASSERT(list_empty(&log->l_cilp->xc_cil));
-	kmem_free(log->l_cilp);
+	ASSERT(list_empty(&cil->xc_cil));
+	kmem_free(cil);
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index ec22c7a3867f1..f5e79a7d44c8e 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -262,6 +262,8 @@ struct xfs_cil_ctx {
  */
 struct xfs_cil {
 	struct xlog		*xc_log;
+	struct percpu_counter	xc_space_used;
+	struct percpu_counter	xc_curr_res;
 	struct list_head	xc_cil;
 	spinlock_t		xc_cil_lock;
 

From patchwork Tue May 12 09:28:10 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 11542771
Return-Path: <SRS0=k9vQ=62=vger.kernel.org=linux-xfs-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1F1D515AB
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Tue, 12 May 2020 09:28:24 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 0D598207FF
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Tue, 12 May 2020 09:28:24 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729159AbgELJ2X (ORCPT
        <rfc822;patchwork-linux-xfs@patchwork.kernel.org>);
        Tue, 12 May 2020 05:28:23 -0400
Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:33014 "EHLO
        mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1729125AbgELJ2W (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 12 May 2020 05:28:22 -0400
Received: from dread.disaster.area (pa49-195-157-175.pa.nsw.optusnet.com.au
 [49.195.157.175])
        by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 1F7B6D79224
        for <linux-xfs@vger.kernel.org>;
 Tue, 12 May 2020 19:28:15 +1000 (AEST)
Received: from discord.disaster.area ([192.168.253.110])
        by dread.disaster.area with esmtp (Exim 4.92.3)
        (envelope-from <david@fromorbit.com>)
        id 1jYRCw-0004H7-Cw
        for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000
Received: from dave by discord.disaster.area with local (Exim 4.93)
        (envelope-from <david@fromorbit.com>)
        id 1jYRCw-007kYb-4A
        for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 4/5] [RFC] xfs: per-cpu CIL lists
Date: Tue, 12 May 2020 19:28:10 +1000
Message-Id: <20200512092811.1846252-5-david@fromorbit.com>
X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9
In-Reply-To: <20200512092811.1846252-1-david@fromorbit.com>
References: <20200512092811.1846252-1-david@fromorbit.com>
MIME-Version: 1.0
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0
        a=ONQRW0k9raierNYdzxQi9Q==:117 a=ONQRW0k9raierNYdzxQi9Q==:17
        a=sTwFKg_x9MkA:10 a=20KFwNOVAAAA:8 a=xl2qj0nMbg3rt_t4oVwA:9
        a=iK1Nx_DWJ8jIP1s-:21 a=Z9_Xw1WqqA7Fm-cJ:21
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

Next on the list to getting rid of the xc_cil_lock is making the CIL
itself per-cpu.

This requires a trade-off: we no longer move items forward in the
CIL; once they are on the CIL they remain there as we treat the
percpu lists as lockless.

XXX: preempt_disable() around the list operations to ensure they
stay local to the CPU.

XXX: this needs CPU hotplug notifiers to clean up when cpus go
offline.

Performance now increases substantially - the transaction rate goes
from 750,000/s to 1.05M/sec, and the unlink rate is over 500,000/s
for the first time.

Using a 32-way concurrent create/unlink on a 32p/16GB virtual
machine:

	    create time     rate            unlink time
unpatched	1m56s      533k/s+/-28k/s      2m34s
patched		1m49s	   523k/s+/-14k/s      2m00s

Notably, the system time for the create went up, while variance went
down. This indicates we're starting to hit some other contention
limit as we reduce the amount of time we spend contending on the
xc_cil_lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 66 ++++++++++++++++++++++++++++---------------
 fs/xfs/xfs_log_priv.h |  2 +-
 2 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 746c841757ed1..af444bc69a7cd 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -467,28 +467,28 @@ xlog_cil_insert_items(
 	if (!list_empty(&tp->t_busy))
 		list_splice_init(&tp->t_busy, &ctx->busy_extents);
 
+	spin_unlock(&cil->xc_cil_lock);
+
 	/*
 	 * Now (re-)position everything modified at the tail of the CIL.
 	 * We do this here so we only need to take the CIL lock once during
 	 * the transaction commit.
+	 * Move everything to the tail of the local per-cpu CIL list.
 	 */
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
-
 		/* Skip items which aren't dirty in this transaction. */
 		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
 			continue;
 
 		/*
-		 * Only move the item if it isn't already at the tail. This is
-		 * to prevent a transient list_empty() state when reinserting
-		 * an item that is already the only item in the CIL.
+		 * If the item is already in the CIL, don't try to reposition it
+		 * because we don't know what per-cpu list it is on.
 		 */
-		if (!list_is_last(&lip->li_cil, &cil->xc_cil))
-			list_move_tail(&lip->li_cil, &cil->xc_cil);
+		if (!list_empty(&lip->li_cil))
+			continue;
+		list_add_tail(&lip->li_cil, this_cpu_ptr(cil->xc_cil));
 	}
 
-	spin_unlock(&cil->xc_cil_lock);
-
 	if (tp->t_ticket->t_curr_res < 0)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 }
@@ -666,6 +666,8 @@ xlog_cil_push_work(
 	struct xfs_log_vec	lvhdr = { NULL };
 	xfs_lsn_t		commit_lsn;
 	xfs_lsn_t		push_seq;
+	LIST_HEAD		(cil_items);
+	int			cpu;
 
 	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -687,7 +689,7 @@ xlog_cil_push_work(
 	 * move on to a new sequence number and so we have to be able to push
 	 * this sequence again later.
 	 */
-	if (list_empty(&cil->xc_cil)) {
+	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
 		cil->xc_push_seq = 0;
 		spin_unlock(&cil->xc_push_lock);
 		goto out_skip;
@@ -728,17 +730,21 @@ xlog_cil_push_work(
 	spin_unlock(&cil->xc_push_lock);
 
 	/*
-	 * pull all the log vectors off the items in the CIL, and
-	 * remove the items from the CIL. We don't need the CIL lock
-	 * here because it's only needed on the transaction commit
-	 * side which is currently locked out by the flush lock.
+	 * Remove the items from the per-cpu CIL lists and then pull all the
+	 * log vectors off the items. We hold the xc_ctx_lock exclusively here,
+	 * so nothing can be adding or removing from the per-cpu lists here.
 	 */
+	/* XXX: hotplug! */
+	for_each_online_cpu(cpu) {
+		list_splice_tail_init(per_cpu_ptr(cil->xc_cil, cpu), &cil_items);
+	}
+
 	lv = NULL;
 	num_iovecs = 0;
-	while (!list_empty(&cil->xc_cil)) {
+	while (!list_empty(&cil_items)) {
 		struct xfs_log_item	*item;
 
-		item = list_first_entry(&cil->xc_cil,
+		item = list_first_entry(&cil_items,
 					struct xfs_log_item, li_cil);
 		list_del_init(&item->li_cil);
 		if (!ctx->lv_chain)
@@ -927,7 +933,7 @@ xlog_cil_push_background(
 	 * The cil won't be empty because we are called while holding the
 	 * context lock so whatever we added to the CIL will still be there
 	 */
-	ASSERT(!list_empty(&cil->xc_cil));
+	ASSERT(space_used != 0);
 
 	/*
 	 * don't do a background push if we haven't used up all the
@@ -993,7 +999,8 @@ xlog_cil_push_now(
 	 * there's no work we need to do.
 	 */
 	spin_lock(&cil->xc_push_lock);
-	if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) {
+	if (percpu_counter_read(&cil->xc_curr_res) == 0 ||
+	    push_seq <= cil->xc_push_seq) {
 		spin_unlock(&cil->xc_push_lock);
 		return;
 	}
@@ -1011,7 +1018,7 @@ xlog_cil_empty(
 	bool		empty = false;
 
 	spin_lock(&cil->xc_push_lock);
-	if (list_empty(&cil->xc_cil))
+	if (percpu_counter_read(&cil->xc_curr_res) == 0)
 		empty = true;
 	spin_unlock(&cil->xc_push_lock);
 	return empty;
@@ -1163,7 +1170,7 @@ xlog_cil_force_lsn(
 	 * we would have found the context on the committing list.
 	 */
 	if (sequence == cil->xc_current_sequence &&
-	    !list_empty(&cil->xc_cil)) {
+	    percpu_counter_read(&cil->xc_curr_res) != 0) {
 		spin_unlock(&cil->xc_push_lock);
 		goto restart;
 	}
@@ -1223,6 +1230,7 @@ xlog_cil_init(
 	struct xfs_cil	*cil;
 	struct xfs_cil_ctx *ctx;
 	int		error = -ENOMEM;
+	int		cpu;
 
 	cil = kmem_zalloc(sizeof(*cil), KM_MAYFAIL);
 	if (!cil)
@@ -1232,16 +1240,24 @@ xlog_cil_init(
 	if (!ctx)
 		goto out_free_cil;
 
+	/* XXX: CPU hotplug!!! */
+	cil->xc_cil = alloc_percpu_gfp(struct list_head, GFP_KERNEL);
+	if (!cil->xc_cil)
+		goto out_free_ctx;
+
+	for_each_possible_cpu(cpu) {
+		INIT_LIST_HEAD(per_cpu_ptr(cil->xc_cil, cpu));
+	}
+
 	error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL);
 	if (error)
-		goto out_free_ctx;
+		goto out_free_pcp_cil;
 
 	error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL);
 	if (error)
 		goto out_free_space;
 
 	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
-	INIT_LIST_HEAD(&cil->xc_cil);
 	INIT_LIST_HEAD(&cil->xc_committing);
 	spin_lock_init(&cil->xc_cil_lock);
 	spin_lock_init(&cil->xc_push_lock);
@@ -1262,6 +1278,8 @@ xlog_cil_init(
 
 out_free_space:
 	percpu_counter_destroy(&cil->xc_space_used);
+out_free_pcp_cil:
+	free_percpu(cil->xc_cil);
 out_free_ctx:
 	kmem_free(ctx);
 out_free_cil:
@@ -1274,6 +1292,7 @@ xlog_cil_destroy(
 	struct xlog	*log)
 {
 	struct xfs_cil  *cil = log->l_cilp;
+	int		cpu;
 
 	if (cil->xc_ctx) {
 		if (cil->xc_ctx->ticket)
@@ -1283,7 +1302,10 @@ xlog_cil_destroy(
 	percpu_counter_destroy(&cil->xc_space_used);
 	percpu_counter_destroy(&cil->xc_curr_res);
 
-	ASSERT(list_empty(&cil->xc_cil));
+	for_each_possible_cpu(cpu) {
+		ASSERT(list_empty(per_cpu_ptr(cil->xc_cil, cpu)));
+	}
+	free_percpu(cil->xc_cil);
 	kmem_free(cil);
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index f5e79a7d44c8e..0bb982920d070 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -264,7 +264,7 @@ struct xfs_cil {
 	struct xlog		*xc_log;
 	struct percpu_counter	xc_space_used;
 	struct percpu_counter	xc_curr_res;
-	struct list_head	xc_cil;
+	struct list_head __percpu *xc_cil;
 	spinlock_t		xc_cil_lock;
 
 	struct rw_semaphore	xc_ctx_lock ____cacheline_aligned_in_smp;

From patchwork Tue May 12 09:28:11 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 11542765
Return-Path: <SRS0=k9vQ=62=vger.kernel.org=linux-xfs-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 985211668
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Tue, 12 May 2020 09:28:19 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 8989E2075E
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Tue, 12 May 2020 09:28:19 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1725889AbgELJ2T (ORCPT
        <rfc822;patchwork-linux-xfs@patchwork.kernel.org>);
        Tue, 12 May 2020 05:28:19 -0400
Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:40303 "EHLO
        mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1728990AbgELJ2S (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Tue, 12 May 2020 05:28:18 -0400
Received: from dread.disaster.area (pa49-195-157-175.pa.nsw.optusnet.com.au
 [49.195.157.175])
        by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 1D1975AAD49
        for <linux-xfs@vger.kernel.org>;
 Tue, 12 May 2020 19:28:15 +1000 (AEST)
Received: from discord.disaster.area ([192.168.253.110])
        by dread.disaster.area with esmtp (Exim 4.92.3)
        (envelope-from <david@fromorbit.com>)
        id 1jYRCw-0004H8-DW
        for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000
Received: from dave by discord.disaster.area with local (Exim 4.93)
        (envelope-from <david@fromorbit.com>)
        id 1jYRCw-007kYg-5L
        for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 5/5] [RFC] xfs: make CIl busy extent lists per-cpu
Date: Tue, 12 May 2020 19:28:11 +1000
Message-Id: <20200512092811.1846252-6-david@fromorbit.com>
X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9
In-Reply-To: <20200512092811.1846252-1-david@fromorbit.com>
References: <20200512092811.1846252-1-david@fromorbit.com>
MIME-Version: 1.0
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0
        a=ONQRW0k9raierNYdzxQi9Q==:117 a=ONQRW0k9raierNYdzxQi9Q==:17
        a=sTwFKg_x9MkA:10 a=20KFwNOVAAAA:8 a=VTcdlksUSm9lYBse5iYA:9
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

From: Dave Chinner <dchinner@redhat.com>

We use the same percpu list trick with the busy extents as we do
with the CIL lists, and this gets rid of the last use of the
xc_cil_lock in the commit fast path.

As noted in the previous patch, it looked like we were approaching
another bottleneck, and that can be seen by the fact that
performance didn't substantially increase even though there is now
no lock contention in the commit path. The transaction rate
only increases slightly to 1.12M/sec.

Using a 32-way concurrent create/unlink on a 32p/16GB virtual
machine:

	     create time     rate            unlink time
unpatched       1m49s      523k/s+/-14k/s      2m00s
patched         1m48s      535k/s+/-24k/s      1m51s

So variance went back up, and performance improved slightly.
Profiling at this point indicates spinlock contention at the VFS
level (inode_sb_list_add() and dentry cache pathwalking) so
significant further gains will require VFS surgery.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 35 ++++++++++++++++++-----------------
 fs/xfs/xfs_log_priv.h | 12 +++++++++++-
 2 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index af444bc69a7cd..d3a5f8478d64a 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -461,13 +461,9 @@ xlog_cil_insert_items(
 	percpu_counter_add_batch(&cil->xc_curr_res, split_res, 1000 * 1000);
 	percpu_counter_add_batch(&cil->xc_space_used, len, 1000 * 1000);
 
-	spin_lock(&cil->xc_cil_lock);
-
 	/* attach the transaction to the CIL if it has any busy extents */
 	if (!list_empty(&tp->t_busy))
-		list_splice_init(&tp->t_busy, &ctx->busy_extents);
-
-	spin_unlock(&cil->xc_cil_lock);
+		list_splice_tail_init(&tp->t_busy, pcp_busy(cil));
 
 	/*
 	 * Now (re-)position everything modified at the tail of the CIL.
@@ -486,7 +482,7 @@ xlog_cil_insert_items(
 		 */
 		if (!list_empty(&lip->li_cil))
 			continue;
-		list_add_tail(&lip->li_cil, this_cpu_ptr(cil->xc_cil));
+		list_add_tail(&lip->li_cil, pcp_cil(cil));
 	}
 
 	if (tp->t_ticket->t_curr_res < 0)
@@ -733,10 +729,14 @@ xlog_cil_push_work(
 	 * Remove the items from the per-cpu CIL lists and then pull all the
 	 * log vectors off the items. We hold the xc_ctx_lock exclusively here,
 	 * so nothing can be adding or removing from the per-cpu lists here.
+	 *
+	 * Also splice the busy extents onto the context while we are walking
+	 * the percpu structure.
 	 */
 	/* XXX: hotplug! */
 	for_each_online_cpu(cpu) {
-		list_splice_tail_init(per_cpu_ptr(cil->xc_cil, cpu), &cil_items);
+		list_splice_tail_init(pcp_cil_cpu(cil, cpu), &cil_items);
+		list_splice_tail_init(pcp_busy_cpu(cil, cpu), &ctx->busy_extents);
 	}
 
 	lv = NULL;
@@ -933,7 +933,7 @@ xlog_cil_push_background(
 	 * The cil won't be empty because we are called while holding the
 	 * context lock so whatever we added to the CIL will still be there
 	 */
-	ASSERT(space_used != 0);
+	ASSERT(percpu_counter_read(&cil->xc_curr_res) != 0);
 
 	/*
 	 * don't do a background push if we haven't used up all the
@@ -1241,17 +1241,18 @@ xlog_cil_init(
 		goto out_free_cil;
 
 	/* XXX: CPU hotplug!!! */
-	cil->xc_cil = alloc_percpu_gfp(struct list_head, GFP_KERNEL);
-	if (!cil->xc_cil)
+	cil->xc_pcp = alloc_percpu_gfp(struct xfs_cil_pcpu, GFP_KERNEL);
+	if (!cil->xc_pcp)
 		goto out_free_ctx;
 
 	for_each_possible_cpu(cpu) {
-		INIT_LIST_HEAD(per_cpu_ptr(cil->xc_cil, cpu));
+		INIT_LIST_HEAD(pcp_cil_cpu(cil, cpu));
+		INIT_LIST_HEAD(pcp_busy_cpu(cil, cpu));
 	}
 
 	error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL);
 	if (error)
-		goto out_free_pcp_cil;
+		goto out_free_pcp;
 
 	error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL);
 	if (error)
@@ -1259,7 +1260,6 @@ xlog_cil_init(
 
 	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
 	INIT_LIST_HEAD(&cil->xc_committing);
-	spin_lock_init(&cil->xc_cil_lock);
 	spin_lock_init(&cil->xc_push_lock);
 	init_rwsem(&cil->xc_ctx_lock);
 	init_waitqueue_head(&cil->xc_commit_wait);
@@ -1278,8 +1278,8 @@ xlog_cil_init(
 
 out_free_space:
 	percpu_counter_destroy(&cil->xc_space_used);
-out_free_pcp_cil:
-	free_percpu(cil->xc_cil);
+out_free_pcp:
+	free_percpu(cil->xc_pcp);
 out_free_ctx:
 	kmem_free(ctx);
 out_free_cil:
@@ -1303,9 +1303,10 @@ xlog_cil_destroy(
 	percpu_counter_destroy(&cil->xc_curr_res);
 
 	for_each_possible_cpu(cpu) {
-		ASSERT(list_empty(per_cpu_ptr(cil->xc_cil, cpu)));
+		ASSERT(list_empty(pcp_cil_cpu(cil, cpu)));
+		ASSERT(list_empty(pcp_busy_cpu(cil, cpu)));
 	}
-	free_percpu(cil->xc_cil);
+	free_percpu(cil->xc_pcp);
 	kmem_free(cil);
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 0bb982920d070..cfc22c9482ea4 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -260,11 +260,16 @@ struct xfs_cil_ctx {
  * the commit LSN to be determined as well. This should make synchronous
  * operations almost as efficient as the old logging methods.
  */
+struct xfs_cil_pcpu {
+	struct list_head	p_cil;
+	struct list_head	p_busy_extents;
+};
+
 struct xfs_cil {
 	struct xlog		*xc_log;
 	struct percpu_counter	xc_space_used;
 	struct percpu_counter	xc_curr_res;
-	struct list_head __percpu *xc_cil;
+	struct xfs_cil_pcpu __percpu *xc_pcp;
 	spinlock_t		xc_cil_lock;
 
 	struct rw_semaphore	xc_ctx_lock ____cacheline_aligned_in_smp;
@@ -278,6 +283,11 @@ struct xfs_cil {
 	struct work_struct	xc_push_work;
 } ____cacheline_aligned_in_smp;
 
+#define pcp_cil(cil)		&(this_cpu_ptr(cil->xc_pcp)->p_cil)
+#define pcp_cil_cpu(cil, cpu)	&(per_cpu_ptr(cil->xc_pcp, cpu)->p_cil)
+#define pcp_busy(cil)		&(this_cpu_ptr(cil->xc_pcp)->p_busy_extents)
+#define pcp_busy_cpu(cil, cpu)	&(per_cpu_ptr(cil->xc_pcp, cpu)->p_busy_extents)
+
 /*
  * The amount of log space we allow the CIL to aggregate is difficult to size.
  * Whatever we choose, we have to make sure we can get a reservation for the