From patchwork Tue May 12 09:28:07 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11542767 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DAD4C159A for ; Tue, 12 May 2020 09:28:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CC7742075E for ; Tue, 12 May 2020 09:28:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728990AbgELJ2T (ORCPT ); Tue, 12 May 2020 05:28:19 -0400 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:40339 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729027AbgELJ2S (ORCPT ); Tue, 12 May 2020 05:28:18 -0400 Received: from dread.disaster.area (pa49-195-157-175.pa.nsw.optusnet.com.au [49.195.157.175]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 7B1635AAF59 for ; Tue, 12 May 2020 19:28:15 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jYRCw-0004H1-7l for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jYRCv-007kYM-V6 for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:13 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 1/5] xfs: separate read-only variables in struct xfs_mount Date: Tue, 12 May 2020 19:28:07 +1000 Message-Id: <20200512092811.1846252-2-david@fromorbit.com> X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9 In-Reply-To: <20200512092811.1846252-1-david@fromorbit.com> References: <20200512092811.1846252-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0 a=ONQRW0k9raierNYdzxQi9Q==:117 a=ONQRW0k9raierNYdzxQi9Q==:17 a=sTwFKg_x9MkA:10 a=20KFwNOVAAAA:8 a=6bWVlVJL7jcz4EFoCGYA:9 a=T7BEPd9-KIUQysAa:21 a=DsDIQ_qJmw4xj2vw:21 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Seeing massive cpu usage from xfs_agino_range() on one machine; instruction level profiles look similar to another machine running the same workload, only one machien is consuming 10x as much CPU as the other and going much slower. The only real difference between the two machines is core count per socket. Both are running identical 16p/16GB virtual machine configurations Machine A: 25.83% [k] xfs_agino_range 12.68% [k] __xfs_dir3_data_check 6.95% [k] xfs_verify_ino 6.78% [k] xfs_dir2_data_entry_tag_p 3.56% [k] xfs_buf_find 2.31% [k] xfs_verify_dir_ino 2.02% [k] xfs_dabuf_map.constprop.0 1.65% [k] xfs_ag_block_count And takes around 13 minutes to remove 50 million inodes. Machine B: 13.90% [k] __pv_queued_spin_lock_slowpath 3.76% [k] do_raw_spin_lock 2.83% [k] xfs_dir3_leaf_check_int 2.75% [k] xfs_agino_range 2.51% [k] __raw_callee_save___pv_queued_spin_unlock 2.18% [k] __xfs_dir3_data_check 2.02% [k] xfs_log_commit_cil And takes around 5m30s to remove 50 million inodes. Suspect is cacheline contention on m_sectbb_log which is used in one of the macros in xfs_agino_range. This is a read-only variable but shares a cacheline with m_active_trans which is a global atomic that gets bounced all around the machine. The workload is trying to run hundreds of thousands of transactions per second and hence cacheline contention will be occuring on this atomic counter. Hence xfs_agino_range() is likely just be an innocent bystander as the cache coherency protocol fights over the cacheline between CPU cores and sockets. On machine A, this rearrangement of the struct xfs_mount results in the profile changing to: 9.77% [kernel] [k] xfs_agino_range 6.27% [kernel] [k] __xfs_dir3_data_check 5.31% [kernel] [k] __pv_queued_spin_lock_slowpath 4.54% [kernel] [k] xfs_buf_find 3.79% [kernel] [k] do_raw_spin_lock 3.39% [kernel] [k] xfs_verify_ino 2.73% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock Vastly less CPU usage in xfs_agino_range(), but still 3x the amount of machine B and still runs substantially slower than it should. Current rm -rf of 50 million files: vanilla patched machine A 13m20s 8m30s machine B 5m30s 5m02s It's an improvement, hence indicating that separation and further optimisation of read-only global filesystem data is worthwhile, but it clearly isn't the underlying issue causing this specific performance degradation. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_mount.h | 50 +++++++++++++++++++++++++++------------------- 1 file changed, 29 insertions(+), 21 deletions(-) diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index aba5a15792792..712b3e2583316 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -88,21 +88,12 @@ typedef struct xfs_mount { struct xfs_buf *m_sb_bp; /* buffer for superblock */ char *m_rtname; /* realtime device name */ char *m_logname; /* external log device name */ - int m_bsize; /* fs logical block size */ xfs_agnumber_t m_agfrotor; /* last ag where space found */ xfs_agnumber_t m_agirotor; /* last ag dir inode alloced */ spinlock_t m_agirotor_lock;/* .. and lock protecting it */ - xfs_agnumber_t m_maxagi; /* highest inode alloc group */ - uint m_allocsize_log;/* min write size log bytes */ - uint m_allocsize_blocks; /* min write size blocks */ struct xfs_da_geometry *m_dir_geo; /* directory block geometry */ struct xfs_da_geometry *m_attr_geo; /* attribute block geometry */ struct xlog *m_log; /* log specific stuff */ - struct xfs_ino_geometry m_ino_geo; /* inode geometry */ - int m_logbufs; /* number of log buffers */ - int m_logbsize; /* size of each log buffer */ - uint m_rsumlevels; /* rt summary levels */ - uint m_rsumsize; /* size of rt summary, bytes */ /* * Optional cache of rt summary level per bitmap block with the * invariant that m_rsum_cache[bbno] <= the minimum i for which @@ -117,9 +108,15 @@ typedef struct xfs_mount { xfs_buftarg_t *m_ddev_targp; /* saves taking the address */ xfs_buftarg_t *m_logdev_targp;/* ptr to log device */ xfs_buftarg_t *m_rtdev_targp; /* ptr to rt device */ + + /* + * Read-only variables that are pre-calculated at mount time. + */ + int ____cacheline_aligned m_bsize; /* fs logical block size */ uint8_t m_blkbit_log; /* blocklog + NBBY */ uint8_t m_blkbb_log; /* blocklog - BBSHIFT */ uint8_t m_agno_log; /* log #ag's */ + uint8_t m_sectbb_log; /* sectlog - BBSHIFT */ uint m_blockmask; /* sb_blocksize-1 */ uint m_blockwsize; /* sb_blocksize in words */ uint m_blockwmask; /* blockwsize-1 */ @@ -138,20 +135,35 @@ typedef struct xfs_mount { xfs_extlen_t m_ag_prealloc_blocks; /* reserved ag blocks */ uint m_alloc_set_aside; /* space we can't use */ uint m_ag_max_usable; /* max space per AG */ - struct radix_tree_root m_perag_tree; /* per-ag accounting info */ - spinlock_t m_perag_lock; /* lock for m_perag_tree */ - struct mutex m_growlock; /* growfs mutex */ + int m_dalign; /* stripe unit */ + int m_swidth; /* stripe width */ + xfs_agnumber_t m_maxagi; /* highest inode alloc group */ + uint m_allocsize_log;/* min write size log bytes */ + uint m_allocsize_blocks; /* min write size blocks */ + int m_logbufs; /* number of log buffers */ + int m_logbsize; /* size of each log buffer */ + uint m_rsumlevels; /* rt summary levels */ + uint m_rsumsize; /* size of rt summary, bytes */ int m_fixedfsid[2]; /* unchanged for life of FS */ - uint64_t m_flags; /* global mount flags */ - bool m_finobt_nores; /* no per-AG finobt resv. */ uint m_qflags; /* quota status flags */ + uint64_t m_flags; /* global mount flags */ + int64_t m_low_space[XFS_LOWSP_MAX]; + struct xfs_ino_geometry m_ino_geo; /* inode geometry */ struct xfs_trans_resv m_resv; /* precomputed res values */ + /* low free space thresholds */ + bool m_always_cow; + bool m_fail_unmount; + bool m_finobt_nores; /* no per-AG finobt resv. */ + /* + * End of pre-calculated read-only variables + */ + + struct radix_tree_root m_perag_tree; /* per-ag accounting info */ + spinlock_t m_perag_lock; /* lock for m_perag_tree */ + struct mutex m_growlock; /* growfs mutex */ uint64_t m_resblks; /* total reserved blocks */ uint64_t m_resblks_avail;/* available reserved blocks */ uint64_t m_resblks_save; /* reserved blks @ remount,ro */ - int m_dalign; /* stripe unit */ - int m_swidth; /* stripe width */ - uint8_t m_sectbb_log; /* sectlog - BBSHIFT */ atomic_t m_active_trans; /* number trans frozen */ struct xfs_mru_cache *m_filestream; /* per-mount filestream data */ struct delayed_work m_reclaim_work; /* background inode reclaim */ @@ -160,8 +172,6 @@ typedef struct xfs_mount { struct delayed_work m_cowblocks_work; /* background cow blocks trimming */ bool m_update_sb; /* sb needs update in mount */ - int64_t m_low_space[XFS_LOWSP_MAX]; - /* low free space thresholds */ struct xfs_kobj m_kobj; struct xfs_kobj m_error_kobj; struct xfs_kobj m_error_meta_kobj; @@ -191,8 +201,6 @@ typedef struct xfs_mount { */ uint32_t m_generation; - bool m_always_cow; - bool m_fail_unmount; #ifdef DEBUG /* * Frequency with which errors are injected. Replaces xfs_etest; the From patchwork Tue May 12 09:28:08 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11542769 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 56E5615E6 for ; Tue, 12 May 2020 09:28:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 48DEA2075E for ; Tue, 12 May 2020 09:28:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729027AbgELJ2T (ORCPT ); Tue, 12 May 2020 05:28:19 -0400 Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:52050 "EHLO mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729178AbgELJ2T (ORCPT ); Tue, 12 May 2020 05:28:19 -0400 Received: from dread.disaster.area (pa49-195-157-175.pa.nsw.optusnet.com.au [49.195.157.175]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id D452010639A for ; Tue, 12 May 2020 19:28:15 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jYRCw-0004H3-A0 for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jYRCw-007kYR-0j for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 2/5] xfs: convert m_active_trans counter to per-cpu Date: Tue, 12 May 2020 19:28:08 +1000 Message-Id: <20200512092811.1846252-3-david@fromorbit.com> X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9 In-Reply-To: <20200512092811.1846252-1-david@fromorbit.com> References: <20200512092811.1846252-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=ONQRW0k9raierNYdzxQi9Q==:117 a=ONQRW0k9raierNYdzxQi9Q==:17 a=sTwFKg_x9MkA:10 a=20KFwNOVAAAA:8 a=DQkj07otzc7_PMFSBPwA:9 a=8MshfPfKmnwD3r4O:21 a=ea1rjcbdGSVUsQv9:21 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner It's a global atomic counter, and we are hitting it at a rate of half a million transactions a second, so it's bouncing the counter cacheline all over the place on large machines. Convert it to a per-cpu counter. And .... oh wow, that was unexpected! Concurrent create, 50 million inodes, identical 16p/16GB virtual machines on different physical hosts. Machine A has twice the CPU cores per socket of machine B: unpatched patched machine A: 3m45s 2m27s machine B: 4m13s 4m14s Create rates: unpatched patched machine A: 246k+/-15k 384k+/-10k machine B: 225k+/-13k 223k+/-11k Concurrent rm of same 50 million inodes: unpatched patched machine A: 8m30s 3m09s machine B: 5m02s 4m51s The transaction rate on the fast machine went from about 250k/sec to over 600k/sec, which indicates just how much of a bottleneck this atomic counter was. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster --- fs/xfs/xfs_mount.h | 2 +- fs/xfs/xfs_super.c | 12 +++++++++--- fs/xfs/xfs_trans.c | 6 +++--- 3 files changed, 13 insertions(+), 7 deletions(-) diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 712b3e2583316..af3d8b71e9591 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -84,6 +84,7 @@ typedef struct xfs_mount { * extents or anything related to the rt device. */ struct percpu_counter m_delalloc_blks; + struct percpu_counter m_active_trans; /* in progress xact counter */ struct xfs_buf *m_sb_bp; /* buffer for superblock */ char *m_rtname; /* realtime device name */ @@ -164,7 +165,6 @@ typedef struct xfs_mount { uint64_t m_resblks; /* total reserved blocks */ uint64_t m_resblks_avail;/* available reserved blocks */ uint64_t m_resblks_save; /* reserved blks @ remount,ro */ - atomic_t m_active_trans; /* number trans frozen */ struct xfs_mru_cache *m_filestream; /* per-mount filestream data */ struct delayed_work m_reclaim_work; /* background inode reclaim */ struct delayed_work m_eofblocks_work; /* background eof blocks diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index e80bd2c4c279e..bc4853525ce18 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -883,7 +883,7 @@ xfs_quiesce_attr( int error = 0; /* wait for all modifications to complete */ - while (atomic_read(&mp->m_active_trans) > 0) + while (percpu_counter_sum(&mp->m_active_trans) > 0) delay(100); /* force the log to unpin objects from the now complete transactions */ @@ -902,7 +902,7 @@ xfs_quiesce_attr( * Just warn here till VFS can correctly support * read-only remount without racing. */ - WARN_ON(atomic_read(&mp->m_active_trans) != 0); + WARN_ON(percpu_counter_sum(&mp->m_active_trans) != 0); xfs_log_quiesce(mp); } @@ -1027,8 +1027,14 @@ xfs_init_percpu_counters( if (error) goto free_fdblocks; + error = percpu_counter_init(&mp->m_active_trans, 0, GFP_KERNEL); + if (error) + goto free_delalloc_blocks; + return 0; +free_delalloc_blocks: + percpu_counter_destroy(&mp->m_delalloc_blks); free_fdblocks: percpu_counter_destroy(&mp->m_fdblocks); free_ifree: @@ -1057,6 +1063,7 @@ xfs_destroy_percpu_counters( ASSERT(XFS_FORCED_SHUTDOWN(mp) || percpu_counter_sum(&mp->m_delalloc_blks) == 0); percpu_counter_destroy(&mp->m_delalloc_blks); + percpu_counter_destroy(&mp->m_active_trans); } static void @@ -1792,7 +1799,6 @@ static int xfs_init_fs_context( INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC); spin_lock_init(&mp->m_perag_lock); mutex_init(&mp->m_growlock); - atomic_set(&mp->m_active_trans, 0); INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker); INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker); INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker); diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 28b983ff8b113..636df5017782e 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -68,7 +68,7 @@ xfs_trans_free( xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false); trace_xfs_trans_free(tp, _RET_IP_); - atomic_dec(&tp->t_mountp->m_active_trans); + percpu_counter_dec(&tp->t_mountp->m_active_trans); if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT)) sb_end_intwrite(tp->t_mountp->m_super); xfs_trans_free_dqinfo(tp); @@ -126,7 +126,7 @@ xfs_trans_dup( xfs_trans_dup_dqinfo(tp, ntp); - atomic_inc(&tp->t_mountp->m_active_trans); + percpu_counter_inc(&tp->t_mountp->m_active_trans); return ntp; } @@ -275,7 +275,7 @@ xfs_trans_alloc( */ WARN_ON(resp->tr_logres > 0 && mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE); - atomic_inc(&mp->m_active_trans); + percpu_counter_inc(&mp->m_active_trans); tp->t_magic = XFS_TRANS_HEADER_MAGIC; tp->t_flags = flags; From patchwork Tue May 12 09:28:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11542763 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7C74615E6 for ; Tue, 12 May 2020 09:28:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 68F152075E for ; Tue, 12 May 2020 09:28:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726891AbgELJ2T (ORCPT ); Tue, 12 May 2020 05:28:19 -0400 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:35755 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725889AbgELJ2S (ORCPT ); Tue, 12 May 2020 05:28:18 -0400 Received: from dread.disaster.area (pa49-195-157-175.pa.nsw.optusnet.com.au [49.195.157.175]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 5FD283A2E4C for ; Tue, 12 May 2020 19:28:15 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jYRCw-0004H5-BY for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jYRCw-007kYW-2S for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters Date: Tue, 12 May 2020 19:28:09 +1000 Message-Id: <20200512092811.1846252-4-david@fromorbit.com> X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9 In-Reply-To: <20200512092811.1846252-1-david@fromorbit.com> References: <20200512092811.1846252-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=ONQRW0k9raierNYdzxQi9Q==:117 a=ONQRW0k9raierNYdzxQi9Q==:17 a=sTwFKg_x9MkA:10 a=20KFwNOVAAAA:8 a=EZ1i1sUKn3mE4C4D1V4A:9 a=0fyGEN4HqmrUkTAy:21 a=GkttGSZADD73hktX:21 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner With the m_active_trans atomic bottleneck out of the way, the CIL xc_cil_lock is the next bottleneck that causes cacheline contention. This protects several things, the first of which is the CIL context reservation ticket and space usage counters. We can lift them out of the xc_cil_lock by converting them to percpu counters. THis involves two things, the first of which is lifting calculations and samples that don't actually need protecting from races outside the xc_cil lock. The second is converting the counters to percpu counters and lifting them outside the lock. This requires a couple of tricky things to minimise initial state races and to ensure we take into account split reservations. We do this by erring on the "take the reservation just in case" side, which largely lost in the noise of many frequent large transactions. We use a trick with percpu_counter_add_batch() to ensure the global sum is updated immediately on first reservation, hence allowing us to use fast counter reads everywhere to determine if the CIL is empty or not, rather than using the list itself. This is important for later patches where the CIL is moved to percpu lists and hence cannot use list_empty() to detect an empty CIL. Hence we provide a low overhead, lockless mechanism for determining if the CIL is empty or not via this mechanisms. All other percpu counter updates use a large batch count so they aggregate on the local CPU and minimise global sum updates. The xc_ctx_lock rwsem protects draining the percpu counters to the context's ticket, similar to the way it allows access to the CIL without using the xc_cil_lock. i.e. the CIL push has exclusive access to the CIL, the context and the percpu counters while holding the xc_ctx_lock. This ensures that we can sum and zero the counters atomically from the perspective of the transaction commit side of the push. i.e. they reset to zero atomically with the CIL context swap and hence we don't need to have the percpu counters attached to the CIL context. Performance wise, this increases the transaction rate from ~620,000/s to around 750,000/second. Using a 32-way concurrent create instead of 16-way on a 32p/16GB virtual machine: create time rate unlink time unpatched 2m03s 472k/s+/-9k/s 3m6s patched 1m56s 533k/s+/-28k/s 2m34 Notably, the system time for the create went from 44m20s down to 38m37s, whilst going faster. There is more variance, but I think that is from the cacheline contention having inconsistent overhead. XXX: probably should split into two patches Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 99 ++++++++++++++++++++++++++++++------------- fs/xfs/xfs_log_priv.h | 2 + 2 files changed, 72 insertions(+), 29 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index b43f0e8f43f2e..746c841757ed1 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -393,7 +393,7 @@ xlog_cil_insert_items( struct xfs_log_item *lip; int len = 0; int diff_iovecs = 0; - int iclog_space; + int iclog_space, space_used; int iovhdr_res = 0, split_res = 0, ctx_res = 0; ASSERT(tp); @@ -403,17 +403,16 @@ xlog_cil_insert_items( * are done so it doesn't matter exactly how we update the CIL. */ xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs); - - spin_lock(&cil->xc_cil_lock); - /* account for space used by new iovec headers */ + iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len += iovhdr_res; ctx->nvecs += diff_iovecs; - /* attach the transaction to the CIL if it has any busy extents */ - if (!list_empty(&tp->t_busy)) - list_splice_init(&tp->t_busy, &ctx->busy_extents); + /* + * The ticket can't go away from us here, so we can do racy sampling + * and precalculate everything. + */ /* * Now transfer enough transaction reservation to the context ticket @@ -421,27 +420,28 @@ xlog_cil_insert_items( * reservation has to grow as well as the current reservation as we * steal from tickets so we can correctly determine the space used * during the transaction commit. + * + * We use percpu_counter_add_batch() here to force the addition into the + * global sum immediately. This will result in percpu_counter_read() now + * always returning a non-zero value, and hence we'll only ever have a + * very short race window on new contexts. */ - if (ctx->ticket->t_curr_res == 0) { + if (percpu_counter_read(&cil->xc_curr_res) == 0) { ctx_res = ctx->ticket->t_unit_res; - ctx->ticket->t_curr_res = ctx_res; tp->t_ticket->t_curr_res -= ctx_res; + percpu_counter_add_batch(&cil->xc_curr_res, ctx_res, ctx_res - 1); } /* do we need space for more log record headers? */ - iclog_space = log->l_iclog_size - log->l_iclog_hsize; - if (len > 0 && (ctx->space_used / iclog_space != - (ctx->space_used + len) / iclog_space)) { + if (len > 0 && !ctx_res) { + iclog_space = log->l_iclog_size - log->l_iclog_hsize; split_res = (len + iclog_space - 1) / iclog_space; /* need to take into account split region headers, too */ split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header); - ctx->ticket->t_unit_res += split_res; - ctx->ticket->t_curr_res += split_res; tp->t_ticket->t_curr_res -= split_res; ASSERT(tp->t_ticket->t_curr_res >= len); } tp->t_ticket->t_curr_res -= len; - ctx->space_used += len; /* * If we've overrun the reservation, dump the tx details before we move @@ -458,6 +458,15 @@ xlog_cil_insert_items( xlog_print_trans(tp); } + percpu_counter_add_batch(&cil->xc_curr_res, split_res, 1000 * 1000); + percpu_counter_add_batch(&cil->xc_space_used, len, 1000 * 1000); + + spin_lock(&cil->xc_cil_lock); + + /* attach the transaction to the CIL if it has any busy extents */ + if (!list_empty(&tp->t_busy)) + list_splice_init(&tp->t_busy, &ctx->busy_extents); + /* * Now (re-)position everything modified at the tail of the CIL. * We do this here so we only need to take the CIL lock once during @@ -741,6 +750,18 @@ xlog_cil_push_work( num_iovecs += lv->lv_niovecs; } + /* + * Drain per cpu counters back to context so they can be re-initialised + * to zero before we allow commits to the new context we are about to + * switch to. + */ + ctx->space_used = percpu_counter_sum(&cil->xc_space_used); + ctx->ticket->t_curr_res = percpu_counter_sum(&cil->xc_curr_res); + ctx->ticket->t_unit_res = ctx->ticket->t_curr_res; + percpu_counter_set(&cil->xc_space_used, 0); + percpu_counter_set(&cil->xc_curr_res, 0); + + /* * initialise the new context and attach it to the CIL. Then attach * the current context to the CIL committing lsit so it can be found @@ -900,6 +921,7 @@ xlog_cil_push_background( struct xlog *log) __releases(cil->xc_ctx_lock) { struct xfs_cil *cil = log->l_cilp; + s64 space_used = percpu_counter_read(&cil->xc_space_used); /* * The cil won't be empty because we are called while holding the @@ -911,7 +933,7 @@ xlog_cil_push_background( * don't do a background push if we haven't used up all the * space available yet. */ - if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) { + if (space_used < XLOG_CIL_SPACE_LIMIT(log)) { up_read(&cil->xc_ctx_lock); return; } @@ -934,9 +956,9 @@ xlog_cil_push_background( * If we are well over the space limit, throttle the work that is being * done until the push work on this context has begun. */ - if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { + if (space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket); - ASSERT(cil->xc_ctx->space_used < log->l_logsize); + ASSERT(space_used < log->l_logsize); xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock); return; } @@ -1200,16 +1222,23 @@ xlog_cil_init( { struct xfs_cil *cil; struct xfs_cil_ctx *ctx; + int error = -ENOMEM; cil = kmem_zalloc(sizeof(*cil), KM_MAYFAIL); if (!cil) - return -ENOMEM; + return error; ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL); - if (!ctx) { - kmem_free(cil); - return -ENOMEM; - } + if (!ctx) + goto out_free_cil; + + error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL); + if (error) + goto out_free_ctx; + + error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL); + if (error) + goto out_free_space; INIT_WORK(&cil->xc_push_work, xlog_cil_push_work); INIT_LIST_HEAD(&cil->xc_cil); @@ -1230,19 +1259,31 @@ xlog_cil_init( cil->xc_log = log; log->l_cilp = cil; return 0; + +out_free_space: + percpu_counter_destroy(&cil->xc_space_used); +out_free_ctx: + kmem_free(ctx); +out_free_cil: + kmem_free(cil); + return error; } void xlog_cil_destroy( struct xlog *log) { - if (log->l_cilp->xc_ctx) { - if (log->l_cilp->xc_ctx->ticket) - xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket); - kmem_free(log->l_cilp->xc_ctx); + struct xfs_cil *cil = log->l_cilp; + + if (cil->xc_ctx) { + if (cil->xc_ctx->ticket) + xfs_log_ticket_put(cil->xc_ctx->ticket); + kmem_free(cil->xc_ctx); } + percpu_counter_destroy(&cil->xc_space_used); + percpu_counter_destroy(&cil->xc_curr_res); - ASSERT(list_empty(&log->l_cilp->xc_cil)); - kmem_free(log->l_cilp); + ASSERT(list_empty(&cil->xc_cil)); + kmem_free(cil); } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index ec22c7a3867f1..f5e79a7d44c8e 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -262,6 +262,8 @@ struct xfs_cil_ctx { */ struct xfs_cil { struct xlog *xc_log; + struct percpu_counter xc_space_used; + struct percpu_counter xc_curr_res; struct list_head xc_cil; spinlock_t xc_cil_lock; From patchwork Tue May 12 09:28:10 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11542771 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1F1D515AB for ; Tue, 12 May 2020 09:28:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0D598207FF for ; Tue, 12 May 2020 09:28:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729159AbgELJ2X (ORCPT ); Tue, 12 May 2020 05:28:23 -0400 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:33014 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729125AbgELJ2W (ORCPT ); Tue, 12 May 2020 05:28:22 -0400 Received: from dread.disaster.area (pa49-195-157-175.pa.nsw.optusnet.com.au [49.195.157.175]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 1F7B6D79224 for ; Tue, 12 May 2020 19:28:15 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jYRCw-0004H7-Cw for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jYRCw-007kYb-4A for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 4/5] [RFC] xfs: per-cpu CIL lists Date: Tue, 12 May 2020 19:28:10 +1000 Message-Id: <20200512092811.1846252-5-david@fromorbit.com> X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9 In-Reply-To: <20200512092811.1846252-1-david@fromorbit.com> References: <20200512092811.1846252-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=ONQRW0k9raierNYdzxQi9Q==:117 a=ONQRW0k9raierNYdzxQi9Q==:17 a=sTwFKg_x9MkA:10 a=20KFwNOVAAAA:8 a=xl2qj0nMbg3rt_t4oVwA:9 a=iK1Nx_DWJ8jIP1s-:21 a=Z9_Xw1WqqA7Fm-cJ:21 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Next on the list to getting rid of the xc_cil_lock is making the CIL itself per-cpu. This requires a trade-off: we no longer move items forward in the CIL; once they are on the CIL they remain there as we treat the percpu lists as lockless. XXX: preempt_disable() around the list operations to ensure they stay local to the CPU. XXX: this needs CPU hotplug notifiers to clean up when cpus go offline. Performance now increases substantially - the transaction rate goes from 750,000/s to 1.05M/sec, and the unlink rate is over 500,000/s for the first time. Using a 32-way concurrent create/unlink on a 32p/16GB virtual machine: create time rate unlink time unpatched 1m56s 533k/s+/-28k/s 2m34s patched 1m49s 523k/s+/-14k/s 2m00s Notably, the system time for the create went up, while variance went down. This indicates we're starting to hit some other contention limit as we reduce the amount of time we spend contending on the xc_cil_lock. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 66 ++++++++++++++++++++++++++++--------------- fs/xfs/xfs_log_priv.h | 2 +- 2 files changed, 45 insertions(+), 23 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 746c841757ed1..af444bc69a7cd 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -467,28 +467,28 @@ xlog_cil_insert_items( if (!list_empty(&tp->t_busy)) list_splice_init(&tp->t_busy, &ctx->busy_extents); + spin_unlock(&cil->xc_cil_lock); + /* * Now (re-)position everything modified at the tail of the CIL. * We do this here so we only need to take the CIL lock once during * the transaction commit. + * Move everything to the tail of the local per-cpu CIL list. */ list_for_each_entry(lip, &tp->t_items, li_trans) { - /* Skip items which aren't dirty in this transaction. */ if (!test_bit(XFS_LI_DIRTY, &lip->li_flags)) continue; /* - * Only move the item if it isn't already at the tail. This is - * to prevent a transient list_empty() state when reinserting - * an item that is already the only item in the CIL. + * If the item is already in the CIL, don't try to reposition it + * because we don't know what per-cpu list it is on. */ - if (!list_is_last(&lip->li_cil, &cil->xc_cil)) - list_move_tail(&lip->li_cil, &cil->xc_cil); + if (!list_empty(&lip->li_cil)) + continue; + list_add_tail(&lip->li_cil, this_cpu_ptr(cil->xc_cil)); } - spin_unlock(&cil->xc_cil_lock); - if (tp->t_ticket->t_curr_res < 0) xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); } @@ -666,6 +666,8 @@ xlog_cil_push_work( struct xfs_log_vec lvhdr = { NULL }; xfs_lsn_t commit_lsn; xfs_lsn_t push_seq; + LIST_HEAD (cil_items); + int cpu; new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS); new_ctx->ticket = xlog_cil_ticket_alloc(log); @@ -687,7 +689,7 @@ xlog_cil_push_work( * move on to a new sequence number and so we have to be able to push * this sequence again later. */ - if (list_empty(&cil->xc_cil)) { + if (percpu_counter_read(&cil->xc_curr_res) == 0) { cil->xc_push_seq = 0; spin_unlock(&cil->xc_push_lock); goto out_skip; @@ -728,17 +730,21 @@ xlog_cil_push_work( spin_unlock(&cil->xc_push_lock); /* - * pull all the log vectors off the items in the CIL, and - * remove the items from the CIL. We don't need the CIL lock - * here because it's only needed on the transaction commit - * side which is currently locked out by the flush lock. + * Remove the items from the per-cpu CIL lists and then pull all the + * log vectors off the items. We hold the xc_ctx_lock exclusively here, + * so nothing can be adding or removing from the per-cpu lists here. */ + /* XXX: hotplug! */ + for_each_online_cpu(cpu) { + list_splice_tail_init(per_cpu_ptr(cil->xc_cil, cpu), &cil_items); + } + lv = NULL; num_iovecs = 0; - while (!list_empty(&cil->xc_cil)) { + while (!list_empty(&cil_items)) { struct xfs_log_item *item; - item = list_first_entry(&cil->xc_cil, + item = list_first_entry(&cil_items, struct xfs_log_item, li_cil); list_del_init(&item->li_cil); if (!ctx->lv_chain) @@ -927,7 +933,7 @@ xlog_cil_push_background( * The cil won't be empty because we are called while holding the * context lock so whatever we added to the CIL will still be there */ - ASSERT(!list_empty(&cil->xc_cil)); + ASSERT(space_used != 0); /* * don't do a background push if we haven't used up all the @@ -993,7 +999,8 @@ xlog_cil_push_now( * there's no work we need to do. */ spin_lock(&cil->xc_push_lock); - if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) { + if (percpu_counter_read(&cil->xc_curr_res) == 0 || + push_seq <= cil->xc_push_seq) { spin_unlock(&cil->xc_push_lock); return; } @@ -1011,7 +1018,7 @@ xlog_cil_empty( bool empty = false; spin_lock(&cil->xc_push_lock); - if (list_empty(&cil->xc_cil)) + if (percpu_counter_read(&cil->xc_curr_res) == 0) empty = true; spin_unlock(&cil->xc_push_lock); return empty; @@ -1163,7 +1170,7 @@ xlog_cil_force_lsn( * we would have found the context on the committing list. */ if (sequence == cil->xc_current_sequence && - !list_empty(&cil->xc_cil)) { + percpu_counter_read(&cil->xc_curr_res) != 0) { spin_unlock(&cil->xc_push_lock); goto restart; } @@ -1223,6 +1230,7 @@ xlog_cil_init( struct xfs_cil *cil; struct xfs_cil_ctx *ctx; int error = -ENOMEM; + int cpu; cil = kmem_zalloc(sizeof(*cil), KM_MAYFAIL); if (!cil) @@ -1232,16 +1240,24 @@ xlog_cil_init( if (!ctx) goto out_free_cil; + /* XXX: CPU hotplug!!! */ + cil->xc_cil = alloc_percpu_gfp(struct list_head, GFP_KERNEL); + if (!cil->xc_cil) + goto out_free_ctx; + + for_each_possible_cpu(cpu) { + INIT_LIST_HEAD(per_cpu_ptr(cil->xc_cil, cpu)); + } + error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL); if (error) - goto out_free_ctx; + goto out_free_pcp_cil; error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL); if (error) goto out_free_space; INIT_WORK(&cil->xc_push_work, xlog_cil_push_work); - INIT_LIST_HEAD(&cil->xc_cil); INIT_LIST_HEAD(&cil->xc_committing); spin_lock_init(&cil->xc_cil_lock); spin_lock_init(&cil->xc_push_lock); @@ -1262,6 +1278,8 @@ xlog_cil_init( out_free_space: percpu_counter_destroy(&cil->xc_space_used); +out_free_pcp_cil: + free_percpu(cil->xc_cil); out_free_ctx: kmem_free(ctx); out_free_cil: @@ -1274,6 +1292,7 @@ xlog_cil_destroy( struct xlog *log) { struct xfs_cil *cil = log->l_cilp; + int cpu; if (cil->xc_ctx) { if (cil->xc_ctx->ticket) @@ -1283,7 +1302,10 @@ xlog_cil_destroy( percpu_counter_destroy(&cil->xc_space_used); percpu_counter_destroy(&cil->xc_curr_res); - ASSERT(list_empty(&cil->xc_cil)); + for_each_possible_cpu(cpu) { + ASSERT(list_empty(per_cpu_ptr(cil->xc_cil, cpu))); + } + free_percpu(cil->xc_cil); kmem_free(cil); } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index f5e79a7d44c8e..0bb982920d070 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -264,7 +264,7 @@ struct xfs_cil { struct xlog *xc_log; struct percpu_counter xc_space_used; struct percpu_counter xc_curr_res; - struct list_head xc_cil; + struct list_head __percpu *xc_cil; spinlock_t xc_cil_lock; struct rw_semaphore xc_ctx_lock ____cacheline_aligned_in_smp; From patchwork Tue May 12 09:28:11 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11542765 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 985211668 for ; Tue, 12 May 2020 09:28:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8989E2075E for ; Tue, 12 May 2020 09:28:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725889AbgELJ2T (ORCPT ); Tue, 12 May 2020 05:28:19 -0400 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:40303 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728990AbgELJ2S (ORCPT ); Tue, 12 May 2020 05:28:18 -0400 Received: from dread.disaster.area (pa49-195-157-175.pa.nsw.optusnet.com.au [49.195.157.175]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 1D1975AAD49 for ; Tue, 12 May 2020 19:28:15 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jYRCw-0004H8-DW for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jYRCw-007kYg-5L for linux-xfs@vger.kernel.org; Tue, 12 May 2020 19:28:14 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 5/5] [RFC] xfs: make CIl busy extent lists per-cpu Date: Tue, 12 May 2020 19:28:11 +1000 Message-Id: <20200512092811.1846252-6-david@fromorbit.com> X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9 In-Reply-To: <20200512092811.1846252-1-david@fromorbit.com> References: <20200512092811.1846252-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=ONQRW0k9raierNYdzxQi9Q==:117 a=ONQRW0k9raierNYdzxQi9Q==:17 a=sTwFKg_x9MkA:10 a=20KFwNOVAAAA:8 a=VTcdlksUSm9lYBse5iYA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner We use the same percpu list trick with the busy extents as we do with the CIL lists, and this gets rid of the last use of the xc_cil_lock in the commit fast path. As noted in the previous patch, it looked like we were approaching another bottleneck, and that can be seen by the fact that performance didn't substantially increase even though there is now no lock contention in the commit path. The transaction rate only increases slightly to 1.12M/sec. Using a 32-way concurrent create/unlink on a 32p/16GB virtual machine: create time rate unlink time unpatched 1m49s 523k/s+/-14k/s 2m00s patched 1m48s 535k/s+/-24k/s 1m51s So variance went back up, and performance improved slightly. Profiling at this point indicates spinlock contention at the VFS level (inode_sb_list_add() and dentry cache pathwalking) so significant further gains will require VFS surgery. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 35 ++++++++++++++++++----------------- fs/xfs/xfs_log_priv.h | 12 +++++++++++- 2 files changed, 29 insertions(+), 18 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index af444bc69a7cd..d3a5f8478d64a 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -461,13 +461,9 @@ xlog_cil_insert_items( percpu_counter_add_batch(&cil->xc_curr_res, split_res, 1000 * 1000); percpu_counter_add_batch(&cil->xc_space_used, len, 1000 * 1000); - spin_lock(&cil->xc_cil_lock); - /* attach the transaction to the CIL if it has any busy extents */ if (!list_empty(&tp->t_busy)) - list_splice_init(&tp->t_busy, &ctx->busy_extents); - - spin_unlock(&cil->xc_cil_lock); + list_splice_tail_init(&tp->t_busy, pcp_busy(cil)); /* * Now (re-)position everything modified at the tail of the CIL. @@ -486,7 +482,7 @@ xlog_cil_insert_items( */ if (!list_empty(&lip->li_cil)) continue; - list_add_tail(&lip->li_cil, this_cpu_ptr(cil->xc_cil)); + list_add_tail(&lip->li_cil, pcp_cil(cil)); } if (tp->t_ticket->t_curr_res < 0) @@ -733,10 +729,14 @@ xlog_cil_push_work( * Remove the items from the per-cpu CIL lists and then pull all the * log vectors off the items. We hold the xc_ctx_lock exclusively here, * so nothing can be adding or removing from the per-cpu lists here. + * + * Also splice the busy extents onto the context while we are walking + * the percpu structure. */ /* XXX: hotplug! */ for_each_online_cpu(cpu) { - list_splice_tail_init(per_cpu_ptr(cil->xc_cil, cpu), &cil_items); + list_splice_tail_init(pcp_cil_cpu(cil, cpu), &cil_items); + list_splice_tail_init(pcp_busy_cpu(cil, cpu), &ctx->busy_extents); } lv = NULL; @@ -933,7 +933,7 @@ xlog_cil_push_background( * The cil won't be empty because we are called while holding the * context lock so whatever we added to the CIL will still be there */ - ASSERT(space_used != 0); + ASSERT(percpu_counter_read(&cil->xc_curr_res) != 0); /* * don't do a background push if we haven't used up all the @@ -1241,17 +1241,18 @@ xlog_cil_init( goto out_free_cil; /* XXX: CPU hotplug!!! */ - cil->xc_cil = alloc_percpu_gfp(struct list_head, GFP_KERNEL); - if (!cil->xc_cil) + cil->xc_pcp = alloc_percpu_gfp(struct xfs_cil_pcpu, GFP_KERNEL); + if (!cil->xc_pcp) goto out_free_ctx; for_each_possible_cpu(cpu) { - INIT_LIST_HEAD(per_cpu_ptr(cil->xc_cil, cpu)); + INIT_LIST_HEAD(pcp_cil_cpu(cil, cpu)); + INIT_LIST_HEAD(pcp_busy_cpu(cil, cpu)); } error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL); if (error) - goto out_free_pcp_cil; + goto out_free_pcp; error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL); if (error) @@ -1259,7 +1260,6 @@ xlog_cil_init( INIT_WORK(&cil->xc_push_work, xlog_cil_push_work); INIT_LIST_HEAD(&cil->xc_committing); - spin_lock_init(&cil->xc_cil_lock); spin_lock_init(&cil->xc_push_lock); init_rwsem(&cil->xc_ctx_lock); init_waitqueue_head(&cil->xc_commit_wait); @@ -1278,8 +1278,8 @@ xlog_cil_init( out_free_space: percpu_counter_destroy(&cil->xc_space_used); -out_free_pcp_cil: - free_percpu(cil->xc_cil); +out_free_pcp: + free_percpu(cil->xc_pcp); out_free_ctx: kmem_free(ctx); out_free_cil: @@ -1303,9 +1303,10 @@ xlog_cil_destroy( percpu_counter_destroy(&cil->xc_curr_res); for_each_possible_cpu(cpu) { - ASSERT(list_empty(per_cpu_ptr(cil->xc_cil, cpu))); + ASSERT(list_empty(pcp_cil_cpu(cil, cpu))); + ASSERT(list_empty(pcp_busy_cpu(cil, cpu))); } - free_percpu(cil->xc_cil); + free_percpu(cil->xc_pcp); kmem_free(cil); } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 0bb982920d070..cfc22c9482ea4 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -260,11 +260,16 @@ struct xfs_cil_ctx { * the commit LSN to be determined as well. This should make synchronous * operations almost as efficient as the old logging methods. */ +struct xfs_cil_pcpu { + struct list_head p_cil; + struct list_head p_busy_extents; +}; + struct xfs_cil { struct xlog *xc_log; struct percpu_counter xc_space_used; struct percpu_counter xc_curr_res; - struct list_head __percpu *xc_cil; + struct xfs_cil_pcpu __percpu *xc_pcp; spinlock_t xc_cil_lock; struct rw_semaphore xc_ctx_lock ____cacheline_aligned_in_smp; @@ -278,6 +283,11 @@ struct xfs_cil { struct work_struct xc_push_work; } ____cacheline_aligned_in_smp; +#define pcp_cil(cil) &(this_cpu_ptr(cil->xc_pcp)->p_cil) +#define pcp_cil_cpu(cil, cpu) &(per_cpu_ptr(cil->xc_pcp, cpu)->p_cil) +#define pcp_busy(cil) &(this_cpu_ptr(cil->xc_pcp)->p_busy_extents) +#define pcp_busy_cpu(cil, cpu) &(per_cpu_ptr(cil->xc_pcp, cpu)->p_busy_extents) + /* * The amount of log space we allow the CIL to aggregate is difficult to size. * Whatever we choose, we have to make sure we can get a reservation for the