[00/14,v6] xfs: improve CIL scalability

Message ID	20211109015240.1547991-1-david@fromorbit.com (mailing list archive)
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Subject: [PATCH 00/14 v6] xfs: improve CIL scalability Date: Tue, 9 Nov 2021 12:52:26 +1100 Message-Id: <20211109015240.1547991-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xfs: improve CIL scalability \| expand [00/14,v6] xfs: improve CIL scalability [01/14] xfs: use the CIL space used counter for emptiness checks [02/14] xfs: lift init CIL reservation out of xc_cil_lock [03/14] xfs: rework per-iclog header CIL reservation [04/14] xfs: introduce per-cpu CIL tracking structure [05/14] xfs: implement percpu cil space used calculation [06/14] xfs: track CIL ticket reservation in percpu structure [07/14] xfs: convert CIL busy extents to per-cpu [08/14] xfs: Add order IDs to log items in CIL [09/14] xfs: convert CIL to unordered per cpu lists [10/14] xfs: convert log vector chain to use list heads [11/14] xfs: move CIL ordering to the logvec chain [12/14] xfs: avoid cil push lock if possible [13/14] xfs: xlog_sync() manually adjusts grant head space [14/14] xfs: expanding delayed logging design with background material

Message ID

20211109015240.1547991-1-david@fromorbit.com (mailing list archive)

Headers

From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 00/14 v6] xfs: improve CIL scalability
Date: Tue,  9 Nov 2021 12:52:26 +1100
Message-Id: <20211109015240.1547991-1-david@fromorbit.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

xfs: improve CIL scalability | expand

Message

Dave Chinner Nov. 9, 2021, 1:52 a.m. UTC

Time to try again to get this code merged.

This series aims to improve the scalability of XFS transaction
commits on large CPU count machines. My 32p machine hits contention
limits in xlog_cil_commit() at about 700,000 transaction commits a
section. It hits this at 16 thread workloads, and 32 thread
workloads go no faster and just burn CPU on the CIL spinlocks.

This patchset gets rid of spinlocks and global serialisation points
in the xlog_cil_commit() path. It does this by moving to a
combination of per-cpu counters, unordered per-cpu lists and
post-ordered per-cpu lists.

This results in transaction commit rates exceeding 1.6 million
commits/s under unlink certain workloads, and while the log lock
contention is largely gone there is still significant lock
contention at the VFS at 600,000 transactions/s:

  19.39%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   6.40%  [kernel]  [k] do_raw_spin_lock
   4.07%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   3.08%  [kernel]  [k] memcpy_erms
   1.93%  [kernel]  [k] xfs_buf_find
   1.69%  [kernel]  [k] xlog_cil_commit
   1.50%  [kernel]  [k] syscall_exit_to_user_mode
   1.18%  [kernel]  [k] memset_erms


-   64.23%     0.22%  [kernel]            [k] path_openat
   - 64.01% path_openat
      - 48.69% xfs_vn_create
         - 48.60% xfs_generic_create
            - 40.96% xfs_create
               - 20.39% xfs_dir_ialloc
                  - 7.05% xfs_setup_inode
>>>>>                - 6.87% inode_sb_list_add
                        - 6.54% _raw_spin_lock
                           - 6.53% do_raw_spin_lock
                                6.08% __pv_queued_spin_lock_slowpath
.....
               - 11.27% xfs_trans_commit
                  - 11.23% __xfs_trans_commit
                     - 10.85% xlog_cil_commit
                          2.47% memcpy_erms
                        - 1.77% xfs_buf_item_committing
                           - 1.70% xfs_buf_item_release
                              - 0.79% xfs_buf_unlock
                                   0.68% up
                                0.61% xfs_buf_rele
                          0.80% xfs_buf_item_format
                          0.73% xfs_inode_item_format
                          0.68% xfs_buf_item_size
                        - 0.55% kmem_alloc_large
                           - 0.55% kmem_alloc
                                0.52% __kmalloc
.....
            - 7.08% d_instantiate
               - 6.66% security_d_instantiate
>>>>>>            - 6.63% selinux_d_instantiate
                     - 6.48% inode_doinit_with_dentry
                        - 6.11% _raw_spin_lock
                           - 6.09% do_raw_spin_lock
                                5.60% __pv_queued_spin_lock_slowpath
....
      - 1.77% terminate_walk
>>>>>>   - 1.69% dput
            - 1.55% _raw_spin_lock
               - do_raw_spin_lock
                    1.19% __pv_queued_spin_lock_slowpath


But when we extend out to 1.5M commits/s we see that the contention
starts to shift to the atomics in the lockless log reservation path:

  14.81%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   7.88%  [kernel]  [k] xlog_grant_add_space
   7.18%  [kernel]  [k] xfs_log_ticket_ungrant
   4.82%  [kernel]  [k] do_raw_spin_lock
   3.58%  [kernel]  [k] xlog_space_left
   3.51%  [kernel]  [k] xlog_cil_commit

There's still substantial spin lock contention occurring at the VFS,
too, but it's indicating that multiple atomic variable updates per
transaction reservation/commit pair is starting to reach scalability
limits here.

This is largely a re-implementation of a past RFC patchsets. While
that were good enough proof of concept to perf test, they did not
preserve transaction order correctly and failed shutdown tests all
the time. The changes to the CIL accounting and behaviour, combined
with the structural changes to xlog_write() in prior patchsets make
the per-cpu restructuring possible and sane.

Instead of trying to account for continuation log opheaders on a
"growth" basis, we pre-calculate how many iclogs we'll need to write
out a maximally sized CIL checkpoint and just reserve that space one
per commit until the CIL has a full reservation. If we ever run a
commit when we are already at the hard limit (because
post-throttling) we simply take an extra reservation from each
commit that is run when over the limit. Hence we don't need to do
space usage math in the fast path and so never need to sum the
per-cpu counters in this path.

Similarly, per-cpu lists have the problem of ordering - we can't
remove an item from a per-cpu list if we want to move it forward in
the CIL. We solve this problem by using an atomic counter to give
every commit a sequence number that is copied into the log items in
that transaction. Hence relogging items just overwrites the sequence
number in the log item, and does not move it in the per-cpu lists.
Once we reaggregate the per-cpu lists back into a single list in the
CIL push work, we can run it through list-sort() and reorder it back
into a globally ordered list. This costs a bit of CPU time, but now
that the CIL can run multiple works and pipelines properly, this is
not a limiting factor for performance. It does increase fsync
latency when the CIL is full, but workloads issuing large numbers of
fsync()s or sync transactions end up with very small CILs and so the
latency impact or sorting is not measurable for such workloads.

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-cil-scale-3

Version 6:
- split out from aggregated patchset
- rebase on linux-xfs/for-next + dgc/xlog-write-rework

Version 5:
- https://lore.kernel.org/linux-xfs/20210603052240.171998-1-david@fromorbit.com/

Comments

Dave Chinner Nov. 18, 2021, 11:15 p.m. UTC | #1

FYI: I've just rebased the git tree branch containing this code
on the V7 version of the xlog_write() rework patch set I just
posted.

No changes to this series were made in the rebase.

Cheers,

Dave.

On Tue, Nov 09, 2021 at 12:52:26PM +1100, Dave Chinner wrote:
> Time to try again to get this code merged.
> 
> This series aims to improve the scalability of XFS transaction
> commits on large CPU count machines. My 32p machine hits contention
> limits in xlog_cil_commit() at about 700,000 transaction commits a
> section. It hits this at 16 thread workloads, and 32 thread
> workloads go no faster and just burn CPU on the CIL spinlocks.
> 
> This patchset gets rid of spinlocks and global serialisation points
> in the xlog_cil_commit() path. It does this by moving to a
> combination of per-cpu counters, unordered per-cpu lists and
> post-ordered per-cpu lists.
> 
> This results in transaction commit rates exceeding 1.6 million
> commits/s under unlink certain workloads, and while the log lock
> contention is largely gone there is still significant lock
> contention at the VFS at 600,000 transactions/s:
> 
>   19.39%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    6.40%  [kernel]  [k] do_raw_spin_lock
>    4.07%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>    3.08%  [kernel]  [k] memcpy_erms
>    1.93%  [kernel]  [k] xfs_buf_find
>    1.69%  [kernel]  [k] xlog_cil_commit
>    1.50%  [kernel]  [k] syscall_exit_to_user_mode
>    1.18%  [kernel]  [k] memset_erms
> 
> 
> -   64.23%     0.22%  [kernel]            [k] path_openat
>    - 64.01% path_openat
>       - 48.69% xfs_vn_create
>          - 48.60% xfs_generic_create
>             - 40.96% xfs_create
>                - 20.39% xfs_dir_ialloc
>                   - 7.05% xfs_setup_inode
> >>>>>                - 6.87% inode_sb_list_add
>                         - 6.54% _raw_spin_lock
>                            - 6.53% do_raw_spin_lock
>                                 6.08% __pv_queued_spin_lock_slowpath
> .....
>                - 11.27% xfs_trans_commit
>                   - 11.23% __xfs_trans_commit
>                      - 10.85% xlog_cil_commit
>                           2.47% memcpy_erms
>                         - 1.77% xfs_buf_item_committing
>                            - 1.70% xfs_buf_item_release
>                               - 0.79% xfs_buf_unlock
>                                    0.68% up
>                                 0.61% xfs_buf_rele
>                           0.80% xfs_buf_item_format
>                           0.73% xfs_inode_item_format
>                           0.68% xfs_buf_item_size
>                         - 0.55% kmem_alloc_large
>                            - 0.55% kmem_alloc
>                                 0.52% __kmalloc
> .....
>             - 7.08% d_instantiate
>                - 6.66% security_d_instantiate
> >>>>>>            - 6.63% selinux_d_instantiate
>                      - 6.48% inode_doinit_with_dentry
>                         - 6.11% _raw_spin_lock
>                            - 6.09% do_raw_spin_lock
>                                 5.60% __pv_queued_spin_lock_slowpath
> ....
>       - 1.77% terminate_walk
> >>>>>>   - 1.69% dput
>             - 1.55% _raw_spin_lock
>                - do_raw_spin_lock
>                     1.19% __pv_queued_spin_lock_slowpath
> 
> 
> But when we extend out to 1.5M commits/s we see that the contention
> starts to shift to the atomics in the lockless log reservation path:
> 
>   14.81%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    7.88%  [kernel]  [k] xlog_grant_add_space
>    7.18%  [kernel]  [k] xfs_log_ticket_ungrant
>    4.82%  [kernel]  [k] do_raw_spin_lock
>    3.58%  [kernel]  [k] xlog_space_left
>    3.51%  [kernel]  [k] xlog_cil_commit
> 
> There's still substantial spin lock contention occurring at the VFS,
> too, but it's indicating that multiple atomic variable updates per
> transaction reservation/commit pair is starting to reach scalability
> limits here.
> 
> This is largely a re-implementation of a past RFC patchsets. While
> that were good enough proof of concept to perf test, they did not
> preserve transaction order correctly and failed shutdown tests all
> the time. The changes to the CIL accounting and behaviour, combined
> with the structural changes to xlog_write() in prior patchsets make
> the per-cpu restructuring possible and sane.
> 
> Instead of trying to account for continuation log opheaders on a
> "growth" basis, we pre-calculate how many iclogs we'll need to write
> out a maximally sized CIL checkpoint and just reserve that space one
> per commit until the CIL has a full reservation. If we ever run a
> commit when we are already at the hard limit (because
> post-throttling) we simply take an extra reservation from each
> commit that is run when over the limit. Hence we don't need to do
> space usage math in the fast path and so never need to sum the
> per-cpu counters in this path.
> 
> Similarly, per-cpu lists have the problem of ordering - we can't
> remove an item from a per-cpu list if we want to move it forward in
> the CIL. We solve this problem by using an atomic counter to give
> every commit a sequence number that is copied into the log items in
> that transaction. Hence relogging items just overwrites the sequence
> number in the log item, and does not move it in the per-cpu lists.
> Once we reaggregate the per-cpu lists back into a single list in the
> CIL push work, we can run it through list-sort() and reorder it back
> into a globally ordered list. This costs a bit of CPU time, but now
> that the CIL can run multiple works and pipelines properly, this is
> not a limiting factor for performance. It does increase fsync
> latency when the CIL is full, but workloads issuing large numbers of
> fsync()s or sync transactions end up with very small CILs and so the
> latency impact or sorting is not measurable for such workloads.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-cil-scale-3
> 
> Version 6:
> - split out from aggregated patchset
> - rebase on linux-xfs/for-next + dgc/xlog-write-rework
> 
> Version 5:
> - https://lore.kernel.org/linux-xfs/20210603052240.171998-1-david@fromorbit.com/
> 
>