mbox series

[GIT,PULL] xfs: Improve CIL scalability

Message ID 20220707233347.GO227878@dread.disaster.area (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL] xfs: Improve CIL scalability | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs tags/xfs-cil-scale-5.20

Message

Dave Chinner July 7, 2022, 11:33 p.m. UTC
Hi Darrick,

Can you please pull the CIL scalability improvements for 5.20 from
the tag below? This branch is based on the linux-xfs/for-next branch
as of 2 days ago, so should apply without any merge issues at all.

Cheers,

Dave.

The following changes since commit 7561cea5dbb97fecb952548a0fb74fb105bf4664:

  xfs: prevent a UAF when log IO errors race with unmount (2022-07-01 09:09:52 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs tags/xfs-cil-scale-5.20

for you to fetch changes up to 51a117edff133a1ea8cb0fcbc599b8d5a34414e9:

  xfs: expanding delayed logging design with background material (2022-07-07 18:56:09 +1000)

----------------------------------------------------------------
xfs: improve CIL scalability

This series aims to improve the scalability of XFS transaction
commits on large CPU count machines. My 32p machine hits contention
limits in xlog_cil_commit() at about 700,000 transaction commits a
section. It hits this at 16 thread workloads, and 32 thread
workloads go no faster and just burn CPU on the CIL spinlocks.

This patchset gets rid of spinlocks and global serialisation points
in the xlog_cil_commit() path. It does this by moving to a
combination of per-cpu counters, unordered per-cpu lists and
post-ordered per-cpu lists.

This results in transaction commit rates exceeding 1.4 million
commits/s under unlink certain workloads, and while the log lock
contention is largely gone there is still significant lock
contention in the VFS (dentry cache, inode cache and security layers)
at >600,000 transactions/s that still limit scalability.

The changes to the CIL accounting and behaviour, combined with the
structural changes to xlog_write() in prior patchsets make the
per-cpu restructuring possible and sane. This allows us to move to
precalculated reservation requirements that allow for reservation
stealing to be accounted across multiple CPUs accurately.

That is, instead of trying to account for continuation log opheaders
on a "growth" basis, we pre-calculate how many iclogs we'll need to
write out a maximally sized CIL checkpoint and steal that reserveD
that space one commit at a time until the CIL has a full
reservation. If we ever run a commit when we are already at the hard
limit (because post-throttling) we simply take an extra reservation
from each commit that is run when over the limit. Hence we don't
need to do space usage math in the fast path and so never need to
sum the per-cpu counters in this fast path.

Similarly, per-cpu lists have the problem of ordering - we can't
remove an item from a per-cpu list if we want to move it forward in
the CIL. We solve this problem by using an atomic counter to give
every commit a sequence number that is copied into the log items in
that transaction. Hence relogging items just overwrites the sequence
number in the log item, and does not move it in the per-cpu lists.
Once we reaggregate the per-cpu lists back into a single list in the
CIL push work, we can run it through list-sort() and reorder it back
into a globally ordered list. This costs a bit of CPU time, but now
that the CIL can run multiple works and pipelines properly, this is
not a limiting factor for performance. It does increase fsync
latency when the CIL is full, but workloads issuing large numbers of
fsync()s or sync transactions end up with very small CILs and so the
latency impact or sorting is not measurable for such workloads.

OVerall, this pushes the transaction commit bottleneck out to the
lockless reservation grant head updates. These atomic updates don't
start to be a limiting fact until > 1.5 million transactions/s are
being run, at which point the accounting functions start to show up
in profiles as the highest CPU users. Still, this series doubles
transaction throughput without increasing CPU usage before we get
to that cacheline contention breakdown point...
`
Signed-off-by: Dave Chinner <dchinner@redhat.com>

----------------------------------------------------------------
Dave Chinner (14):
      xfs: use the CIL space used counter for emptiness checks
      xfs: lift init CIL reservation out of xc_cil_lock
      xfs: rework per-iclog header CIL reservation
      xfs: introduce per-cpu CIL tracking structure
      xfs: implement percpu cil space used calculation
      xfs: track CIL ticket reservation in percpu structure
      xfs: convert CIL busy extents to per-cpu
      xfs: Add order IDs to log items in CIL
      xfs: convert CIL to unordered per cpu lists
      xfs: convert log vector chain to use list heads
      xfs: move CIL ordering to the logvec chain
      xfs: avoid cil push lock if possible
      xfs: xlog_sync() manually adjusts grant head space
      xfs: expanding delayed logging design with background material

 Documentation/filesystems/xfs-delayed-logging-design.rst | 361 +++++++++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_log.c                                         |  55 ++++++---
 fs/xfs/xfs_log.h                                         |   3 +-
 fs/xfs/xfs_log_cil.c                                     | 472 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------
 fs/xfs/xfs_log_priv.h                                    |  58 ++++++---
 fs/xfs/xfs_super.c                                       |   1 +
 fs/xfs/xfs_trans.c                                       |   4 +-
 fs/xfs/xfs_trans.h                                       |   1 +
 fs/xfs/xfs_trans_priv.h                                  |   3 +-
 9 files changed, 768 insertions(+), 190 deletions(-)

Comments

Darrick J. Wong May 12, 2023, 1:28 a.m. UTC | #1
On Fri, Jul 08, 2022 at 09:33:47AM +1000, Dave Chinner wrote:
> Hi Darrick,
> 
> Can you please pull the CIL scalability improvements for 5.20 from
> the tag below? This branch is based on the linux-xfs/for-next branch
> as of 2 days ago, so should apply without any merge issues at all.
> 
> Cheers,
> 
> Dave.
> 
> The following changes since commit 7561cea5dbb97fecb952548a0fb74fb105bf4664:
> 
>   xfs: prevent a UAF when log IO errors race with unmount (2022-07-01 09:09:52 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs tags/xfs-cil-scale-5.20
> 
> for you to fetch changes up to 51a117edff133a1ea8cb0fcbc599b8d5a34414e9:
> 
>   xfs: expanding delayed logging design with background material (2022-07-07 18:56:09 +1000)
> 
> ----------------------------------------------------------------
> xfs: improve CIL scalability
> 
> This series aims to improve the scalability of XFS transaction
> commits on large CPU count machines. My 32p machine hits contention
> limits in xlog_cil_commit() at about 700,000 transaction commits a
> section. It hits this at 16 thread workloads, and 32 thread
> workloads go no faster and just burn CPU on the CIL spinlocks.
> 
> This patchset gets rid of spinlocks and global serialisation points
> in the xlog_cil_commit() path. It does this by moving to a
> combination of per-cpu counters, unordered per-cpu lists and
> post-ordered per-cpu lists.

FWIW, I (rather infrequently) see things like this in the 10 months or
so that this has been in mainline:

run fstests generic/650 at 2023-05-10 19:17:09
XFS (sda3): EXPERIMENTAL Large extent counts feature in use. Use at your own risk!
XFS (sda3): Mounting V5 Filesystem 75c42b12-8a39-4ecd-aac4-6b6ab0e384bd
XFS (sda3): Ending clean mount
smpboot: CPU 1 is now offline
x86: Booting SMP configuration:
smpboot: Booting Node 0 Processor 1 APIC 0x1
smpboot: CPU 1 is now offline
smpboot: CPU 3 is now offline
x86: Booting SMP configuration:
smpboot: Booting Node 0 Processor 1 APIC 0x1
smpboot: Booting Node 0 Processor 3 APIC 0x3
smpboot: CPU 3 is now offline
smpboot: Booting Node 0 Processor 3 APIC 0x3
smpboot: CPU 2 is now offline
smpboot: CPU 3 is now offline
XFS (sda3): ctx ticket reservation ran out. Need to up reservation
XFS (sda3): ticket reservation summary:
XFS (sda3):   unit res    = 9268 bytes
XFS (sda3):   current res = -40 bytes
XFS (sda3):   original count  = 1
XFS (sda3):   remaining count = 1
XFS (sda3): Filesystem has been shut down due to log error (0x2).
XFS (sda3): Please unmount the filesystem and rectify the problem(s).

Not sure what that's about, but given the recent discussions about
percpu counters not quite working correctly when racing with cpu
hotremove, I figured this would be a good time to capture one of the
failures and report it to the list.

--D

> This results in transaction commit rates exceeding 1.4 million
> commits/s under unlink certain workloads, and while the log lock
> contention is largely gone there is still significant lock
> contention in the VFS (dentry cache, inode cache and security layers)
> at >600,000 transactions/s that still limit scalability.
> 
> The changes to the CIL accounting and behaviour, combined with the
> structural changes to xlog_write() in prior patchsets make the
> per-cpu restructuring possible and sane. This allows us to move to
> precalculated reservation requirements that allow for reservation
> stealing to be accounted across multiple CPUs accurately.
> 
> That is, instead of trying to account for continuation log opheaders
> on a "growth" basis, we pre-calculate how many iclogs we'll need to
> write out a maximally sized CIL checkpoint and steal that reserveD
> that space one commit at a time until the CIL has a full
> reservation. If we ever run a commit when we are already at the hard
> limit (because post-throttling) we simply take an extra reservation
> from each commit that is run when over the limit. Hence we don't
> need to do space usage math in the fast path and so never need to
> sum the per-cpu counters in this fast path.
> 
> Similarly, per-cpu lists have the problem of ordering - we can't
> remove an item from a per-cpu list if we want to move it forward in
> the CIL. We solve this problem by using an atomic counter to give
> every commit a sequence number that is copied into the log items in
> that transaction. Hence relogging items just overwrites the sequence
> number in the log item, and does not move it in the per-cpu lists.
> Once we reaggregate the per-cpu lists back into a single list in the
> CIL push work, we can run it through list-sort() and reorder it back
> into a globally ordered list. This costs a bit of CPU time, but now
> that the CIL can run multiple works and pipelines properly, this is
> not a limiting factor for performance. It does increase fsync
> latency when the CIL is full, but workloads issuing large numbers of
> fsync()s or sync transactions end up with very small CILs and so the
> latency impact or sorting is not measurable for such workloads.
> 
> OVerall, this pushes the transaction commit bottleneck out to the
> lockless reservation grant head updates. These atomic updates don't
> start to be a limiting fact until > 1.5 million transactions/s are
> being run, at which point the accounting functions start to show up
> in profiles as the highest CPU users. Still, this series doubles
> transaction throughput without increasing CPU usage before we get
> to that cacheline contention breakdown point...
> `
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> ----------------------------------------------------------------
> Dave Chinner (14):
>       xfs: use the CIL space used counter for emptiness checks
>       xfs: lift init CIL reservation out of xc_cil_lock
>       xfs: rework per-iclog header CIL reservation
>       xfs: introduce per-cpu CIL tracking structure
>       xfs: implement percpu cil space used calculation
>       xfs: track CIL ticket reservation in percpu structure
>       xfs: convert CIL busy extents to per-cpu
>       xfs: Add order IDs to log items in CIL
>       xfs: convert CIL to unordered per cpu lists
>       xfs: convert log vector chain to use list heads
>       xfs: move CIL ordering to the logvec chain
>       xfs: avoid cil push lock if possible
>       xfs: xlog_sync() manually adjusts grant head space
>       xfs: expanding delayed logging design with background material
> 
>  Documentation/filesystems/xfs-delayed-logging-design.rst | 361 +++++++++++++++++++++++++++++++++++++++++++++++------
>  fs/xfs/xfs_log.c                                         |  55 ++++++---
>  fs/xfs/xfs_log.h                                         |   3 +-
>  fs/xfs/xfs_log_cil.c                                     | 472 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------
>  fs/xfs/xfs_log_priv.h                                    |  58 ++++++---
>  fs/xfs/xfs_super.c                                       |   1 +
>  fs/xfs/xfs_trans.c                                       |   4 +-
>  fs/xfs/xfs_trans.h                                       |   1 +
>  fs/xfs/xfs_trans_priv.h                                  |   3 +-
>  9 files changed, 768 insertions(+), 190 deletions(-)
> 
> -- 
> Dave Chinner
> david@fromorbit.com