mbox series

[GIT,PULL] xfs: xlog_write rework and CIL scalability

Message ID 20211210000956.GO449541@dread.disaster.area (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL] xfs: xlog_write rework and CIL scalability | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git tags/xfs-cil-scale-3-tag

Message

Dave Chinner Dec. 10, 2021, 12:09 a.m. UTC
Hi Darrick,

Can you please pull the following changes from the tag listed below
for the XFS dev tree?

Cheers,

Dave.

The following changes since commit 0fcfb00b28c0b7884635dacf38e46d60bf3d4eb1:

  Linux 5.16-rc4 (2021-12-05 14:08:22 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git tags/xfs-cil-scale-3-tag

for you to fetch changes up to 3b5181b310e0f2064f2aafb6143cdb0e920f5858:

  xfs: expanding delayed logging design with background material (2021-12-09 10:22:36 +1100)

----------------------------------------------------------------
xfs: CIL and log scalability improvements

xlog_write() is code that causes severe eye bleeding. It's extremely
difficult to understand the way it is structured, and extremely easy
to break because of all the weird parameters it passes between
functions that do very non-obvious things. state is set in
xlog_write_finish_copy() that is carried across both outer and inner
loop iterations that is used by xlog_write_setup_copy(), which also
sets state that xlog_write_finish_copy() needs. The way iclog space
was obtained affects the accounting logic that ends up being passed
to xlog_state_finish_copy(). The code that handles commit iclogs is
spread over multiple functions and is obfuscated by the set/finish
copy code.

It's just a mess.

It's also extremely inefficient.

That's why I've rewritten the code. I think the code I've written is
much easier to understand and there's less of it.  The compiled code
is smaller and faster. It has much fewer subtleties and outside
dependencies, and is easier to reason about and modify.

Built on top of this is the CIL scalability improvements. My 32p
machine hits lock contention limits in xlog_cil_commit() at about
700,000 transaction commits a section. It hits this at 16 thread
workloads, and 32 thread workloads go no faster and just burn CPU on
the CIL spinlocks.

This patchset gets rid of spinlocks and global serialisation points
in the xlog_cil_commit() path. It does this by moving to a
combination of per-cpu counters, unordered per-cpu lists and
post-ordered per-cpu lists, and is built upon the xlog_write()
simplifications introduced earlier in the rewrite of that function.

This results in transaction commit rates exceeding 2 million
commits/s under unlink certain workloads, but in general the
improvements are smaller than this as the scalability limitations
simply move from xlog_cil_commit() to global VFS lock contexts.

----------------------------------------------------------------
Christoph Hellwig (2):
      xfs: change the type of ic_datap
      xfs: remove xlog_verify_dest_ptr

Dave Chinner (28):
      xfs: factor out the CIL transaction header building
      xfs: only CIL pushes require a start record
      xfs: embed the xlog_op_header in the unmount record
      xfs: embed the xlog_op_header in the commit record
      xfs: log tickets don't need log client id
      xfs: move log iovec alignment to preparation function
      xfs: reserve space and initialise xlog_op_header in item formatting
      xfs: log ticket region debug is largely useless
      xfs: pass lv chain length into xlog_write()
      xfs: introduce xlog_write_full()
      xfs: introduce xlog_write_partial()
      xfs: xlog_write() no longer needs contwr state
      xfs: xlog_write() doesn't need optype anymore
      xfs: CIL context doesn't need to count iovecs
      xfs: use the CIL space used counter for emptiness checks
      xfs: lift init CIL reservation out of xc_cil_lock
      xfs: rework per-iclog header CIL reservation
      xfs: introduce per-cpu CIL tracking structure
      xfs: implement percpu cil space used calculation
      xfs: track CIL ticket reservation in percpu structure
      xfs: convert CIL busy extents to per-cpu
      xfs: Add order IDs to log items in CIL
      xfs: convert CIL to unordered per cpu lists
      xfs: convert log vector chain to use list heads
      xfs: move CIL ordering to the logvec chain
      xfs: avoid cil push lock if possible
      xfs: xlog_sync() manually adjusts grant head space
      xfs: expanding delayed logging design with background material

 Documentation/filesystems/xfs-delayed-logging-design.rst | 361 +++++++++++++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_log_format.h                           |   1 -
 fs/xfs/xfs_log.c                                         | 809 ++++++++++++++++++++++++++++++++++++----------------------------------------------
 fs/xfs/xfs_log.h                                         |  58 ++----
 fs/xfs/xfs_log_cil.c                                     | 550 +++++++++++++++++++++++++++++++++++++++-----------------
 fs/xfs/xfs_log_priv.h                                    | 103 +++++------
 fs/xfs/xfs_super.c                                       |   1 +
 fs/xfs/xfs_trans.c                                       |  10 +-
 fs/xfs/xfs_trans.h                                       |   1 +
 fs/xfs/xfs_trans_priv.h                                  |   3 +-
 10 files changed, 1134 insertions(+), 763 deletions(-)

Comments

Darrick J. Wong Jan. 6, 2022, 9:40 p.m. UTC | #1
On Fri, Dec 10, 2021 at 11:09:56AM +1100, Dave Chinner wrote:
> Hi Darrick,
> 
> Can you please pull the following changes from the tag listed below
> for the XFS dev tree?

Hi Dave,

I tried, but the regressions with generic/017 persist.  It trips the
ticket reservation pretty consistently within 45-60 seconds of starting,
at least on the OCI VM that I created.  /dev/sd[ab] are (software
defined) disks that can sustain reads of ~50MB/s and ~5000iops; and
writes of about half those numbers.

 run fstests generic/017 at 2022-01-06 13:18:59
 XFS (sda4): Mounting V5 Filesystem
 XFS (sda4): Ending clean mount
 XFS (sda4): Quotacheck needed: Please wait.
 XFS (sda4): Quotacheck: Done.
 XFS (sda4): ctx ticket reservation ran out. Need to up reservation
 XFS (sda4): ticket reservation summary:
 XFS (sda4):   unit res    = 548636 bytes
 XFS (sda4):   current res = -76116 bytes
 XFS (sda4):   original count  = 1
 XFS (sda4):   remaining count = 1
 XFS (sda4): Log I/O Error (0x2) detected at xlog_write+0x5ee/0x660 [xfs] (fs/xfs/xfs_log.c:2499).  Shutting down filesystem.
 XFS (sda4): Please unmount the filesystem and rectify the problem(s)
 XFS (sda3): Unmounting Filesystem
 XFS (sda4): Unmounting Filesystem

blockdev/fs configuration:

# xfs_info /opt
meta-data=/dev/sda4              isize=512    agcount=4, agsize=2183680 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1
data     =                       bsize=4096   blocks=8734720, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =/dev/sdb2              bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# lsblk /dev/sda4 /dev/sdb2
NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda4   8:4    0 33.3G  0 part /opt
sdb2   8:18   0  128M  0 part 

# lsblk -O -J /dev/sda4 /dev/sdb2
{
  "blockdevices": [
    {
      "name": "sda4",
      "kname": "sda4",
      "path": "/dev/sda4",
      "maj:min": "8:4",
      "fsavail": null,
      "fssize": null,
      "fstype": "xfs",
      "fsused": null,
      "fsuse%": null,
      "mountpoint": null,
      "label": null,
      "uuid": "6a8ac780-016a-4970-b357-cbe7bf05bca5",
      "ptuuid": "39357d95-9980-b546-b64b-0deac1bd0d1c",
      "pttype": "gpt",
      "parttype": "0fc63daf-8483-4772-8e79-3d69d8477de4",
      "partlabel": "SCRATCH_DEV",
      "partuuid": "2a8ccc9e-c9d0-3542-8fce-ce819dd0d601",
      "partflags": null,
      "ra": 2048,
      "ro": false,
      "rm": false,
      "hotplug": false,
      "model": null,
      "serial": null,
      "size": "33.3G",
      "state": null,
      "owner": "root",
      "group": "disk",
      "mode": "brw-rw----",
      "alignment": 0,
      "min-io": 4096,
      "opt-io": 1048576,
      "phy-sec": 4096,
      "log-sec": 512,
      "rota": true,
      "sched": "bfq",
      "rq-size": 256,
      "type": "part",
      "disc-aln": 0,
      "disc-gran": "0B",
      "disc-max": "0B",
      "disc-zero": false,
      "wsame": "0B",
      "wwn": "0x6024f38ce0184097914648e6d5e1ac3a",
      "rand": true,
      "pkname": "sda",
      "hctl": null,
      "tran": null,
      "subsystems": "block:scsi:virtio:pci",
      "rev": null,
      "vendor": null,
      "zoned": "none"
    },
    {
      "name": "sdb2",
      "kname": "sdb2",
      "path": "/dev/sdb2",
      "maj:min": "8:18",
      "fsavail": null,
      "fssize": null,
      "fstype": "xfs_external_log",
      "fsused": null,
      "fsuse%": null,
      "mountpoint": null,
      "label": null,
      "uuid": null,
      "ptuuid": "007b1bcb-34a8-8342-a655-1e69030e5db5",
      "pttype": "gpt",
      "parttype": "0fc63daf-8483-4772-8e79-3d69d8477de4",
      "partlabel": "SCRATCH_LOGDEV",
      "partuuid": "7a47a7f1-6d59-f943-a06f-b4600d32f20a",
      "partflags": null,
      "ra": 2048,
      "ro": false,
      "rm": false,
      "hotplug": false,
      "model": null,
      "serial": null,
      "size": "128M",
      "state": null,
      "owner": "root",
      "group": "disk",
      "mode": "brw-rw----",
      "alignment": 0,
      "min-io": 4096,
      "opt-io": 1048576,
      "phy-sec": 4096,
      "log-sec": 512,
      "rota": true,
      "sched": "bfq",
      "rq-size": 256,
      "type": "part",
      "disc-aln": 0,
      "disc-gran": "0B",
      "disc-max": "0B",
      "disc-zero": false,
      "wsame": "0B",
      "wwn": "0x60dcbb518a874fe99abcb8399b82c740",
      "rand": true,
      "pkname": "sdb",
      "hctl": null,
      "tran": null,
      "subsystems": "block:scsi:virtio:pci",
      "rev": null,
      "vendor": null,
      "zoned": "none"
    }
  ]
}

fstests config:

FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 oci-mtr24 5.16.0-rc5-djwx #rc5 SMP PREEMPT Wed Dec 15 12:33:55 PST 2021
MKFS_OPTIONS  -- -f -llogdev=/dev/sdb2 /dev/sda4
MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, -ologdev=/dev/sdb2 /dev/sda4 /opt

I bisected this to "xfs: implement percpu cil space used calculation".
Not sure what's really going on there, though everything up through
"xfs: rework per-iclog header CIL reservation" seems solid enough.

--D

> Cheers,
> 
> Dave.
> 
> The following changes since commit 0fcfb00b28c0b7884635dacf38e46d60bf3d4eb1:
> 
>   Linux 5.16-rc4 (2021-12-05 14:08:22 -0800)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git tags/xfs-cil-scale-3-tag
> 
> for you to fetch changes up to 3b5181b310e0f2064f2aafb6143cdb0e920f5858:
> 
>   xfs: expanding delayed logging design with background material (2021-12-09 10:22:36 +1100)
> 
> ----------------------------------------------------------------
> xfs: CIL and log scalability improvements
> 
> xlog_write() is code that causes severe eye bleeding. It's extremely
> difficult to understand the way it is structured, and extremely easy
> to break because of all the weird parameters it passes between
> functions that do very non-obvious things. state is set in
> xlog_write_finish_copy() that is carried across both outer and inner
> loop iterations that is used by xlog_write_setup_copy(), which also
> sets state that xlog_write_finish_copy() needs. The way iclog space
> was obtained affects the accounting logic that ends up being passed
> to xlog_state_finish_copy(). The code that handles commit iclogs is
> spread over multiple functions and is obfuscated by the set/finish
> copy code.
> 
> It's just a mess.
> 
> It's also extremely inefficient.
> 
> That's why I've rewritten the code. I think the code I've written is
> much easier to understand and there's less of it.  The compiled code
> is smaller and faster. It has much fewer subtleties and outside
> dependencies, and is easier to reason about and modify.
> 
> Built on top of this is the CIL scalability improvements. My 32p
> machine hits lock contention limits in xlog_cil_commit() at about
> 700,000 transaction commits a section. It hits this at 16 thread
> workloads, and 32 thread workloads go no faster and just burn CPU on
> the CIL spinlocks.
> 
> This patchset gets rid of spinlocks and global serialisation points
> in the xlog_cil_commit() path. It does this by moving to a
> combination of per-cpu counters, unordered per-cpu lists and
> post-ordered per-cpu lists, and is built upon the xlog_write()
> simplifications introduced earlier in the rewrite of that function.
> 
> This results in transaction commit rates exceeding 2 million
> commits/s under unlink certain workloads, but in general the
> improvements are smaller than this as the scalability limitations
> simply move from xlog_cil_commit() to global VFS lock contexts.
> 
> ----------------------------------------------------------------
> Christoph Hellwig (2):
>       xfs: change the type of ic_datap
>       xfs: remove xlog_verify_dest_ptr
> 
> Dave Chinner (28):
>       xfs: factor out the CIL transaction header building
>       xfs: only CIL pushes require a start record
>       xfs: embed the xlog_op_header in the unmount record
>       xfs: embed the xlog_op_header in the commit record
>       xfs: log tickets don't need log client id
>       xfs: move log iovec alignment to preparation function
>       xfs: reserve space and initialise xlog_op_header in item formatting
>       xfs: log ticket region debug is largely useless
>       xfs: pass lv chain length into xlog_write()
>       xfs: introduce xlog_write_full()
>       xfs: introduce xlog_write_partial()
>       xfs: xlog_write() no longer needs contwr state
>       xfs: xlog_write() doesn't need optype anymore
>       xfs: CIL context doesn't need to count iovecs
>       xfs: use the CIL space used counter for emptiness checks
>       xfs: lift init CIL reservation out of xc_cil_lock
>       xfs: rework per-iclog header CIL reservation
>       xfs: introduce per-cpu CIL tracking structure
>       xfs: implement percpu cil space used calculation
>       xfs: track CIL ticket reservation in percpu structure
>       xfs: convert CIL busy extents to per-cpu
>       xfs: Add order IDs to log items in CIL
>       xfs: convert CIL to unordered per cpu lists
>       xfs: convert log vector chain to use list heads
>       xfs: move CIL ordering to the logvec chain
>       xfs: avoid cil push lock if possible
>       xfs: xlog_sync() manually adjusts grant head space
>       xfs: expanding delayed logging design with background material
> 
>  Documentation/filesystems/xfs-delayed-logging-design.rst | 361 +++++++++++++++++++++++++++++++++----
>  fs/xfs/libxfs/xfs_log_format.h                           |   1 -
>  fs/xfs/xfs_log.c                                         | 809 ++++++++++++++++++++++++++++++++++++----------------------------------------------
>  fs/xfs/xfs_log.h                                         |  58 ++----
>  fs/xfs/xfs_log_cil.c                                     | 550 +++++++++++++++++++++++++++++++++++++++-----------------
>  fs/xfs/xfs_log_priv.h                                    | 103 +++++------
>  fs/xfs/xfs_super.c                                       |   1 +
>  fs/xfs/xfs_trans.c                                       |  10 +-
>  fs/xfs/xfs_trans.h                                       |   1 +
>  fs/xfs/xfs_trans_priv.h                                  |   3 +-
>  10 files changed, 1134 insertions(+), 763 deletions(-)
> 
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Jan. 11, 2022, 5:04 a.m. UTC | #2
On Thu, Jan 06, 2022 at 01:40:33PM -0800, Darrick J. Wong wrote:
> On Fri, Dec 10, 2021 at 11:09:56AM +1100, Dave Chinner wrote:
> > Hi Darrick,
> > 
> > Can you please pull the following changes from the tag listed below
> > for the XFS dev tree?
> 
> Hi Dave,
> 
> I tried, but the regressions with generic/017 persist.  It trips the
> ticket reservation pretty consistently within 45-60 seconds of starting,
> at least on the OCI VM that I created.  /dev/sd[ab] are (software
> defined) disks that can sustain reads of ~50MB/s and ~5000iops; and
> writes of about half those numbers.
> 
>  run fstests generic/017 at 2022-01-06 13:18:59
>  XFS (sda4): Mounting V5 Filesystem
>  XFS (sda4): Ending clean mount
>  XFS (sda4): Quotacheck needed: Please wait.
>  XFS (sda4): Quotacheck: Done.
>  XFS (sda4): ctx ticket reservation ran out. Need to up reservation
>  XFS (sda4): ticket reservation summary:
>  XFS (sda4):   unit res    = 548636 bytes
>  XFS (sda4):   current res = -76116 bytes
>  XFS (sda4):   original count  = 1
>  XFS (sda4):   remaining count = 1
>  XFS (sda4): Log I/O Error (0x2) detected at xlog_write+0x5ee/0x660 [xfs] (fs/xfs/xfs_log.c:2499).  Shutting down filesystem.
>  XFS (sda4): Please unmount the filesystem and rectify the problem(s)
>  XFS (sda3): Unmounting Filesystem
>  XFS (sda4): Unmounting Filesystem

Ok, I *think* I've worked out what was going on here. The patch
below has run several hundred iterations of g/017 with an external
log on two different fs/log size configurations that typically
reproduced in within 10 cycles.

Essentially, the problem is largely caused by using
XLOG_CIL_BLOCKING_SPACE_LIMIT() instead of XLOG_CIL_SPACE_LIMIT()
when determining how much used space we can allow the percpu
counters to accumulate before aggregating them back into the global
counter. Using the hard limit meant that we could accumulate almost
the entire hard limit before we aggregate even a single percpu value
back into the global limit, resulting in failing to trigger either
condition for aggregation until we'd effectively blown through the
hard limit.

This then meant the extra reservations that need to be taken for
space used beyond the hard limit didn't get stolen for the ctx
ticket, and it then overruns.

It also means that we could overrun the hard limit substantially
before throttling kicked in. With the percpu aggregation threshold
brought back down to the (soft limit / num online cpus) we are
guaranteed to always start aggregation back into the global counter
before or at the point in time the soft limit should be hit, meaning
that we start updating the global counter much sooner and so are it
tracks actual space used once over the soft limit much more closely.

Darrick, can you rerun the branch with the patch below also included, and
see if it reproduces on your setup? If it does, can you grab a trace
of the trace_printk() calls I left in the patch?

Note that this change does not make the algorithm fully correct - we
can still have accumulation on other CPUs that isn't folded back
into the global value. What I want is feedback on whether it makes
the problem largely go away on configs other than my own before
spending more time coming up with a better lockless aggregation
algorithm...

Cheers,

Dave.
Darrick J. Wong Jan. 11, 2022, 5:58 p.m. UTC | #3
On Tue, Jan 11, 2022 at 04:04:37PM +1100, Dave Chinner wrote:
> On Thu, Jan 06, 2022 at 01:40:33PM -0800, Darrick J. Wong wrote:
> > On Fri, Dec 10, 2021 at 11:09:56AM +1100, Dave Chinner wrote:
> > > Hi Darrick,
> > > 
> > > Can you please pull the following changes from the tag listed below
> > > for the XFS dev tree?
> > 
> > Hi Dave,
> > 
> > I tried, but the regressions with generic/017 persist.  It trips the
> > ticket reservation pretty consistently within 45-60 seconds of starting,
> > at least on the OCI VM that I created.  /dev/sd[ab] are (software
> > defined) disks that can sustain reads of ~50MB/s and ~5000iops; and
> > writes of about half those numbers.
> > 
> >  run fstests generic/017 at 2022-01-06 13:18:59
> >  XFS (sda4): Mounting V5 Filesystem
> >  XFS (sda4): Ending clean mount
> >  XFS (sda4): Quotacheck needed: Please wait.
> >  XFS (sda4): Quotacheck: Done.
> >  XFS (sda4): ctx ticket reservation ran out. Need to up reservation
> >  XFS (sda4): ticket reservation summary:
> >  XFS (sda4):   unit res    = 548636 bytes
> >  XFS (sda4):   current res = -76116 bytes
> >  XFS (sda4):   original count  = 1
> >  XFS (sda4):   remaining count = 1
> >  XFS (sda4): Log I/O Error (0x2) detected at xlog_write+0x5ee/0x660 [xfs] (fs/xfs/xfs_log.c:2499).  Shutting down filesystem.
> >  XFS (sda4): Please unmount the filesystem and rectify the problem(s)
> >  XFS (sda3): Unmounting Filesystem
> >  XFS (sda4): Unmounting Filesystem
> 
> Ok, I *think* I've worked out what was going on here. The patch
> below has run several hundred iterations of g/017 with an external
> log on two different fs/log size configurations that typically
> reproduced in within 10 cycles.
> 
> Essentially, the problem is largely caused by using
> XLOG_CIL_BLOCKING_SPACE_LIMIT() instead of XLOG_CIL_SPACE_LIMIT()
> when determining how much used space we can allow the percpu
> counters to accumulate before aggregating them back into the global
> counter. Using the hard limit meant that we could accumulate almost
> the entire hard limit before we aggregate even a single percpu value
> back into the global limit, resulting in failing to trigger either
> condition for aggregation until we'd effectively blown through the
> hard limit.
> 
> This then meant the extra reservations that need to be taken for
> space used beyond the hard limit didn't get stolen for the ctx
> ticket, and it then overruns.
> 
> It also means that we could overrun the hard limit substantially
> before throttling kicked in. With the percpu aggregation threshold
> brought back down to the (soft limit / num online cpus) we are
> guaranteed to always start aggregation back into the global counter
> before or at the point in time the soft limit should be hit, meaning
> that we start updating the global counter much sooner and so are it
> tracks actual space used once over the soft limit much more closely.
> 
> Darrick, can you rerun the branch with the patch below also included, and
> see if it reproduces on your setup? If it does, can you grab a trace
> of the trace_printk() calls I left in the patch?

Ok, I'll do that and report back.

> Note that this change does not make the algorithm fully correct - we
> can still have accumulation on other CPUs that isn't folded back
> into the global value. What I want is feedback on whether it makes
> the problem largely go away on configs other than my own before
> spending more time coming up with a better lockless aggregation
> algorithm...

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> ---
>  fs/xfs/xfs_log_cil.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 366c0aaad640..47d46d6e15b3 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -96,6 +96,9 @@ xlog_cil_pcp_aggregate(
>  		ctx->ticket->t_curr_res += cilpcp->space_reserved;
>  		ctx->ticket->t_unit_res += cilpcp->space_reserved;
>  		cilpcp->space_reserved = 0;
> +	trace_printk("cilpcp space used %d, reserved %d unit-res %d cur-res %d",
> +			cilpcp->space_used, cilpcp->space_reserved,
> +			ctx->ticket->t_unit_res, ctx->ticket->t_curr_res);
>  
>  		if (!list_empty(&cilpcp->busy_extents)) {
>  			list_splice_init(&cilpcp->busy_extents,
> @@ -515,11 +518,16 @@ xlog_cil_insert_items(
>  	 *
>  	 * This can steal more than we need, but that's OK.
>  	 */
> -	space_used = atomic_read(&ctx->space_used);
> +	space_used = atomic_read(&ctx->space_used) + len;
>  	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
> -	    space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> +	    space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
>  		int	split_res = log->l_iclog_hsize +
>  					sizeof(struct xlog_op_header);
> +
> +	trace_printk("space used %d, len %d iclog hdrs %d, slim %d, hlim %d",
> +			space_used, len, atomic_read(&cil->xc_iclog_hdrs),
> +			XLOG_CIL_SPACE_LIMIT(log),
> +			XLOG_CIL_BLOCKING_SPACE_LIMIT(log));
>  		if (ctx_res)
>  			ctx_res += split_res * (tp->t_ticket->t_iclog_hdrs - 1);
>  		else
> @@ -540,8 +548,9 @@ xlog_cil_insert_items(
>  	cilpcp->space_used += len;
>  	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
>  	    cilpcp->space_used >
> -			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
> -					num_online_cpus())) {
> +			(XLOG_CIL_SPACE_LIMIT(log) / num_online_cpus())) {
> +	trace_printk("cilpcp space used %d, reserved %d ctxres %d",
> +			cilpcp->space_used, cilpcp->space_reserved, ctx_res);
>  		atomic_add(cilpcp->space_used, &ctx->space_used);
>  		cilpcp->space_used = 0;
>  	}
> @@ -1331,6 +1340,7 @@ xlog_cil_push_background(
>  
>  	spin_lock(&cil->xc_push_lock);
>  	if (cil->xc_push_seq < cil->xc_current_sequence) {
> +		trace_printk("push sapce used %d", space_used);
>  		cil->xc_push_seq = cil->xc_current_sequence;
>  		queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work);
>  	}
Darrick J. Wong Jan. 12, 2022, 11:56 p.m. UTC | #4
On Tue, Jan 11, 2022 at 09:58:29AM -0800, Darrick J. Wong wrote:
> On Tue, Jan 11, 2022 at 04:04:37PM +1100, Dave Chinner wrote:
> > On Thu, Jan 06, 2022 at 01:40:33PM -0800, Darrick J. Wong wrote:
> > > On Fri, Dec 10, 2021 at 11:09:56AM +1100, Dave Chinner wrote:
> > > > Hi Darrick,
> > > > 
> > > > Can you please pull the following changes from the tag listed below
> > > > for the XFS dev tree?
> > > 
> > > Hi Dave,
> > > 
> > > I tried, but the regressions with generic/017 persist.  It trips the
> > > ticket reservation pretty consistently within 45-60 seconds of starting,
> > > at least on the OCI VM that I created.  /dev/sd[ab] are (software
> > > defined) disks that can sustain reads of ~50MB/s and ~5000iops; and
> > > writes of about half those numbers.
> > > 
> > >  run fstests generic/017 at 2022-01-06 13:18:59
> > >  XFS (sda4): Mounting V5 Filesystem
> > >  XFS (sda4): Ending clean mount
> > >  XFS (sda4): Quotacheck needed: Please wait.
> > >  XFS (sda4): Quotacheck: Done.
> > >  XFS (sda4): ctx ticket reservation ran out. Need to up reservation
> > >  XFS (sda4): ticket reservation summary:
> > >  XFS (sda4):   unit res    = 548636 bytes
> > >  XFS (sda4):   current res = -76116 bytes
> > >  XFS (sda4):   original count  = 1
> > >  XFS (sda4):   remaining count = 1
> > >  XFS (sda4): Log I/O Error (0x2) detected at xlog_write+0x5ee/0x660 [xfs] (fs/xfs/xfs_log.c:2499).  Shutting down filesystem.
> > >  XFS (sda4): Please unmount the filesystem and rectify the problem(s)
> > >  XFS (sda3): Unmounting Filesystem
> > >  XFS (sda4): Unmounting Filesystem
> > 
> > Ok, I *think* I've worked out what was going on here. The patch
> > below has run several hundred iterations of g/017 with an external
> > log on two different fs/log size configurations that typically
> > reproduced in within 10 cycles.
> > 
> > Essentially, the problem is largely caused by using
> > XLOG_CIL_BLOCKING_SPACE_LIMIT() instead of XLOG_CIL_SPACE_LIMIT()
> > when determining how much used space we can allow the percpu
> > counters to accumulate before aggregating them back into the global
> > counter. Using the hard limit meant that we could accumulate almost
> > the entire hard limit before we aggregate even a single percpu value
> > back into the global limit, resulting in failing to trigger either
> > condition for aggregation until we'd effectively blown through the
> > hard limit.
> > 
> > This then meant the extra reservations that need to be taken for
> > space used beyond the hard limit didn't get stolen for the ctx
> > ticket, and it then overruns.
> > 
> > It also means that we could overrun the hard limit substantially
> > before throttling kicked in. With the percpu aggregation threshold
> > brought back down to the (soft limit / num online cpus) we are
> > guaranteed to always start aggregation back into the global counter
> > before or at the point in time the soft limit should be hit, meaning
> > that we start updating the global counter much sooner and so are it
> > tracks actual space used once over the soft limit much more closely.
> > 
> > Darrick, can you rerun the branch with the patch below also included, and
> > see if it reproduces on your setup? If it does, can you grab a trace
> > of the trace_printk() calls I left in the patch?
> 
> Ok, I'll do that and report back.

...everything passes now, except for generic/650 on the same machine
that has a 128M external log:

[21310.267037] run fstests generic/650 at 2022-01-11 22:41:45
[21311.121539] XFS (sda3): Mounting V5 Filesystem
[21312.295609] XFS (sda3): Ending clean mount
[21314.160622] smpboot: CPU 2 is now offline
[21314.737842] smpboot: Booting Node 0 Processor 2 APIC 0x2
[21314.740726] kvm-clock: cpu 2, msr 43f608081, secondary cpu clock
[21314.787354] kvm-guest: stealtime: cpu 2, msr 43fd1b040
[21315.917154] smpboot: CPU 1 is now offline
[21317.993809] x86: Booting SMP configuration:
[21317.996484] smpboot: Booting Node 0 Processor 1 APIC 0x1
[21318.001940] kvm-clock: cpu 1, msr 43f608041, secondary cpu clock
[21318.020207] kvm-guest: stealtime: cpu 1, msr 43fc9b040
[21320.126617] smpboot: CPU 3 is now offline
[21320.127744] XFS (sda3): ctx ticket reservation ran out. Need to up reservation
[21320.153944] XFS (sda3): ticket reservation summary:
[21320.158868] XFS (sda3):   unit res    = 2100 bytes
[21320.163064] XFS (sda3):   current res = -40 bytes
[21320.167323] XFS (sda3):   original count  = 1
[21320.170436] XFS (sda3):   remaining count = 1
[21320.171742] XFS (sda3): Log I/O Error (0x2) detected at xlog_write+0x5f3/0x670 [xfs] (fs/xfs/xfs_log.c:2512).  Shutting down filesystem.
[21320.176445] XFS (sda3): Please unmount the filesystem and rectify the problem(s)
[21320.179719] potentially unexpected fatal signal 6.
[21320.182490] potentially unexpected fatal signal 6.
[21320.182632] potentially unexpected fatal signal 6.
[21320.183842] CPU: 0 PID: 3460987 Comm: fsstress Tainted: G        W         5.16.0-rc5-djwx #rc5 79050ab45c4cbd1b9fbe98125ec0eea3a2cdfa1d

--D

> > Note that this change does not make the algorithm fully correct - we
> > can still have accumulation on other CPUs that isn't folded back
> > into the global value. What I want is feedback on whether it makes
> > the problem largely go away on configs other than my own before
> > spending more time coming up with a better lockless aggregation
> > algorithm...
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > ---
> >  fs/xfs/xfs_log_cil.c | 18 ++++++++++++++----
> >  1 file changed, 14 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 366c0aaad640..47d46d6e15b3 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -96,6 +96,9 @@ xlog_cil_pcp_aggregate(
> >  		ctx->ticket->t_curr_res += cilpcp->space_reserved;
> >  		ctx->ticket->t_unit_res += cilpcp->space_reserved;
> >  		cilpcp->space_reserved = 0;
> > +	trace_printk("cilpcp space used %d, reserved %d unit-res %d cur-res %d",
> > +			cilpcp->space_used, cilpcp->space_reserved,
> > +			ctx->ticket->t_unit_res, ctx->ticket->t_curr_res);
> >  
> >  		if (!list_empty(&cilpcp->busy_extents)) {
> >  			list_splice_init(&cilpcp->busy_extents,
> > @@ -515,11 +518,16 @@ xlog_cil_insert_items(
> >  	 *
> >  	 * This can steal more than we need, but that's OK.
> >  	 */
> > -	space_used = atomic_read(&ctx->space_used);
> > +	space_used = atomic_read(&ctx->space_used) + len;
> >  	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
> > -	    space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> > +	    space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> >  		int	split_res = log->l_iclog_hsize +
> >  					sizeof(struct xlog_op_header);
> > +
> > +	trace_printk("space used %d, len %d iclog hdrs %d, slim %d, hlim %d",
> > +			space_used, len, atomic_read(&cil->xc_iclog_hdrs),
> > +			XLOG_CIL_SPACE_LIMIT(log),
> > +			XLOG_CIL_BLOCKING_SPACE_LIMIT(log));
> >  		if (ctx_res)
> >  			ctx_res += split_res * (tp->t_ticket->t_iclog_hdrs - 1);
> >  		else
> > @@ -540,8 +548,9 @@ xlog_cil_insert_items(
> >  	cilpcp->space_used += len;
> >  	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
> >  	    cilpcp->space_used >
> > -			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
> > -					num_online_cpus())) {
> > +			(XLOG_CIL_SPACE_LIMIT(log) / num_online_cpus())) {
> > +	trace_printk("cilpcp space used %d, reserved %d ctxres %d",
> > +			cilpcp->space_used, cilpcp->space_reserved, ctx_res);
> >  		atomic_add(cilpcp->space_used, &ctx->space_used);
> >  		cilpcp->space_used = 0;
> >  	}
> > @@ -1331,6 +1340,7 @@ xlog_cil_push_background(
> >  
> >  	spin_lock(&cil->xc_push_lock);
> >  	if (cil->xc_push_seq < cil->xc_current_sequence) {
> > +		trace_printk("push sapce used %d", space_used);
> >  		cil->xc_push_seq = cil->xc_current_sequence;
> >  		queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work);
> >  	}
Dave Chinner Jan. 13, 2022, 3:46 a.m. UTC | #5
On Wed, Jan 12, 2022 at 03:56:04PM -0800, Darrick J. Wong wrote:
> On Tue, Jan 11, 2022 at 09:58:29AM -0800, Darrick J. Wong wrote:
> > On Tue, Jan 11, 2022 at 04:04:37PM +1100, Dave Chinner wrote:
> > > On Thu, Jan 06, 2022 at 01:40:33PM -0800, Darrick J. Wong wrote:
> > > > On Fri, Dec 10, 2021 at 11:09:56AM +1100, Dave Chinner wrote:
> > > > > Hi Darrick,
> > > > > 
> > > > > Can you please pull the following changes from the tag listed below
> > > > > for the XFS dev tree?
> > > > 
> > > > Hi Dave,
> > > > 
> > > > I tried, but the regressions with generic/017 persist.  It trips the
> > > > ticket reservation pretty consistently within 45-60 seconds of starting,
> > > > at least on the OCI VM that I created.  /dev/sd[ab] are (software
> > > > defined) disks that can sustain reads of ~50MB/s and ~5000iops; and
> > > > writes of about half those numbers.
> > > > 
> > > >  run fstests generic/017 at 2022-01-06 13:18:59
> > > >  XFS (sda4): Mounting V5 Filesystem
> > > >  XFS (sda4): Ending clean mount
> > > >  XFS (sda4): Quotacheck needed: Please wait.
> > > >  XFS (sda4): Quotacheck: Done.
> > > >  XFS (sda4): ctx ticket reservation ran out. Need to up reservation
> > > >  XFS (sda4): ticket reservation summary:
> > > >  XFS (sda4):   unit res    = 548636 bytes
> > > >  XFS (sda4):   current res = -76116 bytes
> > > >  XFS (sda4):   original count  = 1
> > > >  XFS (sda4):   remaining count = 1
> > > >  XFS (sda4): Log I/O Error (0x2) detected at xlog_write+0x5ee/0x660 [xfs] (fs/xfs/xfs_log.c:2499).  Shutting down filesystem.
> > > >  XFS (sda4): Please unmount the filesystem and rectify the problem(s)
> > > >  XFS (sda3): Unmounting Filesystem
> > > >  XFS (sda4): Unmounting Filesystem
> > > 
> > > Ok, I *think* I've worked out what was going on here. The patch
> > > below has run several hundred iterations of g/017 with an external
> > > log on two different fs/log size configurations that typically
> > > reproduced in within 10 cycles.
> > > 
> > > Essentially, the problem is largely caused by using
> > > XLOG_CIL_BLOCKING_SPACE_LIMIT() instead of XLOG_CIL_SPACE_LIMIT()
> > > when determining how much used space we can allow the percpu
> > > counters to accumulate before aggregating them back into the global
> > > counter. Using the hard limit meant that we could accumulate almost
> > > the entire hard limit before we aggregate even a single percpu value
> > > back into the global limit, resulting in failing to trigger either
> > > condition for aggregation until we'd effectively blown through the
> > > hard limit.
> > > 
> > > This then meant the extra reservations that need to be taken for
> > > space used beyond the hard limit didn't get stolen for the ctx
> > > ticket, and it then overruns.
> > > 
> > > It also means that we could overrun the hard limit substantially
> > > before throttling kicked in. With the percpu aggregation threshold
> > > brought back down to the (soft limit / num online cpus) we are
> > > guaranteed to always start aggregation back into the global counter
> > > before or at the point in time the soft limit should be hit, meaning
> > > that we start updating the global counter much sooner and so are it
> > > tracks actual space used once over the soft limit much more closely.
> > > 
> > > Darrick, can you rerun the branch with the patch below also included, and
> > > see if it reproduces on your setup? If it does, can you grab a trace
> > > of the trace_printk() calls I left in the patch?
> > 
> > Ok, I'll do that and report back.
> 
> ...everything passes now, except for generic/650 on the same machine
> that has a 128M external log:
> 
> [21310.267037] run fstests generic/650 at 2022-01-11 22:41:45
> [21311.121539] XFS (sda3): Mounting V5 Filesystem
> [21312.295609] XFS (sda3): Ending clean mount
> [21314.160622] smpboot: CPU 2 is now offline
> [21314.737842] smpboot: Booting Node 0 Processor 2 APIC 0x2
> [21314.740726] kvm-clock: cpu 2, msr 43f608081, secondary cpu clock
> [21314.787354] kvm-guest: stealtime: cpu 2, msr 43fd1b040
> [21315.917154] smpboot: CPU 1 is now offline
> [21317.993809] x86: Booting SMP configuration:
> [21317.996484] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [21318.001940] kvm-clock: cpu 1, msr 43f608041, secondary cpu clock
> [21318.020207] kvm-guest: stealtime: cpu 1, msr 43fc9b040
> [21320.126617] smpboot: CPU 3 is now offline
> [21320.127744] XFS (sda3): ctx ticket reservation ran out. Need to up reservation
> [21320.153944] XFS (sda3): ticket reservation summary:
> [21320.158868] XFS (sda3):   unit res    = 2100 bytes
> [21320.163064] XFS (sda3):   current res = -40 bytes
> [21320.167323] XFS (sda3):   original count  = 1
> [21320.170436] XFS (sda3):   remaining count = 1
> [21320.171742] XFS (sda3): Log I/O Error (0x2) detected at xlog_write+0x5f3/0x670 [xfs] (fs/xfs/xfs_log.c:2512).  Shutting down filesystem.
> [21320.176445] XFS (sda3): Please unmount the filesystem and rectify the problem(s)
> [21320.179719] potentially unexpected fatal signal 6.
> [21320.182490] potentially unexpected fatal signal 6.
> [21320.182632] potentially unexpected fatal signal 6.
> [21320.183842] CPU: 0 PID: 3460987 Comm: fsstress Tainted: G        W         5.16.0-rc5-djwx #rc5 79050ab45c4cbd1b9fbe98125ec0eea3a2cdfa1d

Different issue. That's a CPU hotplug problem, not a general
accounting issue. Hotplug sucks, not sure yet if there's a way to
avoid the race that causes this.

Cheers,

Dave.