mbox series

[GIT,PULL] xfs: new code for 5.15

Message ID 20210831211847.GC9959@magnolia (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL] xfs: new code for 5.15 | expand

Pull-request

git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git tags/xfs-5.15-merge-6

Message

Darrick J. Wong Aug. 31, 2021, 9:18 p.m. UTC
Hi Linus,

Please pull this branch containing new code for 5.15.  There's a lot in
this cycle.

Starting with bug fixes: To avoid livelocks between the logging code and
the quota code, we've disabled the ability of quotaoff to turn off quota
accounting.  (Admins can still disable quota enforcement, but truly
turning off accounting requires a remount.)  We've tried to do this in a
careful enough way that there shouldn't be any user visible effects
aside from quotaoff no longer randomly hanging the system.

We've also fixed some bugs in runtime log behavior that could trip up
log recovery if (otherwise unrelated) transactions manage to start and
commit concurrently; some bugs in the GETFSMAP ioctl where we would
incorrectly restrict the range of records output if the two xfs devices
are of different sizes; a bug that resulted in fallocate funshare
failing unnecessarily; and broken behavior in the xfs inode cache when
DONTCACHE is in play.

As for new features: we now batch inode inactivations in percpu
background threads, which sharply decreases frontend thread wait time
when performing file deletions and should improve overall directory tree
deletion times.  This eliminates both the problem where closing an
unlinked file (especially on a frozen fs) can stall for a long time,
and should also ease complaints about direct reclaim bogging down on
unlinked file cleanup.

Starting with this release, we've enabled pipelining of the XFS log.  On
workloads with high rates of metadata updates to different shards of the
filesystem, multiple threads can be used to format committed log updates
into log checkpoints.

Lastly, with this release, two new features have graduated to supported
status: inode btree counters (for faster mounts), and support for dates
beyond Y2038.  Expect these to be enabled by default in a future release
of xfsprogs.

The branch merges cleanly against upstream as of a few minutes ago.
Please let me know if anything else strange happens during the merge
process.

--D

The following changes since commit c500bee1c5b2f1d59b1081ac879d73268ab0ff17:

  Linux 5.14-rc4 (2021-08-01 17:04:17 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git tags/xfs-5.15-merge-6

for you to fetch changes up to f38a032b165d812b0ba8378a5cd237c0888ff65f:

  xfs: fix I_DONTCACHE (2021-08-24 19:13:04 -0700)

----------------------------------------------------------------
New code for 5.15:
 - Fix a potential log livelock on busy filesystems when there's so much
   work going on that we can't finish a quotaoff before filling up the log
   by removing the ability to disable quota accounting.
 - Introduce the ability to use per-CPU data structures in XFS so that
   we can do a better job of maintaining CPU locality for certain
   operations.
 - Defer inode inactivation work to per-CPU lists, which will help us
   batch that processing.  Deletions of large sparse files will *appear*
   to run faster, but all that means is that we've moved the work to the
   backend.
 - Drop the EXPERIMENTAL warnings from the y2038+ support and the inode
   btree counters, since it's been nearly a year and no complaints have
   come in.
 - Remove more of our bespoke kmem* variants in favor of using the
   standard Linux calls.
 - Prepare for the addition of log incompat features in upcoming cycles
   by actually adding code to support this.
 - Small cleanups of the xattr code in preparation for landing support
   for full logging of extended attribute updates in a future cycle.
 - Replace the various log shutdown state and flag code all over xfs
   with a single atomic bit flag.
 - Fix a serious log recovery bug where log item replay can be skipped
   based on the start lsn of a transaction even though the transaction
   commit lsn is the key data point for that by enforcing start lsns to
   appear in the log in the same order as commit lsns.
 - Enable pipelining in the code that pushes log items to disk.
 - Drop ->writepage.
 - Fix some bugs in GETFSMAP where the last fsmap record reported for a
   device could extend beyond the end of the device, and a separate bug
   where query keys for one device could be applied to another.
 - Don't let GETFSMAP query functions edit their input parameters.
 - Small cleanups to the scrub code's handling of perag structures.
 - Small cleanups to the incore inode tree walk code.
 - Constify btree function parameters that aren't changed, so that there
   will never again be confusion about range query functions changing
   their input parameters.
 - Standardize the format and names of tracepoint data attributes.
 - Clean up all the mount state and feature flags to use wrapped bitset
   functions instead of inconsistently open-coded flag checks.
 - Fix some confusion between xfs_buf hash table key variable vs. block
   number.
 - Fix a mis-interaction with iomap where we reported shared delalloc
   cow fork extents to iomap, which would cause the iomap unshare
   operation to return IO errors unnecessarily.
 - Fix DONTCACHE behavior.

----------------------------------------------------------------
Allison Henderson (2):
      xfs: add attr state machine tracepoints
      xfs: Rename __xfs_attr_rmtval_remove

Christoph Hellwig (5):
      xfs: remove support for disabling quota accounting on a mounted file system
      xfs: remove xfs_dqrele_all_inodes
      xfs: remove the flags argument to xfs_qm_dquot_walk
      xfs: remove the active vs running quota differentiation
      xfs: remove support for untagged lookups in xfs_icwalk*

Darrick J. Wong (51):
      xfs: move xfs_inactive call to xfs_inode_mark_reclaimable
      xfs: detach dquots from inode if we don't need to inactivate it
      xfs: queue inactivation immediately when free space is tight
      xfs: queue inactivation immediately when quota is nearing enforcement
      xfs: queue inactivation immediately when free realtime extents are tight
      xfs: inactivate inodes any time we try to free speculative preallocations
      xfs: flush inode inactivation work when compiling usage statistics
      xfs: don't run speculative preallocation gc when fs is frozen
      xfs: use background worker pool when transactions can't get free space
      xfs: avoid buffer deadlocks when walking fs inodes
      xfs: throttle inode inactivation queuing on memory reclaim
      xfs: fix silly whitespace problems with kernel libxfs
      xfs: drop experimental warnings for bigtime and inobtcount
      xfs: grab active perag ref when reading AG headers
      xfs: dump log intent items that cannot be recovered due to corruption
      xfs: allow setting and clearing of log incompat feature flags
      xfs: clear log incompat feature bits when the log is idle
      xfs: refactor xfs_iget calls from log intent recovery
      xfs: make xfs_rtalloc_query_range input parameters const
      xfs: fix off-by-one error when the last rt extent is in use
      xfs: make fsmap backend function key parameters const
      xfs: remove unnecessary agno variable from struct xchk_ag
      xfs: add trace point for fs shutdown
      xfs: make the key parameters to all btree key comparison functions const
      xfs: make the key parameters to all btree query range functions const
      xfs: make the record pointer passed to query_range functions const
      xfs: mark the record passed into btree init_key functions as const
      xfs: make the keys and records passed to btree inorder functions const
      xfs: mark the record passed into xchk_btree functions as const
      xfs: make the pointer passed to btree set_root functions const
      xfs: make the start pointer passed to btree alloc_block functions const
      xfs: make the start pointer passed to btree update_lastrec functions const
      xfs: constify btree function parameters that are not modified
      xfs: fix incorrect unit conversion in scrub tracepoint
      xfs: standardize inode number formatting in ftrace output
      xfs: standardize AG number formatting in ftrace output
      xfs: standardize AG block number formatting in ftrace output
      xfs: standardize rmap owner number formatting in ftrace output
      xfs: standardize daddr formatting in ftrace output
      xfs: disambiguate units for ftrace fields tagged "blkno", "block", or "bno"
      xfs: disambiguate units for ftrace fields tagged "offset"
      xfs: disambiguate units for ftrace fields tagged "len"
      xfs: disambiguate units for ftrace fields tagged "count"
      xfs: rename i_disk_size fields in ftrace output
      xfs: resolve fork names in trace output
      xfs: standardize remaining xfs_buf length tracepoints
      xfs: standardize inode generation formatting in ftrace output
      xfs: decode scrub flags in ftrace output
      xfs: start documenting common units and tags used in tracepoints
      xfs: fix perag structure refcounting error when scrub fails
      xfs: only set IOMAP_F_SHARED when providing a srcmap to a write

Dave Chinner (44):
      xfs: introduce CPU hotplug infrastructure
      xfs: introduce all-mounts list for cpu hotplug notifications
      xfs: per-cpu deferred inode inactivation queues
      mm: Add kvrealloc()
      xfs: remove kmem_alloc_io()
      xfs: replace kmem_alloc_large() with kvmalloc()
      xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown()
      xfs: XLOG_STATE_IOERROR must die
      xfs: move recovery needed state updates to xfs_log_mount_finish
      xfs: convert log flags to an operational state field
      xfs: make forced shutdown processing atomic
      xfs: rework xlog_state_do_callback()
      xfs: separate out log shutdown callback processing
      xfs: don't run shutdown callbacks on active iclogs
      xfs: log head and tail aren't reliable during shutdown
      xfs: move xlog_commit_record to xfs_log_cil.c
      xfs: pass a CIL context to xlog_write()
      xfs: factor out log write ordering from xlog_cil_push_work()
      xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state()
      xfs: order CIL checkpoint start records
      xfs: AIL needs asynchronous CIL forcing
      xfs: CIL work is serialised, not pipelined
      xfs: move the CIL workqueue to the CIL
      xfs: drop ->writepage completely
      xfs: sb verifier doesn't handle uncached sb buffer
      xfs: rename xfs_has_attr()
      xfs: rework attr2 feature and mount options
      xfs: reflect sb features in xfs_mount
      xfs: replace xfs_sb_version checks with feature flag checks
      xfs: consolidate mount option features in m_features
      xfs: convert mount flags to features
      xfs: convert remaining mount flags to state flags
      xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown
      xfs: convert xfs_fs_geometry to use mount feature checks
      xfs: open code sb verifier feature checks
      xfs: convert scrub to use mount-based feature checks
      xfs: convert xfs_sb_version_has checks to use mount features
      xfs: remove unused xfs_sb_version_has wrappers
      xfs: introduce xfs_sb_is_v5 helper
      xfs: kill xfs_sb_version_has_v3inode()
      xfs: introduce xfs_buf_daddr()
      xfs: convert bp->b_bn references to xfs_buf_daddr()
      xfs: rename buffer cache index variable b_bn
      xfs: fix I_DONTCACHE

Dwaipayan Ray (1):
      xfs: cleanup __FUNCTION__ usage

 fs/xfs/kmem.c                      |  64 ----
 fs/xfs/kmem.h                      |   2 -
 fs/xfs/libxfs/xfs_ag.c             |  25 +-
 fs/xfs/libxfs/xfs_alloc.c          |  56 +--
 fs/xfs/libxfs/xfs_alloc.h          |  12 +-
 fs/xfs/libxfs/xfs_alloc_btree.c    | 100 ++---
 fs/xfs/libxfs/xfs_alloc_btree.h    |   2 +-
 fs/xfs/libxfs/xfs_attr.c           |  56 ++-
 fs/xfs/libxfs/xfs_attr.h           |   1 -
 fs/xfs/libxfs/xfs_attr_leaf.c      |  57 +--
 fs/xfs/libxfs/xfs_attr_remote.c    |  21 +-
 fs/xfs/libxfs/xfs_attr_remote.h    |   2 +-
 fs/xfs/libxfs/xfs_bmap.c           |  38 +-
 fs/xfs/libxfs/xfs_bmap_btree.c     |  56 +--
 fs/xfs/libxfs/xfs_bmap_btree.h     |   9 +-
 fs/xfs/libxfs/xfs_btree.c          | 141 +++----
 fs/xfs/libxfs/xfs_btree.h          |  56 +--
 fs/xfs/libxfs/xfs_btree_staging.c  |  14 +-
 fs/xfs/libxfs/xfs_da_btree.c       |  18 +-
 fs/xfs/libxfs/xfs_da_format.h      |   2 +-
 fs/xfs/libxfs/xfs_dir2.c           |   6 +-
 fs/xfs/libxfs/xfs_dir2_block.c     |  14 +-
 fs/xfs/libxfs/xfs_dir2_data.c      |  20 +-
 fs/xfs/libxfs/xfs_dir2_leaf.c      |  14 +-
 fs/xfs/libxfs/xfs_dir2_node.c      |  20 +-
 fs/xfs/libxfs/xfs_dir2_priv.h      |   2 +-
 fs/xfs/libxfs/xfs_dir2_sf.c        |  12 +-
 fs/xfs/libxfs/xfs_dquot_buf.c      |   8 +-
 fs/xfs/libxfs/xfs_format.h         | 226 ++---------
 fs/xfs/libxfs/xfs_ialloc.c         |  69 ++--
 fs/xfs/libxfs/xfs_ialloc.h         |   3 +-
 fs/xfs/libxfs/xfs_ialloc_btree.c   |  88 ++---
 fs/xfs/libxfs/xfs_ialloc_btree.h   |   2 +-
 fs/xfs/libxfs/xfs_inode_buf.c      |  22 +-
 fs/xfs/libxfs/xfs_inode_buf.h      |  11 +-
 fs/xfs/libxfs/xfs_log_format.h     |   6 +-
 fs/xfs/libxfs/xfs_log_recover.h    |   2 +
 fs/xfs/libxfs/xfs_log_rlimit.c     |   2 +-
 fs/xfs/libxfs/xfs_quota_defs.h     |  30 +-
 fs/xfs/libxfs/xfs_refcount.c       |  12 +-
 fs/xfs/libxfs/xfs_refcount.h       |   2 +-
 fs/xfs/libxfs/xfs_refcount_btree.c |  54 +--
 fs/xfs/libxfs/xfs_rmap.c           |  34 +-
 fs/xfs/libxfs/xfs_rmap.h           |  11 +-
 fs/xfs/libxfs/xfs_rmap_btree.c     |  72 ++--
 fs/xfs/libxfs/xfs_rmap_btree.h     |   2 +-
 fs/xfs/libxfs/xfs_rtbitmap.c       |  14 +-
 fs/xfs/libxfs/xfs_sb.c             | 263 +++++++++----
 fs/xfs/libxfs/xfs_sb.h             |   4 +-
 fs/xfs/libxfs/xfs_symlink_remote.c |  14 +-
 fs/xfs/libxfs/xfs_trans_inode.c    |   2 +-
 fs/xfs/libxfs/xfs_trans_resv.c     |  48 +--
 fs/xfs/libxfs/xfs_trans_resv.h     |   2 -
 fs/xfs/libxfs/xfs_trans_space.h    |   6 +-
 fs/xfs/libxfs/xfs_types.c          |   2 +-
 fs/xfs/libxfs/xfs_types.h          |   5 +
 fs/xfs/scrub/agheader.c            |  47 ++-
 fs/xfs/scrub/agheader_repair.c     |  66 ++--
 fs/xfs/scrub/alloc.c               |   2 +-
 fs/xfs/scrub/attr.c                |  16 +-
 fs/xfs/scrub/attr.h                |   3 -
 fs/xfs/scrub/bitmap.c              |   4 +-
 fs/xfs/scrub/bmap.c                |  41 +-
 fs/xfs/scrub/btree.c               |   9 +-
 fs/xfs/scrub/btree.h               |   4 +-
 fs/xfs/scrub/common.c              |  77 ++--
 fs/xfs/scrub/common.h              |  18 +-
 fs/xfs/scrub/dabtree.c             |   4 +-
 fs/xfs/scrub/dir.c                 |  10 +-
 fs/xfs/scrub/fscounters.c          |   6 +-
 fs/xfs/scrub/ialloc.c              |   4 +-
 fs/xfs/scrub/inode.c               |  14 +-
 fs/xfs/scrub/quota.c               |   4 +-
 fs/xfs/scrub/refcount.c            |   4 +-
 fs/xfs/scrub/repair.c              |  32 +-
 fs/xfs/scrub/rmap.c                |   2 +-
 fs/xfs/scrub/rtbitmap.c            |   2 +-
 fs/xfs/scrub/scrub.c               |  23 +-
 fs/xfs/scrub/scrub.h               |   3 +-
 fs/xfs/scrub/trace.c               |   8 +-
 fs/xfs/scrub/trace.h               |  78 ++--
 fs/xfs/xfs_acl.c                   |   2 +-
 fs/xfs/xfs_aops.c                  |  25 +-
 fs/xfs/xfs_attr_inactive.c         |   6 +-
 fs/xfs/xfs_attr_list.c             |   2 +-
 fs/xfs/xfs_bmap_item.c             |  14 +-
 fs/xfs/xfs_bmap_util.c             |  20 +-
 fs/xfs/xfs_buf.c                   |  40 +-
 fs/xfs/xfs_buf.h                   |  25 +-
 fs/xfs/xfs_buf_item.c              |   6 +-
 fs/xfs/xfs_buf_item_recover.c      |  10 +-
 fs/xfs/xfs_dir2_readdir.c          |   4 +-
 fs/xfs/xfs_discard.c               |   2 +-
 fs/xfs/xfs_dquot.c                 |  13 +-
 fs/xfs/xfs_dquot.h                 |  10 +
 fs/xfs/xfs_dquot_item.c            | 134 -------
 fs/xfs/xfs_dquot_item.h            |  17 -
 fs/xfs/xfs_dquot_item_recover.c    |   4 +-
 fs/xfs/xfs_error.c                 |   4 +-
 fs/xfs/xfs_error.h                 |  12 +
 fs/xfs/xfs_export.c                |   4 +-
 fs/xfs/xfs_extfree_item.c          |   3 +
 fs/xfs/xfs_file.c                  |  18 +-
 fs/xfs/xfs_filestream.c            |   2 +-
 fs/xfs/xfs_filestream.h            |   2 +-
 fs/xfs/xfs_fsmap.c                 |  68 ++--
 fs/xfs/xfs_fsops.c                 |  63 ++--
 fs/xfs/xfs_health.c                |   2 +-
 fs/xfs/xfs_icache.c                | 754 ++++++++++++++++++++++++++++---------
 fs/xfs/xfs_icache.h                |  14 +-
 fs/xfs/xfs_icreate_item.c          |   4 +-
 fs/xfs/xfs_inode.c                 | 102 +++--
 fs/xfs/xfs_inode.h                 |  25 +-
 fs/xfs/xfs_inode_item.c            |   2 +-
 fs/xfs/xfs_inode_item_recover.c    |   2 +-
 fs/xfs/xfs_ioctl.c                 |  33 +-
 fs/xfs/xfs_ioctl32.c               |   4 +-
 fs/xfs/xfs_iomap.c                 |  24 +-
 fs/xfs/xfs_iops.c                  |  32 +-
 fs/xfs/xfs_itable.c                |  44 ++-
 fs/xfs/xfs_iwalk.c                 |  33 +-
 fs/xfs/xfs_log.c                   | 723 ++++++++++++++++++-----------------
 fs/xfs/xfs_log.h                   |   7 +-
 fs/xfs/xfs_log_cil.c               | 450 +++++++++++++++-------
 fs/xfs/xfs_log_priv.h              |  66 ++--
 fs/xfs/xfs_log_recover.c           | 161 ++++----
 fs/xfs/xfs_mount.c                 | 233 +++++++++---
 fs/xfs/xfs_mount.h                 | 248 ++++++++++--
 fs/xfs/xfs_pnfs.c                  |   2 +-
 fs/xfs/xfs_qm.c                    |  96 +++--
 fs/xfs/xfs_qm.h                    |   3 -
 fs/xfs/xfs_qm_bhv.c                |   2 +-
 fs/xfs/xfs_qm_syscalls.c           | 253 ++-----------
 fs/xfs/xfs_quota.h                 |   2 +
 fs/xfs/xfs_quotaops.c              |  30 +-
 fs/xfs/xfs_refcount_item.c         |   5 +-
 fs/xfs/xfs_reflink.c               |   4 +-
 fs/xfs/xfs_reflink.h               |   3 +-
 fs/xfs/xfs_rmap_item.c             |   5 +-
 fs/xfs/xfs_rtalloc.c               |   6 +-
 fs/xfs/xfs_rtalloc.h               |  13 +-
 fs/xfs/xfs_super.c                 | 538 +++++++++++++++-----------
 fs/xfs/xfs_symlink.c               |  13 +-
 fs/xfs/xfs_sysfs.c                 |   1 +
 fs/xfs/xfs_trace.c                 |   2 +
 fs/xfs/xfs_trace.h                 | 386 +++++++++++++------
 fs/xfs/xfs_trans.c                 |  33 +-
 fs/xfs/xfs_trans_ail.c             |  19 +-
 fs/xfs/xfs_trans_buf.c             |   8 +-
 fs/xfs/xfs_trans_dquot.c           |  51 +--
 include/linux/cpuhotplug.h         |   1 +
 include/linux/mm.h                 |   2 +
 mm/util.c                          |  15 +
 153 files changed, 4082 insertions(+), 3201 deletions(-)

Comments

Linus Torvalds Sept. 2, 2021, 3:47 p.m. UTC | #1
On Tue, Aug 31, 2021 at 2:18 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> As for new features: we now batch inode inactivations in percpu
> background threads, which sharply decreases frontend thread wait time
> when performing file deletions and should improve overall directory tree
> deletion times.

So no complaints on this one, but I do have a reaction: we have a lot
of these random CPU hotplug events, and XFS now added another one.

I don't see that as a problem, but just the _randomness_ of these
callbacks makes me go "hmm". And that "enum cpuhp_state" thing isn't
exactly a thing of beauty, and just makes me think there's something
nasty going on.

For the new xfs usage, I really get the feeling that it's not that XFS
actually cares about the CPU states, but that this is literally tied
to just having percpu state allocated and active, and that maybe it
would be sensible to have something more specific to that kind of use.

We have other things that are very similar in nature - like the page
allocator percpu caches etc, which for very similar reasons want cpu
dead/online notification.

I'm only throwing this out as a reaction to this - I'm not sure
another interface would be good or worthwhile, but that "enum
cpuhp_state" is ugly enough that I thought I'd rope in Thomas for CPU
hotplug, and the percpu memory allocation people for comments.

IOW, just _maybe_ we would want to have some kind of callback model
for "percpu_alloc()" and it being explicitly about allocations
becoming available or going away, rather than about CPU state.

Comments?

> Lastly, with this release, two new features have graduated to supported
> status: inode btree counters (for faster mounts), and support for dates
> beyond Y2038.

Oh, I had thought Y2038 was already a non-issue for xfs. Silly me.

              Linus
pr-tracker-bot@kernel.org Sept. 2, 2021, 5:37 p.m. UTC | #2
The pull request you sent on Tue, 31 Aug 2021 14:18:47 -0700:

> git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git tags/xfs-5.15-merge-6

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/90c90cda05aecf0f7c45f9f35384b31bba38455f

Thank you!
Darrick J. Wong Sept. 2, 2021, 5:43 p.m. UTC | #3
On Thu, Sep 02, 2021 at 08:47:42AM -0700, Linus Torvalds wrote:
> On Tue, Aug 31, 2021 at 2:18 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > As for new features: we now batch inode inactivations in percpu
> > background threads, which sharply decreases frontend thread wait time
> > when performing file deletions and should improve overall directory tree
> > deletion times.
> 
> So no complaints on this one, but I do have a reaction: we have a lot
> of these random CPU hotplug events, and XFS now added another one.
> 
> I don't see that as a problem, but just the _randomness_ of these
> callbacks makes me go "hmm". And that "enum cpuhp_state" thing isn't
> exactly a thing of beauty, and just makes me think there's something
> nasty going on.
> 
> For the new xfs usage, I really get the feeling that it's not that XFS
> actually cares about the CPU states, but that this is literally tied
> to just having percpu state allocated and active, and that maybe it
> would be sensible to have something more specific to that kind of use.

Correct -- we don't really care about cpu state at all; all xfs needs is
to push batched work items on a per-cpu list to another cpu when a cpu
goes offline.  I didn't see anything that looked like it handled that
kind of thing, so ... cpuhp_state it was. :/

> We have other things that are very similar in nature - like the page
> allocator percpu caches etc, which for very similar reasons want cpu
> dead/online notification.
> 
> I'm only throwing this out as a reaction to this - I'm not sure
> another interface would be good or worthwhile, but that "enum
> cpuhp_state" is ugly enough that I thought I'd rope in Thomas for CPU
> hotplug, and the percpu memory allocation people for comments.
> 
> IOW, just _maybe_ we would want to have some kind of callback model
> for "percpu_alloc()" and it being explicitly about allocations
> becoming available or going away, rather than about CPU state.
> 
> Comments?

Seems like a good fit for us, though I'll let Dave Chinner chime in
since he's the one with more per-cpu list patches coming up.

> > Lastly, with this release, two new features have graduated to supported
> > status: inode btree counters (for faster mounts), and support for dates
> > beyond Y2038.
> 
> Oh, I had thought Y2038 was already a non-issue for xfs. Silly me.

It's been a new feature in upstream for a year now.  We're merely taking
down the scary warnings that using this new code might result in a
subspace vortex opening in the skies or that all trains bound for
Moynihan end up on track 19 or wherever. ;)

--D

>               Linus
Thomas Gleixner Sept. 2, 2021, 7:13 p.m. UTC | #4
On Thu, Sep 02 2021 at 08:47, Linus Torvalds wrote:
> On Tue, Aug 31, 2021 at 2:18 PM Darrick J. Wong <djwong@kernel.org> wrote:
>>
>> As for new features: we now batch inode inactivations in percpu
>> background threads, which sharply decreases frontend thread wait time
>> when performing file deletions and should improve overall directory tree
>> deletion times.
>
> So no complaints on this one, but I do have a reaction: we have a lot
> of these random CPU hotplug events, and XFS now added another one.
>
> I don't see that as a problem, but just the _randomness_ of these
> callbacks makes me go "hmm". And that "enum cpuhp_state" thing isn't
> exactly a thing of beauty, and just makes me think there's something
> nasty going on.

It's not beautiful, but it's at least well defined in terms of ordering.

Though if the entity which needs the callback does not care about
ordering against other callbacks and just cares about the CPU state, in
this case DEAD, then the explicit entry can be avoided and a dynamic
entry can be requested:

     state = cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "xfs:foo", NULL, xfs_dead);

We have also a dynamic range for the online part (CPUHP_AP_ONLINE_DYN)
which runs on the incoming or outgoing CPU. That spares the explicit
entries in the enum.

I assume most of the prepare/dead states have no ordering requirement at
all, so those could be converted to the dynamic range. But for the
online one which run on the plugged CPU we have quite some ordering
constraints and that's where the explicit states matter.

That surely could be consolidated a bit if we pack the mutually
exclusive ones (timers, interrupt controllers, perf), but the question
is whether such packing (ifdeffery or arch/platform specific includes)
would make it more beautiful. The only thing we'd spare would be some
bytes in the actual state table in the core code. Whether that's worth
it, I don't know.

> For the new xfs usage, I really get the feeling that it's not that XFS
> actually cares about the CPU states, but that this is literally tied
> to just having percpu state allocated and active, and that maybe it
> would be sensible to have something more specific to that kind of use.
>
> We have other things that are very similar in nature - like the page
> allocator percpu caches etc, which for very similar reasons want cpu
> dead/online notification.
>
> I'm only throwing this out as a reaction to this - I'm not sure
> another interface would be good or worthwhile, but that "enum
> cpuhp_state" is ugly enough that I thought I'd rope in Thomas for CPU
> hotplug, and the percpu memory allocation people for comments.

It's not only about memory. 

> IOW, just _maybe_ we would want to have some kind of callback model
> for "percpu_alloc()" and it being explicitly about allocations
> becoming available or going away, rather than about CPU state.

The per cpu storage in XFS does not go away. It contains a llist head
and the queued work items need to be moved from the dead CPU to an alive
CPU and exposed to a work queue for processing. Similar to what we do
with timers, hrtimers and other stuff.

If there are callbacks which are doing pretty much the same thing, then
I'm all for a generic infrastructure for these.

Thanks,

        tglx
Dave Chinner Sept. 2, 2021, 10:35 p.m. UTC | #5
On Thu, Sep 02, 2021 at 10:43:11AM -0700, Darrick J. Wong wrote:
> On Thu, Sep 02, 2021 at 08:47:42AM -0700, Linus Torvalds wrote:
> > On Tue, Aug 31, 2021 at 2:18 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > As for new features: we now batch inode inactivations in percpu
> > > background threads, which sharply decreases frontend thread wait time
> > > when performing file deletions and should improve overall directory tree
> > > deletion times.
> > 
> > So no complaints on this one, but I do have a reaction: we have a lot
> > of these random CPU hotplug events, and XFS now added another one.
> > 
> > I don't see that as a problem, but just the _randomness_ of these
> > callbacks makes me go "hmm". And that "enum cpuhp_state" thing isn't
> > exactly a thing of beauty, and just makes me think there's something
> > nasty going on.
> > 
> > For the new xfs usage, I really get the feeling that it's not that XFS
> > actually cares about the CPU states, but that this is literally tied
> > to just having percpu state allocated and active, and that maybe it
> > would be sensible to have something more specific to that kind of use.
> 
> Correct -- we don't really care about cpu state at all; all xfs needs is
> to push batched work items on a per-cpu list to another cpu when a cpu
> goes offline.  I didn't see anything that looked like it handled that
> kind of thing, so ... cpuhp_state it was. :/

Yeah, it appeared to me that cpuhp_state is the replacement for the
simple, old register_cpu_notifier() hotplug interface that we've
used in the past in XFS (e.g. for per-cpu superblock counters back
in 2006). So, like many other subsystems, I just hooked into it with
a subsystem notifier so we don't have to modify global headers in
future for new internal CPU dead notifiers because, as Darrick
mentions, I have more percpu based modifications coming up for
XFS...


> > We have other things that are very similar in nature - like the page
> > allocator percpu caches etc, which for very similar reasons want cpu
> > dead/online notification.
> > 
> > I'm only throwing this out as a reaction to this - I'm not sure
> > another interface would be good or worthwhile, but that "enum
> > cpuhp_state" is ugly enough that I thought I'd rope in Thomas for CPU
> > hotplug, and the percpu memory allocation people for comments.

Calling it ugly is being kind. :/

The part I dislike most about it is that we have to modify a header
file that triggers full kernel rebuilds. Managing patch stacks and
branches where one of them modifies such a header file means quick,
XFS subsystem only kernel rebuilds are a rare thing...

> > IOW, just _maybe_ we would want to have some kind of callback model
> > for "percpu_alloc()" and it being explicitly about allocations
> > becoming available or going away, rather than about CPU state.
> > 
> > Comments?
> 
> Seems like a good fit for us, though I'll let Dave Chinner chime in
> since he's the one with more per-cpu list patches coming up.

I think this might just be semantics - I understand the intent of
Linus's suggestion is to provide a hotplug callback with a per-cpu
allocation region. The thing that we rely on here is, however, that
per-cpu allocation regions are static for the life of the allocation
and always cover all possible CPUs. Hence the callbacks we want here
are really about CPU state changes and I'm not convinced it's the
best idea to conflate CPU state change callbacks with memory
allocation...

That said, I'm all for a better interface to the CPU hotplug
notifications. THe current interface is ... esoteric and to
understand how to use it effectively requires becoming a CPU hotplug
expert. There's something to be said for the simplicity of the old
register_cpu_notifier() interface we used to have...

Cheers,

Dave.
Christoph Hellwig Sept. 3, 2021, 4:29 a.m. UTC | #6
On Thu, Sep 02, 2021 at 09:13:24PM +0200, Thomas Gleixner wrote:
> > I'm only throwing this out as a reaction to this - I'm not sure
> > another interface would be good or worthwhile, but that "enum
> > cpuhp_state" is ugly enough that I thought I'd rope in Thomas for CPU
> > hotplug, and the percpu memory allocation people for comments.
> 
> It's not only about memory. 
> 
> > IOW, just _maybe_ we would want to have some kind of callback model
> > for "percpu_alloc()" and it being explicitly about allocations
> > becoming available or going away, rather than about CPU state.
> 
> The per cpu storage in XFS does not go away. It contains a llist head
> and the queued work items need to be moved from the dead CPU to an alive
> CPU and exposed to a work queue for processing. Similar to what we do
> with timers, hrtimers and other stuff.
> 
> If there are callbacks which are doing pretty much the same thing, then
> I'm all for a generic infrastructure for these.

In the block layer we've added a new per-cpu bio list, for which
the dead callback literally does nothing but free some memory.
For that case a simple callback would be neat, but I'm not sure how
common that is.
Thomas Gleixner Sept. 3, 2021, 6:26 a.m. UTC | #7
Dave,

On Fri, Sep 03 2021 at 08:35, Dave Chinner wrote:
> On Thu, Sep 02, 2021 at 10:43:11AM -0700, Darrick J. Wong wrote:
> The part I dislike most about it is that we have to modify a header
> file that triggers full kernel rebuilds. Managing patch stacks and
> branches where one of them modifies such a header file means quick,
> XFS subsystem only kernel rebuilds are a rare thing...

If you don't care about ordering, you can avoid touching the global
header completely. The dynamic state ranges in PREPARE and ONLINE
provide exactly what you want. It's documented.

> That said, I'm all for a better interface to the CPU hotplug
> notifications. THe current interface is ... esoteric and to

What's so esoteric about:

       state = cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "xfs:prepare", func1, func2);
       state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "xfs:online", func3, func4);

Only if you care about callback ordering vs. other subsystems, then adding
the state in the global header is required. It's neither the end of the
world, nor is it rocket science and requires expert knowledge to do so.

> understand how to use it effectively requires becoming a CPU hotplug
> expert.

  https://www.kernel.org/doc/html/latest/core-api/cpu_hotplug.html

If there is something missing in that documentation which makes you
think you need to become a CPU hotplug expert, please let me know. I'm
happy to expand that document.

> There's something to be said for the simplicity of the old
> register_cpu_notifier() interface we used to have...

There is a lot to be said about it. The simplicity of it made people do
the most hillarious things to deal with:

  - Ordering issues including build order dependencies
  - Asymetry between bringup and teardown
  - The inability to test state transitions
  - ....

Back then when we converted the notifier mess 35 of ~140 hotplug
notifiers (i.e. ~25%) contained bugs of all sorts. Quite some of them
were caused by the well understood simplicity of the hotplug notifier
mechanics. I'm surely not missing any of that.

Thanks,

        tglx
Dennis Zhou Sept. 3, 2021, 6:40 p.m. UTC | #8
Hello,

On Thu, Sep 02, 2021 at 08:47:42AM -0700, Linus Torvalds wrote:
> On Tue, Aug 31, 2021 at 2:18 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > As for new features: we now batch inode inactivations in percpu
> > background threads, which sharply decreases frontend thread wait time
> > when performing file deletions and should improve overall directory tree
> > deletion times.
> 
> So no complaints on this one, but I do have a reaction: we have a lot
> of these random CPU hotplug events, and XFS now added another one.
> 
> I don't see that as a problem, but just the _randomness_ of these
> callbacks makes me go "hmm". And that "enum cpuhp_state" thing isn't
> exactly a thing of beauty, and just makes me think there's something
> nasty going on.
> 
> For the new xfs usage, I really get the feeling that it's not that XFS
> actually cares about the CPU states, but that this is literally tied
> to just having percpu state allocated and active, and that maybe it
> would be sensible to have something more specific to that kind of use.
> 
> We have other things that are very similar in nature - like the page
> allocator percpu caches etc, which for very similar reasons want cpu
> dead/online notification.
> 
> I'm only throwing this out as a reaction to this - I'm not sure
> another interface would be good or worthwhile, but that "enum
> cpuhp_state" is ugly enough that I thought I'd rope in Thomas for CPU
> hotplug, and the percpu memory allocation people for comments.
> 
> IOW, just _maybe_ we would want to have some kind of callback model
> for "percpu_alloc()" and it being explicitly about allocations
> becoming available or going away, rather than about CPU state.
> 
> Comments?
> 

I think there are 2 pieces here from percpu's side:
A) Onlining and offlining state related to a percpu alloc.
B) Freeing backing memory of an allocation wrt hot plug.

An RFC was sent out for B) in [1] and you need A) for B).
I can see percpu having a callback model for basic allocations that are
independent, but for anything more complex, that subsystem would need to
register with hotplug anyway. It appears percpu_counter already has hot
plug support. percpu_refcount could be extended as well, but more
complex initialization like the runqueues and slab related allocations
would require work. In short, yes I think A) is doable/reasonable.

Freeing the backing memory for A) seems trickier. We would have to
figure out a clean way to handle onlining/offlining racing with new
percpu allocations (adding or removing pages for the corresponding cpu's
chunk). To support A), init and onlining/offlining can be separate
phases, but for B) init/freeing would have to be rolled into
onlining/offlining.

Without freeing, it's not incorrect for_each_online_cpu() to read a dead
cpu's percpu values, but with freeing it does.

I guess to summarize, A) seems like it might be a good idea with
init/destruction happening at allocation/freeing times. I'm a little
skeptical of B) in terms of complexity. If y'all think it's a good idea
I can look into it again.

[1] https://lore.kernel.org/lkml/20210601065147.53735-1-bharata@linux.ibm.com/

Thanks,
Dennis

> > Lastly, with this release, two new features have graduated to supported
> > status: inode btree counters (for faster mounts), and support for dates
> > beyond Y2038.
> 
> Oh, I had thought Y2038 was already a non-issue for xfs. Silly me.
> 
>               Linus
Dave Chinner Sept. 5, 2021, 12:21 a.m. UTC | #9
On Fri, Sep 03, 2021 at 08:26:58AM +0200, Thomas Gleixner wrote:
> Dave,
> 
> On Fri, Sep 03 2021 at 08:35, Dave Chinner wrote:
> > On Thu, Sep 02, 2021 at 10:43:11AM -0700, Darrick J. Wong wrote:
> > The part I dislike most about it is that we have to modify a header
> > file that triggers full kernel rebuilds. Managing patch stacks and
> > branches where one of them modifies such a header file means quick,
> > XFS subsystem only kernel rebuilds are a rare thing...
> 
> If you don't care about ordering, you can avoid touching the global
> header completely. The dynamic state ranges in PREPARE and ONLINE
> provide exactly what you want. It's documented.

Ordering? When and why would I care about ordering?
il_last_pushed_lsn
> 
> > That said, I'm all for a better interface to the CPU hotplug
> > notifications. THe current interface is ... esoteric and to
> 
> What's so esoteric about:
> 
>        state = cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "xfs:prepare", func1, func2);
>        state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "xfs:online", func3, func4);

I don't want -online- notifications.  I only want _offline_
notifications and according to the documentation,
CPUHP_AP_ONLINE_DYN get called on both online and offline state
changes.

Don't you see the cognitive dissonance that contradictory "use
online for offline" API naming like this causes. It easily scores
negative points on the Rusty's API scale....
(http://sweng.the-davies.net/Home/rustys-api-design-manifesto)

Also, having to understand what the multiple callbacks
just for different operations is a bit of a WTF. What's the actual
difference between the "online" and "prepare down" callbacks?
For online notifications, the prepare down op is documented as the
online hotplug error handling function that undoes the online
callback.

But if we are registering an -offline notification-, their use isn't
actually documented. Is it the same, or is it inverted? I have to go
read the code...

That is then followed by this gem:

"The callback can be remove by invoking cpuhp_remove_state(). In
case of a dynamically allocated state (CPUHP_AP_ONLINE_DYN) use the
returned state. During the removal of a hotplug state the teardown
callback will be invoked."

What does "use the returned state" mean? What returned
state? Where did it come from? It's not defined anywhere. Then
there's "the teardown callback will be invoked" - that's the first
reference to a "teardown callback" in the documentation. I have to
assume it means the "prepare_down" callback, but....

... then I wonder: the prepare_down callback is per-cpu. Does this
mean that when we remove a hp notifier, the prepare_down callback is
called for every CPU? Or something else? It's not documented, I've
got to go read the code just to work out the basic, fundamental
behaviour of the API I'm trying to use....

> Only if you care about callback ordering vs. other subsystems, then adding
> the state in the global header is required. It's neither the end of the
> world, nor is it rocket science and requires expert knowledge to do so.
> 
> > understand how to use it effectively requires becoming a CPU hotplug
> > expert.
> 
>   https://www.kernel.org/doc/html/latest/core-api/cpu_hotplug.html
> 
> If there is something missing in that documentation which makes you
> think you need to become a CPU hotplug expert, please let me know. I'm
> happy to expand that document.

Deja vu. It's memory-ordering all over again.

The fundamental problem is documentation is written by experts in
the subject matter and, as such, is full of implicit, unspoken
knowledge the reader needs to know before the documentation makes
sense. It is written in a way that only experts in the subject
matter actually understand because only they have the underlying
knowledge to fill in the blanks. And, worst of all, said experts get
upset and obnoxiously defensive when someone dares to say that it's
not perfect.

You might not think that using CPUHP_AP_ONLINE_DYN for CPU offline
events is hard to understand because you know the intimate details
of the implementation (i.e. the offline events are the reverse order
state transitions of online events). But for someone who hasn't
touched the CPU hotplug infrastructure in several years, it's
totally baroque.

I still have little idea of what a "dynamically allocated state" is
in the CPU hotplug model vs an ordered one. It's not defined in the
documentation, nor is it explained how, why and when each should be
used. No examples are given as to when dynamic vs static order is
preferred or needed, and there's nothing in the documentation to
tell you how to just do offline notification.

Hence we end up with code like this:

void __init page_writeback_init(void)
{
        BUG_ON(wb_domain_init(&global_wb_domain, GFP_KERNEL));

        cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mm/writeback:online",
                          page_writeback_cpu_online, NULL);
        cpuhp_setup_state(CPUHP_MM_WRITEBACK_DEAD, "mm/writeback:dead", NULL,
                          page_writeback_cpu_online);
}

Which mixes a dynamic notifier for CPU online, followed by a
specifically ordered offline notifier. Yet both call the same
"online" function, one as a notifier, the other as a "teardown"
callback. But in both cases, neither is used as a "teardown" for a
failed hotplug case.

The WTF level here is sky high. Taken at face value it makes no
sense at all because it uses the same function for online and
offline events. According to the documentation, neither notifier
handles hotplug failure, and there's absolutely no clear reason for
why one event is dynamic and the other is static.

This is what makes it a terrible API: from my perspective, it seems
almost impossible to use correctly even though I've read the
documentation and spend a bunch of time reading the code and try
hard to do the right thing. That's a -9 or -10 on the Rusty API
scale...

> > There's something to be said for the simplicity of the old
> > register_cpu_notifier() interface we used to have...
> 
> There is a lot to be said about it. The simplicity of it made people do
> the most hillarious things to deal with:
> 
>   - Ordering issues including build order dependencies
>   - Asymetry between bringup and teardown
>   - The inability to test state transitions
>   - ....
>
> Back then when we converted the notifier mess 35 of ~140 hotplug
> notifiers (i.e. ~25%) contained bugs of all sorts. Quite some of them
> were caused by the well understood simplicity of the hotplug notifier
> mechanics. I'm surely not missing any of that.

You're conflating implementation problems with "API is unusable".
The API was very easy to understand and use, and those
implementation difficulties (like ordering and symmetry) could have
eaily been fixed just by having a notifier block per defined
transition, rather than multiplexing all state transitions all into
a single notifier...

Indeed, that's the core difference between that old API and the
current API - the current API requires registering a notifier per
state transition, but that registers the notifier for both CPU up
and down transitions.

The problem with the new API is that the requirement for symmetry in
some subsystems has bled into the API, and so now all the subsystems
that *don't need/want symmetry* have to juggle some undocumented
esoteric combination of state definitions and callbacks to get the
behaviour they require. And that, AFAICT, means that those callbacks
can't handle failures in hotplug processing properly.

So rather than having a nice, simple "one callback per event" API,
we've got this convoluted thing that behaves according to a
combination of state definition and callback defintions. Then the
API is duplicated into "_nocall()" variants (not documented!)
because many subsystems do not want hotplug callbacks run on
setup/teardown of hotplug events.

The old hotplug notification *implementation* had problems, but the
*API* was not the cause of those bugs. In contrast, the current API
appears to make it impossible to implement notifiers for certain use
cases correctly, and that's directly where my statement that "you
need to be a cpuhp expert to use this" comes from....

Cheers,

Dave.
Thomas Gleixner Sept. 5, 2021, 11:28 p.m. UTC | #10
Dave,

On Sun, Sep 05 2021 at 10:21, Dave Chinner wrote:
> On Fri, Sep 03, 2021 at 08:26:58AM +0200, Thomas Gleixner wrote:
>> On Fri, Sep 03 2021 at 08:35, Dave Chinner wrote:
>> > On Thu, Sep 02, 2021 at 10:43:11AM -0700, Darrick J. Wong wrote:
>> > The part I dislike most about it is that we have to modify a header
>> > file that triggers full kernel rebuilds. Managing patch stacks and
>> > branches where one of them modifies such a header file means quick,
>> > XFS subsystem only kernel rebuilds are a rare thing...
>> 
>> If you don't care about ordering, you can avoid touching the global
>> header completely. The dynamic state ranges in PREPARE and ONLINE
>> provide exactly what you want. It's documented.
>
> Ordering? When and why would I care about ordering?

Quite some hotplug related functionality cares about ordering of the
callbacks vs. other subsytems or core functionality:

 - workqueue startup/shutdown has to be at a well defined place

 - the PERF core online callback needs to be invoked before the
   various PERF drivers callbacks. On teardown the ordering is obviously
   reverse: Notify drivers before code.

 - low level startup/teardowm code has very strict ordering requirements

 - ....

If you don't have ordering requirements and you just want to be notified
then you don't have to think about that at all. Set up a dynamic state
and be done with it.

> il_last_pushed_lsn

I assume that I'm not supposed to figure out how that is related to the
topic we are discussing. :)

>> > That said, I'm all for a better interface to the CPU hotplug
>> > notifications. THe current interface is ... esoteric and to
>> 
>> What's so esoteric about:
>> 
>>        state = cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "xfs:prepare", func1, func2);
>>        state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "xfs:online", func3, func4);
>
> I don't want -online- notifications.  I only want _offline_
> notifications and according to the documentation,
> CPUHP_AP_ONLINE_DYN get called on both online and offline state
> changes.

So if you want only offline notification then you simply hand in NULL
for the bringup callback.

        state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "xfs:offline", NULL, func4);

> Don't you see the cognitive dissonance that contradictory "use
> online for offline" API naming like this causes.

It's so amazing hard to understand that offline is the reverse operation
of online, right?

> It easily scores negative points on the Rusty's API scale....

I'm surely impressed by the APIs which Rusty constructed.

The CPU hotplug notifier register/unregister API was surely trivial and
intuitive, but the rest was a horror show if you had to do something
more complex than 'let me know that the CPU is dead'.

Rusty admitted himself that the original CPU hotplug design was a
steaming pile of s... including the 'so simple' API. See below.

> Also, having to understand what the multiple callbacks
> just for different operations is a bit of a WTF. What's the actual
> difference between the "online" and "prepare down" callbacks?
> For online notifications, the prepare down op is documented as the
> online hotplug error handling function that undoes the online
> callback.
>
> But if we are registering an -offline notification-, their use isn't
> actually documented. Is it the same, or is it inverted? I have to go
> read the code...

The documentation sucks. I told you that I'm happy to update it. See
patch below.

> Indeed, that's the core difference between that old API and the
> current API - the current API requires registering a notifier per
> state transition, but that registers the notifier for both CPU up
> and down transitions.

That registers a callback for startup and teardown and either one can be
NULL if not required.

> The problem with the new API is that the requirement for symmetry in
> some subsystems has bled into the API, and so now all the subsystems
> that *don't need/want symmetry* have to juggle some undocumented
> esoteric combination of state definitions and callbacks to get the
> behaviour they require.

What kind of esoteric things do you need? Can you explain the
requirements and why do you think it's hard to solve?

Your handwaving here is more than counterproductive.

> And that, AFAICT, means that those callbacks can't handle failures in
> hotplug processing properly.

Why so?

Here is an example for online:

        [state 1]->startup()    -> success
        [state 2]->startup()    -> success
        [state 3]               -> skipped because startup is NULL 
        [state 4]->startup()    -> fail
        [state 3]->teardown()
        [state 2]               -> skipped because teardown is NULL
        [state 1]->teardown()

If state 2 does not provide a teardown callback then there is nothing to
teardown obviously. Same for state 3 on startup.

What does not work correctly in the error case?

It does not matter at all whether the transition from state 3 to state 0
happens due to a CPU offline operation or because the online operation
got aborted at state 4. It's exactly the same thing. It's that simple,
really.

If you think that you need a 'bringup failed' notification for the
subsystem which set up state 4 or a 'bringup failed' notification for
the subsystem which registered state 2 then you are really doing
something fundamentally wrong.

I know that the 'oh so simple and well designed' original notifier sent
these 'failed' notifications, but they were sent for the completely
wrong reasons and were part of the overall problem.

I'm obviously failing to understand how a 'simple' single callback
interface which forces the user to handle a gazillion of obscure events
is so much superiour than a straight forward linear state machine,

> So rather than having a nice, simple "one callback per event" API,
> we've got this convoluted thing that behaves according to a
> combination of state definition and callback defintions. Then the
> API is duplicated into "_nocall()" variants (not documented!)
> because many subsystems do not want hotplug callbacks run on
> setup/teardown of hotplug events.

About 70% of the state setups are utilizing the callback invocations
which means we don't have to have hundreds duplicated and probably
differently buggy code to do that manually all over the place.

That's a clear maintenance and code correctness win and that definitely
justifies an extra inline variant for those usage sites which don't
need/want those invocations for hopefully well thought out reasons.

> The old hotplug notification *implementation* had problems, but the
> *API* was not the cause of those bugs.

The state transitions, i.e. the @reason (event) argument to the
callbacks were definitely part of the API and that was where the main
problem came from.

At the end we had _seventeen_ different reasons (events) because people
added new ones over and over to duct tape the issues of the existing
pile. What the hell is simple about that?

Most of these reasons and their possible transition schemes were
undocumented and not understandable at all. That and the assbackwards
failure handling turned that thing into an unmaintainable and unfixable
nightmare.

The fact that 25% of all notifier callbacks were buggy speaks for
itself. Not to talk about the locking issues which got papered over by
the fact that the hotplug locking was hidden from lockdep, the absurd
assumptions about invocation context and the workarounds to handle
callback invocations correctly when registering or unregistering a
notifier.

The charm of this "simple" API ended right there at the register /
unregister function invocation level. Nice for the occasional user with
simple requirements and a nightmare for those people who had to debug
the fallout of this simplicity.

Keep it simple is surely a good engineering principle, but if the
simplicity is restricted to the least important part of the problem then
it turns into a nightmare.

> In contrast, the current API appears to make it impossible to
> implement notifiers for certain use cases correctly, and that's
> directly where my statement that "you need to be a cpuhp expert to use
> this" comes from....

Just for the record:

 - When we converted all notifiers over there was not a single instance
   which could not be handled straight forward and in the vast majority
   of the cases the code got simpler, smaller and easier to understand.

 - After the initial conversion more than hundred state setups were
   added without a single complaint about the so complex API and no one
   showed up as hotplug expert since then.

 - The number of obscure and hard to debug CPU hotplug bugs has gone
   close to zero since then. The obvious programming bugs in the core,
   the callbacks and API usage are just the usual programming errors
   which are not specific to CPU hotplug at all.

I'm sorry that this change which turned CPU hotplug into a reliable,
testable and instrumentable mechanism causes so much trouble for you. I
hope it's just the lack of coherent documentation which made you
unhappy.

If the updated documentation does not answer your questions, please let
me know and please provide a coherent explanation of the problem you are
trying to solve. Either I can give you an hint or I can identify further
issues in the documentation.

If it turns out that there are functional shortcomings then I'm of
course all ears as well.

If you need a conveniance API to install multiple states at once to
regain the "simple API" feeling, please let me know - I surely have some
ideas.

Thanks,

        tglx
---
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -156,95 +156,479 @@ hotplug states will be invoked, starting
 * Once all services are migrated, kernel calls an arch specific routine
   ``__cpu_disable()`` to perform arch specific cleanup.
 
-Using the hotplug API
----------------------
-It is possible to receive notifications once a CPU is offline or onlined. This
-might be important to certain drivers which need to perform some kind of setup
-or clean up functions based on the number of available CPUs: ::
-
-  #include <linux/cpuhotplug.h>
-
-  ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "X/Y:online",
-                          Y_online, Y_prepare_down);
-
-*X* is the subsystem and *Y* the particular driver. The *Y_online* callback
-will be invoked during registration on all online CPUs. If an error
-occurs during the online callback the *Y_prepare_down* callback will be
-invoked on all CPUs on which the online callback was previously invoked.
-After registration completed, the *Y_online* callback will be invoked
-once a CPU is brought online and *Y_prepare_down* will be invoked when a
-CPU is shutdown. All resources which were previously allocated in
-*Y_online* should be released in *Y_prepare_down*.
-The return value *ret* is negative if an error occurred during the
-registration process. Otherwise a positive value is returned which
-contains the allocated hotplug for dynamically allocated states
-(*CPUHP_AP_ONLINE_DYN*). It will return zero for predefined states.
-
-The callback can be remove by invoking ``cpuhp_remove_state()``. In case of a
-dynamically allocated state (*CPUHP_AP_ONLINE_DYN*) use the returned state.
-During the removal of a hotplug state the teardown callback will be invoked.
-
-Multiple instances
-~~~~~~~~~~~~~~~~~~
-If a driver has multiple instances and each instance needs to perform the
-callback independently then it is likely that a ''multi-state'' should be used.
-First a multi-state state needs to be registered: ::
-
-  ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "X/Y:online,
-                                Y_online, Y_prepare_down);
-  Y_hp_online = ret;
-
-The ``cpuhp_setup_state_multi()`` behaves similar to ``cpuhp_setup_state()``
-except it prepares the callbacks for a multi state and does not invoke
-the callbacks. This is a one time setup.
-Once a new instance is allocated, you need to register this new instance: ::
-
-  ret = cpuhp_state_add_instance(Y_hp_online, &d->node);
-
-This function will add this instance to your previously allocated
-*Y_hp_online* state and invoke the previously registered callback
-(*Y_online*) on all online CPUs. The *node* element is a ``struct
-hlist_node`` member of your per-instance data structure.
-
-On removal of the instance: ::
-  cpuhp_state_remove_instance(Y_hp_online, &d->node)
-
-should be invoked which will invoke the teardown callback on all online
-CPUs.
-
-Manual setup
-~~~~~~~~~~~~
-Usually it is handy to invoke setup and teardown callbacks on registration or
-removal of a state because usually the operation needs to performed once a CPU
-goes online (offline) and during initial setup (shutdown) of the driver. However
-each registration and removal function is also available with a ``_nocalls``
-suffix which does not invoke the provided callbacks if the invocation of the
-callbacks is not desired. During the manual setup (or teardown) the functions
-``cpus_read_lock()`` and ``cpus_read_unlock()`` should be used to inhibit CPU
-hotplug operations.
-
-
-The ordering of the events
---------------------------
-The hotplug states are defined in ``include/linux/cpuhotplug.h``:
-
-* The states *CPUHP_OFFLINE* … *CPUHP_AP_OFFLINE* are invoked before the
-  CPU is up.
-* The states *CPUHP_AP_OFFLINE* … *CPUHP_AP_ONLINE* are invoked
-  just the after the CPU has been brought up. The interrupts are off and
-  the scheduler is not yet active on this CPU. Starting with *CPUHP_AP_OFFLINE*
-  the callbacks are invoked on the target CPU.
-* The states between *CPUHP_AP_ONLINE_DYN* and *CPUHP_AP_ONLINE_DYN_END* are
-  reserved for the dynamic allocation.
-* The states are invoked in the reverse order on CPU shutdown starting with
-  *CPUHP_ONLINE* and stopping at *CPUHP_OFFLINE*. Here the callbacks are
-  invoked on the CPU that will be shutdown until *CPUHP_AP_OFFLINE*.
-
-A dynamically allocated state via *CPUHP_AP_ONLINE_DYN* is often enough.
-However if an earlier invocation during the bring up or shutdown is required
-then an explicit state should be acquired. An explicit state might also be
-required if the hotplug event requires specific ordering in respect to
-another hotplug event.
+
+The CPU hotplug API
+===================
+
+CPU hotplug state machine
+-------------------------
+
+CPU hotplug uses a trivial state machine with a linear state space from
+CPUHP_OFFLINE to CPUHP_ONLINE. Each state has a startup and a teardown
+callback.
+
+When a CPU is onlined, the startup callbacks are invoked sequentially until
+the state CPUHP_ONLINE is reached. They can also be invoked when the
+callbacks of a state are set up or an instance is added to a multi-instance
+state.
+
+When a CPU is offlined the teardown callbacks are invoked in the reverse
+order sequenctially until the state CPUHP_OFFLINE is reached. They can also
+be invoked when the callbacks of a state are removed or an instance is
+removed from a multi-instance state.
+
+If a usage site requires only a callback in one direction of the hotplug
+operations (CPU online or CPU offline) then the other not required callback
+can be set to NULL when the state is set up.
+
+The state space is divided into three sections:
+
+* The PREPARE section
+
+  The PREPARE section covers the state space from CPUHP_OFFLINE to CPUHP_BRINGUP_CPU
+
+  The startup callbacks in this section are invoked before the CPU is
+  started during a CPU online operation. The teardown callbacks are invoked
+  after the CPU has become dysfunctional during a CPU offline operation.
+
+  The callbacks are invoked on a control CPU as they can't obviously run on
+  the hotplugged CPU which is either not yet started or has become
+  dysfunctional already.
+
+  The startup callbacks are used to setup resources which are required to
+  bring a CPU successfully online. The teardown callbacks are used to free
+  resources or to move pending work to an online CPU after the hotplugged
+  CPU became dysfunctional.
+
+  The startup callbacks are allowed to fail. If a callback fails, the CPU
+  online operation is aborted and the CPU is brought down to the previous
+  state (usually CPUHP_OFFLINE) again.
+
+  The teardown callbacks in this section are not allowed to fail.
+
+* The STARTING section
+
+  The STARTING section covers the state space between CPUHP_BRINGUP_CPU + 1
+  and CPUHP_AP_ONLINE
+
+  The startup callbacks in this section are invoked on the hotplugged CPU
+  with interrupts disabled during a CPU online operation in the early CPU
+  setup code. The teardown callbacks are invoked with interrupts disabled
+  on the hotplugged CPU during a CPU offline operation shortly before the
+  CPU is completely shut down.
+
+  The callbacks in this section are not allowed to fail.
+
+  The callbacks are used for low level hardware initialization/shutdown and
+  for core subsystems.
+
+* The ONLINE section
+
+  The ONLINE section covers the state space between CPUHP_AP_ONLINE + 1 and
+  CPUHP_ONLINE.
+
+  The startup callbacks in this section are invoked on the hotplugged CPU
+  during a CPU online operation. The teardown callbacks are invoked on the
+  hotplugged CPU during a CPU offline operation.
+
+  The callbacks are invoked in the context of the per CPU hotplug thread,
+  which is pinned on the hotplugged CPU. The callbacks are invoked with
+  interrupts and preemption enabled.
+
+  The callbacks are allowed to fail. When a callback fails the hotplug
+  operation is aborted and the CPU is brought back to the previous state.
+
+CPU online/offline operations
+-----------------------------
+
+A successful online operation looks like this: ::
+
+  [CPUHP_OFFLINE]
+  [CPUHP_OFFLINE + 1]->startup()       -> success
+  [CPUHP_OFFLINE + 2]->startup()       -> success
+  [CPUHP_OFFLINE + 3]                  -> skipped because startup == NULL
+  ...
+  [CPUHP_BRINGUP_CPU]->startup()       -> success
+  === End of PREPARE section
+  [CPUHP_BRINGUP_CPU + 1]->startup()   -> success
+  ...
+  [CPUHP_AP_ONLINE]->startup()         -> success
+  === End of STARTUP section
+  [CPUHP_AP_ONLINE + 1]->startup()     -> success
+  ...
+  [CPUHP_ONLINE - 1]->startup()        -> success
+  [CPUHP_ONLINE]
+
+A successful offline operation looks like this: ::
+
+  [CPUHP_ONLINE]
+  [CPUHP_ONLINE - 1]->teardown()       -> success
+  ...
+  [CPUHP_AP_ONLINE + 1]->teardown()    -> success
+  === Start of STARTUP section
+  [CPUHP_AP_ONLINE]->teardown()        -> success
+  ...
+  [CPUHP_BRINGUP_ONLINE - 1]->teardown()
+  ...
+  === Start of PREPARE section
+  [CPUHP_BRINGUP_CPU]->teardown()
+  [CPUHP_OFFLINE + 3]->teardown()
+  [CPUHP_OFFLINE + 2]                  -> skipped because teardown == NULL
+  [CPUHP_OFFLINE + 1]->teardown()
+  [CPUHP_OFFLINE]
+
+A failed online operation looks like this: ::
+
+  [CPUHP_OFFLINE]
+  [CPUHP_OFFLINE + 1]->startup()       -> success
+  [CPUHP_OFFLINE + 2]->startup()       -> success
+  [CPUHP_OFFLINE + 3]                  -> skipped because startup == NULL
+  ...
+  [CPUHP_BRINGUP_CPU]->startup()       -> success
+  === End of PREPARE section
+  [CPUHP_BRINGUP_CPU + 1]->startup()   -> success
+  ...
+  [CPUHP_AP_ONLINE]->startup()         -> success
+  === End of STARTUP section
+  [CPUHP_AP_ONLINE + 1]->startup()     -> success
+  ---
+  [CPUHP_AP_ONLINE + N]->startup()     -> fail
+  [CPUHP_AP_ONLINE + (N - 1)]->teardown()
+  ...
+  [CPUHP_AP_ONLINE + 1]->teardown()
+  === Start of STARTUP section
+  [CPUHP_AP_ONLINE]->teardown()
+  ...
+  [CPUHP_BRINGUP_ONLINE - 1]->teardown()
+  ...
+  === Start of PREPARE section
+  [CPUHP_BRINGUP_CPU]->teardown()
+  [CPUHP_OFFLINE + 3]->teardown()
+  [CPUHP_OFFLINE + 2]                  -> skipped because teardown == NULL
+  [CPUHP_OFFLINE + 1]->teardown()
+  [CPUHP_OFFLINE]
+
+A failed offline operation looks like this: ::
+
+  [CPUHP_ONLINE]
+  [CPUHP_ONLINE - 1]->teardown()       -> success
+  ...
+  [CPUHP_ONLINE - N]->teardown()       -> fail
+  [CPUHP_ONLINE - (N - 1)]->startup()
+  ...
+  [CPUHP_ONLINE - 1]->startup()
+  [CPUHP_ONLINE]
+
+Recursive failures cannot be handled sensibly. Look at the following
+example of a recursive fail due to a failed offline operation: ::
+
+  [CPUHP_ONLINE]
+  [CPUHP_ONLINE - 1]->teardown()       -> success
+  ...
+  [CPUHP_ONLINE - N]->teardown()       -> fail
+  [CPUHP_ONLINE - (N - 1)]->startup()  -> success
+  [CPUHP_ONLINE - (N - 2)]->startup()  -> fail
+
+The CPU hotplug state machine stops right here and does not try to go back
+down again because that would likely result in an endless loop: ::
+
+  [CPUHP_ONLINE - (N - 1)]->teardown() -> success
+  [CPUHP_ONLINE - N]->teardown()       -> fail
+  [CPUHP_ONLINE - (N - 1)]->startup()  -> success
+  [CPUHP_ONLINE - (N - 2)]->startup()  -> fail
+  [CPUHP_ONLINE - (N - 1)]->teardown() -> success
+  [CPUHP_ONLINE - N]->teardown()       -> fail
+
+Lather, rinse and repeat. In this case the CPU left in state: ::
+
+  [CPUHP_ONLINE - (N - 1)]
+
+which at least lets the system make progress and gives the user a chance to
+debug or even resolve the situation.
+
+Allocating a state
+------------------
+
+There are two ways to allocate a CPU hotplug state:
+
+* Static allocation
+
+  Static allocation has to be used when the subsystem or driver has
+  ordering requirements versus other CPU hotplug states. E.g. the PERF core
+  startup callback has to be invoked before the PERF driver startup
+  callbacks during a CPU online operation. During a CPU offline operation
+  the driver teardown callbacks have to be invoked before the core teardown
+  callback. The statically allocated states are described by constants in
+  the cpuhp_state enum which can be found in include/linux/cpuhotplug.h.
+
+  Insert the state into the enum at the proper place so the ordering
+  requirements are fulfilled. The state constant has to be used for state
+  setup and removal.
+
+  Static allocation is also required when the state callbacks are not set
+  up at runtime and are part of the initializer of the CPU hotplug state
+  array in kernel/cpu.c.
+
+* Dynamic allocation
+
+  When there are no ordering requirements for the state callbacks then
+  dynamic allocation is the preferred method. The state number is allocated
+  by the setup function and returned to the caller on success.
+
+  Only the PREPARE and ONLINE sections provide a dynamic allocation
+  range. The STARTING section does not as most of the callbacks in that
+  section have explicit ordering requirements.
+
+Setup of a CPU hotplug state
+----------------------------
+
+The core code provides the following functions to setup a state:
+
+* cpuhp_setup_state(state, name, startup, teardown)
+* cpuhp_setup_state_nocalls(state, name, startup, teardown)
+* cpuhp_setup_state_cpuslocked(state, name, startup, teardown)
+* cpuhp_setup_state_nocalls_cpuslocked(state, name, startup, teardown)
+
+For cases where a driver or a subsystem has multiple instances and the same
+CPU hotplug state callbacks need to be invoked for each instance, the CPU
+hotplug core provides multi-instance support. The advantage over driver
+specific instance lists is that the instance related functions are fully
+serialized against CPU hotplug operations and provide the automatic
+invocations of the state callbacks on add and removal. To set up such a
+multi-instance state the following function is available:
+
+* cpuhp_setup_state_multi(state, name, startup, teardown)
+
+The @state argument is either a statically allocated state or one of the
+constants for dynamically allocated states - CPUHP_PREPARE_DYN,
+CPUHP_ONLINE_DYN - depending on the state section (PREPARE, ONLINE) for
+which a dynamic state should be allocated.
+
+The @name argument is used for sysfs output and for instrumentation. The
+naming convention is "subsys:mode" or "subsys/driver:mode",
+e.g. "perf:mode" or "perf/x86:mode". The common mode names:
+
+======== =======================================================
+prepare  For states in the PREPAREsection
+
+dead     For states in the PREPARE section which do not provide
+         a startup callback
+
+starting For states in the STARTING section
+
+dying    For states in the STARTING section which do not provide
+         a startup callback
+
+online   For states in the ONLINE section
+
+offline  For states in the ONLINE section which do not provide
+         a startup callback
+======== =======================================================
+
+As the @name argument is only used for sysfs and instrumentation other mode
+descriptors can be used as well if they describe the nature of the state
+better than the common ones.
+
+Examples for @name arguments: "perf/online", "perf/x86:prepare",
+"RCU/tree:dying", "sched/waitempty"
+
+The @startup argument is a function pointer to the callback which should be
+invoked during a CPU online operation. If the usage site does not require a
+startup callback set the pointer to NULL.
+
+The @teardown argument is a function pointer to the callback which should
+be invoked during a CPU offline operation. If the usage site does not
+require a teardown callback set the pointer to NULL.
+
+The functions differ in the way how the installed callbacks are treated:
+
+  * cpuhp_setup_state_nocalls(), cpuhp_setup_state_nocalls_cpuslocked()
+    and cpuhp_setup_state_multi() only install the callbacks
+
+  * cpuhp_setup_state() and cpuhp_setup_state_cpuslocked() install the
+    callbacks and invoke the @startup callback (if not NULL) for all online
+    CPUs which have currently a state greater than the newly installed
+    state. Depending on the state section the callback is either invoked on
+    the current CPU (PREPARE section) or on each online CPU (ONLINE
+    section) in the context of the CPU's hotplug thread.
+
+    If a callback fails for CPU N then the teardown callback for CPU
+    0 .. N-1 is invoked to rollback the operation. The state setup fails,
+    the callbacks for the state are not installed and in case of dynamic
+    allocation the allocated state is freed.
+
+The state setup and the callback invocations are serialized against CPU
+hotplug operations. If the setup function has to be called from a CPU
+hotplug read locked region, then the _cpuslocked() variants have to be
+used. These functions cannot be used from within CPU hotplug callbacks.
+
+The function return values:
+  ======== ===================================================================
+  0        Statically allocated state was successfully set up
+
+  >0       Dynamically allocated state was successfully set up.
+
+           The returned number is the state number which was allocated. If
+           the state callbacks have to be removed later, e.g. module
+           removal, then this number has to be saved by the caller and used
+           as @state argument for the state remove function. For
+           multi-instance states the dynamically allocated state number is
+           also required as @state argument for the instance add/remove
+           operations.
+
+  <0	   Operation failed
+  ======== ===================================================================
+
+Removal of a CPU hotplug state
+------------------------------
+
+To remove a previously set up state, the following functions are provided:
+
+* cpuhp_remove_state(state)
+* cpuhp_remove_state_nocalls(state)
+* cpuhp_remove_state_nocalls_cpuslocked(state)
+* cpuhp_remove_multi_state(state)
+
+The @state argument is either a statically allocated state or the state
+number which was allocated in the dynamic range by cpuhp_setup_state*(). If
+the state is in the dynamic range, then the state number is freed and
+available for dynamic allocation again.
+
+The functions differ in the way how the installed callbacks are treated:
+
+  * cpuhp_remove_state_nocalls(), cpuhp_remove_state_nocalls_cpuslocked()
+    and cpuhp_remove_multi_state() only remove the callbacks.
+
+  * cpuhp_remove_state() removes the callbacks and invokes the teardown
+    callback (if not NULL) for all online CPUs which have currently a state
+    greater than the removed state. Depending on the state section the
+    callback is either invoked on the current CPU (PREPARE section) or on
+    each online CPU (ONLINE section) in the context of the CPU's hotplug
+    thread.
+
+    In order to complete the removal, the teardown callback should not fail.
+
+The state removal and the callback invocations are serialized against CPU
+hotplug operations. If the remove function has to be called from a CPU
+hotplug read locked region, then the _cpuslocked() variants have to be
+used. These functions cannot be used from within CPU hotplug callbacks.
+
+If a multi-instance state is removed then the caller has to remove all
+instances first.
+
+Multi-Instance state instance management
+----------------------------------------
+
+Once the multi-instance state is set up, instances can be added to the
+state:
+
+  * cpuhp_state_add_instance(state, node)
+  * cpuhp_state_add_instance_nocalls(state, node)
+
+The @state argument is either a statically allocated state or the state
+number which was allocated in the dynamic range by cpuhp_setup_state_multi().
+
+The @node argument is a pointer to a hlist_node which is embedded in the
+instance's data structure. The pointer is handed to the multi-instance
+state callbacks and can be used by the callback to retrieve the instance
+via container_of().
+
+The functions differ in the way how the installed callbacks are treated:
+
+  * cpuhp_state_add_instance_nocalls() and only adds the instance to the
+    multi-instance state's node list.
+
+  * cpuhp_state_add_instance() adds the instance and invokes the startup
+    callback (if not NULL) associated with @state for all online CPUs which
+    have currently a state greater than @state. The callback is only
+    invoked for the to be added instance. Depending on the state section
+    the callback is either invoked on the current CPU (PREPARE section) or
+    on each online CPU (ONLINE section) in the context of the CPU's hotplug
+    thread.
+
+    If a callback fails for CPU N then the teardown callback for CPU
+    0 .. N-1 is invoked to rollback the operation, the function fails and
+    the instance is not added to the node list of the multi-instance state.
+
+To remove an instance from the state's node list these functions are
+available:
+
+  * cpuhp_state_remove_instance(state, node)
+  * cpuhp_state_remove_instance_nocalls(state, node)
+
+The arguments are the same as for the the cpuhp_state_add_instance*()
+variants above.
+
+The functions differ in the way how the installed callbacks are treated:
+
+  * cpuhp_state_remove_instance_nocalls() only removes the instance from the
+    state's node list.
+
+  * cpuhp_state_remove_instance() removes the instance and invokes the
+    teardown callback (if not NULL) associated with @state for all online
+    CPUs which have currently a state greater than @state.  The callback is
+    only invoked for the to be removed instance.  Depending on the state
+    section the callback is either invoked on the current CPU (PREPARE
+    section) or on each online CPU (ONLINE section) in the context of the
+    CPU's hotplug thread.
+
+    In order to complete the removal, the teardown callback should not fail.
+
+The node list add/remove operations and the callback invocations are
+serialized against CPU hotplug operations. These functions cannot be used
+from within CPU hotplug callbacks and CPU hotplug read locked regions.
+
+Examples
+--------
+
+Setup and teardown a statically allocated state in the STARTING section for
+notifications on online and offline operations: ::
+
+   ret = cpuhp_setup_state(CPUHP_SUBSYS_STARTING, "subsys:starting", subsys_cpu_starting, subsys_cpu_dying);
+   if (ret < 0)
+        return ret;
+   ....
+   cpuhp_remove_state(CPUHP_SUBSYS_STARTING);
+
+Setup and teardown a dynamically allocated state in the ONLINE section
+for notifications on offline operations: ::
+
+   state = cpuhp_setup_state(CPUHP_ONLINE_DYN, "subsys:offline", NULL, subsys_cpu_offline);
+   if (state < 0)
+       return state;
+   ....
+   cpuhp_remove_state(state);
+
+Setup and teardown a dynamically allocated state in the ONLINE section
+for notifications on online operations without invoking the callbacks: ::
+
+   state = cpuhp_setup_state_nocalls(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpi_online, NULL);
+   if (state < 0)
+       return state;
+   ....
+   cpuhp_remove_state_nocalls(state);
+
+Setup, use and teardown a dynamically allocated multi-instance state in the
+ONLINE section for notifications on online and offline operation: ::
+
+   state = cpuhp_setup_state_multi(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpu_online, subsys_cpu_offline);
+   if (state < 0)
+       return state;
+   ....
+   ret = cpuhp_state_add_instance(state, &inst1->node);
+   if (ret)
+        return ret;
+   ....
+   ret = cpuhp_state_add_instance(state, &inst2->node);
+   if (ret)
+        return ret;
+   ....
+   cpuhp_remove_instance(state, &inst1->node);
+   ....
+   cpuhp_remove_instance(state, &inst2->node);
+   ....
+   remove_multi_state(state);
+
 
 Testing of hotplug states
 =========================
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -22,8 +22,42 @@
  *              AP_ACTIVE			AP_ACTIVE
  */
 
+/*
+ * CPU hotplug states. The state machine invokes the installed state
+ * startup callbacks sequentially from CPUHP_OFFLINE + 1 to CPUHP_ONLINE
+ * during a CPU online operation. During a CPU offline operation the
+ * installed teardown callbacks are invoked in the reverse order from
+ * CPU_ONLINE - 1 down to CPUHP_OFFLINE.
+ *
+ * The state space has three sections: PREPARE, STARTING and ONLINE.
+ *
+ * PREPARE: The callbacks are invoked on a control CPU before the
+ * hotplugged CPU is started up or after the hotplugged CPU has died.
+ *
+ * STARTING: The callbacks are invoked on the hotplugged CPU from the low level
+ * hotplug startup/teardown code with interrupts disabled.
+ *
+ * ONLINE: The callbacks are invoked on the hotplugged CPU from the per CPU
+ * hotplug thread with interrupts and preemption enabled.
+ *
+ * Adding explicit states to this enum is only necessary when:
+ *
+ * 1) The state is within the STARTING section
+ *
+ * 2) The state has ordering constraints vs. other states in the
+ *    same section.
+ *
+ * If neither #1 nor #2 apply, please use the dynamic state space when
+ * setting up a state by using CPUHP_PREPARE_DYN or CPUHP_PREPARE_ONLINE
+ * for the @state argument of the setup function.
+ *
+ * See Documentation/core-api/cpu_hotplug.rst for further information and
+ * examples.
+ */
 enum cpuhp_state {
 	CPUHP_INVALID = -1,
+
+	/* PREPARE section invoked on a control CPU */
 	CPUHP_OFFLINE = 0,
 	CPUHP_CREATE_THREADS,
 	CPUHP_PERF_PREPARE,
@@ -93,6 +127,11 @@ enum cpuhp_state {
 	CPUHP_BP_PREPARE_DYN,
 	CPUHP_BP_PREPARE_DYN_END		= CPUHP_BP_PREPARE_DYN + 20,
 	CPUHP_BRINGUP_CPU,
+
+	/*
+	 * STARTING section invoked on the hotplugged CPU in low level
+	 * bringup and teardown code.
+	 */
 	CPUHP_AP_IDLE_DEAD,
 	CPUHP_AP_OFFLINE,
 	CPUHP_AP_SCHED_STARTING,
@@ -153,6 +192,8 @@ enum cpuhp_state {
 	CPUHP_AP_ARM_CACHE_B15_RAC_DYING,
 	CPUHP_AP_ONLINE,
 	CPUHP_TEARDOWN_CPU,
+
+	/* Online section invoked on the hotplugged CPU from the hotplug thread */
 	CPUHP_AP_ONLINE_IDLE,
 	CPUHP_AP_SCHED_WAIT_EMPTY,
 	CPUHP_AP_SMPBOOT_THREADS,
@@ -214,14 +255,15 @@ int __cpuhp_setup_state_cpuslocked(enum
 				   int (*teardown)(unsigned int cpu),
 				   bool multi_instance);
 /**
- * cpuhp_setup_state - Setup hotplug state callbacks with calling the callbacks
+ * cpuhp_setup_state - Setup hotplug state callbacks with calling the startup
+ *                     callback
  * @state:	The state for which the calls are installed
  * @name:	Name of the callback (will be used in debug output)
- * @startup:	startup callback function
- * @teardown:	teardown callback function
+ * @startup:	startup callback function or NULL if not required
+ * @teardown:	teardown callback function or NULL if not required
  *
  * Installs the callback functions and invokes the startup callback on
- * the present cpus which have already reached the @state.
+ * the online cpus which have already reached the @state.
  */
 static inline int cpuhp_setup_state(enum cpuhp_state state,
 				    const char *name,
@@ -231,6 +273,18 @@ static inline int cpuhp_setup_state(enum
 	return __cpuhp_setup_state(state, name, true, startup, teardown, false);
 }
 
+/**
+ * cpuhp_setup_state_cpuslocked - Setup hotplug state callbacks with calling
+ *				  startup callback from a cpus_read_lock()
+ *				  held region
+ * @state:	The state for which the calls are installed
+ * @name:	Name of the callback (will be used in debug output)
+ * @startup:	startup callback function or NULL if not required
+ * @teardown:	teardown callback function or NULL if not required
+ *
+ * Same as cpuhp_setup_state() except that it must be invoked from within a
+ * cpus_read_lock() held region.
+ */
 static inline int cpuhp_setup_state_cpuslocked(enum cpuhp_state state,
 					       const char *name,
 					       int (*startup)(unsigned int cpu),
@@ -242,14 +296,14 @@ static inline int cpuhp_setup_state_cpus
 
 /**
  * cpuhp_setup_state_nocalls - Setup hotplug state callbacks without calling the
- *			       callbacks
+ *			       startup callback
  * @state:	The state for which the calls are installed
  * @name:	Name of the callback.
- * @startup:	startup callback function
- * @teardown:	teardown callback function
+ * @startup:	startup callback function or NULL if not required
+ * @teardown:	teardown callback function or NULL if not required
  *
- * Same as @cpuhp_setup_state except that no calls are executed are invoked
- * during installation of this callback. NOP if SMP=n or HOTPLUG_CPU=n.
+ * Same as cpuhp_setup_state() except that the startup callback is not
+ * invoked during installation. NOP if SMP=n or HOTPLUG_CPU=n.
  */
 static inline int cpuhp_setup_state_nocalls(enum cpuhp_state state,
 					    const char *name,
@@ -260,6 +314,19 @@ static inline int cpuhp_setup_state_noca
 				   false);
 }
 
+/**
+ * cpuhp_setup_state_nocalls_cpuslocked - Setup hotplug state callbacks without
+ *					  invoking the @bringup callback from
+ *					  a cpus_read_lock() held region
+ *			       callbacks
+ * @state:	The state for which the calls are installed
+ * @name:	Name of the callback.
+ * @startup:	startup callback function or NULL if not required
+ * @teardown:	teardown callback function or NULL if not required
+ *
+ * Same as cpuhp_setup_state_nocalls() except that it must be invoked from
+ * within a cpus_read_lock() held region.
+ */
 static inline int cpuhp_setup_state_nocalls_cpuslocked(enum cpuhp_state state,
 						     const char *name,
 						     int (*startup)(unsigned int cpu),
@@ -273,13 +340,13 @@ static inline int cpuhp_setup_state_noca
  * cpuhp_setup_state_multi - Add callbacks for multi state
  * @state:	The state for which the calls are installed
  * @name:	Name of the callback.
- * @startup:	startup callback function
- * @teardown:	teardown callback function
+ * @startup:	startup callback function or NULL if not required
+ * @teardown:	teardown callback function or NULL if not required
  *
  * Sets the internal multi_instance flag and prepares a state to work as a multi
  * instance callback. No callbacks are invoked at this point. The callbacks are
  * invoked once an instance for this state are registered via
- * @cpuhp_state_add_instance or @cpuhp_state_add_instance_nocalls.
+ * cpuhp_state_add_instance() or cpuhp_state_add_instance_nocalls()
  */
 static inline int cpuhp_setup_state_multi(enum cpuhp_state state,
 					  const char *name,
@@ -305,8 +372,8 @@ int __cpuhp_state_add_instance_cpuslocke
  * @node:	The node for this individual state.
  *
  * Installs the instance for the @state and invokes the startup callback on
- * the present cpus which have already reached the @state. The @state must have
- * been earlier marked as multi-instance by @cpuhp_setup_state_multi.
+ * the online cpus which have already reached the @state. The @state must have
+ * been earlier marked as multi-instance by cpuhp_setup_state_multi().
  */
 static inline int cpuhp_state_add_instance(enum cpuhp_state state,
 					   struct hlist_node *node)
@@ -321,7 +388,8 @@ static inline int cpuhp_state_add_instan
  * @node:	The node for this individual state.
  *
  * Installs the instance for the @state The @state must have been earlier
- * marked as multi-instance by @cpuhp_setup_state_multi.
+ * marked as multi-instance by cpuhp_setup_state_multi. NOP if SMP=n or
+ * HOTPLUG_CPU=n.
  */
 static inline int cpuhp_state_add_instance_nocalls(enum cpuhp_state state,
 						   struct hlist_node *node)
@@ -329,6 +397,17 @@ static inline int cpuhp_state_add_instan
 	return __cpuhp_state_add_instance(state, node, false);
 }
 
+/**
+ * cpuhp_state_add_instance_nocalls_cpuslocked - Add an instance for a state
+ *						 without invoking the startup
+ *						 callback from a cpus_read_lock()
+ *						 held region.
+ * @state:	The state for which the instance is installed
+ * @node:	The node for this individual state.
+ *
+ * Same as cpuhp_state_add_instance_nocalls() except that it must be
+ * invoked from within a cpus_read_lock() held region.
+ */
 static inline int
 cpuhp_state_add_instance_nocalls_cpuslocked(enum cpuhp_state state,
 					    struct hlist_node *node)
@@ -353,7 +432,7 @@ static inline void cpuhp_remove_state(en
 
 /**
  * cpuhp_remove_state_nocalls - Remove hotplug state callbacks without invoking
- *				teardown
+ *				the teardown callback
  * @state:	The state for which the calls are removed
  */
 static inline void cpuhp_remove_state_nocalls(enum cpuhp_state state)
@@ -361,6 +440,14 @@ static inline void cpuhp_remove_state_no
 	__cpuhp_remove_state(state, false);
 }
 
+/**
+ * cpuhp_remove_state_nocalls_cpuslocked - Remove hotplug state callbacks without invoking
+ *					   teardown from a cpus_read_lock() held region.
+ * @state:	The state for which the calls are removed
+ *
+ * Same as cpuhp_remove_state nocalls() except that it must be invoked
+ * from within a cpus_read_lock() held region.
+ */
 static inline void cpuhp_remove_state_nocalls_cpuslocked(enum cpuhp_state state)
 {
 	__cpuhp_remove_state_cpuslocked(state, false);
@@ -389,7 +476,7 @@ int __cpuhp_state_remove_instance(enum c
  * @node:	The node for this individual state.
  *
  * Removes the instance and invokes the teardown callback on the present cpus
- * which have already reached the @state.
+ * which have already reached @state.
  */
 static inline int cpuhp_state_remove_instance(enum cpuhp_state state,
 					      struct hlist_node *node)
Randy Dunlap Sept. 6, 2021, 2:11 a.m. UTC | #11
On 9/5/21 4:28 PM, Thomas Gleixner wrote:
> Dave,
> 

[snip]

Hi,

Doc. comments below...


> I'm sorry that this change which turned CPU hotplug into a reliable,
> testable and instrumentable mechanism causes so much trouble for you. I
> hope it's just the lack of coherent documentation which made you
> unhappy.
> 
> If the updated documentation does not answer your questions, please let
> me know and please provide a coherent explanation of the problem you are
> trying to solve. Either I can give you an hint or I can identify further
> issues in the documentation.
> 
> If it turns out that there are functional shortcomings then I'm of
> course all ears as well.
> 
> If you need a conveniance API to install multiple states at once to
> regain the "simple API" feeling, please let me know - I surely have some
> ideas.
> 
> Thanks,
> 
>          tglx
> ---
> --- a/Documentation/core-api/cpu_hotplug.rst
> +++ b/Documentation/core-api/cpu_hotplug.rst
> @@ -156,95 +156,479 @@ hotplug states will be invoked, starting
>   * Once all services are migrated, kernel calls an arch specific routine
>     ``__cpu_disable()`` to perform arch specific cleanup.
>   

[snip]

> +
> +The CPU hotplug API
> +===================
> +
> +CPU hotplug state machine
> +-------------------------
> +
> +CPU hotplug uses a trivial state machine with a linear state space from
> +CPUHP_OFFLINE to CPUHP_ONLINE. Each state has a startup and a teardown
> +callback.
> +
> +When a CPU is onlined, the startup callbacks are invoked sequentially until
> +the state CPUHP_ONLINE is reached. They can also be invoked when the
> +callbacks of a state are set up or an instance is added to a multi-instance
> +state.
> +
> +When a CPU is offlined the teardown callbacks are invoked in the reverse
> +order sequenctially until the state CPUHP_OFFLINE is reached. They can also

          sequentially

> +be invoked when the callbacks of a state are removed or an instance is
> +removed from a multi-instance state.
> +
> +If a usage site requires only a callback in one direction of the hotplug
> +operations (CPU online or CPU offline) then the other not required callback

                                                          not-required

> +can be set to NULL when the state is set up.
> +
> +The state space is divided into three sections:
> +
> +* The PREPARE section
> +
> +  The PREPARE section covers the state space from CPUHP_OFFLINE to CPUHP_BRINGUP_CPU

                                                                       CPUHP_BRINGUP_CPU.

> +
> +  The startup callbacks in this section are invoked before the CPU is
> +  started during a CPU online operation. The teardown callbacks are invoked
> +  after the CPU has become dysfunctional during a CPU offline operation.
> +
> +  The callbacks are invoked on a control CPU as they can't obviously run on
> +  the hotplugged CPU which is either not yet started or has become
> +  dysfunctional already.
> +
> +  The startup callbacks are used to setup resources which are required to
> +  bring a CPU successfully online. The teardown callbacks are used to free
> +  resources or to move pending work to an online CPU after the hotplugged
> +  CPU became dysfunctional.
> +
> +  The startup callbacks are allowed to fail. If a callback fails, the CPU
> +  online operation is aborted and the CPU is brought down to the previous
> +  state (usually CPUHP_OFFLINE) again.
> +
> +  The teardown callbacks in this section are not allowed to fail.
> +
> +* The STARTING section
> +
> +  The STARTING section covers the state space between CPUHP_BRINGUP_CPU + 1
> +  and CPUHP_AP_ONLINE

      and CPUHP_AP_ONLINE.

> +
> +  The startup callbacks in this section are invoked on the hotplugged CPU
> +  with interrupts disabled during a CPU online operation in the early CPU
> +  setup code. The teardown callbacks are invoked with interrupts disabled
> +  on the hotplugged CPU during a CPU offline operation shortly before the
> +  CPU is completely shut down.
> +
> +  The callbacks in this section are not allowed to fail.
> +
> +  The callbacks are used for low level hardware initialization/shutdown and
> +  for core subsystems.
> +
> +* The ONLINE section
> +
> +  The ONLINE section covers the state space between CPUHP_AP_ONLINE + 1 and
> +  CPUHP_ONLINE.
> +
> +  The startup callbacks in this section are invoked on the hotplugged CPU
> +  during a CPU online operation. The teardown callbacks are invoked on the
> +  hotplugged CPU during a CPU offline operation.
> +
> +  The callbacks are invoked in the context of the per CPU hotplug thread,
> +  which is pinned on the hotplugged CPU. The callbacks are invoked with
> +  interrupts and preemption enabled.
> +
> +  The callbacks are allowed to fail. When a callback fails the hotplug
> +  operation is aborted and the CPU is brought back to the previous state.
> +
> +CPU online/offline operations
> +-----------------------------
> +
> +A successful online operation looks like this: ::
> +
> +  [CPUHP_OFFLINE]
> +  [CPUHP_OFFLINE + 1]->startup()       -> success
> +  [CPUHP_OFFLINE + 2]->startup()       -> success
> +  [CPUHP_OFFLINE + 3]                  -> skipped because startup == NULL
> +  ...
> +  [CPUHP_BRINGUP_CPU]->startup()       -> success
> +  === End of PREPARE section
> +  [CPUHP_BRINGUP_CPU + 1]->startup()   -> success
> +  ...
> +  [CPUHP_AP_ONLINE]->startup()         -> success
> +  === End of STARTUP section
> +  [CPUHP_AP_ONLINE + 1]->startup()     -> success
> +  ...
> +  [CPUHP_ONLINE - 1]->startup()        -> success
> +  [CPUHP_ONLINE]
> +
> +A successful offline operation looks like this: ::
> +
> +  [CPUHP_ONLINE]
> +  [CPUHP_ONLINE - 1]->teardown()       -> success
> +  ...
> +  [CPUHP_AP_ONLINE + 1]->teardown()    -> success
> +  === Start of STARTUP section
> +  [CPUHP_AP_ONLINE]->teardown()        -> success
> +  ...
> +  [CPUHP_BRINGUP_ONLINE - 1]->teardown()
> +  ...
> +  === Start of PREPARE section
> +  [CPUHP_BRINGUP_CPU]->teardown()
> +  [CPUHP_OFFLINE + 3]->teardown()
> +  [CPUHP_OFFLINE + 2]                  -> skipped because teardown == NULL
> +  [CPUHP_OFFLINE + 1]->teardown()
> +  [CPUHP_OFFLINE]
> +
> +A failed online operation looks like this: ::
> +
> +  [CPUHP_OFFLINE]
> +  [CPUHP_OFFLINE + 1]->startup()       -> success
> +  [CPUHP_OFFLINE + 2]->startup()       -> success
> +  [CPUHP_OFFLINE + 3]                  -> skipped because startup == NULL
> +  ...
> +  [CPUHP_BRINGUP_CPU]->startup()       -> success
> +  === End of PREPARE section
> +  [CPUHP_BRINGUP_CPU + 1]->startup()   -> success
> +  ...
> +  [CPUHP_AP_ONLINE]->startup()         -> success
> +  === End of STARTUP section
> +  [CPUHP_AP_ONLINE + 1]->startup()     -> success
> +  ---
> +  [CPUHP_AP_ONLINE + N]->startup()     -> fail
> +  [CPUHP_AP_ONLINE + (N - 1)]->teardown()
> +  ...
> +  [CPUHP_AP_ONLINE + 1]->teardown()
> +  === Start of STARTUP section
> +  [CPUHP_AP_ONLINE]->teardown()
> +  ...
> +  [CPUHP_BRINGUP_ONLINE - 1]->teardown()
> +  ...
> +  === Start of PREPARE section
> +  [CPUHP_BRINGUP_CPU]->teardown()
> +  [CPUHP_OFFLINE + 3]->teardown()
> +  [CPUHP_OFFLINE + 2]                  -> skipped because teardown == NULL
> +  [CPUHP_OFFLINE + 1]->teardown()
> +  [CPUHP_OFFLINE]
> +
> +A failed offline operation looks like this: ::
> +
> +  [CPUHP_ONLINE]
> +  [CPUHP_ONLINE - 1]->teardown()       -> success
> +  ...
> +  [CPUHP_ONLINE - N]->teardown()       -> fail
> +  [CPUHP_ONLINE - (N - 1)]->startup()
> +  ...
> +  [CPUHP_ONLINE - 1]->startup()
> +  [CPUHP_ONLINE]
> +
> +Recursive failures cannot be handled sensibly. Look at the following
> +example of a recursive fail due to a failed offline operation: ::
> +
> +  [CPUHP_ONLINE]
> +  [CPUHP_ONLINE - 1]->teardown()       -> success
> +  ...
> +  [CPUHP_ONLINE - N]->teardown()       -> fail
> +  [CPUHP_ONLINE - (N - 1)]->startup()  -> success
> +  [CPUHP_ONLINE - (N - 2)]->startup()  -> fail
> +
> +The CPU hotplug state machine stops right here and does not try to go back
> +down again because that would likely result in an endless loop: ::
> +
> +  [CPUHP_ONLINE - (N - 1)]->teardown() -> success
> +  [CPUHP_ONLINE - N]->teardown()       -> fail
> +  [CPUHP_ONLINE - (N - 1)]->startup()  -> success
> +  [CPUHP_ONLINE - (N - 2)]->startup()  -> fail
> +  [CPUHP_ONLINE - (N - 1)]->teardown() -> success
> +  [CPUHP_ONLINE - N]->teardown()       -> fail
> +
> +Lather, rinse and repeat. In this case the CPU left in state: ::

                                               CPU is left

> +
> +  [CPUHP_ONLINE - (N - 1)]
> +
> +which at least lets the system make progress and gives the user a chance to
> +debug or even resolve the situation.
> +
> +Allocating a state
> +------------------
> +
> +There are two ways to allocate a CPU hotplug state:
> +
> +* Static allocation
> +
> +  Static allocation has to be used when the subsystem or driver has
> +  ordering requirements versus other CPU hotplug states. E.g. the PERF core
> +  startup callback has to be invoked before the PERF driver startup
> +  callbacks during a CPU online operation. During a CPU offline operation
> +  the driver teardown callbacks have to be invoked before the core teardown
> +  callback. The statically allocated states are described by constants in
> +  the cpuhp_state enum which can be found in include/linux/cpuhotplug.h.
> +
> +  Insert the state into the enum at the proper place so the ordering
> +  requirements are fulfilled. The state constant has to be used for state
> +  setup and removal.
> +
> +  Static allocation is also required when the state callbacks are not set
> +  up at runtime and are part of the initializer of the CPU hotplug state
> +  array in kernel/cpu.c.
> +
> +* Dynamic allocation
> +
> +  When there are no ordering requirements for the state callbacks then
> +  dynamic allocation is the preferred method. The state number is allocated
> +  by the setup function and returned to the caller on success.
> +
> +  Only the PREPARE and ONLINE sections provide a dynamic allocation
> +  range. The STARTING section does not as most of the callbacks in that
> +  section have explicit ordering requirements.
> +
> +Setup of a CPU hotplug state
> +----------------------------
> +
> +The core code provides the following functions to setup a state:
> +
> +* cpuhp_setup_state(state, name, startup, teardown)
> +* cpuhp_setup_state_nocalls(state, name, startup, teardown)
> +* cpuhp_setup_state_cpuslocked(state, name, startup, teardown)
> +* cpuhp_setup_state_nocalls_cpuslocked(state, name, startup, teardown)
> +
> +For cases where a driver or a subsystem has multiple instances and the same
> +CPU hotplug state callbacks need to be invoked for each instance, the CPU
> +hotplug core provides multi-instance support. The advantage over driver
> +specific instance lists is that the instance related functions are fully
> +serialized against CPU hotplug operations and provide the automatic
> +invocations of the state callbacks on add and removal. To set up such a
> +multi-instance state the following function is available:
> +
> +* cpuhp_setup_state_multi(state, name, startup, teardown)
> +
> +The @state argument is either a statically allocated state or one of the
> +constants for dynamically allocated states - CPUHP_PREPARE_DYN,
> +CPUHP_ONLINE_DYN - depending on the state section (PREPARE, ONLINE) for
> +which a dynamic state should be allocated.
> +
> +The @name argument is used for sysfs output and for instrumentation. The
> +naming convention is "subsys:mode" or "subsys/driver:mode",
> +e.g. "perf:mode" or "perf/x86:mode". The common mode names:

                                                         names are:

> +
> +======== =======================================================
> +prepare  For states in the PREPAREsection

                               PREPARE section

> +
> +dead     For states in the PREPARE section which do not provide
> +         a startup callback
> +
> +starting For states in the STARTING section
> +
> +dying    For states in the STARTING section which do not provide
> +         a startup callback
> +
> +online   For states in the ONLINE section
> +
> +offline  For states in the ONLINE section which do not provide
> +         a startup callback
> +======== =======================================================
> +
> +As the @name argument is only used for sysfs and instrumentation other mode
> +descriptors can be used as well if they describe the nature of the state
> +better than the common ones.
> +
> +Examples for @name arguments: "perf/online", "perf/x86:prepare",
> +"RCU/tree:dying", "sched/waitempty"
> +
> +The @startup argument is a function pointer to the callback which should be
> +invoked during a CPU online operation. If the usage site does not require a
> +startup callback set the pointer to NULL.
> +
> +The @teardown argument is a function pointer to the callback which should
> +be invoked during a CPU offline operation. If the usage site does not
> +require a teardown callback set the pointer to NULL.
> +
> +The functions differ in the way how the installed callbacks are treated:
> +
> +  * cpuhp_setup_state_nocalls(), cpuhp_setup_state_nocalls_cpuslocked()
> +    and cpuhp_setup_state_multi() only install the callbacks
> +
> +  * cpuhp_setup_state() and cpuhp_setup_state_cpuslocked() install the
> +    callbacks and invoke the @startup callback (if not NULL) for all online
> +    CPUs which have currently a state greater than the newly installed
> +    state. Depending on the state section the callback is either invoked on
> +    the current CPU (PREPARE section) or on each online CPU (ONLINE
> +    section) in the context of the CPU's hotplug thread.
> +
> +    If a callback fails for CPU N then the teardown callback for CPU
> +    0 .. N-1 is invoked to rollback the operation. The state setup fails,

CPU 0? Does one of these fail since it's not an AP?

> +    the callbacks for the state are not installed and in case of dynamic
> +    allocation the allocated state is freed.
> +
> +The state setup and the callback invocations are serialized against CPU
> +hotplug operations. If the setup function has to be called from a CPU
> +hotplug read locked region, then the _cpuslocked() variants have to be
> +used. These functions cannot be used from within CPU hotplug callbacks.
> +
> +The function return values:
> +  ======== ===================================================================
> +  0        Statically allocated state was successfully set up
> +
> +  >0       Dynamically allocated state was successfully set up.
> +
> +           The returned number is the state number which was allocated. If
> +           the state callbacks have to be removed later, e.g. module
> +           removal, then this number has to be saved by the caller and used
> +           as @state argument for the state remove function. For
> +           multi-instance states the dynamically allocated state number is
> +           also required as @state argument for the instance add/remove
> +           operations.
> +
> +  <0	   Operation failed
> +  ======== ===================================================================
> +
> +Removal of a CPU hotplug state
> +------------------------------
> +
> +To remove a previously set up state, the following functions are provided:
> +
> +* cpuhp_remove_state(state)
> +* cpuhp_remove_state_nocalls(state)
> +* cpuhp_remove_state_nocalls_cpuslocked(state)
> +* cpuhp_remove_multi_state(state)
> +
> +The @state argument is either a statically allocated state or the state
> +number which was allocated in the dynamic range by cpuhp_setup_state*(). If
> +the state is in the dynamic range, then the state number is freed and
> +available for dynamic allocation again.
> +
> +The functions differ in the way how the installed callbacks are treated:
> +
> +  * cpuhp_remove_state_nocalls(), cpuhp_remove_state_nocalls_cpuslocked()
> +    and cpuhp_remove_multi_state() only remove the callbacks.
> +
> +  * cpuhp_remove_state() removes the callbacks and invokes the teardown
> +    callback (if not NULL) for all online CPUs which have currently a state
> +    greater than the removed state. Depending on the state section the
> +    callback is either invoked on the current CPU (PREPARE section) or on
> +    each online CPU (ONLINE section) in the context of the CPU's hotplug
> +    thread.
> +
> +    In order to complete the removal, the teardown callback should not fail.
> +
> +The state removal and the callback invocations are serialized against CPU
> +hotplug operations. If the remove function has to be called from a CPU
> +hotplug read locked region, then the _cpuslocked() variants have to be
> +used. These functions cannot be used from within CPU hotplug callbacks.
> +
> +If a multi-instance state is removed then the caller has to remove all
> +instances first.
> +
> +Multi-Instance state instance management
> +----------------------------------------
> +
> +Once the multi-instance state is set up, instances can be added to the
> +state:
> +
> +  * cpuhp_state_add_instance(state, node)
> +  * cpuhp_state_add_instance_nocalls(state, node)
> +
> +The @state argument is either a statically allocated state or the state
> +number which was allocated in the dynamic range by cpuhp_setup_state_multi().
> +
> +The @node argument is a pointer to a hlist_node which is embedded in the

               I would say:         to an hlist_node

> +instance's data structure. The pointer is handed to the multi-instance
> +state callbacks and can be used by the callback to retrieve the instance
> +via container_of().
> +
> +The functions differ in the way how the installed callbacks are treated:
> +
> +  * cpuhp_state_add_instance_nocalls() and only adds the instance to the
> +    multi-instance state's node list.
> +
> +  * cpuhp_state_add_instance() adds the instance and invokes the startup
> +    callback (if not NULL) associated with @state for all online CPUs which
> +    have currently a state greater than @state. The callback is only
> +    invoked for the to be added instance. Depending on the state section
> +    the callback is either invoked on the current CPU (PREPARE section) or
> +    on each online CPU (ONLINE section) in the context of the CPU's hotplug
> +    thread.
> +
> +    If a callback fails for CPU N then the teardown callback for CPU
> +    0 .. N-1 is invoked to rollback the operation, the function fails and

all except the Boot CPU?

> +    the instance is not added to the node list of the multi-instance state.
> +
> +To remove an instance from the state's node list these functions are
> +available:
> +
> +  * cpuhp_state_remove_instance(state, node)
> +  * cpuhp_state_remove_instance_nocalls(state, node)
> +
> +The arguments are the same as for the the cpuhp_state_add_instance*()
> +variants above.
> +
> +The functions differ in the way how the installed callbacks are treated:
> +
> +  * cpuhp_state_remove_instance_nocalls() only removes the instance from the
> +    state's node list.
> +
> +  * cpuhp_state_remove_instance() removes the instance and invokes the
> +    teardown callback (if not NULL) associated with @state for all online
> +    CPUs which have currently a state greater than @state.  The callback is
> +    only invoked for the to be removed instance.  Depending on the state
> +    section the callback is either invoked on the current CPU (PREPARE
> +    section) or on each online CPU (ONLINE section) in the context of the
> +    CPU's hotplug thread.
> +
> +    In order to complete the removal, the teardown callback should not fail.
> +
> +The node list add/remove operations and the callback invocations are
> +serialized against CPU hotplug operations. These functions cannot be used
> +from within CPU hotplug callbacks and CPU hotplug read locked regions.
> +
> +Examples
> +--------
> +
> +Setup and teardown a statically allocated state in the STARTING section for
> +notifications on online and offline operations: ::
> +
> +   ret = cpuhp_setup_state(CPUHP_SUBSYS_STARTING, "subsys:starting", subsys_cpu_starting, subsys_cpu_dying);
> +   if (ret < 0)
> +        return ret;
> +   ....
> +   cpuhp_remove_state(CPUHP_SUBSYS_STARTING);
> +
> +Setup and teardown a dynamically allocated state in the ONLINE section
> +for notifications on offline operations: ::
> +
> +   state = cpuhp_setup_state(CPUHP_ONLINE_DYN, "subsys:offline", NULL, subsys_cpu_offline);
> +   if (state < 0)
> +       return state;
> +   ....
> +   cpuhp_remove_state(state);
> +
> +Setup and teardown a dynamically allocated state in the ONLINE section
> +for notifications on online operations without invoking the callbacks: ::
> +
> +   state = cpuhp_setup_state_nocalls(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpi_online, NULL);

                                                                                  _cpu_

> +   if (state < 0)
> +       return state;
> +   ....
> +   cpuhp_remove_state_nocalls(state);
> +
> +Setup, use and teardown a dynamically allocated multi-instance state in the
> +ONLINE section for notifications on online and offline operation: ::
> +
> +   state = cpuhp_setup_state_multi(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpu_online, subsys_cpu_offline);
> +   if (state < 0)
> +       return state;
> +   ....
> +   ret = cpuhp_state_add_instance(state, &inst1->node);
> +   if (ret)
> +        return ret;
> +   ....
> +   ret = cpuhp_state_add_instance(state, &inst2->node);
> +   if (ret)
> +        return ret;
> +   ....
> +   cpuhp_remove_instance(state, &inst1->node);
> +   ....
> +   cpuhp_remove_instance(state, &inst2->node);
> +   ....
> +   remove_multi_state(state);
> +
>   
>   Testing of hotplug states
>   =========================
Thomas Gleixner Sept. 6, 2021, 9:42 a.m. UTC | #12
Randy,

On Sun, Sep 05 2021 at 19:11, Randy Dunlap wrote:
> On 9/5/21 4:28 PM, Thomas Gleixner wrote:
>> +  * cpuhp_setup_state() and cpuhp_setup_state_cpuslocked() install the
>> +    callbacks and invoke the @startup callback (if not NULL) for all online
>> +    CPUs which have currently a state greater than the newly installed
>> +    state. Depending on the state section the callback is either invoked on
>> +    the current CPU (PREPARE section) or on each online CPU (ONLINE
>> +    section) in the context of the CPU's hotplug thread.
>> +
>> +    If a callback fails for CPU N then the teardown callback for CPU
>> +    0 .. N-1 is invoked to rollback the operation. The state setup fails,
>
> CPU 0? Does one of these fail since it's not an AP?

Yes. CPU 0 is not special in any way.

The point is that the hotplug state callbacks are set up late in the
boot process or during runtime when a module is loaded or some
functionality initialized on first use.

At that time the boot CPU (0) and usually the secondary CPUs are online
already. So the driver/subsystem has two ways to bring the per CPU
functionality into operation:

 1) Initialize all per CPU state manually which often involves queuing
    work on each online CPU or invoking SMP function calls on the online
    CPUs and if all succeeds install the callbacks. If something goes
    wrong on one of the CPUs then the state has to be cleaned up on the
    CPUs which had their state set up correctly already.

    This of course has to be done with cpus_read_lock() held to
    serialize against a concurrent CPU hotplug operation.-

 2) Let the hotplug core do that work. Setup a state with the
    corresponding callbacks. The core invokes the startup callback on
    all online CPUs (including 0) in the correct context:

    for_each_online_cpu(cpu) {
    	ret = invoke_callback_on/for_cpu(cpu, startup);
        if (ret)
        	goto err;
        ...

    Any of these callback invocations can fail even the one on the boot
    CPU. In case of failure on CPU0 there is nothing to clean up, but if
    the Nth CPU callback fails then the state has been established for
    CPU 0 to CPU N-1 already, e.g. memory allocation, hardware setup ...

    So instead of returning with a half set up functionality, the core
    does the rollback on CPU 0 to CPU N-1 by invoking the teardown
    callback before returning the error code.

    err:
    	for_each_online_cpu(cpu) {
            if (startup_done[cpu])
                invoke_callback_on/for_cpu(cpu, teardown);
        }

    That means the call site does not have to mop up the half
    initialized state manually.

    All of that is properly serialized against CPU hotplug operations.

>> +
>> +    If a callback fails for CPU N then the teardown callback for CPU
>> +    0 .. N-1 is invoked to rollback the operation, the function fails and
>
> all except the Boot CPU?

See above.

Thanks,

        tglx