Message ID | 20210831211847.GC9959@magnolia (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] xfs: new code for 5.15 | expand |
On Tue, Aug 31, 2021 at 2:18 PM Darrick J. Wong <djwong@kernel.org> wrote: > > As for new features: we now batch inode inactivations in percpu > background threads, which sharply decreases frontend thread wait time > when performing file deletions and should improve overall directory tree > deletion times. So no complaints on this one, but I do have a reaction: we have a lot of these random CPU hotplug events, and XFS now added another one. I don't see that as a problem, but just the _randomness_ of these callbacks makes me go "hmm". And that "enum cpuhp_state" thing isn't exactly a thing of beauty, and just makes me think there's something nasty going on. For the new xfs usage, I really get the feeling that it's not that XFS actually cares about the CPU states, but that this is literally tied to just having percpu state allocated and active, and that maybe it would be sensible to have something more specific to that kind of use. We have other things that are very similar in nature - like the page allocator percpu caches etc, which for very similar reasons want cpu dead/online notification. I'm only throwing this out as a reaction to this - I'm not sure another interface would be good or worthwhile, but that "enum cpuhp_state" is ugly enough that I thought I'd rope in Thomas for CPU hotplug, and the percpu memory allocation people for comments. IOW, just _maybe_ we would want to have some kind of callback model for "percpu_alloc()" and it being explicitly about allocations becoming available or going away, rather than about CPU state. Comments? > Lastly, with this release, two new features have graduated to supported > status: inode btree counters (for faster mounts), and support for dates > beyond Y2038. Oh, I had thought Y2038 was already a non-issue for xfs. Silly me. Linus
The pull request you sent on Tue, 31 Aug 2021 14:18:47 -0700:
> git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git tags/xfs-5.15-merge-6
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/90c90cda05aecf0f7c45f9f35384b31bba38455f
Thank you!
On Thu, Sep 02, 2021 at 08:47:42AM -0700, Linus Torvalds wrote: > On Tue, Aug 31, 2021 at 2:18 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > As for new features: we now batch inode inactivations in percpu > > background threads, which sharply decreases frontend thread wait time > > when performing file deletions and should improve overall directory tree > > deletion times. > > So no complaints on this one, but I do have a reaction: we have a lot > of these random CPU hotplug events, and XFS now added another one. > > I don't see that as a problem, but just the _randomness_ of these > callbacks makes me go "hmm". And that "enum cpuhp_state" thing isn't > exactly a thing of beauty, and just makes me think there's something > nasty going on. > > For the new xfs usage, I really get the feeling that it's not that XFS > actually cares about the CPU states, but that this is literally tied > to just having percpu state allocated and active, and that maybe it > would be sensible to have something more specific to that kind of use. Correct -- we don't really care about cpu state at all; all xfs needs is to push batched work items on a per-cpu list to another cpu when a cpu goes offline. I didn't see anything that looked like it handled that kind of thing, so ... cpuhp_state it was. :/ > We have other things that are very similar in nature - like the page > allocator percpu caches etc, which for very similar reasons want cpu > dead/online notification. > > I'm only throwing this out as a reaction to this - I'm not sure > another interface would be good or worthwhile, but that "enum > cpuhp_state" is ugly enough that I thought I'd rope in Thomas for CPU > hotplug, and the percpu memory allocation people for comments. > > IOW, just _maybe_ we would want to have some kind of callback model > for "percpu_alloc()" and it being explicitly about allocations > becoming available or going away, rather than about CPU state. > > Comments? Seems like a good fit for us, though I'll let Dave Chinner chime in since he's the one with more per-cpu list patches coming up. > > Lastly, with this release, two new features have graduated to supported > > status: inode btree counters (for faster mounts), and support for dates > > beyond Y2038. > > Oh, I had thought Y2038 was already a non-issue for xfs. Silly me. It's been a new feature in upstream for a year now. We're merely taking down the scary warnings that using this new code might result in a subspace vortex opening in the skies or that all trains bound for Moynihan end up on track 19 or wherever. ;) --D > Linus
On Thu, Sep 02 2021 at 08:47, Linus Torvalds wrote: > On Tue, Aug 31, 2021 at 2:18 PM Darrick J. Wong <djwong@kernel.org> wrote: >> >> As for new features: we now batch inode inactivations in percpu >> background threads, which sharply decreases frontend thread wait time >> when performing file deletions and should improve overall directory tree >> deletion times. > > So no complaints on this one, but I do have a reaction: we have a lot > of these random CPU hotplug events, and XFS now added another one. > > I don't see that as a problem, but just the _randomness_ of these > callbacks makes me go "hmm". And that "enum cpuhp_state" thing isn't > exactly a thing of beauty, and just makes me think there's something > nasty going on. It's not beautiful, but it's at least well defined in terms of ordering. Though if the entity which needs the callback does not care about ordering against other callbacks and just cares about the CPU state, in this case DEAD, then the explicit entry can be avoided and a dynamic entry can be requested: state = cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "xfs:foo", NULL, xfs_dead); We have also a dynamic range for the online part (CPUHP_AP_ONLINE_DYN) which runs on the incoming or outgoing CPU. That spares the explicit entries in the enum. I assume most of the prepare/dead states have no ordering requirement at all, so those could be converted to the dynamic range. But for the online one which run on the plugged CPU we have quite some ordering constraints and that's where the explicit states matter. That surely could be consolidated a bit if we pack the mutually exclusive ones (timers, interrupt controllers, perf), but the question is whether such packing (ifdeffery or arch/platform specific includes) would make it more beautiful. The only thing we'd spare would be some bytes in the actual state table in the core code. Whether that's worth it, I don't know. > For the new xfs usage, I really get the feeling that it's not that XFS > actually cares about the CPU states, but that this is literally tied > to just having percpu state allocated and active, and that maybe it > would be sensible to have something more specific to that kind of use. > > We have other things that are very similar in nature - like the page > allocator percpu caches etc, which for very similar reasons want cpu > dead/online notification. > > I'm only throwing this out as a reaction to this - I'm not sure > another interface would be good or worthwhile, but that "enum > cpuhp_state" is ugly enough that I thought I'd rope in Thomas for CPU > hotplug, and the percpu memory allocation people for comments. It's not only about memory. > IOW, just _maybe_ we would want to have some kind of callback model > for "percpu_alloc()" and it being explicitly about allocations > becoming available or going away, rather than about CPU state. The per cpu storage in XFS does not go away. It contains a llist head and the queued work items need to be moved from the dead CPU to an alive CPU and exposed to a work queue for processing. Similar to what we do with timers, hrtimers and other stuff. If there are callbacks which are doing pretty much the same thing, then I'm all for a generic infrastructure for these. Thanks, tglx
On Thu, Sep 02, 2021 at 10:43:11AM -0700, Darrick J. Wong wrote: > On Thu, Sep 02, 2021 at 08:47:42AM -0700, Linus Torvalds wrote: > > On Tue, Aug 31, 2021 at 2:18 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > > > As for new features: we now batch inode inactivations in percpu > > > background threads, which sharply decreases frontend thread wait time > > > when performing file deletions and should improve overall directory tree > > > deletion times. > > > > So no complaints on this one, but I do have a reaction: we have a lot > > of these random CPU hotplug events, and XFS now added another one. > > > > I don't see that as a problem, but just the _randomness_ of these > > callbacks makes me go "hmm". And that "enum cpuhp_state" thing isn't > > exactly a thing of beauty, and just makes me think there's something > > nasty going on. > > > > For the new xfs usage, I really get the feeling that it's not that XFS > > actually cares about the CPU states, but that this is literally tied > > to just having percpu state allocated and active, and that maybe it > > would be sensible to have something more specific to that kind of use. > > Correct -- we don't really care about cpu state at all; all xfs needs is > to push batched work items on a per-cpu list to another cpu when a cpu > goes offline. I didn't see anything that looked like it handled that > kind of thing, so ... cpuhp_state it was. :/ Yeah, it appeared to me that cpuhp_state is the replacement for the simple, old register_cpu_notifier() hotplug interface that we've used in the past in XFS (e.g. for per-cpu superblock counters back in 2006). So, like many other subsystems, I just hooked into it with a subsystem notifier so we don't have to modify global headers in future for new internal CPU dead notifiers because, as Darrick mentions, I have more percpu based modifications coming up for XFS... > > We have other things that are very similar in nature - like the page > > allocator percpu caches etc, which for very similar reasons want cpu > > dead/online notification. > > > > I'm only throwing this out as a reaction to this - I'm not sure > > another interface would be good or worthwhile, but that "enum > > cpuhp_state" is ugly enough that I thought I'd rope in Thomas for CPU > > hotplug, and the percpu memory allocation people for comments. Calling it ugly is being kind. :/ The part I dislike most about it is that we have to modify a header file that triggers full kernel rebuilds. Managing patch stacks and branches where one of them modifies such a header file means quick, XFS subsystem only kernel rebuilds are a rare thing... > > IOW, just _maybe_ we would want to have some kind of callback model > > for "percpu_alloc()" and it being explicitly about allocations > > becoming available or going away, rather than about CPU state. > > > > Comments? > > Seems like a good fit for us, though I'll let Dave Chinner chime in > since he's the one with more per-cpu list patches coming up. I think this might just be semantics - I understand the intent of Linus's suggestion is to provide a hotplug callback with a per-cpu allocation region. The thing that we rely on here is, however, that per-cpu allocation regions are static for the life of the allocation and always cover all possible CPUs. Hence the callbacks we want here are really about CPU state changes and I'm not convinced it's the best idea to conflate CPU state change callbacks with memory allocation... That said, I'm all for a better interface to the CPU hotplug notifications. THe current interface is ... esoteric and to understand how to use it effectively requires becoming a CPU hotplug expert. There's something to be said for the simplicity of the old register_cpu_notifier() interface we used to have... Cheers, Dave.
On Thu, Sep 02, 2021 at 09:13:24PM +0200, Thomas Gleixner wrote: > > I'm only throwing this out as a reaction to this - I'm not sure > > another interface would be good or worthwhile, but that "enum > > cpuhp_state" is ugly enough that I thought I'd rope in Thomas for CPU > > hotplug, and the percpu memory allocation people for comments. > > It's not only about memory. > > > IOW, just _maybe_ we would want to have some kind of callback model > > for "percpu_alloc()" and it being explicitly about allocations > > becoming available or going away, rather than about CPU state. > > The per cpu storage in XFS does not go away. It contains a llist head > and the queued work items need to be moved from the dead CPU to an alive > CPU and exposed to a work queue for processing. Similar to what we do > with timers, hrtimers and other stuff. > > If there are callbacks which are doing pretty much the same thing, then > I'm all for a generic infrastructure for these. In the block layer we've added a new per-cpu bio list, for which the dead callback literally does nothing but free some memory. For that case a simple callback would be neat, but I'm not sure how common that is.
Dave, On Fri, Sep 03 2021 at 08:35, Dave Chinner wrote: > On Thu, Sep 02, 2021 at 10:43:11AM -0700, Darrick J. Wong wrote: > The part I dislike most about it is that we have to modify a header > file that triggers full kernel rebuilds. Managing patch stacks and > branches where one of them modifies such a header file means quick, > XFS subsystem only kernel rebuilds are a rare thing... If you don't care about ordering, you can avoid touching the global header completely. The dynamic state ranges in PREPARE and ONLINE provide exactly what you want. It's documented. > That said, I'm all for a better interface to the CPU hotplug > notifications. THe current interface is ... esoteric and to What's so esoteric about: state = cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "xfs:prepare", func1, func2); state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "xfs:online", func3, func4); Only if you care about callback ordering vs. other subsystems, then adding the state in the global header is required. It's neither the end of the world, nor is it rocket science and requires expert knowledge to do so. > understand how to use it effectively requires becoming a CPU hotplug > expert. https://www.kernel.org/doc/html/latest/core-api/cpu_hotplug.html If there is something missing in that documentation which makes you think you need to become a CPU hotplug expert, please let me know. I'm happy to expand that document. > There's something to be said for the simplicity of the old > register_cpu_notifier() interface we used to have... There is a lot to be said about it. The simplicity of it made people do the most hillarious things to deal with: - Ordering issues including build order dependencies - Asymetry between bringup and teardown - The inability to test state transitions - .... Back then when we converted the notifier mess 35 of ~140 hotplug notifiers (i.e. ~25%) contained bugs of all sorts. Quite some of them were caused by the well understood simplicity of the hotplug notifier mechanics. I'm surely not missing any of that. Thanks, tglx
Hello, On Thu, Sep 02, 2021 at 08:47:42AM -0700, Linus Torvalds wrote: > On Tue, Aug 31, 2021 at 2:18 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > As for new features: we now batch inode inactivations in percpu > > background threads, which sharply decreases frontend thread wait time > > when performing file deletions and should improve overall directory tree > > deletion times. > > So no complaints on this one, but I do have a reaction: we have a lot > of these random CPU hotplug events, and XFS now added another one. > > I don't see that as a problem, but just the _randomness_ of these > callbacks makes me go "hmm". And that "enum cpuhp_state" thing isn't > exactly a thing of beauty, and just makes me think there's something > nasty going on. > > For the new xfs usage, I really get the feeling that it's not that XFS > actually cares about the CPU states, but that this is literally tied > to just having percpu state allocated and active, and that maybe it > would be sensible to have something more specific to that kind of use. > > We have other things that are very similar in nature - like the page > allocator percpu caches etc, which for very similar reasons want cpu > dead/online notification. > > I'm only throwing this out as a reaction to this - I'm not sure > another interface would be good or worthwhile, but that "enum > cpuhp_state" is ugly enough that I thought I'd rope in Thomas for CPU > hotplug, and the percpu memory allocation people for comments. > > IOW, just _maybe_ we would want to have some kind of callback model > for "percpu_alloc()" and it being explicitly about allocations > becoming available or going away, rather than about CPU state. > > Comments? > I think there are 2 pieces here from percpu's side: A) Onlining and offlining state related to a percpu alloc. B) Freeing backing memory of an allocation wrt hot plug. An RFC was sent out for B) in [1] and you need A) for B). I can see percpu having a callback model for basic allocations that are independent, but for anything more complex, that subsystem would need to register with hotplug anyway. It appears percpu_counter already has hot plug support. percpu_refcount could be extended as well, but more complex initialization like the runqueues and slab related allocations would require work. In short, yes I think A) is doable/reasonable. Freeing the backing memory for A) seems trickier. We would have to figure out a clean way to handle onlining/offlining racing with new percpu allocations (adding or removing pages for the corresponding cpu's chunk). To support A), init and onlining/offlining can be separate phases, but for B) init/freeing would have to be rolled into onlining/offlining. Without freeing, it's not incorrect for_each_online_cpu() to read a dead cpu's percpu values, but with freeing it does. I guess to summarize, A) seems like it might be a good idea with init/destruction happening at allocation/freeing times. I'm a little skeptical of B) in terms of complexity. If y'all think it's a good idea I can look into it again. [1] https://lore.kernel.org/lkml/20210601065147.53735-1-bharata@linux.ibm.com/ Thanks, Dennis > > Lastly, with this release, two new features have graduated to supported > > status: inode btree counters (for faster mounts), and support for dates > > beyond Y2038. > > Oh, I had thought Y2038 was already a non-issue for xfs. Silly me. > > Linus
On Fri, Sep 03, 2021 at 08:26:58AM +0200, Thomas Gleixner wrote: > Dave, > > On Fri, Sep 03 2021 at 08:35, Dave Chinner wrote: > > On Thu, Sep 02, 2021 at 10:43:11AM -0700, Darrick J. Wong wrote: > > The part I dislike most about it is that we have to modify a header > > file that triggers full kernel rebuilds. Managing patch stacks and > > branches where one of them modifies such a header file means quick, > > XFS subsystem only kernel rebuilds are a rare thing... > > If you don't care about ordering, you can avoid touching the global > header completely. The dynamic state ranges in PREPARE and ONLINE > provide exactly what you want. It's documented. Ordering? When and why would I care about ordering? il_last_pushed_lsn > > > That said, I'm all for a better interface to the CPU hotplug > > notifications. THe current interface is ... esoteric and to > > What's so esoteric about: > > state = cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "xfs:prepare", func1, func2); > state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "xfs:online", func3, func4); I don't want -online- notifications. I only want _offline_ notifications and according to the documentation, CPUHP_AP_ONLINE_DYN get called on both online and offline state changes. Don't you see the cognitive dissonance that contradictory "use online for offline" API naming like this causes. It easily scores negative points on the Rusty's API scale.... (http://sweng.the-davies.net/Home/rustys-api-design-manifesto) Also, having to understand what the multiple callbacks just for different operations is a bit of a WTF. What's the actual difference between the "online" and "prepare down" callbacks? For online notifications, the prepare down op is documented as the online hotplug error handling function that undoes the online callback. But if we are registering an -offline notification-, their use isn't actually documented. Is it the same, or is it inverted? I have to go read the code... That is then followed by this gem: "The callback can be remove by invoking cpuhp_remove_state(). In case of a dynamically allocated state (CPUHP_AP_ONLINE_DYN) use the returned state. During the removal of a hotplug state the teardown callback will be invoked." What does "use the returned state" mean? What returned state? Where did it come from? It's not defined anywhere. Then there's "the teardown callback will be invoked" - that's the first reference to a "teardown callback" in the documentation. I have to assume it means the "prepare_down" callback, but.... ... then I wonder: the prepare_down callback is per-cpu. Does this mean that when we remove a hp notifier, the prepare_down callback is called for every CPU? Or something else? It's not documented, I've got to go read the code just to work out the basic, fundamental behaviour of the API I'm trying to use.... > Only if you care about callback ordering vs. other subsystems, then adding > the state in the global header is required. It's neither the end of the > world, nor is it rocket science and requires expert knowledge to do so. > > > understand how to use it effectively requires becoming a CPU hotplug > > expert. > > https://www.kernel.org/doc/html/latest/core-api/cpu_hotplug.html > > If there is something missing in that documentation which makes you > think you need to become a CPU hotplug expert, please let me know. I'm > happy to expand that document. Deja vu. It's memory-ordering all over again. The fundamental problem is documentation is written by experts in the subject matter and, as such, is full of implicit, unspoken knowledge the reader needs to know before the documentation makes sense. It is written in a way that only experts in the subject matter actually understand because only they have the underlying knowledge to fill in the blanks. And, worst of all, said experts get upset and obnoxiously defensive when someone dares to say that it's not perfect. You might not think that using CPUHP_AP_ONLINE_DYN for CPU offline events is hard to understand because you know the intimate details of the implementation (i.e. the offline events are the reverse order state transitions of online events). But for someone who hasn't touched the CPU hotplug infrastructure in several years, it's totally baroque. I still have little idea of what a "dynamically allocated state" is in the CPU hotplug model vs an ordered one. It's not defined in the documentation, nor is it explained how, why and when each should be used. No examples are given as to when dynamic vs static order is preferred or needed, and there's nothing in the documentation to tell you how to just do offline notification. Hence we end up with code like this: void __init page_writeback_init(void) { BUG_ON(wb_domain_init(&global_wb_domain, GFP_KERNEL)); cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mm/writeback:online", page_writeback_cpu_online, NULL); cpuhp_setup_state(CPUHP_MM_WRITEBACK_DEAD, "mm/writeback:dead", NULL, page_writeback_cpu_online); } Which mixes a dynamic notifier for CPU online, followed by a specifically ordered offline notifier. Yet both call the same "online" function, one as a notifier, the other as a "teardown" callback. But in both cases, neither is used as a "teardown" for a failed hotplug case. The WTF level here is sky high. Taken at face value it makes no sense at all because it uses the same function for online and offline events. According to the documentation, neither notifier handles hotplug failure, and there's absolutely no clear reason for why one event is dynamic and the other is static. This is what makes it a terrible API: from my perspective, it seems almost impossible to use correctly even though I've read the documentation and spend a bunch of time reading the code and try hard to do the right thing. That's a -9 or -10 on the Rusty API scale... > > There's something to be said for the simplicity of the old > > register_cpu_notifier() interface we used to have... > > There is a lot to be said about it. The simplicity of it made people do > the most hillarious things to deal with: > > - Ordering issues including build order dependencies > - Asymetry between bringup and teardown > - The inability to test state transitions > - .... > > Back then when we converted the notifier mess 35 of ~140 hotplug > notifiers (i.e. ~25%) contained bugs of all sorts. Quite some of them > were caused by the well understood simplicity of the hotplug notifier > mechanics. I'm surely not missing any of that. You're conflating implementation problems with "API is unusable". The API was very easy to understand and use, and those implementation difficulties (like ordering and symmetry) could have eaily been fixed just by having a notifier block per defined transition, rather than multiplexing all state transitions all into a single notifier... Indeed, that's the core difference between that old API and the current API - the current API requires registering a notifier per state transition, but that registers the notifier for both CPU up and down transitions. The problem with the new API is that the requirement for symmetry in some subsystems has bled into the API, and so now all the subsystems that *don't need/want symmetry* have to juggle some undocumented esoteric combination of state definitions and callbacks to get the behaviour they require. And that, AFAICT, means that those callbacks can't handle failures in hotplug processing properly. So rather than having a nice, simple "one callback per event" API, we've got this convoluted thing that behaves according to a combination of state definition and callback defintions. Then the API is duplicated into "_nocall()" variants (not documented!) because many subsystems do not want hotplug callbacks run on setup/teardown of hotplug events. The old hotplug notification *implementation* had problems, but the *API* was not the cause of those bugs. In contrast, the current API appears to make it impossible to implement notifiers for certain use cases correctly, and that's directly where my statement that "you need to be a cpuhp expert to use this" comes from.... Cheers, Dave.
Dave, On Sun, Sep 05 2021 at 10:21, Dave Chinner wrote: > On Fri, Sep 03, 2021 at 08:26:58AM +0200, Thomas Gleixner wrote: >> On Fri, Sep 03 2021 at 08:35, Dave Chinner wrote: >> > On Thu, Sep 02, 2021 at 10:43:11AM -0700, Darrick J. Wong wrote: >> > The part I dislike most about it is that we have to modify a header >> > file that triggers full kernel rebuilds. Managing patch stacks and >> > branches where one of them modifies such a header file means quick, >> > XFS subsystem only kernel rebuilds are a rare thing... >> >> If you don't care about ordering, you can avoid touching the global >> header completely. The dynamic state ranges in PREPARE and ONLINE >> provide exactly what you want. It's documented. > > Ordering? When and why would I care about ordering? Quite some hotplug related functionality cares about ordering of the callbacks vs. other subsytems or core functionality: - workqueue startup/shutdown has to be at a well defined place - the PERF core online callback needs to be invoked before the various PERF drivers callbacks. On teardown the ordering is obviously reverse: Notify drivers before code. - low level startup/teardowm code has very strict ordering requirements - .... If you don't have ordering requirements and you just want to be notified then you don't have to think about that at all. Set up a dynamic state and be done with it. > il_last_pushed_lsn I assume that I'm not supposed to figure out how that is related to the topic we are discussing. :) >> > That said, I'm all for a better interface to the CPU hotplug >> > notifications. THe current interface is ... esoteric and to >> >> What's so esoteric about: >> >> state = cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "xfs:prepare", func1, func2); >> state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "xfs:online", func3, func4); > > I don't want -online- notifications. I only want _offline_ > notifications and according to the documentation, > CPUHP_AP_ONLINE_DYN get called on both online and offline state > changes. So if you want only offline notification then you simply hand in NULL for the bringup callback. state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "xfs:offline", NULL, func4); > Don't you see the cognitive dissonance that contradictory "use > online for offline" API naming like this causes. It's so amazing hard to understand that offline is the reverse operation of online, right? > It easily scores negative points on the Rusty's API scale.... I'm surely impressed by the APIs which Rusty constructed. The CPU hotplug notifier register/unregister API was surely trivial and intuitive, but the rest was a horror show if you had to do something more complex than 'let me know that the CPU is dead'. Rusty admitted himself that the original CPU hotplug design was a steaming pile of s... including the 'so simple' API. See below. > Also, having to understand what the multiple callbacks > just for different operations is a bit of a WTF. What's the actual > difference between the "online" and "prepare down" callbacks? > For online notifications, the prepare down op is documented as the > online hotplug error handling function that undoes the online > callback. > > But if we are registering an -offline notification-, their use isn't > actually documented. Is it the same, or is it inverted? I have to go > read the code... The documentation sucks. I told you that I'm happy to update it. See patch below. > Indeed, that's the core difference between that old API and the > current API - the current API requires registering a notifier per > state transition, but that registers the notifier for both CPU up > and down transitions. That registers a callback for startup and teardown and either one can be NULL if not required. > The problem with the new API is that the requirement for symmetry in > some subsystems has bled into the API, and so now all the subsystems > that *don't need/want symmetry* have to juggle some undocumented > esoteric combination of state definitions and callbacks to get the > behaviour they require. What kind of esoteric things do you need? Can you explain the requirements and why do you think it's hard to solve? Your handwaving here is more than counterproductive. > And that, AFAICT, means that those callbacks can't handle failures in > hotplug processing properly. Why so? Here is an example for online: [state 1]->startup() -> success [state 2]->startup() -> success [state 3] -> skipped because startup is NULL [state 4]->startup() -> fail [state 3]->teardown() [state 2] -> skipped because teardown is NULL [state 1]->teardown() If state 2 does not provide a teardown callback then there is nothing to teardown obviously. Same for state 3 on startup. What does not work correctly in the error case? It does not matter at all whether the transition from state 3 to state 0 happens due to a CPU offline operation or because the online operation got aborted at state 4. It's exactly the same thing. It's that simple, really. If you think that you need a 'bringup failed' notification for the subsystem which set up state 4 or a 'bringup failed' notification for the subsystem which registered state 2 then you are really doing something fundamentally wrong. I know that the 'oh so simple and well designed' original notifier sent these 'failed' notifications, but they were sent for the completely wrong reasons and were part of the overall problem. I'm obviously failing to understand how a 'simple' single callback interface which forces the user to handle a gazillion of obscure events is so much superiour than a straight forward linear state machine, > So rather than having a nice, simple "one callback per event" API, > we've got this convoluted thing that behaves according to a > combination of state definition and callback defintions. Then the > API is duplicated into "_nocall()" variants (not documented!) > because many subsystems do not want hotplug callbacks run on > setup/teardown of hotplug events. About 70% of the state setups are utilizing the callback invocations which means we don't have to have hundreds duplicated and probably differently buggy code to do that manually all over the place. That's a clear maintenance and code correctness win and that definitely justifies an extra inline variant for those usage sites which don't need/want those invocations for hopefully well thought out reasons. > The old hotplug notification *implementation* had problems, but the > *API* was not the cause of those bugs. The state transitions, i.e. the @reason (event) argument to the callbacks were definitely part of the API and that was where the main problem came from. At the end we had _seventeen_ different reasons (events) because people added new ones over and over to duct tape the issues of the existing pile. What the hell is simple about that? Most of these reasons and their possible transition schemes were undocumented and not understandable at all. That and the assbackwards failure handling turned that thing into an unmaintainable and unfixable nightmare. The fact that 25% of all notifier callbacks were buggy speaks for itself. Not to talk about the locking issues which got papered over by the fact that the hotplug locking was hidden from lockdep, the absurd assumptions about invocation context and the workarounds to handle callback invocations correctly when registering or unregistering a notifier. The charm of this "simple" API ended right there at the register / unregister function invocation level. Nice for the occasional user with simple requirements and a nightmare for those people who had to debug the fallout of this simplicity. Keep it simple is surely a good engineering principle, but if the simplicity is restricted to the least important part of the problem then it turns into a nightmare. > In contrast, the current API appears to make it impossible to > implement notifiers for certain use cases correctly, and that's > directly where my statement that "you need to be a cpuhp expert to use > this" comes from.... Just for the record: - When we converted all notifiers over there was not a single instance which could not be handled straight forward and in the vast majority of the cases the code got simpler, smaller and easier to understand. - After the initial conversion more than hundred state setups were added without a single complaint about the so complex API and no one showed up as hotplug expert since then. - The number of obscure and hard to debug CPU hotplug bugs has gone close to zero since then. The obvious programming bugs in the core, the callbacks and API usage are just the usual programming errors which are not specific to CPU hotplug at all. I'm sorry that this change which turned CPU hotplug into a reliable, testable and instrumentable mechanism causes so much trouble for you. I hope it's just the lack of coherent documentation which made you unhappy. If the updated documentation does not answer your questions, please let me know and please provide a coherent explanation of the problem you are trying to solve. Either I can give you an hint or I can identify further issues in the documentation. If it turns out that there are functional shortcomings then I'm of course all ears as well. If you need a conveniance API to install multiple states at once to regain the "simple API" feeling, please let me know - I surely have some ideas. Thanks, tglx --- --- a/Documentation/core-api/cpu_hotplug.rst +++ b/Documentation/core-api/cpu_hotplug.rst @@ -156,95 +156,479 @@ hotplug states will be invoked, starting * Once all services are migrated, kernel calls an arch specific routine ``__cpu_disable()`` to perform arch specific cleanup. -Using the hotplug API ---------------------- -It is possible to receive notifications once a CPU is offline or onlined. This -might be important to certain drivers which need to perform some kind of setup -or clean up functions based on the number of available CPUs: :: - - #include <linux/cpuhotplug.h> - - ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "X/Y:online", - Y_online, Y_prepare_down); - -*X* is the subsystem and *Y* the particular driver. The *Y_online* callback -will be invoked during registration on all online CPUs. If an error -occurs during the online callback the *Y_prepare_down* callback will be -invoked on all CPUs on which the online callback was previously invoked. -After registration completed, the *Y_online* callback will be invoked -once a CPU is brought online and *Y_prepare_down* will be invoked when a -CPU is shutdown. All resources which were previously allocated in -*Y_online* should be released in *Y_prepare_down*. -The return value *ret* is negative if an error occurred during the -registration process. Otherwise a positive value is returned which -contains the allocated hotplug for dynamically allocated states -(*CPUHP_AP_ONLINE_DYN*). It will return zero for predefined states. - -The callback can be remove by invoking ``cpuhp_remove_state()``. In case of a -dynamically allocated state (*CPUHP_AP_ONLINE_DYN*) use the returned state. -During the removal of a hotplug state the teardown callback will be invoked. - -Multiple instances -~~~~~~~~~~~~~~~~~~ -If a driver has multiple instances and each instance needs to perform the -callback independently then it is likely that a ''multi-state'' should be used. -First a multi-state state needs to be registered: :: - - ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "X/Y:online, - Y_online, Y_prepare_down); - Y_hp_online = ret; - -The ``cpuhp_setup_state_multi()`` behaves similar to ``cpuhp_setup_state()`` -except it prepares the callbacks for a multi state and does not invoke -the callbacks. This is a one time setup. -Once a new instance is allocated, you need to register this new instance: :: - - ret = cpuhp_state_add_instance(Y_hp_online, &d->node); - -This function will add this instance to your previously allocated -*Y_hp_online* state and invoke the previously registered callback -(*Y_online*) on all online CPUs. The *node* element is a ``struct -hlist_node`` member of your per-instance data structure. - -On removal of the instance: :: - cpuhp_state_remove_instance(Y_hp_online, &d->node) - -should be invoked which will invoke the teardown callback on all online -CPUs. - -Manual setup -~~~~~~~~~~~~ -Usually it is handy to invoke setup and teardown callbacks on registration or -removal of a state because usually the operation needs to performed once a CPU -goes online (offline) and during initial setup (shutdown) of the driver. However -each registration and removal function is also available with a ``_nocalls`` -suffix which does not invoke the provided callbacks if the invocation of the -callbacks is not desired. During the manual setup (or teardown) the functions -``cpus_read_lock()`` and ``cpus_read_unlock()`` should be used to inhibit CPU -hotplug operations. - - -The ordering of the events --------------------------- -The hotplug states are defined in ``include/linux/cpuhotplug.h``: - -* The states *CPUHP_OFFLINE* … *CPUHP_AP_OFFLINE* are invoked before the - CPU is up. -* The states *CPUHP_AP_OFFLINE* … *CPUHP_AP_ONLINE* are invoked - just the after the CPU has been brought up. The interrupts are off and - the scheduler is not yet active on this CPU. Starting with *CPUHP_AP_OFFLINE* - the callbacks are invoked on the target CPU. -* The states between *CPUHP_AP_ONLINE_DYN* and *CPUHP_AP_ONLINE_DYN_END* are - reserved for the dynamic allocation. -* The states are invoked in the reverse order on CPU shutdown starting with - *CPUHP_ONLINE* and stopping at *CPUHP_OFFLINE*. Here the callbacks are - invoked on the CPU that will be shutdown until *CPUHP_AP_OFFLINE*. - -A dynamically allocated state via *CPUHP_AP_ONLINE_DYN* is often enough. -However if an earlier invocation during the bring up or shutdown is required -then an explicit state should be acquired. An explicit state might also be -required if the hotplug event requires specific ordering in respect to -another hotplug event. + +The CPU hotplug API +=================== + +CPU hotplug state machine +------------------------- + +CPU hotplug uses a trivial state machine with a linear state space from +CPUHP_OFFLINE to CPUHP_ONLINE. Each state has a startup and a teardown +callback. + +When a CPU is onlined, the startup callbacks are invoked sequentially until +the state CPUHP_ONLINE is reached. They can also be invoked when the +callbacks of a state are set up or an instance is added to a multi-instance +state. + +When a CPU is offlined the teardown callbacks are invoked in the reverse +order sequenctially until the state CPUHP_OFFLINE is reached. They can also +be invoked when the callbacks of a state are removed or an instance is +removed from a multi-instance state. + +If a usage site requires only a callback in one direction of the hotplug +operations (CPU online or CPU offline) then the other not required callback +can be set to NULL when the state is set up. + +The state space is divided into three sections: + +* The PREPARE section + + The PREPARE section covers the state space from CPUHP_OFFLINE to CPUHP_BRINGUP_CPU + + The startup callbacks in this section are invoked before the CPU is + started during a CPU online operation. The teardown callbacks are invoked + after the CPU has become dysfunctional during a CPU offline operation. + + The callbacks are invoked on a control CPU as they can't obviously run on + the hotplugged CPU which is either not yet started or has become + dysfunctional already. + + The startup callbacks are used to setup resources which are required to + bring a CPU successfully online. The teardown callbacks are used to free + resources or to move pending work to an online CPU after the hotplugged + CPU became dysfunctional. + + The startup callbacks are allowed to fail. If a callback fails, the CPU + online operation is aborted and the CPU is brought down to the previous + state (usually CPUHP_OFFLINE) again. + + The teardown callbacks in this section are not allowed to fail. + +* The STARTING section + + The STARTING section covers the state space between CPUHP_BRINGUP_CPU + 1 + and CPUHP_AP_ONLINE + + The startup callbacks in this section are invoked on the hotplugged CPU + with interrupts disabled during a CPU online operation in the early CPU + setup code. The teardown callbacks are invoked with interrupts disabled + on the hotplugged CPU during a CPU offline operation shortly before the + CPU is completely shut down. + + The callbacks in this section are not allowed to fail. + + The callbacks are used for low level hardware initialization/shutdown and + for core subsystems. + +* The ONLINE section + + The ONLINE section covers the state space between CPUHP_AP_ONLINE + 1 and + CPUHP_ONLINE. + + The startup callbacks in this section are invoked on the hotplugged CPU + during a CPU online operation. The teardown callbacks are invoked on the + hotplugged CPU during a CPU offline operation. + + The callbacks are invoked in the context of the per CPU hotplug thread, + which is pinned on the hotplugged CPU. The callbacks are invoked with + interrupts and preemption enabled. + + The callbacks are allowed to fail. When a callback fails the hotplug + operation is aborted and the CPU is brought back to the previous state. + +CPU online/offline operations +----------------------------- + +A successful online operation looks like this: :: + + [CPUHP_OFFLINE] + [CPUHP_OFFLINE + 1]->startup() -> success + [CPUHP_OFFLINE + 2]->startup() -> success + [CPUHP_OFFLINE + 3] -> skipped because startup == NULL + ... + [CPUHP_BRINGUP_CPU]->startup() -> success + === End of PREPARE section + [CPUHP_BRINGUP_CPU + 1]->startup() -> success + ... + [CPUHP_AP_ONLINE]->startup() -> success + === End of STARTUP section + [CPUHP_AP_ONLINE + 1]->startup() -> success + ... + [CPUHP_ONLINE - 1]->startup() -> success + [CPUHP_ONLINE] + +A successful offline operation looks like this: :: + + [CPUHP_ONLINE] + [CPUHP_ONLINE - 1]->teardown() -> success + ... + [CPUHP_AP_ONLINE + 1]->teardown() -> success + === Start of STARTUP section + [CPUHP_AP_ONLINE]->teardown() -> success + ... + [CPUHP_BRINGUP_ONLINE - 1]->teardown() + ... + === Start of PREPARE section + [CPUHP_BRINGUP_CPU]->teardown() + [CPUHP_OFFLINE + 3]->teardown() + [CPUHP_OFFLINE + 2] -> skipped because teardown == NULL + [CPUHP_OFFLINE + 1]->teardown() + [CPUHP_OFFLINE] + +A failed online operation looks like this: :: + + [CPUHP_OFFLINE] + [CPUHP_OFFLINE + 1]->startup() -> success + [CPUHP_OFFLINE + 2]->startup() -> success + [CPUHP_OFFLINE + 3] -> skipped because startup == NULL + ... + [CPUHP_BRINGUP_CPU]->startup() -> success + === End of PREPARE section + [CPUHP_BRINGUP_CPU + 1]->startup() -> success + ... + [CPUHP_AP_ONLINE]->startup() -> success + === End of STARTUP section + [CPUHP_AP_ONLINE + 1]->startup() -> success + --- + [CPUHP_AP_ONLINE + N]->startup() -> fail + [CPUHP_AP_ONLINE + (N - 1)]->teardown() + ... + [CPUHP_AP_ONLINE + 1]->teardown() + === Start of STARTUP section + [CPUHP_AP_ONLINE]->teardown() + ... + [CPUHP_BRINGUP_ONLINE - 1]->teardown() + ... + === Start of PREPARE section + [CPUHP_BRINGUP_CPU]->teardown() + [CPUHP_OFFLINE + 3]->teardown() + [CPUHP_OFFLINE + 2] -> skipped because teardown == NULL + [CPUHP_OFFLINE + 1]->teardown() + [CPUHP_OFFLINE] + +A failed offline operation looks like this: :: + + [CPUHP_ONLINE] + [CPUHP_ONLINE - 1]->teardown() -> success + ... + [CPUHP_ONLINE - N]->teardown() -> fail + [CPUHP_ONLINE - (N - 1)]->startup() + ... + [CPUHP_ONLINE - 1]->startup() + [CPUHP_ONLINE] + +Recursive failures cannot be handled sensibly. Look at the following +example of a recursive fail due to a failed offline operation: :: + + [CPUHP_ONLINE] + [CPUHP_ONLINE - 1]->teardown() -> success + ... + [CPUHP_ONLINE - N]->teardown() -> fail + [CPUHP_ONLINE - (N - 1)]->startup() -> success + [CPUHP_ONLINE - (N - 2)]->startup() -> fail + +The CPU hotplug state machine stops right here and does not try to go back +down again because that would likely result in an endless loop: :: + + [CPUHP_ONLINE - (N - 1)]->teardown() -> success + [CPUHP_ONLINE - N]->teardown() -> fail + [CPUHP_ONLINE - (N - 1)]->startup() -> success + [CPUHP_ONLINE - (N - 2)]->startup() -> fail + [CPUHP_ONLINE - (N - 1)]->teardown() -> success + [CPUHP_ONLINE - N]->teardown() -> fail + +Lather, rinse and repeat. In this case the CPU left in state: :: + + [CPUHP_ONLINE - (N - 1)] + +which at least lets the system make progress and gives the user a chance to +debug or even resolve the situation. + +Allocating a state +------------------ + +There are two ways to allocate a CPU hotplug state: + +* Static allocation + + Static allocation has to be used when the subsystem or driver has + ordering requirements versus other CPU hotplug states. E.g. the PERF core + startup callback has to be invoked before the PERF driver startup + callbacks during a CPU online operation. During a CPU offline operation + the driver teardown callbacks have to be invoked before the core teardown + callback. The statically allocated states are described by constants in + the cpuhp_state enum which can be found in include/linux/cpuhotplug.h. + + Insert the state into the enum at the proper place so the ordering + requirements are fulfilled. The state constant has to be used for state + setup and removal. + + Static allocation is also required when the state callbacks are not set + up at runtime and are part of the initializer of the CPU hotplug state + array in kernel/cpu.c. + +* Dynamic allocation + + When there are no ordering requirements for the state callbacks then + dynamic allocation is the preferred method. The state number is allocated + by the setup function and returned to the caller on success. + + Only the PREPARE and ONLINE sections provide a dynamic allocation + range. The STARTING section does not as most of the callbacks in that + section have explicit ordering requirements. + +Setup of a CPU hotplug state +---------------------------- + +The core code provides the following functions to setup a state: + +* cpuhp_setup_state(state, name, startup, teardown) +* cpuhp_setup_state_nocalls(state, name, startup, teardown) +* cpuhp_setup_state_cpuslocked(state, name, startup, teardown) +* cpuhp_setup_state_nocalls_cpuslocked(state, name, startup, teardown) + +For cases where a driver or a subsystem has multiple instances and the same +CPU hotplug state callbacks need to be invoked for each instance, the CPU +hotplug core provides multi-instance support. The advantage over driver +specific instance lists is that the instance related functions are fully +serialized against CPU hotplug operations and provide the automatic +invocations of the state callbacks on add and removal. To set up such a +multi-instance state the following function is available: + +* cpuhp_setup_state_multi(state, name, startup, teardown) + +The @state argument is either a statically allocated state or one of the +constants for dynamically allocated states - CPUHP_PREPARE_DYN, +CPUHP_ONLINE_DYN - depending on the state section (PREPARE, ONLINE) for +which a dynamic state should be allocated. + +The @name argument is used for sysfs output and for instrumentation. The +naming convention is "subsys:mode" or "subsys/driver:mode", +e.g. "perf:mode" or "perf/x86:mode". The common mode names: + +======== ======================================================= +prepare For states in the PREPAREsection + +dead For states in the PREPARE section which do not provide + a startup callback + +starting For states in the STARTING section + +dying For states in the STARTING section which do not provide + a startup callback + +online For states in the ONLINE section + +offline For states in the ONLINE section which do not provide + a startup callback +======== ======================================================= + +As the @name argument is only used for sysfs and instrumentation other mode +descriptors can be used as well if they describe the nature of the state +better than the common ones. + +Examples for @name arguments: "perf/online", "perf/x86:prepare", +"RCU/tree:dying", "sched/waitempty" + +The @startup argument is a function pointer to the callback which should be +invoked during a CPU online operation. If the usage site does not require a +startup callback set the pointer to NULL. + +The @teardown argument is a function pointer to the callback which should +be invoked during a CPU offline operation. If the usage site does not +require a teardown callback set the pointer to NULL. + +The functions differ in the way how the installed callbacks are treated: + + * cpuhp_setup_state_nocalls(), cpuhp_setup_state_nocalls_cpuslocked() + and cpuhp_setup_state_multi() only install the callbacks + + * cpuhp_setup_state() and cpuhp_setup_state_cpuslocked() install the + callbacks and invoke the @startup callback (if not NULL) for all online + CPUs which have currently a state greater than the newly installed + state. Depending on the state section the callback is either invoked on + the current CPU (PREPARE section) or on each online CPU (ONLINE + section) in the context of the CPU's hotplug thread. + + If a callback fails for CPU N then the teardown callback for CPU + 0 .. N-1 is invoked to rollback the operation. The state setup fails, + the callbacks for the state are not installed and in case of dynamic + allocation the allocated state is freed. + +The state setup and the callback invocations are serialized against CPU +hotplug operations. If the setup function has to be called from a CPU +hotplug read locked region, then the _cpuslocked() variants have to be +used. These functions cannot be used from within CPU hotplug callbacks. + +The function return values: + ======== =================================================================== + 0 Statically allocated state was successfully set up + + >0 Dynamically allocated state was successfully set up. + + The returned number is the state number which was allocated. If + the state callbacks have to be removed later, e.g. module + removal, then this number has to be saved by the caller and used + as @state argument for the state remove function. For + multi-instance states the dynamically allocated state number is + also required as @state argument for the instance add/remove + operations. + + <0 Operation failed + ======== =================================================================== + +Removal of a CPU hotplug state +------------------------------ + +To remove a previously set up state, the following functions are provided: + +* cpuhp_remove_state(state) +* cpuhp_remove_state_nocalls(state) +* cpuhp_remove_state_nocalls_cpuslocked(state) +* cpuhp_remove_multi_state(state) + +The @state argument is either a statically allocated state or the state +number which was allocated in the dynamic range by cpuhp_setup_state*(). If +the state is in the dynamic range, then the state number is freed and +available for dynamic allocation again. + +The functions differ in the way how the installed callbacks are treated: + + * cpuhp_remove_state_nocalls(), cpuhp_remove_state_nocalls_cpuslocked() + and cpuhp_remove_multi_state() only remove the callbacks. + + * cpuhp_remove_state() removes the callbacks and invokes the teardown + callback (if not NULL) for all online CPUs which have currently a state + greater than the removed state. Depending on the state section the + callback is either invoked on the current CPU (PREPARE section) or on + each online CPU (ONLINE section) in the context of the CPU's hotplug + thread. + + In order to complete the removal, the teardown callback should not fail. + +The state removal and the callback invocations are serialized against CPU +hotplug operations. If the remove function has to be called from a CPU +hotplug read locked region, then the _cpuslocked() variants have to be +used. These functions cannot be used from within CPU hotplug callbacks. + +If a multi-instance state is removed then the caller has to remove all +instances first. + +Multi-Instance state instance management +---------------------------------------- + +Once the multi-instance state is set up, instances can be added to the +state: + + * cpuhp_state_add_instance(state, node) + * cpuhp_state_add_instance_nocalls(state, node) + +The @state argument is either a statically allocated state or the state +number which was allocated in the dynamic range by cpuhp_setup_state_multi(). + +The @node argument is a pointer to a hlist_node which is embedded in the +instance's data structure. The pointer is handed to the multi-instance +state callbacks and can be used by the callback to retrieve the instance +via container_of(). + +The functions differ in the way how the installed callbacks are treated: + + * cpuhp_state_add_instance_nocalls() and only adds the instance to the + multi-instance state's node list. + + * cpuhp_state_add_instance() adds the instance and invokes the startup + callback (if not NULL) associated with @state for all online CPUs which + have currently a state greater than @state. The callback is only + invoked for the to be added instance. Depending on the state section + the callback is either invoked on the current CPU (PREPARE section) or + on each online CPU (ONLINE section) in the context of the CPU's hotplug + thread. + + If a callback fails for CPU N then the teardown callback for CPU + 0 .. N-1 is invoked to rollback the operation, the function fails and + the instance is not added to the node list of the multi-instance state. + +To remove an instance from the state's node list these functions are +available: + + * cpuhp_state_remove_instance(state, node) + * cpuhp_state_remove_instance_nocalls(state, node) + +The arguments are the same as for the the cpuhp_state_add_instance*() +variants above. + +The functions differ in the way how the installed callbacks are treated: + + * cpuhp_state_remove_instance_nocalls() only removes the instance from the + state's node list. + + * cpuhp_state_remove_instance() removes the instance and invokes the + teardown callback (if not NULL) associated with @state for all online + CPUs which have currently a state greater than @state. The callback is + only invoked for the to be removed instance. Depending on the state + section the callback is either invoked on the current CPU (PREPARE + section) or on each online CPU (ONLINE section) in the context of the + CPU's hotplug thread. + + In order to complete the removal, the teardown callback should not fail. + +The node list add/remove operations and the callback invocations are +serialized against CPU hotplug operations. These functions cannot be used +from within CPU hotplug callbacks and CPU hotplug read locked regions. + +Examples +-------- + +Setup and teardown a statically allocated state in the STARTING section for +notifications on online and offline operations: :: + + ret = cpuhp_setup_state(CPUHP_SUBSYS_STARTING, "subsys:starting", subsys_cpu_starting, subsys_cpu_dying); + if (ret < 0) + return ret; + .... + cpuhp_remove_state(CPUHP_SUBSYS_STARTING); + +Setup and teardown a dynamically allocated state in the ONLINE section +for notifications on offline operations: :: + + state = cpuhp_setup_state(CPUHP_ONLINE_DYN, "subsys:offline", NULL, subsys_cpu_offline); + if (state < 0) + return state; + .... + cpuhp_remove_state(state); + +Setup and teardown a dynamically allocated state in the ONLINE section +for notifications on online operations without invoking the callbacks: :: + + state = cpuhp_setup_state_nocalls(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpi_online, NULL); + if (state < 0) + return state; + .... + cpuhp_remove_state_nocalls(state); + +Setup, use and teardown a dynamically allocated multi-instance state in the +ONLINE section for notifications on online and offline operation: :: + + state = cpuhp_setup_state_multi(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpu_online, subsys_cpu_offline); + if (state < 0) + return state; + .... + ret = cpuhp_state_add_instance(state, &inst1->node); + if (ret) + return ret; + .... + ret = cpuhp_state_add_instance(state, &inst2->node); + if (ret) + return ret; + .... + cpuhp_remove_instance(state, &inst1->node); + .... + cpuhp_remove_instance(state, &inst2->node); + .... + remove_multi_state(state); + Testing of hotplug states ========================= --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -22,8 +22,42 @@ * AP_ACTIVE AP_ACTIVE */ +/* + * CPU hotplug states. The state machine invokes the installed state + * startup callbacks sequentially from CPUHP_OFFLINE + 1 to CPUHP_ONLINE + * during a CPU online operation. During a CPU offline operation the + * installed teardown callbacks are invoked in the reverse order from + * CPU_ONLINE - 1 down to CPUHP_OFFLINE. + * + * The state space has three sections: PREPARE, STARTING and ONLINE. + * + * PREPARE: The callbacks are invoked on a control CPU before the + * hotplugged CPU is started up or after the hotplugged CPU has died. + * + * STARTING: The callbacks are invoked on the hotplugged CPU from the low level + * hotplug startup/teardown code with interrupts disabled. + * + * ONLINE: The callbacks are invoked on the hotplugged CPU from the per CPU + * hotplug thread with interrupts and preemption enabled. + * + * Adding explicit states to this enum is only necessary when: + * + * 1) The state is within the STARTING section + * + * 2) The state has ordering constraints vs. other states in the + * same section. + * + * If neither #1 nor #2 apply, please use the dynamic state space when + * setting up a state by using CPUHP_PREPARE_DYN or CPUHP_PREPARE_ONLINE + * for the @state argument of the setup function. + * + * See Documentation/core-api/cpu_hotplug.rst for further information and + * examples. + */ enum cpuhp_state { CPUHP_INVALID = -1, + + /* PREPARE section invoked on a control CPU */ CPUHP_OFFLINE = 0, CPUHP_CREATE_THREADS, CPUHP_PERF_PREPARE, @@ -93,6 +127,11 @@ enum cpuhp_state { CPUHP_BP_PREPARE_DYN, CPUHP_BP_PREPARE_DYN_END = CPUHP_BP_PREPARE_DYN + 20, CPUHP_BRINGUP_CPU, + + /* + * STARTING section invoked on the hotplugged CPU in low level + * bringup and teardown code. + */ CPUHP_AP_IDLE_DEAD, CPUHP_AP_OFFLINE, CPUHP_AP_SCHED_STARTING, @@ -153,6 +192,8 @@ enum cpuhp_state { CPUHP_AP_ARM_CACHE_B15_RAC_DYING, CPUHP_AP_ONLINE, CPUHP_TEARDOWN_CPU, + + /* Online section invoked on the hotplugged CPU from the hotplug thread */ CPUHP_AP_ONLINE_IDLE, CPUHP_AP_SCHED_WAIT_EMPTY, CPUHP_AP_SMPBOOT_THREADS, @@ -214,14 +255,15 @@ int __cpuhp_setup_state_cpuslocked(enum int (*teardown)(unsigned int cpu), bool multi_instance); /** - * cpuhp_setup_state - Setup hotplug state callbacks with calling the callbacks + * cpuhp_setup_state - Setup hotplug state callbacks with calling the startup + * callback * @state: The state for which the calls are installed * @name: Name of the callback (will be used in debug output) - * @startup: startup callback function - * @teardown: teardown callback function + * @startup: startup callback function or NULL if not required + * @teardown: teardown callback function or NULL if not required * * Installs the callback functions and invokes the startup callback on - * the present cpus which have already reached the @state. + * the online cpus which have already reached the @state. */ static inline int cpuhp_setup_state(enum cpuhp_state state, const char *name, @@ -231,6 +273,18 @@ static inline int cpuhp_setup_state(enum return __cpuhp_setup_state(state, name, true, startup, teardown, false); } +/** + * cpuhp_setup_state_cpuslocked - Setup hotplug state callbacks with calling + * startup callback from a cpus_read_lock() + * held region + * @state: The state for which the calls are installed + * @name: Name of the callback (will be used in debug output) + * @startup: startup callback function or NULL if not required + * @teardown: teardown callback function or NULL if not required + * + * Same as cpuhp_setup_state() except that it must be invoked from within a + * cpus_read_lock() held region. + */ static inline int cpuhp_setup_state_cpuslocked(enum cpuhp_state state, const char *name, int (*startup)(unsigned int cpu), @@ -242,14 +296,14 @@ static inline int cpuhp_setup_state_cpus /** * cpuhp_setup_state_nocalls - Setup hotplug state callbacks without calling the - * callbacks + * startup callback * @state: The state for which the calls are installed * @name: Name of the callback. - * @startup: startup callback function - * @teardown: teardown callback function + * @startup: startup callback function or NULL if not required + * @teardown: teardown callback function or NULL if not required * - * Same as @cpuhp_setup_state except that no calls are executed are invoked - * during installation of this callback. NOP if SMP=n or HOTPLUG_CPU=n. + * Same as cpuhp_setup_state() except that the startup callback is not + * invoked during installation. NOP if SMP=n or HOTPLUG_CPU=n. */ static inline int cpuhp_setup_state_nocalls(enum cpuhp_state state, const char *name, @@ -260,6 +314,19 @@ static inline int cpuhp_setup_state_noca false); } +/** + * cpuhp_setup_state_nocalls_cpuslocked - Setup hotplug state callbacks without + * invoking the @bringup callback from + * a cpus_read_lock() held region + * callbacks + * @state: The state for which the calls are installed + * @name: Name of the callback. + * @startup: startup callback function or NULL if not required + * @teardown: teardown callback function or NULL if not required + * + * Same as cpuhp_setup_state_nocalls() except that it must be invoked from + * within a cpus_read_lock() held region. + */ static inline int cpuhp_setup_state_nocalls_cpuslocked(enum cpuhp_state state, const char *name, int (*startup)(unsigned int cpu), @@ -273,13 +340,13 @@ static inline int cpuhp_setup_state_noca * cpuhp_setup_state_multi - Add callbacks for multi state * @state: The state for which the calls are installed * @name: Name of the callback. - * @startup: startup callback function - * @teardown: teardown callback function + * @startup: startup callback function or NULL if not required + * @teardown: teardown callback function or NULL if not required * * Sets the internal multi_instance flag and prepares a state to work as a multi * instance callback. No callbacks are invoked at this point. The callbacks are * invoked once an instance for this state are registered via - * @cpuhp_state_add_instance or @cpuhp_state_add_instance_nocalls. + * cpuhp_state_add_instance() or cpuhp_state_add_instance_nocalls() */ static inline int cpuhp_setup_state_multi(enum cpuhp_state state, const char *name, @@ -305,8 +372,8 @@ int __cpuhp_state_add_instance_cpuslocke * @node: The node for this individual state. * * Installs the instance for the @state and invokes the startup callback on - * the present cpus which have already reached the @state. The @state must have - * been earlier marked as multi-instance by @cpuhp_setup_state_multi. + * the online cpus which have already reached the @state. The @state must have + * been earlier marked as multi-instance by cpuhp_setup_state_multi(). */ static inline int cpuhp_state_add_instance(enum cpuhp_state state, struct hlist_node *node) @@ -321,7 +388,8 @@ static inline int cpuhp_state_add_instan * @node: The node for this individual state. * * Installs the instance for the @state The @state must have been earlier - * marked as multi-instance by @cpuhp_setup_state_multi. + * marked as multi-instance by cpuhp_setup_state_multi. NOP if SMP=n or + * HOTPLUG_CPU=n. */ static inline int cpuhp_state_add_instance_nocalls(enum cpuhp_state state, struct hlist_node *node) @@ -329,6 +397,17 @@ static inline int cpuhp_state_add_instan return __cpuhp_state_add_instance(state, node, false); } +/** + * cpuhp_state_add_instance_nocalls_cpuslocked - Add an instance for a state + * without invoking the startup + * callback from a cpus_read_lock() + * held region. + * @state: The state for which the instance is installed + * @node: The node for this individual state. + * + * Same as cpuhp_state_add_instance_nocalls() except that it must be + * invoked from within a cpus_read_lock() held region. + */ static inline int cpuhp_state_add_instance_nocalls_cpuslocked(enum cpuhp_state state, struct hlist_node *node) @@ -353,7 +432,7 @@ static inline void cpuhp_remove_state(en /** * cpuhp_remove_state_nocalls - Remove hotplug state callbacks without invoking - * teardown + * the teardown callback * @state: The state for which the calls are removed */ static inline void cpuhp_remove_state_nocalls(enum cpuhp_state state) @@ -361,6 +440,14 @@ static inline void cpuhp_remove_state_no __cpuhp_remove_state(state, false); } +/** + * cpuhp_remove_state_nocalls_cpuslocked - Remove hotplug state callbacks without invoking + * teardown from a cpus_read_lock() held region. + * @state: The state for which the calls are removed + * + * Same as cpuhp_remove_state nocalls() except that it must be invoked + * from within a cpus_read_lock() held region. + */ static inline void cpuhp_remove_state_nocalls_cpuslocked(enum cpuhp_state state) { __cpuhp_remove_state_cpuslocked(state, false); @@ -389,7 +476,7 @@ int __cpuhp_state_remove_instance(enum c * @node: The node for this individual state. * * Removes the instance and invokes the teardown callback on the present cpus - * which have already reached the @state. + * which have already reached @state. */ static inline int cpuhp_state_remove_instance(enum cpuhp_state state, struct hlist_node *node)
On 9/5/21 4:28 PM, Thomas Gleixner wrote: > Dave, > [snip] Hi, Doc. comments below... > I'm sorry that this change which turned CPU hotplug into a reliable, > testable and instrumentable mechanism causes so much trouble for you. I > hope it's just the lack of coherent documentation which made you > unhappy. > > If the updated documentation does not answer your questions, please let > me know and please provide a coherent explanation of the problem you are > trying to solve. Either I can give you an hint or I can identify further > issues in the documentation. > > If it turns out that there are functional shortcomings then I'm of > course all ears as well. > > If you need a conveniance API to install multiple states at once to > regain the "simple API" feeling, please let me know - I surely have some > ideas. > > Thanks, > > tglx > --- > --- a/Documentation/core-api/cpu_hotplug.rst > +++ b/Documentation/core-api/cpu_hotplug.rst > @@ -156,95 +156,479 @@ hotplug states will be invoked, starting > * Once all services are migrated, kernel calls an arch specific routine > ``__cpu_disable()`` to perform arch specific cleanup. > [snip] > + > +The CPU hotplug API > +=================== > + > +CPU hotplug state machine > +------------------------- > + > +CPU hotplug uses a trivial state machine with a linear state space from > +CPUHP_OFFLINE to CPUHP_ONLINE. Each state has a startup and a teardown > +callback. > + > +When a CPU is onlined, the startup callbacks are invoked sequentially until > +the state CPUHP_ONLINE is reached. They can also be invoked when the > +callbacks of a state are set up or an instance is added to a multi-instance > +state. > + > +When a CPU is offlined the teardown callbacks are invoked in the reverse > +order sequenctially until the state CPUHP_OFFLINE is reached. They can also sequentially > +be invoked when the callbacks of a state are removed or an instance is > +removed from a multi-instance state. > + > +If a usage site requires only a callback in one direction of the hotplug > +operations (CPU online or CPU offline) then the other not required callback not-required > +can be set to NULL when the state is set up. > + > +The state space is divided into three sections: > + > +* The PREPARE section > + > + The PREPARE section covers the state space from CPUHP_OFFLINE to CPUHP_BRINGUP_CPU CPUHP_BRINGUP_CPU. > + > + The startup callbacks in this section are invoked before the CPU is > + started during a CPU online operation. The teardown callbacks are invoked > + after the CPU has become dysfunctional during a CPU offline operation. > + > + The callbacks are invoked on a control CPU as they can't obviously run on > + the hotplugged CPU which is either not yet started or has become > + dysfunctional already. > + > + The startup callbacks are used to setup resources which are required to > + bring a CPU successfully online. The teardown callbacks are used to free > + resources or to move pending work to an online CPU after the hotplugged > + CPU became dysfunctional. > + > + The startup callbacks are allowed to fail. If a callback fails, the CPU > + online operation is aborted and the CPU is brought down to the previous > + state (usually CPUHP_OFFLINE) again. > + > + The teardown callbacks in this section are not allowed to fail. > + > +* The STARTING section > + > + The STARTING section covers the state space between CPUHP_BRINGUP_CPU + 1 > + and CPUHP_AP_ONLINE and CPUHP_AP_ONLINE. > + > + The startup callbacks in this section are invoked on the hotplugged CPU > + with interrupts disabled during a CPU online operation in the early CPU > + setup code. The teardown callbacks are invoked with interrupts disabled > + on the hotplugged CPU during a CPU offline operation shortly before the > + CPU is completely shut down. > + > + The callbacks in this section are not allowed to fail. > + > + The callbacks are used for low level hardware initialization/shutdown and > + for core subsystems. > + > +* The ONLINE section > + > + The ONLINE section covers the state space between CPUHP_AP_ONLINE + 1 and > + CPUHP_ONLINE. > + > + The startup callbacks in this section are invoked on the hotplugged CPU > + during a CPU online operation. The teardown callbacks are invoked on the > + hotplugged CPU during a CPU offline operation. > + > + The callbacks are invoked in the context of the per CPU hotplug thread, > + which is pinned on the hotplugged CPU. The callbacks are invoked with > + interrupts and preemption enabled. > + > + The callbacks are allowed to fail. When a callback fails the hotplug > + operation is aborted and the CPU is brought back to the previous state. > + > +CPU online/offline operations > +----------------------------- > + > +A successful online operation looks like this: :: > + > + [CPUHP_OFFLINE] > + [CPUHP_OFFLINE + 1]->startup() -> success > + [CPUHP_OFFLINE + 2]->startup() -> success > + [CPUHP_OFFLINE + 3] -> skipped because startup == NULL > + ... > + [CPUHP_BRINGUP_CPU]->startup() -> success > + === End of PREPARE section > + [CPUHP_BRINGUP_CPU + 1]->startup() -> success > + ... > + [CPUHP_AP_ONLINE]->startup() -> success > + === End of STARTUP section > + [CPUHP_AP_ONLINE + 1]->startup() -> success > + ... > + [CPUHP_ONLINE - 1]->startup() -> success > + [CPUHP_ONLINE] > + > +A successful offline operation looks like this: :: > + > + [CPUHP_ONLINE] > + [CPUHP_ONLINE - 1]->teardown() -> success > + ... > + [CPUHP_AP_ONLINE + 1]->teardown() -> success > + === Start of STARTUP section > + [CPUHP_AP_ONLINE]->teardown() -> success > + ... > + [CPUHP_BRINGUP_ONLINE - 1]->teardown() > + ... > + === Start of PREPARE section > + [CPUHP_BRINGUP_CPU]->teardown() > + [CPUHP_OFFLINE + 3]->teardown() > + [CPUHP_OFFLINE + 2] -> skipped because teardown == NULL > + [CPUHP_OFFLINE + 1]->teardown() > + [CPUHP_OFFLINE] > + > +A failed online operation looks like this: :: > + > + [CPUHP_OFFLINE] > + [CPUHP_OFFLINE + 1]->startup() -> success > + [CPUHP_OFFLINE + 2]->startup() -> success > + [CPUHP_OFFLINE + 3] -> skipped because startup == NULL > + ... > + [CPUHP_BRINGUP_CPU]->startup() -> success > + === End of PREPARE section > + [CPUHP_BRINGUP_CPU + 1]->startup() -> success > + ... > + [CPUHP_AP_ONLINE]->startup() -> success > + === End of STARTUP section > + [CPUHP_AP_ONLINE + 1]->startup() -> success > + --- > + [CPUHP_AP_ONLINE + N]->startup() -> fail > + [CPUHP_AP_ONLINE + (N - 1)]->teardown() > + ... > + [CPUHP_AP_ONLINE + 1]->teardown() > + === Start of STARTUP section > + [CPUHP_AP_ONLINE]->teardown() > + ... > + [CPUHP_BRINGUP_ONLINE - 1]->teardown() > + ... > + === Start of PREPARE section > + [CPUHP_BRINGUP_CPU]->teardown() > + [CPUHP_OFFLINE + 3]->teardown() > + [CPUHP_OFFLINE + 2] -> skipped because teardown == NULL > + [CPUHP_OFFLINE + 1]->teardown() > + [CPUHP_OFFLINE] > + > +A failed offline operation looks like this: :: > + > + [CPUHP_ONLINE] > + [CPUHP_ONLINE - 1]->teardown() -> success > + ... > + [CPUHP_ONLINE - N]->teardown() -> fail > + [CPUHP_ONLINE - (N - 1)]->startup() > + ... > + [CPUHP_ONLINE - 1]->startup() > + [CPUHP_ONLINE] > + > +Recursive failures cannot be handled sensibly. Look at the following > +example of a recursive fail due to a failed offline operation: :: > + > + [CPUHP_ONLINE] > + [CPUHP_ONLINE - 1]->teardown() -> success > + ... > + [CPUHP_ONLINE - N]->teardown() -> fail > + [CPUHP_ONLINE - (N - 1)]->startup() -> success > + [CPUHP_ONLINE - (N - 2)]->startup() -> fail > + > +The CPU hotplug state machine stops right here and does not try to go back > +down again because that would likely result in an endless loop: :: > + > + [CPUHP_ONLINE - (N - 1)]->teardown() -> success > + [CPUHP_ONLINE - N]->teardown() -> fail > + [CPUHP_ONLINE - (N - 1)]->startup() -> success > + [CPUHP_ONLINE - (N - 2)]->startup() -> fail > + [CPUHP_ONLINE - (N - 1)]->teardown() -> success > + [CPUHP_ONLINE - N]->teardown() -> fail > + > +Lather, rinse and repeat. In this case the CPU left in state: :: CPU is left > + > + [CPUHP_ONLINE - (N - 1)] > + > +which at least lets the system make progress and gives the user a chance to > +debug or even resolve the situation. > + > +Allocating a state > +------------------ > + > +There are two ways to allocate a CPU hotplug state: > + > +* Static allocation > + > + Static allocation has to be used when the subsystem or driver has > + ordering requirements versus other CPU hotplug states. E.g. the PERF core > + startup callback has to be invoked before the PERF driver startup > + callbacks during a CPU online operation. During a CPU offline operation > + the driver teardown callbacks have to be invoked before the core teardown > + callback. The statically allocated states are described by constants in > + the cpuhp_state enum which can be found in include/linux/cpuhotplug.h. > + > + Insert the state into the enum at the proper place so the ordering > + requirements are fulfilled. The state constant has to be used for state > + setup and removal. > + > + Static allocation is also required when the state callbacks are not set > + up at runtime and are part of the initializer of the CPU hotplug state > + array in kernel/cpu.c. > + > +* Dynamic allocation > + > + When there are no ordering requirements for the state callbacks then > + dynamic allocation is the preferred method. The state number is allocated > + by the setup function and returned to the caller on success. > + > + Only the PREPARE and ONLINE sections provide a dynamic allocation > + range. The STARTING section does not as most of the callbacks in that > + section have explicit ordering requirements. > + > +Setup of a CPU hotplug state > +---------------------------- > + > +The core code provides the following functions to setup a state: > + > +* cpuhp_setup_state(state, name, startup, teardown) > +* cpuhp_setup_state_nocalls(state, name, startup, teardown) > +* cpuhp_setup_state_cpuslocked(state, name, startup, teardown) > +* cpuhp_setup_state_nocalls_cpuslocked(state, name, startup, teardown) > + > +For cases where a driver or a subsystem has multiple instances and the same > +CPU hotplug state callbacks need to be invoked for each instance, the CPU > +hotplug core provides multi-instance support. The advantage over driver > +specific instance lists is that the instance related functions are fully > +serialized against CPU hotplug operations and provide the automatic > +invocations of the state callbacks on add and removal. To set up such a > +multi-instance state the following function is available: > + > +* cpuhp_setup_state_multi(state, name, startup, teardown) > + > +The @state argument is either a statically allocated state or one of the > +constants for dynamically allocated states - CPUHP_PREPARE_DYN, > +CPUHP_ONLINE_DYN - depending on the state section (PREPARE, ONLINE) for > +which a dynamic state should be allocated. > + > +The @name argument is used for sysfs output and for instrumentation. The > +naming convention is "subsys:mode" or "subsys/driver:mode", > +e.g. "perf:mode" or "perf/x86:mode". The common mode names: names are: > + > +======== ======================================================= > +prepare For states in the PREPAREsection PREPARE section > + > +dead For states in the PREPARE section which do not provide > + a startup callback > + > +starting For states in the STARTING section > + > +dying For states in the STARTING section which do not provide > + a startup callback > + > +online For states in the ONLINE section > + > +offline For states in the ONLINE section which do not provide > + a startup callback > +======== ======================================================= > + > +As the @name argument is only used for sysfs and instrumentation other mode > +descriptors can be used as well if they describe the nature of the state > +better than the common ones. > + > +Examples for @name arguments: "perf/online", "perf/x86:prepare", > +"RCU/tree:dying", "sched/waitempty" > + > +The @startup argument is a function pointer to the callback which should be > +invoked during a CPU online operation. If the usage site does not require a > +startup callback set the pointer to NULL. > + > +The @teardown argument is a function pointer to the callback which should > +be invoked during a CPU offline operation. If the usage site does not > +require a teardown callback set the pointer to NULL. > + > +The functions differ in the way how the installed callbacks are treated: > + > + * cpuhp_setup_state_nocalls(), cpuhp_setup_state_nocalls_cpuslocked() > + and cpuhp_setup_state_multi() only install the callbacks > + > + * cpuhp_setup_state() and cpuhp_setup_state_cpuslocked() install the > + callbacks and invoke the @startup callback (if not NULL) for all online > + CPUs which have currently a state greater than the newly installed > + state. Depending on the state section the callback is either invoked on > + the current CPU (PREPARE section) or on each online CPU (ONLINE > + section) in the context of the CPU's hotplug thread. > + > + If a callback fails for CPU N then the teardown callback for CPU > + 0 .. N-1 is invoked to rollback the operation. The state setup fails, CPU 0? Does one of these fail since it's not an AP? > + the callbacks for the state are not installed and in case of dynamic > + allocation the allocated state is freed. > + > +The state setup and the callback invocations are serialized against CPU > +hotplug operations. If the setup function has to be called from a CPU > +hotplug read locked region, then the _cpuslocked() variants have to be > +used. These functions cannot be used from within CPU hotplug callbacks. > + > +The function return values: > + ======== =================================================================== > + 0 Statically allocated state was successfully set up > + > + >0 Dynamically allocated state was successfully set up. > + > + The returned number is the state number which was allocated. If > + the state callbacks have to be removed later, e.g. module > + removal, then this number has to be saved by the caller and used > + as @state argument for the state remove function. For > + multi-instance states the dynamically allocated state number is > + also required as @state argument for the instance add/remove > + operations. > + > + <0 Operation failed > + ======== =================================================================== > + > +Removal of a CPU hotplug state > +------------------------------ > + > +To remove a previously set up state, the following functions are provided: > + > +* cpuhp_remove_state(state) > +* cpuhp_remove_state_nocalls(state) > +* cpuhp_remove_state_nocalls_cpuslocked(state) > +* cpuhp_remove_multi_state(state) > + > +The @state argument is either a statically allocated state or the state > +number which was allocated in the dynamic range by cpuhp_setup_state*(). If > +the state is in the dynamic range, then the state number is freed and > +available for dynamic allocation again. > + > +The functions differ in the way how the installed callbacks are treated: > + > + * cpuhp_remove_state_nocalls(), cpuhp_remove_state_nocalls_cpuslocked() > + and cpuhp_remove_multi_state() only remove the callbacks. > + > + * cpuhp_remove_state() removes the callbacks and invokes the teardown > + callback (if not NULL) for all online CPUs which have currently a state > + greater than the removed state. Depending on the state section the > + callback is either invoked on the current CPU (PREPARE section) or on > + each online CPU (ONLINE section) in the context of the CPU's hotplug > + thread. > + > + In order to complete the removal, the teardown callback should not fail. > + > +The state removal and the callback invocations are serialized against CPU > +hotplug operations. If the remove function has to be called from a CPU > +hotplug read locked region, then the _cpuslocked() variants have to be > +used. These functions cannot be used from within CPU hotplug callbacks. > + > +If a multi-instance state is removed then the caller has to remove all > +instances first. > + > +Multi-Instance state instance management > +---------------------------------------- > + > +Once the multi-instance state is set up, instances can be added to the > +state: > + > + * cpuhp_state_add_instance(state, node) > + * cpuhp_state_add_instance_nocalls(state, node) > + > +The @state argument is either a statically allocated state or the state > +number which was allocated in the dynamic range by cpuhp_setup_state_multi(). > + > +The @node argument is a pointer to a hlist_node which is embedded in the I would say: to an hlist_node > +instance's data structure. The pointer is handed to the multi-instance > +state callbacks and can be used by the callback to retrieve the instance > +via container_of(). > + > +The functions differ in the way how the installed callbacks are treated: > + > + * cpuhp_state_add_instance_nocalls() and only adds the instance to the > + multi-instance state's node list. > + > + * cpuhp_state_add_instance() adds the instance and invokes the startup > + callback (if not NULL) associated with @state for all online CPUs which > + have currently a state greater than @state. The callback is only > + invoked for the to be added instance. Depending on the state section > + the callback is either invoked on the current CPU (PREPARE section) or > + on each online CPU (ONLINE section) in the context of the CPU's hotplug > + thread. > + > + If a callback fails for CPU N then the teardown callback for CPU > + 0 .. N-1 is invoked to rollback the operation, the function fails and all except the Boot CPU? > + the instance is not added to the node list of the multi-instance state. > + > +To remove an instance from the state's node list these functions are > +available: > + > + * cpuhp_state_remove_instance(state, node) > + * cpuhp_state_remove_instance_nocalls(state, node) > + > +The arguments are the same as for the the cpuhp_state_add_instance*() > +variants above. > + > +The functions differ in the way how the installed callbacks are treated: > + > + * cpuhp_state_remove_instance_nocalls() only removes the instance from the > + state's node list. > + > + * cpuhp_state_remove_instance() removes the instance and invokes the > + teardown callback (if not NULL) associated with @state for all online > + CPUs which have currently a state greater than @state. The callback is > + only invoked for the to be removed instance. Depending on the state > + section the callback is either invoked on the current CPU (PREPARE > + section) or on each online CPU (ONLINE section) in the context of the > + CPU's hotplug thread. > + > + In order to complete the removal, the teardown callback should not fail. > + > +The node list add/remove operations and the callback invocations are > +serialized against CPU hotplug operations. These functions cannot be used > +from within CPU hotplug callbacks and CPU hotplug read locked regions. > + > +Examples > +-------- > + > +Setup and teardown a statically allocated state in the STARTING section for > +notifications on online and offline operations: :: > + > + ret = cpuhp_setup_state(CPUHP_SUBSYS_STARTING, "subsys:starting", subsys_cpu_starting, subsys_cpu_dying); > + if (ret < 0) > + return ret; > + .... > + cpuhp_remove_state(CPUHP_SUBSYS_STARTING); > + > +Setup and teardown a dynamically allocated state in the ONLINE section > +for notifications on offline operations: :: > + > + state = cpuhp_setup_state(CPUHP_ONLINE_DYN, "subsys:offline", NULL, subsys_cpu_offline); > + if (state < 0) > + return state; > + .... > + cpuhp_remove_state(state); > + > +Setup and teardown a dynamically allocated state in the ONLINE section > +for notifications on online operations without invoking the callbacks: :: > + > + state = cpuhp_setup_state_nocalls(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpi_online, NULL); _cpu_ > + if (state < 0) > + return state; > + .... > + cpuhp_remove_state_nocalls(state); > + > +Setup, use and teardown a dynamically allocated multi-instance state in the > +ONLINE section for notifications on online and offline operation: :: > + > + state = cpuhp_setup_state_multi(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpu_online, subsys_cpu_offline); > + if (state < 0) > + return state; > + .... > + ret = cpuhp_state_add_instance(state, &inst1->node); > + if (ret) > + return ret; > + .... > + ret = cpuhp_state_add_instance(state, &inst2->node); > + if (ret) > + return ret; > + .... > + cpuhp_remove_instance(state, &inst1->node); > + .... > + cpuhp_remove_instance(state, &inst2->node); > + .... > + remove_multi_state(state); > + > > Testing of hotplug states > =========================
Randy, On Sun, Sep 05 2021 at 19:11, Randy Dunlap wrote: > On 9/5/21 4:28 PM, Thomas Gleixner wrote: >> + * cpuhp_setup_state() and cpuhp_setup_state_cpuslocked() install the >> + callbacks and invoke the @startup callback (if not NULL) for all online >> + CPUs which have currently a state greater than the newly installed >> + state. Depending on the state section the callback is either invoked on >> + the current CPU (PREPARE section) or on each online CPU (ONLINE >> + section) in the context of the CPU's hotplug thread. >> + >> + If a callback fails for CPU N then the teardown callback for CPU >> + 0 .. N-1 is invoked to rollback the operation. The state setup fails, > > CPU 0? Does one of these fail since it's not an AP? Yes. CPU 0 is not special in any way. The point is that the hotplug state callbacks are set up late in the boot process or during runtime when a module is loaded or some functionality initialized on first use. At that time the boot CPU (0) and usually the secondary CPUs are online already. So the driver/subsystem has two ways to bring the per CPU functionality into operation: 1) Initialize all per CPU state manually which often involves queuing work on each online CPU or invoking SMP function calls on the online CPUs and if all succeeds install the callbacks. If something goes wrong on one of the CPUs then the state has to be cleaned up on the CPUs which had their state set up correctly already. This of course has to be done with cpus_read_lock() held to serialize against a concurrent CPU hotplug operation.- 2) Let the hotplug core do that work. Setup a state with the corresponding callbacks. The core invokes the startup callback on all online CPUs (including 0) in the correct context: for_each_online_cpu(cpu) { ret = invoke_callback_on/for_cpu(cpu, startup); if (ret) goto err; ... Any of these callback invocations can fail even the one on the boot CPU. In case of failure on CPU0 there is nothing to clean up, but if the Nth CPU callback fails then the state has been established for CPU 0 to CPU N-1 already, e.g. memory allocation, hardware setup ... So instead of returning with a half set up functionality, the core does the rollback on CPU 0 to CPU N-1 by invoking the teardown callback before returning the error code. err: for_each_online_cpu(cpu) { if (startup_done[cpu]) invoke_callback_on/for_cpu(cpu, teardown); } That means the call site does not have to mop up the half initialized state manually. All of that is properly serialized against CPU hotplug operations. >> + >> + If a callback fails for CPU N then the teardown callback for CPU >> + 0 .. N-1 is invoked to rollback the operation, the function fails and > > all except the Boot CPU? See above. Thanks, tglx