[RFC,v3,1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

There are places where preempt_disable() is used to prevent any CPU from
going offline during the critical section. Let us call them as "atomic
hotplug readers" ("atomic" because they run in atomic contexts).

Today, preempt_disable() works because the writer uses stop_machine().
But once stop_machine() is gone, the readers won't be able to prevent
CPUs from going offline using preempt_disable().

The intent of this patch is to provide synchronization APIs for such
atomic hotplug readers, to prevent (any) CPUs from going offline, without
depending on stop_machine() at the writer-side. The new APIs will look
something like this:  get/put_online_cpus_atomic()

Some important design requirements and considerations:
-----------------------------------------------------

1. Scalable synchronization at the reader-side, especially in the fast-path

   Any synchronization at the atomic hotplug readers side must be highly
   scalable - avoid global single-holder locks/counters etc. Because, these
   paths currently use the extremely fast preempt_disable(); our replacement
   to preempt_disable() should not become ridiculously costly and also should
   not serialize the readers among themselves needlessly.

   At a minimum, the new APIs must be extremely fast at the reader side
   atleast in the fast-path, when no CPU offline writers are active.

2. preempt_disable() was recursive. The replacement should also be recursive.

3. No (new) lock-ordering restrictions

   preempt_disable() was super-flexible. It didn't impose any ordering
   restrictions or rules for nesting. Our replacement should also be equally
   flexible and usable.

4. No deadlock possibilities

   Per-cpu locking is not the way to go if we want to have relaxed rules
   for lock-ordering. Because, we can end up in circular-locking dependencies
   as explained in https://lkml.org/lkml/2012/12/6/290

   So, avoid per-cpu locking schemes (per-cpu locks/per-cpu atomic counters
   with spin-on-contention etc) as much as possible.

Implementation of the design:
----------------------------

We use global rwlocks for synchronization, because then we won't get into
lock-ordering related problems (unlike per-cpu locks). However, global
rwlocks lead to unnecessary cache-line bouncing even when there are no
hotplug writers present, which can slow down the system needlessly.

Per-cpu counters can help solve the cache-line bouncing problem. So we
actually use the best of both: per-cpu counters (no-waiting) at the reader
side in the fast-path, and global rwlocks in the slowpath.

[ Fastpath = no writer is active; Slowpath = a writer is active ]

IOW, the hotplug readers just increment/decrement their per-cpu refcounts
when no writer is active. When a writer becomes active, he signals all
readers to switch to global rwlocks for the duration of the CPU offline
operation. The readers ACK the writer's signal when it is safe for them
(ie., when they are about to start a fresh, non-nested read-side critical
section) and start using (holding) the global rwlock for read in their
subsequent critical sections.

The hotplug writer waits for ACKs from every reader, and then acquires
the global rwlock for write and takes the CPU offline. Then the writer
signals all readers that the CPU offline is done, and that they can go back
to using their per-cpu refcounts again.

Note that the lock-safety (despite the per-cpu scheme) comes from the fact
that the readers can *choose* when to ACK the writer's signal (iow, when
to switch to the global rwlock). And the readers don't wait on anybody based
on the per-cpu counters. The only true synchronization that involves waiting
at the reader-side in this scheme, is the one arising from the global rwlock,
which is safe from the circular locking dependency problems mentioned above
(because it is global).

Reader-writer locks and per-cpu counters are recursive, so they can be
used in a nested fashion in the reader-path. And the "signal-and-ack-with-no-
reader-spinning" design of switching the synchronization scheme ensures that
you can safely nest and call these APIs in any way you want, just like
preempt_disable()/enable.

Together, these satisfy all the requirements mentioned above.

I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful
suggestions and ideas, which inspired and influenced many of the decisions in
this as well as previous designs. Thanks a lot Michael and Xiao!

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/cpu.h |    4 +
 kernel/cpu.c        |  252 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 253 insertions(+), 3 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC,v3,1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

Commit Message

Comments

Patch