diff mbox

[RFC,v2,01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline

Message ID 20121205184258.3750.31879.stgit@srivatsabhat.in.ibm.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Srivatsa S. Bhat Dec. 5, 2012, 6:43 p.m. UTC
There are places where preempt_disable() is used to prevent any CPU from
going offline during the critical section. Let us call them as "atomic
hotplug readers" (atomic because they run in atomic contexts).

Often, these atomic hotplug readers have a simple need : they want the cpu
online mask that they work with (inside their critical section), to be
stable, i.e., it should be guaranteed that CPUs in that mask won't go
offline during the critical section. The important point here is that
they don't really need to synchronize with the actual CPU tear-down
sequence. All they need is synchronization with the updates to the
cpu_online_mask. (Hence the term "light", for light-weight).

The intent of this patch is to provide synchronization APIs for such
"light" atomic hotplug readers. [ get/put_online_cpus_atomic_light() ]

Fundamental idea behind the design:
-----------------------------------

Simply put, in the hotplug writer path, have appropriate locking around the
update to the cpu_online_mask in the CPU tear-down sequence. And once the
update is done, release the lock and allow the "light" atomic hotplug readers
to go ahead. Meanwhile, the hotplug writer can safely continue the actual
CPU tear-down sequence (running CPU_DYING notifiers etc) since the "light"
atomic readers don't really care about those operations (and hence don't
need to synchronize with them).

Also, once the hotplug writer completes taking the CPU offline, it should
not start any new cpu_down() operations until all existing "light" atomic
hotplug readers have completed.

Some important design requirements and considerations:
-----------------------------------------------------

1. The "light" atomic hotplug readers should ideally *never* have to wait
   for the hotplug writer (cpu_down()) for too long (like entire duration
   of CPU offline, for example). Because, these atomic hotplug readers can be
   in very hot-paths like interrupt handling/IPI and hence, if they have to
   wait for an ongoing cpu_down() to complete, it would pretty much
   introduce the same performance/latency problems as stop_machine().

2. Any synchronization at the atomic hotplug readers side must be highly
   scalable - avoid global single-holder locks/counters etc. Because, these
   paths currently use the extremely fast preempt_disable(); our replacement
   to preempt_disable() should not become ridiculously costly and also should
   not serialize the readers among themselves needlessly.

3. preempt_disable() was recursive. The replacement should also be recursive.

Implementation of the design:
----------------------------

At the core, we use a reader-writer lock to synchronize the update to the
cpu_online_mask. That way, multiple "light" atomic hotplug readers can co-exist
and the writer can acquire the lock only when all the readers have completed.

Once acquired, the writer holds the "light" lock only during the duration of
the update to the cpu_online_mask. That way, the readers don't have to spin
for too long (ie., the write-hold-time for the "light" lock is tiny), which
keeps the readers in good shape.

Reader-writer lock are recursive, so they can be used in a nested fashion in
the reader-path. Together, these satisfy all the 3 requirements mentioned above.

Also, since we don't use per-cpu locks (because rwlocks themselves are quite
scalable for readers), we don't end up in any lock ordering problems that can
occur if we try to use per-cpu locks.

I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful
suggestions and ideas, which inspired and influenced many of the decisions in
this as well as previous designs. Thanks a lot Michael and Xiao!

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/cpu.h |    4 ++
 kernel/cpu.c        |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 90 insertions(+), 1 deletion(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Srivatsa S. Bhat Dec. 5, 2012, 6:47 p.m. UTC | #1
Replaying what Tejun wrote:

(cc'ing Oleg)

Hello, Srivatsa.

On 12/06/2012 12:13 AM, Srivatsa S. Bhat wrote:
> Also, since we don't use per-cpu locks (because rwlocks themselves are quite
> scalable for readers), we don't end up in any lock ordering problems that can
> occur if we try to use per-cpu locks.
> 

Read-lock really isn't that scalable when you compare it to
preempt_disable/enable().  When used on hot paths, it's gonna generate
a lot of cacheline pingpongs.  This patch is essentially creating a
new big lock which has potential for being very hot.

preempt_disable/enable() + stop_machine() essentially works as percpu
rwlock with very heavy penalty on the writer side.  Because the reader
side doesn't even implement spinning while writer is in progress, the
writer side has to preempt the readers before entering critical
section and that's what the "stopping machine" is about.

Note that the resolution on the reader side is very low.  Any section
w/ preemption disabled is protected against stop_machine().  Also, the
stop_machine() itself is extremely heavy involving essentially locking
up the machine until all CPUs can reach the same condition via
scheduling the stop_machine tasks.  So, I *think* all you need to do
here is making cpu online locking finer grained (separated from
preemption) and lighten the writer side a bit.  I'm quite doubtful
that you would need to go hunting donw all get_online_cpus().  They
aren't used that often anyway.

Anyways, so, separating out cpu hotplug locking from preemption is the
right thing to do but I think rwlock is likely to be too heavy on the
reader side.  I think percpu reader accounting + reader spinning while
writer in progress should be a good combination.  It's a bit heavier
than preempt_disable() - it'll have an extra conditional jump on the
hot path, but there won't be any cacheline bouncing.  The writer side
would need to synchronize against all CPUs but only against the ones
actually read locking cpu hotplug.  As long as reader side critical
sections don't go crazy, it should be okay.

So, we basically need percpu_rwlock.  We already have percpu_rwsem.
We used to have some different variants of writer-heavy locks.  Dunno
what happened to them.  Maybe we still have it somewhere.  Oleg has
been working on the area lately and should know more.  Oleg, it seems
CPU hotplug needs big-reader rwlock, ideas on how to proceed?

Thanks.

-- tejun 

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 5, 2012, 6:51 p.m. UTC | #2
Replaying what Oleg wrote:

Hi,

Sorry I don't understand the context and I can't find this thread
anywhere, so I am not sure I understand...

> Replaying what Tejun wrote:
> So, we basically need percpu_rwlock.  We already have percpu_rwsem.

Yes, and with -mm patches it becomes reader-friendly. In particular
see http://marc.info/?l=linux-mm-commits&m=135240650828875

> Oleg, it seems
> CPU hotplug needs big-reader rwlock, ideas on how to proceed?
> 

I am going to convert get_online_cpus() to use percpu_down_read(),
this looks simple.

We already discussed this with Paul, see

	http://marc.info/?l=linux-kernel&m=135248463226031

and the whole thread.

In short, all we need is percpu_down_write_recursive_readers() and
afaics the only complication is lockdep, we need down_read_no_lockdep()
which (like __up_read) doesn't do rwsem_acquire_read().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 5, 2012, 6:53 p.m. UTC | #3
Replaying what Tejun wrote:

Hello, Oleg.

> Replaying what Oleg wrote:
> 
> Hi,
> 
> Sorry I don't understand the context and I can't find this thread
> anywhere, so I am not sure I understand...
> 

Weird, lkml cc is missing.  Srivatsa?

[Now fixed. This thread has lkml CC]

>> Replaying what Tejun wrote:
>> So, we basically need percpu_rwlock.  We already have percpu_rwsem.
> 
> Yes, and with -mm patches it becomes reader-friendly. In particular
> see http://marc.info/?l=linux-mm-commits&m=135240650828875
> 
>> Oleg, it seems
>> CPU hotplug needs big-reader rwlock, ideas on how to proceed?
>>
> 
> I am going to convert get_online_cpus() to use percpu_down_read(),
> this looks simple.
> 
> We already discussed this with Paul, see
> 
> 	http://marc.info/?l=linux-kernel&m=135248463226031
> 
> and the whole thread.
> 
> In short, all we need is percpu_down_write_recursive_readers() and
> afaics the only complication is lockdep, we need down_read_no_lockdep()
> which (like __up_read) doesn't do rwsem_acquire_read().
> 

So, it's a different thing.  There are two mechanism protecting
against cpu hotplug - get_online_cpus() and preempt_disable().  The
former can be used by ones which can sleep and need to protect against
the whole up/down process (DOWN_PREPARE and so on).  The latter
protects the last step and can be used when the caller can't sleep.
Replacing get_online_cpus() w/ percpu_rwsem is great but this thread
is about replacing preempt_disable with something finer grained and
less heavy on the writer side - IOW, percpu_rwlock as opposed to
percpu_rwsem, so, I think the end result would be that CPU hotplug
will be protected by percpu_rwsem for the whole part and by
percpu_rwlock for the last commit stage.

The problem seems that we don't have percpu_rwlock yet.  It shouldn't
be too difficult to implement, right?

Thanks.

-- tejun 

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 5, 2012, 6:56 p.m. UTC | #4
Replaying what Oleg wrote:

(add lkml)

> Replaying what Tejun wrote:

> Replacing get_online_cpus() w/ percpu_rwsem is great but this thread
> is about replacing preempt_disable with something finer grained and
> less heavy on the writer side

If only I understood why preempt_disable() is bad ;-)

OK, I guess "less heavy on the writer side" is the hint, and in the
previous email you mentioned that "stop_machine() itself is extremely
heavy".

Looks like, you are going to remove stop_machine() from cpu_down ???
 
> The problem seems that we don't have percpu_rwlock yet.  It shouldn't
> be too difficult to implement, right?
> 

Oh, I am not sure... unless you simply copy-and-paste the lglock code
and replace spinlock_t with rwlock_t.

We probably want something more efficient, but I bet we can't avoid
the barriers on the read side.

And somehow we should avoid the livelocks. Say, we can't simply add
the per_cpu_reader_counter, _read_lock should spin if the writer is
active. But at the same time _read_lock should be recursive.

Tejun, could you please send me mbox with this thread offlist?

[That should now be unnecessary, since the discussion can continue
on-list on this thread].

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 5, 2012, 6:59 p.m. UTC | #5
Replaying what Tejun wrote:

Hello, Oleg.

> Replaying what Oleg wrote:
>> Replacing get_online_cpus() w/ percpu_rwsem is great but this thread
>> is about replacing preempt_disable with something finer grained and
>> less heavy on the writer side
> 
> If only I understood why preempt_disable() is bad ;-)
> 
> OK, I guess "less heavy on the writer side" is the hint, and in the
> previous email you mentioned that "stop_machine() itself is extremely
> heavy".
> 
> Looks like, you are going to remove stop_machine() from cpu_down ???
>

Yeah, that's what Srivatsa is trying to do.  The problem seems to be
that cpu up/down is very frequent on certain mobile platforms for
power management and as currently implemented cpu hotplug is too heavy
and latency-inducing.
  
>> The problem seems that we don't have percpu_rwlock yet.  It shouldn't
>> be too difficult to implement, right?
>>
> 
> Oh, I am not sure... unless you simply copy-and-paste the lglock code
> and replace spinlock_t with rwlock_t.
> 

Ah... right, so that's where brlock ended up.  So, lglock is the new
thing and brlock is a wrapper around it.

> We probably want something more efficient, but I bet we can't avoid
> the barriers on the read side.
> 
> And somehow we should avoid the livelocks. Say, we can't simply add
> the per_cpu_reader_counter, _read_lock should spin if the writer is
> active. But at the same time _read_lock should be recursive.
> 

I think we should just go with lglock.  It does involve local atomic
ops but atomic ops themselves aren't that expensive and it's not like
we can avoid memory barriers.  Also, that's the non-sleeping
counterpart of percpu_rwsem.  If it's not good enough for some reason,
we should improve it rather than introducing something else.

Thanks.

-- tejun

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 5, 2012, 7:07 p.m. UTC | #6
I'll try to read this series later,

one minor and almost offtopic nit.

On 12/06, Srivatsa S. Bhat wrote:
>
>  static int __ref take_cpu_down(void *_param)
>  {
>  	struct take_cpu_down_param *param = _param;
> +	unsigned long flags;
>  	int err;
>
> +	/*
> +	 *  __cpu_disable() is the step where the CPU is removed from the
> +	 *  cpu_online_mask. Protect it with the light-lock held for write.
> +	 */
> +	write_lock_irqsave(&light_hotplug_rwlock, flags);
> +
>  	/* Ensure this CPU doesn't handle any more interrupts. */
>  	err = __cpu_disable();
> -	if (err < 0)
> +	if (err < 0) {
> +		write_unlock_irqrestore(&light_hotplug_rwlock, flags);
>  		return err;
> +	}
> +
> +	/*
> +	 * We have successfully removed the CPU from the cpu_online_mask.
> +	 * So release the light-lock, so that the light-weight atomic readers
> +	 * (who care only about the cpu_online_mask updates, and not really
> +	 * about the actual cpu-take-down operation) can continue.
> +	 *
> +	 * But don't enable interrupts yet, because we still have work left to
> +	 * do, to actually bring the CPU down.
> +	 */
> +	write_unlock(&light_hotplug_rwlock);
>
>  	cpu_notify(CPU_DYING | param->mod, param->hcpu);
> +
> +	local_irq_restore(flags);
>  	return 0;

This is subjective, but imho _irqsave and the fat comment look confusing.

Currently take_cpu_down() is always called with irqs disabled, so you
do not need to play with interrupts.

10/10 does s/__stop_machine/stop_cpus/ and that patch could simply add
local_irq_disable/enable into take_cpu_down().

But again this is minor and subjective, I won't insist.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 5, 2012, 7:16 p.m. UTC | #7
On 12/06/2012 12:37 AM, Oleg Nesterov wrote:
> I'll try to read this series later,
> 
> one minor and almost offtopic nit.
> 
> On 12/06, Srivatsa S. Bhat wrote:
>>
>>  static int __ref take_cpu_down(void *_param)
>>  {
>>  	struct take_cpu_down_param *param = _param;
>> +	unsigned long flags;
>>  	int err;
>>
>> +	/*
>> +	 *  __cpu_disable() is the step where the CPU is removed from the
>> +	 *  cpu_online_mask. Protect it with the light-lock held for write.
>> +	 */
>> +	write_lock_irqsave(&light_hotplug_rwlock, flags);
>> +
>>  	/* Ensure this CPU doesn't handle any more interrupts. */
>>  	err = __cpu_disable();
>> -	if (err < 0)
>> +	if (err < 0) {
>> +		write_unlock_irqrestore(&light_hotplug_rwlock, flags);
>>  		return err;
>> +	}
>> +
>> +	/*
>> +	 * We have successfully removed the CPU from the cpu_online_mask.
>> +	 * So release the light-lock, so that the light-weight atomic readers
>> +	 * (who care only about the cpu_online_mask updates, and not really
>> +	 * about the actual cpu-take-down operation) can continue.
>> +	 *
>> +	 * But don't enable interrupts yet, because we still have work left to
>> +	 * do, to actually bring the CPU down.
>> +	 */
>> +	write_unlock(&light_hotplug_rwlock);
>>
>>  	cpu_notify(CPU_DYING | param->mod, param->hcpu);
>> +
>> +	local_irq_restore(flags);
>>  	return 0;
> 
> This is subjective, but imho _irqsave and the fat comment look confusing.
> 
> Currently take_cpu_down() is always called with irqs disabled, so you
> do not need to play with interrupts.
> 
> 10/10 does s/__stop_machine/stop_cpus/ and that patch could simply add
> local_irq_disable/enable into take_cpu_down().
> 

Hmm, we could certainly do that, but somehow I felt it would be easier to
read if we tinker and fix up the take_cpu_down() logic at one place, as a
whole, instead of breaking up into pieces in different patches. And that
also makes the last patch look really cute: it just replaces stop_machine()
with stop_cpus(), as the changelog intended.

I'll see if doing like what you suggested improves the readability, and
if yes, I'll change it. Thank you!

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index ce7a074..dd0a3ee 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -175,6 +175,8 @@  extern struct bus_type cpu_subsys;
 
 extern void get_online_cpus(void);
 extern void put_online_cpus(void);
+extern void get_online_cpus_atomic_light(void);
+extern void put_online_cpus_atomic_light(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
 #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
@@ -198,6 +200,8 @@  static inline void cpu_hotplug_driver_unlock(void)
 
 #define get_online_cpus()	do { } while (0)
 #define put_online_cpus()	do { } while (0)
+#define get_online_cpus_atomic_light()	do { } while (0)
+#define put_online_cpus_atomic_light()	do { } while (0)
 #define hotcpu_notifier(fn, pri)	do { (void)(fn); } while (0)
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 42bd331..381593c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,6 +49,69 @@  static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
+/*
+ * Reader-writer lock to synchronize between "light" atomic hotplug readers
+ * and the hotplug writer while updating cpu_online_mask.
+ * "Light" atomic hotplug readers are those who don't really need to
+ * synchronize with the actual CPU bring-up/take-down sequence, but only
+ * need to synchronize with the updates to the cpu_online_mask.
+ */
+static DEFINE_RWLOCK(light_hotplug_rwlock);
+
+/*
+ * Hotplug readers (those that want to prevent CPUs from coming online or
+ * going offline ) sometimes run from atomic contexts, and hence can't use
+ * get/put_online_cpus() because they can sleep. And often-times, all
+ * they really want is that the cpu_online_mask remain unchanged while
+ * they are executing in their critical section. They also don't really
+ * need to synchronize with the actual CPU tear-down sequence. Such atomic
+ * hotplug readers are called "light" readers (light for light-weight).
+ *
+ * These "light" atomic hotplug readers can use the APIs
+ * get/put_online_atomic_light() around their critical sections to
+ * ensure that the cpu_online_mask remains unaltered throughout that
+ * critical section.
+ *
+ * Caution!: While the readers are in their critical section, a CPU offline
+ * operation can actually happen under the covers; its just that the bit flip
+ * in the cpu_online_mask will be synchronized properly if you use these APIs.
+ * If you really want full synchronization with the entire CPU tear-down
+ * sequence, then you are not a "light" hotplug reader. So don't use these
+ * APIs!
+ *
+ * Eg:
+ *
+ * "Light" atomic hotplug read-side critical section:
+ * --------------------------------------------------
+ *
+ * get_online_cpus_atomic_light();
+ *
+ * for_each_online_cpu(cpu) {
+ *         ... Do something...
+ * }
+ *    ...
+ *
+ * if (cpu_online(other_cpu))
+ *         do_something();
+ *
+ * put_online_cpus_atomic_light();
+ *
+ * You can call this function recursively.
+ */
+void get_online_cpus_atomic_light(void)
+{
+	preempt_disable();
+	read_lock(&light_hotplug_rwlock);
+}
+EXPORT_SYMBOL_GPL(get_online_cpus_atomic_light);
+
+void put_online_cpus_atomic_light(void)
+{
+	read_unlock(&light_hotplug_rwlock);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(put_online_cpus_atomic_light);
+
 static struct {
 	struct task_struct *active_writer;
 	struct mutex lock; /* Synchronizes accesses to refcount, */
@@ -246,14 +309,36 @@  struct take_cpu_down_param {
 static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
+	unsigned long flags;
 	int err;
 
+	/*
+	 *  __cpu_disable() is the step where the CPU is removed from the
+	 *  cpu_online_mask. Protect it with the light-lock held for write.
+	 */
+	write_lock_irqsave(&light_hotplug_rwlock, flags);
+
 	/* Ensure this CPU doesn't handle any more interrupts. */
 	err = __cpu_disable();
-	if (err < 0)
+	if (err < 0) {
+		write_unlock_irqrestore(&light_hotplug_rwlock, flags);
 		return err;
+	}
+
+	/*
+	 * We have successfully removed the CPU from the cpu_online_mask.
+	 * So release the light-lock, so that the light-weight atomic readers
+	 * (who care only about the cpu_online_mask updates, and not really
+	 * about the actual cpu-take-down operation) can continue.
+	 *
+	 * But don't enable interrupts yet, because we still have work left to
+	 * do, to actually bring the CPU down.
+	 */
+	write_unlock(&light_hotplug_rwlock);
 
 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
+
+	local_irq_restore(flags);
 	return 0;
 }