diff mbox

[RFC,v4,1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

Message ID 20121211140358.23621.97011.stgit@srivatsabhat.in.ibm.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Srivatsa S. Bhat Dec. 11, 2012, 2:04 p.m. UTC
There are places where preempt_disable() is used to prevent any CPU from
going offline during the critical section. Let us call them as "atomic
hotplug readers" ("atomic" because they run in atomic contexts).

Today, preempt_disable() works because the writer uses stop_machine().
But once stop_machine() is gone, the readers won't be able to prevent
CPUs from going offline using preempt_disable().

The intent of this patch is to provide synchronization APIs for such
atomic hotplug readers, to prevent (any) CPUs from going offline, without
depending on stop_machine() at the writer-side. The new APIs will look
something like this:  get/put_online_cpus_atomic()

Some important design requirements and considerations:
-----------------------------------------------------

1. Scalable synchronization at the reader-side, especially in the fast-path

   Any synchronization at the atomic hotplug readers side must be highly
   scalable - avoid global single-holder locks/counters etc. Because, these
   paths currently use the extremely fast preempt_disable(); our replacement
   to preempt_disable() should not become ridiculously costly and also should
   not serialize the readers among themselves needlessly.

   At a minimum, the new APIs must be extremely fast at the reader side
   atleast in the fast-path, when no CPU offline writers are active.

2. preempt_disable() was recursive. The replacement should also be recursive.

3. No (new) lock-ordering restrictions

   preempt_disable() was super-flexible. It didn't impose any ordering
   restrictions or rules for nesting. Our replacement should also be equally
   flexible and usable.

4. No deadlock possibilities

   Regular per-cpu locking is not the way to go if we want to have relaxed
   rules for lock-ordering. Because, we can end up in circular-locking
   dependencies as explained in https://lkml.org/lkml/2012/12/6/290

   So, avoid the usual per-cpu locking schemes (per-cpu locks/per-cpu atomic
   counters with spin-on-contention etc) as much as possible.


Implementation of the design:
----------------------------

We use global rwlocks for synchronization, because then we won't get into
lock-ordering related problems (unlike per-cpu locks). However, global
rwlocks lead to unnecessary cache-line bouncing even when there are no
hotplug writers present, which can slow down the system needlessly.

Per-cpu counters can help solve the cache-line bouncing problem. So we
actually use the best of both: per-cpu counters (no-waiting) at the reader
side in the fast-path, and global rwlocks in the slowpath.

[ Fastpath = no writer is active; Slowpath = a writer is active ]

IOW, the hotplug readers just increment/decrement their per-cpu refcounts
when no writer is active. When a writer becomes active, he signals all
readers to switch to global rwlocks for the duration of the CPU offline
operation. The readers switch over when it is safe for them (ie., when they
are about to start a fresh, non-nested read-side critical section) and
start using (holding) the global rwlock for read in their subsequent critical
sections.

The hotplug writer waits for every reader to switch, and then acquires
the global rwlock for write and takes the CPU offline. Then the writer
signals all readers that the CPU offline is done, and that they can go back
to using their per-cpu refcounts again.

Note that the lock-safety (despite the per-cpu scheme) comes from the fact
that the readers can *choose* _when_ to switch to rwlocks upon the writer's
signal. And the readers don't wait on anybody based on the per-cpu counters.
The only true synchronization that involves waiting at the reader-side in this
scheme, is the one arising from the global rwlock, which is safe from the
circular locking dependency problems mentioned above (because it is global).

Reader-writer locks and per-cpu counters are recursive, so they can be
used in a nested fashion in the reader-path. Also, this design of switching
the synchronization scheme ensures that you can safely nest and call these
APIs in any way you want, just like preempt_disable()/enable.

Together, these satisfy all the requirements mentioned above.

I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful
suggestions and ideas, which inspired and influenced many of the decisions in
this as well as previous designs. Thanks a lot Michael and Xiao!

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/cpu.h |    4 +
 kernel/cpu.c        |  204 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 205 insertions(+), 3 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Oleg Nesterov Dec. 12, 2012, 5:17 p.m. UTC | #1
On 12/11, Srivatsa S. Bhat wrote:
>
> IOW, the hotplug readers just increment/decrement their per-cpu refcounts
> when no writer is active.

plus cli/sti ;) and increment/decrement are atomic.

At first glance looks correct to me, but I'll try to read it carefully
later.

A couple of minor nits,

> +static DEFINE_PER_CPU(bool, writer_signal);

Why it needs to be per-cpu? It can be global and __read_mostly to avoid
the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal
into a single cacheline...

> +void get_online_cpus_atomic(void)
> +{
> +	unsigned long flags;
> +
> +	preempt_disable();
> +
> +	if (cpu_hotplug.active_writer == current)
> +		return;
> +
> +	local_irq_save(flags);

Yes... this is still needed, we are going to increment reader_percpu_refcnt
unconditionally and this makes reader_nested_percpu() == T.

But,

> +void put_online_cpus_atomic(void)
> +{
> +	unsigned long flags;
> +
> +	if (cpu_hotplug.active_writer == current)
> +		goto out;
> +
> +	local_irq_save(flags);
> +
> +	/*
> +	 * We never allow heterogeneous nesting of readers. So it is trivial
> +	 * to find out the kind of reader we are, and undo the operation
> +	 * done by our corresponding get_online_cpus_atomic().
> +	 */
> +	if (__this_cpu_read(reader_percpu_refcnt))
> +		__this_cpu_dec(reader_percpu_refcnt);
> +	else
> +		read_unlock(&hotplug_rwlock);
> +
> +	local_irq_restore(flags);
> +out:
> +	preempt_enable();
> +}

Do we really need local_irq_save/restore in put_ ?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 12, 2012, 5:24 p.m. UTC | #2
On 12/12, Oleg Nesterov wrote:
>
> On 12/11, Srivatsa S. Bhat wrote:
> >
> > IOW, the hotplug readers just increment/decrement their per-cpu refcounts
> > when no writer is active.
>
> plus cli/sti ;) and increment/decrement are atomic.
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

OOPS, sorry I was going to say "adds mb()".

And when I look at get_online_cpus_atomic() again it uses rmb(). This
doesn't look correct, we need the full barrier between this_cpu_inc()
and writer_active().

At the same time reader_nested_percpu() can be checked before mb().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 12, 2012, 5:53 p.m. UTC | #3
On 12/12/2012 10:47 PM, Oleg Nesterov wrote:
> On 12/11, Srivatsa S. Bhat wrote:
>>
>> IOW, the hotplug readers just increment/decrement their per-cpu refcounts
>> when no writer is active.
> 
> plus cli/sti ;)

Of course, forgot to mention it, again! :)

> and increment/decrement are atomic.
> 
> At first glance looks correct to me, but I'll try to read it carefully
> later.
> 
> A couple of minor nits,
> 
>> +static DEFINE_PER_CPU(bool, writer_signal);
> 
> Why it needs to be per-cpu? It can be global and __read_mostly to avoid
> the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal
> into a single cacheline...
> 

Even I realized this (that we could use a global) after posting out the
series.. But do you think that it would be better to retain the per-cpu
variant itself, due to the cache effects?

>> +void get_online_cpus_atomic(void)
>> +{
>> +	unsigned long flags;
>> +
>> +	preempt_disable();
>> +
>> +	if (cpu_hotplug.active_writer == current)
>> +		return;
>> +
>> +	local_irq_save(flags);
> 
> Yes... this is still needed, we are going to increment reader_percpu_refcnt
> unconditionally and this makes reader_nested_percpu() == T.
> 
> But,
> 
>> +void put_online_cpus_atomic(void)
>> +{
>> +	unsigned long flags;
>> +
>> +	if (cpu_hotplug.active_writer == current)
>> +		goto out;
>> +
>> +	local_irq_save(flags);
>> +
>> +	/*
>> +	 * We never allow heterogeneous nesting of readers. So it is trivial
>> +	 * to find out the kind of reader we are, and undo the operation
>> +	 * done by our corresponding get_online_cpus_atomic().
>> +	 */
>> +	if (__this_cpu_read(reader_percpu_refcnt))
>> +		__this_cpu_dec(reader_percpu_refcnt);
>> +	else
>> +		read_unlock(&hotplug_rwlock);
>> +
>> +	local_irq_restore(flags);
>> +out:
>> +	preempt_enable();
>> +}
> 
> Do we really need local_irq_save/restore in put_ ?
>

Hmm.. good point! I don't think we need it.
 

Regards,
Srivatsa S. Bhat


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 12, 2012, 6:02 p.m. UTC | #4
On 12/12, Srivatsa S. Bhat wrote:
>
> On 12/12/2012 10:47 PM, Oleg Nesterov wrote:
> >
> > Why it needs to be per-cpu? It can be global and __read_mostly to avoid
> > the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal
> > into a single cacheline...
>
> Even I realized this (that we could use a global) after posting out the
> series.. But do you think that it would be better to retain the per-cpu
> variant itself, due to the cache effects?

I don't really know, up to you. This was the question ;)

> > Do we really need local_irq_save/restore in put_ ?
> >
>
> Hmm.. good point! I don't think we need it.

And _perhaps_ get_ can avoid it too?

I didn't really try to think, probably this is not right, but can't
something like this work?

	#define XXXX	(1 << 16)
	#define MASK	(XXXX -1)

	void get_online_cpus_atomic(void)
	{
		preempt_disable();

		// only for writer
		__this_cpu_add(reader_percpu_refcnt, XXXX);

		if (__this_cpu_read(reader_percpu_refcnt) & MASK) {
			__this_cpu_inc(reader_percpu_refcnt);
		} else {
			smp_wmb();
			if (writer_active()) {
				...
			}
		}

		__this_cpu_dec(reader_percpu_refcnt, XXXX);
	}

	void put_online_cpus_atomic(void)
	{
		if (__this_cpu_read(reader_percpu_refcnt) & MASK)
			__this_cpu_dec(reader_percpu_refcnt);
		else
			read_unlock(&hotplug_rwlock);
		preempt_enable();
	}

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 12, 2012, 6:11 p.m. UTC | #5
On 12/12/2012 10:54 PM, Oleg Nesterov wrote:
> On 12/12, Oleg Nesterov wrote:
>>
>> On 12/11, Srivatsa S. Bhat wrote:
>>>
>>> IOW, the hotplug readers just increment/decrement their per-cpu refcounts
>>> when no writer is active.
>>
>> plus cli/sti ;) and increment/decrement are atomic.
>                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> OOPS, sorry I was going to say "adds mb()".
> 

Ok, got it :)

> And when I look at get_online_cpus_atomic() again it uses rmb(). This
> doesn't look correct, we need the full barrier between this_cpu_inc()
> and writer_active().
> 

Hmm..

> At the same time reader_nested_percpu() can be checked before mb().
> 

I thought that since the increment and the check (reader_nested_percpu)
act on the same memory location, they will naturally be run in the given
order, without any need for barriers. Am I wrong?

(I referred Documentation/memory-barriers.txt again to verify this, and
the second point under the "Guarantees" section looked like it said the
same thing : "Overlapping loads and stores within a particular CPU will
appear to be ordered within that CPU").

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 12, 2012, 6:23 p.m. UTC | #6
On 12/12, Srivatsa S. Bhat wrote:
>
> On 12/12/2012 10:54 PM, Oleg Nesterov wrote:
>
> > And when I look at get_online_cpus_atomic() again it uses rmb(). This
> > doesn't look correct, we need the full barrier between this_cpu_inc()
> > and writer_active().
>
> Hmm..
>
> > At the same time reader_nested_percpu() can be checked before mb().
>
> I thought that since the increment and the check (reader_nested_percpu)
> act on the same memory location, they will naturally be run in the given
> order, without any need for barriers. Am I wrong?

And this is what I meant, you do not need a barrier before
reader_nested_percpu().

But you need to ensure that WRITE(reader_percpu_refcnt) and READ(writer_signal)
can't be reordered, so you need mb() in between. rmb() can serialize LOADs and
STOREs.

Or I misunderstood?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 12, 2012, 6:30 p.m. UTC | #7
On 12/12/2012 11:32 PM, Oleg Nesterov wrote:
> On 12/12, Srivatsa S. Bhat wrote:
>>
>> On 12/12/2012 10:47 PM, Oleg Nesterov wrote:
>>>
>>> Why it needs to be per-cpu? It can be global and __read_mostly to avoid
>>> the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal
>>> into a single cacheline...
>>
>> Even I realized this (that we could use a global) after posting out the
>> series.. But do you think that it would be better to retain the per-cpu
>> variant itself, due to the cache effects?
> 
> I don't really know, up to you. This was the question ;)

OK :-)

> 
>>> Do we really need local_irq_save/restore in put_ ?
>>>
>>
>> Hmm.. good point! I don't think we need it.
> 
> And _perhaps_ get_ can avoid it too?
> 
> I didn't really try to think, probably this is not right, but can't
> something like this work?
> 
> 	#define XXXX	(1 << 16)
> 	#define MASK	(XXXX -1)
> 
> 	void get_online_cpus_atomic(void)
> 	{
> 		preempt_disable();
> 
> 		// only for writer
> 		__this_cpu_add(reader_percpu_refcnt, XXXX);
> 
> 		if (__this_cpu_read(reader_percpu_refcnt) & MASK) {
> 			__this_cpu_inc(reader_percpu_refcnt);
> 		} else {
> 			smp_wmb();
> 			if (writer_active()) {
> 				...
> 			}
> 		}
> 
> 		__this_cpu_dec(reader_percpu_refcnt, XXXX);
> 	}
> 

Sorry, may be I'm too blind to see, but I didn't understand the logic
of how the mask helps us avoid disabling interrupts.. Can you kindly
elaborate?

> 	void put_online_cpus_atomic(void)
> 	{
> 		if (__this_cpu_read(reader_percpu_refcnt) & MASK)
> 			__this_cpu_dec(reader_percpu_refcnt);
> 		else
> 			read_unlock(&hotplug_rwlock);
> 		preempt_enable();
> 	}
> 

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 12, 2012, 6:42 p.m. UTC | #8
On 12/12/2012 11:53 PM, Oleg Nesterov wrote:
> On 12/12, Srivatsa S. Bhat wrote:
>>
>> On 12/12/2012 10:54 PM, Oleg Nesterov wrote:
>>
>>> And when I look at get_online_cpus_atomic() again it uses rmb(). This
>>> doesn't look correct, we need the full barrier between this_cpu_inc()
>>> and writer_active().
>>
>> Hmm..
>>
>>> At the same time reader_nested_percpu() can be checked before mb().
>>
>> I thought that since the increment and the check (reader_nested_percpu)
>> act on the same memory location, they will naturally be run in the given
>> order, without any need for barriers. Am I wrong?
> 
> And this is what I meant, you do not need a barrier before
> reader_nested_percpu().
> 

Ah, ok!

> But you need to ensure that WRITE(reader_percpu_refcnt) and READ(writer_signal)
> can't be reordered, so you need mb() in between. rmb() can serialize LOADs and
> STOREs.
> 

OK, got it. (I know you meant s/can/can't).

I'm trying to see if we can somehow exploit the fact that the writer can
potentially tolerate if a reader ignores his signal (to switch to rwlocks)
for a while... and use this to get rid of barriers in the reader path (without
using synchronize_sched() at the writer, of course). And perhaps also take advantage
of the fact that the read_lock() acts as a one-way barrier..

I don't know, maybe its not possible after all.. :-/
 
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 12, 2012, 6:48 p.m. UTC | #9
On 12/13, Srivatsa S. Bhat wrote:
>
> On 12/12/2012 11:32 PM, Oleg Nesterov wrote:
> > And _perhaps_ get_ can avoid it too?
> >
> > I didn't really try to think, probably this is not right, but can't
> > something like this work?
> >
> > 	#define XXXX	(1 << 16)
> > 	#define MASK	(XXXX -1)
> >
> > 	void get_online_cpus_atomic(void)
> > 	{
> > 		preempt_disable();
> >
> > 		// only for writer
> > 		__this_cpu_add(reader_percpu_refcnt, XXXX);
> >
> > 		if (__this_cpu_read(reader_percpu_refcnt) & MASK) {
> > 			__this_cpu_inc(reader_percpu_refcnt);
> > 		} else {
> > 			smp_wmb();
> > 			if (writer_active()) {
> > 				...
> > 			}
> > 		}
> >
> > 		__this_cpu_dec(reader_percpu_refcnt, XXXX);
> > 	}
> >
>
> Sorry, may be I'm too blind to see, but I didn't understand the logic
> of how the mask helps us avoid disabling interrupts..

Why do we need cli/sti at all? We should prevent the following race:

	- the writer already holds hotplug_rwlock, so get_ must not
	  succeed.

	- the new reader comes, it increments reader_percpu_refcnt,
	  but before it checks writer_active() ...

	- irq handler does get_online_cpus_atomic() and sees
	  reader_nested_percpu() == T, so it simply increments
	  reader_percpu_refcnt and succeeds.

OTOH, why do we need to increment reader_percpu_refcnt the counter
in advance? To ensure that either we see writer_active() or the
writer should see reader_percpu_refcnt != 0 (and that is why they
should write/read in reverse order).

The code above tries to avoid this race using the lower 16 bits
as a "nested-counter", and the upper bits to avoid the race with
the writer.

	// only for writer
	__this_cpu_add(reader_percpu_refcnt, XXXX);

If irq comes and does get_online_cpus_atomic(), it won't be confused
by __this_cpu_add(XXXX), it will check the lower bits and switch to
the "slow path".


But once again, so far I didn't really try to think. It is quite
possible I missed something.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 12, 2012, 7:12 p.m. UTC | #10
On 12/13/2012 12:18 AM, Oleg Nesterov wrote:
> On 12/13, Srivatsa S. Bhat wrote:
>>
>> On 12/12/2012 11:32 PM, Oleg Nesterov wrote:
>>> And _perhaps_ get_ can avoid it too?
>>>
>>> I didn't really try to think, probably this is not right, but can't
>>> something like this work?
>>>
>>> 	#define XXXX	(1 << 16)
>>> 	#define MASK	(XXXX -1)
>>>
>>> 	void get_online_cpus_atomic(void)
>>> 	{
>>> 		preempt_disable();
>>>
>>> 		// only for writer
>>> 		__this_cpu_add(reader_percpu_refcnt, XXXX);
>>>
>>> 		if (__this_cpu_read(reader_percpu_refcnt) & MASK) {
>>> 			__this_cpu_inc(reader_percpu_refcnt);
>>> 		} else {
>>> 			smp_wmb();
>>> 			if (writer_active()) {
>>> 				...
>>> 			}
>>> 		}
>>>
>>> 		__this_cpu_dec(reader_percpu_refcnt, XXXX);
>>> 	}
>>>
>>
>> Sorry, may be I'm too blind to see, but I didn't understand the logic
>> of how the mask helps us avoid disabling interrupts..
> 
> Why do we need cli/sti at all? We should prevent the following race:
> 
> 	- the writer already holds hotplug_rwlock, so get_ must not
> 	  succeed.
> 
> 	- the new reader comes, it increments reader_percpu_refcnt,
> 	  but before it checks writer_active() ...
> 
> 	- irq handler does get_online_cpus_atomic() and sees
> 	  reader_nested_percpu() == T, so it simply increments
> 	  reader_percpu_refcnt and succeeds.
> 
> OTOH, why do we need to increment reader_percpu_refcnt the counter
> in advance? To ensure that either we see writer_active() or the
> writer should see reader_percpu_refcnt != 0 (and that is why they
> should write/read in reverse order).
> 
> The code above tries to avoid this race using the lower 16 bits
> as a "nested-counter", and the upper bits to avoid the race with
> the writer.
> 
> 	// only for writer
> 	__this_cpu_add(reader_percpu_refcnt, XXXX);
> 
> If irq comes and does get_online_cpus_atomic(), it won't be confused
> by __this_cpu_add(XXXX), it will check the lower bits and switch to
> the "slow path".
> 

This is a very clever scheme indeed! :-) Thanks a lot for explaining
it in detail.

> 
> But once again, so far I didn't really try to think. It is quite
> possible I missed something.
> 

Even I don't spot anything wrong with it. But I'll give it some more
thought..

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 12, 2012, 7:36 p.m. UTC | #11
On 12/12, Oleg Nesterov wrote:
>
> On 12/12, Srivatsa S. Bhat wrote:
> >
> > On 12/12/2012 10:47 PM, Oleg Nesterov wrote:
> > >
> > > Why it needs to be per-cpu? It can be global and __read_mostly to avoid
> > > the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal
> > > into a single cacheline...
> >
> > Even I realized this (that we could use a global) after posting out the
> > series.. But do you think that it would be better to retain the per-cpu
> > variant itself, due to the cache effects?
>
> I don't really know, up to you. This was the question ;)

But perhaps there is another reason to make it per-cpu...

It seems we can avoid cpu_hotplug.active_writer == current check in
get/put.

take_cpu_down() can clear this_cpu(writer_signal) right after it takes
hotplug_rwlock for writing. It runs with irqs and preemption disabled,
nobody else will ever look at writer_signal on its CPU.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 12, 2012, 7:43 p.m. UTC | #12
On 12/13/2012 01:06 AM, Oleg Nesterov wrote:
> On 12/12, Oleg Nesterov wrote:
>>
>> On 12/12, Srivatsa S. Bhat wrote:
>>>
>>> On 12/12/2012 10:47 PM, Oleg Nesterov wrote:
>>>>
>>>> Why it needs to be per-cpu? It can be global and __read_mostly to avoid
>>>> the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal
>>>> into a single cacheline...
>>>
>>> Even I realized this (that we could use a global) after posting out the
>>> series.. But do you think that it would be better to retain the per-cpu
>>> variant itself, due to the cache effects?
>>
>> I don't really know, up to you. This was the question ;)
> 
> But perhaps there is another reason to make it per-cpu...
> 
> It seems we can avoid cpu_hotplug.active_writer == current check in
> get/put.
> 
> take_cpu_down() can clear this_cpu(writer_signal) right after it takes
> hotplug_rwlock for writing. It runs with irqs and preemption disabled,
> nobody else will ever look at writer_signal on its CPU.
> 

Hmm.. And then the get/put_ on that CPU will increment/decrement the per-cpu
refcount, but we don't care.. because we only need to ensure that they don't
deadlock by taking the rwlock for read.

This sounds great!

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 12, 2012, 9:10 p.m. UTC | #13
On 12/13, Srivatsa S. Bhat wrote:
>
> On 12/13/2012 01:06 AM, Oleg Nesterov wrote:
> >
> > But perhaps there is another reason to make it per-cpu...

Actually this is not the reason, please see below. But let me repeat,
it is not that I suggest to remove "per-cpu".

> > It seems we can avoid cpu_hotplug.active_writer == current check in
> > get/put.
> >
> > take_cpu_down() can clear this_cpu(writer_signal) right after it takes
> > hotplug_rwlock for writing. It runs with irqs and preemption disabled,
> > nobody else will ever look at writer_signal on its CPU.
> >
>
> Hmm.. And then the get/put_ on that CPU will increment/decrement the per-cpu
> refcount, but we don't care.. because we only need to ensure that they don't
> deadlock by taking the rwlock for read.

Yes, but...

Probably it would be more clean to simply do this_cpu_inc(reader_percpu_refcnt)
after write_lock(hotplug_rwlock). This will have the same effect for get/put,
and we still can make writer_signal global (if we want).

And note that this will also simplify the lockdep annotations which we (imho)
should add later.

Ignoring all complications get_online_cpus_atomic() does:

	if (this_cpu_read(reader_percpu_refcnt))
		this_cpu_inc(reader_percpu_refcnt);
	else if (!writer_signal)
		this_cpu_inc(reader_percpu_refcnt);	// same as above
	else
		read_lock(&hotplug_rwlock);

But for lockdep it should do:

	if (this_cpu_read(reader_percpu_refcnt))
		this_cpu_inc(reader_percpu_refcnt);
	else if (!writer_signal) {
		this_cpu_inc(reader_percpu_refcnt);
		// pretend we take hotplug_rwlock for lockdep
		rwlock_acquire_read(&hotplug_rwlock.dep_map, 0, 0);
	}
	else
		read_lock(&hotplug_rwlock);

And we need to ensure that rwlock_acquire_read() is not called under
write_lock(hotplug_rwlock).

If we use reader_percpu_refcnt to fool get/put, we should not worry.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 13, 2012, 3:26 p.m. UTC | #14
On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote:
> On 12/13/2012 12:18 AM, Oleg Nesterov wrote:
>> On 12/13, Srivatsa S. Bhat wrote:
>>>
>>> On 12/12/2012 11:32 PM, Oleg Nesterov wrote:
>>>> And _perhaps_ get_ can avoid it too?
>>>>
>>>> I didn't really try to think, probably this is not right, but can't
>>>> something like this work?
>>>>
>>>> 	#define XXXX	(1 << 16)
>>>> 	#define MASK	(XXXX -1)
>>>>
>>>> 	void get_online_cpus_atomic(void)
>>>> 	{
>>>> 		preempt_disable();
>>>>
>>>> 		// only for writer
>>>> 		__this_cpu_add(reader_percpu_refcnt, XXXX);
>>>>
>>>> 		if (__this_cpu_read(reader_percpu_refcnt) & MASK) {
>>>> 			__this_cpu_inc(reader_percpu_refcnt);
>>>> 		} else {
>>>> 			smp_wmb();
>>>> 			if (writer_active()) {
>>>> 				...
>>>> 			}
>>>> 		}
>>>>
>>>> 		__this_cpu_dec(reader_percpu_refcnt, XXXX);
>>>> 	}
>>>>
>>>
>>> Sorry, may be I'm too blind to see, but I didn't understand the logic
>>> of how the mask helps us avoid disabling interrupts..
>>
>> Why do we need cli/sti at all? We should prevent the following race:
>>
>> 	- the writer already holds hotplug_rwlock, so get_ must not
>> 	  succeed.
>>
>> 	- the new reader comes, it increments reader_percpu_refcnt,
>> 	  but before it checks writer_active() ...
>>
>> 	- irq handler does get_online_cpus_atomic() and sees
>> 	  reader_nested_percpu() == T, so it simply increments
>> 	  reader_percpu_refcnt and succeeds.
>>
>> OTOH, why do we need to increment reader_percpu_refcnt the counter
>> in advance? To ensure that either we see writer_active() or the
>> writer should see reader_percpu_refcnt != 0 (and that is why they
>> should write/read in reverse order).
>>
>> The code above tries to avoid this race using the lower 16 bits
>> as a "nested-counter", and the upper bits to avoid the race with
>> the writer.
>>
>> 	// only for writer
>> 	__this_cpu_add(reader_percpu_refcnt, XXXX);
>>
>> If irq comes and does get_online_cpus_atomic(), it won't be confused
>> by __this_cpu_add(XXXX), it will check the lower bits and switch to
>> the "slow path".
>>
> 
> This is a very clever scheme indeed! :-) Thanks a lot for explaining
> it in detail.
> 
>>
>> But once again, so far I didn't really try to think. It is quite
>> possible I missed something.
>>
> 
> Even I don't spot anything wrong with it. But I'll give it some more
> thought..

Since an interrupt handler can also run get_online_cpus_atomic(), we
cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt,
right?

To maintain the integrity of the update itself, we will have to use the
this_cpu_* variant, which basically plays spoil-sport on this whole
scheme... :-(

But still, this scheme is better, because the reader doesn't have to spin
on the read_lock() with interrupts disabled. That way, interrupt handlers
that are not hotplug readers can continue to run on this CPU while taking
another CPU offline.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 13, 2012, 4:17 p.m. UTC | #15
On 12/13, Srivatsa S. Bhat wrote:
>
> On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote:
> >
> > Even I don't spot anything wrong with it. But I'll give it some more
> > thought..
>
> Since an interrupt handler can also run get_online_cpus_atomic(), we
> cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt,
> right?

Hmm. I thought that __this_cpu_* must be safe under preempt_disable().
IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable +
__this_cpu_inc() correctness-wise.

And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t.

But when I try to read the comments percpu.h, I am starting to think that
even this_cpu_inc() is not safe if irq handler can do the same?

Confused...

I am shy to ask... will, say, DEFINE_PER_CPU(local_t) and
local_inc(__this_cpu_ptr(...)) work??

> But still, this scheme is better, because the reader doesn't have to spin
> on the read_lock() with interrupts disabled.

Yes, but my main concern is that irq_disable/enable itself is not that cheap.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 13, 2012, 4:32 p.m. UTC | #16
On 12/13/2012 09:47 PM, Oleg Nesterov wrote:
> On 12/13, Srivatsa S. Bhat wrote:
>>
>> On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote:
>>>
>>> Even I don't spot anything wrong with it. But I'll give it some more
>>> thought..
>>
>> Since an interrupt handler can also run get_online_cpus_atomic(), we
>> cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt,
>> right?
> 
> Hmm. I thought that __this_cpu_* must be safe under preempt_disable().
> IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable +
> __this_cpu_inc() correctness-wise.
>
> And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t.
> 
> But when I try to read the comments percpu.h, I am starting to think that
> even this_cpu_inc() is not safe if irq handler can do the same?
> 

The comment seems to say that its not safe wrt interrupts. But looking at
the code in include/linux/percpu.h, IIUC, that is true only about
this_cpu_read() because it only disables preemption.

However, this_cpu_inc() looks safe wrt interrupts because it wraps the
increment within raw_local_irqsave()/restore().

> Confused...
> 
> I am shy to ask... will, say, DEFINE_PER_CPU(local_t) and
> local_inc(__this_cpu_ptr(...)) work??
> 
>> But still, this scheme is better, because the reader doesn't have to spin
>> on the read_lock() with interrupts disabled.
> 
> Yes, but my main concern is that irq_disable/enable itself is not that cheap.
> 

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Dec. 13, 2012, 4:32 p.m. UTC | #17
Hello, Oleg.

On Thu, Dec 13, 2012 at 05:17:09PM +0100, Oleg Nesterov wrote:
> Hmm. I thought that __this_cpu_* must be safe under preempt_disable().
> IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable +
> __this_cpu_inc() correctness-wise.

this_cpu_inc() equals local_irq_save() + __this_cpu_inc().

> And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t.

Yes, it is safe.

> But when I try to read the comments percpu.h, I am starting to think that
> even this_cpu_inc() is not safe if irq handler can do the same?
> 
> Confused...

Yeah, the comment is confusing and the way these macros are defined
doesn't help.  There used to be three variants and it looks like we
didn't update the comment while removing the preempt safe ones.  Gotta
clean those up.  Anyways, yes, this_cpu_*() are safe against irqs.

Thanks.
Oleg Nesterov Dec. 14, 2012, 6:03 p.m. UTC | #18
On 12/13, Srivatsa S. Bhat wrote:
>
> On 12/13/2012 09:47 PM, Oleg Nesterov wrote:
> > On 12/13, Srivatsa S. Bhat wrote:
> >>
> >> On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote:
> >>>
> >>> Even I don't spot anything wrong with it. But I'll give it some more
> >>> thought..
> >>
> >> Since an interrupt handler can also run get_online_cpus_atomic(), we
> >> cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt,
> >> right?
> >
> > Hmm. I thought that __this_cpu_* must be safe under preempt_disable().
> > IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable +
> > __this_cpu_inc() correctness-wise.
> >
> > And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t.
> >
> > But when I try to read the comments percpu.h, I am starting to think that
> > even this_cpu_inc() is not safe if irq handler can do the same?
> >
>
> The comment seems to say that its not safe wrt interrupts. But looking at
> the code in include/linux/percpu.h, IIUC, that is true only about
> this_cpu_read() because it only disables preemption.
>
> However, this_cpu_inc() looks safe wrt interrupts because it wraps the
> increment within raw_local_irqsave()/restore().

You mean _this_cpu_generic_to_op() I guess. So yes, I think you are right,
this_cpu_* should be irq-safe, but __this_cpu_* is not.

Thanks.

At least on x86 there is no difference between this_ and __this_, both do
percpu_add_op() without local_irq_disable/enable. But it seems that most
of architectures use generic code.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 18, 2012, 3:53 p.m. UTC | #19
On 12/14/2012 11:33 PM, Oleg Nesterov wrote:
> On 12/13, Srivatsa S. Bhat wrote:
>>
>> On 12/13/2012 09:47 PM, Oleg Nesterov wrote:
>>> On 12/13, Srivatsa S. Bhat wrote:
>>>>
>>>> On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote:
>>>>>
>>>>> Even I don't spot anything wrong with it. But I'll give it some more
>>>>> thought..
>>>>
>>>> Since an interrupt handler can also run get_online_cpus_atomic(), we
>>>> cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt,
>>>> right?
>>>
>>> Hmm. I thought that __this_cpu_* must be safe under preempt_disable().
>>> IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable +
>>> __this_cpu_inc() correctness-wise.
>>>
>>> And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t.
>>>
>>> But when I try to read the comments percpu.h, I am starting to think that
>>> even this_cpu_inc() is not safe if irq handler can do the same?
>>>
>>
>> The comment seems to say that its not safe wrt interrupts. But looking at
>> the code in include/linux/percpu.h, IIUC, that is true only about
>> this_cpu_read() because it only disables preemption.
>>
>> However, this_cpu_inc() looks safe wrt interrupts because it wraps the
>> increment within raw_local_irqsave()/restore().
> 
> You mean _this_cpu_generic_to_op() I guess. So yes, I think you are right,
> this_cpu_* should be irq-safe, but __this_cpu_* is not.
> 

Yes.

> Thanks.
> 
> At least on x86 there is no difference between this_ and __this_, both do
> percpu_add_op() without local_irq_disable/enable. But it seems that most
> of architectures use generic code.
> 

So now that we can't avoid disabling and enabling interrupts, I was
wondering if we could exploit this to avoid the smp_mb()..

Maybe this is a stupid question, but I'll shoot it anyway...
Does local_irq_disable()/enable provide any ordering guarantees by any chance?
I think the answer is no, but if it is yes, I guess we can do as shown
below to ensure that STORE(reader_percpu_refcnt) happens before
LOAD(writer_signal).

void get_online_cpus_atomic(void)
{
	unsigned long flags;

	preempt_disable();

	//only for writer
	local_irq_save(flags);
	__this_cpu_add(reader_percpu_refcnt, XXXX);
	local_irq_restore(flags);

	//no need of an explicit smp_mb()

	if (__this_cpu_read(reader_percpu_refcnt) & MASK) {
		this_cpu_inc(reader_percpu_refcnt);
	} else if (writer_active()) {
		...
	}

	this_cpu_dec(reader_percpu_refcnt, XXXX);

}

I tried thinking about other ways to avoid that smp_mb() in the reader,
but was unsuccessful. So if the above assumption is wrong, I guess we'll
just have to go with the version that uses synchronize_sched() at the
writer-side.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 18, 2012, 7:43 p.m. UTC | #20
On 12/18, Srivatsa S. Bhat wrote:
>
> So now that we can't avoid disabling and enabling interrupts,

Still I think it would be better to not use local_irq_save/restore
directly. And,

> I was
> wondering if we could exploit this to avoid the smp_mb()..
>
> Maybe this is a stupid question, but I'll shoot it anyway...
> Does local_irq_disable()/enable provide any ordering guarantees by any chance?

Oh, I do not know.

But please look at the comment above prepare_to_wait(). It assumes
that even spin_unlock_irqrestore() is not the full barrier.

In any case. get_online_cpus_atomic() has to use irq_restore, not
irq_enable. And _restore does nothing "special" if irqs were already
disabled, so I think we can't rely on sti.

> I tried thinking about other ways to avoid that smp_mb() in the reader,

Just in case, I think there is no way to avoid mb() in _get (although
perhaps it can be "implicit").

The writer changes cpu_online_mask and drops the lock. We need to ensure
that the reader sees the change in cpu_online_mask after _get returns.

> but was unsuccessful. So if the above assumption is wrong, I guess we'll
> just have to go with the version that uses synchronize_sched() at the
> writer-side.

In this case we can also convert get_online_cpus() to use percpu_rwsem
and avoid mutex_lock(&cpu_hotplug.lock), but this is minor I guess.
I do not think get_online_cpus() is called too often.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 18, 2012, 8:06 p.m. UTC | #21
On 12/19/2012 01:13 AM, Oleg Nesterov wrote:
> On 12/18, Srivatsa S. Bhat wrote:
>>
>> So now that we can't avoid disabling and enabling interrupts,
> 
> Still I think it would be better to not use local_irq_save/restore
> directly.

Sure, we can use this_cpu_add() itself. I explicitly used
local_irq_save/restore here just to explain my question.

> And,
> 
>> I was
>> wondering if we could exploit this to avoid the smp_mb()..
>>
>> Maybe this is a stupid question, but I'll shoot it anyway...
>> Does local_irq_disable()/enable provide any ordering guarantees by any chance?
> 
> Oh, I do not know.
> 
> But please look at the comment above prepare_to_wait(). It assumes
> that even spin_unlock_irqrestore() is not the full barrier.
> 

Semi-permeable barrier.. Hmm.. 

> In any case. get_online_cpus_atomic() has to use irq_restore, not
> irq_enable. And _restore does nothing "special" if irqs were already
> disabled, so I think we can't rely on sti.
> 

Right, I forgot about the _restore part.

>> I tried thinking about other ways to avoid that smp_mb() in the reader,
> 
> Just in case, I think there is no way to avoid mb() in _get (although
> perhaps it can be "implicit").
> 

Actually, I was trying to avoid mb() in the _fastpath_, when there is no
active writer. I missed stating that clearly, sorry.

> The writer changes cpu_online_mask and drops the lock. We need to ensure
> that the reader sees the change in cpu_online_mask after _get returns.
> 

The write_unlock() will ensure the completion of the update to cpu_online_mask,
using the same semi-permeable logic that you pointed above. So readers will
see the update as soon as the writer releases the lock, right?

>> but was unsuccessful. So if the above assumption is wrong, I guess we'll
>> just have to go with the version that uses synchronize_sched() at the
>> writer-side.
> 
> In this case we can also convert get_online_cpus() to use percpu_rwsem
> and avoid mutex_lock(&cpu_hotplug.lock), but this is minor I guess.
> I do not think get_online_cpus() is called too often.
> 

Yes, we could do that as well. I remember you saying that you had some
patches for percpu_rwsem to help use it in cpu hotplug code (to make it
recursive, IIRC).

So, I guess we'll go with the synchronize_sched() approach for percpu rwlocks
then. Tejun, it is still worthwhile to expose this as a generic percpu rwlock
and then use it inside cpu hotplug code, right?


Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 19, 2012, 4:39 p.m. UTC | #22
(sorry if you see this email twice)

On 12/19, Srivatsa S. Bhat wrote:
>
> On 12/19/2012 01:13 AM, Oleg Nesterov wrote:
>
> >> I tried thinking about other ways to avoid that smp_mb() in the reader,
> >
> > Just in case, I think there is no way to avoid mb() in _get (although
> > perhaps it can be "implicit").
> >
>
> Actually, I was trying to avoid mb() in the _fastpath_, when there is no
> active writer. I missed stating that clearly, sorry.

Yes, I meant the fastpath too,

> > The writer changes cpu_online_mask and drops the lock. We need to ensure
> > that the reader sees the change in cpu_online_mask after _get returns.
> >
>
> The write_unlock() will ensure the completion of the update to cpu_online_mask,
> using the same semi-permeable logic that you pointed above. So readers will
> see the update as soon as the writer releases the lock, right?

Why?

Once again, the writer (say) removes CPU_1 from cpu_online_mask, and
sets writer_signal[0] = 0, this "enables" fast path on CPU_0.

But, without a barrier, there is no guarantee that CPU_0 will see the
change in cpu_online_mask after get_online_cpus_atomic() checks
writer_signal[0] and returns.

Hmm. And this means that the last version lacks the additional rmb()
(at least) if writer_active() == F.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 19, 2012, 6:16 p.m. UTC | #23
On 12/19/2012 10:09 PM, Oleg Nesterov wrote:
> (sorry if you see this email twice)
> 
> On 12/19, Srivatsa S. Bhat wrote:
>>
>> On 12/19/2012 01:13 AM, Oleg Nesterov wrote:
>>
>>>> I tried thinking about other ways to avoid that smp_mb() in the reader,
>>>
>>> Just in case, I think there is no way to avoid mb() in _get (although
>>> perhaps it can be "implicit").
>>>
>>
>> Actually, I was trying to avoid mb() in the _fastpath_, when there is no
>> active writer. I missed stating that clearly, sorry.
> 
> Yes, I meant the fastpath too,
> 
>>> The writer changes cpu_online_mask and drops the lock. We need to ensure
>>> that the reader sees the change in cpu_online_mask after _get returns.
>>>
>>
>> The write_unlock() will ensure the completion of the update to cpu_online_mask,
>> using the same semi-permeable logic that you pointed above. So readers will
>> see the update as soon as the writer releases the lock, right?
> 
> Why?
> 
> Once again, the writer (say) removes CPU_1 from cpu_online_mask, and
> sets writer_signal[0] = 0, this "enables" fast path on CPU_0.
> 
> But, without a barrier, there is no guarantee that CPU_0 will see the
> change in cpu_online_mask after get_online_cpus_atomic() checks
> writer_signal[0] and returns.
> 

Hmm, because a reader's code can get reordered to something like:

read cpu_online_mask

get_online_cpus_atomic()

You are right, I missed that.

> Hmm. And this means that the last version lacks the additional rmb()
> (at least) if writer_active() == F.
> 

Yes, the fast path needs that smp_rmb(), whereas the slow-path is fine,
because it takes the read_lock().

BTW, there is a small problem with the synchronize_sched() approach:
We can't generalize the synchronization scheme as a generic percpu rwlock
because the writer's role is split into 2, the first part (the one having
synchronize_sched()) which should be run in process context, and the
second part (the rest of it), which can be run in atomic context.

That is, needing the writer to be able to sleep (due to synchronize_sched())
will make the locking scheme unusable (in other usecases) I think. And
the split (blocking and non-blocking part) itself seems hard to express
as a single writer API.

Hmmm.. so what do we do? Shall we say "We anyway have to do smp_rmb() in the
reader in the fast path. Instead let's do a full smp_mb() and get rid of
the synchronize_sched() in the writer, and thereby expose this synchronization
scheme as generic percpu rwlocks?" Or should we rather use synchronize_sched()
and keep this synchronization scheme restricted to CPU hotplug only?

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 19, 2012, 7:14 p.m. UTC | #24
On 12/19, Srivatsa S. Bhat wrote:
>
> BTW, there is a small problem with the synchronize_sched() approach:
> We can't generalize the synchronization scheme as a generic percpu rwlock
> because the writer's role is split into 2, the first part (the one having
> synchronize_sched()) which should be run in process context, and the
> second part (the rest of it), which can be run in atomic context.

Yes,

> That is, needing the writer to be able to sleep (due to synchronize_sched())
> will make the locking scheme unusable (in other usecases) I think. And
> the split (blocking and non-blocking part) itself seems hard to express
> as a single writer API.

I do not think this is the problem.

We need 2 helpers for writer, the 1st one does synchronize_sched() and the
2nd one takes rwlock. A generic percpu_write_lock() simply calls them both.

In fact I think that the 1st helper should not do synchronize_sched(),
and (say) cpu_hotplug_begin() should call it itself. Because if we also
change cpu_hotplug_begin() to use percpu_rw_sem we do not want to do
synchronize_sched() twice. But lets ignore this for now.

But,

> Hmmm.. so what do we do? Shall we say "We anyway have to do smp_rmb() in the
> reader in the fast path. Instead let's do a full smp_mb() and get rid of
> the synchronize_sched() in the writer, and thereby expose this synchronization
> scheme as generic percpu rwlocks?" Or should we rather use synchronize_sched()
> and keep this synchronization scheme restricted to CPU hotplug only?

Oh, I do not know ;)

To me, the main question is: can we use synchronize_sched() in cpu_down?
It is slow.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 19, 2012, 7:49 p.m. UTC | #25
On 12/20/2012 12:44 AM, Oleg Nesterov wrote:
> On 12/19, Srivatsa S. Bhat wrote:
>>
>> BTW, there is a small problem with the synchronize_sched() approach:
>> We can't generalize the synchronization scheme as a generic percpu rwlock
>> because the writer's role is split into 2, the first part (the one having
>> synchronize_sched()) which should be run in process context, and the
>> second part (the rest of it), which can be run in atomic context.
> 
> Yes,
> 
>> That is, needing the writer to be able to sleep (due to synchronize_sched())
>> will make the locking scheme unusable (in other usecases) I think. And
>> the split (blocking and non-blocking part) itself seems hard to express
>> as a single writer API.
> 
> I do not think this is the problem.
> 
> We need 2 helpers for writer, the 1st one does synchronize_sched() and the
> 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both.
>

Ah, that's the problem no? Users of reader-writer locks expect to run in
atomic context (ie., they don't want to sleep). We can't expose an API that
can make the task go to sleep under the covers!
(And that too, definitely not with a name such as "percpu_write_lock_irqsave()"
... because we are going to be interrupt-safe as well, in the second part...).
 
> In fact I think that the 1st helper should not do synchronize_sched(),
> and (say) cpu_hotplug_begin() should call it itself. Because if we also
> change cpu_hotplug_begin() to use percpu_rw_sem we do not want to do
> synchronize_sched() twice. But lets ignore this for now.
> 

Yeah, let's ignore this for now, else I'll get all confused ;-)

> But,
> 
>> Hmmm.. so what do we do? Shall we say "We anyway have to do smp_rmb() in the
>> reader in the fast path. Instead let's do a full smp_mb() and get rid of
>> the synchronize_sched() in the writer, and thereby expose this synchronization
>> scheme as generic percpu rwlocks?" Or should we rather use synchronize_sched()
>> and keep this synchronization scheme restricted to CPU hotplug only?
> 
> Oh, I do not know ;)
> 
> To me, the main question is: can we use synchronize_sched() in cpu_down?
> It is slow.
>

Haha :-) So we don't want smp_mb() in the reader, *and* also don't want
synchronize_sched() in the writer! Sounds like saying we want to have the cake
and eat it too ;-) :P

But seriously speaking, I understand that its a bit of a hard choice to make.
On one side, we can avoid synchronize_sched() at the writer, but have to bear
the burden of smp_mb() at the reader even in the fastpath, when no writer is active.
On the other side, we can make the reader avoid smp_mb(), but the writer has to
suffer a full synchronize_sched(). But an important point is that, at the writer
side, we already wait for so many mutex locks and stuff, like in the CPU_DOWN_PREPARE
stage (as determined by the callbacks). So adding a synchronize_sched() to the mix
shouldn't make it noticeably bad, isn't it?

(I'm not arguing in favor of using synchronize_sched(). I'm just trying to point
out that using it might not be as bad as we think it is, in the CPU hotplug case.
And moreover, since I'm still not convinced about the writer API part if use
synchronize_sched(), I'd rather avoid synchronize_sched().)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 20, 2012, 1:42 p.m. UTC | #26
On 12/20, Srivatsa S. Bhat wrote:
>
> On 12/20/2012 12:44 AM, Oleg Nesterov wrote:
> >
> > We need 2 helpers for writer, the 1st one does synchronize_sched() and the
> > 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both.
> >
>
> Ah, that's the problem no? Users of reader-writer locks expect to run in
> atomic context (ie., they don't want to sleep).

Ah, I misunderstood.

Sure, percpu_write_lock() should be might_sleep(), and this is not
symmetric to percpu_read_lock().

> We can't expose an API that
> can make the task go to sleep under the covers!

Why? Just this should be documented. However I would not worry until we
find another user. Until then we do not even need to add percpu_write_lock
or try to generalize this code too much.

> > To me, the main question is: can we use synchronize_sched() in cpu_down?
> > It is slow.
> >
>
> Haha :-) So we don't want smp_mb() in the reader,

We need mb() + rmb(). Plust cli/sti unless this arch has optimized
this_cpu_add() like x86 (as you pointed out).

> *and* also don't want
> synchronize_sched() in the writer! Sounds like saying we want to have the cake
> and eat it too ;-) :P

Personally I'd vote for synchronize_sched() but I am not sure. And I do
not really understand the problem space.

> And moreover, since I'm still not convinced about the writer API part if use
> synchronize_sched(), I'd rather avoid synchronize_sched().)

Understand.

And yes, synchronize_sched() adds more problems. For example, where should
we call it? I do not this _cpu_down() should do this, in this case, say,
disable_nonboot_cpus() needs num_online_cpus() synchronize_sched's.

So probably cpu_down() should call it before cpu_maps_update_begin(), this
makes the locking even less obvious.

In short. What I am trying to say is, don't ask me I do not know ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 20, 2012, 2:06 p.m. UTC | #27
On 12/20/2012 07:12 PM, Oleg Nesterov wrote:
> On 12/20, Srivatsa S. Bhat wrote:
>>
>> On 12/20/2012 12:44 AM, Oleg Nesterov wrote:
>>>
>>> We need 2 helpers for writer, the 1st one does synchronize_sched() and the
>>> 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both.
>>>
>>
>> Ah, that's the problem no? Users of reader-writer locks expect to run in
>> atomic context (ie., they don't want to sleep).
> 
> Ah, I misunderstood.
> 
> Sure, percpu_write_lock() should be might_sleep(), and this is not
> symmetric to percpu_read_lock().
> 
>> We can't expose an API that
>> can make the task go to sleep under the covers!
> 
> Why? Just this should be documented. However I would not worry until we
> find another user. Until then we do not even need to add percpu_write_lock
> or try to generalize this code too much.
> 

Hmm.. But considering the disable_nonboot_cpus() case you mentioned below, I'm
only getting farther away from using synchronize_sched() ;-) And that also makes
it easier to expose a generic percpu rwlock API, like Tejun was suggesting.
So I'll give it a shot.

>>> To me, the main question is: can we use synchronize_sched() in cpu_down?
>>> It is slow.
>>>
>>
>> Haha :-) So we don't want smp_mb() in the reader,
> 
> We need mb() + rmb(). Plust cli/sti unless this arch has optimized
> this_cpu_add() like x86 (as you pointed out).
> 
>> *and* also don't want
>> synchronize_sched() in the writer! Sounds like saying we want to have the cake
>> and eat it too ;-) :P
> 
> Personally I'd vote for synchronize_sched() but I am not sure. And I do
> not really understand the problem space.
> 
>> And moreover, since I'm still not convinced about the writer API part if use
>> synchronize_sched(), I'd rather avoid synchronize_sched().)
> 
> Understand.
> 
> And yes, synchronize_sched() adds more problems. For example, where should
> we call it? I do not this _cpu_down() should do this, in this case, say,
> disable_nonboot_cpus() needs num_online_cpus() synchronize_sched's.
> 

Ouch! I should have seen that coming!

> So probably cpu_down() should call it before cpu_maps_update_begin(), this
> makes the locking even less obvious.
> 

True.

> In short. What I am trying to say is, don't ask me I do not know ;)
>

OK then, I'll go with what I believe is a reasonably good way (not necessarily
the best way) to deal with this:

I'll avoid the use of synchronize_sched(), expose a decent-looking percpu
rwlock implementation, use it in CPU hotplug and get rid of stop_machine().
That would certainly be a good starting base, IMHO.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 22, 2012, 8:17 p.m. UTC | #28
On 12/20/2012 07:12 PM, Oleg Nesterov wrote:
> On 12/20, Srivatsa S. Bhat wrote:
>>
>> On 12/20/2012 12:44 AM, Oleg Nesterov wrote:
>>>
>>> We need 2 helpers for writer, the 1st one does synchronize_sched() and the
>>> 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both.
>>>
>>
>> Ah, that's the problem no? Users of reader-writer locks expect to run in
>> atomic context (ie., they don't want to sleep).
> 
> Ah, I misunderstood.
> 
> Sure, percpu_write_lock() should be might_sleep(), and this is not
> symmetric to percpu_read_lock().
> 
>> We can't expose an API that
>> can make the task go to sleep under the covers!
> 
> Why? Just this should be documented. However I would not worry until we
> find another user. Until then we do not even need to add percpu_write_lock
> or try to generalize this code too much.
> 
>>> To me, the main question is: can we use synchronize_sched() in cpu_down?
>>> It is slow.
>>>
>>
>> Haha :-) So we don't want smp_mb() in the reader,
> 
> We need mb() + rmb(). Plust cli/sti unless this arch has optimized
> this_cpu_add() like x86 (as you pointed out).
> 

Hey, IIUC, we actually don't need mb() in the reader!! Just an rmb() will do.

This is the reader code I have so far:

#define reader_nested_percpu()						\
	     (__this_cpu_read(reader_percpu_refcnt) & READER_REFCNT_MASK)

#define writer_active()							\
				(__this_cpu_read(writer_signal))


#define READER_PRESENT		(1UL << 16)
#define READER_REFCNT_MASK	(READER_PRESENT - 1)

void get_online_cpus_atomic(void)
{
	preempt_disable();

	/*
	 * First and foremost, make your presence known to the writer.
	 */
	this_cpu_add(reader_percpu_refcnt, READER_PRESENT);

	/*
	 * If we are already using per-cpu refcounts, it is not safe to switch
	 * the synchronization scheme. So continue using the refcounts.
	 */
	if (reader_nested_percpu()) {
		this_cpu_inc(reader_percpu_refcnt);
	} else {
		smp_rmb();
		if (unlikely(writer_active())) {
			... //take hotplug_rwlock
		}
	}

	...

	/* Prevent reordering of any subsequent reads of cpu_online_mask. */
	smp_rmb();
}

The smp_rmb() before writer_active() ensures that LOAD(writer_signal) follows
LOAD(reader_percpu_refcnt) (at the 'if' condition). And in turn, that load is
automatically going to follow the STORE(reader_percpu_refcnt) (at this_cpu_add())
due to the data dependency. So it is something like a transitive relation.

So, the result is that, we mark ourselves as active in reader_percpu_refcnt before
we check writer_signal. This is exactly what we wanted to do right?
And luckily, due to the dependency, we can achieve it without using the heavy
smp_mb(). And, we can't crib about the smp_rmb() because it is unavoidable anyway
(because we want to prevent reordering of the reads to cpu_online_mask, like you
pointed out earlier).

I hope I'm not missing anything...

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 23, 2012, 4:42 p.m. UTC | #29
On 12/23, Srivatsa S. Bhat wrote:
>
> On 12/20/2012 07:12 PM, Oleg Nesterov wrote:
> >
> > We need mb() + rmb(). Plust cli/sti unless this arch has optimized
> > this_cpu_add() like x86 (as you pointed out).
> >
>
> Hey, IIUC, we actually don't need mb() in the reader!! Just an rmb() will do.

Well. I don't think so. But when it comes to the barriers I am never sure
until Paul confirms my understanding ;)

> #define reader_nested_percpu()						\
> 	     (__this_cpu_read(reader_percpu_refcnt) & READER_REFCNT_MASK)
>
> #define writer_active()							\
> 				(__this_cpu_read(writer_signal))
>
>
> #define READER_PRESENT		(1UL << 16)
> #define READER_REFCNT_MASK	(READER_PRESENT - 1)
>
> void get_online_cpus_atomic(void)
> {
> 	preempt_disable();
>
> 	/*
> 	 * First and foremost, make your presence known to the writer.
> 	 */
> 	this_cpu_add(reader_percpu_refcnt, READER_PRESENT);
>
> 	/*
> 	 * If we are already using per-cpu refcounts, it is not safe to switch
> 	 * the synchronization scheme. So continue using the refcounts.
> 	 */
> 	if (reader_nested_percpu()) {
> 		this_cpu_inc(reader_percpu_refcnt);
> 	} else {
> 		smp_rmb();
> 		if (unlikely(writer_active())) {
> 			... //take hotplug_rwlock
> 		}
> 	}
>
> 	...
>
> 	/* Prevent reordering of any subsequent reads of cpu_online_mask. */
> 	smp_rmb();
> }
>
> The smp_rmb() before writer_active() ensures that LOAD(writer_signal) follows
> LOAD(reader_percpu_refcnt) (at the 'if' condition). And in turn, that load is
> automatically going to follow the STORE(reader_percpu_refcnt)

But why this STORE should be visible on another CPU before we LOAD(writer_signal)?

Lets discuss the simple and artificial example. Suppose we have

	int X, Y;

	int func(void)
	{
		X = 1;	// suppose that nobody else can change it
		mb();
		return Y;
	}

Now you are saying that we can change it and avoid the costly mb():

	int func(void)
	{
		X = 1;

		if (X != 1)
			BUG();
	
		rmb();
		return Y;
	}

I doubt. rmb() can only guarantee that the preceding LOAD's should be
completed. Without mb() it is possible that this CPU won't write X to
memory at all.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 24, 2012, 3:50 p.m. UTC | #30
On 12/23/2012 10:12 PM, Oleg Nesterov wrote:
> On 12/23, Srivatsa S. Bhat wrote:
>>
>> On 12/20/2012 07:12 PM, Oleg Nesterov wrote:
>>>
>>> We need mb() + rmb(). Plust cli/sti unless this arch has optimized
>>> this_cpu_add() like x86 (as you pointed out).
>>>
>>
>> Hey, IIUC, we actually don't need mb() in the reader!! Just an rmb() will do.
> 
> Well. I don't think so. But when it comes to the barriers I am never sure
> until Paul confirms my understanding ;)
> 
>> #define reader_nested_percpu()						\
>> 	     (__this_cpu_read(reader_percpu_refcnt) & READER_REFCNT_MASK)
>>
>> #define writer_active()							\
>> 				(__this_cpu_read(writer_signal))
>>
>>
>> #define READER_PRESENT		(1UL << 16)
>> #define READER_REFCNT_MASK	(READER_PRESENT - 1)
>>
>> void get_online_cpus_atomic(void)
>> {
>> 	preempt_disable();
>>
>> 	/*
>> 	 * First and foremost, make your presence known to the writer.
>> 	 */
>> 	this_cpu_add(reader_percpu_refcnt, READER_PRESENT);
>>
>> 	/*
>> 	 * If we are already using per-cpu refcounts, it is not safe to switch
>> 	 * the synchronization scheme. So continue using the refcounts.
>> 	 */
>> 	if (reader_nested_percpu()) {
>> 		this_cpu_inc(reader_percpu_refcnt);
>> 	} else {
>> 		smp_rmb();
>> 		if (unlikely(writer_active())) {
>> 			... //take hotplug_rwlock
>> 		}
>> 	}
>>
>> 	...
>>
>> 	/* Prevent reordering of any subsequent reads of cpu_online_mask. */
>> 	smp_rmb();
>> }
>>
>> The smp_rmb() before writer_active() ensures that LOAD(writer_signal) follows
>> LOAD(reader_percpu_refcnt) (at the 'if' condition). And in turn, that load is
>> automatically going to follow the STORE(reader_percpu_refcnt)
> 
> But why this STORE should be visible on another CPU before we LOAD(writer_signal)?
> 
> Lets discuss the simple and artificial example. Suppose we have
> 
> 	int X, Y;
> 
> 	int func(void)
> 	{
> 		X = 1;	// suppose that nobody else can change it
> 		mb();
> 		return Y;
> 	}
> 
> Now you are saying that we can change it and avoid the costly mb():
> 
> 	int func(void)
> 	{
> 		X = 1;
> 
> 		if (X != 1)
> 			BUG();
> 	
> 		rmb();
> 		return Y;
> 	}
> 
> I doubt. rmb() can only guarantee that the preceding LOAD's should be
> completed. Without mb() it is possible that this CPU won't write X to
> memory at all.
> 

Oh, ok :-( Thanks for correcting me and for the detailed explanation!
For a moment, I really thought we had it solved at last! ;-(

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index ce7a074..cf24da1 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -175,6 +175,8 @@  extern struct bus_type cpu_subsys;
 
 extern void get_online_cpus(void);
 extern void put_online_cpus(void);
+extern void get_online_cpus_atomic(void);
+extern void put_online_cpus_atomic(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
 #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
@@ -198,6 +200,8 @@  static inline void cpu_hotplug_driver_unlock(void)
 
 #define get_online_cpus()	do { } while (0)
 #define put_online_cpus()	do { } while (0)
+#define get_online_cpus_atomic()	do { } while (0)
+#define put_online_cpus_atomic()	do { } while (0)
 #define hotcpu_notifier(fn, pri)	do { (void)(fn); } while (0)
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 42bd331..5a63296 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -133,6 +133,119 @@  static void cpu_hotplug_done(void)
 	mutex_unlock(&cpu_hotplug.lock);
 }
 
+/*
+ * Reader-writer lock to synchronize between atomic hotplug readers
+ * and the CPU offline hotplug writer.
+ */
+static DEFINE_RWLOCK(hotplug_rwlock);
+
+static DEFINE_PER_CPU(int, reader_percpu_refcnt);
+static DEFINE_PER_CPU(bool, writer_signal);
+
+
+#define reader_uses_percpu_refcnt(cpu)					\
+			(ACCESS_ONCE(per_cpu(reader_percpu_refcnt, cpu)))
+
+#define reader_nested_percpu()						\
+				(__this_cpu_read(reader_percpu_refcnt) > 1)
+
+#define writer_active()							\
+				(__this_cpu_read(writer_signal))
+
+
+
+/*
+ * Invoked by hotplug reader, to prevent CPUs from going offline.
+ *
+ * If there are no CPU offline writers active, just increment the
+ * per-cpu counter 'reader_percpu_refcnt' and proceed.
+ *
+ * If a CPU offline hotplug writer is active, we'll need to switch from
+ * per-cpu refcounts to the global rwlock, when the time is right.
+ *
+ * It is not safe to switch the synchronization scheme when we are
+ * already in a read-side critical section which uses per-cpu refcounts.
+ * Also, we don't want to allow heterogeneous readers to nest inside
+ * each other, to avoid complications in put_online_cpus_atomic().
+ *
+ * Once you switch, keep using the rwlocks for synchronization, until
+ * the writer signals the end of CPU offline.
+ *
+ * You can call this recursively, without fear of locking problems.
+ *
+ * Returns with preemption disabled.
+ */
+void get_online_cpus_atomic(void)
+{
+	unsigned long flags;
+
+	preempt_disable();
+
+	if (cpu_hotplug.active_writer == current)
+		return;
+
+	local_irq_save(flags);
+
+	/*
+	 * Use the percpu refcounts by default. Switch over to rwlock (if
+	 * necessary) later on. This helps avoid several race conditions
+	 * as well.
+	 */
+	__this_cpu_inc(reader_percpu_refcnt);
+
+	smp_rmb(); /* Paired with smp_mb() in announce_cpu_offline_begin(). */
+
+	/*
+	 * We must not allow heterogeneous nesting of readers (ie., readers
+	 * using percpu refcounts to nest with readers using rwlocks).
+	 * So don't switch the synchronization scheme if we are currently
+	 * using perpcu refcounts.
+	 */
+	if (!reader_nested_percpu() && unlikely(writer_active())) {
+
+		read_lock(&hotplug_rwlock);
+
+		/*
+		 * We might have raced with a writer going inactive before we
+		 * took the read-lock. So re-evaluate whether we still need to
+		 * use the rwlock or if we can switch back to percpu refcounts.
+		 * (This also helps avoid heterogeneous nesting of readers).
+		 */
+		if (writer_active())
+			__this_cpu_dec(reader_percpu_refcnt);
+		else
+			read_unlock(&hotplug_rwlock);
+	}
+
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(get_online_cpus_atomic);
+
+void put_online_cpus_atomic(void)
+{
+	unsigned long flags;
+
+	if (cpu_hotplug.active_writer == current)
+		goto out;
+
+	local_irq_save(flags);
+
+	/*
+	 * We never allow heterogeneous nesting of readers. So it is trivial
+	 * to find out the kind of reader we are, and undo the operation
+	 * done by our corresponding get_online_cpus_atomic().
+	 */
+	if (__this_cpu_read(reader_percpu_refcnt))
+		__this_cpu_dec(reader_percpu_refcnt);
+	else
+		read_unlock(&hotplug_rwlock);
+
+	local_irq_restore(flags);
+out:
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(put_online_cpus_atomic);
+
 #else /* #if CONFIG_HOTPLUG_CPU */
 static void cpu_hotplug_begin(void) {}
 static void cpu_hotplug_done(void) {}
@@ -237,6 +350,61 @@  static inline void check_for_tasks(int cpu)
 	write_unlock_irq(&tasklist_lock);
 }
 
+static inline void raise_writer_signal(unsigned int cpu)
+{
+	per_cpu(writer_signal, cpu) = true;
+}
+
+static inline void drop_writer_signal(unsigned int cpu)
+{
+	per_cpu(writer_signal, cpu) = false;
+}
+
+static void announce_cpu_offline_begin(void)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu)
+		raise_writer_signal(cpu);
+
+	smp_mb();
+}
+
+static void announce_cpu_offline_end(unsigned int dead_cpu)
+{
+	unsigned int cpu;
+
+	drop_writer_signal(dead_cpu);
+
+	for_each_online_cpu(cpu)
+		drop_writer_signal(cpu);
+
+	smp_mb();
+}
+
+/*
+ * Wait for the reader to see the writer's signal and switch from percpu
+ * refcounts to global rwlock.
+ *
+ * If the reader is still using percpu refcounts, wait for him to switch.
+ * Else, we can safely go ahead, because either the reader has already
+ * switched over, or the next atomic hotplug reader who comes along on this
+ * CPU will notice the writer's signal and will switch over to the rwlock.
+ */
+static inline void sync_atomic_reader(unsigned int cpu)
+{
+	while (reader_uses_percpu_refcnt(cpu))
+		cpu_relax();
+}
+
+static void sync_all_readers(void)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu)
+		sync_atomic_reader(cpu);
+}
+
 struct take_cpu_down_param {
 	unsigned long mod;
 	void *hcpu;
@@ -246,15 +414,45 @@  struct take_cpu_down_param {
 static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
-	int err;
+	unsigned long flags;
+	unsigned int cpu = (long)(param->hcpu);
+	int err = 0;
+
+
+	/*
+	 * Inform all atomic readers that we are going to offline a CPU
+	 * and that they need to switch from per-cpu refcounts to the
+	 * global hotplug_rwlock.
+	 */
+	announce_cpu_offline_begin();
+
+	/* Wait for every reader to notice the announcement and switch over */
+	sync_all_readers();
+
+	/*
+	 * Now all the readers have switched to using the global hotplug_rwlock.
+	 * So now is our chance, go bring down the CPU!
+	 */
+
+	write_lock_irqsave(&hotplug_rwlock, flags);
 
 	/* Ensure this CPU doesn't handle any more interrupts. */
 	err = __cpu_disable();
 	if (err < 0)
-		return err;
+		goto out;
 
 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
-	return 0;
+
+out:
+	/*
+	 * Inform all atomic readers that we are done with the CPU offline
+	 * operation, so that they can switch back to their per-cpu refcounts.
+	 * (We don't need to wait for them to see it).
+	 */
+	announce_cpu_offline_end(cpu);
+
+	write_unlock_irqrestore(&hotplug_rwlock, flags);
+	return err;
 }
 
 /* Requires cpu_add_remove_lock to be held */