diff mbox

[RFC,v3,1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

Message ID 20121207173759.27305.84316.stgit@srivatsabhat.in.ibm.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Srivatsa S. Bhat Dec. 7, 2012, 5:38 p.m. UTC
There are places where preempt_disable() is used to prevent any CPU from
going offline during the critical section. Let us call them as "atomic
hotplug readers" ("atomic" because they run in atomic contexts).

Today, preempt_disable() works because the writer uses stop_machine().
But once stop_machine() is gone, the readers won't be able to prevent
CPUs from going offline using preempt_disable().

The intent of this patch is to provide synchronization APIs for such
atomic hotplug readers, to prevent (any) CPUs from going offline, without
depending on stop_machine() at the writer-side. The new APIs will look
something like this:  get/put_online_cpus_atomic()

Some important design requirements and considerations:
-----------------------------------------------------

1. Scalable synchronization at the reader-side, especially in the fast-path

   Any synchronization at the atomic hotplug readers side must be highly
   scalable - avoid global single-holder locks/counters etc. Because, these
   paths currently use the extremely fast preempt_disable(); our replacement
   to preempt_disable() should not become ridiculously costly and also should
   not serialize the readers among themselves needlessly.

   At a minimum, the new APIs must be extremely fast at the reader side
   atleast in the fast-path, when no CPU offline writers are active.

2. preempt_disable() was recursive. The replacement should also be recursive.

3. No (new) lock-ordering restrictions

   preempt_disable() was super-flexible. It didn't impose any ordering
   restrictions or rules for nesting. Our replacement should also be equally
   flexible and usable.

4. No deadlock possibilities

   Per-cpu locking is not the way to go if we want to have relaxed rules
   for lock-ordering. Because, we can end up in circular-locking dependencies
   as explained in https://lkml.org/lkml/2012/12/6/290

   So, avoid per-cpu locking schemes (per-cpu locks/per-cpu atomic counters
   with spin-on-contention etc) as much as possible.


Implementation of the design:
----------------------------

We use global rwlocks for synchronization, because then we won't get into
lock-ordering related problems (unlike per-cpu locks). However, global
rwlocks lead to unnecessary cache-line bouncing even when there are no
hotplug writers present, which can slow down the system needlessly.

Per-cpu counters can help solve the cache-line bouncing problem. So we
actually use the best of both: per-cpu counters (no-waiting) at the reader
side in the fast-path, and global rwlocks in the slowpath.

[ Fastpath = no writer is active; Slowpath = a writer is active ]

IOW, the hotplug readers just increment/decrement their per-cpu refcounts
when no writer is active. When a writer becomes active, he signals all
readers to switch to global rwlocks for the duration of the CPU offline
operation. The readers ACK the writer's signal when it is safe for them
(ie., when they are about to start a fresh, non-nested read-side critical
section) and start using (holding) the global rwlock for read in their
subsequent critical sections.

The hotplug writer waits for ACKs from every reader, and then acquires
the global rwlock for write and takes the CPU offline. Then the writer
signals all readers that the CPU offline is done, and that they can go back
to using their per-cpu refcounts again.

Note that the lock-safety (despite the per-cpu scheme) comes from the fact
that the readers can *choose* when to ACK the writer's signal (iow, when
to switch to the global rwlock). And the readers don't wait on anybody based
on the per-cpu counters. The only true synchronization that involves waiting
at the reader-side in this scheme, is the one arising from the global rwlock,
which is safe from the circular locking dependency problems mentioned above
(because it is global).

Reader-writer locks and per-cpu counters are recursive, so they can be
used in a nested fashion in the reader-path. And the "signal-and-ack-with-no-
reader-spinning" design of switching the synchronization scheme ensures that
you can safely nest and call these APIs in any way you want, just like
preempt_disable()/enable.

Together, these satisfy all the requirements mentioned above.

I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful
suggestions and ideas, which inspired and influenced many of the decisions in
this as well as previous designs. Thanks a lot Michael and Xiao!

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/cpu.h |    4 +
 kernel/cpu.c        |  252 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 253 insertions(+), 3 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Tejun Heo Dec. 7, 2012, 5:57 p.m. UTC | #1
On Fri, Dec 07, 2012 at 11:08:13PM +0530, Srivatsa S. Bhat wrote:
> 4. No deadlock possibilities
> 
>    Per-cpu locking is not the way to go if we want to have relaxed rules
>    for lock-ordering. Because, we can end up in circular-locking dependencies
>    as explained in https://lkml.org/lkml/2012/12/6/290
> 
>    So, avoid per-cpu locking schemes (per-cpu locks/per-cpu atomic counters
>    with spin-on-contention etc) as much as possible.

I really can't say I like this approach.  percpu locking is very
tricky to get right and difficult to get right and we should try our
best to avoid implementing subsystem specific ones as much as
possible.  Also, I think the right approach would be auditing each
get_online_cpus_atomic() callsites and figure out proper locking order
rather than implementing a construct this unusual especially as
hunting down the incorrect cases shouldn't be difficult given proper
lockdep annotation.

lg_lock doesn't do local nesting and I'm not sure how big a deal that
is as I don't know how many should be converted.  But if nesting is an
absolute necessity, it would be much better to implement generic
rwlock variant (say, lg_rwlock) rather than implementing unusual
cpuhotplug-specific percpu synchronization construct.

Thanks.
Tejun Heo Dec. 7, 2012, 6:16 p.m. UTC | #2
Hello, again.

On Fri, Dec 07, 2012 at 09:57:24AM -0800, Tejun Heo wrote:
> possible.  Also, I think the right approach would be auditing each
> get_online_cpus_atomic() callsites and figure out proper locking order
> rather than implementing a construct this unusual especially as
> hunting down the incorrect cases shouldn't be difficult given proper
> lockdep annotation.

On the second look, it looks like you're implementing proper
percpu_rwlock semantics as readers aren't supposed to induce circular
dependency directly.  Can you please work with Oleg to implement
proper percpu-rwlock and use that for CPU hotplug rather than
implementing it inside CPU hotplug?

Thanks.
Srivatsa S. Bhat Dec. 7, 2012, 6:24 p.m. UTC | #3
On 12/07/2012 11:27 PM, Tejun Heo wrote:
> On Fri, Dec 07, 2012 at 11:08:13PM +0530, Srivatsa S. Bhat wrote:
>> 4. No deadlock possibilities
>>
>>    Per-cpu locking is not the way to go if we want to have relaxed rules
>>    for lock-ordering. Because, we can end up in circular-locking dependencies
>>    as explained in https://lkml.org/lkml/2012/12/6/290
>>
>>    So, avoid per-cpu locking schemes (per-cpu locks/per-cpu atomic counters
>>    with spin-on-contention etc) as much as possible.
> 
> I really can't say I like this approach.  percpu locking is very
> tricky to get right and difficult to get right and we should try our
> best to avoid implementing subsystem specific ones as much as
> possible.  Also, I think the right approach would be auditing each
> get_online_cpus_atomic() callsites and figure out proper locking order
> rather than implementing a construct this unusual especially as
> hunting down the incorrect cases shouldn't be difficult given proper
> lockdep annotation.
> 
> lg_lock doesn't do local nesting and I'm not sure how big a deal that
> is as I don't know how many should be converted.  But if nesting is an
> absolute necessity, it would be much better to implement generic
> rwlock variant (say, lg_rwlock) rather than implementing unusual
> cpuhotplug-specific percpu synchronization construct.
>

To be honest, at a certain point in time while designing this, I did
realize that this was getting kinda overly complicated ;-) ... but I
wanted to see how this would actually work out when finished and get
some feedback on the same, hence I posted it out. But this also proves
that we _can_ actually compete with the flexibility of preempt_disable()
and still be safe with respect to locking, if we really want to ;-)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Dec. 7, 2012, 6:31 p.m. UTC | #4
Hello, Srivatsa.

On Fri, Dec 07, 2012 at 11:54:01PM +0530, Srivatsa S. Bhat wrote:
> > lg_lock doesn't do local nesting and I'm not sure how big a deal that
> > is as I don't know how many should be converted.  But if nesting is an
> > absolute necessity, it would be much better to implement generic
> > rwlock variant (say, lg_rwlock) rather than implementing unusual
> > cpuhotplug-specific percpu synchronization construct.
> 
> To be honest, at a certain point in time while designing this, I did
> realize that this was getting kinda overly complicated ;-) ... but I
> wanted to see how this would actually work out when finished and get
> some feedback on the same, hence I posted it out. But this also proves
> that we _can_ actually compete with the flexibility of preempt_disable()
> and still be safe with respect to locking, if we really want to ;-)

I got confused by comparison to preempt_disable() but you're right
that percpu rwlock shouldn't be able to introduce locking dependency
which doesn't exist with non-percpu rwlock.  ie. write locking should
be atomic w.r.t. to all readers.  At the simplest, this can be
implemented by writer backing out all the way if try-locking any CPU
fails and retrying the whole thing.  That should be correct but has
the potential of starving the writer.

What we need here is a generic percpu-rwlock.  I don't know which
exact implementation strategy we should choose.  Maybe your switching
to global rwlock is the right solution.  But, at any rate, I think it
would be best to implement proper percpu-rwlock and then apply it to
CPU hotplug.  It's actually gonna be pretty fitting as
get_online_cpus() is being converted to percpu-rwsem.  IIUC, Oleg has
been working on this for a while now.  Oleg, what do you think?

Thanks.
Srivatsa S. Bhat Dec. 7, 2012, 6:33 p.m. UTC | #5
On 12/07/2012 11:46 PM, Tejun Heo wrote:
> Hello, again.
> 
> On Fri, Dec 07, 2012 at 09:57:24AM -0800, Tejun Heo wrote:
>> possible.  Also, I think the right approach would be auditing each
>> get_online_cpus_atomic() callsites and figure out proper locking order
>> rather than implementing a construct this unusual especially as
>> hunting down the incorrect cases shouldn't be difficult given proper
>> lockdep annotation.
> 
> On the second look, it looks like you're implementing proper
> percpu_rwlock semantics

Ah, nice! I didn't realize that I was actually doing what I intended
to avoid! ;-)

Looking at the implementation of lglocks, and going by Oleg's earlier
comment that we just need to replace spinlock_t with rwlock_t in them
to get percpu_rwlocks, I was horrified at the kinds of circular locking
dependencies that they would be prone to, unless used carefully.

So I devised this scheme to be safe, while still having relaxed rules.
But if *this* is what percpu_rwlocks should ideally look like,
then great! :-)

> as readers aren't supposed to induce circular
> dependency directly.

Yep, in this scheme, nobody will end up in circular dependency.

>  Can you please work with Oleg to implement
> proper percpu-rwlock and use that for CPU hotplug rather than
> implementing it inside CPU hotplug?
> 

Sure, I'd be more than happy to!
 
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 7, 2012, 6:38 p.m. UTC | #6
On 12/08/2012 12:01 AM, Tejun Heo wrote:
> Hello, Srivatsa.
> 
> On Fri, Dec 07, 2012 at 11:54:01PM +0530, Srivatsa S. Bhat wrote:
>>> lg_lock doesn't do local nesting and I'm not sure how big a deal that
>>> is as I don't know how many should be converted.  But if nesting is an
>>> absolute necessity, it would be much better to implement generic
>>> rwlock variant (say, lg_rwlock) rather than implementing unusual
>>> cpuhotplug-specific percpu synchronization construct.
>>
>> To be honest, at a certain point in time while designing this, I did
>> realize that this was getting kinda overly complicated ;-) ... but I
>> wanted to see how this would actually work out when finished and get
>> some feedback on the same, hence I posted it out. But this also proves
>> that we _can_ actually compete with the flexibility of preempt_disable()
>> and still be safe with respect to locking, if we really want to ;-)
> 
> I got confused by comparison to preempt_disable() but you're right
> that percpu rwlock shouldn't be able to introduce locking dependency
> which doesn't exist with non-percpu rwlock.  ie. write locking should
> be atomic w.r.t. to all readers.

Yep!

>  At the simplest, this can be
> implemented by writer backing out all the way if try-locking any CPU
> fails and retrying the whole thing.  That should be correct but has
> the potential of starving the writer.
> 

Exactly! This is what I mentioned yesterday in the link below, and said
that its not good because of writer starvation ("too wasteful"):
https://lkml.org/lkml/2012/12/6/290

> What we need here is a generic percpu-rwlock.  I don't know which
> exact implementation strategy we should choose.  Maybe your switching
> to global rwlock is the right solution.  But, at any rate, I think it
> would be best to implement proper percpu-rwlock and then apply it to
> CPU hotplug.  It's actually gonna be pretty fitting as
> get_online_cpus() is being converted to percpu-rwsem.  IIUC, Oleg has
> been working on this for a while now.  Oleg, what do you think?
> 

Hmm, that sounds good.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 9, 2012, 7:14 p.m. UTC | #7
On 12/07, Srivatsa S. Bhat wrote:
>
> Per-cpu counters can help solve the cache-line bouncing problem. So we
> actually use the best of both: per-cpu counters (no-waiting) at the reader
> side in the fast-path, and global rwlocks in the slowpath.
>
> [ Fastpath = no writer is active; Slowpath = a writer is active ]
>
> IOW, the hotplug readers just increment/decrement their per-cpu refcounts
> when no writer is active.

Plus LOCK and cli/sti. I do not pretend I really know how bad this is
performance-wise though. And at first glance this look overcomplicated.

But yes, it is easy to blame somebody else's code ;) And I can't suggest
something better at least right now. If I understand correctly, we can not
use, say, synchronize_sched() in _cpu_down() path, you also want to improve
the latency. And I guess something like kick_all_cpus_sync() is "too heavy".

Also. After the quick reading this doesn't look correct, please see below.

> +void get_online_cpus_atomic(void)
> +{
> +	unsigned int cpu = smp_processor_id();
> +	unsigned long flags;
> +
> +	preempt_disable();
> +	local_irq_save(flags);
> +
> +	if (cpu_hotplug.active_writer == current)
> +		goto out;
> +
> +	smp_rmb(); /* Paired with smp_wmb() in drop_writer_signal() */
> +
> +	if (likely(!writer_active(cpu))) {

WINDOW. Suppose that reader_active() == F.

> +		mark_reader_fastpath();
> +		goto out;

Why take_cpu_down() can't do announce_cpu_offline_begin() + sync_all_readers()
in between?

Looks like we should increment the counter first, then check writer_active().
And sync_atomic_reader() needs rmb between 2 atomic_read's.


Or. Again, suppose that reader_active() == F. But is_new_writer() == T.

> +	if (is_new_writer(cpu)) {
> +		/*
> +		 * ACK the writer's signal only if this is a fresh read-side
> +		 * critical section, and not just an extension of a running
> +		 * (nested) read-side critical section.
> +		 */
> +		if (!reader_active(cpu)) {
> +			ack_writer_signal();

What if take_cpu_down() does announce_cpu_offline_end() right before
ack_writer_signal() ? In this case get_online_cpus_atomic() returns
with writer_signal == -1. If nothing else this breaks the next
raise_writer_signal().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 9, 2012, 7:50 p.m. UTC | #8
On 12/10/2012 12:44 AM, Oleg Nesterov wrote:
> On 12/07, Srivatsa S. Bhat wrote:
>>
>> Per-cpu counters can help solve the cache-line bouncing problem. So we
>> actually use the best of both: per-cpu counters (no-waiting) at the reader
>> side in the fast-path, and global rwlocks in the slowpath.
>>
>> [ Fastpath = no writer is active; Slowpath = a writer is active ]
>>
>> IOW, the hotplug readers just increment/decrement their per-cpu refcounts
>> when no writer is active.
> 
> Plus LOCK and cli/sti. I do not pretend I really know how bad this is
> performance-wise though. And at first glance this look overcomplicated.
>

Hehe, I agree ;-) But I couldn't think of any other way to get rid of the
deadlock possibilities, other than using global rwlocks. So I designed a
way in which we can switch between per-cpu counters and global rwlocks
dynamically. Probably there is a smarter way to achieve what we want, dunno...
 
> But yes, it is easy to blame somebody else's code ;) And I can't suggest
> something better at least right now. If I understand correctly, we can not
> use, say, synchronize_sched() in _cpu_down() path

We can't sleep in that code.. so that's a no-go.

>, you also want to improve
> the latency. And I guess something like kick_all_cpus_sync() is "too heavy".
> 

I hadn't considered that. Thinking of it, I don't think it would help us..
It won't get rid of the currently running preempt_disable() sections no?

> Also. After the quick reading this doesn't look correct, please see below.
> 
>> +void get_online_cpus_atomic(void)
>> +{
>> +	unsigned int cpu = smp_processor_id();
>> +	unsigned long flags;
>> +
>> +	preempt_disable();
>> +	local_irq_save(flags);
>> +
>> +	if (cpu_hotplug.active_writer == current)
>> +		goto out;
>> +
>> +	smp_rmb(); /* Paired with smp_wmb() in drop_writer_signal() */
>> +
>> +	if (likely(!writer_active(cpu))) {
> 
> WINDOW. Suppose that reader_active() == F.
> 
>> +		mark_reader_fastpath();
>> +		goto out;
> 
> Why take_cpu_down() can't do announce_cpu_offline_begin() + sync_all_readers()
> in between?
> 
> Looks like we should increment the counter first, then check writer_active().

You are right, I missed the above race-conditions.

> And sync_atomic_reader() needs rmb between 2 atomic_read's.
> 

OK.

> 
> Or. Again, suppose that reader_active() == F. But is_new_writer() == T.
> 
>> +	if (is_new_writer(cpu)) {
>> +		/*
>> +		 * ACK the writer's signal only if this is a fresh read-side
>> +		 * critical section, and not just an extension of a running
>> +		 * (nested) read-side critical section.
>> +		 */
>> +		if (!reader_active(cpu)) {
>> +			ack_writer_signal();
> 
> What if take_cpu_down() does announce_cpu_offline_end() right before
> ack_writer_signal() ? In this case get_online_cpus_atomic() returns
> with writer_signal == -1. If nothing else this breaks the next
> raise_writer_signal().
> 

Oh, yes, this one is wrong too! We need to mark ourselves as active
reader right in the beginning. And probably change the check to
"reader_nested()" or something like that.

Thanks a lot Oleg!
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 9, 2012, 8:22 p.m. UTC | #9
On 12/10, Srivatsa S. Bhat wrote:
>
> On 12/10/2012 12:44 AM, Oleg Nesterov wrote:
>
> > But yes, it is easy to blame somebody else's code ;) And I can't suggest
> > something better at least right now. If I understand correctly, we can not
> > use, say, synchronize_sched() in _cpu_down() path
>
> We can't sleep in that code.. so that's a no-go.

But we can?

Note that I meant _cpu_down(), not get_online_cpus_atomic() or take_cpu_down().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 9, 2012, 8:57 p.m. UTC | #10
On 12/07, Srivatsa S. Bhat wrote:
>
> 4. No deadlock possibilities
>
>    Per-cpu locking is not the way to go if we want to have relaxed rules
>    for lock-ordering. Because, we can end up in circular-locking dependencies
>    as explained in https://lkml.org/lkml/2012/12/6/290

OK, but this assumes that, contrary to what Steven said, read-write-read
deadlock is not possible when it comes to rwlock_t. So far I think this
is true and we can't deadlock. Steven?

However. If this is true, then compared to preempt_disable/stop_machine
livelock is possible. Probably this is fine, we have the same problem with
get_online_cpus(). But if we can accept this fact I feel we can simmplify
this somehow... Can't prove, only feel ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 9, 2012, 9:13 p.m. UTC | #11
Damn, sorry for noise. I missed this part...

On 12/10, Srivatsa S. Bhat wrote:
>
> On 12/10/2012 12:44 AM, Oleg Nesterov wrote:
> > the latency. And I guess something like kick_all_cpus_sync() is "too heavy".
>
> I hadn't considered that. Thinking of it, I don't think it would help us..
> It won't get rid of the currently running preempt_disable() sections no?

Sure. But (again, this is only my feeling so far) given that get_online_cpus_atomic()
does cli/sti, this can help to implement ensure-the-readers-must-see-the-pending-writer.
IOW this might help to implement sync-with-readers.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 10, 2012, 4:28 a.m. UTC | #12
On 12/10/2012 01:52 AM, Oleg Nesterov wrote:
> On 12/10, Srivatsa S. Bhat wrote:
>>
>> On 12/10/2012 12:44 AM, Oleg Nesterov wrote:
>>
>>> But yes, it is easy to blame somebody else's code ;) And I can't suggest
>>> something better at least right now. If I understand correctly, we can not
>>> use, say, synchronize_sched() in _cpu_down() path
>>
>> We can't sleep in that code.. so that's a no-go.
> 
> But we can?
> 
> Note that I meant _cpu_down(), not get_online_cpus_atomic() or take_cpu_down().
> 

Maybe I'm missing something, but how would it help if we did a
synchronize_sched() so early (in _cpu_down())? Another bunch of preempt_disable()
sections could start immediately after our call to synchronize_sched() no?
How would we deal with that?

What we need to ensure here is that all existing preempt_disable() sections
are over and that *we* (meaning, the cpu offline writer) get to proceed immediately
after that, making all the new readers wait for us. And that is possible only if
we do our 'wait-for-readers' thing very close to our actual cpu offline operation
(which is take_cpu_down). Moreover, the writer needs to remain stable
(non-preemptible) while doing the wait-for-readers. Else (if the writer himself
keeps getting scheduled in and out of the CPU) I can't see how he can possibly
do justice to the wait.

Also, synchronize_sched() only helps us do the 'wait-for-existing-readers' logic.
What would take care of the 'prevent-new-readers-from-going-ahead' logic?

To add to it all, synchronize_sched() waits for _all_ preempt_disable() sections
to complete, whereas only a handful of them might actually care about CPU hotplug.
Which is an unnecessary burden for the writer (ie., waiting for unrelated readers
to complete).

I bet you meant something else. Sorry if I misunderstood it.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 10, 2012, 5:01 a.m. UTC | #13
On 12/10/2012 02:43 AM, Oleg Nesterov wrote:
> Damn, sorry for noise. I missed this part...
> 
> On 12/10, Srivatsa S. Bhat wrote:
>>
>> On 12/10/2012 12:44 AM, Oleg Nesterov wrote:
>>> the latency. And I guess something like kick_all_cpus_sync() is "too heavy".
>>
>> I hadn't considered that. Thinking of it, I don't think it would help us..
>> It won't get rid of the currently running preempt_disable() sections no?
> 
> Sure. But (again, this is only my feeling so far) given that get_online_cpus_atomic()
> does cli/sti,

Ah, that one! Actually, the only reason I do that cli/sti is because, potentially
interrupt handlers can be hotplug readers too. So we need to protect the portion
of the code of get_online_cpus_atomic() which is not re-entrant.
(Which reminds me to try and reduce the length of cli/sti in that code, if possible).

> this can help to implement ensure-the-readers-must-see-the-pending-writer.
> IOW this might help to implement sync-with-readers.
> 

2 problems:

1. It won't help with cases like this:

   preempt_disable()
	...
	preempt_disable()
		...
			<------- Here
		...
	preempt_enable()
	...
   preempt_enable()

If the IPI hits at the point marked above, the IPI is useless, because, at
that point, since we are already in a nested read-side critical section, we can't
switch the synchronization protocol. We need to wait till we start a fresh
non-nested read-side critical section, in order to switch to global rwlock.
The reason is that preempt_enable() or put_online_cpus_atomic() can only undo
what its predecessor (preempt_disable()/get_online_cpus_atomic()) did.

2. Part of the reason we want to get rid of stop_machine() is to avoid the
latency it induces on _all_ CPUs just to take *one* CPU offline. If we use
kick_all_cpus_sync(), we get into that territory again : we unfairly interrupt
every CPU, _even when_ that CPU's existing preempt_disabled() sections might
not actually be hotplug readers! (ie., not bothered about CPU Hotplug).

So, I think whatever synchronization scheme we develop, must not induce the very
same problems that stop_machine() had. Otherwise, we can end up going a full-circle
and coming back to the same stop_machine() scenario that we intended to get rid of.

(That's part of the reason why I initially tried to provide that _light() variant
of the reader APIs in the previous version of the patchset, so that light readers
can remain as undisturbed from cpu hotplug as possible, thereby avoiding indirectly
inducing the "stop_machine effect", like I mentioned here:
https://lkml.org/lkml/2012/12/5/369)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 10, 2012, 5:19 a.m. UTC | #14
On 12/10/2012 02:27 AM, Oleg Nesterov wrote:
> On 12/07, Srivatsa S. Bhat wrote:
>>
>> 4. No deadlock possibilities
>>
>>    Per-cpu locking is not the way to go if we want to have relaxed rules
>>    for lock-ordering. Because, we can end up in circular-locking dependencies
>>    as explained in https://lkml.org/lkml/2012/12/6/290
> 
> OK, but this assumes that, contrary to what Steven said, read-write-read
> deadlock is not possible when it comes to rwlock_t.

What I meant is, with a single (global) rwlock, you can't deadlock like that.
But if you used per-cpu rwlocks and if we don't implement them properly, then we
can end up in circular locking dependencies like shown above.

That is, if you take the same example and replace the lock with global
rwlock, you won't deadlock:


Readers:

         CPU 0                                CPU 1
         ------                               ------

1.    spin_lock(&random_lock);             read_lock(&my_rwlock);


2.    read_lock(&my_rwlock);               spin_lock(&random_lock);


Writer:

         CPU 2:
         ------

       write_lock(&my_rwlock);


Even if the writer does a write_lock() in-between steps 1 and 2 at the reader-
side, nothing will happen. The writer would spin because CPU 1 already got
the rwlock for read, and hence both CPU 0 and 1 go ahead. When they finish,
the writer will get the lock and proceed. So, no deadlocks here.
So, what I was pointing out here was, if somebody replaced this global
rwlock with a "straight-forward" implementation of per-cpu rwlocks, he'll
immediately end up in circular locking dependency deadlock between the 3
entities as explained in the link above.

Let me know if my assumptions are incorrect!

> So far I think this
> is true and we can't deadlock. Steven?
> 
> However. If this is true, then compared to preempt_disable/stop_machine
> livelock is possible. Probably this is fine, we have the same problem with
> get_online_cpus(). But if we can accept this fact I feel we can simmplify
> this somehow... Can't prove, only feel ;)
> 

Not sure I follow..

Anyway, my point is that, we _can't_ implement per-cpu rwlocks like lglocks
and expect it to work in this case. IOW, we can't do:

Reader-side:
   -> read_lock() your per-cpu rwlock and proceed.

Writer-side:
   -> for_each_online_cpu(cpu)
          write_lock(per-cpu rwlock of 'cpu');


Also, like Tejun said, one of the important measures for per-cpu rwlocks
should be that, if a user replaces global rwlocks with percpu rwlocks (for
performance reasons), he shouldn't suddenly end up in numerous deadlock
possibilities which never existed before. The replacement should continue to
remain safe, and perhaps improve the performance.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 10, 2012, 5:24 p.m. UTC | #15
On 12/10, Srivatsa S. Bhat wrote:
>
> On 12/10/2012 01:52 AM, Oleg Nesterov wrote:
> > On 12/10, Srivatsa S. Bhat wrote:
> >>
> >> On 12/10/2012 12:44 AM, Oleg Nesterov wrote:
> >>
> >>> But yes, it is easy to blame somebody else's code ;) And I can't suggest
> >>> something better at least right now. If I understand correctly, we can not
> >>> use, say, synchronize_sched() in _cpu_down() path
> >>
> >> We can't sleep in that code.. so that's a no-go.
> >
> > But we can?
> >
> > Note that I meant _cpu_down(), not get_online_cpus_atomic() or take_cpu_down().
> >
>
> Maybe I'm missing something, but how would it help if we did a
> synchronize_sched() so early (in _cpu_down())? Another bunch of preempt_disable()
> sections could start immediately after our call to synchronize_sched() no?
> How would we deal with that?

Sorry for confusion. Of course synchronize_sched() alone is not enough.
But we can use it to synchronize with preempt-disabled section and avoid
the barriers/atomic in the fast-path.

For example,

	bool writer_pending;
	DEFINE_RWLOCK(writer_rwlock);
	DEFINE_PER_CPU(int, reader_ctr);

	void get_online_cpus_atomic(void)
	{
		preempt_disable();
	
		if (likely(!writer_pending) || __this_cpu_read(reader_ctr)) {
			__this_cpu_inc(reader_ctr);
			return;
		}

		read_lock(&writer_rwlock);
		__this_cpu_inc(reader_ctr);
		read_unlock(&writer_rwlock);
	}

	// lacks release semantics, but we don't care
	void put_online_cpus_atomic(void)
	{
		__this_cpu_dec(reader_ctr);
		preempt_enable();
	}

Now, _cpu_down() does

	writer_pending = true;
	synchronize_sched();

before stop_one_cpu(). When synchronize_sched() returns, we know that
every get_online_cpus_atomic() must see writer_pending == T. And, if
any CPU incremented its reader_ctr we must see it is not zero.

take_cpu_down() does

	write_lock(&writer_rwlock);

	for_each_online_cpu(cpu) {
		while (per_cpu(reader_ctr, cpu))
			cpu_relax();
	}

and takes the lock.

However. This can lead to the deadlock we already discussed. So
take_cpu_down() should do

 retry:
	write_lock(&writer_rwlock);

	for_each_online_cpu(cpu) {
		if (per_cpu(reader_ctr, cpu)) {
			write_unlock(&writer_rwlock);
			goto retry;
		}
	}

to take the lock. But this is livelockable. However, I do not think it
is possible to avoid the livelock.

Just in case, the code above is only for illustration, perhaps it is not
100% correct and perhaps we can do it better. cpu_hotplug.active_writer
is ignored for simplicity, get/put should check current == active_writer.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 10, 2012, 5:28 p.m. UTC | #16
On 12/10, Srivatsa S. Bhat wrote:
>
> On 12/10/2012 02:43 AM, Oleg Nesterov wrote:
> > Damn, sorry for noise. I missed this part...
> >
> > On 12/10, Srivatsa S. Bhat wrote:
> >>
> >> On 12/10/2012 12:44 AM, Oleg Nesterov wrote:
> >>> the latency. And I guess something like kick_all_cpus_sync() is "too heavy".
> >>
> >> I hadn't considered that. Thinking of it, I don't think it would help us..
> >> It won't get rid of the currently running preempt_disable() sections no?
> >
> > Sure. But (again, this is only my feeling so far) given that get_online_cpus_atomic()
> > does cli/sti,
>
> Ah, that one! Actually, the only reason I do that cli/sti is because, potentially
> interrupt handlers can be hotplug readers too. So we need to protect the portion
> of the code of get_online_cpus_atomic() which is not re-entrant.

Yes, I understand.

> > this can help to implement ensure-the-readers-must-see-the-pending-writer.
> > IOW this might help to implement sync-with-readers.
> >
>
> 2 problems:
>
> 1. It won't help with cases like this:
>
>    preempt_disable()
> 	...
> 	preempt_disable()
> 		...
> 			<------- Here
> 		...
> 	preempt_enable()
> 	...
>    preempt_enable()

No, I meant that kick_all_cpus_sync() can be used to synchronize with
cli/sti in get_online_cpus_atomic(), just like synchronize_sched() does
in the code I posted a minute ago.

> 2. Part of the reason we want to get rid of stop_machine() is to avoid the
> latency it induces on _all_ CPUs just to take *one* CPU offline. If we use
> kick_all_cpus_sync(), we get into that territory again : we unfairly interrupt
> every CPU, _even when_ that CPU's existing preempt_disabled() sections might
> not actually be hotplug readers! (ie., not bothered about CPU Hotplug).

I agree, that is why I said it is "too heavy".

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Dec. 10, 2012, 6:15 p.m. UTC | #17
On 12/10, Srivatsa S. Bhat wrote:
>
> On 12/10/2012 02:27 AM, Oleg Nesterov wrote:
> > On 12/07, Srivatsa S. Bhat wrote:
> >>
> >> 4. No deadlock possibilities
> >>
> >>    Per-cpu locking is not the way to go if we want to have relaxed rules
> >>    for lock-ordering. Because, we can end up in circular-locking dependencies
> >>    as explained in https://lkml.org/lkml/2012/12/6/290
> >
> > OK, but this assumes that, contrary to what Steven said, read-write-read
> > deadlock is not possible when it comes to rwlock_t.
>
> What I meant is, with a single (global) rwlock, you can't deadlock like that.

Ah. I greatly misunderstood Steven's email,

http://marc.info/?l=linux-pm&m=135482212307876

Somehow I didn't notice he described the deadlock with _two_ rwlock's, I
wrongly thought that his point is that read_lock() is not recursive (like
down_read).

> Let me know if my assumptions are incorrect!

No, sorry, I misunderstood Steven.


> > However. If this is true, then compared to preempt_disable/stop_machine
> > livelock is possible. Probably this is fine, we have the same problem with
> > get_online_cpus(). But if we can accept this fact I feel we can simmplify
> > this somehow... Can't prove, only feel ;)
>
> Not sure I follow..

I meant that write_lock_irqsave(&hotplug_rwlock) in take_cpu_down()
can spin "forever".

Suppose that reader_acked() == T on every CPU, so that
get_online_cpus_atomic() always takes read_lock(&hotplug_rwlock).

It is possible that this lock will be never released by readers,

	CPU_0				CPU_1

	get_online_cpus_atomic()
					get_online_cpus_atomic()
	put_online_cpus_atomic()

	get_online_cpus_atomic()
					put_online_cpus_atomic()

					get_online_cpus_atomic()
	put_online_cpus_atomic()

and so on.


> Reader-side:
>    -> read_lock() your per-cpu rwlock and proceed.
>
> Writer-side:
>    -> for_each_online_cpu(cpu)
>           write_lock(per-cpu rwlock of 'cpu');

Yes, yes, this is clear.

> Also, like Tejun said, one of the important measures for per-cpu rwlocks
> should be that, if a user replaces global rwlocks with percpu rwlocks (for
> performance reasons), he shouldn't suddenly end up in numerous deadlock
> possibilities which never existed before. The replacement should continue to
> remain safe, and perhaps improve the performance.

Sure, I agree.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 11, 2012, 1:04 p.m. UTC | #18
On 12/10/2012 11:45 PM, Oleg Nesterov wrote:
> On 12/10, Srivatsa S. Bhat wrote:
>>
>> On 12/10/2012 02:27 AM, Oleg Nesterov wrote:
>>> However. If this is true, then compared to preempt_disable/stop_machine
>>> livelock is possible. Probably this is fine, we have the same problem with
>>> get_online_cpus(). But if we can accept this fact I feel we can simmplify
>>> this somehow... Can't prove, only feel ;)
>>
>> Not sure I follow..
> 
> I meant that write_lock_irqsave(&hotplug_rwlock) in take_cpu_down()
> can spin "forever".
> 
> Suppose that reader_acked() == T on every CPU, so that
> get_online_cpus_atomic() always takes read_lock(&hotplug_rwlock).
> 
> It is possible that this lock will be never released by readers,
> 
> 	CPU_0				CPU_1
> 
> 	get_online_cpus_atomic()
> 					get_online_cpus_atomic()
> 	put_online_cpus_atomic()
> 
> 	get_online_cpus_atomic()
> 					put_online_cpus_atomic()
> 
> 					get_online_cpus_atomic()
> 	put_online_cpus_atomic()
> 
> and so on.
> 

Right, and we can't do anything about it :(
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 11, 2012, 1:05 p.m. UTC | #19
On 12/10/2012 10:58 PM, Oleg Nesterov wrote:
> On 12/10, Srivatsa S. Bhat wrote:
>>
>> On 12/10/2012 02:43 AM, Oleg Nesterov wrote:
>>> Damn, sorry for noise. I missed this part...
>>>
>>> On 12/10, Srivatsa S. Bhat wrote:
>>>>
>>>> On 12/10/2012 12:44 AM, Oleg Nesterov wrote:
>>>>> the latency. And I guess something like kick_all_cpus_sync() is "too heavy".
>>>>
>>>> I hadn't considered that. Thinking of it, I don't think it would help us..
>>>> It won't get rid of the currently running preempt_disable() sections no?
>>>
>>> Sure. But (again, this is only my feeling so far) given that get_online_cpus_atomic()
>>> does cli/sti,
>>
>> Ah, that one! Actually, the only reason I do that cli/sti is because, potentially
>> interrupt handlers can be hotplug readers too. So we need to protect the portion
>> of the code of get_online_cpus_atomic() which is not re-entrant.
> 
> Yes, I understand.
> 
>>> this can help to implement ensure-the-readers-must-see-the-pending-writer.
>>> IOW this might help to implement sync-with-readers.
>>>
>>
>> 2 problems:
>>
>> 1. It won't help with cases like this:
>>
>>    preempt_disable()
>> 	...
>> 	preempt_disable()
>> 		...
>> 			<------- Here
>> 		...
>> 	preempt_enable()
>> 	...
>>    preempt_enable()
> 
> No, I meant that kick_all_cpus_sync() can be used to synchronize with
> cli/sti in get_online_cpus_atomic(), just like synchronize_sched() does
> in the code I posted a minute ago.
> 

Ah, OK.

>> 2. Part of the reason we want to get rid of stop_machine() is to avoid the
>> latency it induces on _all_ CPUs just to take *one* CPU offline. If we use
>> kick_all_cpus_sync(), we get into that territory again : we unfairly interrupt
>> every CPU, _even when_ that CPU's existing preempt_disabled() sections might
>> not actually be hotplug readers! (ie., not bothered about CPU Hotplug).
> 
> I agree, that is why I said it is "too heavy".
> 

Got it :)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 11, 2012, 1:13 p.m. UTC | #20
On 12/10/2012 10:54 PM, Oleg Nesterov wrote:
> On 12/10, Srivatsa S. Bhat wrote:
>>
>> On 12/10/2012 01:52 AM, Oleg Nesterov wrote:
>>> On 12/10, Srivatsa S. Bhat wrote:
>>>>
>>>> On 12/10/2012 12:44 AM, Oleg Nesterov wrote:
>>>>
>>>>> But yes, it is easy to blame somebody else's code ;) And I can't suggest
>>>>> something better at least right now. If I understand correctly, we can not
>>>>> use, say, synchronize_sched() in _cpu_down() path
>>>>
>>>> We can't sleep in that code.. so that's a no-go.
>>>
>>> But we can?
>>>
>>> Note that I meant _cpu_down(), not get_online_cpus_atomic() or take_cpu_down().
>>>
>>
>> Maybe I'm missing something, but how would it help if we did a
>> synchronize_sched() so early (in _cpu_down())? Another bunch of preempt_disable()
>> sections could start immediately after our call to synchronize_sched() no?
>> How would we deal with that?
> 
> Sorry for confusion. Of course synchronize_sched() alone is not enough.
> But we can use it to synchronize with preempt-disabled section and avoid
> the barriers/atomic in the fast-path.
> 
> For example,
> 
> 	bool writer_pending;
> 	DEFINE_RWLOCK(writer_rwlock);
> 	DEFINE_PER_CPU(int, reader_ctr);
> 
> 	void get_online_cpus_atomic(void)
> 	{
> 		preempt_disable();
> 	
> 		if (likely(!writer_pending) || __this_cpu_read(reader_ctr)) {
> 			__this_cpu_inc(reader_ctr);
> 			return;
> 		}
> 
> 		read_lock(&writer_rwlock);
> 		__this_cpu_inc(reader_ctr);
> 		read_unlock(&writer_rwlock);
> 	}
> 
> 	// lacks release semantics, but we don't care
> 	void put_online_cpus_atomic(void)
> 	{
> 		__this_cpu_dec(reader_ctr);
> 		preempt_enable();
> 	}
> 
> Now, _cpu_down() does
> 
> 	writer_pending = true;
> 	synchronize_sched();
> 
> before stop_one_cpu(). When synchronize_sched() returns, we know that
> every get_online_cpus_atomic() must see writer_pending == T. And, if
> any CPU incremented its reader_ctr we must see it is not zero.
> 
> take_cpu_down() does
> 
> 	write_lock(&writer_rwlock);
> 
> 	for_each_online_cpu(cpu) {
> 		while (per_cpu(reader_ctr, cpu))
> 			cpu_relax();
> 	}
> 
> and takes the lock.
> 
> However. This can lead to the deadlock we already discussed. So
> take_cpu_down() should do
> 
>  retry:
> 	write_lock(&writer_rwlock);
> 
> 	for_each_online_cpu(cpu) {
> 		if (per_cpu(reader_ctr, cpu)) {
> 			write_unlock(&writer_rwlock);
> 			goto retry;
> 		}
> 	}
> 
> to take the lock. But this is livelockable. However, I do not think it
> is possible to avoid the livelock.
> 
> Just in case, the code above is only for illustration, perhaps it is not
> 100% correct and perhaps we can do it better. cpu_hotplug.active_writer
> is ignored for simplicity, get/put should check current == active_writer.
> 

This approach (of using synchronize_sched()) also looks good. It is simple,
yet effective, but unfortunately inefficient at the writer side (because
he'll have to wait for a full synchronize_sched()).

So I would say that we should keep this approach as a fallback, if we don't
come up with any better synchronization scheme (in terms of efficiency) at
a comparable level of simplicity/complexity.

I have come up with a v4 that simplifies several aspects of the synchronization
and makes it look a lot more simpler than v3. (Lesser race windows to take
care of, implicit ACKs, no atomic ops etc..)

I'll post it soon. Let me know what you think...

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Dec. 11, 2012, 1:47 p.m. UTC | #21
Hello, Srivatsa.

On Tue, Dec 11, 2012 at 06:43:54PM +0530, Srivatsa S. Bhat wrote:
> This approach (of using synchronize_sched()) also looks good. It is simple,
> yet effective, but unfortunately inefficient at the writer side (because
> he'll have to wait for a full synchronize_sched()).

While synchornize_sched() is heavier on the writer side than the
originally posted version, it doesn't stall the whole machine and
wouldn't introduce latencies to others.  Shouldn't that be enough?

Thanks.
Srivatsa S. Bhat Dec. 11, 2012, 2:02 p.m. UTC | #22
On 12/11/2012 07:17 PM, Tejun Heo wrote:
> Hello, Srivatsa.
> 
> On Tue, Dec 11, 2012 at 06:43:54PM +0530, Srivatsa S. Bhat wrote:
>> This approach (of using synchronize_sched()) also looks good. It is simple,
>> yet effective, but unfortunately inefficient at the writer side (because
>> he'll have to wait for a full synchronize_sched()).
> 
> While synchornize_sched() is heavier on the writer side than the
> originally posted version, it doesn't stall the whole machine and
> wouldn't introduce latencies to others.  Shouldn't that be enough?
> 

Short answer: Yes. But we can do better, with almost comparable code
complexity. So I'm tempted to try that out.

Long answer:
Even in the synchronize_sched() approach, we still have to identify the
readers who need to be converted to use the new get/put_online_cpus_atomic()
APIs and convert them. Then, if we can come up with a scheme such that
the writer has to wait only for those readers to complete, then why not?

If such a scheme ends up becoming too complicated, then I agree, we
can use synchronize_sched() itself. (That's what I meant by saying that
we'll use this as a fallback).

But even in this scheme which uses synchronize_sched(), we are
already half-way through (we already use 2 types of sync schemes -
counters and rwlocks). Just a little more logic can get rid of the
unnecessary full-wait too.. So why not give it a shot?

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Dec. 11, 2012, 2:07 p.m. UTC | #23
Hello,

On Tue, Dec 11, 2012 at 07:32:13PM +0530, Srivatsa S. Bhat wrote:
> On 12/11/2012 07:17 PM, Tejun Heo wrote:
> > Hello, Srivatsa.
> > 
> > On Tue, Dec 11, 2012 at 06:43:54PM +0530, Srivatsa S. Bhat wrote:
> >> This approach (of using synchronize_sched()) also looks good. It is simple,
> >> yet effective, but unfortunately inefficient at the writer side (because
> >> he'll have to wait for a full synchronize_sched()).
> > 
> > While synchornize_sched() is heavier on the writer side than the
> > originally posted version, it doesn't stall the whole machine and
> > wouldn't introduce latencies to others.  Shouldn't that be enough?
> > 
> 
> Short answer: Yes. But we can do better, with almost comparable code
> complexity. So I'm tempted to try that out.
> 
> Long answer:
> Even in the synchronize_sched() approach, we still have to identify the
> readers who need to be converted to use the new get/put_online_cpus_atomic()
> APIs and convert them. Then, if we can come up with a scheme such that
> the writer has to wait only for those readers to complete, then why not?
> 
> If such a scheme ends up becoming too complicated, then I agree, we
> can use synchronize_sched() itself. (That's what I meant by saying that
> we'll use this as a fallback).
> 
> But even in this scheme which uses synchronize_sched(), we are
> already half-way through (we already use 2 types of sync schemes -
> counters and rwlocks). Just a little more logic can get rid of the
> unnecessary full-wait too.. So why not give it a shot?

It's not really about the code complexity but making the reader side
as light as possible.  Please keep in mind that reader side is still
*way* more hotter than the writer side.  Before, the writer side was
heavy to the extent which causes noticeable disruptions on the whole
system and I think that's what we're trying to hunt down here.  If we
can shave of memory barriers from reader side by using
synchornized_sched() on writer side, that is the *better* result, not
worse.

Thanks.
Srivatsa S. Bhat Dec. 11, 2012, 4:28 p.m. UTC | #24
On 12/11/2012 07:37 PM, Tejun Heo wrote:
> Hello,
> 
> On Tue, Dec 11, 2012 at 07:32:13PM +0530, Srivatsa S. Bhat wrote:
>> On 12/11/2012 07:17 PM, Tejun Heo wrote:
>>> Hello, Srivatsa.
>>>
>>> On Tue, Dec 11, 2012 at 06:43:54PM +0530, Srivatsa S. Bhat wrote:
>>>> This approach (of using synchronize_sched()) also looks good. It is simple,
>>>> yet effective, but unfortunately inefficient at the writer side (because
>>>> he'll have to wait for a full synchronize_sched()).
>>>
>>> While synchornize_sched() is heavier on the writer side than the
>>> originally posted version, it doesn't stall the whole machine and
>>> wouldn't introduce latencies to others.  Shouldn't that be enough?
>>>
>>
>> Short answer: Yes. But we can do better, with almost comparable code
>> complexity. So I'm tempted to try that out.
>>
>> Long answer:
>> Even in the synchronize_sched() approach, we still have to identify the
>> readers who need to be converted to use the new get/put_online_cpus_atomic()
>> APIs and convert them. Then, if we can come up with a scheme such that
>> the writer has to wait only for those readers to complete, then why not?
>>
>> If such a scheme ends up becoming too complicated, then I agree, we
>> can use synchronize_sched() itself. (That's what I meant by saying that
>> we'll use this as a fallback).
>>
>> But even in this scheme which uses synchronize_sched(), we are
>> already half-way through (we already use 2 types of sync schemes -
>> counters and rwlocks). Just a little more logic can get rid of the
>> unnecessary full-wait too.. So why not give it a shot?
> 
> It's not really about the code complexity but making the reader side
> as light as possible.  Please keep in mind that reader side is still
> *way* more hotter than the writer side.  Before, the writer side was
> heavy to the extent which causes noticeable disruptions on the whole
> system and I think that's what we're trying to hunt down here.  If we
> can shave of memory barriers from reader side by using
> synchornized_sched() on writer side, that is the *better* result, not
> worse.
> 

That's a fair point. Hmmm...
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index ce7a074..cf24da1 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -175,6 +175,8 @@  extern struct bus_type cpu_subsys;
 
 extern void get_online_cpus(void);
 extern void put_online_cpus(void);
+extern void get_online_cpus_atomic(void);
+extern void put_online_cpus_atomic(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
 #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
@@ -198,6 +200,8 @@  static inline void cpu_hotplug_driver_unlock(void)
 
 #define get_online_cpus()	do { } while (0)
 #define put_online_cpus()	do { } while (0)
+#define get_online_cpus_atomic()	do { } while (0)
+#define put_online_cpus_atomic()	do { } while (0)
 #define hotcpu_notifier(fn, pri)	do { (void)(fn); } while (0)
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 42bd331..d2076fa 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -19,6 +19,7 @@ 
 #include <linux/mutex.h>
 #include <linux/gfp.h>
 #include <linux/suspend.h>
+#include <linux/atomic.h>
 
 #include "smpboot.h"
 
@@ -133,6 +134,161 @@  static void cpu_hotplug_done(void)
 	mutex_unlock(&cpu_hotplug.lock);
 }
 
+/*
+ * Reader-writer lock to synchronize between atomic hotplug readers
+ * and the CPU offline hotplug writer.
+ */
+static DEFINE_RWLOCK(hotplug_rwlock);
+
+static DEFINE_PER_CPU(atomic_t, reader_refcount);
+static DEFINE_PER_CPU(atomic_t, writer_signal);
+
+static inline void ack_writer_signal(void)
+{
+	/* Writer raised it to 2. Reduce it to 1 */
+	atomic_dec(&__get_cpu_var(writer_signal));
+	smp_mb__after_atomic_dec();
+}
+
+
+#define reader_active(cpu)						\
+			(atomic_read(&per_cpu(reader_refcount, cpu)) > 0)
+
+#define writer_active(cpu)						\
+			(atomic_read(&per_cpu(writer_signal, cpu)) > 0)
+
+#define is_new_writer(cpu)						\
+			(atomic_read(&per_cpu(writer_signal, cpu)) == 2)
+
+#define reader_acked(cpu)						\
+			(atomic_read(&per_cpu(writer_signal, cpu)) == 1)
+
+
+static inline void mark_reader_fastpath(void)
+{
+	atomic_inc(&__get_cpu_var(reader_refcount));
+	smp_mb__after_atomic_inc();
+}
+
+static inline void unmark_reader_fastpath(void)
+{
+	atomic_dec(&__get_cpu_var(reader_refcount));
+	smp_mb__after_atomic_dec();
+}
+
+static inline void mark_reader_slowpath(void)
+{
+	unsigned int cpu = smp_processor_id();
+
+	mark_reader_fastpath();
+
+	read_lock(&hotplug_rwlock);
+
+	/*
+	 * Release the lock before returning, if the writer has finished.
+	 * This will help reduce the burden on put_online_cpus_atomic().
+	 */
+	if (!writer_active(cpu))
+		read_unlock(&hotplug_rwlock);
+}
+
+/*
+ * Invoked by hotplug reader, to prevent CPUs from going offline.
+ *
+ * If there are no CPU offline writers active, just increment the
+ * per-cpu counter 'reader_refcount' and proceed.
+ *
+ * If a CPU offline hotplug writer is active, we need to ACK his
+ * signal and switch to using the global rwlocks, when the time is
+ * right.
+ *
+ * It is not safe to switch the synchronization scheme when we are
+ * already in a read-side critical section (because their corresponding
+ * put_online_cpus_atomic() won't know of the switch and will continue
+ * to use the old scheme (per-cpu counter updates), which would be
+ * wrong).
+ *
+ * So don't ACK the writer's signal (and don't switch to rwlocks) unless
+ * this is a fresh non-nested read-side critical section on this CPU.
+ *
+ * Once you switch, keep using the rwlocks for synchronization, until
+ * the writer signals the end of CPU offline.
+ *
+ * You can call this recursively, without fear of locking problems.
+ *
+ * Returns with preemption disabled.
+ */
+void get_online_cpus_atomic(void)
+{
+	unsigned int cpu = smp_processor_id();
+	unsigned long flags;
+
+	preempt_disable();
+	local_irq_save(flags);
+
+	if (cpu_hotplug.active_writer == current)
+		goto out;
+
+	smp_rmb(); /* Paired with smp_wmb() in drop_writer_signal() */
+
+	if (likely(!writer_active(cpu))) {
+		mark_reader_fastpath();
+		goto out;
+	}
+
+	/* CPU offline writer is active... */
+
+	/* ACK his signal only once */
+	if (is_new_writer(cpu)) {
+		/*
+		 * ACK the writer's signal only if this is a fresh read-side
+		 * critical section, and not just an extension of a running
+		 * (nested) read-side critical section.
+		 */
+		if (!reader_active(cpu)) {
+			ack_writer_signal();
+			mark_reader_slowpath();
+		} else {
+			/*
+			 * Keep taking the fastpath until all existing nested
+			 * read-side critical sections complete on this CPU.
+			 */
+			mark_reader_fastpath();
+		}
+	} else {
+		/*
+		 * The writer is not new because we have already acked his
+		 * signal before. So just take the slowpath and proceed.
+		 */
+		mark_reader_slowpath();
+	}
+
+out:
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(get_online_cpus_atomic);
+
+void put_online_cpus_atomic(void)
+{
+	unsigned int cpu = smp_processor_id();
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	if (cpu_hotplug.active_writer == current)
+		goto out;
+
+	if (unlikely(writer_active(cpu)) && reader_acked(cpu))
+		read_unlock(&hotplug_rwlock);
+
+	unmark_reader_fastpath();
+
+out:
+	local_irq_restore(flags);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(put_online_cpus_atomic);
+
 #else /* #if CONFIG_HOTPLUG_CPU */
 static void cpu_hotplug_begin(void) {}
 static void cpu_hotplug_done(void) {}
@@ -237,6 +393,66 @@  static inline void check_for_tasks(int cpu)
 	write_unlock_irq(&tasklist_lock);
 }
 
+/* Set the per-cpu variable writer_signal to 2 */
+static inline void raise_writer_signal(unsigned int cpu)
+{
+	atomic_add(2, &per_cpu(writer_signal, cpu));
+	smp_mb__after_atomic_inc();
+}
+
+/* Reset the per-cpu variable writer_signal to 0 */
+static inline void drop_writer_signal(unsigned int cpu)
+{
+	atomic_set(&per_cpu(writer_signal, cpu), 0);
+	smp_wmb(); /* Paired with smp_rmb() in get_online_cpus_atomic() */
+}
+
+static void announce_cpu_offline_begin(void)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu)
+		raise_writer_signal(cpu);
+}
+
+static void announce_cpu_offline_end(unsigned int dead_cpu)
+{
+	unsigned int cpu;
+
+	drop_writer_signal(dead_cpu);
+
+	for_each_online_cpu(cpu)
+		drop_writer_signal(cpu);
+}
+
+/*
+ * Check if the reader saw and acked the writer's signal.
+ * If he has, return immediately.
+ *
+ * If he hasn't acked it, there are 2 possibilities:
+ *  a. He is not an active reader; so we can return safely.
+ *  b. He is an active reader, so we need to retry until
+ *     he acks the writer's signal.
+ */
+static void sync_atomic_reader(unsigned int cpu)
+{
+
+retry:
+	if (reader_acked(cpu))
+		return;
+
+	if (reader_active(cpu))
+		goto retry;
+}
+
+static void sync_all_readers(void)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu)
+		sync_atomic_reader(cpu);
+}
+
 struct take_cpu_down_param {
 	unsigned long mod;
 	void *hcpu;
@@ -246,15 +462,45 @@  struct take_cpu_down_param {
 static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
-	int err;
+	unsigned long flags;
+	unsigned int cpu = (long)(param->hcpu);
+	int err = 0;
+
+
+	/*
+	 * Inform all atomic readers that we are going to offline a CPU
+	 * and that they need to switch from per-cpu counters to the
+	 * global hotplug_rwlock.
+	 */
+	announce_cpu_offline_begin();
+
+	/* Wait for every reader to notice the announcement and ACK it */
+	sync_all_readers();
+
+	/*
+	 * Now all the readers have switched to using the global hotplug_rwlock.
+	 * So now is our chance, go bring down the CPU!
+	 */
+
+	write_lock_irqsave(&hotplug_rwlock, flags);
 
 	/* Ensure this CPU doesn't handle any more interrupts. */
 	err = __cpu_disable();
 	if (err < 0)
-		return err;
+		goto out;
 
 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
-	return 0;
+
+out:
+	/*
+	 * Inform all atomic readers that we are done with the CPU offline
+	 * operation, so that they can switch back to their per-cpu counters.
+	 * (We don't need to wait for them to see it).
+	 */
+	announce_cpu_offline_end(cpu);
+
+	write_unlock_irqrestore(&hotplug_rwlock, flags);
+	return err;
 }
 
 /* Requires cpu_add_remove_lock to be held */