Message ID | 20121207173759.27305.84316.stgit@srivatsabhat.in.ibm.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On Fri, Dec 07, 2012 at 11:08:13PM +0530, Srivatsa S. Bhat wrote: > 4. No deadlock possibilities > > Per-cpu locking is not the way to go if we want to have relaxed rules > for lock-ordering. Because, we can end up in circular-locking dependencies > as explained in https://lkml.org/lkml/2012/12/6/290 > > So, avoid per-cpu locking schemes (per-cpu locks/per-cpu atomic counters > with spin-on-contention etc) as much as possible. I really can't say I like this approach. percpu locking is very tricky to get right and difficult to get right and we should try our best to avoid implementing subsystem specific ones as much as possible. Also, I think the right approach would be auditing each get_online_cpus_atomic() callsites and figure out proper locking order rather than implementing a construct this unusual especially as hunting down the incorrect cases shouldn't be difficult given proper lockdep annotation. lg_lock doesn't do local nesting and I'm not sure how big a deal that is as I don't know how many should be converted. But if nesting is an absolute necessity, it would be much better to implement generic rwlock variant (say, lg_rwlock) rather than implementing unusual cpuhotplug-specific percpu synchronization construct. Thanks.
Hello, again. On Fri, Dec 07, 2012 at 09:57:24AM -0800, Tejun Heo wrote: > possible. Also, I think the right approach would be auditing each > get_online_cpus_atomic() callsites and figure out proper locking order > rather than implementing a construct this unusual especially as > hunting down the incorrect cases shouldn't be difficult given proper > lockdep annotation. On the second look, it looks like you're implementing proper percpu_rwlock semantics as readers aren't supposed to induce circular dependency directly. Can you please work with Oleg to implement proper percpu-rwlock and use that for CPU hotplug rather than implementing it inside CPU hotplug? Thanks.
On 12/07/2012 11:27 PM, Tejun Heo wrote: > On Fri, Dec 07, 2012 at 11:08:13PM +0530, Srivatsa S. Bhat wrote: >> 4. No deadlock possibilities >> >> Per-cpu locking is not the way to go if we want to have relaxed rules >> for lock-ordering. Because, we can end up in circular-locking dependencies >> as explained in https://lkml.org/lkml/2012/12/6/290 >> >> So, avoid per-cpu locking schemes (per-cpu locks/per-cpu atomic counters >> with spin-on-contention etc) as much as possible. > > I really can't say I like this approach. percpu locking is very > tricky to get right and difficult to get right and we should try our > best to avoid implementing subsystem specific ones as much as > possible. Also, I think the right approach would be auditing each > get_online_cpus_atomic() callsites and figure out proper locking order > rather than implementing a construct this unusual especially as > hunting down the incorrect cases shouldn't be difficult given proper > lockdep annotation. > > lg_lock doesn't do local nesting and I'm not sure how big a deal that > is as I don't know how many should be converted. But if nesting is an > absolute necessity, it would be much better to implement generic > rwlock variant (say, lg_rwlock) rather than implementing unusual > cpuhotplug-specific percpu synchronization construct. > To be honest, at a certain point in time while designing this, I did realize that this was getting kinda overly complicated ;-) ... but I wanted to see how this would actually work out when finished and get some feedback on the same, hence I posted it out. But this also proves that we _can_ actually compete with the flexibility of preempt_disable() and still be safe with respect to locking, if we really want to ;-) Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Srivatsa. On Fri, Dec 07, 2012 at 11:54:01PM +0530, Srivatsa S. Bhat wrote: > > lg_lock doesn't do local nesting and I'm not sure how big a deal that > > is as I don't know how many should be converted. But if nesting is an > > absolute necessity, it would be much better to implement generic > > rwlock variant (say, lg_rwlock) rather than implementing unusual > > cpuhotplug-specific percpu synchronization construct. > > To be honest, at a certain point in time while designing this, I did > realize that this was getting kinda overly complicated ;-) ... but I > wanted to see how this would actually work out when finished and get > some feedback on the same, hence I posted it out. But this also proves > that we _can_ actually compete with the flexibility of preempt_disable() > and still be safe with respect to locking, if we really want to ;-) I got confused by comparison to preempt_disable() but you're right that percpu rwlock shouldn't be able to introduce locking dependency which doesn't exist with non-percpu rwlock. ie. write locking should be atomic w.r.t. to all readers. At the simplest, this can be implemented by writer backing out all the way if try-locking any CPU fails and retrying the whole thing. That should be correct but has the potential of starving the writer. What we need here is a generic percpu-rwlock. I don't know which exact implementation strategy we should choose. Maybe your switching to global rwlock is the right solution. But, at any rate, I think it would be best to implement proper percpu-rwlock and then apply it to CPU hotplug. It's actually gonna be pretty fitting as get_online_cpus() is being converted to percpu-rwsem. IIUC, Oleg has been working on this for a while now. Oleg, what do you think? Thanks.
On 12/07/2012 11:46 PM, Tejun Heo wrote: > Hello, again. > > On Fri, Dec 07, 2012 at 09:57:24AM -0800, Tejun Heo wrote: >> possible. Also, I think the right approach would be auditing each >> get_online_cpus_atomic() callsites and figure out proper locking order >> rather than implementing a construct this unusual especially as >> hunting down the incorrect cases shouldn't be difficult given proper >> lockdep annotation. > > On the second look, it looks like you're implementing proper > percpu_rwlock semantics Ah, nice! I didn't realize that I was actually doing what I intended to avoid! ;-) Looking at the implementation of lglocks, and going by Oleg's earlier comment that we just need to replace spinlock_t with rwlock_t in them to get percpu_rwlocks, I was horrified at the kinds of circular locking dependencies that they would be prone to, unless used carefully. So I devised this scheme to be safe, while still having relaxed rules. But if *this* is what percpu_rwlocks should ideally look like, then great! :-) > as readers aren't supposed to induce circular > dependency directly. Yep, in this scheme, nobody will end up in circular dependency. > Can you please work with Oleg to implement > proper percpu-rwlock and use that for CPU hotplug rather than > implementing it inside CPU hotplug? > Sure, I'd be more than happy to! Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/08/2012 12:01 AM, Tejun Heo wrote: > Hello, Srivatsa. > > On Fri, Dec 07, 2012 at 11:54:01PM +0530, Srivatsa S. Bhat wrote: >>> lg_lock doesn't do local nesting and I'm not sure how big a deal that >>> is as I don't know how many should be converted. But if nesting is an >>> absolute necessity, it would be much better to implement generic >>> rwlock variant (say, lg_rwlock) rather than implementing unusual >>> cpuhotplug-specific percpu synchronization construct. >> >> To be honest, at a certain point in time while designing this, I did >> realize that this was getting kinda overly complicated ;-) ... but I >> wanted to see how this would actually work out when finished and get >> some feedback on the same, hence I posted it out. But this also proves >> that we _can_ actually compete with the flexibility of preempt_disable() >> and still be safe with respect to locking, if we really want to ;-) > > I got confused by comparison to preempt_disable() but you're right > that percpu rwlock shouldn't be able to introduce locking dependency > which doesn't exist with non-percpu rwlock. ie. write locking should > be atomic w.r.t. to all readers. Yep! > At the simplest, this can be > implemented by writer backing out all the way if try-locking any CPU > fails and retrying the whole thing. That should be correct but has > the potential of starving the writer. > Exactly! This is what I mentioned yesterday in the link below, and said that its not good because of writer starvation ("too wasteful"): https://lkml.org/lkml/2012/12/6/290 > What we need here is a generic percpu-rwlock. I don't know which > exact implementation strategy we should choose. Maybe your switching > to global rwlock is the right solution. But, at any rate, I think it > would be best to implement proper percpu-rwlock and then apply it to > CPU hotplug. It's actually gonna be pretty fitting as > get_online_cpus() is being converted to percpu-rwsem. IIUC, Oleg has > been working on this for a while now. Oleg, what do you think? > Hmm, that sounds good. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/07, Srivatsa S. Bhat wrote: > > Per-cpu counters can help solve the cache-line bouncing problem. So we > actually use the best of both: per-cpu counters (no-waiting) at the reader > side in the fast-path, and global rwlocks in the slowpath. > > [ Fastpath = no writer is active; Slowpath = a writer is active ] > > IOW, the hotplug readers just increment/decrement their per-cpu refcounts > when no writer is active. Plus LOCK and cli/sti. I do not pretend I really know how bad this is performance-wise though. And at first glance this look overcomplicated. But yes, it is easy to blame somebody else's code ;) And I can't suggest something better at least right now. If I understand correctly, we can not use, say, synchronize_sched() in _cpu_down() path, you also want to improve the latency. And I guess something like kick_all_cpus_sync() is "too heavy". Also. After the quick reading this doesn't look correct, please see below. > +void get_online_cpus_atomic(void) > +{ > + unsigned int cpu = smp_processor_id(); > + unsigned long flags; > + > + preempt_disable(); > + local_irq_save(flags); > + > + if (cpu_hotplug.active_writer == current) > + goto out; > + > + smp_rmb(); /* Paired with smp_wmb() in drop_writer_signal() */ > + > + if (likely(!writer_active(cpu))) { WINDOW. Suppose that reader_active() == F. > + mark_reader_fastpath(); > + goto out; Why take_cpu_down() can't do announce_cpu_offline_begin() + sync_all_readers() in between? Looks like we should increment the counter first, then check writer_active(). And sync_atomic_reader() needs rmb between 2 atomic_read's. Or. Again, suppose that reader_active() == F. But is_new_writer() == T. > + if (is_new_writer(cpu)) { > + /* > + * ACK the writer's signal only if this is a fresh read-side > + * critical section, and not just an extension of a running > + * (nested) read-side critical section. > + */ > + if (!reader_active(cpu)) { > + ack_writer_signal(); What if take_cpu_down() does announce_cpu_offline_end() right before ack_writer_signal() ? In this case get_online_cpus_atomic() returns with writer_signal == -1. If nothing else this breaks the next raise_writer_signal(). Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10/2012 12:44 AM, Oleg Nesterov wrote: > On 12/07, Srivatsa S. Bhat wrote: >> >> Per-cpu counters can help solve the cache-line bouncing problem. So we >> actually use the best of both: per-cpu counters (no-waiting) at the reader >> side in the fast-path, and global rwlocks in the slowpath. >> >> [ Fastpath = no writer is active; Slowpath = a writer is active ] >> >> IOW, the hotplug readers just increment/decrement their per-cpu refcounts >> when no writer is active. > > Plus LOCK and cli/sti. I do not pretend I really know how bad this is > performance-wise though. And at first glance this look overcomplicated. > Hehe, I agree ;-) But I couldn't think of any other way to get rid of the deadlock possibilities, other than using global rwlocks. So I designed a way in which we can switch between per-cpu counters and global rwlocks dynamically. Probably there is a smarter way to achieve what we want, dunno... > But yes, it is easy to blame somebody else's code ;) And I can't suggest > something better at least right now. If I understand correctly, we can not > use, say, synchronize_sched() in _cpu_down() path We can't sleep in that code.. so that's a no-go. >, you also want to improve > the latency. And I guess something like kick_all_cpus_sync() is "too heavy". > I hadn't considered that. Thinking of it, I don't think it would help us.. It won't get rid of the currently running preempt_disable() sections no? > Also. After the quick reading this doesn't look correct, please see below. > >> +void get_online_cpus_atomic(void) >> +{ >> + unsigned int cpu = smp_processor_id(); >> + unsigned long flags; >> + >> + preempt_disable(); >> + local_irq_save(flags); >> + >> + if (cpu_hotplug.active_writer == current) >> + goto out; >> + >> + smp_rmb(); /* Paired with smp_wmb() in drop_writer_signal() */ >> + >> + if (likely(!writer_active(cpu))) { > > WINDOW. Suppose that reader_active() == F. > >> + mark_reader_fastpath(); >> + goto out; > > Why take_cpu_down() can't do announce_cpu_offline_begin() + sync_all_readers() > in between? > > Looks like we should increment the counter first, then check writer_active(). You are right, I missed the above race-conditions. > And sync_atomic_reader() needs rmb between 2 atomic_read's. > OK. > > Or. Again, suppose that reader_active() == F. But is_new_writer() == T. > >> + if (is_new_writer(cpu)) { >> + /* >> + * ACK the writer's signal only if this is a fresh read-side >> + * critical section, and not just an extension of a running >> + * (nested) read-side critical section. >> + */ >> + if (!reader_active(cpu)) { >> + ack_writer_signal(); > > What if take_cpu_down() does announce_cpu_offline_end() right before > ack_writer_signal() ? In this case get_online_cpus_atomic() returns > with writer_signal == -1. If nothing else this breaks the next > raise_writer_signal(). > Oh, yes, this one is wrong too! We need to mark ourselves as active reader right in the beginning. And probably change the check to "reader_nested()" or something like that. Thanks a lot Oleg! Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10, Srivatsa S. Bhat wrote: > > On 12/10/2012 12:44 AM, Oleg Nesterov wrote: > > > But yes, it is easy to blame somebody else's code ;) And I can't suggest > > something better at least right now. If I understand correctly, we can not > > use, say, synchronize_sched() in _cpu_down() path > > We can't sleep in that code.. so that's a no-go. But we can? Note that I meant _cpu_down(), not get_online_cpus_atomic() or take_cpu_down(). Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/07, Srivatsa S. Bhat wrote: > > 4. No deadlock possibilities > > Per-cpu locking is not the way to go if we want to have relaxed rules > for lock-ordering. Because, we can end up in circular-locking dependencies > as explained in https://lkml.org/lkml/2012/12/6/290 OK, but this assumes that, contrary to what Steven said, read-write-read deadlock is not possible when it comes to rwlock_t. So far I think this is true and we can't deadlock. Steven? However. If this is true, then compared to preempt_disable/stop_machine livelock is possible. Probably this is fine, we have the same problem with get_online_cpus(). But if we can accept this fact I feel we can simmplify this somehow... Can't prove, only feel ;) Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Damn, sorry for noise. I missed this part... On 12/10, Srivatsa S. Bhat wrote: > > On 12/10/2012 12:44 AM, Oleg Nesterov wrote: > > the latency. And I guess something like kick_all_cpus_sync() is "too heavy". > > I hadn't considered that. Thinking of it, I don't think it would help us.. > It won't get rid of the currently running preempt_disable() sections no? Sure. But (again, this is only my feeling so far) given that get_online_cpus_atomic() does cli/sti, this can help to implement ensure-the-readers-must-see-the-pending-writer. IOW this might help to implement sync-with-readers. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10/2012 01:52 AM, Oleg Nesterov wrote: > On 12/10, Srivatsa S. Bhat wrote: >> >> On 12/10/2012 12:44 AM, Oleg Nesterov wrote: >> >>> But yes, it is easy to blame somebody else's code ;) And I can't suggest >>> something better at least right now. If I understand correctly, we can not >>> use, say, synchronize_sched() in _cpu_down() path >> >> We can't sleep in that code.. so that's a no-go. > > But we can? > > Note that I meant _cpu_down(), not get_online_cpus_atomic() or take_cpu_down(). > Maybe I'm missing something, but how would it help if we did a synchronize_sched() so early (in _cpu_down())? Another bunch of preempt_disable() sections could start immediately after our call to synchronize_sched() no? How would we deal with that? What we need to ensure here is that all existing preempt_disable() sections are over and that *we* (meaning, the cpu offline writer) get to proceed immediately after that, making all the new readers wait for us. And that is possible only if we do our 'wait-for-readers' thing very close to our actual cpu offline operation (which is take_cpu_down). Moreover, the writer needs to remain stable (non-preemptible) while doing the wait-for-readers. Else (if the writer himself keeps getting scheduled in and out of the CPU) I can't see how he can possibly do justice to the wait. Also, synchronize_sched() only helps us do the 'wait-for-existing-readers' logic. What would take care of the 'prevent-new-readers-from-going-ahead' logic? To add to it all, synchronize_sched() waits for _all_ preempt_disable() sections to complete, whereas only a handful of them might actually care about CPU hotplug. Which is an unnecessary burden for the writer (ie., waiting for unrelated readers to complete). I bet you meant something else. Sorry if I misunderstood it. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10/2012 02:43 AM, Oleg Nesterov wrote: > Damn, sorry for noise. I missed this part... > > On 12/10, Srivatsa S. Bhat wrote: >> >> On 12/10/2012 12:44 AM, Oleg Nesterov wrote: >>> the latency. And I guess something like kick_all_cpus_sync() is "too heavy". >> >> I hadn't considered that. Thinking of it, I don't think it would help us.. >> It won't get rid of the currently running preempt_disable() sections no? > > Sure. But (again, this is only my feeling so far) given that get_online_cpus_atomic() > does cli/sti, Ah, that one! Actually, the only reason I do that cli/sti is because, potentially interrupt handlers can be hotplug readers too. So we need to protect the portion of the code of get_online_cpus_atomic() which is not re-entrant. (Which reminds me to try and reduce the length of cli/sti in that code, if possible). > this can help to implement ensure-the-readers-must-see-the-pending-writer. > IOW this might help to implement sync-with-readers. > 2 problems: 1. It won't help with cases like this: preempt_disable() ... preempt_disable() ... <------- Here ... preempt_enable() ... preempt_enable() If the IPI hits at the point marked above, the IPI is useless, because, at that point, since we are already in a nested read-side critical section, we can't switch the synchronization protocol. We need to wait till we start a fresh non-nested read-side critical section, in order to switch to global rwlock. The reason is that preempt_enable() or put_online_cpus_atomic() can only undo what its predecessor (preempt_disable()/get_online_cpus_atomic()) did. 2. Part of the reason we want to get rid of stop_machine() is to avoid the latency it induces on _all_ CPUs just to take *one* CPU offline. If we use kick_all_cpus_sync(), we get into that territory again : we unfairly interrupt every CPU, _even when_ that CPU's existing preempt_disabled() sections might not actually be hotplug readers! (ie., not bothered about CPU Hotplug). So, I think whatever synchronization scheme we develop, must not induce the very same problems that stop_machine() had. Otherwise, we can end up going a full-circle and coming back to the same stop_machine() scenario that we intended to get rid of. (That's part of the reason why I initially tried to provide that _light() variant of the reader APIs in the previous version of the patchset, so that light readers can remain as undisturbed from cpu hotplug as possible, thereby avoiding indirectly inducing the "stop_machine effect", like I mentioned here: https://lkml.org/lkml/2012/12/5/369) Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10/2012 02:27 AM, Oleg Nesterov wrote: > On 12/07, Srivatsa S. Bhat wrote: >> >> 4. No deadlock possibilities >> >> Per-cpu locking is not the way to go if we want to have relaxed rules >> for lock-ordering. Because, we can end up in circular-locking dependencies >> as explained in https://lkml.org/lkml/2012/12/6/290 > > OK, but this assumes that, contrary to what Steven said, read-write-read > deadlock is not possible when it comes to rwlock_t. What I meant is, with a single (global) rwlock, you can't deadlock like that. But if you used per-cpu rwlocks and if we don't implement them properly, then we can end up in circular locking dependencies like shown above. That is, if you take the same example and replace the lock with global rwlock, you won't deadlock: Readers: CPU 0 CPU 1 ------ ------ 1. spin_lock(&random_lock); read_lock(&my_rwlock); 2. read_lock(&my_rwlock); spin_lock(&random_lock); Writer: CPU 2: ------ write_lock(&my_rwlock); Even if the writer does a write_lock() in-between steps 1 and 2 at the reader- side, nothing will happen. The writer would spin because CPU 1 already got the rwlock for read, and hence both CPU 0 and 1 go ahead. When they finish, the writer will get the lock and proceed. So, no deadlocks here. So, what I was pointing out here was, if somebody replaced this global rwlock with a "straight-forward" implementation of per-cpu rwlocks, he'll immediately end up in circular locking dependency deadlock between the 3 entities as explained in the link above. Let me know if my assumptions are incorrect! > So far I think this > is true and we can't deadlock. Steven? > > However. If this is true, then compared to preempt_disable/stop_machine > livelock is possible. Probably this is fine, we have the same problem with > get_online_cpus(). But if we can accept this fact I feel we can simmplify > this somehow... Can't prove, only feel ;) > Not sure I follow.. Anyway, my point is that, we _can't_ implement per-cpu rwlocks like lglocks and expect it to work in this case. IOW, we can't do: Reader-side: -> read_lock() your per-cpu rwlock and proceed. Writer-side: -> for_each_online_cpu(cpu) write_lock(per-cpu rwlock of 'cpu'); Also, like Tejun said, one of the important measures for per-cpu rwlocks should be that, if a user replaces global rwlocks with percpu rwlocks (for performance reasons), he shouldn't suddenly end up in numerous deadlock possibilities which never existed before. The replacement should continue to remain safe, and perhaps improve the performance. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10, Srivatsa S. Bhat wrote: > > On 12/10/2012 01:52 AM, Oleg Nesterov wrote: > > On 12/10, Srivatsa S. Bhat wrote: > >> > >> On 12/10/2012 12:44 AM, Oleg Nesterov wrote: > >> > >>> But yes, it is easy to blame somebody else's code ;) And I can't suggest > >>> something better at least right now. If I understand correctly, we can not > >>> use, say, synchronize_sched() in _cpu_down() path > >> > >> We can't sleep in that code.. so that's a no-go. > > > > But we can? > > > > Note that I meant _cpu_down(), not get_online_cpus_atomic() or take_cpu_down(). > > > > Maybe I'm missing something, but how would it help if we did a > synchronize_sched() so early (in _cpu_down())? Another bunch of preempt_disable() > sections could start immediately after our call to synchronize_sched() no? > How would we deal with that? Sorry for confusion. Of course synchronize_sched() alone is not enough. But we can use it to synchronize with preempt-disabled section and avoid the barriers/atomic in the fast-path. For example, bool writer_pending; DEFINE_RWLOCK(writer_rwlock); DEFINE_PER_CPU(int, reader_ctr); void get_online_cpus_atomic(void) { preempt_disable(); if (likely(!writer_pending) || __this_cpu_read(reader_ctr)) { __this_cpu_inc(reader_ctr); return; } read_lock(&writer_rwlock); __this_cpu_inc(reader_ctr); read_unlock(&writer_rwlock); } // lacks release semantics, but we don't care void put_online_cpus_atomic(void) { __this_cpu_dec(reader_ctr); preempt_enable(); } Now, _cpu_down() does writer_pending = true; synchronize_sched(); before stop_one_cpu(). When synchronize_sched() returns, we know that every get_online_cpus_atomic() must see writer_pending == T. And, if any CPU incremented its reader_ctr we must see it is not zero. take_cpu_down() does write_lock(&writer_rwlock); for_each_online_cpu(cpu) { while (per_cpu(reader_ctr, cpu)) cpu_relax(); } and takes the lock. However. This can lead to the deadlock we already discussed. So take_cpu_down() should do retry: write_lock(&writer_rwlock); for_each_online_cpu(cpu) { if (per_cpu(reader_ctr, cpu)) { write_unlock(&writer_rwlock); goto retry; } } to take the lock. But this is livelockable. However, I do not think it is possible to avoid the livelock. Just in case, the code above is only for illustration, perhaps it is not 100% correct and perhaps we can do it better. cpu_hotplug.active_writer is ignored for simplicity, get/put should check current == active_writer. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10, Srivatsa S. Bhat wrote: > > On 12/10/2012 02:43 AM, Oleg Nesterov wrote: > > Damn, sorry for noise. I missed this part... > > > > On 12/10, Srivatsa S. Bhat wrote: > >> > >> On 12/10/2012 12:44 AM, Oleg Nesterov wrote: > >>> the latency. And I guess something like kick_all_cpus_sync() is "too heavy". > >> > >> I hadn't considered that. Thinking of it, I don't think it would help us.. > >> It won't get rid of the currently running preempt_disable() sections no? > > > > Sure. But (again, this is only my feeling so far) given that get_online_cpus_atomic() > > does cli/sti, > > Ah, that one! Actually, the only reason I do that cli/sti is because, potentially > interrupt handlers can be hotplug readers too. So we need to protect the portion > of the code of get_online_cpus_atomic() which is not re-entrant. Yes, I understand. > > this can help to implement ensure-the-readers-must-see-the-pending-writer. > > IOW this might help to implement sync-with-readers. > > > > 2 problems: > > 1. It won't help with cases like this: > > preempt_disable() > ... > preempt_disable() > ... > <------- Here > ... > preempt_enable() > ... > preempt_enable() No, I meant that kick_all_cpus_sync() can be used to synchronize with cli/sti in get_online_cpus_atomic(), just like synchronize_sched() does in the code I posted a minute ago. > 2. Part of the reason we want to get rid of stop_machine() is to avoid the > latency it induces on _all_ CPUs just to take *one* CPU offline. If we use > kick_all_cpus_sync(), we get into that territory again : we unfairly interrupt > every CPU, _even when_ that CPU's existing preempt_disabled() sections might > not actually be hotplug readers! (ie., not bothered about CPU Hotplug). I agree, that is why I said it is "too heavy". Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10, Srivatsa S. Bhat wrote: > > On 12/10/2012 02:27 AM, Oleg Nesterov wrote: > > On 12/07, Srivatsa S. Bhat wrote: > >> > >> 4. No deadlock possibilities > >> > >> Per-cpu locking is not the way to go if we want to have relaxed rules > >> for lock-ordering. Because, we can end up in circular-locking dependencies > >> as explained in https://lkml.org/lkml/2012/12/6/290 > > > > OK, but this assumes that, contrary to what Steven said, read-write-read > > deadlock is not possible when it comes to rwlock_t. > > What I meant is, with a single (global) rwlock, you can't deadlock like that. Ah. I greatly misunderstood Steven's email, http://marc.info/?l=linux-pm&m=135482212307876 Somehow I didn't notice he described the deadlock with _two_ rwlock's, I wrongly thought that his point is that read_lock() is not recursive (like down_read). > Let me know if my assumptions are incorrect! No, sorry, I misunderstood Steven. > > However. If this is true, then compared to preempt_disable/stop_machine > > livelock is possible. Probably this is fine, we have the same problem with > > get_online_cpus(). But if we can accept this fact I feel we can simmplify > > this somehow... Can't prove, only feel ;) > > Not sure I follow.. I meant that write_lock_irqsave(&hotplug_rwlock) in take_cpu_down() can spin "forever". Suppose that reader_acked() == T on every CPU, so that get_online_cpus_atomic() always takes read_lock(&hotplug_rwlock). It is possible that this lock will be never released by readers, CPU_0 CPU_1 get_online_cpus_atomic() get_online_cpus_atomic() put_online_cpus_atomic() get_online_cpus_atomic() put_online_cpus_atomic() get_online_cpus_atomic() put_online_cpus_atomic() and so on. > Reader-side: > -> read_lock() your per-cpu rwlock and proceed. > > Writer-side: > -> for_each_online_cpu(cpu) > write_lock(per-cpu rwlock of 'cpu'); Yes, yes, this is clear. > Also, like Tejun said, one of the important measures for per-cpu rwlocks > should be that, if a user replaces global rwlocks with percpu rwlocks (for > performance reasons), he shouldn't suddenly end up in numerous deadlock > possibilities which never existed before. The replacement should continue to > remain safe, and perhaps improve the performance. Sure, I agree. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10/2012 11:45 PM, Oleg Nesterov wrote: > On 12/10, Srivatsa S. Bhat wrote: >> >> On 12/10/2012 02:27 AM, Oleg Nesterov wrote: >>> However. If this is true, then compared to preempt_disable/stop_machine >>> livelock is possible. Probably this is fine, we have the same problem with >>> get_online_cpus(). But if we can accept this fact I feel we can simmplify >>> this somehow... Can't prove, only feel ;) >> >> Not sure I follow.. > > I meant that write_lock_irqsave(&hotplug_rwlock) in take_cpu_down() > can spin "forever". > > Suppose that reader_acked() == T on every CPU, so that > get_online_cpus_atomic() always takes read_lock(&hotplug_rwlock). > > It is possible that this lock will be never released by readers, > > CPU_0 CPU_1 > > get_online_cpus_atomic() > get_online_cpus_atomic() > put_online_cpus_atomic() > > get_online_cpus_atomic() > put_online_cpus_atomic() > > get_online_cpus_atomic() > put_online_cpus_atomic() > > and so on. > Right, and we can't do anything about it :( Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10/2012 10:58 PM, Oleg Nesterov wrote: > On 12/10, Srivatsa S. Bhat wrote: >> >> On 12/10/2012 02:43 AM, Oleg Nesterov wrote: >>> Damn, sorry for noise. I missed this part... >>> >>> On 12/10, Srivatsa S. Bhat wrote: >>>> >>>> On 12/10/2012 12:44 AM, Oleg Nesterov wrote: >>>>> the latency. And I guess something like kick_all_cpus_sync() is "too heavy". >>>> >>>> I hadn't considered that. Thinking of it, I don't think it would help us.. >>>> It won't get rid of the currently running preempt_disable() sections no? >>> >>> Sure. But (again, this is only my feeling so far) given that get_online_cpus_atomic() >>> does cli/sti, >> >> Ah, that one! Actually, the only reason I do that cli/sti is because, potentially >> interrupt handlers can be hotplug readers too. So we need to protect the portion >> of the code of get_online_cpus_atomic() which is not re-entrant. > > Yes, I understand. > >>> this can help to implement ensure-the-readers-must-see-the-pending-writer. >>> IOW this might help to implement sync-with-readers. >>> >> >> 2 problems: >> >> 1. It won't help with cases like this: >> >> preempt_disable() >> ... >> preempt_disable() >> ... >> <------- Here >> ... >> preempt_enable() >> ... >> preempt_enable() > > No, I meant that kick_all_cpus_sync() can be used to synchronize with > cli/sti in get_online_cpus_atomic(), just like synchronize_sched() does > in the code I posted a minute ago. > Ah, OK. >> 2. Part of the reason we want to get rid of stop_machine() is to avoid the >> latency it induces on _all_ CPUs just to take *one* CPU offline. If we use >> kick_all_cpus_sync(), we get into that territory again : we unfairly interrupt >> every CPU, _even when_ that CPU's existing preempt_disabled() sections might >> not actually be hotplug readers! (ie., not bothered about CPU Hotplug). > > I agree, that is why I said it is "too heavy". > Got it :) Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10/2012 10:54 PM, Oleg Nesterov wrote: > On 12/10, Srivatsa S. Bhat wrote: >> >> On 12/10/2012 01:52 AM, Oleg Nesterov wrote: >>> On 12/10, Srivatsa S. Bhat wrote: >>>> >>>> On 12/10/2012 12:44 AM, Oleg Nesterov wrote: >>>> >>>>> But yes, it is easy to blame somebody else's code ;) And I can't suggest >>>>> something better at least right now. If I understand correctly, we can not >>>>> use, say, synchronize_sched() in _cpu_down() path >>>> >>>> We can't sleep in that code.. so that's a no-go. >>> >>> But we can? >>> >>> Note that I meant _cpu_down(), not get_online_cpus_atomic() or take_cpu_down(). >>> >> >> Maybe I'm missing something, but how would it help if we did a >> synchronize_sched() so early (in _cpu_down())? Another bunch of preempt_disable() >> sections could start immediately after our call to synchronize_sched() no? >> How would we deal with that? > > Sorry for confusion. Of course synchronize_sched() alone is not enough. > But we can use it to synchronize with preempt-disabled section and avoid > the barriers/atomic in the fast-path. > > For example, > > bool writer_pending; > DEFINE_RWLOCK(writer_rwlock); > DEFINE_PER_CPU(int, reader_ctr); > > void get_online_cpus_atomic(void) > { > preempt_disable(); > > if (likely(!writer_pending) || __this_cpu_read(reader_ctr)) { > __this_cpu_inc(reader_ctr); > return; > } > > read_lock(&writer_rwlock); > __this_cpu_inc(reader_ctr); > read_unlock(&writer_rwlock); > } > > // lacks release semantics, but we don't care > void put_online_cpus_atomic(void) > { > __this_cpu_dec(reader_ctr); > preempt_enable(); > } > > Now, _cpu_down() does > > writer_pending = true; > synchronize_sched(); > > before stop_one_cpu(). When synchronize_sched() returns, we know that > every get_online_cpus_atomic() must see writer_pending == T. And, if > any CPU incremented its reader_ctr we must see it is not zero. > > take_cpu_down() does > > write_lock(&writer_rwlock); > > for_each_online_cpu(cpu) { > while (per_cpu(reader_ctr, cpu)) > cpu_relax(); > } > > and takes the lock. > > However. This can lead to the deadlock we already discussed. So > take_cpu_down() should do > > retry: > write_lock(&writer_rwlock); > > for_each_online_cpu(cpu) { > if (per_cpu(reader_ctr, cpu)) { > write_unlock(&writer_rwlock); > goto retry; > } > } > > to take the lock. But this is livelockable. However, I do not think it > is possible to avoid the livelock. > > Just in case, the code above is only for illustration, perhaps it is not > 100% correct and perhaps we can do it better. cpu_hotplug.active_writer > is ignored for simplicity, get/put should check current == active_writer. > This approach (of using synchronize_sched()) also looks good. It is simple, yet effective, but unfortunately inefficient at the writer side (because he'll have to wait for a full synchronize_sched()). So I would say that we should keep this approach as a fallback, if we don't come up with any better synchronization scheme (in terms of efficiency) at a comparable level of simplicity/complexity. I have come up with a v4 that simplifies several aspects of the synchronization and makes it look a lot more simpler than v3. (Lesser race windows to take care of, implicit ACKs, no atomic ops etc..) I'll post it soon. Let me know what you think... Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Srivatsa. On Tue, Dec 11, 2012 at 06:43:54PM +0530, Srivatsa S. Bhat wrote: > This approach (of using synchronize_sched()) also looks good. It is simple, > yet effective, but unfortunately inefficient at the writer side (because > he'll have to wait for a full synchronize_sched()). While synchornize_sched() is heavier on the writer side than the originally posted version, it doesn't stall the whole machine and wouldn't introduce latencies to others. Shouldn't that be enough? Thanks.
On 12/11/2012 07:17 PM, Tejun Heo wrote: > Hello, Srivatsa. > > On Tue, Dec 11, 2012 at 06:43:54PM +0530, Srivatsa S. Bhat wrote: >> This approach (of using synchronize_sched()) also looks good. It is simple, >> yet effective, but unfortunately inefficient at the writer side (because >> he'll have to wait for a full synchronize_sched()). > > While synchornize_sched() is heavier on the writer side than the > originally posted version, it doesn't stall the whole machine and > wouldn't introduce latencies to others. Shouldn't that be enough? > Short answer: Yes. But we can do better, with almost comparable code complexity. So I'm tempted to try that out. Long answer: Even in the synchronize_sched() approach, we still have to identify the readers who need to be converted to use the new get/put_online_cpus_atomic() APIs and convert them. Then, if we can come up with a scheme such that the writer has to wait only for those readers to complete, then why not? If such a scheme ends up becoming too complicated, then I agree, we can use synchronize_sched() itself. (That's what I meant by saying that we'll use this as a fallback). But even in this scheme which uses synchronize_sched(), we are already half-way through (we already use 2 types of sync schemes - counters and rwlocks). Just a little more logic can get rid of the unnecessary full-wait too.. So why not give it a shot? Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Tue, Dec 11, 2012 at 07:32:13PM +0530, Srivatsa S. Bhat wrote: > On 12/11/2012 07:17 PM, Tejun Heo wrote: > > Hello, Srivatsa. > > > > On Tue, Dec 11, 2012 at 06:43:54PM +0530, Srivatsa S. Bhat wrote: > >> This approach (of using synchronize_sched()) also looks good. It is simple, > >> yet effective, but unfortunately inefficient at the writer side (because > >> he'll have to wait for a full synchronize_sched()). > > > > While synchornize_sched() is heavier on the writer side than the > > originally posted version, it doesn't stall the whole machine and > > wouldn't introduce latencies to others. Shouldn't that be enough? > > > > Short answer: Yes. But we can do better, with almost comparable code > complexity. So I'm tempted to try that out. > > Long answer: > Even in the synchronize_sched() approach, we still have to identify the > readers who need to be converted to use the new get/put_online_cpus_atomic() > APIs and convert them. Then, if we can come up with a scheme such that > the writer has to wait only for those readers to complete, then why not? > > If such a scheme ends up becoming too complicated, then I agree, we > can use synchronize_sched() itself. (That's what I meant by saying that > we'll use this as a fallback). > > But even in this scheme which uses synchronize_sched(), we are > already half-way through (we already use 2 types of sync schemes - > counters and rwlocks). Just a little more logic can get rid of the > unnecessary full-wait too.. So why not give it a shot? It's not really about the code complexity but making the reader side as light as possible. Please keep in mind that reader side is still *way* more hotter than the writer side. Before, the writer side was heavy to the extent which causes noticeable disruptions on the whole system and I think that's what we're trying to hunt down here. If we can shave of memory barriers from reader side by using synchornized_sched() on writer side, that is the *better* result, not worse. Thanks.
On 12/11/2012 07:37 PM, Tejun Heo wrote: > Hello, > > On Tue, Dec 11, 2012 at 07:32:13PM +0530, Srivatsa S. Bhat wrote: >> On 12/11/2012 07:17 PM, Tejun Heo wrote: >>> Hello, Srivatsa. >>> >>> On Tue, Dec 11, 2012 at 06:43:54PM +0530, Srivatsa S. Bhat wrote: >>>> This approach (of using synchronize_sched()) also looks good. It is simple, >>>> yet effective, but unfortunately inefficient at the writer side (because >>>> he'll have to wait for a full synchronize_sched()). >>> >>> While synchornize_sched() is heavier on the writer side than the >>> originally posted version, it doesn't stall the whole machine and >>> wouldn't introduce latencies to others. Shouldn't that be enough? >>> >> >> Short answer: Yes. But we can do better, with almost comparable code >> complexity. So I'm tempted to try that out. >> >> Long answer: >> Even in the synchronize_sched() approach, we still have to identify the >> readers who need to be converted to use the new get/put_online_cpus_atomic() >> APIs and convert them. Then, if we can come up with a scheme such that >> the writer has to wait only for those readers to complete, then why not? >> >> If such a scheme ends up becoming too complicated, then I agree, we >> can use synchronize_sched() itself. (That's what I meant by saying that >> we'll use this as a fallback). >> >> But even in this scheme which uses synchronize_sched(), we are >> already half-way through (we already use 2 types of sync schemes - >> counters and rwlocks). Just a little more logic can get rid of the >> unnecessary full-wait too.. So why not give it a shot? > > It's not really about the code complexity but making the reader side > as light as possible. Please keep in mind that reader side is still > *way* more hotter than the writer side. Before, the writer side was > heavy to the extent which causes noticeable disruptions on the whole > system and I think that's what we're trying to hunt down here. If we > can shave of memory barriers from reader side by using > synchornized_sched() on writer side, that is the *better* result, not > worse. > That's a fair point. Hmmm... Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/cpu.h b/include/linux/cpu.h index ce7a074..cf24da1 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -175,6 +175,8 @@ extern struct bus_type cpu_subsys; extern void get_online_cpus(void); extern void put_online_cpus(void); +extern void get_online_cpus_atomic(void); +extern void put_online_cpus_atomic(void); #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri) #define register_hotcpu_notifier(nb) register_cpu_notifier(nb) #define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) @@ -198,6 +200,8 @@ static inline void cpu_hotplug_driver_unlock(void) #define get_online_cpus() do { } while (0) #define put_online_cpus() do { } while (0) +#define get_online_cpus_atomic() do { } while (0) +#define put_online_cpus_atomic() do { } while (0) #define hotcpu_notifier(fn, pri) do { (void)(fn); } while (0) /* These aren't inline functions due to a GCC bug. */ #define register_hotcpu_notifier(nb) ({ (void)(nb); 0; }) diff --git a/kernel/cpu.c b/kernel/cpu.c index 42bd331..d2076fa 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -19,6 +19,7 @@ #include <linux/mutex.h> #include <linux/gfp.h> #include <linux/suspend.h> +#include <linux/atomic.h> #include "smpboot.h" @@ -133,6 +134,161 @@ static void cpu_hotplug_done(void) mutex_unlock(&cpu_hotplug.lock); } +/* + * Reader-writer lock to synchronize between atomic hotplug readers + * and the CPU offline hotplug writer. + */ +static DEFINE_RWLOCK(hotplug_rwlock); + +static DEFINE_PER_CPU(atomic_t, reader_refcount); +static DEFINE_PER_CPU(atomic_t, writer_signal); + +static inline void ack_writer_signal(void) +{ + /* Writer raised it to 2. Reduce it to 1 */ + atomic_dec(&__get_cpu_var(writer_signal)); + smp_mb__after_atomic_dec(); +} + + +#define reader_active(cpu) \ + (atomic_read(&per_cpu(reader_refcount, cpu)) > 0) + +#define writer_active(cpu) \ + (atomic_read(&per_cpu(writer_signal, cpu)) > 0) + +#define is_new_writer(cpu) \ + (atomic_read(&per_cpu(writer_signal, cpu)) == 2) + +#define reader_acked(cpu) \ + (atomic_read(&per_cpu(writer_signal, cpu)) == 1) + + +static inline void mark_reader_fastpath(void) +{ + atomic_inc(&__get_cpu_var(reader_refcount)); + smp_mb__after_atomic_inc(); +} + +static inline void unmark_reader_fastpath(void) +{ + atomic_dec(&__get_cpu_var(reader_refcount)); + smp_mb__after_atomic_dec(); +} + +static inline void mark_reader_slowpath(void) +{ + unsigned int cpu = smp_processor_id(); + + mark_reader_fastpath(); + + read_lock(&hotplug_rwlock); + + /* + * Release the lock before returning, if the writer has finished. + * This will help reduce the burden on put_online_cpus_atomic(). + */ + if (!writer_active(cpu)) + read_unlock(&hotplug_rwlock); +} + +/* + * Invoked by hotplug reader, to prevent CPUs from going offline. + * + * If there are no CPU offline writers active, just increment the + * per-cpu counter 'reader_refcount' and proceed. + * + * If a CPU offline hotplug writer is active, we need to ACK his + * signal and switch to using the global rwlocks, when the time is + * right. + * + * It is not safe to switch the synchronization scheme when we are + * already in a read-side critical section (because their corresponding + * put_online_cpus_atomic() won't know of the switch and will continue + * to use the old scheme (per-cpu counter updates), which would be + * wrong). + * + * So don't ACK the writer's signal (and don't switch to rwlocks) unless + * this is a fresh non-nested read-side critical section on this CPU. + * + * Once you switch, keep using the rwlocks for synchronization, until + * the writer signals the end of CPU offline. + * + * You can call this recursively, without fear of locking problems. + * + * Returns with preemption disabled. + */ +void get_online_cpus_atomic(void) +{ + unsigned int cpu = smp_processor_id(); + unsigned long flags; + + preempt_disable(); + local_irq_save(flags); + + if (cpu_hotplug.active_writer == current) + goto out; + + smp_rmb(); /* Paired with smp_wmb() in drop_writer_signal() */ + + if (likely(!writer_active(cpu))) { + mark_reader_fastpath(); + goto out; + } + + /* CPU offline writer is active... */ + + /* ACK his signal only once */ + if (is_new_writer(cpu)) { + /* + * ACK the writer's signal only if this is a fresh read-side + * critical section, and not just an extension of a running + * (nested) read-side critical section. + */ + if (!reader_active(cpu)) { + ack_writer_signal(); + mark_reader_slowpath(); + } else { + /* + * Keep taking the fastpath until all existing nested + * read-side critical sections complete on this CPU. + */ + mark_reader_fastpath(); + } + } else { + /* + * The writer is not new because we have already acked his + * signal before. So just take the slowpath and proceed. + */ + mark_reader_slowpath(); + } + +out: + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(get_online_cpus_atomic); + +void put_online_cpus_atomic(void) +{ + unsigned int cpu = smp_processor_id(); + unsigned long flags; + + local_irq_save(flags); + + if (cpu_hotplug.active_writer == current) + goto out; + + if (unlikely(writer_active(cpu)) && reader_acked(cpu)) + read_unlock(&hotplug_rwlock); + + unmark_reader_fastpath(); + +out: + local_irq_restore(flags); + preempt_enable(); +} +EXPORT_SYMBOL_GPL(put_online_cpus_atomic); + #else /* #if CONFIG_HOTPLUG_CPU */ static void cpu_hotplug_begin(void) {} static void cpu_hotplug_done(void) {} @@ -237,6 +393,66 @@ static inline void check_for_tasks(int cpu) write_unlock_irq(&tasklist_lock); } +/* Set the per-cpu variable writer_signal to 2 */ +static inline void raise_writer_signal(unsigned int cpu) +{ + atomic_add(2, &per_cpu(writer_signal, cpu)); + smp_mb__after_atomic_inc(); +} + +/* Reset the per-cpu variable writer_signal to 0 */ +static inline void drop_writer_signal(unsigned int cpu) +{ + atomic_set(&per_cpu(writer_signal, cpu), 0); + smp_wmb(); /* Paired with smp_rmb() in get_online_cpus_atomic() */ +} + +static void announce_cpu_offline_begin(void) +{ + unsigned int cpu; + + for_each_online_cpu(cpu) + raise_writer_signal(cpu); +} + +static void announce_cpu_offline_end(unsigned int dead_cpu) +{ + unsigned int cpu; + + drop_writer_signal(dead_cpu); + + for_each_online_cpu(cpu) + drop_writer_signal(cpu); +} + +/* + * Check if the reader saw and acked the writer's signal. + * If he has, return immediately. + * + * If he hasn't acked it, there are 2 possibilities: + * a. He is not an active reader; so we can return safely. + * b. He is an active reader, so we need to retry until + * he acks the writer's signal. + */ +static void sync_atomic_reader(unsigned int cpu) +{ + +retry: + if (reader_acked(cpu)) + return; + + if (reader_active(cpu)) + goto retry; +} + +static void sync_all_readers(void) +{ + unsigned int cpu; + + for_each_online_cpu(cpu) + sync_atomic_reader(cpu); +} + struct take_cpu_down_param { unsigned long mod; void *hcpu; @@ -246,15 +462,45 @@ struct take_cpu_down_param { static int __ref take_cpu_down(void *_param) { struct take_cpu_down_param *param = _param; - int err; + unsigned long flags; + unsigned int cpu = (long)(param->hcpu); + int err = 0; + + + /* + * Inform all atomic readers that we are going to offline a CPU + * and that they need to switch from per-cpu counters to the + * global hotplug_rwlock. + */ + announce_cpu_offline_begin(); + + /* Wait for every reader to notice the announcement and ACK it */ + sync_all_readers(); + + /* + * Now all the readers have switched to using the global hotplug_rwlock. + * So now is our chance, go bring down the CPU! + */ + + write_lock_irqsave(&hotplug_rwlock, flags); /* Ensure this CPU doesn't handle any more interrupts. */ err = __cpu_disable(); if (err < 0) - return err; + goto out; cpu_notify(CPU_DYING | param->mod, param->hcpu); - return 0; + +out: + /* + * Inform all atomic readers that we are done with the CPU offline + * operation, so that they can switch back to their per-cpu counters. + * (We don't need to wait for them to see it). + */ + announce_cpu_offline_end(cpu); + + write_unlock_irqrestore(&hotplug_rwlock, flags); + return err; } /* Requires cpu_add_remove_lock to be held */
There are places where preempt_disable() is used to prevent any CPU from going offline during the critical section. Let us call them as "atomic hotplug readers" ("atomic" because they run in atomic contexts). Today, preempt_disable() works because the writer uses stop_machine(). But once stop_machine() is gone, the readers won't be able to prevent CPUs from going offline using preempt_disable(). The intent of this patch is to provide synchronization APIs for such atomic hotplug readers, to prevent (any) CPUs from going offline, without depending on stop_machine() at the writer-side. The new APIs will look something like this: get/put_online_cpus_atomic() Some important design requirements and considerations: ----------------------------------------------------- 1. Scalable synchronization at the reader-side, especially in the fast-path Any synchronization at the atomic hotplug readers side must be highly scalable - avoid global single-holder locks/counters etc. Because, these paths currently use the extremely fast preempt_disable(); our replacement to preempt_disable() should not become ridiculously costly and also should not serialize the readers among themselves needlessly. At a minimum, the new APIs must be extremely fast at the reader side atleast in the fast-path, when no CPU offline writers are active. 2. preempt_disable() was recursive. The replacement should also be recursive. 3. No (new) lock-ordering restrictions preempt_disable() was super-flexible. It didn't impose any ordering restrictions or rules for nesting. Our replacement should also be equally flexible and usable. 4. No deadlock possibilities Per-cpu locking is not the way to go if we want to have relaxed rules for lock-ordering. Because, we can end up in circular-locking dependencies as explained in https://lkml.org/lkml/2012/12/6/290 So, avoid per-cpu locking schemes (per-cpu locks/per-cpu atomic counters with spin-on-contention etc) as much as possible. Implementation of the design: ---------------------------- We use global rwlocks for synchronization, because then we won't get into lock-ordering related problems (unlike per-cpu locks). However, global rwlocks lead to unnecessary cache-line bouncing even when there are no hotplug writers present, which can slow down the system needlessly. Per-cpu counters can help solve the cache-line bouncing problem. So we actually use the best of both: per-cpu counters (no-waiting) at the reader side in the fast-path, and global rwlocks in the slowpath. [ Fastpath = no writer is active; Slowpath = a writer is active ] IOW, the hotplug readers just increment/decrement their per-cpu refcounts when no writer is active. When a writer becomes active, he signals all readers to switch to global rwlocks for the duration of the CPU offline operation. The readers ACK the writer's signal when it is safe for them (ie., when they are about to start a fresh, non-nested read-side critical section) and start using (holding) the global rwlock for read in their subsequent critical sections. The hotplug writer waits for ACKs from every reader, and then acquires the global rwlock for write and takes the CPU offline. Then the writer signals all readers that the CPU offline is done, and that they can go back to using their per-cpu refcounts again. Note that the lock-safety (despite the per-cpu scheme) comes from the fact that the readers can *choose* when to ACK the writer's signal (iow, when to switch to the global rwlock). And the readers don't wait on anybody based on the per-cpu counters. The only true synchronization that involves waiting at the reader-side in this scheme, is the one arising from the global rwlock, which is safe from the circular locking dependency problems mentioned above (because it is global). Reader-writer locks and per-cpu counters are recursive, so they can be used in a nested fashion in the reader-path. And the "signal-and-ack-with-no- reader-spinning" design of switching the synchronization scheme ensures that you can safely nest and call these APIs in any way you want, just like preempt_disable()/enable. Together, these satisfy all the requirements mentioned above. I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful suggestions and ideas, which inspired and influenced many of the decisions in this as well as previous designs. Thanks a lot Michael and Xiao! Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> --- include/linux/cpu.h | 4 + kernel/cpu.c | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 253 insertions(+), 3 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html