Message ID | 20121211140358.23621.97011.stgit@srivatsabhat.in.ibm.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On 12/11, Srivatsa S. Bhat wrote: > > IOW, the hotplug readers just increment/decrement their per-cpu refcounts > when no writer is active. plus cli/sti ;) and increment/decrement are atomic. At first glance looks correct to me, but I'll try to read it carefully later. A couple of minor nits, > +static DEFINE_PER_CPU(bool, writer_signal); Why it needs to be per-cpu? It can be global and __read_mostly to avoid the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal into a single cacheline... > +void get_online_cpus_atomic(void) > +{ > + unsigned long flags; > + > + preempt_disable(); > + > + if (cpu_hotplug.active_writer == current) > + return; > + > + local_irq_save(flags); Yes... this is still needed, we are going to increment reader_percpu_refcnt unconditionally and this makes reader_nested_percpu() == T. But, > +void put_online_cpus_atomic(void) > +{ > + unsigned long flags; > + > + if (cpu_hotplug.active_writer == current) > + goto out; > + > + local_irq_save(flags); > + > + /* > + * We never allow heterogeneous nesting of readers. So it is trivial > + * to find out the kind of reader we are, and undo the operation > + * done by our corresponding get_online_cpus_atomic(). > + */ > + if (__this_cpu_read(reader_percpu_refcnt)) > + __this_cpu_dec(reader_percpu_refcnt); > + else > + read_unlock(&hotplug_rwlock); > + > + local_irq_restore(flags); > +out: > + preempt_enable(); > +} Do we really need local_irq_save/restore in put_ ? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/12, Oleg Nesterov wrote: > > On 12/11, Srivatsa S. Bhat wrote: > > > > IOW, the hotplug readers just increment/decrement their per-cpu refcounts > > when no writer is active. > > plus cli/sti ;) and increment/decrement are atomic. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ OOPS, sorry I was going to say "adds mb()". And when I look at get_online_cpus_atomic() again it uses rmb(). This doesn't look correct, we need the full barrier between this_cpu_inc() and writer_active(). At the same time reader_nested_percpu() can be checked before mb(). Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/12/2012 10:47 PM, Oleg Nesterov wrote: > On 12/11, Srivatsa S. Bhat wrote: >> >> IOW, the hotplug readers just increment/decrement their per-cpu refcounts >> when no writer is active. > > plus cli/sti ;) Of course, forgot to mention it, again! :) > and increment/decrement are atomic. > > At first glance looks correct to me, but I'll try to read it carefully > later. > > A couple of minor nits, > >> +static DEFINE_PER_CPU(bool, writer_signal); > > Why it needs to be per-cpu? It can be global and __read_mostly to avoid > the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal > into a single cacheline... > Even I realized this (that we could use a global) after posting out the series.. But do you think that it would be better to retain the per-cpu variant itself, due to the cache effects? >> +void get_online_cpus_atomic(void) >> +{ >> + unsigned long flags; >> + >> + preempt_disable(); >> + >> + if (cpu_hotplug.active_writer == current) >> + return; >> + >> + local_irq_save(flags); > > Yes... this is still needed, we are going to increment reader_percpu_refcnt > unconditionally and this makes reader_nested_percpu() == T. > > But, > >> +void put_online_cpus_atomic(void) >> +{ >> + unsigned long flags; >> + >> + if (cpu_hotplug.active_writer == current) >> + goto out; >> + >> + local_irq_save(flags); >> + >> + /* >> + * We never allow heterogeneous nesting of readers. So it is trivial >> + * to find out the kind of reader we are, and undo the operation >> + * done by our corresponding get_online_cpus_atomic(). >> + */ >> + if (__this_cpu_read(reader_percpu_refcnt)) >> + __this_cpu_dec(reader_percpu_refcnt); >> + else >> + read_unlock(&hotplug_rwlock); >> + >> + local_irq_restore(flags); >> +out: >> + preempt_enable(); >> +} > > Do we really need local_irq_save/restore in put_ ? > Hmm.. good point! I don't think we need it. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/12, Srivatsa S. Bhat wrote: > > On 12/12/2012 10:47 PM, Oleg Nesterov wrote: > > > > Why it needs to be per-cpu? It can be global and __read_mostly to avoid > > the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal > > into a single cacheline... > > Even I realized this (that we could use a global) after posting out the > series.. But do you think that it would be better to retain the per-cpu > variant itself, due to the cache effects? I don't really know, up to you. This was the question ;) > > Do we really need local_irq_save/restore in put_ ? > > > > Hmm.. good point! I don't think we need it. And _perhaps_ get_ can avoid it too? I didn't really try to think, probably this is not right, but can't something like this work? #define XXXX (1 << 16) #define MASK (XXXX -1) void get_online_cpus_atomic(void) { preempt_disable(); // only for writer __this_cpu_add(reader_percpu_refcnt, XXXX); if (__this_cpu_read(reader_percpu_refcnt) & MASK) { __this_cpu_inc(reader_percpu_refcnt); } else { smp_wmb(); if (writer_active()) { ... } } __this_cpu_dec(reader_percpu_refcnt, XXXX); } void put_online_cpus_atomic(void) { if (__this_cpu_read(reader_percpu_refcnt) & MASK) __this_cpu_dec(reader_percpu_refcnt); else read_unlock(&hotplug_rwlock); preempt_enable(); } Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/12/2012 10:54 PM, Oleg Nesterov wrote: > On 12/12, Oleg Nesterov wrote: >> >> On 12/11, Srivatsa S. Bhat wrote: >>> >>> IOW, the hotplug readers just increment/decrement their per-cpu refcounts >>> when no writer is active. >> >> plus cli/sti ;) and increment/decrement are atomic. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > OOPS, sorry I was going to say "adds mb()". > Ok, got it :) > And when I look at get_online_cpus_atomic() again it uses rmb(). This > doesn't look correct, we need the full barrier between this_cpu_inc() > and writer_active(). > Hmm.. > At the same time reader_nested_percpu() can be checked before mb(). > I thought that since the increment and the check (reader_nested_percpu) act on the same memory location, they will naturally be run in the given order, without any need for barriers. Am I wrong? (I referred Documentation/memory-barriers.txt again to verify this, and the second point under the "Guarantees" section looked like it said the same thing : "Overlapping loads and stores within a particular CPU will appear to be ordered within that CPU"). Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/12, Srivatsa S. Bhat wrote: > > On 12/12/2012 10:54 PM, Oleg Nesterov wrote: > > > And when I look at get_online_cpus_atomic() again it uses rmb(). This > > doesn't look correct, we need the full barrier between this_cpu_inc() > > and writer_active(). > > Hmm.. > > > At the same time reader_nested_percpu() can be checked before mb(). > > I thought that since the increment and the check (reader_nested_percpu) > act on the same memory location, they will naturally be run in the given > order, without any need for barriers. Am I wrong? And this is what I meant, you do not need a barrier before reader_nested_percpu(). But you need to ensure that WRITE(reader_percpu_refcnt) and READ(writer_signal) can't be reordered, so you need mb() in between. rmb() can serialize LOADs and STOREs. Or I misunderstood? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/12/2012 11:32 PM, Oleg Nesterov wrote: > On 12/12, Srivatsa S. Bhat wrote: >> >> On 12/12/2012 10:47 PM, Oleg Nesterov wrote: >>> >>> Why it needs to be per-cpu? It can be global and __read_mostly to avoid >>> the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal >>> into a single cacheline... >> >> Even I realized this (that we could use a global) after posting out the >> series.. But do you think that it would be better to retain the per-cpu >> variant itself, due to the cache effects? > > I don't really know, up to you. This was the question ;) OK :-) > >>> Do we really need local_irq_save/restore in put_ ? >>> >> >> Hmm.. good point! I don't think we need it. > > And _perhaps_ get_ can avoid it too? > > I didn't really try to think, probably this is not right, but can't > something like this work? > > #define XXXX (1 << 16) > #define MASK (XXXX -1) > > void get_online_cpus_atomic(void) > { > preempt_disable(); > > // only for writer > __this_cpu_add(reader_percpu_refcnt, XXXX); > > if (__this_cpu_read(reader_percpu_refcnt) & MASK) { > __this_cpu_inc(reader_percpu_refcnt); > } else { > smp_wmb(); > if (writer_active()) { > ... > } > } > > __this_cpu_dec(reader_percpu_refcnt, XXXX); > } > Sorry, may be I'm too blind to see, but I didn't understand the logic of how the mask helps us avoid disabling interrupts.. Can you kindly elaborate? > void put_online_cpus_atomic(void) > { > if (__this_cpu_read(reader_percpu_refcnt) & MASK) > __this_cpu_dec(reader_percpu_refcnt); > else > read_unlock(&hotplug_rwlock); > preempt_enable(); > } > Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/12/2012 11:53 PM, Oleg Nesterov wrote: > On 12/12, Srivatsa S. Bhat wrote: >> >> On 12/12/2012 10:54 PM, Oleg Nesterov wrote: >> >>> And when I look at get_online_cpus_atomic() again it uses rmb(). This >>> doesn't look correct, we need the full barrier between this_cpu_inc() >>> and writer_active(). >> >> Hmm.. >> >>> At the same time reader_nested_percpu() can be checked before mb(). >> >> I thought that since the increment and the check (reader_nested_percpu) >> act on the same memory location, they will naturally be run in the given >> order, without any need for barriers. Am I wrong? > > And this is what I meant, you do not need a barrier before > reader_nested_percpu(). > Ah, ok! > But you need to ensure that WRITE(reader_percpu_refcnt) and READ(writer_signal) > can't be reordered, so you need mb() in between. rmb() can serialize LOADs and > STOREs. > OK, got it. (I know you meant s/can/can't). I'm trying to see if we can somehow exploit the fact that the writer can potentially tolerate if a reader ignores his signal (to switch to rwlocks) for a while... and use this to get rid of barriers in the reader path (without using synchronize_sched() at the writer, of course). And perhaps also take advantage of the fact that the read_lock() acts as a one-way barrier.. I don't know, maybe its not possible after all.. :-/ Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/13, Srivatsa S. Bhat wrote: > > On 12/12/2012 11:32 PM, Oleg Nesterov wrote: > > And _perhaps_ get_ can avoid it too? > > > > I didn't really try to think, probably this is not right, but can't > > something like this work? > > > > #define XXXX (1 << 16) > > #define MASK (XXXX -1) > > > > void get_online_cpus_atomic(void) > > { > > preempt_disable(); > > > > // only for writer > > __this_cpu_add(reader_percpu_refcnt, XXXX); > > > > if (__this_cpu_read(reader_percpu_refcnt) & MASK) { > > __this_cpu_inc(reader_percpu_refcnt); > > } else { > > smp_wmb(); > > if (writer_active()) { > > ... > > } > > } > > > > __this_cpu_dec(reader_percpu_refcnt, XXXX); > > } > > > > Sorry, may be I'm too blind to see, but I didn't understand the logic > of how the mask helps us avoid disabling interrupts.. Why do we need cli/sti at all? We should prevent the following race: - the writer already holds hotplug_rwlock, so get_ must not succeed. - the new reader comes, it increments reader_percpu_refcnt, but before it checks writer_active() ... - irq handler does get_online_cpus_atomic() and sees reader_nested_percpu() == T, so it simply increments reader_percpu_refcnt and succeeds. OTOH, why do we need to increment reader_percpu_refcnt the counter in advance? To ensure that either we see writer_active() or the writer should see reader_percpu_refcnt != 0 (and that is why they should write/read in reverse order). The code above tries to avoid this race using the lower 16 bits as a "nested-counter", and the upper bits to avoid the race with the writer. // only for writer __this_cpu_add(reader_percpu_refcnt, XXXX); If irq comes and does get_online_cpus_atomic(), it won't be confused by __this_cpu_add(XXXX), it will check the lower bits and switch to the "slow path". But once again, so far I didn't really try to think. It is quite possible I missed something. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/13/2012 12:18 AM, Oleg Nesterov wrote: > On 12/13, Srivatsa S. Bhat wrote: >> >> On 12/12/2012 11:32 PM, Oleg Nesterov wrote: >>> And _perhaps_ get_ can avoid it too? >>> >>> I didn't really try to think, probably this is not right, but can't >>> something like this work? >>> >>> #define XXXX (1 << 16) >>> #define MASK (XXXX -1) >>> >>> void get_online_cpus_atomic(void) >>> { >>> preempt_disable(); >>> >>> // only for writer >>> __this_cpu_add(reader_percpu_refcnt, XXXX); >>> >>> if (__this_cpu_read(reader_percpu_refcnt) & MASK) { >>> __this_cpu_inc(reader_percpu_refcnt); >>> } else { >>> smp_wmb(); >>> if (writer_active()) { >>> ... >>> } >>> } >>> >>> __this_cpu_dec(reader_percpu_refcnt, XXXX); >>> } >>> >> >> Sorry, may be I'm too blind to see, but I didn't understand the logic >> of how the mask helps us avoid disabling interrupts.. > > Why do we need cli/sti at all? We should prevent the following race: > > - the writer already holds hotplug_rwlock, so get_ must not > succeed. > > - the new reader comes, it increments reader_percpu_refcnt, > but before it checks writer_active() ... > > - irq handler does get_online_cpus_atomic() and sees > reader_nested_percpu() == T, so it simply increments > reader_percpu_refcnt and succeeds. > > OTOH, why do we need to increment reader_percpu_refcnt the counter > in advance? To ensure that either we see writer_active() or the > writer should see reader_percpu_refcnt != 0 (and that is why they > should write/read in reverse order). > > The code above tries to avoid this race using the lower 16 bits > as a "nested-counter", and the upper bits to avoid the race with > the writer. > > // only for writer > __this_cpu_add(reader_percpu_refcnt, XXXX); > > If irq comes and does get_online_cpus_atomic(), it won't be confused > by __this_cpu_add(XXXX), it will check the lower bits and switch to > the "slow path". > This is a very clever scheme indeed! :-) Thanks a lot for explaining it in detail. > > But once again, so far I didn't really try to think. It is quite > possible I missed something. > Even I don't spot anything wrong with it. But I'll give it some more thought.. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/12, Oleg Nesterov wrote: > > On 12/12, Srivatsa S. Bhat wrote: > > > > On 12/12/2012 10:47 PM, Oleg Nesterov wrote: > > > > > > Why it needs to be per-cpu? It can be global and __read_mostly to avoid > > > the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal > > > into a single cacheline... > > > > Even I realized this (that we could use a global) after posting out the > > series.. But do you think that it would be better to retain the per-cpu > > variant itself, due to the cache effects? > > I don't really know, up to you. This was the question ;) But perhaps there is another reason to make it per-cpu... It seems we can avoid cpu_hotplug.active_writer == current check in get/put. take_cpu_down() can clear this_cpu(writer_signal) right after it takes hotplug_rwlock for writing. It runs with irqs and preemption disabled, nobody else will ever look at writer_signal on its CPU. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/13/2012 01:06 AM, Oleg Nesterov wrote: > On 12/12, Oleg Nesterov wrote: >> >> On 12/12, Srivatsa S. Bhat wrote: >>> >>> On 12/12/2012 10:47 PM, Oleg Nesterov wrote: >>>> >>>> Why it needs to be per-cpu? It can be global and __read_mostly to avoid >>>> the false-sharing. OK, perhaps to put reader_percpu_refcnt/writer_signal >>>> into a single cacheline... >>> >>> Even I realized this (that we could use a global) after posting out the >>> series.. But do you think that it would be better to retain the per-cpu >>> variant itself, due to the cache effects? >> >> I don't really know, up to you. This was the question ;) > > But perhaps there is another reason to make it per-cpu... > > It seems we can avoid cpu_hotplug.active_writer == current check in > get/put. > > take_cpu_down() can clear this_cpu(writer_signal) right after it takes > hotplug_rwlock for writing. It runs with irqs and preemption disabled, > nobody else will ever look at writer_signal on its CPU. > Hmm.. And then the get/put_ on that CPU will increment/decrement the per-cpu refcount, but we don't care.. because we only need to ensure that they don't deadlock by taking the rwlock for read. This sounds great! Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/13, Srivatsa S. Bhat wrote: > > On 12/13/2012 01:06 AM, Oleg Nesterov wrote: > > > > But perhaps there is another reason to make it per-cpu... Actually this is not the reason, please see below. But let me repeat, it is not that I suggest to remove "per-cpu". > > It seems we can avoid cpu_hotplug.active_writer == current check in > > get/put. > > > > take_cpu_down() can clear this_cpu(writer_signal) right after it takes > > hotplug_rwlock for writing. It runs with irqs and preemption disabled, > > nobody else will ever look at writer_signal on its CPU. > > > > Hmm.. And then the get/put_ on that CPU will increment/decrement the per-cpu > refcount, but we don't care.. because we only need to ensure that they don't > deadlock by taking the rwlock for read. Yes, but... Probably it would be more clean to simply do this_cpu_inc(reader_percpu_refcnt) after write_lock(hotplug_rwlock). This will have the same effect for get/put, and we still can make writer_signal global (if we want). And note that this will also simplify the lockdep annotations which we (imho) should add later. Ignoring all complications get_online_cpus_atomic() does: if (this_cpu_read(reader_percpu_refcnt)) this_cpu_inc(reader_percpu_refcnt); else if (!writer_signal) this_cpu_inc(reader_percpu_refcnt); // same as above else read_lock(&hotplug_rwlock); But for lockdep it should do: if (this_cpu_read(reader_percpu_refcnt)) this_cpu_inc(reader_percpu_refcnt); else if (!writer_signal) { this_cpu_inc(reader_percpu_refcnt); // pretend we take hotplug_rwlock for lockdep rwlock_acquire_read(&hotplug_rwlock.dep_map, 0, 0); } else read_lock(&hotplug_rwlock); And we need to ensure that rwlock_acquire_read() is not called under write_lock(hotplug_rwlock). If we use reader_percpu_refcnt to fool get/put, we should not worry. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote: > On 12/13/2012 12:18 AM, Oleg Nesterov wrote: >> On 12/13, Srivatsa S. Bhat wrote: >>> >>> On 12/12/2012 11:32 PM, Oleg Nesterov wrote: >>>> And _perhaps_ get_ can avoid it too? >>>> >>>> I didn't really try to think, probably this is not right, but can't >>>> something like this work? >>>> >>>> #define XXXX (1 << 16) >>>> #define MASK (XXXX -1) >>>> >>>> void get_online_cpus_atomic(void) >>>> { >>>> preempt_disable(); >>>> >>>> // only for writer >>>> __this_cpu_add(reader_percpu_refcnt, XXXX); >>>> >>>> if (__this_cpu_read(reader_percpu_refcnt) & MASK) { >>>> __this_cpu_inc(reader_percpu_refcnt); >>>> } else { >>>> smp_wmb(); >>>> if (writer_active()) { >>>> ... >>>> } >>>> } >>>> >>>> __this_cpu_dec(reader_percpu_refcnt, XXXX); >>>> } >>>> >>> >>> Sorry, may be I'm too blind to see, but I didn't understand the logic >>> of how the mask helps us avoid disabling interrupts.. >> >> Why do we need cli/sti at all? We should prevent the following race: >> >> - the writer already holds hotplug_rwlock, so get_ must not >> succeed. >> >> - the new reader comes, it increments reader_percpu_refcnt, >> but before it checks writer_active() ... >> >> - irq handler does get_online_cpus_atomic() and sees >> reader_nested_percpu() == T, so it simply increments >> reader_percpu_refcnt and succeeds. >> >> OTOH, why do we need to increment reader_percpu_refcnt the counter >> in advance? To ensure that either we see writer_active() or the >> writer should see reader_percpu_refcnt != 0 (and that is why they >> should write/read in reverse order). >> >> The code above tries to avoid this race using the lower 16 bits >> as a "nested-counter", and the upper bits to avoid the race with >> the writer. >> >> // only for writer >> __this_cpu_add(reader_percpu_refcnt, XXXX); >> >> If irq comes and does get_online_cpus_atomic(), it won't be confused >> by __this_cpu_add(XXXX), it will check the lower bits and switch to >> the "slow path". >> > > This is a very clever scheme indeed! :-) Thanks a lot for explaining > it in detail. > >> >> But once again, so far I didn't really try to think. It is quite >> possible I missed something. >> > > Even I don't spot anything wrong with it. But I'll give it some more > thought.. Since an interrupt handler can also run get_online_cpus_atomic(), we cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt, right? To maintain the integrity of the update itself, we will have to use the this_cpu_* variant, which basically plays spoil-sport on this whole scheme... :-( But still, this scheme is better, because the reader doesn't have to spin on the read_lock() with interrupts disabled. That way, interrupt handlers that are not hotplug readers can continue to run on this CPU while taking another CPU offline. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/13, Srivatsa S. Bhat wrote: > > On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote: > > > > Even I don't spot anything wrong with it. But I'll give it some more > > thought.. > > Since an interrupt handler can also run get_online_cpus_atomic(), we > cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt, > right? Hmm. I thought that __this_cpu_* must be safe under preempt_disable(). IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable + __this_cpu_inc() correctness-wise. And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t. But when I try to read the comments percpu.h, I am starting to think that even this_cpu_inc() is not safe if irq handler can do the same? Confused... I am shy to ask... will, say, DEFINE_PER_CPU(local_t) and local_inc(__this_cpu_ptr(...)) work?? > But still, this scheme is better, because the reader doesn't have to spin > on the read_lock() with interrupts disabled. Yes, but my main concern is that irq_disable/enable itself is not that cheap. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/13/2012 09:47 PM, Oleg Nesterov wrote: > On 12/13, Srivatsa S. Bhat wrote: >> >> On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote: >>> >>> Even I don't spot anything wrong with it. But I'll give it some more >>> thought.. >> >> Since an interrupt handler can also run get_online_cpus_atomic(), we >> cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt, >> right? > > Hmm. I thought that __this_cpu_* must be safe under preempt_disable(). > IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable + > __this_cpu_inc() correctness-wise. > > And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t. > > But when I try to read the comments percpu.h, I am starting to think that > even this_cpu_inc() is not safe if irq handler can do the same? > The comment seems to say that its not safe wrt interrupts. But looking at the code in include/linux/percpu.h, IIUC, that is true only about this_cpu_read() because it only disables preemption. However, this_cpu_inc() looks safe wrt interrupts because it wraps the increment within raw_local_irqsave()/restore(). > Confused... > > I am shy to ask... will, say, DEFINE_PER_CPU(local_t) and > local_inc(__this_cpu_ptr(...)) work?? > >> But still, this scheme is better, because the reader doesn't have to spin >> on the read_lock() with interrupts disabled. > > Yes, but my main concern is that irq_disable/enable itself is not that cheap. > Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Oleg. On Thu, Dec 13, 2012 at 05:17:09PM +0100, Oleg Nesterov wrote: > Hmm. I thought that __this_cpu_* must be safe under preempt_disable(). > IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable + > __this_cpu_inc() correctness-wise. this_cpu_inc() equals local_irq_save() + __this_cpu_inc(). > And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t. Yes, it is safe. > But when I try to read the comments percpu.h, I am starting to think that > even this_cpu_inc() is not safe if irq handler can do the same? > > Confused... Yeah, the comment is confusing and the way these macros are defined doesn't help. There used to be three variants and it looks like we didn't update the comment while removing the preempt safe ones. Gotta clean those up. Anyways, yes, this_cpu_*() are safe against irqs. Thanks.
On 12/13, Srivatsa S. Bhat wrote: > > On 12/13/2012 09:47 PM, Oleg Nesterov wrote: > > On 12/13, Srivatsa S. Bhat wrote: > >> > >> On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote: > >>> > >>> Even I don't spot anything wrong with it. But I'll give it some more > >>> thought.. > >> > >> Since an interrupt handler can also run get_online_cpus_atomic(), we > >> cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt, > >> right? > > > > Hmm. I thought that __this_cpu_* must be safe under preempt_disable(). > > IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable + > > __this_cpu_inc() correctness-wise. > > > > And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t. > > > > But when I try to read the comments percpu.h, I am starting to think that > > even this_cpu_inc() is not safe if irq handler can do the same? > > > > The comment seems to say that its not safe wrt interrupts. But looking at > the code in include/linux/percpu.h, IIUC, that is true only about > this_cpu_read() because it only disables preemption. > > However, this_cpu_inc() looks safe wrt interrupts because it wraps the > increment within raw_local_irqsave()/restore(). You mean _this_cpu_generic_to_op() I guess. So yes, I think you are right, this_cpu_* should be irq-safe, but __this_cpu_* is not. Thanks. At least on x86 there is no difference between this_ and __this_, both do percpu_add_op() without local_irq_disable/enable. But it seems that most of architectures use generic code. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/14/2012 11:33 PM, Oleg Nesterov wrote: > On 12/13, Srivatsa S. Bhat wrote: >> >> On 12/13/2012 09:47 PM, Oleg Nesterov wrote: >>> On 12/13, Srivatsa S. Bhat wrote: >>>> >>>> On 12/13/2012 12:42 AM, Srivatsa S. Bhat wrote: >>>>> >>>>> Even I don't spot anything wrong with it. But I'll give it some more >>>>> thought.. >>>> >>>> Since an interrupt handler can also run get_online_cpus_atomic(), we >>>> cannot use the __this_cpu_* versions for modifying reader_percpu_refcnt, >>>> right? >>> >>> Hmm. I thought that __this_cpu_* must be safe under preempt_disable(). >>> IOW, I thought that, say, this_cpu_inc() is "equal" to preempt_disable + >>> __this_cpu_inc() correctness-wise. >>> >>> And. I thought that this_cpu_inc() is safe wrt interrupt, like local_t. >>> >>> But when I try to read the comments percpu.h, I am starting to think that >>> even this_cpu_inc() is not safe if irq handler can do the same? >>> >> >> The comment seems to say that its not safe wrt interrupts. But looking at >> the code in include/linux/percpu.h, IIUC, that is true only about >> this_cpu_read() because it only disables preemption. >> >> However, this_cpu_inc() looks safe wrt interrupts because it wraps the >> increment within raw_local_irqsave()/restore(). > > You mean _this_cpu_generic_to_op() I guess. So yes, I think you are right, > this_cpu_* should be irq-safe, but __this_cpu_* is not. > Yes. > Thanks. > > At least on x86 there is no difference between this_ and __this_, both do > percpu_add_op() without local_irq_disable/enable. But it seems that most > of architectures use generic code. > So now that we can't avoid disabling and enabling interrupts, I was wondering if we could exploit this to avoid the smp_mb().. Maybe this is a stupid question, but I'll shoot it anyway... Does local_irq_disable()/enable provide any ordering guarantees by any chance? I think the answer is no, but if it is yes, I guess we can do as shown below to ensure that STORE(reader_percpu_refcnt) happens before LOAD(writer_signal). void get_online_cpus_atomic(void) { unsigned long flags; preempt_disable(); //only for writer local_irq_save(flags); __this_cpu_add(reader_percpu_refcnt, XXXX); local_irq_restore(flags); //no need of an explicit smp_mb() if (__this_cpu_read(reader_percpu_refcnt) & MASK) { this_cpu_inc(reader_percpu_refcnt); } else if (writer_active()) { ... } this_cpu_dec(reader_percpu_refcnt, XXXX); } I tried thinking about other ways to avoid that smp_mb() in the reader, but was unsuccessful. So if the above assumption is wrong, I guess we'll just have to go with the version that uses synchronize_sched() at the writer-side. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/18, Srivatsa S. Bhat wrote: > > So now that we can't avoid disabling and enabling interrupts, Still I think it would be better to not use local_irq_save/restore directly. And, > I was > wondering if we could exploit this to avoid the smp_mb().. > > Maybe this is a stupid question, but I'll shoot it anyway... > Does local_irq_disable()/enable provide any ordering guarantees by any chance? Oh, I do not know. But please look at the comment above prepare_to_wait(). It assumes that even spin_unlock_irqrestore() is not the full barrier. In any case. get_online_cpus_atomic() has to use irq_restore, not irq_enable. And _restore does nothing "special" if irqs were already disabled, so I think we can't rely on sti. > I tried thinking about other ways to avoid that smp_mb() in the reader, Just in case, I think there is no way to avoid mb() in _get (although perhaps it can be "implicit"). The writer changes cpu_online_mask and drops the lock. We need to ensure that the reader sees the change in cpu_online_mask after _get returns. > but was unsuccessful. So if the above assumption is wrong, I guess we'll > just have to go with the version that uses synchronize_sched() at the > writer-side. In this case we can also convert get_online_cpus() to use percpu_rwsem and avoid mutex_lock(&cpu_hotplug.lock), but this is minor I guess. I do not think get_online_cpus() is called too often. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/19/2012 01:13 AM, Oleg Nesterov wrote: > On 12/18, Srivatsa S. Bhat wrote: >> >> So now that we can't avoid disabling and enabling interrupts, > > Still I think it would be better to not use local_irq_save/restore > directly. Sure, we can use this_cpu_add() itself. I explicitly used local_irq_save/restore here just to explain my question. > And, > >> I was >> wondering if we could exploit this to avoid the smp_mb().. >> >> Maybe this is a stupid question, but I'll shoot it anyway... >> Does local_irq_disable()/enable provide any ordering guarantees by any chance? > > Oh, I do not know. > > But please look at the comment above prepare_to_wait(). It assumes > that even spin_unlock_irqrestore() is not the full barrier. > Semi-permeable barrier.. Hmm.. > In any case. get_online_cpus_atomic() has to use irq_restore, not > irq_enable. And _restore does nothing "special" if irqs were already > disabled, so I think we can't rely on sti. > Right, I forgot about the _restore part. >> I tried thinking about other ways to avoid that smp_mb() in the reader, > > Just in case, I think there is no way to avoid mb() in _get (although > perhaps it can be "implicit"). > Actually, I was trying to avoid mb() in the _fastpath_, when there is no active writer. I missed stating that clearly, sorry. > The writer changes cpu_online_mask and drops the lock. We need to ensure > that the reader sees the change in cpu_online_mask after _get returns. > The write_unlock() will ensure the completion of the update to cpu_online_mask, using the same semi-permeable logic that you pointed above. So readers will see the update as soon as the writer releases the lock, right? >> but was unsuccessful. So if the above assumption is wrong, I guess we'll >> just have to go with the version that uses synchronize_sched() at the >> writer-side. > > In this case we can also convert get_online_cpus() to use percpu_rwsem > and avoid mutex_lock(&cpu_hotplug.lock), but this is minor I guess. > I do not think get_online_cpus() is called too often. > Yes, we could do that as well. I remember you saying that you had some patches for percpu_rwsem to help use it in cpu hotplug code (to make it recursive, IIRC). So, I guess we'll go with the synchronize_sched() approach for percpu rwlocks then. Tejun, it is still worthwhile to expose this as a generic percpu rwlock and then use it inside cpu hotplug code, right? Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(sorry if you see this email twice) On 12/19, Srivatsa S. Bhat wrote: > > On 12/19/2012 01:13 AM, Oleg Nesterov wrote: > > >> I tried thinking about other ways to avoid that smp_mb() in the reader, > > > > Just in case, I think there is no way to avoid mb() in _get (although > > perhaps it can be "implicit"). > > > > Actually, I was trying to avoid mb() in the _fastpath_, when there is no > active writer. I missed stating that clearly, sorry. Yes, I meant the fastpath too, > > The writer changes cpu_online_mask and drops the lock. We need to ensure > > that the reader sees the change in cpu_online_mask after _get returns. > > > > The write_unlock() will ensure the completion of the update to cpu_online_mask, > using the same semi-permeable logic that you pointed above. So readers will > see the update as soon as the writer releases the lock, right? Why? Once again, the writer (say) removes CPU_1 from cpu_online_mask, and sets writer_signal[0] = 0, this "enables" fast path on CPU_0. But, without a barrier, there is no guarantee that CPU_0 will see the change in cpu_online_mask after get_online_cpus_atomic() checks writer_signal[0] and returns. Hmm. And this means that the last version lacks the additional rmb() (at least) if writer_active() == F. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/19/2012 10:09 PM, Oleg Nesterov wrote: > (sorry if you see this email twice) > > On 12/19, Srivatsa S. Bhat wrote: >> >> On 12/19/2012 01:13 AM, Oleg Nesterov wrote: >> >>>> I tried thinking about other ways to avoid that smp_mb() in the reader, >>> >>> Just in case, I think there is no way to avoid mb() in _get (although >>> perhaps it can be "implicit"). >>> >> >> Actually, I was trying to avoid mb() in the _fastpath_, when there is no >> active writer. I missed stating that clearly, sorry. > > Yes, I meant the fastpath too, > >>> The writer changes cpu_online_mask and drops the lock. We need to ensure >>> that the reader sees the change in cpu_online_mask after _get returns. >>> >> >> The write_unlock() will ensure the completion of the update to cpu_online_mask, >> using the same semi-permeable logic that you pointed above. So readers will >> see the update as soon as the writer releases the lock, right? > > Why? > > Once again, the writer (say) removes CPU_1 from cpu_online_mask, and > sets writer_signal[0] = 0, this "enables" fast path on CPU_0. > > But, without a barrier, there is no guarantee that CPU_0 will see the > change in cpu_online_mask after get_online_cpus_atomic() checks > writer_signal[0] and returns. > Hmm, because a reader's code can get reordered to something like: read cpu_online_mask get_online_cpus_atomic() You are right, I missed that. > Hmm. And this means that the last version lacks the additional rmb() > (at least) if writer_active() == F. > Yes, the fast path needs that smp_rmb(), whereas the slow-path is fine, because it takes the read_lock(). BTW, there is a small problem with the synchronize_sched() approach: We can't generalize the synchronization scheme as a generic percpu rwlock because the writer's role is split into 2, the first part (the one having synchronize_sched()) which should be run in process context, and the second part (the rest of it), which can be run in atomic context. That is, needing the writer to be able to sleep (due to synchronize_sched()) will make the locking scheme unusable (in other usecases) I think. And the split (blocking and non-blocking part) itself seems hard to express as a single writer API. Hmmm.. so what do we do? Shall we say "We anyway have to do smp_rmb() in the reader in the fast path. Instead let's do a full smp_mb() and get rid of the synchronize_sched() in the writer, and thereby expose this synchronization scheme as generic percpu rwlocks?" Or should we rather use synchronize_sched() and keep this synchronization scheme restricted to CPU hotplug only? Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/19, Srivatsa S. Bhat wrote: > > BTW, there is a small problem with the synchronize_sched() approach: > We can't generalize the synchronization scheme as a generic percpu rwlock > because the writer's role is split into 2, the first part (the one having > synchronize_sched()) which should be run in process context, and the > second part (the rest of it), which can be run in atomic context. Yes, > That is, needing the writer to be able to sleep (due to synchronize_sched()) > will make the locking scheme unusable (in other usecases) I think. And > the split (blocking and non-blocking part) itself seems hard to express > as a single writer API. I do not think this is the problem. We need 2 helpers for writer, the 1st one does synchronize_sched() and the 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both. In fact I think that the 1st helper should not do synchronize_sched(), and (say) cpu_hotplug_begin() should call it itself. Because if we also change cpu_hotplug_begin() to use percpu_rw_sem we do not want to do synchronize_sched() twice. But lets ignore this for now. But, > Hmmm.. so what do we do? Shall we say "We anyway have to do smp_rmb() in the > reader in the fast path. Instead let's do a full smp_mb() and get rid of > the synchronize_sched() in the writer, and thereby expose this synchronization > scheme as generic percpu rwlocks?" Or should we rather use synchronize_sched() > and keep this synchronization scheme restricted to CPU hotplug only? Oh, I do not know ;) To me, the main question is: can we use synchronize_sched() in cpu_down? It is slow. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/20/2012 12:44 AM, Oleg Nesterov wrote: > On 12/19, Srivatsa S. Bhat wrote: >> >> BTW, there is a small problem with the synchronize_sched() approach: >> We can't generalize the synchronization scheme as a generic percpu rwlock >> because the writer's role is split into 2, the first part (the one having >> synchronize_sched()) which should be run in process context, and the >> second part (the rest of it), which can be run in atomic context. > > Yes, > >> That is, needing the writer to be able to sleep (due to synchronize_sched()) >> will make the locking scheme unusable (in other usecases) I think. And >> the split (blocking and non-blocking part) itself seems hard to express >> as a single writer API. > > I do not think this is the problem. > > We need 2 helpers for writer, the 1st one does synchronize_sched() and the > 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both. > Ah, that's the problem no? Users of reader-writer locks expect to run in atomic context (ie., they don't want to sleep). We can't expose an API that can make the task go to sleep under the covers! (And that too, definitely not with a name such as "percpu_write_lock_irqsave()" ... because we are going to be interrupt-safe as well, in the second part...). > In fact I think that the 1st helper should not do synchronize_sched(), > and (say) cpu_hotplug_begin() should call it itself. Because if we also > change cpu_hotplug_begin() to use percpu_rw_sem we do not want to do > synchronize_sched() twice. But lets ignore this for now. > Yeah, let's ignore this for now, else I'll get all confused ;-) > But, > >> Hmmm.. so what do we do? Shall we say "We anyway have to do smp_rmb() in the >> reader in the fast path. Instead let's do a full smp_mb() and get rid of >> the synchronize_sched() in the writer, and thereby expose this synchronization >> scheme as generic percpu rwlocks?" Or should we rather use synchronize_sched() >> and keep this synchronization scheme restricted to CPU hotplug only? > > Oh, I do not know ;) > > To me, the main question is: can we use synchronize_sched() in cpu_down? > It is slow. > Haha :-) So we don't want smp_mb() in the reader, *and* also don't want synchronize_sched() in the writer! Sounds like saying we want to have the cake and eat it too ;-) :P But seriously speaking, I understand that its a bit of a hard choice to make. On one side, we can avoid synchronize_sched() at the writer, but have to bear the burden of smp_mb() at the reader even in the fastpath, when no writer is active. On the other side, we can make the reader avoid smp_mb(), but the writer has to suffer a full synchronize_sched(). But an important point is that, at the writer side, we already wait for so many mutex locks and stuff, like in the CPU_DOWN_PREPARE stage (as determined by the callbacks). So adding a synchronize_sched() to the mix shouldn't make it noticeably bad, isn't it? (I'm not arguing in favor of using synchronize_sched(). I'm just trying to point out that using it might not be as bad as we think it is, in the CPU hotplug case. And moreover, since I'm still not convinced about the writer API part if use synchronize_sched(), I'd rather avoid synchronize_sched().) Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/20, Srivatsa S. Bhat wrote: > > On 12/20/2012 12:44 AM, Oleg Nesterov wrote: > > > > We need 2 helpers for writer, the 1st one does synchronize_sched() and the > > 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both. > > > > Ah, that's the problem no? Users of reader-writer locks expect to run in > atomic context (ie., they don't want to sleep). Ah, I misunderstood. Sure, percpu_write_lock() should be might_sleep(), and this is not symmetric to percpu_read_lock(). > We can't expose an API that > can make the task go to sleep under the covers! Why? Just this should be documented. However I would not worry until we find another user. Until then we do not even need to add percpu_write_lock or try to generalize this code too much. > > To me, the main question is: can we use synchronize_sched() in cpu_down? > > It is slow. > > > > Haha :-) So we don't want smp_mb() in the reader, We need mb() + rmb(). Plust cli/sti unless this arch has optimized this_cpu_add() like x86 (as you pointed out). > *and* also don't want > synchronize_sched() in the writer! Sounds like saying we want to have the cake > and eat it too ;-) :P Personally I'd vote for synchronize_sched() but I am not sure. And I do not really understand the problem space. > And moreover, since I'm still not convinced about the writer API part if use > synchronize_sched(), I'd rather avoid synchronize_sched().) Understand. And yes, synchronize_sched() adds more problems. For example, where should we call it? I do not this _cpu_down() should do this, in this case, say, disable_nonboot_cpus() needs num_online_cpus() synchronize_sched's. So probably cpu_down() should call it before cpu_maps_update_begin(), this makes the locking even less obvious. In short. What I am trying to say is, don't ask me I do not know ;) Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/20/2012 07:12 PM, Oleg Nesterov wrote: > On 12/20, Srivatsa S. Bhat wrote: >> >> On 12/20/2012 12:44 AM, Oleg Nesterov wrote: >>> >>> We need 2 helpers for writer, the 1st one does synchronize_sched() and the >>> 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both. >>> >> >> Ah, that's the problem no? Users of reader-writer locks expect to run in >> atomic context (ie., they don't want to sleep). > > Ah, I misunderstood. > > Sure, percpu_write_lock() should be might_sleep(), and this is not > symmetric to percpu_read_lock(). > >> We can't expose an API that >> can make the task go to sleep under the covers! > > Why? Just this should be documented. However I would not worry until we > find another user. Until then we do not even need to add percpu_write_lock > or try to generalize this code too much. > Hmm.. But considering the disable_nonboot_cpus() case you mentioned below, I'm only getting farther away from using synchronize_sched() ;-) And that also makes it easier to expose a generic percpu rwlock API, like Tejun was suggesting. So I'll give it a shot. >>> To me, the main question is: can we use synchronize_sched() in cpu_down? >>> It is slow. >>> >> >> Haha :-) So we don't want smp_mb() in the reader, > > We need mb() + rmb(). Plust cli/sti unless this arch has optimized > this_cpu_add() like x86 (as you pointed out). > >> *and* also don't want >> synchronize_sched() in the writer! Sounds like saying we want to have the cake >> and eat it too ;-) :P > > Personally I'd vote for synchronize_sched() but I am not sure. And I do > not really understand the problem space. > >> And moreover, since I'm still not convinced about the writer API part if use >> synchronize_sched(), I'd rather avoid synchronize_sched().) > > Understand. > > And yes, synchronize_sched() adds more problems. For example, where should > we call it? I do not this _cpu_down() should do this, in this case, say, > disable_nonboot_cpus() needs num_online_cpus() synchronize_sched's. > Ouch! I should have seen that coming! > So probably cpu_down() should call it before cpu_maps_update_begin(), this > makes the locking even less obvious. > True. > In short. What I am trying to say is, don't ask me I do not know ;) > OK then, I'll go with what I believe is a reasonably good way (not necessarily the best way) to deal with this: I'll avoid the use of synchronize_sched(), expose a decent-looking percpu rwlock implementation, use it in CPU hotplug and get rid of stop_machine(). That would certainly be a good starting base, IMHO. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/20/2012 07:12 PM, Oleg Nesterov wrote: > On 12/20, Srivatsa S. Bhat wrote: >> >> On 12/20/2012 12:44 AM, Oleg Nesterov wrote: >>> >>> We need 2 helpers for writer, the 1st one does synchronize_sched() and the >>> 2nd one takes rwlock. A generic percpu_write_lock() simply calls them both. >>> >> >> Ah, that's the problem no? Users of reader-writer locks expect to run in >> atomic context (ie., they don't want to sleep). > > Ah, I misunderstood. > > Sure, percpu_write_lock() should be might_sleep(), and this is not > symmetric to percpu_read_lock(). > >> We can't expose an API that >> can make the task go to sleep under the covers! > > Why? Just this should be documented. However I would not worry until we > find another user. Until then we do not even need to add percpu_write_lock > or try to generalize this code too much. > >>> To me, the main question is: can we use synchronize_sched() in cpu_down? >>> It is slow. >>> >> >> Haha :-) So we don't want smp_mb() in the reader, > > We need mb() + rmb(). Plust cli/sti unless this arch has optimized > this_cpu_add() like x86 (as you pointed out). > Hey, IIUC, we actually don't need mb() in the reader!! Just an rmb() will do. This is the reader code I have so far: #define reader_nested_percpu() \ (__this_cpu_read(reader_percpu_refcnt) & READER_REFCNT_MASK) #define writer_active() \ (__this_cpu_read(writer_signal)) #define READER_PRESENT (1UL << 16) #define READER_REFCNT_MASK (READER_PRESENT - 1) void get_online_cpus_atomic(void) { preempt_disable(); /* * First and foremost, make your presence known to the writer. */ this_cpu_add(reader_percpu_refcnt, READER_PRESENT); /* * If we are already using per-cpu refcounts, it is not safe to switch * the synchronization scheme. So continue using the refcounts. */ if (reader_nested_percpu()) { this_cpu_inc(reader_percpu_refcnt); } else { smp_rmb(); if (unlikely(writer_active())) { ... //take hotplug_rwlock } } ... /* Prevent reordering of any subsequent reads of cpu_online_mask. */ smp_rmb(); } The smp_rmb() before writer_active() ensures that LOAD(writer_signal) follows LOAD(reader_percpu_refcnt) (at the 'if' condition). And in turn, that load is automatically going to follow the STORE(reader_percpu_refcnt) (at this_cpu_add()) due to the data dependency. So it is something like a transitive relation. So, the result is that, we mark ourselves as active in reader_percpu_refcnt before we check writer_signal. This is exactly what we wanted to do right? And luckily, due to the dependency, we can achieve it without using the heavy smp_mb(). And, we can't crib about the smp_rmb() because it is unavoidable anyway (because we want to prevent reordering of the reads to cpu_online_mask, like you pointed out earlier). I hope I'm not missing anything... Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/23, Srivatsa S. Bhat wrote: > > On 12/20/2012 07:12 PM, Oleg Nesterov wrote: > > > > We need mb() + rmb(). Plust cli/sti unless this arch has optimized > > this_cpu_add() like x86 (as you pointed out). > > > > Hey, IIUC, we actually don't need mb() in the reader!! Just an rmb() will do. Well. I don't think so. But when it comes to the barriers I am never sure until Paul confirms my understanding ;) > #define reader_nested_percpu() \ > (__this_cpu_read(reader_percpu_refcnt) & READER_REFCNT_MASK) > > #define writer_active() \ > (__this_cpu_read(writer_signal)) > > > #define READER_PRESENT (1UL << 16) > #define READER_REFCNT_MASK (READER_PRESENT - 1) > > void get_online_cpus_atomic(void) > { > preempt_disable(); > > /* > * First and foremost, make your presence known to the writer. > */ > this_cpu_add(reader_percpu_refcnt, READER_PRESENT); > > /* > * If we are already using per-cpu refcounts, it is not safe to switch > * the synchronization scheme. So continue using the refcounts. > */ > if (reader_nested_percpu()) { > this_cpu_inc(reader_percpu_refcnt); > } else { > smp_rmb(); > if (unlikely(writer_active())) { > ... //take hotplug_rwlock > } > } > > ... > > /* Prevent reordering of any subsequent reads of cpu_online_mask. */ > smp_rmb(); > } > > The smp_rmb() before writer_active() ensures that LOAD(writer_signal) follows > LOAD(reader_percpu_refcnt) (at the 'if' condition). And in turn, that load is > automatically going to follow the STORE(reader_percpu_refcnt) But why this STORE should be visible on another CPU before we LOAD(writer_signal)? Lets discuss the simple and artificial example. Suppose we have int X, Y; int func(void) { X = 1; // suppose that nobody else can change it mb(); return Y; } Now you are saying that we can change it and avoid the costly mb(): int func(void) { X = 1; if (X != 1) BUG(); rmb(); return Y; } I doubt. rmb() can only guarantee that the preceding LOAD's should be completed. Without mb() it is possible that this CPU won't write X to memory at all. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/23/2012 10:12 PM, Oleg Nesterov wrote: > On 12/23, Srivatsa S. Bhat wrote: >> >> On 12/20/2012 07:12 PM, Oleg Nesterov wrote: >>> >>> We need mb() + rmb(). Plust cli/sti unless this arch has optimized >>> this_cpu_add() like x86 (as you pointed out). >>> >> >> Hey, IIUC, we actually don't need mb() in the reader!! Just an rmb() will do. > > Well. I don't think so. But when it comes to the barriers I am never sure > until Paul confirms my understanding ;) > >> #define reader_nested_percpu() \ >> (__this_cpu_read(reader_percpu_refcnt) & READER_REFCNT_MASK) >> >> #define writer_active() \ >> (__this_cpu_read(writer_signal)) >> >> >> #define READER_PRESENT (1UL << 16) >> #define READER_REFCNT_MASK (READER_PRESENT - 1) >> >> void get_online_cpus_atomic(void) >> { >> preempt_disable(); >> >> /* >> * First and foremost, make your presence known to the writer. >> */ >> this_cpu_add(reader_percpu_refcnt, READER_PRESENT); >> >> /* >> * If we are already using per-cpu refcounts, it is not safe to switch >> * the synchronization scheme. So continue using the refcounts. >> */ >> if (reader_nested_percpu()) { >> this_cpu_inc(reader_percpu_refcnt); >> } else { >> smp_rmb(); >> if (unlikely(writer_active())) { >> ... //take hotplug_rwlock >> } >> } >> >> ... >> >> /* Prevent reordering of any subsequent reads of cpu_online_mask. */ >> smp_rmb(); >> } >> >> The smp_rmb() before writer_active() ensures that LOAD(writer_signal) follows >> LOAD(reader_percpu_refcnt) (at the 'if' condition). And in turn, that load is >> automatically going to follow the STORE(reader_percpu_refcnt) > > But why this STORE should be visible on another CPU before we LOAD(writer_signal)? > > Lets discuss the simple and artificial example. Suppose we have > > int X, Y; > > int func(void) > { > X = 1; // suppose that nobody else can change it > mb(); > return Y; > } > > Now you are saying that we can change it and avoid the costly mb(): > > int func(void) > { > X = 1; > > if (X != 1) > BUG(); > > rmb(); > return Y; > } > > I doubt. rmb() can only guarantee that the preceding LOAD's should be > completed. Without mb() it is possible that this CPU won't write X to > memory at all. > Oh, ok :-( Thanks for correcting me and for the detailed explanation! For a moment, I really thought we had it solved at last! ;-( Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/cpu.h b/include/linux/cpu.h index ce7a074..cf24da1 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -175,6 +175,8 @@ extern struct bus_type cpu_subsys; extern void get_online_cpus(void); extern void put_online_cpus(void); +extern void get_online_cpus_atomic(void); +extern void put_online_cpus_atomic(void); #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri) #define register_hotcpu_notifier(nb) register_cpu_notifier(nb) #define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) @@ -198,6 +200,8 @@ static inline void cpu_hotplug_driver_unlock(void) #define get_online_cpus() do { } while (0) #define put_online_cpus() do { } while (0) +#define get_online_cpus_atomic() do { } while (0) +#define put_online_cpus_atomic() do { } while (0) #define hotcpu_notifier(fn, pri) do { (void)(fn); } while (0) /* These aren't inline functions due to a GCC bug. */ #define register_hotcpu_notifier(nb) ({ (void)(nb); 0; }) diff --git a/kernel/cpu.c b/kernel/cpu.c index 42bd331..5a63296 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -133,6 +133,119 @@ static void cpu_hotplug_done(void) mutex_unlock(&cpu_hotplug.lock); } +/* + * Reader-writer lock to synchronize between atomic hotplug readers + * and the CPU offline hotplug writer. + */ +static DEFINE_RWLOCK(hotplug_rwlock); + +static DEFINE_PER_CPU(int, reader_percpu_refcnt); +static DEFINE_PER_CPU(bool, writer_signal); + + +#define reader_uses_percpu_refcnt(cpu) \ + (ACCESS_ONCE(per_cpu(reader_percpu_refcnt, cpu))) + +#define reader_nested_percpu() \ + (__this_cpu_read(reader_percpu_refcnt) > 1) + +#define writer_active() \ + (__this_cpu_read(writer_signal)) + + + +/* + * Invoked by hotplug reader, to prevent CPUs from going offline. + * + * If there are no CPU offline writers active, just increment the + * per-cpu counter 'reader_percpu_refcnt' and proceed. + * + * If a CPU offline hotplug writer is active, we'll need to switch from + * per-cpu refcounts to the global rwlock, when the time is right. + * + * It is not safe to switch the synchronization scheme when we are + * already in a read-side critical section which uses per-cpu refcounts. + * Also, we don't want to allow heterogeneous readers to nest inside + * each other, to avoid complications in put_online_cpus_atomic(). + * + * Once you switch, keep using the rwlocks for synchronization, until + * the writer signals the end of CPU offline. + * + * You can call this recursively, without fear of locking problems. + * + * Returns with preemption disabled. + */ +void get_online_cpus_atomic(void) +{ + unsigned long flags; + + preempt_disable(); + + if (cpu_hotplug.active_writer == current) + return; + + local_irq_save(flags); + + /* + * Use the percpu refcounts by default. Switch over to rwlock (if + * necessary) later on. This helps avoid several race conditions + * as well. + */ + __this_cpu_inc(reader_percpu_refcnt); + + smp_rmb(); /* Paired with smp_mb() in announce_cpu_offline_begin(). */ + + /* + * We must not allow heterogeneous nesting of readers (ie., readers + * using percpu refcounts to nest with readers using rwlocks). + * So don't switch the synchronization scheme if we are currently + * using perpcu refcounts. + */ + if (!reader_nested_percpu() && unlikely(writer_active())) { + + read_lock(&hotplug_rwlock); + + /* + * We might have raced with a writer going inactive before we + * took the read-lock. So re-evaluate whether we still need to + * use the rwlock or if we can switch back to percpu refcounts. + * (This also helps avoid heterogeneous nesting of readers). + */ + if (writer_active()) + __this_cpu_dec(reader_percpu_refcnt); + else + read_unlock(&hotplug_rwlock); + } + + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(get_online_cpus_atomic); + +void put_online_cpus_atomic(void) +{ + unsigned long flags; + + if (cpu_hotplug.active_writer == current) + goto out; + + local_irq_save(flags); + + /* + * We never allow heterogeneous nesting of readers. So it is trivial + * to find out the kind of reader we are, and undo the operation + * done by our corresponding get_online_cpus_atomic(). + */ + if (__this_cpu_read(reader_percpu_refcnt)) + __this_cpu_dec(reader_percpu_refcnt); + else + read_unlock(&hotplug_rwlock); + + local_irq_restore(flags); +out: + preempt_enable(); +} +EXPORT_SYMBOL_GPL(put_online_cpus_atomic); + #else /* #if CONFIG_HOTPLUG_CPU */ static void cpu_hotplug_begin(void) {} static void cpu_hotplug_done(void) {} @@ -237,6 +350,61 @@ static inline void check_for_tasks(int cpu) write_unlock_irq(&tasklist_lock); } +static inline void raise_writer_signal(unsigned int cpu) +{ + per_cpu(writer_signal, cpu) = true; +} + +static inline void drop_writer_signal(unsigned int cpu) +{ + per_cpu(writer_signal, cpu) = false; +} + +static void announce_cpu_offline_begin(void) +{ + unsigned int cpu; + + for_each_online_cpu(cpu) + raise_writer_signal(cpu); + + smp_mb(); +} + +static void announce_cpu_offline_end(unsigned int dead_cpu) +{ + unsigned int cpu; + + drop_writer_signal(dead_cpu); + + for_each_online_cpu(cpu) + drop_writer_signal(cpu); + + smp_mb(); +} + +/* + * Wait for the reader to see the writer's signal and switch from percpu + * refcounts to global rwlock. + * + * If the reader is still using percpu refcounts, wait for him to switch. + * Else, we can safely go ahead, because either the reader has already + * switched over, or the next atomic hotplug reader who comes along on this + * CPU will notice the writer's signal and will switch over to the rwlock. + */ +static inline void sync_atomic_reader(unsigned int cpu) +{ + while (reader_uses_percpu_refcnt(cpu)) + cpu_relax(); +} + +static void sync_all_readers(void) +{ + unsigned int cpu; + + for_each_online_cpu(cpu) + sync_atomic_reader(cpu); +} + struct take_cpu_down_param { unsigned long mod; void *hcpu; @@ -246,15 +414,45 @@ struct take_cpu_down_param { static int __ref take_cpu_down(void *_param) { struct take_cpu_down_param *param = _param; - int err; + unsigned long flags; + unsigned int cpu = (long)(param->hcpu); + int err = 0; + + + /* + * Inform all atomic readers that we are going to offline a CPU + * and that they need to switch from per-cpu refcounts to the + * global hotplug_rwlock. + */ + announce_cpu_offline_begin(); + + /* Wait for every reader to notice the announcement and switch over */ + sync_all_readers(); + + /* + * Now all the readers have switched to using the global hotplug_rwlock. + * So now is our chance, go bring down the CPU! + */ + + write_lock_irqsave(&hotplug_rwlock, flags); /* Ensure this CPU doesn't handle any more interrupts. */ err = __cpu_disable(); if (err < 0) - return err; + goto out; cpu_notify(CPU_DYING | param->mod, param->hcpu); - return 0; + +out: + /* + * Inform all atomic readers that we are done with the CPU offline + * operation, so that they can switch back to their per-cpu refcounts. + * (We don't need to wait for them to see it). + */ + announce_cpu_offline_end(cpu); + + write_unlock_irqrestore(&hotplug_rwlock, flags); + return err; } /* Requires cpu_add_remove_lock to be held */
There are places where preempt_disable() is used to prevent any CPU from going offline during the critical section. Let us call them as "atomic hotplug readers" ("atomic" because they run in atomic contexts). Today, preempt_disable() works because the writer uses stop_machine(). But once stop_machine() is gone, the readers won't be able to prevent CPUs from going offline using preempt_disable(). The intent of this patch is to provide synchronization APIs for such atomic hotplug readers, to prevent (any) CPUs from going offline, without depending on stop_machine() at the writer-side. The new APIs will look something like this: get/put_online_cpus_atomic() Some important design requirements and considerations: ----------------------------------------------------- 1. Scalable synchronization at the reader-side, especially in the fast-path Any synchronization at the atomic hotplug readers side must be highly scalable - avoid global single-holder locks/counters etc. Because, these paths currently use the extremely fast preempt_disable(); our replacement to preempt_disable() should not become ridiculously costly and also should not serialize the readers among themselves needlessly. At a minimum, the new APIs must be extremely fast at the reader side atleast in the fast-path, when no CPU offline writers are active. 2. preempt_disable() was recursive. The replacement should also be recursive. 3. No (new) lock-ordering restrictions preempt_disable() was super-flexible. It didn't impose any ordering restrictions or rules for nesting. Our replacement should also be equally flexible and usable. 4. No deadlock possibilities Regular per-cpu locking is not the way to go if we want to have relaxed rules for lock-ordering. Because, we can end up in circular-locking dependencies as explained in https://lkml.org/lkml/2012/12/6/290 So, avoid the usual per-cpu locking schemes (per-cpu locks/per-cpu atomic counters with spin-on-contention etc) as much as possible. Implementation of the design: ---------------------------- We use global rwlocks for synchronization, because then we won't get into lock-ordering related problems (unlike per-cpu locks). However, global rwlocks lead to unnecessary cache-line bouncing even when there are no hotplug writers present, which can slow down the system needlessly. Per-cpu counters can help solve the cache-line bouncing problem. So we actually use the best of both: per-cpu counters (no-waiting) at the reader side in the fast-path, and global rwlocks in the slowpath. [ Fastpath = no writer is active; Slowpath = a writer is active ] IOW, the hotplug readers just increment/decrement their per-cpu refcounts when no writer is active. When a writer becomes active, he signals all readers to switch to global rwlocks for the duration of the CPU offline operation. The readers switch over when it is safe for them (ie., when they are about to start a fresh, non-nested read-side critical section) and start using (holding) the global rwlock for read in their subsequent critical sections. The hotplug writer waits for every reader to switch, and then acquires the global rwlock for write and takes the CPU offline. Then the writer signals all readers that the CPU offline is done, and that they can go back to using their per-cpu refcounts again. Note that the lock-safety (despite the per-cpu scheme) comes from the fact that the readers can *choose* _when_ to switch to rwlocks upon the writer's signal. And the readers don't wait on anybody based on the per-cpu counters. The only true synchronization that involves waiting at the reader-side in this scheme, is the one arising from the global rwlock, which is safe from the circular locking dependency problems mentioned above (because it is global). Reader-writer locks and per-cpu counters are recursive, so they can be used in a nested fashion in the reader-path. Also, this design of switching the synchronization scheme ensures that you can safely nest and call these APIs in any way you want, just like preempt_disable()/enable. Together, these satisfy all the requirements mentioned above. I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful suggestions and ideas, which inspired and influenced many of the decisions in this as well as previous designs. Thanks a lot Michael and Xiao! Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> --- include/linux/cpu.h | 4 + kernel/cpu.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 205 insertions(+), 3 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html