Message ID | 5130E8E2.50206@cn.fujitsu.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Hey, guys and Oleg (yes, I'm singling you out ;p because you're that awesome.) On Sat, Mar 02, 2013 at 01:44:02AM +0800, Lai Jiangshan wrote: > Performance: > We only focus on the performance of the read site. this read site's fast path > is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(), > It has only one heavy memory operation. it will be expected fast. > > We test three locks. > 1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock) > 2) this lock(lgrwlock) > 3) V6 percpu-rwlock by "Srivatsa S. Bhat". (percpu-rwlock) > (https://lkml.org/lkml/2013/2/18/186) > > nested=1(no nested) nested=2 nested=4 > opt-rwlock 517181 1009200 2010027 > lgrwlock 452897 700026 1201415 > percpu-rwlock 1192955 1451343 1951757 On the first glance, the numbers look pretty good and I kinda really like the fact that if this works out we don't have to introduce yet another percpu synchronization construct and get to reuse lglock. So, Oleg, can you please see whether you can find holes in this one? Srivatsa, I know you spent a lot of time on percpu_rwlock but as you wrote before Lai's work can be seen as continuation of yours, and if we get to extend what's already there instead of introducing something completely new, there's no reason not to (and my apologies for not noticing the possibility of extending lglock before). So, if this can work, it would be awesome if you guys can work together. Lai might not be very good at communicating in english yet but he's really good at spotting patterns in complex code and playing with them. Thanks!
Lai, I didn't read this discussion except the code posted by Michel. I'll try to read this patch carefully later, but I'd like to ask a couple of questions. This version looks more complex than Michel's, why? Just curious, I am trying to understand what I missed. See http://marc.info/?l=linux-kernel&m=136196350213593 And I can't understand FALLBACK_BASE... OK, suppose that CPU_0 does _write_unlock() and releases ->fallback_rwlock. CPU_1 does _read_lock(), and ... > +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw) > +{ > + struct lglock *lg = &lgrw->lglock; > + > + preempt_disable(); > + rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_); > + if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) { > + if (!arch_spin_trylock(this_cpu_ptr(lg->lock))) { _trylock() fails, > + read_lock(&lgrw->fallback_rwlock); > + __this_cpu_add(*lgrw->reader_refcnt, FALLBACK_BASE); so we take ->fallback_rwlock and ->reader_refcnt == FALLBACK_BASE. CPU_0 does lg_global_unlock(lgrw->lglock) and finishes _write_unlock(). Interrupt handler on CPU_1 does _read_lock() notices ->reader_refcnt != 0 and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1. Then irq does _read_unlock(), and > +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw) > +{ > + switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) { > + case 0: > + lg_local_unlock(&lgrw->lglock); > + return; > + case FALLBACK_BASE: > + __this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE); > + read_unlock(&lgrw->fallback_rwlock); hits this case? Doesn't look right, but most probably I missed something. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/01/2013 11:23 PM, Tejun Heo wrote: > Hey, guys and Oleg (yes, I'm singling you out ;p because you're that > awesome.) > > On Sat, Mar 02, 2013 at 01:44:02AM +0800, Lai Jiangshan wrote: >> Performance: >> We only focus on the performance of the read site. this read site's fast path >> is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(), >> It has only one heavy memory operation. it will be expected fast. >> >> We test three locks. >> 1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock) >> 2) this lock(lgrwlock) >> 3) V6 percpu-rwlock by "Srivatsa S. Bhat". (percpu-rwlock) >> (https://lkml.org/lkml/2013/2/18/186) >> >> nested=1(no nested) nested=2 nested=4 >> opt-rwlock 517181 1009200 2010027 >> lgrwlock 452897 700026 1201415 >> percpu-rwlock 1192955 1451343 1951757 > > On the first glance, the numbers look pretty good and I kinda really > like the fact that if this works out we don't have to introduce yet > another percpu synchronization construct and get to reuse lglock. > > So, Oleg, can you please see whether you can find holes in this one? > > Srivatsa, I know you spent a lot of time on percpu_rwlock but as you > wrote before Lai's work can be seen as continuation of yours, and if > we get to extend what's already there instead of introducing something > completely new, there's no reason not to Yep, I agree! > (and my apologies for not > noticing the possibility of extending lglock before). No problem at all! You gave so many invaluable suggestions to make this whole thing work in the first place! Now, if we can reuse the existing stuff and extend it to what we want, then its just even better! :-) > So, if this can > work, it would be awesome if you guys can work together. Absolutely! > Lai might > not be very good at communicating in english yet but he's really good > at spotting patterns in complex code and playing with them. > That sounds great! :-) I'll soon take a closer look at his code and the comparisons he posted, and work towards taking this effort forward. Thank you very much! Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Mar 2, 2013 at 2:28 AM, Oleg Nesterov <oleg@redhat.com> wrote: > Lai, I didn't read this discussion except the code posted by Michel. > I'll try to read this patch carefully later, but I'd like to ask > a couple of questions. > > This version looks more complex than Michel's, why? Just curious, I > am trying to understand what I missed. See > http://marc.info/?l=linux-kernel&m=136196350213593 From what I can see, my version used local_refcnt to count how many reentrant locks are represented by the fastpath lglock spinlock; Lai's version uses it to count how many reentrant locks are represented by either the fastpath lglock spinlock or the global rwlock, with FALLBACK_BASE being a bit thrown in so we can remember which of these locks was acquired. My version would be slower if it needs to take the slow path in a reentrant way, but I'm not sure it matters either :) > Interrupt handler on CPU_1 does _read_lock() notices ->reader_refcnt != 0 > and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1. > > Then irq does _read_unlock(), and > >> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw) >> +{ >> + switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) { >> + case 0: >> + lg_local_unlock(&lgrw->lglock); >> + return; >> + case FALLBACK_BASE: >> + __this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE); >> + read_unlock(&lgrw->fallback_rwlock); > > hits this case? > > Doesn't look right, but most probably I missed something. Good catch. I think this is easily fixed by setting reader_refcn directly to FALLBACK_BASE+1, instead of setting it to FALLBACK_BASE and then incrementing it to FALLBACK_BASE+1.
On 02/03/13 02:28, Oleg Nesterov wrote: > Lai, I didn't read this discussion except the code posted by Michel. > I'll try to read this patch carefully later, but I'd like to ask > a couple of questions. > > This version looks more complex than Michel's, why? Just curious, I > am trying to understand what I missed. See > http://marc.info/?l=linux-kernel&m=136196350213593 Michel changed my old draft version a little, his version is good enough for me. My new version tries to add a little better nestable support with only adding single __this_cpu_op() in _read_[un]lock(). > > And I can't understand FALLBACK_BASE... > > OK, suppose that CPU_0 does _write_unlock() and releases ->fallback_rwlock. > > CPU_1 does _read_lock(), and ... > >> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw) >> +{ >> + struct lglock *lg = &lgrw->lglock; >> + >> + preempt_disable(); >> + rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_); >> + if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) { >> + if (!arch_spin_trylock(this_cpu_ptr(lg->lock))) { > > _trylock() fails, > >> + read_lock(&lgrw->fallback_rwlock); >> + __this_cpu_add(*lgrw->reader_refcnt, FALLBACK_BASE); > > so we take ->fallback_rwlock and ->reader_refcnt == FALLBACK_BASE. > > CPU_0 does lg_global_unlock(lgrw->lglock) and finishes _write_unlock(). > > Interrupt handler on CPU_1 does _read_lock() notices ->reader_refcnt != 0 > and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1. > > Then irq does _read_unlock(), and > >> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw) >> +{ >> + switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) { >> + case 0: >> + lg_local_unlock(&lgrw->lglock); >> + return; >> + case FALLBACK_BASE: >> + __this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE); >> + read_unlock(&lgrw->fallback_rwlock); > > hits this case? > > Doesn't look right, but most probably I missed something. Your are right, I just realized that I had spit a code which should be atomic. I hope this patch(V2) can get more reviews. My first and many locking knowledge is learned from Paul. Paul, would you also review it? Thanks, Lai > > Oleg. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/02, Lai Jiangshan wrote: > > On 02/03/13 02:28, Oleg Nesterov wrote: > > Lai, I didn't read this discussion except the code posted by Michel. > > I'll try to read this patch carefully later, but I'd like to ask > > a couple of questions. > > > > This version looks more complex than Michel's, why? Just curious, I > > am trying to understand what I missed. See > > http://marc.info/?l=linux-kernel&m=136196350213593 > > Michel changed my old draft version a little, his version is good enough for me. Yes, I see. But imho Michel suggested the valuable cleanup, the code becomes even more simple with the same perfomance. Your v2 looks almost correct to me, but I still think it makes sense to incorporate the simplification from Michel. > My new version tries to add a little better nestable support with only > adding single __this_cpu_op() in _read_[un]lock(). How? Afaics with or without FALLBACK_BASE you need _reed + _inc/dec in _read_lock/unlock. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/02, Michel Lespinasse wrote: > > My version would be slower if it needs to take the > slow path in a reentrant way, but I'm not sure it matters either :) I'd say, this doesn't matter at all, simply because this can only happen if we race with the active writer. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/03/13 01:06, Oleg Nesterov wrote: > On 03/02, Michel Lespinasse wrote: >> >> My version would be slower if it needs to take the >> slow path in a reentrant way, but I'm not sure it matters either :) > > I'd say, this doesn't matter at all, simply because this can only happen > if we race with the active writer. > It can also happen when interrupted. (still very rarely) arch_spin_trylock() ------->interrupted, __this_cpu_read() returns 0. arch_spin_trylock() fails slowpath, any nested will be slowpath too. ... ..._read_unlock() <-------interrupt __this_cpu_inc() .... I saw get_online_cpu_atomic() is called very frequent. And the above thing happens in one CPU rarely, but how often it happens in the whole system if we have 4096 CPUs? (I worries to much. I tend to remove FALLBACK_BASE now, we should add it only after we proved we needed it, this part is not proved) Thanks, Lai -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 5, 2013 at 7:54 AM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote: > On 03/03/13 01:06, Oleg Nesterov wrote: >> On 03/02, Michel Lespinasse wrote: >>> >>> My version would be slower if it needs to take the >>> slow path in a reentrant way, but I'm not sure it matters either :) >> >> I'd say, this doesn't matter at all, simply because this can only happen >> if we race with the active writer. >> > > It can also happen when interrupted. (still very rarely) > > arch_spin_trylock() > ------->interrupted, > __this_cpu_read() returns 0. > arch_spin_trylock() fails > slowpath, any nested will be slowpath too. > ... > ..._read_unlock() > <-------interrupt > __this_cpu_inc() > .... Yes (and I think this is actually the most likely way for it to happen). We do need this to work correctly, but I don't expect we need it to be fast. (could be wrong, this is only my intuition)
On 03/05, Lai Jiangshan wrote: > > On 03/03/13 01:06, Oleg Nesterov wrote: > > On 03/02, Michel Lespinasse wrote: > >> > >> My version would be slower if it needs to take the > >> slow path in a reentrant way, but I'm not sure it matters either :) > > > > I'd say, this doesn't matter at all, simply because this can only happen > > if we race with the active writer. > > It can also happen when interrupted. (still very rarely) > > arch_spin_trylock() > ------->interrupted, > __this_cpu_read() returns 0. > arch_spin_trylock() fails > slowpath, any nested will be slowpath too. > ... > ..._read_unlock() > <-------interrupt > __this_cpu_inc() > .... Yes sure. Or it can take the local lock after we already take the global fallback_lock. But the same can happen with FALLBACK_BASE, just because we need to take a lock (local or global) first, then increment the counter. > (I worries to much. I tend to remove FALLBACK_BASE now, we should > add it only after we proved we needed it, this part is not proved) Agreed, great ;) Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/lglock.h b/include/linux/lglock.h index 0d24e93..afb22bd 100644 --- a/include/linux/lglock.h +++ b/include/linux/lglock.h @@ -67,4 +67,39 @@ void lg_local_unlock_cpu(struct lglock *lg, int cpu); void lg_global_lock(struct lglock *lg); void lg_global_unlock(struct lglock *lg); +struct lgrwlock { + unsigned long __percpu *reader_refcnt; + struct lglock lglock; + rwlock_t fallback_rwlock; +}; + +#define __DEFINE_LGRWLOCK_PERCPU_DATA(name) \ + static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock) \ + = __ARCH_SPIN_LOCK_UNLOCKED; \ + static DEFINE_PER_CPU(unsigned long, name ## _refcnt); + +#define __LGRWLOCK_INIT(name) \ + { \ + .reader_refcnt = &name ## _refcnt, \ + .lglock = { .lock = &name ## _lock }, \ + .fallback_rwlock = __RW_LOCK_UNLOCKED(name.fallback_rwlock)\ + } + +#define DEFINE_LGRWLOCK(name) \ + __DEFINE_LGRWLOCK_PERCPU_DATA(name) \ + struct lgrwlock name = __LGRWLOCK_INIT(name) + +#define DEFINE_STATIC_LGRWLOCK(name) \ + __DEFINE_LGRWLOCK_PERCPU_DATA(name) \ + static struct lgrwlock name = __LGRWLOCK_INIT(name) + +static inline void lg_rwlock_init(struct lgrwlock *lgrw, char *name) +{ + lg_lock_init(&lgrw->lglock, name); +} + +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw); +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw); +void lg_rwlock_global_write_lock(struct lgrwlock *lgrw); +void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw); #endif diff --git a/kernel/lglock.c b/kernel/lglock.c index 6535a66..6c60cf7 100644 --- a/kernel/lglock.c +++ b/kernel/lglock.c @@ -87,3 +87,55 @@ void lg_global_unlock(struct lglock *lg) preempt_enable(); } EXPORT_SYMBOL(lg_global_unlock); + +#define FALLBACK_BASE (1UL << 30) + +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw) +{ + struct lglock *lg = &lgrw->lglock; + + preempt_disable(); + rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_); + if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) { + if (!arch_spin_trylock(this_cpu_ptr(lg->lock))) { + read_lock(&lgrw->fallback_rwlock); + __this_cpu_add(*lgrw->reader_refcnt, FALLBACK_BASE); + } + } + + __this_cpu_inc(*lgrw->reader_refcnt); +} +EXPORT_SYMBOL(lg_rwlock_local_read_lock); + +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw) +{ + switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) { + case 0: + lg_local_unlock(&lgrw->lglock); + return; + case FALLBACK_BASE: + __this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE); + read_unlock(&lgrw->fallback_rwlock); + break; + default: + break; + } + + rwlock_release(&lg->lock_dep_map, 1, _RET_IP_); + preempt_enable(); +} +EXPORT_SYMBOL(lg_rwlock_local_read_unlock); + +void lg_rwlock_global_write_lock(struct lgrwlock *lgrw) +{ + lg_global_lock(&lgrw->lglock); + write_lock(&lgrw->fallback_rwlock); +} +EXPORT_SYMBOL(lg_rwlock_global_write_lock); + +void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw) +{ + write_unlock(&lgrw->fallback_rwlock); + lg_global_unlock(&lgrw->lglock); +} +EXPORT_SYMBOL(lg_rwlock_global_write_unlock);