diff mbox

lglock: add read-preference local-global rwlock

Message ID 5130E8E2.50206@cn.fujitsu.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Lai Jiangshan March 1, 2013, 5:44 p.m. UTC
From c63f2be9a4cf7106a521dda169a0e14f8e4f7e3b Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <laijs@cn.fujitsu.com>
Date: Mon, 25 Feb 2013 23:14:27 +0800
Subject: [PATCH] lglock: add read-preference local-global rwlock

Current lglock is not read-preference, so it can't be used on some cases
which read-preference rwlock can do. Example, get_cpu_online_atomic().

Although we can use rwlock for these cases which needs read-preference.
but it leads to unnecessary cache-line bouncing even when there are no
writers present, which can slow down the system needlessly. It will be
worse when we have a lot of CPUs, it is not scale.

So we look forward to lglock. lglock is read-write-lock based on percpu locks,
but it is not read-preference due to its underlining percpu locks.

But what if we convert the percpu locks of lglock to use percpu rwlocks:

         CPU 0                                CPU 1
         ------                               ------
1.    spin_lock(&random_lock);             read_lock(my_rwlock of CPU 1);
2.    read_lock(my_rwlock of CPU 0);       spin_lock(&random_lock);

Writer:
         CPU 2:
         ------
      for_each_online_cpu(cpu)
        write_lock(my_rwlock of 'cpu');


Consider what happens if the writer begins his operation in between steps 1
and 2 at the reader side. It becomes evident that we end up in a (previously
non-existent) deadlock due to a circular locking dependency between the 3
entities, like this:


(holds              Waiting for
 random_lock) CPU 0 -------------> CPU 2  (holds my_rwlock of CPU 0
                                               for write)
               ^                   |
               |                   |
        Waiting|                   | Waiting
          for  |                   |  for
               |                   V
                ------ CPU 1 <------

                (holds my_rwlock of
                 CPU 1 for read)


So obviously this "straight-forward" way of implementing percpu rwlocks is
deadlock-prone. So we can't implement read-preference local-global rwlock
like this.


The implement of this patch reuse current lglock as frontend to achieve
local-read-lock, and reuse global fallback rwlock as backend to achieve
read-preference, and use percpu reader counter to indicate 1) the depth
of the nested reader lockes and 2) whether the outmost lock is percpu lock
or fallback rwlock.

The algorithm is simple, in the read site:
If it is nested reader, just increase the counter
If it is the outmost reader,
	1) try to lock its cpu's lock of the frontend lglock.
	   (reader count +=1 if success)
	2) if the above step fails, read_lock(&fallback_rwlock).
	   (reader count += FALLBACK_BASE + 1)

Write site:
Do the lg_global_lock() of the frontend lglock.
And then write_lock(&fallback_rwlock).


Prof:
1) reader-writer exclusive:
The two steps of write site finished, no reader. Vice-verse.

2) read-preference:
before write site lock finished acquired, read site at least
wins at read_lock(&fallback_rwlock) for rwlock is read-preference.
read-preference also implies nestable.

3) read site functions are irqsafe(reentrance-safe)
	If read site function is interrupted at any point and reenters read site
again, reentranced read site will not be mislead by the first read site if the
reader counter > 0, in this case, it means currently frontend(this cpu lock of
lglock) or backend(fallback rwlock) lock is held, it is safe to act as
nested reader.
	if the reader counter=0, eentranced reader considers it is the
outmost read site, and it always successes after the write side release the lock.
(even the interrupted read-site has already hold the cpu lock of lglock
or the fallback_rwlock).
	And reentranced read site only calls arch_spin_trylock(), read_lock()
and __this_cpu_op(), arch_spin_trylock(), read_lock() is already reentrance-safe.
Although __this_cpu_op() is not reentrance-safe, but the value of the counter
will be restored after the interrupted finished, so read site functions
are still reentrance-safe.


Performance:
We only focus on the performance of the read site. this read site's fast path
is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(),
It has only one heavy memory operation. it will be expected fast.

We test three locks.
1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock)
2) this lock(lgrwlock)
3) V6 percpu-rwlock by "Srivatsa S. Bhat". (percpu-rwlock)
   (https://lkml.org/lkml/2013/2/18/186)

		nested=1(no nested)	nested=2	nested=4
opt-rwlock	 517181			1009200		2010027
lgrwlock	 452897			 700026		1201415
percpu-rwlock	1192955			1451343		1951757

The value is the time(nano-second) of 10000 times of the operations:
{
	read-lock
	[nested read-lock]...
	[nested read-unlock]...
	read-unlock
}
If the way of this test is wrong nor bad, please correct me,
the code of test is here: https://gist.github.com/laijs/5066159
(i5 760, 64bit)

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 include/linux/lglock.h |   35 ++++++++++++++++++++++++++++++++
 kernel/lglock.c        |   52 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+), 0 deletions(-)

Comments

Tejun Heo March 1, 2013, 5:53 p.m. UTC | #1
Hey, guys and Oleg (yes, I'm singling you out ;p because you're that
awesome.)

On Sat, Mar 02, 2013 at 01:44:02AM +0800, Lai Jiangshan wrote:
> Performance:
> We only focus on the performance of the read site. this read site's fast path
> is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(),
> It has only one heavy memory operation. it will be expected fast.
> 
> We test three locks.
> 1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock)
> 2) this lock(lgrwlock)
> 3) V6 percpu-rwlock by "Srivatsa S. Bhat". (percpu-rwlock)
>    (https://lkml.org/lkml/2013/2/18/186)
> 
> 		nested=1(no nested)	nested=2	nested=4
> opt-rwlock	 517181			1009200		2010027
> lgrwlock	 452897			 700026		1201415
> percpu-rwlock	1192955			1451343		1951757

On the first glance, the numbers look pretty good and I kinda really
like the fact that if this works out we don't have to introduce yet
another percpu synchronization construct and get to reuse lglock.

So, Oleg, can you please see whether you can find holes in this one?

Srivatsa, I know you spent a lot of time on percpu_rwlock but as you
wrote before Lai's work can be seen as continuation of yours, and if
we get to extend what's already there instead of introducing something
completely new, there's no reason not to (and my apologies for not
noticing the possibility of extending lglock before).  So, if this can
work, it would be awesome if you guys can work together.  Lai might
not be very good at communicating in english yet but he's really good
at spotting patterns in complex code and playing with them.

Thanks!
Oleg Nesterov March 1, 2013, 6:28 p.m. UTC | #2
Lai, I didn't read this discussion except the code posted by Michel.
I'll try to read this patch carefully later, but I'd like to ask
a couple of questions.

This version looks more complex than Michel's, why? Just curious, I
am trying to understand what I missed. See
http://marc.info/?l=linux-kernel&m=136196350213593

And I can't understand FALLBACK_BASE...

OK, suppose that CPU_0 does _write_unlock() and releases ->fallback_rwlock.

CPU_1 does _read_lock(), and ...

> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
> +{
> +	struct lglock *lg = &lgrw->lglock;
> +
> +	preempt_disable();
> +	rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
> +	if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) {
> +		if (!arch_spin_trylock(this_cpu_ptr(lg->lock))) {

_trylock() fails,

> +			read_lock(&lgrw->fallback_rwlock);
> +			__this_cpu_add(*lgrw->reader_refcnt, FALLBACK_BASE);

so we take ->fallback_rwlock and ->reader_refcnt == FALLBACK_BASE.

CPU_0 does lg_global_unlock(lgrw->lglock) and finishes _write_unlock().

Interrupt handler on CPU_1 does _read_lock() notices ->reader_refcnt != 0
and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1.

Then irq does _read_unlock(), and

> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
> +{
> +	switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) {
> +	case 0:
> +		lg_local_unlock(&lgrw->lglock);
> +		return;
> +	case FALLBACK_BASE:
> +		__this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE);
> +		read_unlock(&lgrw->fallback_rwlock);

hits this case?

Doesn't look right, but most probably I missed something.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat March 1, 2013, 8:06 p.m. UTC | #3
On 03/01/2013 11:23 PM, Tejun Heo wrote:
> Hey, guys and Oleg (yes, I'm singling you out ;p because you're that
> awesome.)
> 
> On Sat, Mar 02, 2013 at 01:44:02AM +0800, Lai Jiangshan wrote:
>> Performance:
>> We only focus on the performance of the read site. this read site's fast path
>> is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(),
>> It has only one heavy memory operation. it will be expected fast.
>>
>> We test three locks.
>> 1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock)
>> 2) this lock(lgrwlock)
>> 3) V6 percpu-rwlock by "Srivatsa S. Bhat". (percpu-rwlock)
>>    (https://lkml.org/lkml/2013/2/18/186)
>>
>> 		nested=1(no nested)	nested=2	nested=4
>> opt-rwlock	 517181			1009200		2010027
>> lgrwlock	 452897			 700026		1201415
>> percpu-rwlock	1192955			1451343		1951757
> 
> On the first glance, the numbers look pretty good and I kinda really
> like the fact that if this works out we don't have to introduce yet
> another percpu synchronization construct and get to reuse lglock.
> 
> So, Oleg, can you please see whether you can find holes in this one?
> 
> Srivatsa, I know you spent a lot of time on percpu_rwlock but as you
> wrote before Lai's work can be seen as continuation of yours, and if
> we get to extend what's already there instead of introducing something
> completely new, there's no reason not to

Yep, I agree!

> (and my apologies for not
> noticing the possibility of extending lglock before).

No problem at all! You gave so many invaluable suggestions to make this
whole thing work in the first place! Now, if we can reuse the existing
stuff and extend it to what we want, then its just even better! :-)

>  So, if this can
> work, it would be awesome if you guys can work together.

Absolutely!

>  Lai might
> not be very good at communicating in english yet but he's really good
> at spotting patterns in complex code and playing with them.
>

That sounds great! :-)

I'll soon take a closer look at his code and the comparisons he posted,
and work towards taking this effort forward.

Thank you very much!

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michel Lespinasse March 2, 2013, 12:13 p.m. UTC | #4
On Sat, Mar 2, 2013 at 2:28 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> Lai, I didn't read this discussion except the code posted by Michel.
> I'll try to read this patch carefully later, but I'd like to ask
> a couple of questions.
>
> This version looks more complex than Michel's, why? Just curious, I
> am trying to understand what I missed. See
> http://marc.info/?l=linux-kernel&m=136196350213593

From what I can see, my version used local_refcnt to count how many
reentrant locks are represented by the fastpath lglock spinlock; Lai's
version uses it to count how many reentrant locks are represented by
either the fastpath lglock spinlock or the global rwlock, with
FALLBACK_BASE being a bit thrown in so we can remember which of these
locks was acquired. My version would be slower if it needs to take the
slow path in a reentrant way, but I'm not sure it matters either :)

> Interrupt handler on CPU_1 does _read_lock() notices ->reader_refcnt != 0
> and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1.
>
> Then irq does _read_unlock(), and
>
>> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
>> +{
>> +     switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) {
>> +     case 0:
>> +             lg_local_unlock(&lgrw->lglock);
>> +             return;
>> +     case FALLBACK_BASE:
>> +             __this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE);
>> +             read_unlock(&lgrw->fallback_rwlock);
>
> hits this case?
>
> Doesn't look right, but most probably I missed something.

Good catch. I think this is easily fixed by setting reader_refcn
directly to FALLBACK_BASE+1, instead of setting it to FALLBACK_BASE
and then incrementing it to FALLBACK_BASE+1.
Lai Jiangshan March 2, 2013, 1:42 p.m. UTC | #5
On 02/03/13 02:28, Oleg Nesterov wrote:
> Lai, I didn't read this discussion except the code posted by Michel.
> I'll try to read this patch carefully later, but I'd like to ask
> a couple of questions.
> 
> This version looks more complex than Michel's, why? Just curious, I
> am trying to understand what I missed. See
> http://marc.info/?l=linux-kernel&m=136196350213593

Michel changed my old draft version a little, his version is good enough for me.
My new version tries to add a little better nestable support with only
adding single __this_cpu_op() in _read_[un]lock().

> 
> And I can't understand FALLBACK_BASE...
> 
> OK, suppose that CPU_0 does _write_unlock() and releases ->fallback_rwlock.
> 
> CPU_1 does _read_lock(), and ...
> 
>> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
>> +{
>> +	struct lglock *lg = &lgrw->lglock;
>> +
>> +	preempt_disable();
>> +	rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
>> +	if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) {
>> +		if (!arch_spin_trylock(this_cpu_ptr(lg->lock))) {
> 
> _trylock() fails,
> 
>> +			read_lock(&lgrw->fallback_rwlock);
>> +			__this_cpu_add(*lgrw->reader_refcnt, FALLBACK_BASE);
> 
> so we take ->fallback_rwlock and ->reader_refcnt == FALLBACK_BASE.
> 
> CPU_0 does lg_global_unlock(lgrw->lglock) and finishes _write_unlock().
> 
> Interrupt handler on CPU_1 does _read_lock() notices ->reader_refcnt != 0
> and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1.
> 
> Then irq does _read_unlock(), and
> 
>> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
>> +{
>> +	switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) {
>> +	case 0:
>> +		lg_local_unlock(&lgrw->lglock);
>> +		return;
>> +	case FALLBACK_BASE:
>> +		__this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE);
>> +		read_unlock(&lgrw->fallback_rwlock);
> 
> hits this case?
> 
> Doesn't look right, but most probably I missed something.

Your are right, I just realized that I had spit a code which should be atomic.

I hope this patch(V2) can get more reviews.

My first and many locking knowledge is learned from Paul.
Paul, would you also review it?

Thanks,
Lai

> 
> Oleg.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov March 2, 2013, 5:01 p.m. UTC | #6
On 03/02, Lai Jiangshan wrote:
>
> On 02/03/13 02:28, Oleg Nesterov wrote:
> > Lai, I didn't read this discussion except the code posted by Michel.
> > I'll try to read this patch carefully later, but I'd like to ask
> > a couple of questions.
> >
> > This version looks more complex than Michel's, why? Just curious, I
> > am trying to understand what I missed. See
> > http://marc.info/?l=linux-kernel&m=136196350213593
>
> Michel changed my old draft version a little, his version is good enough for me.

Yes, I see. But imho Michel suggested the valuable cleanup, the code
becomes even more simple with the same perfomance.

Your v2 looks almost correct to me, but I still think it makes sense
to incorporate the simplification from Michel.

> My new version tries to add a little better nestable support with only
> adding single __this_cpu_op() in _read_[un]lock().

How? Afaics with or without FALLBACK_BASE you need _reed + _inc/dec in
_read_lock/unlock.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov March 2, 2013, 5:06 p.m. UTC | #7
On 03/02, Michel Lespinasse wrote:
>
> My version would be slower if it needs to take the
> slow path in a reentrant way, but I'm not sure it matters either :)

I'd say, this doesn't matter at all, simply because this can only happen
if we race with the active writer.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lai Jiangshan March 5, 2013, 3:54 p.m. UTC | #8
On 03/03/13 01:06, Oleg Nesterov wrote:
> On 03/02, Michel Lespinasse wrote:
>>
>> My version would be slower if it needs to take the
>> slow path in a reentrant way, but I'm not sure it matters either :)
> 
> I'd say, this doesn't matter at all, simply because this can only happen
> if we race with the active writer.
> 

It can also happen when interrupted. (still very rarely)

arch_spin_trylock()
	------->interrupted,
		__this_cpu_read() returns 0.
		arch_spin_trylock() fails
		slowpath, any nested will be slowpath too.
		...
		..._read_unlock()
	<-------interrupt
__this_cpu_inc()
....


I saw get_online_cpu_atomic() is called very frequent.
And the above thing happens in one CPU rarely, but how often it
happens in the whole system if we have 4096 CPUs?
(I worries to much. I tend to remove FALLBACK_BASE now, we should
add it only after we proved we needed it, this part is not proved)

Thanks,
Lai


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michel Lespinasse March 5, 2013, 4:32 p.m. UTC | #9
On Tue, Mar 5, 2013 at 7:54 AM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> On 03/03/13 01:06, Oleg Nesterov wrote:
>> On 03/02, Michel Lespinasse wrote:
>>>
>>> My version would be slower if it needs to take the
>>> slow path in a reentrant way, but I'm not sure it matters either :)
>>
>> I'd say, this doesn't matter at all, simply because this can only happen
>> if we race with the active writer.
>>
>
> It can also happen when interrupted. (still very rarely)
>
> arch_spin_trylock()
>         ------->interrupted,
>                 __this_cpu_read() returns 0.
>                 arch_spin_trylock() fails
>                 slowpath, any nested will be slowpath too.
>                 ...
>                 ..._read_unlock()
>         <-------interrupt
> __this_cpu_inc()
> ....

Yes (and I think this is actually the most likely way for it to happen).

We do need this to work correctly, but I don't expect we need it to be fast.
(could be wrong, this is only my intuition)
Oleg Nesterov March 5, 2013, 4:35 p.m. UTC | #10
On 03/05, Lai Jiangshan wrote:
>
> On 03/03/13 01:06, Oleg Nesterov wrote:
> > On 03/02, Michel Lespinasse wrote:
> >>
> >> My version would be slower if it needs to take the
> >> slow path in a reentrant way, but I'm not sure it matters either :)
> >
> > I'd say, this doesn't matter at all, simply because this can only happen
> > if we race with the active writer.
>
> It can also happen when interrupted. (still very rarely)
>
> arch_spin_trylock()
> 	------->interrupted,
> 		__this_cpu_read() returns 0.
> 		arch_spin_trylock() fails
> 		slowpath, any nested will be slowpath too.
> 		...
> 		..._read_unlock()
> 	<-------interrupt
> __this_cpu_inc()
> ....

Yes sure. Or it can take the local lock after we already take the global
fallback_lock.

But the same can happen with FALLBACK_BASE, just because we need to take
a lock (local or global) first, then increment the counter.

> (I worries to much. I tend to remove FALLBACK_BASE now, we should
> add it only after we proved we needed it, this part is not proved)

Agreed, great ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index 0d24e93..afb22bd 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -67,4 +67,39 @@  void lg_local_unlock_cpu(struct lglock *lg, int cpu);
 void lg_global_lock(struct lglock *lg);
 void lg_global_unlock(struct lglock *lg);
 
+struct lgrwlock {
+	unsigned long __percpu *reader_refcnt;
+	struct lglock lglock;
+	rwlock_t fallback_rwlock;
+};
+
+#define __DEFINE_LGRWLOCK_PERCPU_DATA(name)				\
+	static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)		\
+	= __ARCH_SPIN_LOCK_UNLOCKED;					\
+	static DEFINE_PER_CPU(unsigned long, name ## _refcnt);
+
+#define __LGRWLOCK_INIT(name)						\
+	{								\
+		.reader_refcnt = &name ## _refcnt,			\
+		.lglock = { .lock = &name ## _lock },			\
+		.fallback_rwlock = __RW_LOCK_UNLOCKED(name.fallback_rwlock)\
+	}
+
+#define DEFINE_LGRWLOCK(name)						\
+	__DEFINE_LGRWLOCK_PERCPU_DATA(name)				\
+	struct lgrwlock name = __LGRWLOCK_INIT(name)
+
+#define DEFINE_STATIC_LGRWLOCK(name)					\
+	__DEFINE_LGRWLOCK_PERCPU_DATA(name)				\
+	static struct lgrwlock name = __LGRWLOCK_INIT(name)
+
+static inline void lg_rwlock_init(struct lgrwlock *lgrw, char *name)
+{
+	lg_lock_init(&lgrw->lglock, name);
+}
+
+void lg_rwlock_local_read_lock(struct lgrwlock *lgrw);
+void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw);
+void lg_rwlock_global_write_lock(struct lgrwlock *lgrw);
+void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw);
 #endif
diff --git a/kernel/lglock.c b/kernel/lglock.c
index 6535a66..6c60cf7 100644
--- a/kernel/lglock.c
+++ b/kernel/lglock.c
@@ -87,3 +87,55 @@  void lg_global_unlock(struct lglock *lg)
 	preempt_enable();
 }
 EXPORT_SYMBOL(lg_global_unlock);
+
+#define FALLBACK_BASE	(1UL << 30)
+
+void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
+{
+	struct lglock *lg = &lgrw->lglock;
+
+	preempt_disable();
+	rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
+	if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) {
+		if (!arch_spin_trylock(this_cpu_ptr(lg->lock))) {
+			read_lock(&lgrw->fallback_rwlock);
+			__this_cpu_add(*lgrw->reader_refcnt, FALLBACK_BASE);
+		}
+	}
+
+	__this_cpu_inc(*lgrw->reader_refcnt);
+}
+EXPORT_SYMBOL(lg_rwlock_local_read_lock);
+
+void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
+{
+	switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) {
+	case 0:
+		lg_local_unlock(&lgrw->lglock);
+		return;
+	case FALLBACK_BASE:
+		__this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE);
+		read_unlock(&lgrw->fallback_rwlock);
+		break;
+	default:
+		break;
+	}
+
+	rwlock_release(&lg->lock_dep_map, 1, _RET_IP_);
+	preempt_enable();
+}
+EXPORT_SYMBOL(lg_rwlock_local_read_unlock);
+
+void lg_rwlock_global_write_lock(struct lgrwlock *lgrw)
+{
+	lg_global_lock(&lgrw->lglock);
+	write_lock(&lgrw->fallback_rwlock);
+}
+EXPORT_SYMBOL(lg_rwlock_global_write_lock);
+
+void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw)
+{
+	write_unlock(&lgrw->fallback_rwlock);
+	lg_global_unlock(&lgrw->lglock);
+}
+EXPORT_SYMBOL(lg_rwlock_global_write_unlock);