diff mbox series

[RFC,1/2] rwsem: introduce upgrade_read interface

Message ID 20241016043600.35139-2-lizhe.67@bytedance.com (mailing list archive)
State New
Headers show
Series rwsem: introduce upgrade_read interface | expand

Commit Message

lizhe.67@bytedance.com Oct. 16, 2024, 4:35 a.m. UTC
From: Li Zhe <lizhe.67@bytedance.com>

Introduce a new rwsem interface upgrade_read(). We can call it
to upgrade the lock into write rwsem lock after we get read lock.
This interface will wait for all readers to exit before obtaining
the write lock. In addition, this interface has a higher priority
than any process waiting for the write lock and subsequent threads
that want to obtain the read lock.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 include/linux/rwsem.h  |  1 +
 kernel/locking/rwsem.c | 87 ++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 85 insertions(+), 3 deletions(-)

Comments

Christoph Hellwig Oct. 16, 2024, 4:56 a.m. UTC | #1
On Wed, Oct 16, 2024 at 12:35:59PM +0800, lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
> 
> Introduce a new rwsem interface upgrade_read(). We can call it

It's obviously a try_upgrade_read, right?

> +extern int upgrade_read(struct rw_semaphore *sem);

No need for the extern.  Also please make any new advanced funtionality
like this an EXPORT_SYMBO_GPL.  That is if you actually had a modular
user, the current one is built-in.  (although we could use this in
XFS if it lands).
lizhe.67@bytedance.com Oct. 16, 2024, 7:33 a.m. UTC | #2
On Tue, 15 Oct 2024 21:56:51 -0700, hch@infradead.org wrote:

>On Wed, Oct 16, 2024 at 12:35:59PM +0800, lizhe.67@bytedance.com wrote:
>> From: Li Zhe <lizhe.67@bytedance.com>
>>
>> Introduce a new rwsem interface upgrade_read(). We can call it
>
>It's obviously a try_upgrade_read, right?

Yes, try_upgrade_read is a better name, I will fix it in v2. Thanks!

>
>> +extern int upgrade_read(struct rw_semaphore *sem);
>
>No need for the extern.  Also please make any new advanced funtionality

Sorry I don't understand why extern is not needed. If we remove it,
we will encounter the following compilation error:

mm/khugepaged.c:1145:6: error: implicit declaration of function 'upgrade_read'; did you mean 'i_gid_read'? [-Werror=implicit-function-declaration]
  if (upgrade_read(&mm->mmap_lock)) {
      ^~~~~~~~~~~~

>like this an EXPORT_SYMBO_GPL.  That is if you actually had a modular
>user, the current one is built-in.  (although we could use this in
>XFS if it lands).

I see all advanced functionality in kernel/locking/rwsem.c is tagged with
EXPORT_SYMBOL (which is used in my patch), and I'm not sure if I need to
change it to EXPORT_SYMBO_GPL.
Hope to see this interface used by XFS.
Christoph Hellwig Oct. 16, 2024, 7:36 a.m. UTC | #3
On Wed, Oct 16, 2024 at 03:33:40PM +0800, lizhe.67@bytedance.com wrote:
> >> +extern int upgrade_read(struct rw_semaphore *sem);
> >
> >No need for the extern.  Also please make any new advanced funtionality
> 
> Sorry I don't understand why extern is not needed. If we remove it,
> we will encounter the following compilation error:

That sounds like you dropped the entire line above, and not just the
"extern ".
lizhe.67@bytedance.com Oct. 16, 2024, 8 a.m. UTC | #4
On Wed, 16 Oct 2024 00:36:19 -0700, hch@infradead.org wrote:

>> >> +extern int upgrade_read(struct rw_semaphore *sem);
>> >
>> >No need for the extern.  Also please make any new advanced funtionality
>>
>> Sorry I don't understand why extern is not needed. If we remove it,
>> we will encounter the following compilation error:
>
>That sounds like you dropped the entire line above, and not just the
>"extern ".

OK I know what you mean. It is indeed OK to remove "extern", but all function
declarations in the rwsem.h have the "extern" prefix. I think it would be
better to keep it consistent.

Thanks!
Christoph Hellwig Oct. 16, 2024, 8:03 a.m. UTC | #5
On Wed, Oct 16, 2024 at 04:00:57PM +0800, lizhe.67@bytedance.com wrote:
> OK I know what you mean. It is indeed OK to remove "extern", but all function
> declarations in the rwsem.h have the "extern" prefix. I think it would be
> better to keep it consistent.

They are pointless and should never be added.
lizhe.67@bytedance.com Oct. 16, 2024, 8:13 a.m. UTC | #6
On Wed, 16 Oct 2024 01:03:33 -0700, hch@infradead.org wrote:

>> OK I know what you mean. It is indeed OK to remove "extern", but all function
>> declarations in the rwsem.h have the "extern" prefix. I think it would be
>> better to keep it consistent.
>
>They are pointless and should never be added.

OK I will remove it in v2. Thanks!
Matthew Wilcox (Oracle) Oct. 16, 2024, 11:49 a.m. UTC | #7
On Wed, Oct 16, 2024 at 12:35:59PM +0800, lizhe.67@bytedance.com wrote:
> +++ b/include/linux/rwsem.h
> @@ -249,6 +249,7 @@ DEFINE_GUARD_COND(rwsem_write, _try, down_write_trylock(_T))
>   * downgrade write lock to read lock
>   */
>  extern void downgrade_write(struct rw_semaphore *sem);
> +extern int upgrade_read(struct rw_semaphore *sem);

This needs __must_check, and I think this should be a bool, not an errno
return.
Matthew Wilcox (Oracle) Oct. 16, 2024, 11:51 a.m. UTC | #8
On Tue, Oct 15, 2024 at 09:56:51PM -0700, Christoph Hellwig wrote:
> On Wed, Oct 16, 2024 at 12:35:59PM +0800, lizhe.67@bytedance.com wrote:
> > From: Li Zhe <lizhe.67@bytedance.com>
> > 
> > Introduce a new rwsem interface upgrade_read(). We can call it
> 
> It's obviously a try_upgrade_read, right?

Well, that's confusing.  "try" usually means "don't sleep", and this
sleeps.  Maybe it shouldn't sleep; ie we make this fail if there's any
other reader?  It'll succeed less often, but it'll be easier to
understand.
Christoph Hellwig Oct. 16, 2024, 12:21 p.m. UTC | #9
On Wed, Oct 16, 2024 at 12:51:23PM +0100, Matthew Wilcox wrote:
> > It's obviously a try_upgrade_read, right?
> 
> Well, that's confusing.  "try" usually means "don't sleep", and this
> sleeps.  Maybe it shouldn't sleep; ie we make this fail if there's any
> other reader?  It'll succeed less often, but it'll be easier to
> understand.

To me try primarily implies that it can fail and the return value
needs to be checked.  But I guess it has different implications to
different people.
Waiman Long Oct. 16, 2024, 2:23 p.m. UTC | #10
On 10/16/24 12:35 AM, lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
>
> Introduce a new rwsem interface upgrade_read(). We can call it
> to upgrade the lock into write rwsem lock after we get read lock.
> This interface will wait for all readers to exit before obtaining
> the write lock. In addition, this interface has a higher priority
> than any process waiting for the write lock and subsequent threads
> that want to obtain the read lock.
>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
>   include/linux/rwsem.h  |  1 +
>   kernel/locking/rwsem.c | 87 ++++++++++++++++++++++++++++++++++++++++--
>   2 files changed, 85 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> index c8b543d428b0..90183ab5ea79 100644
> --- a/include/linux/rwsem.h
> +++ b/include/linux/rwsem.h
> @@ -249,6 +249,7 @@ DEFINE_GUARD_COND(rwsem_write, _try, down_write_trylock(_T))
>    * downgrade write lock to read lock
>    */
>   extern void downgrade_write(struct rw_semaphore *sem);
> +extern int upgrade_read(struct rw_semaphore *sem);
>   
>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>   /*
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 2bbb6eca5144..0583e1be3dbf 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -37,6 +37,7 @@
>    * meanings when set.
>    *  - Bit 0: RWSEM_READER_OWNED - rwsem may be owned by readers (just a hint)
>    *  - Bit 1: RWSEM_NONSPINNABLE - Cannot spin on a reader-owned lock
> + *  - Bit 2: RWSEM_UPGRADING    - doing upgrade read process
>    *
>    * When the rwsem is reader-owned and a spinning writer has timed out,
>    * the nonspinnable bit will be set to disable optimistic spinning.
> @@ -62,7 +63,8 @@
>    */
>   #define RWSEM_READER_OWNED	(1UL << 0)
>   #define RWSEM_NONSPINNABLE	(1UL << 1)
> -#define RWSEM_OWNER_FLAGS_MASK	(RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
> +#define RWSEM_UPGRADING		(1UL << 2)
> +#define RWSEM_OWNER_FLAGS_MASK	(RWSEM_READER_OWNED | RWSEM_NONSPINNABLE | RWSEM_UPGRADING)
>   
>   #ifdef CONFIG_DEBUG_RWSEMS
>   # define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
> @@ -93,7 +95,8 @@
>    * Bit  0    - writer locked bit
>    * Bit  1    - waiters present bit
>    * Bit  2    - lock handoff bit
> - * Bits 3-7  - reserved
> + * Bit  3    - upgrade read bit
> + * Bits 4-7  - reserved
>    * Bits 8-30 - 23-bit reader count
>    * Bit  31   - read fail bit
>    *
> @@ -117,6 +120,7 @@
>   #define RWSEM_WRITER_LOCKED	(1UL << 0)
>   #define RWSEM_FLAG_WAITERS	(1UL << 1)
>   #define RWSEM_FLAG_HANDOFF	(1UL << 2)
> +#define RWSEM_FLAG_UPGRADE_READ	(1UL << 3)
>   #define RWSEM_FLAG_READFAIL	(1UL << (BITS_PER_LONG - 1))
>   
>   #define RWSEM_READER_SHIFT	8
> @@ -143,6 +147,13 @@ static inline void rwsem_set_owner(struct rw_semaphore *sem)
>   	atomic_long_set(&sem->owner, (long)current);
>   }
>   
> +static inline void rwsem_set_owner_upgrade(struct rw_semaphore *sem)
> +{
> +	lockdep_assert_preemption_disabled();
> +	atomic_long_set(&sem->owner, (long)current | RWSEM_UPGRADING |
> +			RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
> +}

Because of possible  racing between 2 competing upgraders, read lock 
owner setting has to be atomic to avoid one overwriting the others.


> +
>   static inline void rwsem_clear_owner(struct rw_semaphore *sem)
>   {
>   	lockdep_assert_preemption_disabled();
> @@ -201,7 +212,7 @@ static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
>   	 */
>   	long count = atomic_long_read(&sem->count);
>   
> -	if (count & RWSEM_WRITER_MASK)
> +	if ((count & RWSEM_WRITER_MASK) && !(count & RWSEM_FLAG_UPGRADE_READ))
>   		return false;
>   	return rwsem_test_oflags(sem, RWSEM_READER_OWNED);
>   }
> @@ -1336,6 +1347,8 @@ static inline int __down_write_trylock(struct rw_semaphore *sem)
>   static inline void __up_read(struct rw_semaphore *sem)
>   {
>   	long tmp;
> +	unsigned long flags;
> +	struct task_struct *owner;
>   
>   	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
>   	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> @@ -1349,6 +1362,9 @@ static inline void __up_read(struct rw_semaphore *sem)
>   		clear_nonspinnable(sem);
>   		rwsem_wake(sem);
>   	}
> +	owner = rwsem_owner_flags(sem, &flags);
> +	if (unlikely(!(tmp & RWSEM_READER_MASK) && (flags & RWSEM_UPGRADING)))
> +		wake_up_process(owner);
>   	preempt_enable();
>   }
>   
> @@ -1641,6 +1657,71 @@ void downgrade_write(struct rw_semaphore *sem)
>   }
>   EXPORT_SYMBOL(downgrade_write);
>   
> +static inline void rwsem_clear_upgrade_flag(struct rw_semaphore *sem)
> +{
> +	atomic_long_andnot(RWSEM_FLAG_UPGRADE_READ, &sem->count);
> +}
> +
> +/*
> + * upgrade read lock to write lock
> + */
> +static inline int __upgrade_read(struct rw_semaphore *sem)
> +{
> +	long tmp;
> +
> +	preempt_disable();
> +
> +	tmp = atomic_long_read(&sem->count);
> +	do {
> +		if (tmp & (RWSEM_WRITER_MASK | RWSEM_FLAG_UPGRADE_READ)) {
> +			preempt_enable();
> +			return -EBUSY;
> +		}
> +	} while (!atomic_long_try_cmpxchg(&sem->count, &tmp,
> +		tmp + RWSEM_FLAG_UPGRADE_READ + RWSEM_WRITER_LOCKED - RWSEM_READER_BIAS));
> +
> +	if ((tmp & RWSEM_READER_MASK) == RWSEM_READER_BIAS) {
> +		/* fast path */
> +		DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
> +		rwsem_clear_upgrade_flag(sem);
> +		rwsem_set_owner(sem);
> +		preempt_enable();
> +		return 0;
> +	}
> +	/* slow path */
> +	raw_spin_lock_irq(&sem->wait_lock);
> +	rwsem_set_owner_upgrade(sem);
> +
> +	set_current_state(TASK_UNINTERRUPTIBLE);
> +
> +	for (;;) {
> +		if (!(atomic_long_read(&sem->count) & RWSEM_READER_MASK))
> +			break;
> +		raw_spin_unlock_irq(&sem->wait_lock);
> +		schedule_preempt_disabled();
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		raw_spin_lock_irq(&sem->wait_lock);
> +	}
> +
> +	rwsem_clear_upgrade_flag(sem);
> +	rwsem_set_owner(sem);
> +	__set_current_state(TASK_RUNNING);
> +	raw_spin_unlock_irq(&sem->wait_lock);
> +	preempt_enable();
> +	return 0;
> +}
> +
> +/*
> + * upgrade read lock to write lock
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int upgrade_read(struct rw_semaphore *sem)
> +{
> +	return __upgrade_read(sem);
> +}
> +EXPORT_SYMBOL(upgrade_read);

This new interface should have an API similar to a trylock. True if 
successful, false otherwise. I like the read_try_upgrade() name.

Another alternative that I have been thinking about is a down_read() 
variant with intention to upgrade later. This will ensure that only one 
active reader is allowed to upgrade later. With this, upgrade_read() 
will always succeed, maybe with some sleeping, as long as the correct 
down_read() is used.

Cheers,
Longman
Matthew Wilcox (Oracle) Oct. 16, 2024, 6:05 p.m. UTC | #11
On Wed, Oct 16, 2024 at 10:23:14AM -0400, Waiman Long wrote:
> Another alternative that I have been thinking about is a down_read() variant
> with intention to upgrade later. This will ensure that only one active
> reader is allowed to upgrade later. With this, upgrade_read() will always
> succeed, maybe with some sleeping, as long as the correct down_read() is
> used.

How is that different from Kent's SIX locks other than you can take an
rwsem for write immediately (SIX locks have to be taken for Intent and
then Upgraded)?
Waiman Long Oct. 16, 2024, 6:39 p.m. UTC | #12
On 10/16/24 2:05 PM, Matthew Wilcox wrote:
> On Wed, Oct 16, 2024 at 10:23:14AM -0400, Waiman Long wrote:
>> Another alternative that I have been thinking about is a down_read() variant
>> with intention to upgrade later. This will ensure that only one active
>> reader is allowed to upgrade later. With this, upgrade_read() will always
>> succeed, maybe with some sleeping, as long as the correct down_read() is
>> used.
> How is that different from Kent's SIX locks other than you can take an
> rwsem for write immediately (SIX locks have to be taken for Intent and
> then Upgraded)?

Yes, it is modeled after Kent's SIX locks. The goal is to eventually 
eliminate the need of a separate SIX lock.

Cheers,
Longman
lizhe.67@bytedance.com Oct. 17, 2024, 6:23 a.m. UTC | #13
On Wed, 16 Oct 2024 12:49:44 +0100, willy@infradead.org wrote:

>On Wed, Oct 16, 2024 at 12:35:59PM +0800, lizhe.67@bytedance.com wrote:
>> +++ b/include/linux/rwsem.h
>> @@ -249,6 +249,7 @@ DEFINE_GUARD_COND(rwsem_write, _try, down_write_trylock(_T))
>>   * downgrade write lock to read lock
>>   */
>>  extern void downgrade_write(struct rw_semaphore *sem);
>> +extern int upgrade_read(struct rw_semaphore *sem);
>
>This needs __must_check, and I think this should be a bool, not an errno
>return.

I can't agree more. I will fix it in v2.

Thanks!
lizhe.67@bytedance.com Oct. 17, 2024, 6:46 a.m. UTC | #14
On Wed, 16 Oct 2024 10:23:14 -0400, llong@redhat.com wrote:

>> +static inline void rwsem_set_owner_upgrade(struct rw_semaphore *sem)
>> +{
>> +	lockdep_assert_preemption_disabled();
>> +	atomic_long_set(&sem->owner, (long)current | RWSEM_UPGRADING |
>> +			RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
>> +}
>
>Because of possible  racing between 2 competing upgraders, read lock 
>owner setting has to be atomic to avoid one overwriting the others.

I did concurrent processing at the very beginning of the function
__upgrade_read(). Is it not necessary to do concurrent processing here?
The relevant code is as follows.

+static inline int __upgrade_read(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	preempt_disable();
+
+	tmp = atomic_long_read(&sem->count);
+	do {
+		if (tmp & (RWSEM_WRITER_MASK | RWSEM_FLAG_UPGRADE_READ)) {
+			preempt_enable();
+			return -EBUSY;
+		}
+	} while (!atomic_long_try_cmpxchg(&sem->count, &tmp,
+		tmp + RWSEM_FLAG_UPGRADE_READ + RWSEM_WRITER_LOCKED - RWSEM_READER_BIAS));

>This new interface should have an API similar to a trylock. True if 
>successful, false otherwise. I like the read_try_upgrade() name.

I can't agree more. I will add an read_try_upgrade() API in v2.

>Another alternative that I have been thinking about is a down_read() 
>variant with intention to upgrade later. This will ensure that only one 
>active reader is allowed to upgrade later. With this, upgrade_read() 
>will always succeed, maybe with some sleeping, as long as the correct 
>down_read() is used.

I haven't figured out how to do this yet. If the current upgrade_read
idea is also OK, should I continue to complete this patchset according
to this idea?

>Cheers,
>Longman

Thanks for your review!
Christoph Hellwig Oct. 17, 2024, 3:05 p.m. UTC | #15
On Wed, Oct 16, 2024 at 10:23:14AM -0400, Waiman Long wrote:
> Another alternative that I have been thinking about is a down_read() variant
> with intention to upgrade later. This will ensure that only one active
> reader is allowed to upgrade later. With this, upgrade_read() will always
> succeed, maybe with some sleeping, as long as the correct down_read() is
> used.

At least for the XFS use case where direct I/O takes a share lock
that needs to be replaced with an exclusive one for certain kinds of
I/O would be useless.  But then again we've survived without this
operation for a long time, despite the initial port bringing one over
from IRIX.
Waiman Long Oct. 17, 2024, 5:36 p.m. UTC | #16
On 10/17/24 11:05 AM, Christoph Hellwig wrote:
> On Wed, Oct 16, 2024 at 10:23:14AM -0400, Waiman Long wrote:
>> Another alternative that I have been thinking about is a down_read() variant
>> with intention to upgrade later. This will ensure that only one active
>> reader is allowed to upgrade later. With this, upgrade_read() will always
>> succeed, maybe with some sleeping, as long as the correct down_read() is
>> used.
> At least for the XFS use case where direct I/O takes a share lock
> that needs to be replaced with an exclusive one for certain kinds of
> I/O would be useless.  But then again we've survived without this
> operation for a long time, despite the initial port bringing one over
> from IRIX.

That means XFS only needs to upgrade to a write lock in certain cases 
only, not all of them. Right? In that case, read_try_upgrade() that 
attempts to upgrade to a write lock will be useful.

Cheers,
Longman
Christoph Hellwig Oct. 18, 2024, 5:06 a.m. UTC | #17
On Thu, Oct 17, 2024 at 01:36:41PM -0400, Waiman Long wrote:
> > At least for the XFS use case where direct I/O takes a share lock
> > that needs to be replaced with an exclusive one for certain kinds of
> > I/O would be useless.  But then again we've survived without this
> > operation for a long time, despite the initial port bringing one over
> > from IRIX.
> 
> That means XFS only needs to upgrade to a write lock in certain cases only,
> not all of them.

Yes.  Basically when we detect a direct I/O writes needs to clear the
SUID bit or other metadata, or when it is an extending write.

> Right? In that case, read_try_upgrade() that attempts to
> upgrade to a write lock will be useful.

Yes.  But I also understand Peters reasoning that it will be very hard
to actually have a useful implementation that does better than just
unlocking (which is what we do currently).
diff mbox series

Patch

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index c8b543d428b0..90183ab5ea79 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -249,6 +249,7 @@  DEFINE_GUARD_COND(rwsem_write, _try, down_write_trylock(_T))
  * downgrade write lock to read lock
  */
 extern void downgrade_write(struct rw_semaphore *sem);
+extern int upgrade_read(struct rw_semaphore *sem);
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 /*
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2bbb6eca5144..0583e1be3dbf 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -37,6 +37,7 @@ 
  * meanings when set.
  *  - Bit 0: RWSEM_READER_OWNED - rwsem may be owned by readers (just a hint)
  *  - Bit 1: RWSEM_NONSPINNABLE - Cannot spin on a reader-owned lock
+ *  - Bit 2: RWSEM_UPGRADING    - doing upgrade read process
  *
  * When the rwsem is reader-owned and a spinning writer has timed out,
  * the nonspinnable bit will be set to disable optimistic spinning.
@@ -62,7 +63,8 @@ 
  */
 #define RWSEM_READER_OWNED	(1UL << 0)
 #define RWSEM_NONSPINNABLE	(1UL << 1)
-#define RWSEM_OWNER_FLAGS_MASK	(RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
+#define RWSEM_UPGRADING		(1UL << 2)
+#define RWSEM_OWNER_FLAGS_MASK	(RWSEM_READER_OWNED | RWSEM_NONSPINNABLE | RWSEM_UPGRADING)
 
 #ifdef CONFIG_DEBUG_RWSEMS
 # define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
@@ -93,7 +95,8 @@ 
  * Bit  0    - writer locked bit
  * Bit  1    - waiters present bit
  * Bit  2    - lock handoff bit
- * Bits 3-7  - reserved
+ * Bit  3    - upgrade read bit
+ * Bits 4-7  - reserved
  * Bits 8-30 - 23-bit reader count
  * Bit  31   - read fail bit
  *
@@ -117,6 +120,7 @@ 
 #define RWSEM_WRITER_LOCKED	(1UL << 0)
 #define RWSEM_FLAG_WAITERS	(1UL << 1)
 #define RWSEM_FLAG_HANDOFF	(1UL << 2)
+#define RWSEM_FLAG_UPGRADE_READ	(1UL << 3)
 #define RWSEM_FLAG_READFAIL	(1UL << (BITS_PER_LONG - 1))
 
 #define RWSEM_READER_SHIFT	8
@@ -143,6 +147,13 @@  static inline void rwsem_set_owner(struct rw_semaphore *sem)
 	atomic_long_set(&sem->owner, (long)current);
 }
 
+static inline void rwsem_set_owner_upgrade(struct rw_semaphore *sem)
+{
+	lockdep_assert_preemption_disabled();
+	atomic_long_set(&sem->owner, (long)current | RWSEM_UPGRADING |
+			RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
+}
+
 static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 {
 	lockdep_assert_preemption_disabled();
@@ -201,7 +212,7 @@  static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
 	 */
 	long count = atomic_long_read(&sem->count);
 
-	if (count & RWSEM_WRITER_MASK)
+	if ((count & RWSEM_WRITER_MASK) && !(count & RWSEM_FLAG_UPGRADE_READ))
 		return false;
 	return rwsem_test_oflags(sem, RWSEM_READER_OWNED);
 }
@@ -1336,6 +1347,8 @@  static inline int __down_write_trylock(struct rw_semaphore *sem)
 static inline void __up_read(struct rw_semaphore *sem)
 {
 	long tmp;
+	unsigned long flags;
+	struct task_struct *owner;
 
 	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
 	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
@@ -1349,6 +1362,9 @@  static inline void __up_read(struct rw_semaphore *sem)
 		clear_nonspinnable(sem);
 		rwsem_wake(sem);
 	}
+	owner = rwsem_owner_flags(sem, &flags);
+	if (unlikely(!(tmp & RWSEM_READER_MASK) && (flags & RWSEM_UPGRADING)))
+		wake_up_process(owner);
 	preempt_enable();
 }
 
@@ -1641,6 +1657,71 @@  void downgrade_write(struct rw_semaphore *sem)
 }
 EXPORT_SYMBOL(downgrade_write);
 
+static inline void rwsem_clear_upgrade_flag(struct rw_semaphore *sem)
+{
+	atomic_long_andnot(RWSEM_FLAG_UPGRADE_READ, &sem->count);
+}
+
+/*
+ * upgrade read lock to write lock
+ */
+static inline int __upgrade_read(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	preempt_disable();
+
+	tmp = atomic_long_read(&sem->count);
+	do {
+		if (tmp & (RWSEM_WRITER_MASK | RWSEM_FLAG_UPGRADE_READ)) {
+			preempt_enable();
+			return -EBUSY;
+		}
+	} while (!atomic_long_try_cmpxchg(&sem->count, &tmp,
+		tmp + RWSEM_FLAG_UPGRADE_READ + RWSEM_WRITER_LOCKED - RWSEM_READER_BIAS));
+
+	if ((tmp & RWSEM_READER_MASK) == RWSEM_READER_BIAS) {
+		/* fast path */
+		DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
+		rwsem_clear_upgrade_flag(sem);
+		rwsem_set_owner(sem);
+		preempt_enable();
+		return 0;
+	}
+	/* slow path */
+	raw_spin_lock_irq(&sem->wait_lock);
+	rwsem_set_owner_upgrade(sem);
+
+	set_current_state(TASK_UNINTERRUPTIBLE);
+
+	for (;;) {
+		if (!(atomic_long_read(&sem->count) & RWSEM_READER_MASK))
+			break;
+		raw_spin_unlock_irq(&sem->wait_lock);
+		schedule_preempt_disabled();
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		raw_spin_lock_irq(&sem->wait_lock);
+	}
+
+	rwsem_clear_upgrade_flag(sem);
+	rwsem_set_owner(sem);
+	__set_current_state(TASK_RUNNING);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	preempt_enable();
+	return 0;
+}
+
+/*
+ * upgrade read lock to write lock
+ *
+ * Return: 0 on success, error code on failure
+ */
+int upgrade_read(struct rw_semaphore *sem)
+{
+	return __upgrade_read(sem);
+}
+EXPORT_SYMBOL(upgrade_read);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 
 void down_read_nested(struct rw_semaphore *sem, int subclass)