From patchwork Wed Sep 11 15:05:33 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Waiman Long <longman@redhat.com>
X-Patchwork-Id: 11141313
Return-Path: <SRS0=hojy=XG=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0304614ED
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Wed, 11 Sep 2019 15:06:15 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id D51F5207FC
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Wed, 11 Sep 2019 15:06:14 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1728384AbfIKPGL (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 11 Sep 2019 11:06:11 -0400
Received: from mx1.redhat.com ([209.132.183.28]:42142 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727581AbfIKPGL (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 11 Sep 2019 11:06:11 -0400
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com
 [10.5.11.14])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mx1.redhat.com (Postfix) with ESMTPS id 2D53885360;
        Wed, 11 Sep 2019 15:06:10 +0000 (UTC)
Received: from llong.com (ovpn-125-196.rdu2.redhat.com [10.10.125.196])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 7408B5D9E2;
        Wed, 11 Sep 2019 15:06:07 +0000 (UTC)
From: Waiman Long <longman@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Will Deacon <will.deacon@arm.com>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Mike Kravetz <mike.kravetz@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
        linux-mm@kvack.org, Davidlohr Bueso <dave@stgolabs.net>,
        Waiman Long <longman@redhat.com>
Subject: [PATCH 1/5] locking/rwsem: Add down_write_timedlock()
Date: Wed, 11 Sep 2019 16:05:33 +0100
Message-Id: <20190911150537.19527-2-longman@redhat.com>
In-Reply-To: <20190911150537.19527-1-longman@redhat.com>
References: <20190911150537.19527-1-longman@redhat.com>
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14
X-Greylist: Sender IP whitelisted,
 not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]);
 Wed, 11 Sep 2019 15:06:10 +0000 (UTC)
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

There are cases where a task wants to acquire a rwsem but doesn't want
to wait for an indefinite period of time. Instead, a task may want
an alternative way of dealing with the inability to acquire the lock
after a certain period of time. There are also cases where waiting
indefinitely can potentially lead to deadlock. Doing it by using a
trylock loop is inelegant as it increases cacheline contention and is
difficult to control the actual wait time.

To address this dilemma, a new down_write_timedlock() variant
is introduced which allows an additional ktime_t timeout argument
(currently in ns) relative to now. With this new API, a task can now
wait for a given period of time and bail out when the lock cannot be
acquired within the given period.

In reality, the actual wait time is likely to be longer than the
given time. Timeout checking isn't done when doing optimistic spinning.
Therefore a short timeout smaller than the scheduling period may be
less accurate.

From the lockdep perspective, down_write_timedlock() is treated similar
to down_write_trylock().

A similar down_read_timedlock() may be added later on when the need
arises.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/rwsem.h             |  4 +-
 kernel/locking/lock_events_list.h |  1 +
 kernel/locking/rwsem.c            | 85 +++++++++++++++++++++++++++++--
 3 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 00d6054687dd..b3c7c5afde46 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -15,6 +15,7 @@
 #include <linux/list.h>
 #include <linux/spinlock.h>
 #include <linux/atomic.h>
+#include <linux/ktime.h>
 #include <linux/err.h>
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 #include <linux/osq_lock.h>
@@ -139,9 +140,10 @@ extern void down_write(struct rw_semaphore *sem);
 extern int __must_check down_write_killable(struct rw_semaphore *sem);
 
 /*
- * trylock for writing -- returns 1 if successful, 0 if contention
+ * trylock or timedlock for writing -- returns 1 if successful, 0 if failed
  */
 extern int down_write_trylock(struct rw_semaphore *sem);
+extern int down_write_timedlock(struct rw_semaphore *sem, ktime_t timeout);
 
 /*
  * release a read lock
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 239039d0ce21..c2345e0472b0 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -69,3 +69,4 @@ LOCK_EVENT(rwsem_rlock_handoff)	/* # of read lock handoffs		*/
 LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
 LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
 LOCK_EVENT(rwsem_wlock_handoff)	/* # of write lock handoffs		*/
+LOCK_EVENT(rwsem_wlock_timeout)	/* # of write lock timeouts		*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index eef04551eae7..c0285749c338 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -27,6 +27,7 @@
 #include <linux/export.h>
 #include <linux/rwsem.h>
 #include <linux/atomic.h>
+#include <linux/hrtimer.h>
 
 #include "rwsem.h"
 #include "lock_events.h"
@@ -988,6 +989,26 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable)
 #define OWNER_NULL	1
 #endif
 
+/*
+ * Set up the hrtimer to fire at a future time relative to now.
+ * Return: The hrtimer_sleeper pointer if success, or NULL if it
+ *	   has timed out.
+ */
+static inline struct hrtimer_sleeper *
+rwsem_setup_hrtimer(struct hrtimer_sleeper *to, ktime_t timeout)
+{
+	ktime_t curtime = ns_to_ktime(sched_clock());
+
+	if (ktime_compare(curtime, timeout) >= 0)
+		return NULL;
+
+	hrtimer_init_sleeper_on_stack(to, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_set_expires_range_ns(&to->timer, timeout - curtime,
+				     current->timer_slack_ns);
+	hrtimer_start_expires(&to->timer, HRTIMER_MODE_REL);
+	return to;
+}
+
 /*
  * Wait for the read lock to be granted
  */
@@ -1136,7 +1157,7 @@ static inline void rwsem_disable_reader_optspin(struct rw_semaphore *sem,
  * Wait until we successfully acquire the write lock
  */
 static struct rw_semaphore *
-rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
+rwsem_down_write_slowpath(struct rw_semaphore *sem, int state, ktime_t timeout)
 {
 	long count;
 	bool disable_rspin;
@@ -1144,6 +1165,13 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
+	struct hrtimer_sleeper timer_sleeper, *to = NULL;
+
+	/*
+	 * The timeuot value is now the end time when the timer will expire.
+	 */
+	if (timeout)
+		timeout = ktime_add_ns(timeout, sched_clock());
 
 	/* do optimistic spinning and steal lock if possible */
 	if (rwsem_can_spin_on_owner(sem, RWSEM_WR_NONSPINNABLE) &&
@@ -1235,6 +1263,15 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 			if (signal_pending_state(state, current))
 				goto out_nolock;
 
+			if (timeout) {
+				if (!to)
+					to = rwsem_setup_hrtimer(&timer_sleeper,
+								 timeout);
+				if (!to || !to->task) {
+					lockevent_inc(rwsem_wlock_timeout);
+					goto out_nolock;
+				}
+			}
 			schedule();
 			lockevent_inc(rwsem_sleep_writer);
 			set_current_state(state);
@@ -1273,6 +1310,11 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	raw_spin_unlock_irq(&sem->wait_lock);
 	lockevent_inc(rwsem_wlock);
 
+out:
+	if (to) {
+		hrtimer_cancel(&to->timer);
+		destroy_hrtimer_on_stack(&to->timer);
+	}
 	return ret;
 
 out_nolock:
@@ -1291,7 +1333,8 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	wake_up_q(&wake_q);
 	lockevent_inc(rwsem_wlock_fail);
 
-	return ERR_PTR(-EINTR);
+	ret = ERR_PTR(timeout ? -ETIMEDOUT : -EINTR);
+	goto out;
 }
 
 /*
@@ -1389,7 +1432,7 @@ static inline void __down_write(struct rw_semaphore *sem)
 
 	if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
 						      RWSEM_WRITER_LOCKED)))
-		rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
+		rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE, 0);
 	else
 		rwsem_set_owner(sem);
 }
@@ -1400,7 +1443,7 @@ static inline int __down_write_killable(struct rw_semaphore *sem)
 
 	if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
 						      RWSEM_WRITER_LOCKED))) {
-		if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE)))
+		if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE, 0)))
 			return -EINTR;
 	} else {
 		rwsem_set_owner(sem);
@@ -1408,6 +1451,25 @@ static inline int __down_write_killable(struct rw_semaphore *sem)
 	return 0;
 }
 
+static inline int __down_write_timedlock(struct rw_semaphore *sem,
+					 ktime_t timeout)
+{
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+						      RWSEM_WRITER_LOCKED))) {
+		if (unlikely(timeout <= 0))
+			return false;
+
+		if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE,
+						     timeout)))
+			return false;
+	} else {
+		rwsem_set_owner(sem);
+	}
+	return true;
+}
+
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
 	long tmp;
@@ -1568,6 +1630,21 @@ int down_write_trylock(struct rw_semaphore *sem)
 }
 EXPORT_SYMBOL(down_write_trylock);
 
+/*
+ * lock for writing with timeout (relative to now in ns)
+ */
+int down_write_timedlock(struct rw_semaphore *sem, ktime_t timeout)
+{
+	might_sleep();
+	if (__down_write_timedlock(sem, timeout)) {
+		rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL(down_write_timedlock);
+
 /*
  * release a read lock
  */

From patchwork Wed Sep 11 15:05:34 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Waiman Long <longman@redhat.com>
X-Patchwork-Id: 11141317
Return-Path: <SRS0=hojy=XG=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9116116B1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Wed, 11 Sep 2019 15:06:17 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 6BB4121A4C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Wed, 11 Sep 2019 15:06:17 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1728413AbfIKPGO (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 11 Sep 2019 11:06:14 -0400
Received: from mx1.redhat.com ([209.132.183.28]:62270 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727581AbfIKPGO (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 11 Sep 2019 11:06:14 -0400
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com
 [10.5.11.14])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mx1.redhat.com (Postfix) with ESMTPS id 11103309C386;
        Wed, 11 Sep 2019 15:06:13 +0000 (UTC)
Received: from llong.com (ovpn-125-196.rdu2.redhat.com [10.10.125.196])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 866675D9E2;
        Wed, 11 Sep 2019 15:06:10 +0000 (UTC)
From: Waiman Long <longman@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Will Deacon <will.deacon@arm.com>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Mike Kravetz <mike.kravetz@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
        linux-mm@kvack.org, Davidlohr Bueso <dave@stgolabs.net>,
        Waiman Long <longman@redhat.com>
Subject: [PATCH 2/5] locking/rwsem: Enable timeout check when spinning on
 owner
Date: Wed, 11 Sep 2019 16:05:34 +0100
Message-Id: <20190911150537.19527-3-longman@redhat.com>
In-Reply-To: <20190911150537.19527-1-longman@redhat.com>
References: <20190911150537.19527-1-longman@redhat.com>
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14
X-Greylist: Sender IP whitelisted,
 not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.43]);
 Wed, 11 Sep 2019 15:06:13 +0000 (UTC)
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

When a task is optimistically spinning on the owner, it may do it for a
long time if there is no other running task available in the run queue.
That can be long past the given timeout value.

To prevent that from happening, the rwsem_optimistic_spin() is now
modified to check for the timeout value, if specified, to see if it
should abort early.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 67 ++++++++++++++++++++++++++++--------------
 1 file changed, 45 insertions(+), 22 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c0285749c338..49f052d68404 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -716,11 +716,13 @@ rwsem_owner_state(struct task_struct *owner, unsigned long flags, unsigned long
 }
 
 static noinline enum owner_state
-rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable)
+rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable,
+		    ktime_t timeout)
 {
 	struct task_struct *new, *owner;
 	unsigned long flags, new_flags;
 	enum owner_state state;
+	int loopcnt = 0;
 
 	owner = rwsem_owner_flags(sem, &flags);
 	state = rwsem_owner_state(owner, flags, nonspinnable);
@@ -749,16 +751,22 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable)
 		 */
 		barrier();
 
-		if (need_resched() || !owner_on_cpu(owner)) {
-			state = OWNER_NONSPINNABLE;
-			break;
-		}
+		if (need_resched() || !owner_on_cpu(owner))
+			goto stop_optspin;
+
+		if (timeout && !(++loopcnt & 0xf) &&
+		   (sched_clock() >= ktime_to_ns(timeout)))
+			goto stop_optspin;
 
 		cpu_relax();
 	}
 	rcu_read_unlock();
 
 	return state;
+
+stop_optspin:
+	rcu_read_unlock();
+	return OWNER_NONSPINNABLE;
 }
 
 /*
@@ -786,12 +794,13 @@ static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
 	return sched_clock() + delta;
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock,
+				  ktime_t timeout)
 {
 	bool taken = false;
 	int prev_owner_state = OWNER_NULL;
 	int loop = 0;
-	u64 rspin_threshold = 0;
+	u64 rspin_threshold = 0, curtime;
 	unsigned long nonspinnable = wlock ? RWSEM_WR_NONSPINNABLE
 					   : RWSEM_RD_NONSPINNABLE;
 
@@ -801,6 +810,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 	if (!osq_lock(&sem->osq))
 		goto done;
 
+	curtime = timeout ? sched_clock() : 0;
+
 	/*
 	 * Optimistically spin on the owner field and attempt to acquire the
 	 * lock whenever the owner changes. Spinning will be stopped when:
@@ -810,7 +821,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 	for (;;) {
 		enum owner_state owner_state;
 
-		owner_state = rwsem_spin_on_owner(sem, nonspinnable);
+		owner_state = rwsem_spin_on_owner(sem, nonspinnable, timeout);
 		if (!(owner_state & OWNER_SPINNABLE))
 			break;
 
@@ -823,6 +834,21 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 		if (taken)
 			break;
 
+		/*
+		 * Check current time once every 16 iterations when
+		 *  1) spinning on reader-owned rwsem; or
+		 *  2) a timeout value is specified.
+		 *
+		 * This is to avoid calling sched_clock() too frequently
+		 * so as to reduce the average latency between the times
+		 * when the lock becomes free and when the spinner is
+		 * ready to do a trylock.
+		 */
+		if ((wlock && (owner_state == OWNER_READER)) || timeout) {
+			if (!(++loop & 0xf))
+				curtime = sched_clock();
+		}
+
 		/*
 		 * Time-based reader-owned rwsem optimistic spinning
 		 */
@@ -838,23 +864,18 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 				if (rwsem_test_oflags(sem, nonspinnable))
 					break;
 				rspin_threshold = rwsem_rspin_threshold(sem);
-				loop = 0;
 			}
 
-			/*
-			 * Check time threshold once every 16 iterations to
-			 * avoid calling sched_clock() too frequently so
-			 * as to reduce the average latency between the times
-			 * when the lock becomes free and when the spinner
-			 * is ready to do a trylock.
-			 */
-			else if (!(++loop & 0xf) && (sched_clock() > rspin_threshold)) {
+			else if (curtime > rspin_threshold) {
 				rwsem_set_nonspinnable(sem);
 				lockevent_inc(rwsem_opt_nospin);
 				break;
 			}
 		}
 
+		if (timeout && (ns_to_ktime(curtime) >= timeout))
+			break;
+
 		/*
 		 * An RT task cannot do optimistic spinning if it cannot
 		 * be sure the lock holder is running or live-lock may
@@ -968,7 +989,8 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
 	return false;
 }
 
-static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock,
+					 ktime_t timeout)
 {
 	return false;
 }
@@ -982,7 +1004,8 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
 }
 
 static inline int
-rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable)
+rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable,
+		    ktime_t timeout)
 {
 	return 0;
 }
@@ -1036,7 +1059,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	 */
 	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
 	adjustment = 0;
-	if (rwsem_optimistic_spin(sem, false)) {
+	if (rwsem_optimistic_spin(sem, false, 0)) {
 		/* rwsem_optimistic_spin() implies ACQUIRE on success */
 		/*
 		 * Wake up other readers in the wait list if the front
@@ -1175,7 +1198,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state, ktime_t timeout)
 
 	/* do optimistic spinning and steal lock if possible */
 	if (rwsem_can_spin_on_owner(sem, RWSEM_WR_NONSPINNABLE) &&
-	    rwsem_optimistic_spin(sem, true)) {
+	    rwsem_optimistic_spin(sem, true, timeout)) {
 		/* rwsem_optimistic_spin() implies ACQUIRE on success */
 		return sem;
 	}
@@ -1255,7 +1278,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state, ktime_t timeout)
 		 * without sleeping.
 		 */
 		if ((wstate == WRITER_HANDOFF) &&
-		    (rwsem_spin_on_owner(sem, 0) == OWNER_NULL))
+		    (rwsem_spin_on_owner(sem, 0, 0) == OWNER_NULL))
 			goto trylock_again;
 
 		/* Block until there are no active lockers. */

From patchwork Wed Sep 11 15:05:35 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Waiman Long <longman@redhat.com>
X-Patchwork-Id: 11141321
Return-Path: <SRS0=hojy=XG=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DCC4314ED
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Wed, 11 Sep 2019 15:06:23 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id C527621928
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Wed, 11 Sep 2019 15:06:23 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1728484AbfIKPGU (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 11 Sep 2019 11:06:20 -0400
Received: from mx1.redhat.com ([209.132.183.28]:33310 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727581AbfIKPGU (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 11 Sep 2019 11:06:20 -0400
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com
 [10.5.11.14])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mx1.redhat.com (Postfix) with ESMTPS id 6E586A37188;
        Wed, 11 Sep 2019 15:06:19 +0000 (UTC)
Received: from llong.com (ovpn-125-196.rdu2.redhat.com [10.10.125.196])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 60BA75D9E2;
        Wed, 11 Sep 2019 15:06:13 +0000 (UTC)
From: Waiman Long <longman@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Will Deacon <will.deacon@arm.com>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Mike Kravetz <mike.kravetz@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
        linux-mm@kvack.org, Davidlohr Bueso <dave@stgolabs.net>,
        Waiman Long <longman@redhat.com>
Subject: [PATCH 3/5] locking/osq: Allow early break from OSQ
Date: Wed, 11 Sep 2019 16:05:35 +0100
Message-Id: <20190911150537.19527-4-longman@redhat.com>
In-Reply-To: <20190911150537.19527-1-longman@redhat.com>
References: <20190911150537.19527-1-longman@redhat.com>
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14
X-Greylist: Sender IP whitelisted,
 not delayed by milter-greylist-4.6.2 (mx1.redhat.com [10.5.110.68]);
 Wed, 11 Sep 2019 15:06:19 +0000 (UTC)
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

The current osq_lock() function will spin until it gets the lock or
when its time slice has been used up. There may be other reasons that
a task may want to back out from the OSQ before getting the lock. This
patch extends the osq_lock() function by adding two new arguments - a
break function pointer and its argument.  That break function will be
called, if defined, in each iteration of the loop to see if it should
break out early.

The optimistic_spin_node structure in osq_lock.h isn't needed by callers,
so it is moved into osq_lock.c.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/osq_lock.h  | 13 ++-----------
 kernel/locking/mutex.c    |  2 +-
 kernel/locking/osq_lock.c | 12 +++++++++++-
 kernel/locking/rwsem.c    |  2 +-
 4 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 5581dbd3bd34..161eb6b26d6d 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -2,16 +2,6 @@
 #ifndef __LINUX_OSQ_LOCK_H
 #define __LINUX_OSQ_LOCK_H
 
-/*
- * An MCS like lock especially tailored for optimistic spinning for sleeping
- * lock implementations (mutex, rwsem, etc).
- */
-struct optimistic_spin_node {
-	struct optimistic_spin_node *next, *prev;
-	int locked; /* 1 if lock acquired */
-	int cpu; /* encoded CPU # + 1 value */
-};
-
 struct optimistic_spin_queue {
 	/*
 	 * Stores an encoded value of the CPU # of the tail node in the queue.
@@ -30,7 +20,8 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
 	atomic_set(&lock->tail, OSQ_UNLOCKED_VAL);
 }
 
-extern bool osq_lock(struct optimistic_spin_queue *lock);
+extern bool osq_lock(struct optimistic_spin_queue *lock,
+		     bool (*break_fn)(void *), void *break_arg);
 extern void osq_unlock(struct optimistic_spin_queue *lock);
 
 static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 468a9b8422e3..8a1df82fd71a 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -654,7 +654,7 @@ mutex_optimistic_spin(struct mutex *lock, struct ww_acquire_ctx *ww_ctx,
 		 * acquire the mutex all at once, the spinners need to take a
 		 * MCS (queued) lock first before spinning on the owner field.
 		 */
-		if (!osq_lock(&lock->osq))
+		if (!osq_lock(&lock->osq, NULL, NULL))
 			goto fail;
 	}
 
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 6ef600aa0f47..40c94380a485 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -11,6 +11,12 @@
  * called from interrupt context and we have preemption disabled while
  * spinning.
  */
+struct optimistic_spin_node {
+	struct optimistic_spin_node *next, *prev;
+	int locked; /* 1 if lock acquired */
+	int cpu; /* encoded CPU # + 1 value */
+};
+
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
 
 /*
@@ -87,7 +93,8 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 	return next;
 }
 
-bool osq_lock(struct optimistic_spin_queue *lock)
+bool osq_lock(struct optimistic_spin_queue *lock,
+	      bool (*break_fn)(void *), void *break_arg)
 {
 	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
 	struct optimistic_spin_node *prev, *next;
@@ -143,6 +150,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 		if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
 			goto unqueue;
 
+		if (unlikely(break_fn) && break_fn(break_arg))
+			goto unqueue;
+
 		cpu_relax();
 	}
 	return true;
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 49f052d68404..c15926ecb21e 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -807,7 +807,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock,
 	preempt_disable();
 
 	/* sem->wait_lock should not be held when doing optimistic spinning */
-	if (!osq_lock(&sem->osq))
+	if (!osq_lock(&sem->osq, NULL, NULL))
 		goto done;
 
 	curtime = timeout ? sched_clock() : 0;

From patchwork Wed Sep 11 15:05:36 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Waiman Long <longman@redhat.com>
X-Patchwork-Id: 11141323
Return-Path: <SRS0=hojy=XG=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8F73F1599
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Wed, 11 Sep 2019 15:06:26 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 76E54207FC
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Wed, 11 Sep 2019 15:06:26 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1728510AbfIKPGX (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 11 Sep 2019 11:06:23 -0400
Received: from mx1.redhat.com ([209.132.183.28]:45394 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727581AbfIKPGW (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 11 Sep 2019 11:06:22 -0400
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com
 [10.5.11.14])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mx1.redhat.com (Postfix) with ESMTPS id 63CF5307D925;
        Wed, 11 Sep 2019 15:06:22 +0000 (UTC)
Received: from llong.com (ovpn-125-196.rdu2.redhat.com [10.10.125.196])
        by smtp.corp.redhat.com (Postfix) with ESMTP id D361D5D9E2;
        Wed, 11 Sep 2019 15:06:19 +0000 (UTC)
From: Waiman Long <longman@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Will Deacon <will.deacon@arm.com>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Mike Kravetz <mike.kravetz@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
        linux-mm@kvack.org, Davidlohr Bueso <dave@stgolabs.net>,
        Waiman Long <longman@redhat.com>
Subject: [PATCH 4/5] locking/rwsem: Enable timeout check when staying in the
 OSQ
Date: Wed, 11 Sep 2019 16:05:36 +0100
Message-Id: <20190911150537.19527-5-longman@redhat.com>
In-Reply-To: <20190911150537.19527-1-longman@redhat.com>
References: <20190911150537.19527-1-longman@redhat.com>
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14
X-Greylist: Sender IP whitelisted,
 not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.48]);
 Wed, 11 Sep 2019 15:06:22 +0000 (UTC)
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Use the break function allowed by the new osq_lock() to enable early
break from the OSQ when a timeout value is specified and expiration
time has been reached.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c15926ecb21e..78708097162a 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -794,23 +794,50 @@ static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
 	return sched_clock() + delta;
 }
 
+struct rwsem_break_arg {
+	u64 timeout;
+	int loopcnt;
+};
+
+static bool rwsem_osq_break(void *brk_arg)
+{
+	struct rwsem_break_arg *arg = brk_arg;
+
+	arg->loopcnt++;
+	/*
+	 * Check sched_clock() only once every 256 iterations.
+	 */
+	if (!(arg->loopcnt++ & 0xff) && (sched_clock() >= arg->timeout))
+		return true;
+	return false;
+}
+
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock,
 				  ktime_t timeout)
 {
-	bool taken = false;
+	bool taken = false, locked;
 	int prev_owner_state = OWNER_NULL;
 	int loop = 0;
 	u64 rspin_threshold = 0, curtime;
+	struct rwsem_break_arg break_arg;
 	unsigned long nonspinnable = wlock ? RWSEM_WR_NONSPINNABLE
 					   : RWSEM_RD_NONSPINNABLE;
 
 	preempt_disable();
 
 	/* sem->wait_lock should not be held when doing optimistic spinning */
-	if (!osq_lock(&sem->osq, NULL, NULL))
-		goto done;
+	if (timeout) {
+		break_arg.timeout = ktime_to_ns(timeout);
+		break_arg.loopcnt = 0;
+		locked = osq_lock(&sem->osq, rwsem_osq_break, &break_arg);
+		curtime = sched_clock();
+	} else {
+		locked = osq_lock(&sem->osq, NULL, NULL);
+		curtime = 0;
+	}
 
-	curtime = timeout ? sched_clock() : 0;
+	if (!locked)
+		goto done;
 
 	/*
 	 * Optimistically spin on the owner field and attempt to acquire the

From patchwork Wed Sep 11 15:05:37 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Waiman Long <longman@redhat.com>
X-Patchwork-Id: 11141329
Return-Path: <SRS0=hojy=XG=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9851B1599
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Wed, 11 Sep 2019 15:06:33 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 807CE2168B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Wed, 11 Sep 2019 15:06:33 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1728529AbfIKPGa (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 11 Sep 2019 11:06:30 -0400
Received: from mx1.redhat.com ([209.132.183.28]:41176 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727581AbfIKPG3 (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 11 Sep 2019 11:06:29 -0400
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com
 [10.5.11.14])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mx1.redhat.com (Postfix) with ESMTPS id 6C9BB300DA3A;
        Wed, 11 Sep 2019 15:06:29 +0000 (UTC)
Received: from llong.com (ovpn-125-196.rdu2.redhat.com [10.10.125.196])
        by smtp.corp.redhat.com (Postfix) with ESMTP id CCD7D5D9E2;
        Wed, 11 Sep 2019 15:06:22 +0000 (UTC)
From: Waiman Long <longman@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Will Deacon <will.deacon@arm.com>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Mike Kravetz <mike.kravetz@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
        linux-mm@kvack.org, Davidlohr Bueso <dave@stgolabs.net>,
        Waiman Long <longman@redhat.com>
Subject: [PATCH 5/5] hugetlbfs: Limit wait time when trying to share huge PMD
Date: Wed, 11 Sep 2019 16:05:37 +0100
Message-Id: <20190911150537.19527-6-longman@redhat.com>
In-Reply-To: <20190911150537.19527-1-longman@redhat.com>
References: <20190911150537.19527-1-longman@redhat.com>
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14
X-Greylist: Sender IP whitelisted,
 not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.45]);
 Wed, 11 Sep 2019 15:06:29 +0000 (UTC)
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

When allocating a large amount of static hugepages (~500-1500GB) on a
system with large number of CPUs (4, 8 or even 16 sockets), performance
degradation (random multi-second delays) was observed when thousands
of processes are trying to fault in the data into the huge pages. The
likelihood of the delay increases with the number of sockets and hence
the CPUs a system has.  This only happens in the initial setup phase
and will be gone after all the necessary data are faulted in.

These random delays, however, are deemed unacceptable. The cause of
that delay is the long wait time in acquiring the mmap_sem when trying
to share the huge PMDs.

To remove the unacceptable delays, we have to limit the amount of wait
time on the mmap_sem. So the new down_write_timedlock() function is
used to acquire the write lock on the mmap_sem with a timeout value of
10ms which should not cause a perceivable delay. If timeout happens,
the task will abandon its effort to share the PMD and allocate its own
copy instead.

When too many timeouts happens (threshold currently set at 256), the
system may be too large for PMD sharing to be useful without undue delay.
So the sharing will be disabled in this case.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/fs.h |  7 +++++++
 mm/hugetlb.c       | 24 +++++++++++++++++++++---
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 997a530ff4e9..e9d3ad465a6b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -40,6 +40,7 @@
 #include <linux/fs_types.h>
 #include <linux/build_bug.h>
 #include <linux/stddef.h>
+#include <linux/ktime.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -519,6 +520,12 @@ static inline void i_mmap_lock_write(struct address_space *mapping)
 	down_write(&mapping->i_mmap_rwsem);
 }
 
+static inline bool i_mmap_timedlock_write(struct address_space *mapping,
+					 ktime_t timeout)
+{
+	return down_write_timedlock(&mapping->i_mmap_rwsem, timeout);
+}
+
 static inline void i_mmap_unlock_write(struct address_space *mapping)
 {
 	up_write(&mapping->i_mmap_rwsem);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6d7296dd11b8..445af661ae29 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4750,6 +4750,8 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 	}
 }
 
+#define PMD_SHARE_DISABLE_THRESHOLD	(1 << 8)
+
 /*
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
@@ -4770,11 +4772,24 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	pte_t *spte = NULL;
 	pte_t *pte;
 	spinlock_t *ptl;
+	static atomic_t timeout_cnt;
 
-	if (!vma_shareable(vma, addr))
-		return (pte_t *)pmd_alloc(mm, pud, addr);
+	/*
+	 * Don't share if it is not sharable or locking attempt timed out
+	 * after 10ms. After 256 timeouts, PMD sharing will be permanently
+	 * disabled as it is just too slow.
+	 */
+	if (!vma_shareable(vma, addr) ||
+	   (atomic_read(&timeout_cnt) >= PMD_SHARE_DISABLE_THRESHOLD))
+		goto out_no_share;
+
+	if (!i_mmap_timedlock_write(mapping, ms_to_ktime(10))) {
+		if (atomic_inc_return(&timeout_cnt) ==
+		    PMD_SHARE_DISABLE_THRESHOLD)
+			pr_info("Hugetlbfs PMD sharing disabled because of timeouts!\n");
+		goto out_no_share;
+	}
 
-	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -4806,6 +4821,9 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	i_mmap_unlock_write(mapping);
 	return pte;
+
+out_no_share:
+	return (pte_t *)pmd_alloc(mm, pud, addr);
 }
 
 /*