From patchwork Mon Sep  8 08:25:36 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 4860371
Return-Path: <linux-arm-msm-owner@kernel.org>
X-Original-To: patchwork-linux-arm-msm@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.19.201])
	by patchwork1.web.kernel.org (Postfix) with ESMTP id 3C0829F32E
	for <patchwork-linux-arm-msm@patchwork.kernel.org>;
	Mon,  8 Sep 2014 08:27:33 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id BC0C42012F
	for <patchwork-linux-arm-msm@patchwork.kernel.org>;
	Mon,  8 Sep 2014 08:27:30 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 14D3A2011D
	for <patchwork-linux-arm-msm@patchwork.kernel.org>;
	Mon,  8 Sep 2014 08:27:29 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752990AbaIHI12 (ORCPT
	<rfc822;patchwork-linux-arm-msm@patchwork.kernel.org>);
	Mon, 8 Sep 2014 04:27:28 -0400
Received: from mail-pd0-f180.google.com ([209.85.192.180]:39361 "EHLO
	mail-pd0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752329AbaIHI11 (ORCPT
	<rfc822;linux-arm-msm@vger.kernel.org>);
	Mon, 8 Sep 2014 04:27:27 -0400
Received: by mail-pd0-f180.google.com with SMTP id ft15so4201098pdb.25
	for <linux-arm-msm@vger.kernel.org>;
	Mon, 08 Sep 2014 01:27:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=google.com; s=20120113;
	h=date:from:to:cc:subject:in-reply-to:message-id:references
	:user-agent:mime-version:content-type;
	bh=x7lN+tDNrhTJiyw4GZHitjsuE39tu3j4ae7rSzljyuE=;
	b=kq+FdXkrejlXlM8t2maT3AGYxtucjiYXNIiwgxZXtNfb3UDB21s7Ej7wWBO6c878gb
	xAoloBKYZRJgtTL8StRCdFSoERjf/bDHKcdEUglOQamvd6dk5v1Wfm/euOl6euuZfoRW
	WeKoc1h1WaWO/wnOGhZABr5YO7tZpA+quZugdnp4vH7Ngxd/CIaR+kihgC2yPSpFzjkK
	b8tf/OaT1seCiRvY19NTMYgBQnro96DOO+Z1thQb0Qu+Pywnk8TjwIypaDdtd2/02pKK
	Hp5Abu8oUsF5zKYy3aD3gRPSwQki3v0iXvy6sZY89dCf2aCWB5Ulmv4qrHtOlH3M/CEA
	nkww==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
	:references:user-agent:mime-version:content-type;
	bh=x7lN+tDNrhTJiyw4GZHitjsuE39tu3j4ae7rSzljyuE=;
	b=iH+EDLkuAGsKBNqQNDepqBMeAXbaa8jckX+cLQQifW1lXBT0hhBVpMAMctPGI6gNLY
	sKrzc25mXee2OLuFu81lVMEnbelMJaUMpk6Fr9BG5J0Gyld9v7EBgM+5cub6WCzSMOZ4
	GuAo0UDFg6QSBQz0aQ4qBMO/wjpCoE31CFdPDkdXDuw0VTJFMhdGApU3IC0LybQt79Gv
	TeqYCg2/JRQ890XE3G1SK86LKs8sLK0XBS0cFNEQ5FT17mdqUOvtdnNzQnQNMbKtwisb
	X0/ac+QCo3GRtosTuZkZ3UQ1yQgyCENtnWGiqHDsqxXpsb7xQSLLKU9WEyDcS9Er3ADa
	wAqw==
X-Gm-Message-State: 
 ALoCoQl5BsSMLl+Y9h92M0Fut9ITBnnRC4n3jkvlwTSEsUjgyEC6jfKLfNib57Gi4yDYzjEIP9up
X-Received: by 10.66.176.97 with SMTP id ch1mr38968133pac.101.1410164846462;
	Mon, 08 Sep 2014 01:27:26 -0700 (PDT)
Received: from [192.168.1.19] (c-67-169-183-19.hsd1.ca.comcast.net.
	[67.169.183.19]) by mx.google.com with ESMTPSA id
	lx10sm8340543pdb.31.2014.09.08.01.27.24 for <multiple recipients>
	(version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
	Mon, 08 Sep 2014 01:27:25 -0700 (PDT)
Date: Mon, 8 Sep 2014 01:25:36 -0700 (PDT)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@eggly.anvils
To: Peter Zijlstra <peterz@infradead.org>
cc: Hugh Dickins <hughd@google.com>, Chintan Pandya <cpandya@codeaurora.org>,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>,
	John Stultz <john.stultz@linaro.org>, Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers
	for scanner thread
In-Reply-To: <20140903095815.GK4783@worktop.ger.corp.intel.com>
Message-ID: <alpine.LSU.2.11.1409080023100.1610@eggly.anvils>
References: <1408536628-29379-1-git-send-email-cpandya@codeaurora.org>
	<1408536628-29379-2-git-send-email-cpandya@codeaurora.org>
	<alpine.LSU.2.11.1408272258050.10518@eggly.anvils>
	<20140903095815.GK4783@worktop.ger.corp.intel.com>
User-Agent: Alpine 2.11 (LSU 23 2013-08-11)
MIME-Version: 1.0
Sender: linux-arm-msm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-arm-msm.vger.kernel.org>
X-Mailing-List: linux-arm-msm@vger.kernel.org
X-Spam-Status: No, score=-9.3 required=5.0 tests=BAYES_00,
	DKIM_ADSP_CUSTOM_MED,
	DKIM_SIGNED, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, T_DKIM_INVALID,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Wed, 3 Sep 2014, Peter Zijlstra wrote:
> On Wed, Aug 27, 2014 at 11:02:20PM -0700, Hugh Dickins wrote:
> > On Wed, 20 Aug 2014, Chintan Pandya wrote:
> > 
> > > KSM thread to scan pages is scheduled on definite timeout. That wakes up
> > > CPU from idle state and hence may affect the power consumption. Provide
> > > an optional support to use deferrable timer which suites low-power
> > > use-cases.
> > > 
> > > Typically, on our setup we observed, 10% less power consumption with some
> > > use-cases in which CPU goes to power collapse frequently. For example,
> > > playing audio on Soc which has HW based Audio encoder/decoder, CPU
> > > remains idle for longer duration of time. This idle state will save
> > > significant CPU power consumption if KSM don't wakes them up
> > > periodically.
> > > 
> > > Note that, deferrable timers won't be deferred if any CPU is active and
> > > not in IDLE state.
> > > 
> > > By default, deferrable timers is enabled. To disable deferrable timers,
> > > $ echo 0 > /sys/kernel/mm/ksm/deferrable_timer
> > 
> > I have now experimented.  And, much as I wanted to eliminate the
> > tunable, and just have deferrable timers on, I have come right back
> > to your original position.
> > 
> > I was impressed by how quiet ksmd goes when there's nothing much
> > happening on the machine; but equally, disappointed in how slow
> > it then is to fulfil the outstanding merge work.  I agree with your
> > original assessment, that not everybody will want deferrable timer,
> > the way it is working at present.
> > 
> > I expect that can be fixed, partly by doing more work on wakeup from
> > a deferred timer, according to how long it has been deferred; and
> > partly by not deferring on idle until two passes of the list have been
> > completed.  But that's easier said than done, and might turn out to
> 
> So why not have the timer cancel itself when there is no more work to do
> and start itself up again when there's work added?

Well, yes, but... how do we know when there is no more work to do?
Further down I said:

> > But fixing that might require ksm hooks in hot locations where nobody
> > else would want them: I'm rather hoping we can strike a good enough
> > balance with your deferrable timer, that nobody will need any better.

Thomas has given reason why KSM might simply fail to do its job if we
rely on the deferrable timer.  So I've tried another approach, patch
below; but I do not expect you to jump for joy at the sight of it!

I've tried to minimize the offensive KSM hook in context_switch().
Why place it there, rather than do something near profile_tick() or
account_process_tick()?  Because KSM is aware of mms not tasks, and
context_switch() should have the next mm cachelines hot (if not, a
slight regrouping in mm_struct should do it); whereas I can find
no reference whatever to mm_struct in kernel/time, so hooking to
KSM from there would drag in another few cachelines every tick.

(Another approach would be to set up KSM hint faulting, along the
lines of NUMA hint faulting.  Not a path I'm keen to go down.)

I'm not thrilled with this patch, I think it's somewhat defective
in several ways.  But maybe in practice it will prove good enough,
and if so then I'd rather not waste effort on complicating it.

My own testing is not realistic, nor representative of real KSM users;
and I have no idea what values of pages_to_scan and sleep_millisecs
people really use (and those may make quite a difference to how
well it works).

Chintan, even if the scheduler guys turn out to hate it, please would
you give the patch below a try, to see how well it works in your
environment, whether it seems to go better or worse than your own patch.

If it works well enough for you, maybe we can come up with ideas to
make it more palatable.  I do think your issue is an important one
to fix, one way or another.

Thanks,
Hugh

[PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet

Description yet to be written!

Reported-by: Chintan Pandya <cpandya@codeaurora.org>
Not-Signed-off-by: Hugh Dickins <hughd@google.com>
---

 include/linux/ksm.h   |   14 +++++++++++
 include/linux/sched.h |    1 
 kernel/sched/core.c   |    9 ++++++-
 mm/ksm.c              |   50 ++++++++++++++++++++++++++++------------
 4 files changed, 58 insertions(+), 16 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--- 3.17-rc4/include/linux/ksm.h	2014-03-30 20:40:15.000000000 -0700
+++ linux/include/linux/ksm.h	2014-09-07 11:54:41.528003316 -0700
@@ -12,6 +12,7 @@
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 
 struct stable_node;
 struct mem_cgroup;
@@ -21,6 +22,7 @@ int ksm_madvise(struct vm_area_struct *v
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
 void __ksm_exit(struct mm_struct *mm);
+wait_queue_head_t *__ksm_switch(struct mm_struct *mm);
 
 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -35,6 +37,13 @@ static inline void ksm_exit(struct mm_st
 		__ksm_exit(mm);
 }
 
+static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)
+{
+	if (unlikely(test_bit(MMF_SWITCH_TO_KSM, &mm->flags)))
+		return __ksm_switch(mm);
+	return NULL;
+}
+
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
  * which KSM maps into multiple mms, wherever identical anonymous page content
@@ -87,6 +96,11 @@ static inline void ksm_exit(struct mm_st
 {
 }
 
+static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)
+{
+	return NULL;
+}
+
 static inline int PageKsm(struct page *page)
 {
 	return 0;
--- 3.17-rc4/include/linux/sched.h	2014-08-16 16:00:53.909189060 -0700
+++ linux/include/linux/sched.h	2014-09-07 11:54:41.528003316 -0700
@@ -453,6 +453,7 @@ static inline int get_dumpable(struct mm
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
+#define MMF_SWITCH_TO_KSM	21	/* notify KSM of switch to this mm */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
--- 3.17-rc4/kernel/sched/core.c	2014-08-16 16:00:54.062189063 -0700
+++ linux/kernel/sched/core.c	2014-09-07 11:54:41.528003316 -0700
@@ -61,6 +61,7 @@
 #include <linux/times.h>
 #include <linux/tsacct_kern.h>
 #include <linux/kprobes.h>
+#include <linux/ksm.h>
 #include <linux/delayacct.h>
 #include <linux/unistd.h>
 #include <linux/pagemap.h>
@@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct tas
 	       struct task_struct *next)
 {
 	struct mm_struct *mm, *oldmm;
+	wait_queue_head_t *wake_ksm = NULL;
 
 	prepare_task_switch(rq, prev, next);
 
@@ -2320,8 +2322,10 @@ context_switch(struct rq *rq, struct tas
 		next->active_mm = oldmm;
 		atomic_inc(&oldmm->mm_count);
 		enter_lazy_tlb(oldmm, next);
-	} else
+	} else {
 		switch_mm(oldmm, mm, next);
+		wake_ksm = ksm_switch(mm);
+	}
 
 	if (!prev->mm) {
 		prev->active_mm = NULL;
@@ -2348,6 +2352,9 @@ context_switch(struct rq *rq, struct tas
 	 * frame will be invalid.
 	 */
 	finish_task_switch(this_rq(), prev);
+
+	if (wake_ksm)
+		wake_up_interruptible(wake_ksm);
 }
 
 /*
--- 3.17-rc4/mm/ksm.c	2014-08-16 16:00:54.132189065 -0700
+++ linux/mm/ksm.c	2014-09-07 11:54:41.528003316 -0700
@@ -205,6 +205,9 @@ static struct kmem_cache *rmap_item_cach
 static struct kmem_cache *stable_node_cache;
 static struct kmem_cache *mm_slot_cache;
 
+/* The number of mergeable mms which have recently run */
+static atomic_t active_mergeable_mms = ATOMIC_INIT(0);
+
 /* The number of nodes in the stable tree */
 static unsigned long ksm_pages_shared;
 
@@ -313,9 +316,13 @@ static inline struct mm_slot *alloc_mm_s
 	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
 }
 
-static inline void free_mm_slot(struct mm_slot *mm_slot)
+static void free_mm_slot(struct mm_struct *mm, struct mm_slot *mm_slot)
 {
 	kmem_cache_free(mm_slot_cache, mm_slot);
+
+	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+	if (!test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags))
+		atomic_dec(&active_mergeable_mms);
 }
 
 static struct mm_slot *get_mm_slot(struct mm_struct *mm)
@@ -801,8 +808,7 @@ static int unmerge_and_remove_all_rmap_i
 			list_del(&mm_slot->mm_list);
 			spin_unlock(&ksm_mmlist_lock);
 
-			free_mm_slot(mm_slot);
-			clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+			free_mm_slot(mm, mm_slot);
 			up_read(&mm->mmap_sem);
 			mmdrop(mm);
 		} else {
@@ -1668,12 +1674,20 @@ next_mm:
 		list_del(&slot->mm_list);
 		spin_unlock(&ksm_mmlist_lock);
 
-		free_mm_slot(slot);
-		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm, slot);
 		up_read(&mm->mmap_sem);
 		mmdrop(mm);
 	} else {
 		spin_unlock(&ksm_mmlist_lock);
+		/*
+		 * After completing its scan, assume this mm to be inactive,
+		 * but set a flag for context_switch() to notify us as soon
+		 * as it is used again: see ksm_switch().  If the number of
+		 * active_mergeable_mms goes down to zero, ksmd will sleep
+		 * to save power, until awoken by mergeable context_switch().
+		 */
+		if (!test_and_set_bit(MMF_SWITCH_TO_KSM, &mm->flags))
+			atomic_dec(&active_mergeable_mms);
 		up_read(&mm->mmap_sem);
 	}
 
@@ -1707,7 +1721,7 @@ static void ksm_do_scan(unsigned int sca
 
 static int ksmd_should_run(void)
 {
-	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
+	return (ksm_run & KSM_RUN_MERGE) && atomic_read(&active_mergeable_mms);
 }
 
 static int ksm_scan_thread(void *nothing)
@@ -1785,15 +1799,11 @@ int ksm_madvise(struct vm_area_struct *v
 int __ksm_enter(struct mm_struct *mm)
 {
 	struct mm_slot *mm_slot;
-	int needs_wakeup;
 
 	mm_slot = alloc_mm_slot();
 	if (!mm_slot)
 		return -ENOMEM;
 
-	/* Check ksm_run too?  Would need tighter locking */
-	needs_wakeup = list_empty(&ksm_mm_head.mm_list);
-
 	spin_lock(&ksm_mmlist_lock);
 	insert_to_mm_slots_hash(mm, mm_slot);
 	/*
@@ -1812,10 +1822,9 @@ int __ksm_enter(struct mm_struct *mm)
 		list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	set_bit(MMF_VM_MERGEABLE, &mm->flags);
 	atomic_inc(&mm->mm_count);
-
-	if (needs_wakeup)
+	set_bit(MMF_VM_MERGEABLE, &mm->flags);
+	if (atomic_inc_return(&active_mergeable_mms) == 1)
 		wake_up_interruptible(&ksm_thread_wait);
 
 	return 0;
@@ -1850,8 +1859,7 @@ void __ksm_exit(struct mm_struct *mm)
 	spin_unlock(&ksm_mmlist_lock);
 
 	if (easy_to_free) {
-		free_mm_slot(mm_slot);
-		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm, mm_slot);
 		mmdrop(mm);
 	} else if (mm_slot) {
 		down_write(&mm->mmap_sem);
@@ -1859,6 +1867,18 @@ void __ksm_exit(struct mm_struct *mm)
 	}
 }
 
+wait_queue_head_t *__ksm_switch(struct mm_struct *mm)
+{
+	/*
+	 * Called by context_switch() to a hitherto inactive mergeable mm:
+	 * scheduler locks forbid immediate wakeup so leave that to caller.
+	 */
+	if (test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags) &&
+	    atomic_inc_return(&active_mergeable_mms) == 1)
+		return &ksm_thread_wait;
+	return NULL;
+}
+
 struct page *ksm_might_need_to_copy(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {