From patchwork Fri Dec 14 06:27:43 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 10730627
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C88D114E2
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 14 Dec 2018 06:28:20 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B663B2CC8C
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 14 Dec 2018 06:28:20 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id AA6ED2CC99; Fri, 14 Dec 2018 06:28:20 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D8A7B2CC8C
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 14 Dec 2018 06:28:19 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3F84C8E01B4; Fri, 14 Dec 2018 01:28:11 -0500 (EST)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3A82B8E0014; Fri, 14 Dec 2018 01:28:11 -0500 (EST)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2BFB78E01B4; Fri, 14 Dec 2018 01:28:11 -0500 (EST)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-pl1-f197.google.com (mail-pl1-f197.google.com
 [209.85.214.197])
	by kanga.kvack.org (Postfix) with ESMTP id DBB6F8E0014
	for <linux-mm@kvack.org>; Fri, 14 Dec 2018 01:28:10 -0500 (EST)
Received: by mail-pl1-f197.google.com with SMTP id l9so2943600plt.7
        for <linux-mm@kvack.org>; Thu, 13 Dec 2018 22:28:10 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=knYDE2Y4kYMyTftPpiIn/YNLPn8n1fSOGQupP+BrBRs=;
        b=K2oGJ+qiveEqIKApw85EwNJYeYSI8Bpv54hJ7U0yODqRoYzV4huq55o+Zf7hyzxjQF
         /7S4j62Uhl9hTVWDS7b2SkeOZDfMiYxsqHKmS6Z1J0i/6tmL6g/8NnYM5BwsUy7vRzHK
         xpTNBA3Seqp4ZdElGCNplxufcdzItHYRRr66Zo5sPIcOJrli7KDUy7ssMDyv6D9WElry
         hgAoCx8x+dwAdEt5rv2pwz6WffzKLb+uIzTh9jar5not8Aj0JEvOKoeUa1WgMEFX+UZC
         p5PsbHw+OuVacWPhG/DixI6QfqazAz1aNBPtut+gAneBqOV+ybNXkIfdVhnIBzkyFSDC
         xD+A==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
X-Gm-Message-State: AA+aEWbeQD2rtZO7GqAujAZDEGs3ol5e9tZ9IT/+DksRv+dRoA4awAwD
	Phduov/7giRmba1he4HGvpbWk3b25EDGXyQOeQQdoToMYwAB3Ft3XmOC9Ej5892EQPl5ObrD6W5
	lMaK+xRC4z/xjNUs4OXih48A8uubYt6meV7Gf25po3oG+8HueSLmoV1vHuQ2/NPEiXw==
X-Received: by 2002:a17:902:29ab:: with SMTP id
 h40mr1757200plb.238.1544768890554;
        Thu, 13 Dec 2018 22:28:10 -0800 (PST)
X-Google-Smtp-Source: 
 AFSGD/UaEMP9Up2HnEEGI0vDLxVp5G1hcH8MXIpcoVfNPn7MI+qCJuqRH+OGp+R2jfMqL/VRqBWE
X-Received: by 2002:a17:902:29ab:: with SMTP id
 h40mr1757174plb.238.1544768889757;
        Thu, 13 Dec 2018 22:28:09 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1544768889; cv=none;
        d=google.com; s=arc-20160816;
        b=FnyFpV3zZdzMzlZXkDjRWVwsqAx0rWdXovRWKISD6g30gUO57sVeop+ceM7Q6a7xhA
         8gOWj5oAYnWzl80dZR4cp/73HAdSzuWWIeAelq+vJatwqQcRpN/OphGXDCm3YnuAGtDm
         EscubuKiw2nENZd7Ezg3FIp9mqct/G0zynREvNusFFGISHKeeD01CoHuoX14yDLAqLS8
         unDNOMYXucoNUxZuvwHt4WDuizt7ifNHJMPCyBS/hV0pWY6TggoOlag5cNW/AwDQszWZ
         uWqIQGutK/1UdzmaW6OxsrRyiOR4zD9Vo/ihNP9xdLsleu08n+AAZNlWc6/e//5gRbeX
         qTwQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=knYDE2Y4kYMyTftPpiIn/YNLPn8n1fSOGQupP+BrBRs=;
        b=GX1f1+rHQVf5k62Bu7A7iRHTzbhhN4KbiJD0sDhvDEO1K7/M3KtF7thd/KuZZerr9I
         ygfOpsoh5cyPqwo1dMtavWBgMlGk1bArz+OrRBk5ph9GQakx2tDNIaPbWy9KfwblzXIo
         4AOAjlF+dVWjEnRph1xWRHuLbi6MLn7p7FRmETR4dwwrclI5Pow6Hnitznj9BMHtI2M4
         k8UM3CJYNqh2hfUw6Tfh1W2UlAhr8WGpDvMlDL6t1yJSaWJn6wrFdssM4xnU/QD3PY8e
         rl1reqJ9jeIzEFutRUsTuKa7WStlU+koyQJRcTCz4ZX1Q68quTFXQLeemd3imHwX6m4d
         qDMg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from mga07.intel.com (mga07.intel.com. [134.134.136.100])
        by mx.google.com with ESMTPS id v19si3555849pfa.80.2018.12.13.22.28.09
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Thu, 13 Dec 2018 22:28:09 -0800 (PST)
Received-SPF: pass (google.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) client-ip=134.134.136.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
  by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 13 Dec 2018 22:28:09 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.56,351,1539673200";
   d="scan'208";a="125840940"
Received: from yhuang-mobile.sh.intel.com ([10.239.197.226])
  by fmsmga002.fm.intel.com with ESMTP; 13 Dec 2018 22:28:06 -0800
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Huang Ying <ying.huang@intel.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Michal Hocko <mhocko@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Shaohua Li <shli@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Rik van Riel <riel@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Daniel Jordan <daniel.m.jordan@oracle.com>
Subject: [PATCH -V9 10/21] swap: Swapin a THP in one piece
Date: Fri, 14 Dec 2018 14:27:43 +0800
Message-Id: <20181214062754.13723-11-ying.huang@intel.com>
X-Mailer: git-send-email 2.18.1
In-Reply-To: <20181214062754.13723-1-ying.huang@intel.com>
References: <20181214062754.13723-1-ying.huang@intel.com>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

With this patch, when page fault handler find a PMD swap mapping, it
will swap in a THP in one piece.  This avoids the overhead of
splitting/collapsing before/after the THP swapping.  And improves the
swap performance greatly for reduced page fault count etc.

do_huge_pmd_swap_page() is added in the patch to implement this.  It
is similar to do_swap_page() for normal page swapin.

If failing to allocate a THP, the huge swap cluster and the PMD swap
mapping will be split to fallback to normal page swapin.

If the huge swap cluster has been split already, the PMD swap mapping
will be split to fallback to normal page swapin.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/huge_mm.h |   9 +++
 mm/huge_memory.c        | 174 ++++++++++++++++++++++++++++++++++++++++
 mm/memory.c             |  16 ++--
 3 files changed, 193 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 72f2617d336b..debe3760e894 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -371,4 +371,13 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifdef CONFIG_THP_SWAP
+extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
+#else /* CONFIG_THP_SWAP */
+static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_THP_SWAP */
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cec1eec340e..644cb5d6b056 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,8 @@
 #include <linux/page_idle.h>
 #include <linux/shmem_fs.h>
 #include <linux/oom.h>
+#include <linux/delayacct.h>
+#include <linux/swap.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -1652,6 +1654,178 @@ static void __split_huge_swap_pmd(struct vm_area_struct *vma,
 	pmd_populate(mm, pmd, pgtable);
 }
 
+#ifdef CONFIG_THP_SWAP
+static int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+			       unsigned long address, pmd_t orig_pmd)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	ptl = pmd_lock(mm, pmd);
+	if (pmd_same(*pmd, orig_pmd))
+		__split_huge_swap_pmd(vma, address, pmd);
+	else
+		ret = -ENOENT;
+	spin_unlock(ptl);
+
+	return ret;
+}
+
+int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
+{
+	struct page *page;
+	struct mem_cgroup *memcg;
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	swp_entry_t entry;
+	pmd_t pmd;
+	int i, locked, exclusive = 0, ret = 0;
+
+	entry = pmd_to_swp_entry(orig_pmd);
+	VM_BUG_ON(non_swap_entry(entry));
+	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
+retry:
+	page = lookup_swap_cache(entry, NULL, vmf->address);
+	if (!page) {
+		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, vma,
+					     haddr, false);
+		if (!page) {
+			/*
+			 * Back out if somebody else faulted in this pmd
+			 * while we released the pmd lock.
+			 */
+			if (likely(pmd_same(*vmf->pmd, orig_pmd))) {
+				/*
+				 * Failed to allocate huge page, split huge swap
+				 * cluster, and fallback to swapin normal page
+				 */
+				ret = split_swap_cluster(entry, 0);
+				/* Somebody else swapin the swap entry, retry */
+				if (ret == -EEXIST) {
+					ret = 0;
+					goto retry;
+				/* swapoff occurs under us */
+				} else if (ret == -EINVAL)
+					ret = 0;
+				else
+					goto fallback;
+			}
+			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+			goto out;
+		}
+
+		/* Had to read the page from swap area: Major fault */
+		ret = VM_FAULT_MAJOR;
+		count_vm_event(PGMAJFAULT);
+		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
+	} else if (!PageTransCompound(page))
+		goto fallback;
+
+	locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
+
+	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+	if (!locked) {
+		ret |= VM_FAULT_RETRY;
+		goto out_release;
+	}
+
+	/*
+	 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
+	 * release the swapcache from under us.  The page pin, and pmd_same
+	 * test below, are not enough to exclude that.  Even if it is still
+	 * swapcache, we need to check that the page's swap has not changed.
+	 */
+	if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
+		goto out_page;
+
+	if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL,
+					&memcg, true)) {
+		ret = VM_FAULT_OOM;
+		goto out_page;
+	}
+
+	/*
+	 * Back out if somebody else already faulted in this pmd.
+	 */
+	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
+	spin_lock(vmf->ptl);
+	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
+		goto out_nomap;
+
+	if (unlikely(!PageUptodate(page))) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_nomap;
+	}
+
+	/*
+	 * The page isn't present yet, go ahead with the fault.
+	 *
+	 * Be careful about the sequence of operations here.
+	 * To get its accounting right, reuse_swap_page() must be called
+	 * while the page is counted on swap but not yet in mapcount i.e.
+	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
+	 * must be called after the swap_free(), or it will never succeed.
+	 */
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+	pmd = mk_huge_pmd(page, vma->vm_page_prot);
+	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
+		pmd = maybe_pmd_mkwrite(pmd_mkdirty(pmd), vma);
+		vmf->flags &= ~FAULT_FLAG_WRITE;
+		ret |= VM_FAULT_WRITE;
+		exclusive = RMAP_EXCLUSIVE;
+	}
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		flush_icache_page(vma, page + i);
+	if (pmd_swp_soft_dirty(orig_pmd))
+		pmd = pmd_mksoft_dirty(pmd);
+	do_page_add_anon_rmap(page, vma, haddr,
+			      exclusive | RMAP_COMPOUND);
+	mem_cgroup_commit_charge(page, memcg, true, true);
+	activate_page(page);
+	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
+
+	swap_free(entry, HPAGE_PMD_NR);
+	if (mem_cgroup_swap_full(page) ||
+	    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
+		try_to_free_swap(page);
+	unlock_page(page);
+
+	if (vmf->flags & FAULT_FLAG_WRITE) {
+		spin_unlock(vmf->ptl);
+		ret |= do_huge_pmd_wp_page(vmf, pmd);
+		if (ret & VM_FAULT_ERROR)
+			ret &= VM_FAULT_ERROR;
+		goto out;
+	}
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
+	spin_unlock(vmf->ptl);
+out:
+	return ret;
+out_nomap:
+	mem_cgroup_cancel_charge(page, memcg, true);
+	spin_unlock(vmf->ptl);
+out_page:
+	unlock_page(page);
+out_release:
+	put_page(page);
+	return ret;
+fallback:
+	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+	if (!split_huge_swap_pmd(vmf->vma, vmf->pmd, vmf->address, orig_pmd))
+		ret = VM_FAULT_FALLBACK;
+	else
+		ret = 0;
+	if (page)
+		put_page(page);
+	return ret;
+}
+#endif
+
 /*
  * Return true if we do MADV_FREE successfully on entire pmd page.
  * Otherwise, return false.
diff --git a/mm/memory.c b/mm/memory.c
index 78f341b5672d..a480c562d7d9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3870,13 +3870,17 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 
 		barrier();
 		if (unlikely(is_swap_pmd(orig_pmd))) {
-			VM_BUG_ON(thp_migration_supported() &&
-					  !is_pmd_migration_entry(orig_pmd));
-			if (is_pmd_migration_entry(orig_pmd))
+			if (thp_migration_supported() &&
+			    is_pmd_migration_entry(orig_pmd)) {
 				pmd_migration_entry_wait(mm, vmf.pmd);
-			return 0;
-		}
-		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
+				return 0;
+			} else if (IS_ENABLED(CONFIG_THP_SWAP)) {
+				ret = do_huge_pmd_swap_page(&vmf, orig_pmd);
+				if (!(ret & VM_FAULT_FALLBACK))
+					return ret;
+			} else
+				VM_BUG_ON(1);
+		} else if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);