From patchwork Tue Dec  8 17:29:01 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Joao Martins <joao.m.martins@oracle.com>
X-Patchwork-Id: 11959139
Return-Path: <SRS0=VM1+=FM=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,
	UNPARSEABLE_RELAY,USER_AGENT_GIT autolearn=unavailable autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7063BC4167B
	for <linux-mm@archiver.kernel.org>; Tue,  8 Dec 2020 17:31:00 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 1574623B17
	for <linux-mm@archiver.kernel.org>; Tue,  8 Dec 2020 17:31:00 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1574623B17
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=oracle.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 78C5F6B0074; Tue,  8 Dec 2020 12:30:56 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6C95E6B0075; Tue,  8 Dec 2020 12:30:56 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5B3606B0078; Tue,  8 Dec 2020 12:30:56 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0144.hostedemail.com
 [216.40.44.144])
	by kanga.kvack.org (Postfix) with ESMTP id 44C916B0074
	for <linux-mm@kvack.org>; Tue,  8 Dec 2020 12:30:56 -0500 (EST)
Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 03355181AEF1A
	for <linux-mm@kvack.org>; Tue,  8 Dec 2020 17:30:56 +0000 (UTC)
X-FDA: 77570805312.03.note50_540402b273e8
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin03.hostedemail.com (Postfix) with ESMTP id 1601328A24A
	for <linux-mm@kvack.org>; Tue,  8 Dec 2020 17:30:54 +0000 (UTC)
X-HE-Tag: note50_540402b273e8
X-Filterd-Recvd-Size: 13603
Received: from aserp2120.oracle.com (aserp2120.oracle.com [141.146.126.78])
	by imf05.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue,  8 Dec 2020 17:30:53 +0000 (UTC)
Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1])
	by aserp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 0B8HPIw8071095;
	Tue, 8 Dec 2020 17:30:48 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com;
 h=from : to : cc :
 subject : date : message-id : in-reply-to : references; s=corp-2020-01-29;
 bh=OgxmGknp7SIfBNCz26KqSEuDI8RgB3Wqg5tc6ikM2Ro=;
 b=ADzxl0fZ05pr6/BLynk4EXG8TkYwAgHt7DSXfhPur9jg+O1o5ANUon5eX4eZpboaNaep
 xUdrpNXfmjB4iT+Zdm8Gk7Lo6LMQtUO2ObM0ksjKh8DosMFLXBbpkOeQe2hl78TBTp9w
 C+GWcrV9LXeybcYRU2N7v+zLMHPrCrZAvpXehwEivy+CDbWNCUaIQTwUCVSiSy/oCM4F
 dQRi8wy3fLqXPpwqV542KdRoRCm7FVLDLqLtEwUqdnhN/+Ktix3K8CTyB8YUyTnbNWP9
 upAgE25EkWkwL7AKdsFC7L31ma3UnVJXcsIpNPHYTdADfLRBesw/ARWXnuVZbonUgeFO rw==
Received: from aserp3030.oracle.com (aserp3030.oracle.com [141.146.126.71])
	by aserp2120.oracle.com with ESMTP id 35825m41mg-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL);
	Tue, 08 Dec 2020 17:30:48 +0000
Received: from pps.filterd (aserp3030.oracle.com [127.0.0.1])
	by aserp3030.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 0B8HOmpk060825;
	Tue, 8 Dec 2020 17:30:48 GMT
Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236])
	by aserp3030.oracle.com with ESMTP id 358ksnwgke-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Tue, 08 Dec 2020 17:30:47 +0000
Received: from abhmp0019.oracle.com (abhmp0019.oracle.com [141.146.116.25])
	by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 0B8HUl8x012509;
	Tue, 8 Dec 2020 17:30:47 GMT
Received: from paddy.uk.oracle.com (/10.175.194.215)
	by default (Oracle Beehive Gateway v4.0)
	with ESMTP ; Tue, 08 Dec 2020 09:30:46 -0800
From: Joao Martins <joao.m.martins@oracle.com>
To: linux-mm@kvack.org
Cc: Dan Williams <dan.j.williams@intel.com>, Ira Weiny <ira.weiny@intel.com>,
        linux-nvdimm@lists.01.org, Matthew Wilcox <willy@infradead.org>,
        Jason Gunthorpe <jgg@ziepe.ca>, Jane Chu <jane.chu@oracle.com>,
        Muchun Song <songmuchun@bytedance.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Joao Martins <joao.m.martins@oracle.com>
Subject: [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas
Date: Tue,  8 Dec 2020 17:29:01 +0000
Message-Id: <20201208172901.17384-11-joao.m.martins@oracle.com>
X-Mailer: git-send-email 2.11.0
In-Reply-To: <20201208172901.17384-1-joao.m.martins@oracle.com>
References: <20201208172901.17384-1-joao.m.martins@oracle.com>
X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9829
 signatures=668683
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxlogscore=999
 suspectscore=1
 bulkscore=0 malwarescore=0 phishscore=0 mlxscore=0 spamscore=0
 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2009150000 definitions=main-2012080107
X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9829
 signatures=668683
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=1
 adultscore=0 bulkscore=0
 phishscore=0 mlxlogscore=999 clxscore=1015 priorityscore=1501 mlxscore=0
 spamscore=0 lowpriorityscore=0 malwarescore=0 impostorscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000
 definitions=main-2012080107
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Similar to follow_hugetlb_page() add a follow_devmap_page which rather
than calling follow_page() per 4K page in a PMD/PUD it does so for the
entire PMD, where we lock the pmd/pud, get all pages , unlock.

While doing so, we only change the refcount once when PGMAP_COMPOUND is
passed in.

This let us improve {pin,get}_user_pages{,_longterm}() considerably:

$ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-U,-b,-L] -n 512 -w

(<test>) [before] -> [after]
(get_user_pages 2M pages) ~150k us -> ~8.9k us
(pin_user_pages 2M pages) ~192k us -> ~9k us
(pin_user_pages_longterm 2M pages) ~200k us -> ~19k us

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
I've special-cased this to device-dax vmas given its similar page size
guarantees as hugetlbfs, but I feel this is a bit wrong. I am
replicating follow_hugetlb_page() as RFC ought to seek feedback whether
this should be generalized if no fundamental issues exist. In such case,
should I be changing follow_page_mask() to take either an array of pages
or a function pointer and opaque arguments which would let caller pick
its structure?
---
 include/linux/huge_mm.h |   4 +
 include/linux/mm.h      |   2 +
 mm/gup.c                |  22 ++++-
 mm/huge_memory.c        | 202 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 227 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0365aa97f8e7..da87ecea19e6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -293,6 +293,10 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 		pud_t *pud, int flags, struct dev_pagemap **pgmap);
+long follow_devmap_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			struct page **pages, struct vm_area_struct **vmas,
+			unsigned long *position, unsigned long *nr_pages,
+			long i, unsigned int flags, int *locked);
 
 extern vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b0155441835..466c88679628 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1164,6 +1164,8 @@ static inline void get_page(struct page *page)
 	page_ref_inc(page);
 }
 
+__maybe_unused struct page *try_grab_compound_head(struct page *page, int refs,
+						   unsigned int flags);
 bool __must_check try_grab_page(struct page *page, unsigned int flags);
 
 static inline __must_check bool try_get_page(struct page *page)
diff --git a/mm/gup.c b/mm/gup.c
index 3a9a7229f418..50effb9cc349 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -78,7 +78,7 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
  * considered failure, and furthermore, a likely bug in the caller, so a warning
  * is also emitted.
  */
-static __maybe_unused struct page *try_grab_compound_head(struct page *page,
+__maybe_unused struct page *try_grab_compound_head(struct page *page,
 							  int refs,
 							  unsigned int flags)
 {
@@ -880,8 +880,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
  * does not include FOLL_NOWAIT, the mmap_lock may be released.  If it
  * is, *@locked will be set to 0 and -EBUSY returned.
  */
-static int faultin_page(struct vm_area_struct *vma,
-		unsigned long address, unsigned int *flags, int *locked)
+int faultin_page(struct vm_area_struct *vma,
+		 unsigned long address, unsigned int *flags, int *locked)
 {
 	unsigned int fault_flags = 0;
 	vm_fault_t ret;
@@ -1103,6 +1103,22 @@ static long __get_user_pages(struct mm_struct *mm,
 				}
 				continue;
 			}
+			if (vma_is_dax(vma)) {
+				i = follow_devmap_page(mm, vma, pages, vmas,
+						       &start, &nr_pages, i,
+						       gup_flags, locked);
+				if (locked && *locked == 0) {
+					/*
+					 * We've got a VM_FAULT_RETRY
+					 * and we've lost mmap_lock.
+					 * We must stop here.
+					 */
+					BUG_ON(gup_flags & FOLL_NOWAIT);
+					BUG_ON(ret != 0);
+					goto out;
+				}
+				continue;
+			}
 		}
 retry:
 		/*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec2bb93f7431..20bfbf211dc3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1168,6 +1168,208 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	return page;
 }
 
+long follow_devmap_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			struct page **pages, struct vm_area_struct **vmas,
+			unsigned long *position, unsigned long *nr_pages,
+			long i, unsigned int flags, int *locked)
+{
+	unsigned long pfn_offset;
+	unsigned long vaddr = *position;
+	unsigned long remainder = *nr_pages;
+	unsigned long align = vma_kernel_pagesize(vma);
+	unsigned long align_nr_pages = align >> PAGE_SHIFT;
+	unsigned long mask = ~(align-1);
+	unsigned long nr_pages_hpage = 0;
+	struct dev_pagemap *pgmap = NULL;
+	int err = -EFAULT;
+
+	if (align == PAGE_SIZE)
+		return i;
+
+	while (vaddr < vma->vm_end && remainder) {
+		pte_t *pte;
+		spinlock_t *ptl = NULL;
+		int absent;
+		struct page *page;
+
+		/*
+		 * If we have a pending SIGKILL, don't keep faulting pages and
+		 * potentially allocating memory.
+		 */
+		if (fatal_signal_pending(current)) {
+			remainder = 0;
+			break;
+		}
+
+		/*
+		 * Some archs (sparc64, sh*) have multiple pte_ts to
+		 * each hugepage.  We have to make sure we get the
+		 * first, for the page indexing below to work.
+		 *
+		 * Note that page table lock is not held when pte is null.
+		 */
+		pte = huge_pte_offset(mm, vaddr & mask, align);
+		if (pte) {
+			if (align == PMD_SIZE)
+				ptl = pmd_lockptr(mm, (pmd_t *) pte);
+			else if (align == PUD_SIZE)
+				ptl = pud_lockptr(mm, (pud_t *) pte);
+			spin_lock(ptl);
+		}
+		absent = !pte || pte_none(ptep_get(pte));
+
+		if (absent && (flags & FOLL_DUMP)) {
+			if (pte)
+				spin_unlock(ptl);
+			remainder = 0;
+			break;
+		}
+
+		if (absent ||
+		    ((flags & FOLL_WRITE) &&
+		      !pte_write(ptep_get(pte)))) {
+			vm_fault_t ret;
+			unsigned int fault_flags = 0;
+
+			if (pte)
+				spin_unlock(ptl);
+			if (flags & FOLL_WRITE)
+				fault_flags |= FAULT_FLAG_WRITE;
+			if (locked)
+				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
+					FAULT_FLAG_KILLABLE;
+			if (flags & FOLL_NOWAIT)
+				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
+					FAULT_FLAG_RETRY_NOWAIT;
+			if (flags & FOLL_TRIED) {
+				/*
+				 * Note: FAULT_FLAG_ALLOW_RETRY and
+				 * FAULT_FLAG_TRIED can co-exist
+				 */
+				fault_flags |= FAULT_FLAG_TRIED;
+			}
+			ret = handle_mm_fault(vma, vaddr, flags, NULL);
+			if (ret & VM_FAULT_ERROR) {
+				err = vm_fault_to_errno(ret, flags);
+				remainder = 0;
+				break;
+			}
+			if (ret & VM_FAULT_RETRY) {
+				if (locked &&
+				    !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
+					*locked = 0;
+				*nr_pages = 0;
+				/*
+				 * VM_FAULT_RETRY must not return an
+				 * error, it will return zero
+				 * instead.
+				 *
+				 * No need to update "position" as the
+				 * caller will not check it after
+				 * *nr_pages is set to 0.
+				 */
+				return i;
+			}
+			continue;
+		}
+
+		pfn_offset = (vaddr & ~mask) >> PAGE_SHIFT;
+		page = pte_page(ptep_get(pte));
+
+		pgmap = get_dev_pagemap(page_to_pfn(page), pgmap);
+		if (!pgmap) {
+			spin_unlock(ptl);
+			remainder = 0;
+			err = -EFAULT;
+			break;
+		}
+
+		/*
+		 * If subpage information not requested, update counters
+		 * and skip the same_page loop below.
+		 */
+		if (!pages && !vmas && !pfn_offset &&
+		    (vaddr + align < vma->vm_end) &&
+		    (remainder >= (align_nr_pages))) {
+			vaddr += align;
+			remainder -= align_nr_pages;
+			i += align_nr_pages;
+			spin_unlock(ptl);
+			continue;
+		}
+
+		nr_pages_hpage = 0;
+
+same_page:
+		if (pages) {
+			pages[i] = mem_map_offset(page, pfn_offset);
+
+			/*
+			 * try_grab_page() should always succeed here, because:
+			 * a) we hold the ptl lock, and b) we've just checked
+			 * that the huge page is present in the page tables.
+			 */
+			if (!(pgmap->flags & PGMAP_COMPOUND) &&
+			    WARN_ON_ONCE(!try_grab_page(pages[i], flags))) {
+				spin_unlock(ptl);
+				remainder = 0;
+				err = -ENOMEM;
+				break;
+			}
+
+		}
+
+		if (vmas)
+			vmas[i] = vma;
+
+		vaddr += PAGE_SIZE;
+		++pfn_offset;
+		--remainder;
+		++i;
+		nr_pages_hpage++;
+		if (vaddr < vma->vm_end && remainder &&
+				pfn_offset < align_nr_pages) {
+			/*
+			 * We use pfn_offset to avoid touching the pageframes
+			 * of this compound page.
+			 */
+			goto same_page;
+		} else {
+			/*
+			 * try_grab_compound_head() should always succeed here,
+			 * because: a) we hold the ptl lock, and b) we've just
+			 * checked that the huge page is present in the page
+			 * tables. If the huge page is present, then the tail
+			 * pages must also be present. The ptl prevents the
+			 * head page and tail pages from being rearranged in
+			 * any way. So this page must be available at this
+			 * point, unless the page refcount overflowed:
+			 */
+			if ((pgmap->flags & PGMAP_COMPOUND) &&
+			    WARN_ON_ONCE(!try_grab_compound_head(pages[i-1],
+								 nr_pages_hpage,
+								 flags))) {
+				put_dev_pagemap(pgmap);
+				spin_unlock(ptl);
+				remainder = 0;
+				err = -ENOMEM;
+				break;
+			}
+			put_dev_pagemap(pgmap);
+		}
+		spin_unlock(ptl);
+	}
+	*nr_pages = remainder;
+	/*
+	 * setting position is actually required only if remainder is
+	 * not zero but it's faster not to add a "if (remainder)"
+	 * branch.
+	 */
+	*position = vaddr;
+
+	return i ? i : err;
+}
+
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma)