From patchwork Mon Jun 26 17:14:30 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 13293239
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 40B13EB64D7
	for <linux-mm@archiver.kernel.org>; Mon, 26 Jun 2023 17:15:16 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DAE168E0006; Mon, 26 Jun 2023 13:15:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D5E2D8E0002; Mon, 26 Jun 2023 13:15:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BFF2D8E0006; Mon, 26 Jun 2023 13:15:15 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com
 [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id B28798E0002
	for <linux-mm@kvack.org>; Mon, 26 Jun 2023 13:15:15 -0400 (EDT)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 5ECDD1207D3
	for <linux-mm@kvack.org>; Mon, 26 Jun 2023 17:15:15 +0000 (UTC)
X-FDA: 80945549790.15.816A66B
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf16.hostedemail.com (Postfix) with ESMTP id 7202418000A
	for <linux-mm@kvack.org>; Mon, 26 Jun 2023 17:15:12 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf16.hostedemail.com: domain of ryan.roberts@arm.com designates
 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1687799712;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Ayx1XUOP8Jqvje8vyoH06qGKySRPMl/T4IoLYSq146A=;
	b=3WIU7PPKklb5TVgT2HneXczEmYmQfrSWWJui62mXlatFdsKeSUrGO87DSn2BGUAyl4015M
	U0bls+DMuqIZGO1xMfEM+txG2CU7whuZlU636/g142IOMXAUALt2WOuqr48vRJpewf8WCk
	qQN3kApIXsYkabP/eyxPoMxgXhmJNmg=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf16.hostedemail.com: domain of ryan.roberts@arm.com designates
 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687799712; a=rsa-sha256;
	cv=none;
	b=clucTa/HsRemtjqPfB7uqB43XoDhqR74t//xmfYLe8Kt1pCp5L9PkSjr6FXq7KCjn1J9zi
	RDyElnhul7eTnrUdSZ/eFMgFdXkldydRHa5DtEIThnJ9Wc0azwo3zsm/RFzALOHAWlgC05
	tWjM0qGCNDbcGY57fvCVwfRo1/oqTyI=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D0EE615DB;
	Mon, 26 Jun 2023 10:15:55 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E66B63F663;
	Mon, 26 Jun 2023 10:15:08 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Yin Fengwei <fengwei.yin@intel.com>,
	David Hildenbrand <david@redhat.com>,
	Yu Zhao <yuzhao@google.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Geert Uytterhoeven <geert@linux-m68k.org>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	linux-alpha@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-ia64@vger.kernel.org,
	linux-m68k@lists.linux-m68k.org,
	linux-s390@vger.kernel.org
Subject: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
Date: Mon, 26 Jun 2023 18:14:30 +0100
Message-Id: <20230626171430.3167004-11-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20230626171430.3167004-1-ryan.roberts@arm.com>
References: <20230626171430.3167004-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 7202418000A
X-Stat-Signature: seuqhu5mq11k977mu6p3qt7je1hkngy3
X-HE-Tag: 1687799712-799889
X-HE-Meta: 
 U2FsdGVkX19lAyrMgDLOHeyFmSR6jsE+DaVv2Hcg11t+Bil3rbP+LUDamVBBSKO4T0SzbTdAc+lMJw0aBccsKPy2I+4IBb7FlqqqRexcvA460JFt7RxKIIZsK7ndnAzUAjwGaZapOvPhI/yf4TuJiedVRG8ld63XkDetpyNNzMlUXk487cdhxbs8lB26bJ2I8RRup7HZWlhUqtIV1Rc6AIzWWtikv1mNVaZCHpy7L/xf+rtkZecBdT2Nn6MAecfawNvT/lVQtxeaq44iSIzYmx/y5mPGPCFfE2MpL0lykOEmgtJVA6ZvNnSepKNwYZYDPV1w8O+PszfVY5vGVp4o8QEtyBqqqs/TpWIKqta+U4Psj5/a/fgOzalE0obLrveS6vW7QwwS/QnJgEnuE/16r9vKueWC5olQV7IvF79XZvpENt3ihSB8wG69YM+IXWokaCYmzg7v90x2mL0M5oUmPW1XH8RYiQl/HouhC8mkKom4RGTbVCncFVLQlSNp1rCJEmNwpKJCzCu8Klm4gufGyyls6al4D3zEQP7GXK4GSE0cjHtURCyQj7mz4MeiGv3G6FxjJeVGrFE+7GuqR1KHR8js9liM2h5ByOUYEEIZa54AaUR0jeCkQZWRzgfJzq+I7CBKSaMrwnlyrBBbG9PEx8DoSFpNI21sta4CQ1rXLatvAUwN5N6p0oVPBMahB+hSgZep32MtYKCVbBnH9U0lbeL6tZDEaCQoGSONUyvhUkOJ55qJ/NKozNEmciAwQz8mKp28jj8YKQC1Tpk0Dg/B5OnVSKLIKiJJyO6LpmwiN04hB/GPtERpOespBrdblhGh1H28SKlavUAhMotHN3llIdAbo8qWBeTt0T7aTGwfWrksTNPRhkSfWunkAvj00qr7ueXfQO4Gh1EEPDm74DksR1+FmIPPPjqQn4v9TjxclL5zDiQ/0UhMSoGiH++zCI56T6w79vtT0mpTqkNY0F9
 xg4qtquF
 BZh9EjBJDvOi2QecuhCqjMvSeqn/g/wGRDfjGSXDurMRCn+EV1QiREUxWJkpsA2no5ewmnXqxOLv7V293QLYAYsOmS58BvYppXlqwM41oMOfVBHf8jtmCdrI+MKh4EJuyH9WK3T3luaCdV5NBepLk/WfznyoEjnlTK2MX
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

With all of the enabler patches in place, modify the anonymous memory
write allocation path so that it opportunistically attempts to allocate
a large folio up to `max_anon_folio_order()` size (This value is
ultimately configured by the architecture). This reduces the number of
page faults, reduces the size of (e.g. LRU) lists, and generally
improves performance by batching what were per-page operations into
per-(large)-folio operations.

If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
`max_anon_folio_order()` always returns 0, meaning we get the existing
allocation behaviour.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 144 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a8f7e2b28d7a..d23c44cc5092 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
 		return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
 }
 
+/*
+ * Returns index of first pte that is not none, or nr if all are none.
+ */
+static inline int check_ptes_none(pte_t *pte, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		if (!pte_none(ptep_get(pte++)))
+			return i;
+	}
+
+	return nr;
+}
+
+static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
+{
+	/*
+	 * The aim here is to determine what size of folio we should allocate
+	 * for this fault. Factors include:
+	 * - Order must not be higher than `order` upon entry
+	 * - Folio must be naturally aligned within VA space
+	 * - Folio must not breach boundaries of vma
+	 * - Folio must be fully contained inside one pmd entry
+	 * - Folio must not overlap any non-none ptes
+	 *
+	 * Additionally, we do not allow order-1 since this breaks assumptions
+	 * elsewhere in the mm; THP pages must be at least order-2 (since they
+	 * store state up to the 3rd struct page subpage), and these pages must
+	 * be THP in order to correctly use pre-existing THP infrastructure such
+	 * as folio_split().
+	 *
+	 * As a consequence of relying on the THP infrastructure, if the system
+	 * does not support THP, we always fallback to order-0.
+	 *
+	 * Note that the caller may or may not choose to lock the pte. If
+	 * unlocked, the calculation should be considered an estimate that will
+	 * need to be validated under the lock.
+	 */
+
+	struct vm_area_struct *vma = vmf->vma;
+	int nr;
+	unsigned long addr;
+	pte_t *pte;
+	pte_t *first_set = NULL;
+	int ret;
+
+	if (has_transparent_hugepage()) {
+		order = min(order, PMD_SHIFT - PAGE_SHIFT);
+
+		for (; order > 1; order--) {
+			nr = 1 << order;
+			addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
+			pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
+
+			/* Check vma bounds. */
+			if (addr < vma->vm_start ||
+			    addr + (nr << PAGE_SHIFT) > vma->vm_end)
+				continue;
+
+			/* Ptes covered by order already known to be none. */
+			if (pte + nr <= first_set)
+				break;
+
+			/* Already found set pte in range covered by order. */
+			if (pte <= first_set)
+				continue;
+
+			/* Need to check if all the ptes are none. */
+			ret = check_ptes_none(pte, nr);
+			if (ret == nr)
+				break;
+
+			first_set = pte + ret;
+		}
+
+		if (order == 1)
+			order = 0;
+	} else
+		order = 0;
+
+	return order;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
@@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	unsigned long addr;
+	int order = uffd_wp ? 0 : max_anon_folio_order(vma);
+	int pgcount = BIT(order);
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return handle_userfault(vmf, VM_UFFD_MISSING);
 		}
-		goto setpte;
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address, vmf->pte);
+		goto unlock;
 	}
 
-	/* Allocate our own private page. */
+retry:
+	/*
+	 * Estimate the folio order to allocate. We are not under the ptl here
+	 * so this estiamte needs to be re-checked later once we have the lock.
+	 */
+	vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+	order = calc_anon_folio_order_alloc(vmf, order);
+	pte_unmap(vmf->pte);
+
+	/* Allocate our own private folio. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
+	folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
 	if (!folio)
 		goto oom;
 
+	/* We may have been granted less than we asked for. */
+	order = folio_order(folio);
+	pgcount = BIT(order);
+	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
 	/*
 	 * The memory barrier inside __folio_mark_uptodate makes sure that
-	 * preceding stores to the page contents become visible before
-	 * the set_pte_at() write.
+	 * preceding stores to the folio contents become visible before
+	 * the set_ptes() write.
 	 */
 	__folio_mark_uptodate(folio);
 
@@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
-		goto release;
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
+
+	/*
+	 * Ensure our estimate above is still correct; we could have raced with
+	 * another thread to service a fault in the region.
+	 */
+	if (order == 0) {
+		if (vmf_pte_changed(vmf)) {
+			update_mmu_tlb(vma, vmf->address, vmf->pte);
+			goto release;
+		}
+	} else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
+		pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
+
+		/* If faulting pte was allocated by another, exit early. */
+		if (!pte_none(ptep_get(pte))) {
+			update_mmu_tlb(vma, vmf->address, pte);
+			goto release;
+		}
+
+		/* Else try again, with a lower order. */
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		folio_put(folio);
+		order--;
+		goto retry;
 	}
 
 	ret = check_stable_address_space(vma->vm_mm);
@@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, pgcount - 1);
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
+	folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
 	folio_add_lru_vma(folio, vma);
-setpte:
+
 	if (uffd_wp)
 		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;