From patchwork Fri Jun 28 06:01:58 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ge Yang X-Patchwork-Id: 13715490 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D470C2BBCA for ; Fri, 28 Jun 2024 06:02:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9AC826B0089; Fri, 28 Jun 2024 02:02:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 95C236B008C; Fri, 28 Jun 2024 02:02:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 823726B0092; Fri, 28 Jun 2024 02:02:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 646D16B0089 for ; Fri, 28 Jun 2024 02:02:19 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 9B7F1C1464 for ; Fri, 28 Jun 2024 06:02:18 +0000 (UTC) X-FDA: 82279252356.07.7B62868 Received: from m16.mail.126.com (m16.mail.126.com [117.135.210.6]) by imf07.hostedemail.com (Postfix) with ESMTP id CDC2B40018 for ; Fri, 28 Jun 2024 06:02:15 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b="l/1fCQjB"; spf=pass (imf07.hostedemail.com: domain of yangge1116@126.com designates 117.135.210.6 as permitted sender) smtp.mailfrom=yangge1116@126.com; dmarc=pass (policy=none) header.from=126.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719554528; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:references:dkim-signature; bh=SlRELd3UcZ/MfdqfmGnleJEin+O8nKTwTDSEoHe12n8=; b=fW7xGZ9YtQU7L7Fxg6fcyN/fO3AdGx3MsMpJYi2Z1/sOUPCUHqT3TizdwGA8KAa4dRqTWp Vxs7T5ztBln1kjvMjy4yyikrwjcuarHDON8iiDE7UdhZdSyk+ibP4x/NaV530S050+ddvS I3XaToaZ7lDQuM9qrWci6fZ2IQscdww= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b="l/1fCQjB"; spf=pass (imf07.hostedemail.com: domain of yangge1116@126.com designates 117.135.210.6 as permitted sender) smtp.mailfrom=yangge1116@126.com; dmarc=pass (policy=none) header.from=126.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719554528; a=rsa-sha256; cv=none; b=HR0ZjHOdTc5jzNKVALxKeS6jaVwwp1/JJqx5YDwgIM7Q8hKY7TIrRDtrKn2w1ymNVQmsPk Oo9tuyl0iiFRMXtYckGVx/RhkBosDxe50OBiQljcGSdC+kAv+Ls5dtM2IeJPe8UkOhISYb IMDuGPNomRHO7mW+s7cAZ3cjlJDgB1M= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com; s=s110527; h=From:Subject:Date:Message-Id; bh=SlRELd3UcZ/MfdqfmG nleJEin+O8nKTwTDSEoHe12n8=; b=l/1fCQjB3mD/Z3bwG9Js+pkXzFVipt2nC0 ER3ZRgGrXAGSfo3w+K7edDaSCYhuTWIc/HfTuMqKRNYtuDFFCWQ8SY8f0jkb9uVL 6vqwyYdQ00uRiK0P1u4HXqyBZaAG1bcVjzRGUHtevvnIN7Y1qt6VowQK3hun1pF9 hSXp2fzic= Received: from hg-OptiPlex-7040.hygon.cn (unknown [118.242.3.34]) by gzga-smtp-mta-g1-0 (Coremail) with SMTP id _____wDn703ZUX5mXGHJAA--.29875S2; Fri, 28 Jun 2024 14:02:03 +0800 (CST) From: yangge1116@126.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, 21cnbao@gmail.com, peterx@redhat.com, yang@os.amperecomputing.com, baolin.wang@linux.alibaba.com, liuzixing@hygon.cn, yangge Subject: [PATCH V2] mm/gup: Fix longterm pin on slow gup regression Date: Fri, 28 Jun 2024 14:01:58 +0800 Message-Id: <1719554518-11006-1-git-send-email-yangge1116@126.com> X-Mailer: git-send-email 2.7.4 X-CM-TRANSID: _____wDn703ZUX5mXGHJAA--.29875S2 X-Coremail-Antispam: 1Uf129KBjvJXoW3tF4rKr1rAryDAF1xuw4rXwb_yoWDKw4rpF 4xG3Z0y3y3Jry2kFZ7Ars8Zr4ay3s7Ka18CFWxC34rZ3W3tryYgF1fX34rJwn5Jrykuayx Aayayr1DuanrXw7anT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07jjlk3UUUUU= X-Originating-IP: [118.242.3.34] X-CM-SenderInfo: 51dqwwjhrrila6rslhhfrp/1tbiOhwMG2VExE332wACs3 X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: CDC2B40018 X-Stat-Signature: auou35t9su466pzwzo6i8ym9ekdy3oaw X-HE-Tag: 1719554535-678882 X-HE-Meta: U2FsdGVkX19DdFMXx0e8cr2qgml2M84+Y0LZOLDfgFBomKrUV3bDnty9/nyi9F/y6BRaYxaDijk/dswVkDqSD1eky7+iqmW2EomPOyMY3U+gF0+c8eb7pNyv8hcn5YN7zaeOWJJKd3ZbIoh+TigUVby+pRJ9TMWNM3w9Eq3y7KPsXWpLm/lSWJ82wGKjOYjeuDe8/Rh/d7jYGc0bihkWjTIEj9DOk4TriDtx/fCBnS63dHCJ6LaV8LKA9qyzjgcyf3WcnfrmFBqRa4m+qU+4usSoMbWLvK11U0UzzncUqkHDaVInYTGk0e3paDGpc0gMv1u3C4nUuOHnL9iaKjF3h0hlz7cnWWCNVJXTNsJAUGEs2mOIH7LNPhPwaQSvFgPwCeTZL6Xe67uApzbAjPtMu3LcAre9z6yT5xApjClUw0600S52evuQlFV5OoqmeCdfmX1ugOzaf7T2yAO8F0nWENXD+2VnIrSAVFts7yQxSNmxgC3z9CeJmRdrf4U9XJFlY+y5G8MprJQxJqZNNTcBFZfmoh6dNN3NWgrvjXo240am6qqjJd1+nOyewUEPv3nkY8hiVW/INjDHQ18uShxBtpX0JfRcay1yamSOAKoAMW2E+QOr0yPyntJsVHWjyeYylvjlbX6C2fha8ddm0SvOOia5c/+HB/fxKOv3xKNJazp3HLCWvM3N5Ir078btg+DiFhKb70RMwc5vlVWmZBYcn0yvqheHc+DMN4KKAeEyIKgeYL9KBWBuObgKwcCBVnLH7Kh77ymWWvp6s1Uc20zvMfh2xeYJBIIw19HLLZ4ymo+LEYtPMoYXY9gQE3dbMs3FtJ5YsgrA+eCix353FFFhKtxwErznchw4pz7vhrgl/f1v5V5ePv1ijegE2E6GrHEBbsIyyIhGymOjUrPYIwJ90mFIoYuk/Cdlyx/STjSyuQdtwvpLpsrCkxWlrAzMLQW+01ogF9Q26+G2OZB+PVk lf0ZckE8 Q7W/UDeUu+E0M2hShOJ5JsEmuYRvENAkq0SgWhX2pu5ZzQOis/FEKS2BLQcje6EtBkPzRSIqh++qm/5Ze13XC7Qk7+/veR0bw3COnccCOff3FhP3rCMip5Xac4MqMFAnE+Qo1STIt4laCi7hKuM8lsDOyPipp62xqHIQJbN9cptmZ4Q2mwjpLVXhBBhV8C6H36fVnjPVjtTS0uzrNvjRtXYg8PH5Lnlv190r9k15lnm3IDax3FLCjsY8u1PcZWXdfmYn2uiMA9pGWqiyT7pS2ZyDe4klwojOqhzjqhAEPllitEBQta8qCS16X1idPkaCSc5W42MTLAB8LQHDt95ZYxAQMizXJO5b3F1jvOx3N/gkACEoA+/c7h5ETuPfB/4x4lvOnVVytnUux9XcUodAX7f9RKsdPGU4unWE4vF0TgZRDLTclysZqpdl9wH4PLuAfC4Y2/s0AuQA3g3GMKqDqbjUqHQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: yangge If a large number of CMA memory are configured in system (for example, the CMA memory accounts for 50% of the system memory), starting a SEV virtual machine will fail. During starting the SEV virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin memory. Normally if a page is present and in CMA area, pin_user_pages_fast() will first call __get_user_pages_locked() to pin the page in CMA area, and then call check_and_migrate_movable_pages() to migrate the page from CMA area to non-CMA area. But the current code calling __get_user_pages_locked() will fail, because it call try_grab_folio() to pin page in gup slow path. The commit 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") uses try_grab_folio() in gup slow path, which seems to be problematic because try_grap_folio() will check if the page can be longterm pinned. This check may fail and cause __get_user_pages_lock() to fail. However, these checks are not required in gup slow path, seems we can use try_grab_page() instead of try_grab_folio(). In addition, in the current code, try_grab_page() can only add 1 to the page's refcount. We extend this function so that the page's refcount can be increased according to the parameters passed in. The following log reveals it: [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 [ 464.325515] Call Trace: [ 464.325520] [ 464.325523] ? __get_user_pages+0x423/0x520 [ 464.325528] ? __warn+0x81/0x130 [ 464.325536] ? __get_user_pages+0x423/0x520 [ 464.325541] ? report_bug+0x171/0x1a0 [ 464.325549] ? handle_bug+0x3c/0x70 [ 464.325554] ? exc_invalid_op+0x17/0x70 [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 [ 464.325567] ? __get_user_pages+0x423/0x520 [ 464.325575] __gup_longterm_locked+0x212/0x7a0 [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 [ 464.325590] pin_user_pages_fast+0x47/0x60 [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] In another thread [1], hugepd also has a similar problem, so include relevant handling codes. [1] https://lore.kernel.org/all/20240604234858.948986-2-yang@os.amperecomputing.com/ Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Cc: Signed-off-by: yangge --- mm/gup.c | 55 +++++++++++++++++++++++++++++-------------------------- mm/huge_memory.c | 2 +- mm/internal.h | 2 +- 3 files changed, 31 insertions(+), 28 deletions(-) V2: 1, Using unlikely instead of WARN_ON_ONCE 2, Reworked the code and commit log to include hugepd path handling from Yang diff --git a/mm/gup.c b/mm/gup.c index 6ff9f95..070cf58 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -222,7 +222,7 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) * -ENOMEM FOLL_GET or FOLL_PIN was set, but the page could not * be grabbed. */ -int __must_check try_grab_page(struct page *page, unsigned int flags) +int __must_check try_grab_page(struct page *page, int refs, unsigned int flags) { struct folio *folio = page_folio(page); @@ -233,7 +233,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) return -EREMOTEIO; if (flags & FOLL_GET) - folio_ref_inc(folio); + folio_ref_add(folio, refs); else if (flags & FOLL_PIN) { /* * Don't take a pin on the zero page - it's not going anywhere @@ -248,13 +248,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) * so that the page really is pinned. */ if (folio_test_large(folio)) { - folio_ref_add(folio, 1); - atomic_add(1, &folio->_pincount); + folio_ref_add(folio, refs); + atomic_add(refs, &folio->_pincount); } else { - folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); + folio_ref_add(folio, refs * GUP_PIN_COUNTING_BIAS); } - node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); } return 0; @@ -535,7 +535,7 @@ static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, */ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz, unsigned long addr, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { unsigned long pte_end; struct page *page; @@ -558,9 +558,14 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz page = pte_page(pte); refs = record_subpages(page, sz, addr, end, pages + *nr); - folio = try_grab_folio(page, refs, flags); - if (!folio) - return 0; + if (fast) { + if (try_grab_page(page, refs, flags)) + return 0; + else { + folio = try_grab_folio(page, refs, flags); + if (!folio) + return 0; + } if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, refs, flags); @@ -588,7 +593,7 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { pte_t *ptep; unsigned long sz = 1UL << hugepd_shift(hugepd); @@ -598,7 +603,7 @@ static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, ptep = hugepte_offset(hugepd, addr, pdshift); do { next = hugepte_addr_end(addr, end, sz); - ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr); + ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr, fast); if (ret != 1) return ret; } while (ptep++, addr = next, addr != end); @@ -625,7 +630,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, ptep = hugepte_offset(hugepd, addr, pdshift); ptl = huge_pte_lock(h, vma->vm_mm, ptep); ret = gup_hugepd(vma, hugepd, addr, pdshift, addr + PAGE_SIZE, - flags, &page, &nr); + flags, &page, &nr, false); spin_unlock(ptl); if (ret == 1) { @@ -642,7 +647,7 @@ static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, static inline int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, - struct page **pages, int *nr) + struct page **pages, int *nr, bool fast) { return 0; } @@ -729,7 +734,7 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma, gup_must_unshare(vma, flags, page)) return ERR_PTR(-EMLINK); - ret = try_grab_page(page, flags); + ret = try_grab_page(page, 1, flags); if (ret) page = ERR_PTR(ret); else @@ -806,7 +811,7 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma, VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && !PageAnonExclusive(page), page); - ret = try_grab_page(page, flags); + ret = try_grab_page(page, 1, flags); if (ret) return ERR_PTR(ret); @@ -969,7 +974,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, !PageAnonExclusive(page), page); /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ - ret = try_grab_page(page, flags); + ret = try_grab_page(page, 1, flags); if (unlikely(ret)) { page = ERR_PTR(ret); goto out; @@ -1233,7 +1238,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, goto unmap; *page = pte_page(entry); } - ret = try_grab_page(*page, gup_flags); + ret = try_grab_page(*page, 1, gup_flags); if (unlikely(ret)) goto unmap; out: @@ -1636,22 +1641,20 @@ static long __get_user_pages(struct mm_struct *mm, * pages. */ if (page_increm > 1) { - struct folio *folio; /* * Since we already hold refcount on the * large folio, this should never fail. */ - folio = try_grab_folio(page, page_increm - 1, + ret = try_grab_page(page, page_increm - 1, foll_flags); - if (WARN_ON_ONCE(!folio)) { + if (unlikely(ret)) { /* * Release the 1st page ref if the * folio is problematic, fail hard. */ gup_put_folio(page_folio(page), 1, foll_flags); - ret = -EFAULT; goto out; } } @@ -3276,7 +3279,7 @@ static int gup_fast_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, * pmd format and THP pmd format */ if (gup_hugepd(NULL, __hugepd(pmd_val(pmd)), addr, - PMD_SHIFT, next, flags, pages, nr) != 1) + PMD_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) @@ -3306,7 +3309,7 @@ static int gup_fast_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, return 0; } else if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) { if (gup_hugepd(NULL, __hugepd(pud_val(pud)), addr, - PUD_SHIFT, next, flags, pages, nr) != 1) + PUD_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pmd_range(pudp, pud, addr, next, flags, pages, nr)) @@ -3333,7 +3336,7 @@ static int gup_fast_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, BUILD_BUG_ON(p4d_leaf(p4d)); if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) { if (gup_hugepd(NULL, __hugepd(p4d_val(p4d)), addr, - P4D_SHIFT, next, flags, pages, nr) != 1) + P4D_SHIFT, next, flags, pages, nr, true) != 1) return 0; } else if (!gup_fast_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) @@ -3362,7 +3365,7 @@ static void gup_fast_pgd_range(unsigned long addr, unsigned long end, return; } else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) { if (gup_hugepd(NULL, __hugepd(pgd_val(pgd)), addr, - PGDIR_SHIFT, next, flags, pages, nr) != 1) + PGDIR_SHIFT, next, flags, pages, nr, true) != 1) return; } else if (!gup_fast_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 425374a..18604e4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1332,7 +1332,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn); - ret = try_grab_page(page, flags); + ret = try_grab_page(page, 1, flags); if (ret) page = ERR_PTR(ret); diff --git a/mm/internal.h b/mm/internal.h index 2ea9a88..5305bbf 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1227,7 +1227,7 @@ int migrate_device_coherent_page(struct page *page); * mm/gup.c */ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags); -int __must_check try_grab_page(struct page *page, unsigned int flags); +int __must_check try_grab_page(struct page *page, int refs, unsigned int flags); /* * mm/huge_memory.c