From patchwork Wed Nov 10 10:54:14 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611839 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 12EFAC433EF for ; Wed, 10 Nov 2021 11:03:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B569261207 for ; Wed, 10 Nov 2021 11:03:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org B569261207 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 52FD06B006C; Wed, 10 Nov 2021 06:03:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B8EE6B007E; Wed, 10 Nov 2021 06:03:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 332CF6B0080; Wed, 10 Nov 2021 06:03:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0191.hostedemail.com [216.40.44.191]) by kanga.kvack.org (Postfix) with ESMTP id 24E546B006C for ; Wed, 10 Nov 2021 06:03:16 -0500 (EST) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id DF68F80B9117 for ; Wed, 10 Nov 2021 11:03:15 +0000 (UTC) X-FDA: 78792733950.28.57FF8E4 Received: from mail-pg1-f173.google.com (mail-pg1-f173.google.com [209.85.215.173]) by imf08.hostedemail.com (Postfix) with ESMTP id 0D6CA30000AE for ; Wed, 10 Nov 2021 11:02:58 +0000 (UTC) Received: by mail-pg1-f173.google.com with SMTP id n23so1921417pgh.8 for ; Wed, 10 Nov 2021 03:03:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Qt7zreSDUacj2v2u5kH5eyo4ziYtzmNFB0lGNIs9uO4=; b=l8SsT/vCxWFUSSMF7MX9Zo49Dqycpk/E1y+Wj4vtEiwoSuYn6Ot7lQu1+VmZimpDHK HYWgzARYq87nMhl+I6uWWdQRfQkz9hlfQW9ZUj5tw6KuEoL3xriYFo8Ofw1f8TxO4Kq7 TOr4HKwhTU6YNx9YjOX6HnODyTjswL+NE+DsDz2wqNYDFRkbSA7+rxI6jURVgWrby9sZ aQ5/aR94ZsBmjchdXP177x5J/cVtS+qn8XSbWLirL2JpO262nb7ovAfN6lmQAx5key0i 61WEurRpY1dhr6PTI8E/yv02FgJvUvY1/lBbg0vdFLL8vsYCSGc6pD9eKyvA4FXeblJt 9xHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Qt7zreSDUacj2v2u5kH5eyo4ziYtzmNFB0lGNIs9uO4=; b=a9rF3gk00VVIEy7OUtwYbHBq8B0LCBVOckgT1jbRGoGbf95SJerpP1u7qY3lzaJ8Fi Bjd4QDav2rF5CmqKWdAi2kLbCuY5zAwjaUAqVO8KRj5IoFciYjzjJzA0k7Z+DcBt4PrV ZM2JKoTssyPScaPn16ttUP2dkHgX2vzKXK0QwVwDSyAvy0g2WTmVuv8av+9rQuBUqbUS x16ZxcLyub9qwmV2LGz3Tre5YYrYugnau1eTgs3r6mB6rjHm13IdQ2eCptL7CzYfLP8q VI5UqBUd8M4M37/IcGVTmcADwkAqPRRfGm4/dRAbsic+P82Ymk3dr59q5y65SizqTlf8 b3JA== X-Gm-Message-State: AOAM530auE6YizZ1MYnoeB8cK3i1cmgpLYOuZY1SmnQIgWkJ2bwjjtED NRQBS4JBKNQ1iF5PnS7ltkfU/l/nPPWylA== X-Google-Smtp-Source: ABdhPJza7nfvYf9STeiUke1V+/jRoDSu267dlbThM8QvxtxpYIF51oy/+RM2CrjL2b6yY0gkHT246A== X-Received: by 2002:a17:902:8f93:b0:142:8731:1a5d with SMTP id z19-20020a1709028f9300b0014287311a5dmr14860865plo.60.1636541685575; Wed, 10 Nov 2021 02:54:45 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.54.40 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:54:45 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd() Date: Wed, 10 Nov 2021 18:54:14 +0800 Message-Id: <20211110105428.32458-2-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="l8SsT/vC"; spf=pass (imf08.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.173 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 0D6CA30000AE X-Stat-Signature: gj5ib39hmxxcho1becy9xdrq9p9urqh5 X-HE-Tag: 1636542178-277709 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently we have two times the same few lines repeated in filemap_map_pmd(). Deduplicate them and fix some code style issues. Signed-off-by: Qi Zheng --- mm/filemap.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index daa0e23a6ee6..07c654202870 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3203,11 +3203,8 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) struct mm_struct *mm = vmf->vma->vm_mm; /* Huge page is mapped? No need to proceed. */ - if (pmd_trans_huge(*vmf->pmd)) { - unlock_page(page); - put_page(page); - return true; - } + if (pmd_trans_huge(*vmf->pmd)) + goto out; if (pmd_none(*vmf->pmd) && PageTransHuge(page)) { vm_fault_t ret = do_set_pmd(vmf, page); @@ -3222,13 +3219,15 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) pmd_install(mm, vmf->pmd, &vmf->prealloc_pte); /* See comment in handle_pte_fault() */ - if (pmd_devmap_trans_unstable(vmf->pmd)) { - unlock_page(page); - put_page(page); - return true; - } + if (pmd_devmap_trans_unstable(vmf->pmd)) + goto out; return false; + +out: + unlock_page(page); + put_page(page); + return true; } static struct page *next_uptodate_page(struct page *page, From patchwork Wed Nov 10 10:54:15 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611835 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FB19C433F5 for ; Wed, 10 Nov 2021 11:02:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 44FD061207 for ; Wed, 10 Nov 2021 11:02:22 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 44FD061207 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id D22416B0072; Wed, 10 Nov 2021 06:02:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CAC386B0073; Wed, 10 Nov 2021 06:02:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B4BCA6B007E; Wed, 10 Nov 2021 06:02:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0245.hostedemail.com [216.40.44.245]) by kanga.kvack.org (Postfix) with ESMTP id A68026B0072 for ; Wed, 10 Nov 2021 06:02:21 -0500 (EST) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 5D5C91837CD8F for ; Wed, 10 Nov 2021 11:02:21 +0000 (UTC) X-FDA: 78792731682.11.CAC0AD0 Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) by imf03.hostedemail.com (Postfix) with ESMTP id C56643001A0D for ; Wed, 10 Nov 2021 11:02:12 +0000 (UTC) Received: by mail-pj1-f41.google.com with SMTP id np6-20020a17090b4c4600b001a90b011e06so104252pjb.5 for ; Wed, 10 Nov 2021 03:02:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=2uW/yCeCX04tsKzmi9YATgmqhnOC3lon0JZ/8HxInXE=; b=RzRE+vVxzm4+wKGjsGDeRVwx37ba93GOJQ6dz7FM0cgy+FhixdeL8SlKgyKglCYoZN qQTEJizB+ln5eZl5o9Xgk0v678gd/eUR8nKpT2DZ051tCrTMYKnni68ZMaMcRVN8btla xYM8y6NUmGRTjAVHu1d1PeF227s0Mchk62lvfwmRR/snBy2tIIC5O7c0OUEcYDJxbZ5b F0eJpHdRoN5avuoReVFZynJIwFJh06wAXTaxGzBgmu1g8ljR8wanQvOR/YCU6XvSllul T8dH0K9NB5qaGsw3/BbveCVnKKlsw9jZ/yw4FQzWoIA8CasqGslwG6CwhaBfFI4zZF5n zQhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=2uW/yCeCX04tsKzmi9YATgmqhnOC3lon0JZ/8HxInXE=; b=eWOpxYg237kQsAkhzEtKj6raHXL6WtG16Tcshds2PaVI+5slhSw/BOgP/lfZO4Mz+7 J6j/TdQsl3BL93L0+z8uwPnTY4zYYd4HLKH7x8T3wne4Uj9dYybBDmtUbmzrHuuIaa/T f5wyW438faNwOiceCzknQ5Sk9UtX3LTKC117s5bJRl3jYFb8dnkAIpYiJ9DyLXrfw/YX APfkN5KwQuaoQCdIfPqKV57MEYRJQDmIzbmMmK0KrkDt2GE7dEp7WIEhpv1Qk36f2ABg P/+zJTtkGFrzLFsWpGOzz+vTzcC6S7E7TMCuwZlVo2YlVxvgIl4Uo2r/NJQ6HD1copAW mp7Q== X-Gm-Message-State: AOAM533EaNzG3FhortlYUpmfB341bFRwmpxyzGfssDWbSpkqIQRhSG/8 U1x5ZUys42MBxijI0EkIIQP3TtLpPdYeZQ== X-Google-Smtp-Source: ABdhPJyLO3oYpNDt6IJybi2Fv3yqi2elU8rKRyXUj16YeNGeRdM26/CnwTxUPAn+xPfyzqTbqNqBYQ== X-Received: by 2002:a17:902:f551:b0:143:759c:6a30 with SMTP id h17-20020a170902f55100b00143759c6a30mr9935586plf.0.1636541690585; Wed, 10 Nov 2021 02:54:50 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.54.45 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:54:50 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 02/15] mm: introduce is_huge_pmd() helper Date: Wed, 10 Nov 2021 18:54:15 +0800 Message-Id: <20211110105428.32458-3-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: C56643001A0D X-Stat-Signature: 75dyb1tkyy7xoansgh8cfkw9pak91411 Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=RzRE+vVx; spf=pass (imf03.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-HE-Tag: 1636542132-404781 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently we have some times the following judgments repeated in the code: is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd) which is to determine whether the *pmd is a huge pmd, so introduce is_huge_pmd() helper to deduplicate them. Signed-off-by: Qi Zheng Reported-by: kernel test robot --- include/linux/huge_mm.h | 10 +++++++--- mm/huge_memory.c | 3 +-- mm/memory.c | 5 ++--- mm/mprotect.c | 2 +- mm/mremap.c | 3 +-- 5 files changed, 12 insertions(+), 11 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index f280f33ff223..b37a89180846 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -199,8 +199,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, #define split_huge_pmd(__vma, __pmd, __address) \ do { \ pmd_t *____pmd = (__pmd); \ - if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \ - || pmd_devmap(*____pmd)) \ + if (is_huge_pmd(*____pmd)) \ __split_huge_pmd(__vma, __pmd, __address, \ false, NULL); \ } while (0) @@ -232,11 +231,16 @@ static inline int is_swap_pmd(pmd_t pmd) return !pmd_none(pmd) && !pmd_present(pmd); } +static inline int is_huge_pmd(pmd_t pmd) +{ + return is_swap_pmd(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd); +} + /* mmap_lock must be held on entry */ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) { - if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) + if (is_huge_pmd(*pmd)) return __pmd_trans_huge_lock(pmd, vma); else return NULL; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e5483347291c..e76ee2e1e423 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1832,8 +1832,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) { spinlock_t *ptl; ptl = pmd_lock(vma->vm_mm, pmd); - if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || - pmd_devmap(*pmd))) + if (likely(is_huge_pmd(*pmd))) return ptl; spin_unlock(ptl); return NULL; diff --git a/mm/memory.c b/mm/memory.c index 855486fff526..b00cd60fc368 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1146,8 +1146,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, src_pmd = pmd_offset(src_pud, addr); do { next = pmd_addr_end(addr, end); - if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd) - || pmd_devmap(*src_pmd)) { + if (is_huge_pmd(*src_pmd)) { int err; VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma); err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd, @@ -1441,7 +1440,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); - if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { + if (is_huge_pmd(*pmd)) { if (next - addr != HPAGE_PMD_SIZE) __split_huge_pmd(vma, pmd, addr, false, NULL); else if (zap_huge_pmd(tlb, vma, pmd, addr)) diff --git a/mm/mprotect.c b/mm/mprotect.c index e552f5e0ccbd..2d5064a4631c 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -257,7 +257,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, mmu_notifier_invalidate_range_start(&range); } - if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { + if (is_huge_pmd(*pmd)) { if (next - addr != HPAGE_PMD_SIZE) { __split_huge_pmd(vma, pmd, addr, false, NULL); } else { diff --git a/mm/mremap.c b/mm/mremap.c index 002eec83e91e..c6e9da09dd0a 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -532,8 +532,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr); if (!new_pmd) break; - if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || - pmd_devmap(*old_pmd)) { + if (is_huge_pmd(*old_pmd)) { if (extent == HPAGE_PMD_SIZE && move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr, old_pmd, new_pmd, need_rmap_locks)) From patchwork Wed Nov 10 10:54:16 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611763 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77BB8C433F5 for ; Wed, 10 Nov 2021 10:54:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id ECC70611BF for ; Wed, 10 Nov 2021 10:54:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org ECC70611BF Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 894AA6B0071; Wed, 10 Nov 2021 05:54:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7F6796B0072; Wed, 10 Nov 2021 05:54:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6492F6B0073; Wed, 10 Nov 2021 05:54:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0195.hostedemail.com [216.40.44.195]) by kanga.kvack.org (Postfix) with ESMTP id 5285A6B0071 for ; Wed, 10 Nov 2021 05:54:57 -0500 (EST) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 075C77DCBA for ; Wed, 10 Nov 2021 10:54:57 +0000 (UTC) X-FDA: 78792713034.07.8B661A9 Received: from mail-pj1-f43.google.com (mail-pj1-f43.google.com [209.85.216.43]) by imf16.hostedemail.com (Postfix) with ESMTP id DC754F00009F for ; Wed, 10 Nov 2021 10:54:46 +0000 (UTC) Received: by mail-pj1-f43.google.com with SMTP id nh10-20020a17090b364a00b001a69adad5ebso1325669pjb.2 for ; Wed, 10 Nov 2021 02:54:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=+MsbkM47+d/dKbMtk5lgpAD1nJiQP3KKT2CTa5NSKic=; b=BM877EwXfwlSwLJfQSNuaNwjynQ2zmEcLSnT5C957W1grlHrfY6sfma+5UoGvVNHze +bD4ZgnXbpgkj5WQTQNNuyEUXrabjtKOsNIGc11eszAE1oTEv5VjmKbVtvWHk3mXGdGO dzV+AZ6qcOvm02M+XxmW+rusL6ZPnlrxS5Ne6QiDMvw2e2wiQzO/M+9E6dBd+XFK45uZ SJw+wAgShSQ5+btJPOhSRoTnm0KpGRESlo6cu/5SiOQ5Xa50zJbwAh/4jYrxCJ1FfFwH 370P0FHPO7tPJ8SyUEas33s3cuV9zlbgMBrcrlt3wGEKJYDS3TGmbXppg964iSr8iN7v sjng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=+MsbkM47+d/dKbMtk5lgpAD1nJiQP3KKT2CTa5NSKic=; b=cppOKOuJkE4Ip09jZT+EkfabSTpUa8m1fGttNplovPcT7U3C8noPHATcGcpfz9BFP+ zG+z44+Amu+Qu5RkZTCMgShEV0j+vfQ5HRoXrN6X3bNoqypL02VG+wQ9Qv56/GBG1c4g /ILczVoWeQ7F0vlkpkGMgUuljBdjRAnwAwIYNZqQhfxRG14/GF8O+ykqDFb+NAkit9dU W0c4FJhSi0ne7pIC4EOSV0u85i84N825pjq3VkM6ep/bm83O0+442oMgfoo0fZV6VR0Z V+M6pLCwjTuaNKPSet6S006+QrKymPw2f6E+7XsBaCkZ23OPozgFmLuAMNNcGEpPebqo 0nGA== X-Gm-Message-State: AOAM531SWUgavCYG0K5aGSxJS0f9zF+rqB4+vhBD6vL9dEcaMzhgYRHy k0FykKbR4o/dRVPXTQ4F5gqJXg== X-Google-Smtp-Source: ABdhPJzr/CnNBMwRYn1aXGajt6dNRCqOz8d5iw30wbrk3O0W2rMw/UIXAv1kDxQlrdSKwmFlk3k2Tg== X-Received: by 2002:a17:90a:1913:: with SMTP id 19mr15829974pjg.174.1636541695817; Wed, 10 Nov 2021 02:54:55 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.54.50 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:54:55 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h Date: Wed, 10 Nov 2021 18:54:16 +0800 Message-Id: <20211110105428.32458-4-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: DC754F00009F X-Stat-Signature: 8acj63bstgt5qoaqk7t1jsape7g5qfa8 Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=BM877EwX; spf=pass (imf16.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.43 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-HE-Tag: 1636541686-596995 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: pte_offset_map() is in include/linux/pgtable.h, so move its friend pte_offset_map_lock() to pgtable.h together. pte_lockptr() is required for pte_offset_map_lock(), so also move {pte,pmd,pud}_lockptr() to pgtable.h. Signed-off-by: Qi Zheng --- include/linux/mm.h | 149 ------------------------------------------------ include/linux/pgtable.h | 149 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 149 insertions(+), 149 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index a7e4a9e7d807..706da081b9f8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2284,70 +2284,6 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a } #endif /* CONFIG_MMU */ -#if USE_SPLIT_PTE_PTLOCKS -#if ALLOC_SPLIT_PTLOCKS -void __init ptlock_cache_init(void); -extern bool ptlock_alloc(struct page *page); -extern void ptlock_free(struct page *page); - -static inline spinlock_t *ptlock_ptr(struct page *page) -{ - return page->ptl; -} -#else /* ALLOC_SPLIT_PTLOCKS */ -static inline void ptlock_cache_init(void) -{ -} - -static inline bool ptlock_alloc(struct page *page) -{ - return true; -} - -static inline void ptlock_free(struct page *page) -{ -} - -static inline spinlock_t *ptlock_ptr(struct page *page) -{ - return &page->ptl; -} -#endif /* ALLOC_SPLIT_PTLOCKS */ - -static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) -{ - return ptlock_ptr(pmd_page(*pmd)); -} - -static inline bool ptlock_init(struct page *page) -{ - /* - * prep_new_page() initialize page->private (and therefore page->ptl) - * with 0. Make sure nobody took it in use in between. - * - * It can happen if arch try to use slab for page table allocation: - * slab code uses page->slab_cache, which share storage with page->ptl. - */ - VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page); - if (!ptlock_alloc(page)) - return false; - spin_lock_init(ptlock_ptr(page)); - return true; -} - -#else /* !USE_SPLIT_PTE_PTLOCKS */ -/* - * We use mm->page_table_lock to guard all pagetable pages of the mm. - */ -static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) -{ - return &mm->page_table_lock; -} -static inline void ptlock_cache_init(void) {} -static inline bool ptlock_init(struct page *page) { return true; } -static inline void ptlock_free(struct page *page) {} -#endif /* USE_SPLIT_PTE_PTLOCKS */ - static inline void pgtable_init(void) { ptlock_cache_init(); @@ -2370,20 +2306,6 @@ static inline void pgtable_pte_page_dtor(struct page *page) dec_lruvec_page_state(page, NR_PAGETABLE); } -#define pte_offset_map_lock(mm, pmd, address, ptlp) \ -({ \ - spinlock_t *__ptl = pte_lockptr(mm, pmd); \ - pte_t *__pte = pte_offset_map(pmd, address); \ - *(ptlp) = __ptl; \ - spin_lock(__ptl); \ - __pte; \ -}) - -#define pte_unmap_unlock(pte, ptl) do { \ - spin_unlock(ptl); \ - pte_unmap(pte); \ -} while (0) - #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd)) #define pte_alloc_map(mm, pmd, address) \ @@ -2397,58 +2319,6 @@ static inline void pgtable_pte_page_dtor(struct page *page) ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \ NULL: pte_offset_kernel(pmd, address)) -#if USE_SPLIT_PMD_PTLOCKS - -static struct page *pmd_to_page(pmd_t *pmd) -{ - unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1); - return virt_to_page((void *)((unsigned long) pmd & mask)); -} - -static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) -{ - return ptlock_ptr(pmd_to_page(pmd)); -} - -static inline bool pmd_ptlock_init(struct page *page) -{ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - page->pmd_huge_pte = NULL; -#endif - return ptlock_init(page); -} - -static inline void pmd_ptlock_free(struct page *page) -{ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - VM_BUG_ON_PAGE(page->pmd_huge_pte, page); -#endif - ptlock_free(page); -} - -#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte) - -#else - -static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) -{ - return &mm->page_table_lock; -} - -static inline bool pmd_ptlock_init(struct page *page) { return true; } -static inline void pmd_ptlock_free(struct page *page) {} - -#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte) - -#endif - -static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd) -{ - spinlock_t *ptl = pmd_lockptr(mm, pmd); - spin_lock(ptl); - return ptl; -} - static inline bool pgtable_pmd_page_ctor(struct page *page) { if (!pmd_ptlock_init(page)) @@ -2465,25 +2335,6 @@ static inline void pgtable_pmd_page_dtor(struct page *page) dec_lruvec_page_state(page, NR_PAGETABLE); } -/* - * No scalability reason to split PUD locks yet, but follow the same pattern - * as the PMD locks to make it easier if we decide to. The VM should not be - * considered ready to switch to split PUD locks yet; there may be places - * which need to be converted from page_table_lock. - */ -static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud) -{ - return &mm->page_table_lock; -} - -static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud) -{ - spinlock_t *ptl = pud_lockptr(mm, pud); - - spin_lock(ptl); - return ptl; -} - extern void __init pagecache_init(void); extern void __init free_area_init_memoryless_node(int nid); extern void free_initmem(void); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e24d2c992b11..c8f045705c1e 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -84,6 +84,141 @@ static inline unsigned long pud_index(unsigned long address) #define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1)) #endif +#if USE_SPLIT_PTE_PTLOCKS +#if ALLOC_SPLIT_PTLOCKS +void __init ptlock_cache_init(void); +extern bool ptlock_alloc(struct page *page); +extern void ptlock_free(struct page *page); + +static inline spinlock_t *ptlock_ptr(struct page *page) +{ + return page->ptl; +} +#else /* ALLOC_SPLIT_PTLOCKS */ +static inline void ptlock_cache_init(void) +{ +} + +static inline bool ptlock_alloc(struct page *page) +{ + return true; +} + +static inline void ptlock_free(struct page *page) +{ +} + +static inline spinlock_t *ptlock_ptr(struct page *page) +{ + return &page->ptl; +} +#endif /* ALLOC_SPLIT_PTLOCKS */ + +static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) +{ + return ptlock_ptr(pmd_page(*pmd)); +} + +static inline bool ptlock_init(struct page *page) +{ + /* + * prep_new_page() initialize page->private (and therefore page->ptl) + * with 0. Make sure nobody took it in use in between. + * + * It can happen if arch try to use slab for page table allocation: + * slab code uses page->slab_cache, which share storage with page->ptl. + */ + VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page); + if (!ptlock_alloc(page)) + return false; + spin_lock_init(ptlock_ptr(page)); + return true; +} + +#else /* !USE_SPLIT_PTE_PTLOCKS */ +/* + * We use mm->page_table_lock to guard all pagetable pages of the mm. + */ +static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) +{ + return &mm->page_table_lock; +} +static inline void ptlock_cache_init(void) {} +static inline bool ptlock_init(struct page *page) { return true; } +static inline void ptlock_free(struct page *page) {} +#endif /* USE_SPLIT_PTE_PTLOCKS */ + +#if USE_SPLIT_PMD_PTLOCKS + +static struct page *pmd_to_page(pmd_t *pmd) +{ + unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1); + return virt_to_page((void *)((unsigned long) pmd & mask)); +} + +static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) +{ + return ptlock_ptr(pmd_to_page(pmd)); +} + +static inline bool pmd_ptlock_init(struct page *page) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + page->pmd_huge_pte = NULL; +#endif + return ptlock_init(page); +} + +static inline void pmd_ptlock_free(struct page *page) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + VM_BUG_ON_PAGE(page->pmd_huge_pte, page); +#endif + ptlock_free(page); +} + +#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte) + +#else + +static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) +{ + return &mm->page_table_lock; +} + +static inline bool pmd_ptlock_init(struct page *page) { return true; } +static inline void pmd_ptlock_free(struct page *page) {} + +#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte) + +#endif + +static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd) +{ + spinlock_t *ptl = pmd_lockptr(mm, pmd); + spin_lock(ptl); + return ptl; +} + +/* + * No scalability reason to split PUD locks yet, but follow the same pattern + * as the PMD locks to make it easier if we decide to. The VM should not be + * considered ready to switch to split PUD locks yet; there may be places + * which need to be converted from page_table_lock. + */ +static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud) +{ + return &mm->page_table_lock; +} + +static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud) +{ + spinlock_t *ptl = pud_lockptr(mm, pud); + + spin_lock(ptl); + return ptl; +} + #ifndef pte_offset_kernel static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) { @@ -102,6 +237,20 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) #define pte_unmap(pte) ((void)(pte)) /* NOP */ #endif +#define pte_offset_map_lock(mm, pmd, address, ptlp) \ +({ \ + spinlock_t *__ptl = pte_lockptr(mm, pmd); \ + pte_t *__pte = pte_offset_map(pmd, address); \ + *(ptlp) = __ptl; \ + spin_lock(__ptl); \ + __pte; \ +}) + +#define pte_unmap_unlock(pte, ptl) do { \ + spin_unlock(ptl); \ + pte_unmap(pte); \ +} while (0) + /* Find an entry in the second-level page table.. */ #ifndef pmd_offset static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) From patchwork Wed Nov 10 10:54:17 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611765 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0A18C433F5 for ; Wed, 10 Nov 2021 10:55:03 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 869D5611BF for ; Wed, 10 Nov 2021 10:55:03 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 869D5611BF Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 2A1426B0072; Wed, 10 Nov 2021 05:55:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 22AB96B0073; Wed, 10 Nov 2021 05:55:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0A3B86B0074; Wed, 10 Nov 2021 05:55:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0128.hostedemail.com [216.40.44.128]) by kanga.kvack.org (Postfix) with ESMTP id ED7166B0072 for ; Wed, 10 Nov 2021 05:55:02 -0500 (EST) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id A44E77DCBA for ; Wed, 10 Nov 2021 10:55:02 +0000 (UTC) X-FDA: 78792713370.05.BC3E9C9 Received: from mail-pj1-f42.google.com (mail-pj1-f42.google.com [209.85.216.42]) by imf29.hostedemail.com (Postfix) with ESMTP id 419199000271 for ; Wed, 10 Nov 2021 10:55:02 +0000 (UTC) Received: by mail-pj1-f42.google.com with SMTP id gt5so1249254pjb.1 for ; Wed, 10 Nov 2021 02:55:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Zr0rEdADAEaJn1N2dbntb6SW/6iRFYHNNestGTjHAUw=; b=HPUN9ql1FG+neFcOlqge+XfZ9d8NFw14fMg58AGUz+eydfvC09cyZa+jcxb37NzfN5 V+H/2f3OtvfVTRoEPwomKmsxjjqz2tMWLvloxX+azi3MAXcdKwuvd5OnzjaVX0gKhPzL ivUobAdbapEhRBbzYIxpjYe2raMmkrMTaIWLIz41b0EYHen3+ZyyirO8ubW0jWEHLAVq tjmAUQ1Up9N/eOYzd9ojy70F+Imu2eGknw+fLV/mxocHK4wNJ8/9NmMaQPze5m7I23w2 QNSqhp1lPC8zoP39jltkuYN/T7wIgJRV4kG4gznwzXCpvqxw44X5adLh0RLH7Oazvq6W n2Vw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Zr0rEdADAEaJn1N2dbntb6SW/6iRFYHNNestGTjHAUw=; b=4vIk1BOYUTvKbmccNVM8Pb5xSPgBzkGyhU580PbN4zsVMU7+Hr568viG7yzq2m+rrN DGE2+tDechsqK7l+Ic0fVkj/fpTY4PPQxuW8DnmXEZha0rJ1+/cgjg5zxQfSRHfWq8AH lUzvvkFNDLYXZjfmzoSWQp8iciAQZXKIvcNL1axVVKwkBxD8XD26hJYFuv8CgoeGR2MX g7260xTD6Tx76BVuQWerQObskLwumkpCR/h4k4IRh46mffTwI4OV0knvGJAdojTBqUMR RXEvTy3vQqcktlN1Kz7Q7IQ196X20/vDI3gCsZJF2bmVt+fRfZ8eUr6IEZ9oVW1d45ZW 1kJQ== X-Gm-Message-State: AOAM531j+w/8fzg1vuVqWaWBGuqCbmFvHVknTRFUTGDNtclld5v/wfvy uOKgCT29bEAVVmJ8XJD9THF/iQ== X-Google-Smtp-Source: ABdhPJwFh1bEBmmrAr/exJm9KrX1d9IXTLyrDWgr/IvlDLpUmsG5/fJDb/+NBSZdkcokjcHWkv0S4A== X-Received: by 2002:a17:902:f693:b0:142:9ec:c2e1 with SMTP id l19-20020a170902f69300b0014209ecc2e1mr14415302plg.34.1636541701391; Wed, 10 Nov 2021 02:55:01 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.54.56 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:55:01 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 04/15] mm: rework the parameter of lock_page_or_retry() Date: Wed, 10 Nov 2021 18:54:17 +0800 Message-Id: <20211110105428.32458-5-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 419199000271 X-Stat-Signature: 9wwxd5c66njst5b8uw95ynj5ft5u5j35 Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=HPUN9ql1; spf=pass (imf29.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.42 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-HE-Tag: 1636541702-975046 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: We need the vmf in lock_page_or_retry() in the subsequent patch, so pass in it directly. Signed-off-by: Qi Zheng --- include/linux/pagemap.h | 8 +++----- mm/filemap.c | 6 ++++-- mm/memory.c | 4 ++-- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 6a30916b76e5..94f9547b4411 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -709,8 +709,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page, void __folio_lock(struct folio *folio); int __folio_lock_killable(struct folio *folio); -bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, - unsigned int flags); +bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf); void unlock_page(struct page *page); void folio_unlock(struct folio *folio); @@ -772,14 +771,13 @@ static inline int lock_page_killable(struct page *page) * Return value and mmap_lock implications depend on flags; see * __folio_lock_or_retry(). */ -static inline bool lock_page_or_retry(struct page *page, struct mm_struct *mm, - unsigned int flags) +static inline bool lock_page_or_retry(struct page *page, struct vm_fault *vmf) { struct folio *folio; might_sleep(); folio = page_folio(page); - return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags); + return folio_trylock(folio) || __folio_lock_or_retry(folio, vmf); } /* diff --git a/mm/filemap.c b/mm/filemap.c index 07c654202870..ff8d19b7ce1d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1695,9 +1695,11 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) * If neither ALLOW_RETRY nor KILLABLE are set, will always return true * with the folio locked and the mmap_lock unperturbed. */ -bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, - unsigned int flags) +bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf) { + unsigned int flags = vmf->flags; + struct mm_struct *mm = vmf->vma->vm_mm; + if (fault_flag_allow_retry_first(flags)) { /* * CAUTION! In this case, mmap_lock is not released diff --git a/mm/memory.c b/mm/memory.c index b00cd60fc368..bec6a5d5ee7c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3443,7 +3443,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct mmu_notifier_range range; - if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) + if (!lock_page_or_retry(page, vmf)) return VM_FAULT_RETRY; mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma, vma->vm_mm, vmf->address & PAGE_MASK, @@ -3576,7 +3576,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_release; } - locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags); + locked = lock_page_or_retry(page, vmf); delayacct_clear_flag(current, DELAYACCT_PF_SWAPIN); if (!locked) { From patchwork Wed Nov 10 10:54:18 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611841 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 486B4C433EF for ; Wed, 10 Nov 2021 11:03:33 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E8F6461207 for ; Wed, 10 Nov 2021 11:03:32 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org E8F6461207 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 8B21D6B0071; Wed, 10 Nov 2021 06:03:32 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 83B226B0072; Wed, 10 Nov 2021 06:03:32 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6B5106B007E; Wed, 10 Nov 2021 06:03:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0228.hostedemail.com [216.40.44.228]) by kanga.kvack.org (Postfix) with ESMTP id 59F966B0071 for ; Wed, 10 Nov 2021 06:03:32 -0500 (EST) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 19179184EBCE6 for ; Wed, 10 Nov 2021 11:03:32 +0000 (UTC) X-FDA: 78792734664.29.29024E2 Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) by imf07.hostedemail.com (Postfix) with ESMTP id 8E05610000B1 for ; Wed, 10 Nov 2021 11:03:31 +0000 (UTC) Received: by mail-pf1-f169.google.com with SMTP id z6so2332787pfe.7 for ; Wed, 10 Nov 2021 03:03:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=4diZW2wEVqXk2ndEsmrS3nCtcwwWuUKIRQnKQFgeem0=; b=COtsjIpr3BRI4gEAV+7KnRqUIb8utOtebdwXLB3m7llDg/5zahVoa8o/IYTsFOsyrt bUIWdwV5r/0pjidaepCIseNRFM+E0E5QARRz0HazoKmTrN6MPiiVAekvkdfnDHDHvTqZ LCqlKldesqC6ZTXQU3bHRP4HVHG224561nCXqbQs9nTBBfVqK6EIVuquYe6w9CUl1qdv XgAm3YwtQNpcvkHQHijJ1QcM6vzagO/Fr75N1VdhD1BLNpWknv5caWxHMLx8C2f/AfOT h5TD3ZnpTj+3o/S1h0K0+l144PNQ5Hd2WjyLns7T2rwte7KapCc6Zssao8NBJEKvgpC/ 2o0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=4diZW2wEVqXk2ndEsmrS3nCtcwwWuUKIRQnKQFgeem0=; b=iS+G1xBIfmpzSJJ99CvTwDDRXmFFBK9v57srDJhA86dN6UKiCUjyhsnstI0UHO0Azu HRMqik+Gg5xpalmLKMciq3s7rlU3wmejv3FQjibI7BZhQny6226r1uDulh+/VnynUA2n WDmJkIQqgthfJb+rZYj3UIJtMCMBqFsU/ii80z2f9l/Pn4kodT6wEjLaRU0J7G4Qa0iv 0oRZlBB/uGN3ZPLVXMahXHpJtf3cLSGIiWwum0LuGkvny7CwVDzsxfCE6FbOiSyjjjXh g4lS9G53RDxea644DKxIk0G76+bwVZfPuGq7a1EsKD/H+ZuprR+gFIhi0ypw6DEOyC3m tcdw== X-Gm-Message-State: AOAM530dYHhGpifow3HeDly+73u3vfZHLll147QxjZ0XatVBD1EGfzJj QMqF/3fGy4XmuVBNegdRVYvPAHVnBIP0/Q== X-Google-Smtp-Source: ABdhPJzQ+sHBFCJV0gr4TeSz2WlmbmXylJJw1YLnLOpTAEHwwUclKmLfFDkpfmt1JaAJ+7PoXM5DTw== X-Received: by 2002:a17:902:b682:b0:143:7eb8:222 with SMTP id c2-20020a170902b68200b001437eb80222mr5397811pls.31.1636541707612; Wed, 10 Nov 2021 02:55:07 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.55.01 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:55:07 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends Date: Wed, 10 Nov 2021 18:54:18 +0800 Message-Id: <20211110105428.32458-6-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=COtsjIpr; spf=pass (imf07.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.169 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 8E05610000B1 X-Stat-Signature: jt95446fp1w38n4mmtk4kwhtj45ejd5x X-HE-Tag: 1636542211-277678 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When we call __pte_alloc() or other friends, a huge pmd might be created from a different thread. This is why pmd_trans_unstable() will now be called after __pte_alloc() or other friends return. This patch add pmd_installed_type return for __pte_alloc() and other friends, then we can check the huge pmd through the return value instead of calling pmd_trans_unstable() again. This patch has no functional change, just some preparations for the future patches. Signed-off-by: Qi Zheng --- include/linux/mm.h | 20 +++++++++++++++++--- mm/debug_vm_pgtable.c | 2 +- mm/filemap.c | 11 +++++++---- mm/gup.c | 2 +- mm/internal.h | 3 ++- mm/memory.c | 39 ++++++++++++++++++++++++++------------- mm/migrate.c | 17 ++--------------- mm/mremap.c | 2 +- mm/userfaultfd.c | 24 +++++++++++++++--------- 9 files changed, 72 insertions(+), 48 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 706da081b9f8..52f36fde2f11 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2306,13 +2306,27 @@ static inline void pgtable_pte_page_dtor(struct page *page) dec_lruvec_page_state(page, NR_PAGETABLE); } -#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd)) +enum pmd_installed_type { + INSTALLED_PTE, + INSTALLED_HUGE_PMD, +}; + +static inline int pte_alloc(struct mm_struct *mm, pmd_t *pmd) +{ + if (unlikely(pmd_none(*(pmd)))) + return __pte_alloc(mm, pmd); + if (unlikely(is_huge_pmd(*pmd))) + return INSTALLED_HUGE_PMD; + + return INSTALLED_PTE; +} +#define pte_alloc pte_alloc #define pte_alloc_map(mm, pmd, address) \ - (pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address)) + (pte_alloc(mm, pmd) < 0 ? NULL : pte_offset_map(pmd, address)) #define pte_alloc_map_lock(mm, pmd, address, ptlp) \ - (pte_alloc(mm, pmd) ? \ + (pte_alloc(mm, pmd) < 0 ? \ NULL : pte_offset_map_lock(mm, pmd, address, ptlp)) #define pte_alloc_kernel(pmd, address) \ diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 228e3954b90c..b8322c55e65d 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -1170,7 +1170,7 @@ static int __init init_args(struct pgtable_debug_args *args) args->start_pmdp = pmd_offset(args->pudp, 0UL); WARN_ON(!args->start_pmdp); - if (pte_alloc(args->mm, args->pmdp)) { + if (pte_alloc(args->mm, args->pmdp) < 0) { pr_err("Failed to allocate pte entries\n"); ret = -ENOMEM; goto error; diff --git a/mm/filemap.c b/mm/filemap.c index ff8d19b7ce1d..23363f8ddbbe 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3217,12 +3217,15 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) } } - if (pmd_none(*vmf->pmd)) - pmd_install(mm, vmf->pmd, &vmf->prealloc_pte); + if (pmd_none(*vmf->pmd)) { + int ret = pmd_install(mm, vmf->pmd, &vmf->prealloc_pte); - /* See comment in handle_pte_fault() */ - if (pmd_devmap_trans_unstable(vmf->pmd)) + if (unlikely(ret == INSTALLED_HUGE_PMD)) + goto out; + } else if (pmd_devmap_trans_unstable(vmf->pmd)) { + /* See comment in handle_pte_fault() */ goto out; + } return false; diff --git a/mm/gup.c b/mm/gup.c index 2c51e9748a6a..2def775232a3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -699,7 +699,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, } else { spin_unlock(ptl); split_huge_pmd(vma, pmd, address); - ret = pte_alloc(mm, pmd) ? -ENOMEM : 0; + ret = pte_alloc(mm, pmd) < 0 ? -ENOMEM : 0; } return ret ? ERR_PTR(ret) : diff --git a/mm/internal.h b/mm/internal.h index 3b79a5c9427a..474d6e3443f8 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -67,7 +67,8 @@ bool __folio_end_writeback(struct folio *folio); void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, unsigned long floor, unsigned long ceiling); -void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); +enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd, + pgtable_t *pte); static inline bool can_madv_lru_vma(struct vm_area_struct *vma) { diff --git a/mm/memory.c b/mm/memory.c index bec6a5d5ee7c..8a39c0e58324 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -437,8 +437,10 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma, } } -void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) +enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd, + pgtable_t *pte) { + int ret = INSTALLED_PTE; spinlock_t *ptl = pmd_lock(mm, pmd); if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ @@ -459,20 +461,26 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ pmd_populate(mm, pmd, *pte); *pte = NULL; + } else if (is_huge_pmd(*pmd)) { + /* See comment in handle_pte_fault() */ + ret = INSTALLED_HUGE_PMD; } spin_unlock(ptl); + + return ret; } int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) { + enum pmd_installed_type ret; pgtable_t new = pte_alloc_one(mm); if (!new) return -ENOMEM; - pmd_install(mm, pmd, &new); + ret = pmd_install(mm, pmd, &new); if (new) pte_free(mm, new); - return 0; + return ret; } int __pte_alloc_kernel(pmd_t *pmd) @@ -1813,7 +1821,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr, /* Allocate the PTE if necessary; takes PMD lock once only. */ ret = -ENOMEM; - if (pte_alloc(mm, pmd)) + if (pte_alloc(mm, pmd) < 0) goto out; while (pages_to_write_in_pmd) { @@ -3713,6 +3721,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) struct page *page; vm_fault_t ret = 0; pte_t entry; + int alloc_ret; /* File mapping without ->vm_ops ? */ if (vma->vm_flags & VM_SHARED) @@ -3728,11 +3737,11 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) * * Here we only have mmap_read_lock(mm). */ - if (pte_alloc(vma->vm_mm, vmf->pmd)) + alloc_ret = pte_alloc(vma->vm_mm, vmf->pmd); + if (alloc_ret < 0) return VM_FAULT_OOM; - /* See comment in handle_pte_fault() */ - if (unlikely(pmd_trans_unstable(vmf->pmd))) + if (unlikely(alloc_ret == INSTALLED_HUGE_PMD)) return 0; /* Use the zero-page for reads */ @@ -4023,6 +4032,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf) } if (pmd_none(*vmf->pmd)) { + int alloc_ret; + if (PageTransCompound(page)) { ret = do_set_pmd(vmf, page); if (ret != VM_FAULT_FALLBACK) @@ -4030,14 +4041,16 @@ vm_fault_t finish_fault(struct vm_fault *vmf) } if (vmf->prealloc_pte) - pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte); - else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) - return VM_FAULT_OOM; - } + alloc_ret = pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte); + else + alloc_ret = pte_alloc(vma->vm_mm, vmf->pmd); - /* See comment in handle_pte_fault() */ - if (pmd_devmap_trans_unstable(vmf->pmd)) + if (unlikely(alloc_ret != INSTALLED_PTE)) + return alloc_ret < 0 ? VM_FAULT_OOM : 0; + } else if (pmd_devmap_trans_unstable(vmf->pmd)) { + /* See comment in handle_pte_fault() */ return 0; + } vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); diff --git a/mm/migrate.c b/mm/migrate.c index cf25b00f03c8..bdfdfd3b50be 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2731,21 +2731,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp)) goto abort; - /* - * Use pte_alloc() instead of pte_alloc_map(). We can't run - * pte_offset_map() on pmds where a huge pmd might be created - * from a different thread. - * - * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when - * parallel threads are excluded by other means. - * - * Here we only have mmap_read_lock(mm). - */ - if (pte_alloc(mm, pmdp)) - goto abort; - - /* See the comment in pte_alloc_one_map() */ - if (unlikely(pmd_trans_unstable(pmdp))) + /* See the comment in do_anonymous_page() */ + if (unlikely(pte_alloc(mm, pmdp) != INSTALLED_PTE)) goto abort; if (unlikely(anon_vma_prepare(vma))) diff --git a/mm/mremap.c b/mm/mremap.c index c6e9da09dd0a..fc5c56858883 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -551,7 +551,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, continue; } - if (pte_alloc(new_vma->vm_mm, new_pmd)) + if (pte_alloc(new_vma->vm_mm, new_pmd) < 0) break; move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma, new_pmd, new_addr, need_rmap_locks); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 0780c2a57ff1..2cea08e7f076 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -592,15 +592,21 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, err = -EEXIST; break; } - if (unlikely(pmd_none(dst_pmdval)) && - unlikely(__pte_alloc(dst_mm, dst_pmd))) { - err = -ENOMEM; - break; - } - /* If an huge pmd materialized from under us fail */ - if (unlikely(pmd_trans_huge(*dst_pmd))) { - err = -EFAULT; - break; + + if (unlikely(pmd_none(dst_pmdval))) { + int ret = __pte_alloc(dst_mm, dst_pmd); + + /* + * If there is not enough memory or an huge pmd + * materialized from under us + */ + if (unlikely(ret < 0)) { + err = -ENOMEM; + break; + } else if (unlikely(ret == INSTALLED_HUGE_PMD)) { + err = -EFAULT; + break; + } } BUG_ON(pmd_none(*dst_pmd)); From patchwork Wed Nov 10 10:54:19 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611769 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8002CC433EF for ; Wed, 10 Nov 2021 10:55:29 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1DC6F611F2 for ; Wed, 10 Nov 2021 10:55:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 1DC6F611F2 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id A1B8B6B0074; Wed, 10 Nov 2021 05:55:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9A5426B0075; Wed, 10 Nov 2021 05:55:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A87D6B007B; Wed, 10 Nov 2021 05:55:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0109.hostedemail.com [216.40.44.109]) by kanga.kvack.org (Postfix) with ESMTP id 6519C6B0074 for ; Wed, 10 Nov 2021 05:55:28 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 160F9788A5 for ; Wed, 10 Nov 2021 10:55:28 +0000 (UTC) X-FDA: 78792714336.17.992506B Received: from mail-pj1-f53.google.com (mail-pj1-f53.google.com [209.85.216.53]) by imf16.hostedemail.com (Postfix) with ESMTP id 9D4C4F00008E for ; Wed, 10 Nov 2021 10:55:05 +0000 (UTC) Received: by mail-pj1-f53.google.com with SMTP id n15-20020a17090a160f00b001a75089daa3so1517610pja.1 for ; Wed, 10 Nov 2021 02:55:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=raVLPH1/MWxE4zs2FkbLjPSd/7tvVQ5+mBIXosP/gF0=; b=jvRFfLUipsnp5nXhQ50c/JBEXDjRMRFe2c6Qa0BsLi7KlI0OTAiVwpnXv5L0wUSWSo xtlkI4yPSkahfS9Yg4qgFmS+3sPVS9LvCkAUb1LPWEhRj1l/rmAPCr7hAZi4NvrluLlo aDCKaX8jx5iLBOdzG3/03M8N6Lh/LpjDymxS3OF3VCh+gG3AkxeuqHtL18nxMio8LJVv zeKA+V7/AZq19J3czJdGSW+Jq6QAu+urQ4E7eZtpqYF0BnXu2g/f+AUj94SVxl0A5iD5 2wKHkJkoZp6WHf/8NhoQHNXzbbxQ/OT9kwei2LowospzEe0Gc8fUGgf2H4kykvwEEKWs 1kEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=raVLPH1/MWxE4zs2FkbLjPSd/7tvVQ5+mBIXosP/gF0=; b=Ht3YE4CfDv2XBeceTCKxkt7qypMpXsFF7L9qv29f9GCw6jm9s8Mb/6CKnwyp8nL0hj td/1jg6efKbJoDahOaukxIuvPJYKA75AM3+sWz2E6W3XH1iV8cwLB4OTyieiJ2ut7fW4 5XCiIKqXH08Ems4bZWhgXvvNF45cwejEe6cqQ6JbIpLEuebvkZdRhNIZ+ZB5v/nJ4e14 YKcrDUJ/+pvvR9jiTjI3UQXE+dfnPN203VQ3ajOndbgaWNPgpWmcWLvlu33ORDp3dS8i QS1AES/pGRa1fDXqGrTtOLabKU1BRfXB6TGsnN2dgDJoCkqLZXHXNsNB4RFDJwxTl+G1 7LWA== X-Gm-Message-State: AOAM531C423ggd7MlTjx20C2Qrc8R8qSoptoEEoK4GLSfKmTOWZbWavT yALY67HXO4C8N3GuxCHKx6QkMg== X-Google-Smtp-Source: ABdhPJw4V8336kwkciDWeD8ocIhge08gWjIa/Btfc89EZgRozbZ5QwCuPUJyDXdMeJYQL2/tAmpvow== X-Received: by 2002:a17:902:bf02:b0:13f:cfdd:804e with SMTP id bi2-20020a170902bf0200b0013fcfdd804emr14712229plb.1.1636541713063; Wed, 10 Nov 2021 02:55:13 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.55.08 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:55:12 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 06/15] mm: introduce refcount for user PTE page table page Date: Wed, 10 Nov 2021 18:54:19 +0800 Message-Id: <20211110105428.32458-7-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=jvRFfLUi; spf=pass (imf16.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.53 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 9D4C4F00008E X-Stat-Signature: h4mbs794npo6apsd48iydts5bdbywbf8 X-HE-Tag: 1636541705-28505 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 1. Preface ========== Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory for the following reasons:: First of all, we should hold as few write locks of mmap_lock as possible, since the mmap_lock semaphore has long been a contention point in the memory management subsystem. The mmap()/munmap() hold the write lock, and the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using madvise() instead of munmap() to released physical memory can reduce the competition of the mmap_lock. Secondly, after using madvise() to release physical memory, there is no need to build vma and allocate page tables again when accessing the same virtual address again, which can also save some time. The following is the largest user PTE page table memory that can be allocated by a single user process in a 32-bit and a 64-bit system. +---------------------------+--------+---------+ | | 32-bit | 64-bit | +===========================+========+=========+ | user PTE page table pages | 3 MiB | 512 GiB | +---------------------------+--------+---------+ | user PMD page table pages | 3 KiB | 1 GiB | +---------------------------+--------+---------+ (for 32-bit, take 3G user address space, 4K page size as an example; for 64-bit, take 48-bit address width, 4K page size as an example.) After using madvise(), everything looks good, but as can be seen from the above table, a single process can create a large number of PTE page tables on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not release page table memory. And before the process exits or calls munmap(), the kernel cannot reclaim these pages even if these PTE page tables do not map anything. Therefore, we decided to introduce reference count to manage the PTE page table life cycle, so that some free PTE page table memory in the system can be dynamically released. 2. The reference count of user PTE page table pages =================================================== We introduce two members for the struct page of the user PTE page table page:: union { pgtable_t pmd_huge_pte; /* protected by page->ptl */ pmd_t *pmd; /* PTE page only */ }; union { struct mm_struct *pt_mm; /* x86 pgds only */ atomic_t pt_frag_refcount; /* powerpc */ atomic_t pte_refcount; /* PTE page only */ }; The pmd member record the pmd entry that maps the user PTE page table page, the pte_refcount member keep track of how many references to the user PTE page table page. The following people will hold a reference on the user PTE page table page:: The !pte_none() entry, such as regular page table entry that map physical pages, or swap entry, or migrate entry, etc. Visitor to the PTE page table entries, such as page table walker. Any ``!pte_none()`` entry and visitor can be regarded as the user of its PTE page table page. When the ``pte_refcount`` is reduced to 0, it means that no one is using the PTE page table page, then this free PTE page table page can be released back to the system at this time. 3. Helpers ========== +---------------------+-------------------------------------------------+ | pte_ref_init | Initialize the pte_refcount and pmd | +---------------------+-------------------------------------------------+ | pte_to_pmd | Get the corresponding pmd | +---------------------+-------------------------------------------------+ | pte_update_pmd | Update the corresponding pmd | +---------------------+-------------------------------------------------+ | pte_get | Increment a pte_refcount | +---------------------+-------------------------------------------------+ | pte_get_many | Add a value to a pte_refcount | +---------------------+-------------------------------------------------+ | pte_get_unless_zero | Increment a pte_refcount unless it is 0 | +---------------------+-------------------------------------------------+ | pte_try_get | Try to increment a pte_refcount | +---------------------+-------------------------------------------------+ | pte_tryget_map | Try to increment a pte_refcount before | | | pte_offset_map() | +---------------------+-------------------------------------------------+ | pte_tryget_map_lock | Try to increment a pte_refcount before | | | pte_offset_map_lock() | +---------------------+-------------------------------------------------+ | pte_put | Decrement a pte_refcount | +---------------------+-------------------------------------------------+ | pte_put_many | Sub a value to a pte_refcount | +---------------------+-------------------------------------------------+ | pte_put_vmf | Decrement a pte_refcount in the page fault path | +---------------------+-------------------------------------------------+ 4. About this commit ==================== This commit just introduces some dummy helpers, the actual logic will be implemented in future commits. Signed-off-by: Qi Zheng Reported-by: kernel test robot --- include/linux/mm_types.h | 6 +++- include/linux/pte_ref.h | 87 ++++++++++++++++++++++++++++++++++++++++++++++++ mm/Makefile | 4 +-- mm/pte_ref.c | 55 ++++++++++++++++++++++++++++++ 4 files changed, 149 insertions(+), 3 deletions(-) create mode 100644 include/linux/pte_ref.h create mode 100644 mm/pte_ref.c diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index bb8c6f5f19bc..c599008d54fe 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -149,11 +149,15 @@ struct page { }; struct { /* Page table pages */ unsigned long _pt_pad_1; /* compound_head */ - pgtable_t pmd_huge_pte; /* protected by page->ptl */ + union { + pgtable_t pmd_huge_pte; /* protected by page->ptl */ + pmd_t *pmd; /* PTE page only */ + }; unsigned long _pt_pad_2; /* mapping */ union { struct mm_struct *pt_mm; /* x86 pgds only */ atomic_t pt_frag_refcount; /* powerpc */ + atomic_t pte_refcount; /* PTE page only */ }; #if ALLOC_SPLIT_PTLOCKS spinlock_t *ptl; diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h new file mode 100644 index 000000000000..b6d8335bdc59 --- /dev/null +++ b/include/linux/pte_ref.h @@ -0,0 +1,87 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2021, ByteDance. All rights reserved. + * + * Author: Qi Zheng + */ +#ifndef _LINUX_PTE_REF_H +#define _LINUX_PTE_REF_H + +#include + +enum pte_tryget_type { + TRYGET_SUCCESSED, + TRYGET_FAILED_ZERO, + TRYGET_FAILED_NONE, + TRYGET_FAILED_HUGE_PMD, +}; + +bool pte_get_unless_zero(pmd_t *pmd); +enum pte_tryget_type pte_try_get(pmd_t *pmd); +void pte_put_vmf(struct vm_fault *vmf); + +static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count) +{ +} + +static inline pmd_t *pte_to_pmd(pte_t *pte) +{ + return NULL; +} + +static inline void pte_update_pmd(pmd_t old_pmd, pmd_t *new_pmd) +{ +} + +static inline void pte_get_many(pmd_t *pmd, unsigned int nr) +{ +} + +/* + * pte_get - Increment refcount for the PTE page table. + * @pmd: a pointer to the pmd entry corresponding to the PTE page table. + * + * Similar to the mechanism of page refcount, the user of PTE page table + * should hold a refcount to it before accessing. + */ +static inline void pte_get(pmd_t *pmd) +{ + pte_get_many(pmd, 1); +} + +static inline pte_t *pte_tryget_map(pmd_t *pmd, unsigned long address) +{ + if (pte_try_get(pmd)) + return NULL; + + return pte_offset_map(pmd, address); +} + +static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd, + unsigned long address, spinlock_t **ptlp) +{ + if (pte_try_get(pmd)) + return NULL; + + return pte_offset_map_lock(mm, pmd, address, ptlp); +} + +static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, unsigned int nr) +{ +} + +/* + * pte_put - Decrement refcount for the PTE page table. + * @mm: the mm_struct of the target address space. + * @pmd: a pointer to the pmd entry corresponding to the PTE page table. + * @addr: the start address of the tlb range to be flushed. + * + * The PTE page table page will be freed when the last refcount is dropped. + */ +static inline void pte_put(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) +{ + pte_put_many(mm, pmd, addr, 1); +} + +#endif diff --git a/mm/Makefile b/mm/Makefile index d6c0042e3aa0..ea679bf75a5f 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -38,8 +38,8 @@ mmu-y := nommu.o mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \ mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \ msync.o page_vma_mapped.o pagewalk.o \ - pgtable-generic.o rmap.o vmalloc.o - + pgtable-generic.o rmap.o vmalloc.o \ + pte_ref.o ifdef CONFIG_CROSS_MEMORY_ATTACH mmu-$(CONFIG_MMU) += process_vm_access.o diff --git a/mm/pte_ref.c b/mm/pte_ref.c new file mode 100644 index 000000000000..de109905bc8f --- /dev/null +++ b/mm/pte_ref.c @@ -0,0 +1,55 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2021, ByteDance. All rights reserved. + * + * Author: Qi Zheng + */ + +#include +#include + +/* + * pte_get_unless_zero - Increment refcount for the PTE page table + * unless it is zero. + * @pmd: a pointer to the pmd entry corresponding to the PTE page table. + */ +bool pte_get_unless_zero(pmd_t *pmd) +{ + return true; +} + +/* + * pte_try_get - Try to increment refcount for the PTE page table. + * @pmd: a pointer to the pmd entry corresponding to the PTE page table. + * + * Return true if the increment succeeded. Otherwise return false. + * + * Before Operating the PTE page table, we need to hold a refcount + * to protect against the concurrent release of the PTE page table. + * But we will fail in the following case: + * - The content mapped in @pmd is not a PTE page + * - The refcount of the PTE page table is zero, it will be freed + */ +enum pte_tryget_type pte_try_get(pmd_t *pmd) +{ + if (unlikely(pmd_none(*pmd))) + return TRYGET_FAILED_NONE; + if (unlikely(is_huge_pmd(*pmd))) + return TRYGET_FAILED_HUGE_PMD; + + return TRYGET_SUCCESSED; +} + +/* + * pte_put_vmf - Decrement refcount for the PTE page table. + * @vmf: fault information + * + * The mmap_lock may be unlocked in advance in some cases + * in handle_pte_fault(), then the pmd entry will no longer + * be stable. For example, the corresponds of the PTE page may + * be replaced(e.g. mremap), so we should ensure the pte_put() + * is performed in the critical section of the mmap_lock. + */ +void pte_put_vmf(struct vm_fault *vmf) +{ +} From patchwork Wed Nov 10 10:54:20 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611833 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85797C43219 for ; Wed, 10 Nov 2021 11:02:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 248CB61452 for ; Wed, 10 Nov 2021 11:02:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 248CB61452 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id C83E36B0071; Wed, 10 Nov 2021 06:02:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C34026B0072; Wed, 10 Nov 2021 06:02:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AFBBE6B0073; Wed, 10 Nov 2021 06:02:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0072.hostedemail.com [216.40.44.72]) by kanga.kvack.org (Postfix) with ESMTP id A3EF46B0071 for ; Wed, 10 Nov 2021 06:02:20 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 5556F7CB5C for ; Wed, 10 Nov 2021 11:02:20 +0000 (UTC) X-FDA: 78792731640.06.1EE1866 Received: from mail-qv1-f53.google.com (mail-qv1-f53.google.com [209.85.219.53]) by imf21.hostedemail.com (Postfix) with ESMTP id 3881DD036A55 for ; Wed, 10 Nov 2021 11:02:14 +0000 (UTC) Received: by mail-qv1-f53.google.com with SMTP id u25so1605064qve.2 for ; Wed, 10 Nov 2021 03:02:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=98uC4cgov5u0+cSk5mxDVBWjtzB+qrevfuyFblbNyWw=; b=cva+aOBEkaQYYDYZAF5rDi1Tfp9MZNosTs0NPG0+glqltDBrZb9wxohXGVtHPC2AIr ZFQ+xqwm1q7FxXMytBl9Wc1TRo2GN7P36QBn3aIgARZzLR+xZjfgK/DFHSIJWwCfYbGe TGG6YW2ZvHrIqxcQVlzqb4Ivi+z+W5ehuhWR8pqV7qn7szeJLvNYAzyCKXFF+i1ef2+i V1/qfqX6Yi/1sWo4PA/u19/7eu/A6D3GMW1FHo4tIL406VlVeu12EID8kVWuMZrPskPU zQRNZy+XCSSn9ScZgOFY8kLZpfqPvEuPqKuzhatRkuCmKIKu3xHy6KutgHrqU+r72cms 8S6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=98uC4cgov5u0+cSk5mxDVBWjtzB+qrevfuyFblbNyWw=; b=byRidPBiMXFdk2FzsSOpf2msIR75mbImTVYxvqtD9Vnm6bXZk5EenGgLQ5wAOv3pwi GDJaHyV+MFU6cK3Ik3NP+dK7bpBNtsO+L+pBEngCXZoBk2BApX+cGzrSbGmEJ7qqNkev 1uWMsID5YU8Bpyg04ZFOf42FguUCLAouoK7s3NJVNU34g6ZcZnouYAb2Rk38w09tW/mx VezCA3sGNRc5AkqIu570uy6PJLF+MPr2RCoSwNXWosvyGCaDIuArG8v5t19ZZGQqsj+X aDxrpf3kjw5nICHb2th4LQ4/9jts9MmTyvEaFj6CODmm6EhoO2DV0UB4xZCPKbM1ml4S ekTQ== X-Gm-Message-State: AOAM5309YxCznDHiaeniJnsFI4B7lvwLiplYwdW6xcgTenUrVEH8nfR7 F41e2E5tdrp1qMLYJ5Tcq8Acv7Nk22JF2w== X-Google-Smtp-Source: ABdhPJxJ891Je08HKBpyxaVnluFJ2Tkn+ncUllLSe1oyHEEXHaMg7i4b2la/BjCym1+nnoYCwNn+5Q== X-Received: by 2002:a62:e908:0:b0:49f:c633:51ec with SMTP id j8-20020a62e908000000b0049fc63351ecmr27603629pfh.1.1636541718220; Wed, 10 Nov 2021 02:55:18 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.55.13 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:55:17 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation Date: Wed, 10 Nov 2021 18:54:20 +0800 Message-Id: <20211110105428.32458-8-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 3881DD036A55 X-Stat-Signature: ydpjkzgqti5tey4y6i1ugtqt86xbknbo Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=cva+aOBE; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf21.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.219.53 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-HE-Tag: 1636542134-351390 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When the PTE page table page is allocated and installed into the pmd entry, it needs to take an initial reference count to prevent the release of PTE page table page by other threads, and the caller of pte_alloc()(or other friends) needs to reduce this reference count. Signed-off-by: Qi Zheng Reported-by: kernel test robot --- include/linux/mm.h | 7 +++++-- mm/debug_vm_pgtable.c | 1 + mm/filemap.c | 8 ++++++-- mm/gup.c | 10 +++++++--- mm/memory.c | 51 +++++++++++++++++++++++++++++++++++++++++---------- mm/migrate.c | 9 ++++++--- mm/mlock.c | 1 + mm/mremap.c | 1 + mm/userfaultfd.c | 16 +++++++++++++++- 9 files changed, 83 insertions(+), 21 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 52f36fde2f11..753a9435e0d0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -2313,9 +2314,11 @@ enum pmd_installed_type { static inline int pte_alloc(struct mm_struct *mm, pmd_t *pmd) { - if (unlikely(pmd_none(*(pmd)))) + enum pte_tryget_type ret = pte_try_get(pmd); + + if (ret == TRYGET_FAILED_NONE || ret == TRYGET_FAILED_ZERO) return __pte_alloc(mm, pmd); - if (unlikely(is_huge_pmd(*pmd))) + else if (ret == TRYGET_FAILED_HUGE_PMD) return INSTALLED_HUGE_PMD; return INSTALLED_PTE; diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index b8322c55e65d..52f006654664 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -1048,6 +1048,7 @@ static void __init destroy_args(struct pgtable_debug_args *args) /* Free page table entries */ if (args->start_ptep) { + pte_put(args->mm, args->start_pmdp, args->vaddr); pte_free(args->mm, args->start_ptep); mm_dec_nr_ptes(args->mm); } diff --git a/mm/filemap.c b/mm/filemap.c index 23363f8ddbbe..1e7e9e4fd759 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3217,6 +3217,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) } } +retry: if (pmd_none(*vmf->pmd)) { int ret = pmd_install(mm, vmf->pmd, &vmf->prealloc_pte); @@ -3225,6 +3226,8 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) } else if (pmd_devmap_trans_unstable(vmf->pmd)) { /* See comment in handle_pte_fault() */ goto out; + } else if (pte_try_get(vmf->pmd) == TRYGET_FAILED_ZERO) { + goto retry; } return false; @@ -3301,7 +3304,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, struct file *file = vma->vm_file; struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; - unsigned long addr; + unsigned long addr, start; XA_STATE(xas, &mapping->i_pages, start_pgoff); struct page *head, *page; unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); @@ -3317,7 +3320,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, goto out; } - addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT); + start = addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); do { page = find_subpage(head, xas.xa_index); @@ -3348,6 +3351,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, put_page(head); } while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL); pte_unmap_unlock(vmf->pte, vmf->ptl); + pte_put(vma->vm_mm, vmf->pmd, start); out: rcu_read_unlock(); WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); diff --git a/mm/gup.c b/mm/gup.c index 2def775232a3..e084111103f0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -694,7 +694,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, spin_unlock(ptl); ret = 0; split_huge_pmd(vma, pmd, address); - if (pmd_trans_unstable(pmd)) + if (pte_try_get(pmd) == TRYGET_FAILED_HUGE_PMD) ret = -EBUSY; } else { spin_unlock(ptl); @@ -702,8 +702,12 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, ret = pte_alloc(mm, pmd) < 0 ? -ENOMEM : 0; } - return ret ? ERR_PTR(ret) : - follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); + if (ret) + return ERR_PTR(ret); + + page = follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); + pte_put(mm, pmd, address); + return page; } page = follow_trans_huge_pmd(vma, address, pmd, flags); spin_unlock(ptl); diff --git a/mm/memory.c b/mm/memory.c index 8a39c0e58324..0b9af38cfa11 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -441,10 +441,13 @@ enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) { int ret = INSTALLED_PTE; - spinlock_t *ptl = pmd_lock(mm, pmd); + spinlock_t *ptl; +retry: + ptl = pmd_lock(mm, pmd); if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ mm_inc_nr_ptes(mm); + pte_ref_init(*pte, pmd, 1); /* * Ensure all pte setup (eg. pte page lock and page clearing) are * visible before the pte is made visible to other CPUs by being @@ -464,6 +467,9 @@ enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd, } else if (is_huge_pmd(*pmd)) { /* See comment in handle_pte_fault() */ ret = INSTALLED_HUGE_PMD; + } else if (!pte_get_unless_zero(pmd)) { + spin_unlock(ptl); + goto retry; } spin_unlock(ptl); @@ -1028,6 +1034,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, int rss[NR_MM_COUNTERS]; swp_entry_t entry = (swp_entry_t){0}; struct page *prealloc = NULL; + unsigned long start = addr; again: progress = 0; @@ -1108,6 +1115,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_unmap(orig_src_pte); add_mm_rss_vec(dst_mm, rss); pte_unmap_unlock(orig_dst_pte, dst_ptl); + pte_put(dst_mm, dst_pmd, start); cond_resched(); if (ret == -EIO) { @@ -1778,6 +1786,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, goto out; retval = insert_page_into_pte_locked(mm, pte, addr, page, prot); pte_unmap_unlock(pte, ptl); + pte_put(mm, pte_to_pmd(pte), addr); out: return retval; } @@ -1810,6 +1819,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr, unsigned long remaining_pages_total = *num; unsigned long pages_to_write_in_pmd; int ret; + unsigned long start = addr; more: ret = -EFAULT; pmd = walk_to_pmd(mm, addr); @@ -1836,7 +1846,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr, pte_unmap_unlock(start_pte, pte_lock); ret = err; remaining_pages_total -= pte_idx; - goto out; + goto put; } addr += PAGE_SIZE; ++curr_page_idx; @@ -1845,9 +1855,13 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr, pages_to_write_in_pmd -= batch_size; remaining_pages_total -= batch_size; } - if (remaining_pages_total) + if (remaining_pages_total) { + pte_put(mm, pmd, start); goto more; + } ret = 0; +put: + pte_put(mm, pmd, start); out: *num = remaining_pages_total; return ret; @@ -2075,6 +2089,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, out_unlock: pte_unmap_unlock(pte, ptl); + pte_put(mm, pte_to_pmd(pte), addr); return VM_FAULT_NOPAGE; } @@ -2275,6 +2290,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, unsigned long end, unsigned long pfn, pgprot_t prot) { + unsigned long start = addr; pte_t *pte, *mapped_pte; spinlock_t *ptl; int err = 0; @@ -2294,6 +2310,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, } while (pte++, addr += PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(mapped_pte, ptl); + pte_put(mm, pmd, start); return err; } @@ -2503,6 +2520,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, pte_fn_t fn, void *data, bool create, pgtbl_mod_mask *mask) { + unsigned long start = addr; pte_t *pte, *mapped_pte; int err = 0; spinlock_t *ptl; @@ -2536,8 +2554,11 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, arch_leave_lazy_mmu_mode(); - if (mm != &init_mm) + if (mm != &init_mm) { pte_unmap_unlock(mapped_pte, ptl); + if (create) + pte_put(mm, pmd, start); + } return err; } @@ -3761,7 +3782,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Deliver the page fault to userland, check inside PT lock */ if (userfaultfd_missing(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); - return handle_userfault(vmf, VM_UFFD_MISSING); + ret = handle_userfault(vmf, VM_UFFD_MISSING); + goto put; } goto setpte; } @@ -3804,7 +3826,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (userfaultfd_missing(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); put_page(page); - return handle_userfault(vmf, VM_UFFD_MISSING); + ret = handle_userfault(vmf, VM_UFFD_MISSING); + goto put; } inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); @@ -3817,14 +3840,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) update_mmu_cache(vma, vmf->address, vmf->pte); unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); - return ret; + goto put; release: put_page(page); goto unlock; oom_free_page: put_page(page); oom: - return VM_FAULT_OOM; + ret = VM_FAULT_OOM; +put: + pte_put(vma->vm_mm, vmf->pmd, vmf->address); + return ret; } /* @@ -4031,7 +4057,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf) return ret; } - if (pmd_none(*vmf->pmd)) { +retry: + ret = pte_try_get(vmf->pmd); + if (ret == TRYGET_FAILED_NONE) { int alloc_ret; if (PageTransCompound(page)) { @@ -4047,9 +4075,11 @@ vm_fault_t finish_fault(struct vm_fault *vmf) if (unlikely(alloc_ret != INSTALLED_PTE)) return alloc_ret < 0 ? VM_FAULT_OOM : 0; - } else if (pmd_devmap_trans_unstable(vmf->pmd)) { + } else if (ret == TRYGET_FAILED_HUGE_PMD) { /* See comment in handle_pte_fault() */ return 0; + } else if (ret == TRYGET_FAILED_ZERO) { + goto retry; } vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, @@ -4063,6 +4093,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) update_mmu_tlb(vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); + pte_put(vma->vm_mm, vmf->pmd, vmf->address); return ret; } diff --git a/mm/migrate.c b/mm/migrate.c index bdfdfd3b50be..26f16a4836d8 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2736,9 +2736,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, goto abort; if (unlikely(anon_vma_prepare(vma))) - goto abort; + goto put; if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) - goto abort; + goto put; /* * The memory barrier inside __SetPageUptodate makes sure that @@ -2764,7 +2764,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, * device memory. */ pr_warn_once("Unsupported ZONE_DEVICE page type.\n"); - goto abort; + goto put; } } else { entry = mk_pte(page, vma->vm_page_prot); @@ -2811,11 +2811,14 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, } pte_unmap_unlock(ptep, ptl); + pte_put(mm, pmdp, addr); *src = MIGRATE_PFN_MIGRATE; return; unlock_abort: pte_unmap_unlock(ptep, ptl); +put: + pte_put(mm, pmdp, addr); abort: *src &= ~MIGRATE_PFN_MIGRATE; } diff --git a/mm/mlock.c b/mm/mlock.c index e263d62ae2d0..a4ef20ba9627 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -398,6 +398,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, break; } pte_unmap_unlock(pte, ptl); + pte_put(vma->vm_mm, pte_to_pmd(pte), start); return start; } diff --git a/mm/mremap.c b/mm/mremap.c index fc5c56858883..f80c628db25d 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -555,6 +555,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, break; move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma, new_pmd, new_addr, need_rmap_locks); + pte_put(new_vma->vm_mm, new_pmd, new_addr); } mmu_notifier_invalidate_range_end(&range); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 2cea08e7f076..37df899a1b9d 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -574,6 +574,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, while (src_addr < src_start + len) { pmd_t dst_pmdval; + enum pte_tryget_type tryget_type; BUG_ON(dst_addr >= dst_start + len); @@ -583,6 +584,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, break; } +again: + /* + * After the management of the PTE page changes to the refcount + * mode, the PTE page may be released by another thread(rcu mode), + * so the rcu lock is held here to prevent the PTE page from + * being released. + */ + rcu_read_lock(); dst_pmdval = pmd_read_atomic(dst_pmd); /* * If the dst_pmd is mapped as THP don't @@ -593,7 +602,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, break; } - if (unlikely(pmd_none(dst_pmdval))) { + tryget_type = pte_try_get(&dst_pmdval); + rcu_read_unlock(); + if (unlikely(tryget_type == TRYGET_FAILED_NONE)) { int ret = __pte_alloc(dst_mm, dst_pmd); /* @@ -607,6 +618,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, err = -EFAULT; break; } + } else if (unlikely(tryget_type == TRYGET_FAILED_ZERO)) { + goto again; } BUG_ON(pmd_none(*dst_pmd)); @@ -614,6 +627,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, src_addr, &page, mcopy_mode, wp_copy); + pte_put(dst_mm, dst_pmd, dst_addr); cond_resched(); if (unlikely(err == -ENOENT)) { From patchwork Wed Nov 10 10:54:21 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611767 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 43B81C433EF for ; Wed, 10 Nov 2021 10:55:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CD9A36112D for ; Wed, 10 Nov 2021 10:55:26 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org CD9A36112D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 698436B0073; Wed, 10 Nov 2021 05:55:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 621366B0074; Wed, 10 Nov 2021 05:55:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4C2836B0075; Wed, 10 Nov 2021 05:55:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0098.hostedemail.com [216.40.44.98]) by kanga.kvack.org (Postfix) with ESMTP id 3CCAF6B0073 for ; Wed, 10 Nov 2021 05:55:26 -0500 (EST) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id E8113184B5020 for ; Wed, 10 Nov 2021 10:55:25 +0000 (UTC) X-FDA: 78792714210.25.115A9AC Received: from mail-pg1-f173.google.com (mail-pg1-f173.google.com [209.85.215.173]) by imf14.hostedemail.com (Postfix) with ESMTP id 7E85660019BC for ; Wed, 10 Nov 2021 10:55:26 +0000 (UTC) Received: by mail-pg1-f173.google.com with SMTP id s136so1922453pgs.4 for ; Wed, 10 Nov 2021 02:55:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=7ATw7hUTYZVQ0DwLa45RdRSrYNXHKuwCC3S5w4hY8zw=; b=VOdIGEoosH+Ug2YShcw7yL35/Vu9bbytERCcJq8J6YIYWZcb4Wzk0VqLvWpPNAP0Pw /Apjox+z/JMW87E0IkMHXYJDJh7yAnWsR81m/2b1hqYlGy2+kFwlbVIdu/LoypPEYnZ1 /myJbmXy1PKLiG1qhBLKJe700SISvZbyHPo40ZNX7ITlvCjS5GAyV72xI1ifbj2zqgpF dMRRvHvUHBpTgtVm0K6F55r1zU4grKZy22pnRukYVH56+GyCn/qKF2pxWoPQdtRmqEjC y+w75tYDanH+yKEh9gDH8WY3av+3hyJHDpl+/CJ3aRvYD7LyhHOP0unWM1QIvqrCO+RW hHWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=7ATw7hUTYZVQ0DwLa45RdRSrYNXHKuwCC3S5w4hY8zw=; b=xtqvJP/MVcXheZLmAmkXmERxGinHs7zYSjexoG7uMOM/3QumcbL7/a/y7FpUlEeT8K aSL3+X02DzAHI1sgUySVVCnTFtRE9/JyrIV8ufwzL3Y/eWvyMh1vxn8REnM2ZtrDDvA1 XIK+hPUxfQvKHFH6sxLO+tDub0H0lSx5UjOwQG/pcKF5UYS6sZzhN8Yn9Bc5hF7LInGv QHoqVZIITyJk1nmFHodm7erAGZI/1wYxS4r0s9lEq+gw63hImeTGnm4AafU81SBqwHyw vfaFm+fTjbxXliMK8ZIfoeh4qu0y1ZbmCAr9h1rbfcW5h6vHl0IUk97PLSr7JGAmUygx csLw== X-Gm-Message-State: AOAM533SmLxzmRzVOXrQqE2ypo2doiPL+wFxAqX7dOg3koTPTk7e4oYs 8t87NVHlLFHULkVf4JWkeAUHgg== X-Google-Smtp-Source: ABdhPJyasHvetrlbVDhZxlSWJdssheCa0+hBmO6hdgHKupbF8TeFo1liY0cCWMJdtFDghoSjaMgPhg== X-Received: by 2002:a05:6a00:8cd:b0:47b:b9e8:7c2e with SMTP id s13-20020a056a0008cd00b0047bb9e87c2emr96445421pfu.61.1636541724659; Wed, 10 Nov 2021 02:55:24 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.55.18 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:55:24 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page Date: Wed, 10 Nov 2021 18:54:21 +0800 Message-Id: <20211110105428.32458-9-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 7E85660019BC X-Stat-Signature: ah39ydumruxtof1wy53gfbbbwsqxz7zy Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=VOdIGEoo; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf14.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.173 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-HE-Tag: 1636541726-425923 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When we split the PMD-mapped THP to the PTE-mapped THP, we should initialize the refcount of the withdrawn PTE page table page to HPAGE_PMD_NR, which ensures that we can release the PTE page table page when it is free(the refcount is 0). Signed-off-by: Qi Zheng --- mm/pgtable-generic.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4e640baf9794..523053e09dfa 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -186,6 +186,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) struct page, lru); if (pmd_huge_pte(mm, pmdp)) list_del(&pgtable->lru); + pte_ref_init(pgtable, pmdp, HPAGE_PMD_NR); return pgtable; } #endif From patchwork Wed Nov 10 10:54:22 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611831 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 60CEBC4332F for ; Wed, 10 Nov 2021 11:02:13 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D5F39611BF for ; Wed, 10 Nov 2021 11:02:12 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org D5F39611BF Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 6DFB46B006C; Wed, 10 Nov 2021 06:02:12 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 68EDA6B0071; Wed, 10 Nov 2021 06:02:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 52F616B0072; Wed, 10 Nov 2021 06:02:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 444C16B006C for ; Wed, 10 Nov 2021 06:02:12 -0500 (EST) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id F31958089A for ; Wed, 10 Nov 2021 11:02:11 +0000 (UTC) X-FDA: 78792731262.30.F81F406 Received: from mail-oi1-f176.google.com (mail-oi1-f176.google.com [209.85.167.176]) by imf19.hostedemail.com (Postfix) with ESMTP id CC62EB00018D for ; Wed, 10 Nov 2021 11:02:01 +0000 (UTC) Received: by mail-oi1-f176.google.com with SMTP id q124so4522412oig.3 for ; Wed, 10 Nov 2021 03:02:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=rjaHrKt+DJP4IEAgkVBlHHfdXPa5cKGowXjGMRJo0SA=; b=bohHfjmxZ4ZcWawmPp+rornoIkAb2F2k0T2i/2zrWsyapWl4Ylm3nc76Jt5S7tfezp jrhFNsqJBS267/XPU3lw7pFUzftAqmJZIfx4/qTLAqxkCVdglXFnCru+/WY5HbT2uD4M 4CirUXzR4167QO+3TFurAHdy3fFyZCXgjl3jS0wp6cMqgfC3MM7bZU29ru5yGrkVRZGf rfYw1s6EQuUZuXEdZIzMqL18bNNZ6hI6C2I/lF3CYybV+wCRZdszoao0qCNT7LIi3RS7 iUNAb9+SYWg7R5yRGg1kqk9OsnYT1Jpm0vDF0+L2ZqK56ykVx8Glj+mTP80y+4/4kxmP Drjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=rjaHrKt+DJP4IEAgkVBlHHfdXPa5cKGowXjGMRJo0SA=; b=eNMvo3VEr1p3ldY8kUkgUbnAOiWG/c+mi5LrazUXBElxUYaWxf2fJ2WmGViXMvZpr7 ZaHAesCiJQH5V+yerqiz601pSfyBc0EQHt7K6sye/ZtP5Wu8PdwtuHTwT4a9Ah05CYNU 83OY3JpIZS79iUi/RhmN33k1q3tJUHo1GzZ/jSkel98pcJOVL8K0gjy1iQyoC6/bmpmw 3OsbWmbRsrO65lgRd30xOtkvwa0AHubS3rs2Wk5Twa2awBjcFuni+R7hjl9J3H9BQ8D8 mslIBSZ3GVMBsVejugxt5KFcwcA8p+HVLAGfQclVdLeEhXUh9N8Ftgc0zMLMN4Kreixj pg4g== X-Gm-Message-State: AOAM530slej9FT00fwr8Q9ny1uHpuH6hHKt1HM9+UchJGKvx4k3y9Ozs ynMkFUKBIoOWeN4KHMYg8A5l8cNazncBvQ== X-Google-Smtp-Source: ABdhPJx61Muq4erBWz9RHv7Hs5Kp1jlvYakuNGtRa0LQxAmzOWxpx2nB52avVkka16rva5zUkNLNJg== X-Received: by 2002:a17:90b:1644:: with SMTP id il4mr15565944pjb.39.1636541730201; Wed, 10 Nov 2021 02:55:30 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.55.25 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:55:29 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 09/15] mm/pte_ref: add support for the map/unmap of user PTE page table page Date: Wed, 10 Nov 2021 18:54:22 +0800 Message-Id: <20211110105428.32458-10-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=bohHfjmx; spf=pass (imf19.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.167.176 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: CC62EB00018D X-Stat-Signature: o73ujsqkg6gx1abm9qwzzc93rw4f64q3 X-HE-Tag: 1636542121-786 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The !pte_none() entry will take a reference on the user PTE page table page, such as regular page table entry that map physical pages, or swap entry, or migrate entry, etc. So a pte_none() entry is mapped, it needs to increase the refcount of the PTE page table page. When a !pte_none() entry becomes none, the refcount of the PTE page table page needs to be decreased. For swap or migrate cases, which only change the content of the PTE entry, we keep the refcount unchanged. Signed-off-by: Qi Zheng --- kernel/events/uprobes.c | 2 ++ mm/filemap.c | 3 +++ mm/madvise.c | 5 +++++ mm/memory.c | 42 +++++++++++++++++++++++++++++++++++------- mm/migrate.c | 1 + mm/mremap.c | 7 +++++++ mm/rmap.c | 10 ++++++++++ mm/userfaultfd.c | 2 ++ 8 files changed, 65 insertions(+), 7 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 6357c3580d07..96dd2959e1ac 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -200,6 +200,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, if (new_page) set_pte_at_notify(mm, addr, pvmw.pte, mk_pte(new_page, vma->vm_page_prot)); + else + pte_put(mm, pte_to_pmd(pvmw.pte), addr); page_remove_rmap(old_page, false); if (!page_mapped(old_page)) diff --git a/mm/filemap.c b/mm/filemap.c index 1e7e9e4fd759..aa47ee11a3d8 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3309,6 +3309,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, struct page *head, *page; unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); vm_fault_t ret = 0; + unsigned int nr_get = 0; rcu_read_lock(); head = first_map_page(mapping, &xas, end_pgoff); @@ -3342,6 +3343,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, ret = VM_FAULT_NOPAGE; do_set_pte(vmf, page, addr); + nr_get++; /* no need to invalidate: a not-present page won't be cached */ update_mmu_cache(vma, addr, vmf->pte); unlock_page(head); @@ -3351,6 +3353,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, put_page(head); } while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL); pte_unmap_unlock(vmf->pte, vmf->ptl); + pte_get_many(vmf->pmd, nr_get); pte_put(vma->vm_mm, vmf->pmd, start); out: rcu_read_unlock(); diff --git a/mm/madvise.c b/mm/madvise.c index 0734db8d53a7..82fc40b6dcbf 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -580,6 +580,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, struct page *page; int nr_swap = 0; unsigned long next; + unsigned int nr_put = 0; + unsigned long start = addr; next = pmd_addr_end(addr, end); if (pmd_trans_huge(*pmd)) @@ -612,6 +614,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, nr_swap--; free_swap_and_cache(entry); pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + nr_put++; continue; } @@ -696,6 +699,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, } arch_leave_lazy_mmu_mode(); pte_unmap_unlock(orig_pte, ptl); + if (nr_put) + pte_put_many(mm, pmd, start, nr_put); cond_resched(); next: return 0; diff --git a/mm/memory.c b/mm/memory.c index 0b9af38cfa11..ea4d651ac8c7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -878,6 +878,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, if (!userfaultfd_wp(dst_vma)) pte = pte_swp_clear_uffd_wp(pte); set_pte_at(dst_mm, addr, dst_pte, pte); + pte_get(pte_to_pmd(dst_pte)); return 0; } @@ -946,6 +947,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma /* Uffd-wp needs to be delivered to dest pte as well */ pte = pte_wrprotect(pte_mkuffd_wp(pte)); set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); + pte_get(pte_to_pmd(dst_pte)); return 0; } @@ -998,6 +1000,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte = pte_clear_uffd_wp(pte); set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); + pte_get(pte_to_pmd(dst_pte)); return 0; } @@ -1335,6 +1338,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, pte_t *start_pte; pte_t *pte; swp_entry_t entry; + unsigned int nr_put = 0; + unsigned long start = addr; tlb_change_page_size(tlb, PAGE_SIZE); again: @@ -1359,6 +1364,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, continue; ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + nr_put++; tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; @@ -1392,6 +1398,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (unlikely(zap_skip_check_mapping(details, page))) continue; pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + nr_put++; rss[mm_counter(page)]--; if (is_device_private_entry(entry)) @@ -1416,6 +1423,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (unlikely(!free_swap_and_cache(entry))) print_bad_pte(vma, addr, ptent, NULL); pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + nr_put++; } while (pte++, addr += PAGE_SIZE, addr != end); add_mm_rss_vec(mm, rss); @@ -1442,6 +1450,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, goto again; } + if (nr_put) + pte_put_many(mm, pmd, start, nr_put); + return addr; } @@ -1759,6 +1770,7 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte, inc_mm_counter_fast(mm, mm_counter_file(page)); page_add_file_rmap(page, false); set_pte_at(mm, addr, pte, mk_pte(page, prot)); + pte_get(pte_to_pmd(pte)); return 0; } @@ -2085,6 +2097,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, } set_pte_at(mm, addr, pte, entry); + pte_get(pte_to_pmd(pte)); update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */ out_unlock: @@ -2291,6 +2304,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, unsigned long pfn, pgprot_t prot) { unsigned long start = addr; + unsigned int nr_get = 0; pte_t *pte, *mapped_pte; spinlock_t *ptl; int err = 0; @@ -2306,10 +2320,12 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, break; } set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot))); + nr_get++; pfn++; } while (pte++, addr += PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(mapped_pte, ptl); + pte_get_many(pmd, nr_get); pte_put(mm, pmd, start); return err; } @@ -2524,6 +2540,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, pte_t *pte, *mapped_pte; int err = 0; spinlock_t *ptl; + unsigned int nr_put = 0, nr_get = 0; if (create) { mapped_pte = pte = (mm == &init_mm) ? @@ -2531,6 +2548,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, pte_alloc_map_lock(mm, pmd, addr, &ptl); if (!pte) return -ENOMEM; + nr_put++; } else { mapped_pte = pte = (mm == &init_mm) ? pte_offset_kernel(pmd, addr) : @@ -2543,11 +2561,17 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, if (fn) { do { - if (create || !pte_none(*pte)) { + if (create) { err = fn(pte++, addr, data); - if (err) - break; + if (mm != &init_mm && !pte_none(*(pte-1))) + nr_get++; + } else if (!pte_none(*pte)) { + err = fn(pte++, addr, data); + if (mm != &init_mm && pte_none(*(pte-1))) + nr_put++; } + if (err) + break; } while (addr += PAGE_SIZE, addr != end); } *mask |= PGTBL_PTE_MODIFIED; @@ -2556,8 +2580,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, if (mm != &init_mm) { pte_unmap_unlock(mapped_pte, ptl); - if (create) - pte_put(mm, pmd, start); + pte_get_many(pmd, nr_get); + if (nr_put) + pte_put_many(mm, pmd, start, nr_put); } return err; } @@ -3835,6 +3860,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) lru_cache_add_inactive_or_unevictable(page, vma); setpte: set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + pte_get(vmf->pmd); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, vmf->address, vmf->pte); @@ -4086,10 +4112,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf) vmf->address, &vmf->ptl); ret = 0; /* Re-check under ptl */ - if (likely(pte_none(*vmf->pte))) + if (likely(pte_none(*vmf->pte))) { do_set_pte(vmf, page, vmf->address); - else + pte_get(vmf->pmd); + } else { ret = VM_FAULT_NOPAGE; + } update_mmu_tlb(vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); diff --git a/mm/migrate.c b/mm/migrate.c index 26f16a4836d8..c03ac25f42a9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2807,6 +2807,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, } else { /* No need to invalidate - it was non-present before */ set_pte_at(mm, addr, ptep, entry); + pte_get(pmdp); update_mmu_cache(vma, addr, ptep); } diff --git a/mm/mremap.c b/mm/mremap.c index f80c628db25d..088a7a75cb4b 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -141,6 +141,8 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, spinlock_t *old_ptl, *new_ptl; bool force_flush = false; unsigned long len = old_end - old_addr; + unsigned int nr_put = 0, nr_get = 0; + unsigned long old_start = old_addr; /* * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma @@ -181,6 +183,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, continue; pte = ptep_get_and_clear(mm, old_addr, old_pte); + nr_put++; /* * If we are remapping a valid PTE, make sure * to flush TLB before we drop the PTL for the @@ -197,6 +200,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); pte = move_soft_dirty_pte(pte); set_pte_at(mm, new_addr, new_pte, pte); + nr_get++; } arch_leave_lazy_mmu_mode(); @@ -206,6 +210,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, spin_unlock(new_ptl); pte_unmap(new_pte - 1); pte_unmap_unlock(old_pte - 1, old_ptl); + pte_get_many(new_pmd, nr_get); + if (nr_put) + pte_put_many(mm, old_pmd, old_start, nr_put); if (need_rmap_locks) drop_rmap_locks(vma); } diff --git a/mm/rmap.c b/mm/rmap.c index 2908d637bcad..630ce8a036b5 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1404,6 +1404,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, bool ret = true; struct mmu_notifier_range range; enum ttu_flags flags = (enum ttu_flags)(long)arg; + unsigned int nr_put = 0; /* * When racing against e.g. zap_pte_range() on another cpu, @@ -1551,6 +1552,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* We have to invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); + nr_put++; } else if (PageAnon(page)) { swp_entry_t entry = { .val = page_private(subpage) }; pte_t swp_pte; @@ -1564,6 +1566,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* We have to invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); + pte_put(mm, pvmw.pmd, address); page_vma_mapped_walk_done(&pvmw); break; } @@ -1575,6 +1578,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); dec_mm_counter(mm, MM_ANONPAGES); + nr_put++; goto discard; } @@ -1630,6 +1634,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * See Documentation/vm/mmu_notifier.rst */ dec_mm_counter(mm, mm_counter_file(page)); + nr_put++; } discard: /* @@ -1641,6 +1646,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, */ page_remove_rmap(subpage, PageHuge(page)); put_page(page); + if (nr_put) { + pte_put_many(mm, pvmw.pmd, address, nr_put); + nr_put = 0; + } } mmu_notifier_invalidate_range_end(&range); @@ -1871,6 +1880,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, /* We have to invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); + pte_put(mm, pvmw.pmd, address); } else { swp_entry_t entry; pte_t swp_pte; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 37df899a1b9d..b87c61b94065 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -110,6 +110,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, lru_cache_add_inactive_or_unevictable(page, dst_vma); set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + pte_get(dst_pmd); /* No need to invalidate - it was non-present before */ update_mmu_cache(dst_vma, dst_addr, dst_pte); @@ -204,6 +205,7 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm, if (!pte_none(*dst_pte)) goto out_unlock; set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + pte_get(dst_pmd); /* No need to invalidate - it was non-present before */ update_mmu_cache(dst_vma, dst_addr, dst_pte); ret = 0; From patchwork Wed Nov 10 10:54:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611771 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B97ECC433EF for ; Wed, 10 Nov 2021 10:55:38 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6FBE261246 for ; Wed, 10 Nov 2021 10:55:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 6FBE261246 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 140D36B007B; Wed, 10 Nov 2021 05:55:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0C9E16B007D; Wed, 10 Nov 2021 05:55:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E5E656B007E; Wed, 10 Nov 2021 05:55:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0008.hostedemail.com [216.40.44.8]) by kanga.kvack.org (Postfix) with ESMTP id D83816B007B for ; Wed, 10 Nov 2021 05:55:37 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 901357DCBA for ; Wed, 10 Nov 2021 10:55:37 +0000 (UTC) X-FDA: 78792714714.16.1CBBB7B Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) by imf21.hostedemail.com (Postfix) with ESMTP id 4F348D036A4A for ; Wed, 10 Nov 2021 10:55:31 +0000 (UTC) Received: by mail-pj1-f45.google.com with SMTP id gt5so1250408pjb.1 for ; Wed, 10 Nov 2021 02:55:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=FzKPEOc5mm0TwBqhdoOl1DzQcSNXKD+dBXA3Ds/gQ/s=; b=mEzS5XATpkqsdNV1d5pMNaa7xKmkAkT+cSCulJa0YMKoxdETv0iA2ujFJiRBU/GpxD qiCEOhEFYq+kBBYrxfGR4u0rULYIUHAbIFauW9mXYm20ZJd2r8x7e4UKN5vQPOAWzClH 3edYvbjiyAPquf0l0aKgNQAhHjZ/U4TE4k9AY4YteFLAlZhpL17O7+K+/VzvPi+2/vb8 Fp0+EEeN/+7cHZBv/1WjkdRlh6ItgfZN4Ck+oSrV9WC2gw6cXlugOIjGwvRSftQUEhCR YZS5V5Q81rCXG6MsM6fTQhOwpsPosLA6LB4ebl821L04tpytE4pZdIupEEP7WgAhNk1k B3Xw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=FzKPEOc5mm0TwBqhdoOl1DzQcSNXKD+dBXA3Ds/gQ/s=; b=UwRQTFGII+Fv/FONYz0P0+2cjYhS4ZEzk6Zex5athCqkZX+0sbZmcr2IbOS/h/M6jB KiJR3PNeBNCFIhAOFyYUN8cmXAB9dgPk1x9V+/yA7oVDUeIrARBMxiO1u8VTX/taW9NP /J3e55TRDfDeZGsOLgSSUem0sO+AkyOv9AbADvxkEk3hM46/ywxEyom2goom7Gmokwbc K6ZsU6Ecze4WJh+2vPZbaHDQKzAQebGlIpDbfDpr+3e0XZq8vFht4V+cwK2HbQCmFA3O GnTrJ+50E4KzoZfBwbI6z7QExd7FQMFXL+Zln8zWkCxzqSRgkyJOWU22/NgEZiySs0Fh 9u3Q== X-Gm-Message-State: AOAM530Tq2ddR2qYnSrQ4UtggwYmUVhiBllwFiHKhuIBKMPClxPocmcF Sg9KTEWF34ew3rgD4hb1arO9yA== X-Google-Smtp-Source: ABdhPJy+FoE6Pd2qFnrKTUzrIZoCmmPyO4ssefhZxWp4+uu4twU52l9a97OY9avZJGJf8URTBIrhSw== X-Received: by 2002:a17:902:b718:b0:143:72b7:409e with SMTP id d24-20020a170902b71800b0014372b7409emr10879675pls.28.1636541736145; Wed, 10 Nov 2021 02:55:36 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.55.30 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:55:35 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 10/15] mm/pte_ref: add support for page fault path Date: Wed, 10 Nov 2021 18:54:23 +0800 Message-Id: <20211110105428.32458-11-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=mEzS5XAT; spf=pass (imf21.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.45 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 4F348D036A4A X-Stat-Signature: tx3muncbub9sythbzbmzfahk6xinwt1i X-HE-Tag: 1636541731-335727 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In the page fault path, we need to take a reference of the PTE page table page if the pmd entry is not none, which ensures the PTE page table page will not be released by other threads. And the mmap_lock may be unlocked in advance in some cases in handle_pte_fault(), then the pmd entry will no longer be stable: thread A thread B page fault collapse_huge_page ========== ================== mmap_read_unlock() mmap_write_lock() pgtable_trans_huge_deposit() set_pmd_at() /* pmd entry is changed! */ pte_put() So we should call pte_put() before dropping the mmap_lock. Signed-off-by: Qi Zheng --- fs/userfaultfd.c | 1 + mm/filemap.c | 2 ++ mm/internal.h | 1 + mm/khugepaged.c | 8 +++++++- mm/memory.c | 33 ++++++++++++++++++++++++--------- 5 files changed, 35 insertions(+), 10 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 22bf14ab2d16..ddbcefa7e0a6 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -509,6 +509,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma, vmf->address, vmf->flags, reason); + pte_put_vmf(vmf); mmap_read_unlock(mm); if (likely(must_wait && !READ_ONCE(ctx->released))) { diff --git a/mm/filemap.c b/mm/filemap.c index aa47ee11a3d8..4fdc74dc6736 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1708,6 +1708,7 @@ bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf) if (flags & FAULT_FLAG_RETRY_NOWAIT) return false; + pte_put_vmf(vmf); mmap_read_unlock(mm); if (flags & FAULT_FLAG_KILLABLE) folio_wait_locked_killable(folio); @@ -1720,6 +1721,7 @@ bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf) ret = __folio_lock_killable(folio); if (ret) { + pte_put_vmf(vmf); mmap_read_unlock(mm); return false; } diff --git a/mm/internal.h b/mm/internal.h index 474d6e3443f8..460418828a76 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -488,6 +488,7 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf, if (fault_flag_allow_retry_first(flags) && !(flags & FAULT_FLAG_RETRY_NOWAIT)) { fpin = get_file(vmf->vma->vm_file); + pte_put_vmf(vmf); mmap_read_unlock(vmf->vma->vm_mm); } return fpin; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index e99101162f1a..92b0494f4a00 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1019,10 +1019,13 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm, .pmd = pmd, }; - vmf.pte = pte_offset_map(pmd, address); + vmf.pte = pte_tryget_map(pmd, address); + if (!vmf.pte) + continue; vmf.orig_pte = *vmf.pte; if (!is_swap_pte(vmf.orig_pte)) { pte_unmap(vmf.pte); + pte_put_vmf(&vmf); continue; } swapped_in++; @@ -1041,7 +1044,10 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm, trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0); return false; } + } else { + pte_put_vmf(&vmf); } + if (ret & VM_FAULT_ERROR) { trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0); return false; diff --git a/mm/memory.c b/mm/memory.c index ea4d651ac8c7..5cc4ce0af665 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4571,8 +4571,10 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud) static vm_fault_t handle_pte_fault(struct vm_fault *vmf) { pte_t entry; + vm_fault_t ret; - if (unlikely(pmd_none(*vmf->pmd))) { +retry: + if (unlikely(pmd_none(READ_ONCE(*vmf->pmd)))) { /* * Leave __pte_alloc() until later: because vm_ops->fault may * want to allocate huge page, and if we expose page table @@ -4595,13 +4597,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; + /* * A regular pmd is established and it can't morph into a huge * pmd from under us anymore at this point because we hold the * mmap_lock read mode and khugepaged takes it in write mode. * So now it's safe to run pte_offset_map(). */ - vmf->pte = pte_offset_map(vmf->pmd, vmf->address); + vmf->pte = pte_tryget_map(vmf->pmd, vmf->address); + if (!vmf->pte) + goto retry; vmf->orig_pte = *vmf->pte; /* @@ -4616,6 +4621,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) if (pte_none(vmf->orig_pte)) { pte_unmap(vmf->pte); vmf->pte = NULL; + pte_put_vmf(vmf); } } @@ -4626,11 +4632,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) return do_fault(vmf); } - if (!pte_present(vmf->orig_pte)) - return do_swap_page(vmf); + if (!pte_present(vmf->orig_pte)) { + ret = do_swap_page(vmf); + goto put; + } - if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) - return do_numa_page(vmf); + if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) { + ret = do_numa_page(vmf); + goto put; + } vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd); spin_lock(vmf->ptl); @@ -4640,8 +4650,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) goto unlock; } if (vmf->flags & FAULT_FLAG_WRITE) { - if (!pte_write(entry)) - return do_wp_page(vmf); + if (!pte_write(entry)) { + ret = do_wp_page(vmf); + goto put; + } entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); @@ -4663,7 +4675,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) } unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; + ret = 0; +put: + pte_put_vmf(vmf); + return ret; } /* From patchwork Wed Nov 10 10:54:24 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611773 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4982C433EF for ; Wed, 10 Nov 2021 10:55:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 380FD6112D for ; Wed, 10 Nov 2021 10:55:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 380FD6112D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id CB9D76B006C; Wed, 10 Nov 2021 05:55:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B9EAA6B0081; Wed, 10 Nov 2021 05:55:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 97C7B6B006C; Wed, 10 Nov 2021 05:55:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0020.hostedemail.com [216.40.44.20]) by kanga.kvack.org (Postfix) with ESMTP id 7B2E26B007E for ; Wed, 10 Nov 2021 05:55:45 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 33625184956D6 for ; Wed, 10 Nov 2021 10:55:45 +0000 (UTC) X-FDA: 78792715050.08.98C630C Received: from mail-pj1-f43.google.com (mail-pj1-f43.google.com [209.85.216.43]) by imf18.hostedemail.com (Postfix) with ESMTP id 6E0A8400209D for ; Wed, 10 Nov 2021 10:55:44 +0000 (UTC) Received: by mail-pj1-f43.google.com with SMTP id gt5so1250676pjb.1 for ; Wed, 10 Nov 2021 02:55:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=BKDF6fzfiXc8Vyvfw72GJavPYu3r53WCuZUQ+TXvJhw=; b=c+RZgiWHPvKXp4uYhSTWNM+SWj8/NqMRWY5bY15HxWfIP3r/q4PpltIqf7LhSKqjO6 sRVY+qivnZOw51j+Yg7diwaQu2lJL3gzjRlVAaPcHAdlgMp4ADwt/l2auGMuGLRi5BLl btoxwPQsD7ApE6Qq2MW7oP2QnHAXg0v4qF+YxkuQr/zhuGG5z8OyE5wxE2LqhQ9psRVJ 2beiqJH8f3ouRkssyUP5vPwjJRmM3yhnSPiK9aQipnmTzB5XdhQ91POvnqxOx/2Z65Lf /ATDhSNxSA0VXCCzqjOTXqxF4b1jjBKQurPug833TYMy5o8hHkQu6/EF5jHnmfWPc5WE rxNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=BKDF6fzfiXc8Vyvfw72GJavPYu3r53WCuZUQ+TXvJhw=; b=TJNlJfX3aDZfFPgYVUmHquE3reas3f3SZ2g3CWO5HUcWdLnOeWJukz102sNcfhq1m5 QvlZ2g966cqZD5CjwmEL3X0T55CD6TIDGupJ0OHN6M94voLxXZmsXmpxAJAPoApGWVER 8cV7QMbJK1yjlgF8LbY5tNQLhnRavJAiRKyjYciEQ0ptlImOxJ/hXidDWpEnolgXvkBQ PPHTiCzcrPzRcxykeRWZoCSOk+TCS6oHvPJrALff3jN7E3v0ZJbS49ArhD6gA1ZFyJhh JEWAGzOEYtDVxF257yjCk6Hk2mUAJMO3SC7BjJKDQelxxDSuWqNw1wpfKyUvi294doWD sblg== X-Gm-Message-State: AOAM530R1Ma/YHXrXIsyMDbMyKYpdiNEXqXTQn4R6YP1SOk8/EFbwZAQ KWbz01ay+KioisaxxsxjOCD5hw== X-Google-Smtp-Source: ABdhPJyzvJYgDdoOYpu8g1Ea3k9a8ndWoD6+0hXlBI+/XwAS7njN4MV4iRjXp/aDCaJ631NaCmD8DA== X-Received: by 2002:a17:90a:bb84:: with SMTP id v4mr15962150pjr.4.1636541743354; Wed, 10 Nov 2021 02:55:43 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.55.37 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:55:42 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 11/15] mm/pte_ref: take a refcount before accessing the PTE page table page Date: Wed, 10 Nov 2021 18:54:24 +0800 Message-Id: <20211110105428.32458-12-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 6E0A8400209D X-Stat-Signature: n9cz6za8drmt7645c3gy1qre51wbxxxx Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=c+RZgiWH; spf=pass (imf18.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.43 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-HE-Tag: 1636541744-176459 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now, the user page table will only be released by calling free_pgtables() when the process exits or unmap_region() is called (e.g. munmap() path). So other threads only need to ensure mutual exclusion with these paths to ensure that the page table is not released. For example:: thread A thread B page table walker munmap ================= ====== mmap_read_lock() if (!pte_none() && pte_present() && !pmd_trans_unstable()) { pte_offset_map_lock() *walk page table* pte_unmap_unlock() } mmap_read_unlock() mmap_write_lock_killable() detach_vmas_to_be_unmapped() unmap_region() --> free_pgtables() But after we introduce the reference count for the user PTE page table page, these existing balances will be broken. The page can be released at any time when its pte_refcount is reduced to 0. Therefore, the following case may happen:: thread A thread B page table walker madvise(MADV_DONTNEED) ================= ====================== mmap_read_lock() if (!pte_none() && pte_present() && !pmd_trans_unstable()) { mmap_read_lock() unmap_page_range() --> zap_pte_range() *the pte_refcount is reduced to 0* --> *free PTE page table page* /* broken!! */ pte_offset_map_lock() As we can see, all of the thread A and B hold the read lock of mmap_lock, so they can execute concurrently. When thread B releases the PTE page table page, the value in the corresponding pmd entry will become unstable, which may be none or huge pmd, or map a new PTE page table page again. This will cause system chaos and even panic. So we need to try to take a reference to the PTE page table page before walking page table, then the system will become orderly again:: thread A thread B page table walker madvise(MADV_DONTNEED) ================= ====================== mmap_read_lock() if (!pte_none() && pte_present() && !pmd_trans_unstable()) { pte_try_get() --> pte_get_unless_zero *if successfully, then:* mmap_read_lock() unmap_page_range() --> zap_pte_range() *the pte_refcount is reduced to 1* pte_offset_map_lock() *walk page table* pte_unmap_unlock() pte_put() --> *the pte_refcount is reduced to 0* --> *free PTE page table page* There is also a lock-less scenario(such as fast GUP). Fortunately, we don't need to do any additional operations to ensure that the system is in order. Take fast GUP as an example:: thread A thread B fast GUP madvise(MADV_DONTNEED) ======== ====================== get_user_pages_fast_only() --> local_irq_save(); *free PTE page table page* --> unhook page /* The CPU where thread A is * located closed the local * interrupt and cannot respond to * IPI, so it will block here */ TLB invalidate page gup_pgd_range(); local_irq_restore(); *free page* Signed-off-by: Qi Zheng --- fs/proc/task_mmu.c | 24 ++++++++++++++--- fs/userfaultfd.c | 8 ++++-- include/linux/rmap.h | 2 ++ mm/damon/vaddr.c | 12 +++++++-- mm/gup.c | 13 ++++++++-- mm/hmm.c | 5 +++- mm/khugepaged.c | 13 ++++++++-- mm/ksm.c | 6 ++++- mm/madvise.c | 16 +++++++++--- mm/memcontrol.c | 12 +++++++-- mm/memory-failure.c | 11 ++++++-- mm/memory.c | 73 +++++++++++++++++++++++++++++++++++++--------------- mm/mempolicy.c | 6 ++++- mm/migrate.c | 27 ++++++++++++------- mm/mincore.c | 7 ++++- mm/mprotect.c | 11 +++++--- mm/page_vma_mapped.c | 4 +++ mm/pagewalk.c | 15 ++++++++--- mm/swapfile.c | 3 +++ 19 files changed, 209 insertions(+), 59 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index ad667dbc96f5..82dd5fd540ce 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -581,6 +581,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct vm_area_struct *vma = walk->vma; pte_t *pte; spinlock_t *ptl; + unsigned long start = addr; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -596,10 +597,13 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, * keeps khugepaged out of here and from collapsing things * in here. */ - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + goto out; for (; addr != end; pte++, addr += PAGE_SIZE) smaps_pte_entry(pte, addr, walk); pte_unmap_unlock(pte - 1, ptl); + pte_put(vma->vm_mm, pmd, start); out: cond_resched(); return 0; @@ -1124,6 +1128,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, pte_t *pte, ptent; spinlock_t *ptl; struct page *page; + unsigned long start = addr; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -1149,7 +1154,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; for (; addr != end; pte++, addr += PAGE_SIZE) { ptent = *pte; @@ -1171,6 +1178,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, ClearPageReferenced(page); } pte_unmap_unlock(pte - 1, ptl); + pte_put(vma->vm_mm, pmd, start); cond_resched(); return 0; } @@ -1410,6 +1418,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, spinlock_t *ptl; pte_t *pte, *orig_pte; int err = 0; + unsigned long start = addr; #ifdef CONFIG_TRANSPARENT_HUGEPAGE ptl = pmd_trans_huge_lock(pmdp, vma); @@ -1482,7 +1491,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, * We can assume that @vma always points to a valid one and @end never * goes beyond vma->vm_end. */ - orig_pte = pte = pte_offset_map_lock(walk->mm, pmdp, addr, &ptl); + orig_pte = pte = pte_tryget_map_lock(walk->mm, pmdp, addr, &ptl); + if (!pte) + return 0; for (; addr < end; pte++, addr += PAGE_SIZE) { pagemap_entry_t pme; @@ -1492,6 +1503,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, break; } pte_unmap_unlock(orig_pte, ptl); + pte_put(walk->mm, pmdp, start); cond_resched(); @@ -1798,6 +1810,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, spinlock_t *ptl; pte_t *orig_pte; pte_t *pte; + unsigned long start = addr; #ifdef CONFIG_TRANSPARENT_HUGEPAGE ptl = pmd_trans_huge_lock(pmd, vma); @@ -1815,7 +1828,9 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; #endif - orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + orig_pte = pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!pte) + return 0; do { struct page *page = can_gather_numa_stats(*pte, vma, addr); if (!page) @@ -1824,6 +1839,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, } while (pte++, addr += PAGE_SIZE, addr != end); pte_unmap_unlock(orig_pte, ptl); + pte_put(walk->mm, pmd, start); cond_resched(); return 0; } diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index ddbcefa7e0a6..d1e18e5f3a13 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -297,6 +297,8 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, if (!pud_present(*pud)) goto out; pmd = pmd_offset(pud, address); + +retry: /* * READ_ONCE must function as a barrier with narrower scope * and it must be equivalent to: @@ -323,7 +325,9 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it * and use the standard pte_offset_map() instead of parsing _pmd. */ - pte = pte_offset_map(pmd, address); + pte = pte_tryget_map(pmd, address); + if (!pte) + goto retry; /* * Lockless access: we're in a wait_event so it's ok if it * changes under us. @@ -333,7 +337,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, if (!pte_write(*pte) && (reason & VM_UFFD_WP)) ret = true; pte_unmap(pte); - + pte_put(mm, pmd, address); out: return ret; } diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 221c3c6438a7..5bd76fb8b93a 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -222,6 +222,8 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) pte_unmap(pvmw->pte); if (pvmw->ptl) spin_unlock(pvmw->ptl); + if (pvmw->pte && !PageHuge(pvmw->page)) + pte_put(pvmw->vma->vm_mm, pvmw->pmd, pvmw->address); } bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 35fe49080ee9..8b816f92a563 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -373,6 +373,7 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr, pte_t *pte; spinlock_t *ptl; +retry: if (pmd_huge(*pmd)) { ptl = pmd_lock(walk->mm, pmd); if (pmd_huge(*pmd)) { @@ -385,12 +386,15 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr, if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) return 0; - pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!pte) + goto retry; if (!pte_present(*pte)) goto out; damon_ptep_mkold(pte, walk->mm, addr); out: pte_unmap_unlock(pte, ptl); + pte_put(walk->mm, pmd, addr); return 0; } @@ -446,6 +450,7 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr, struct page *page; struct damon_young_walk_private *priv = walk->private; +retry: #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (pmd_huge(*pmd)) { ptl = pmd_lock(walk->mm, pmd); @@ -473,7 +478,9 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr, if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) return -EINVAL; - pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!pte) + goto retry; if (!pte_present(*pte)) goto out; page = damon_get_page(pte_pfn(*pte)); @@ -487,6 +494,7 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr, put_page(page); out: pte_unmap_unlock(pte, ptl); + pte_put(walk->mm, pmd, addr); return 0; } diff --git a/mm/gup.c b/mm/gup.c index e084111103f0..7b6d024ad5c7 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -488,7 +488,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (unlikely(pmd_bad(*pmd))) return no_page_table(vma, flags); - ptep = pte_offset_map_lock(mm, pmd, address, &ptl); + ptep = pte_tryget_map_lock(mm, pmd, address, &ptl); + if (!ptep) + return no_page_table(vma, flags); pte = *ptep; if (!pte_present(pte)) { swp_entry_t entry; @@ -505,6 +507,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (!is_migration_entry(entry)) goto no_page; pte_unmap_unlock(ptep, ptl); + pte_put(mm, pmd, address); migration_entry_wait(mm, pmd, address); goto retry; } @@ -512,6 +515,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, goto no_page; if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) { pte_unmap_unlock(ptep, ptl); + pte_put(mm, pmd, address); return NULL; } @@ -600,9 +604,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } out: pte_unmap_unlock(ptep, ptl); + pte_put(mm, pmd, address); return page; no_page: pte_unmap_unlock(ptep, ptl); + pte_put(mm, pmd, address); if (!pte_none(pte)) return NULL; return no_page_table(vma, flags); @@ -885,7 +891,9 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, if (!pmd_present(*pmd)) return -EFAULT; VM_BUG_ON(pmd_trans_huge(*pmd)); - pte = pte_offset_map(pmd, address); + pte = pte_tryget_map(pmd, address); + if (!pte) + return -EFAULT; if (pte_none(*pte)) goto unmap; *vma = get_gate_vma(mm); @@ -905,6 +913,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, ret = 0; unmap: pte_unmap(pte); + pte_put(mm, pmd, address); return ret; } diff --git a/mm/hmm.c b/mm/hmm.c index 842e26599238..b8917a5ae442 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -383,7 +383,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); } - ptep = pte_offset_map(pmdp, addr); + ptep = pte_tryget_map(pmdp, addr); + if (!ptep) + goto again; for (; addr < end; addr += PAGE_SIZE, ptep++, hmm_pfns++) { int r; @@ -394,6 +396,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, } } pte_unmap(ptep - 1); + pte_put(walk->mm, pmdp, start); return 0; } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 92b0494f4a00..5842c0774d70 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1249,7 +1249,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, } memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load)); - pte = pte_offset_map_lock(mm, pmd, address, &ptl); + pte = pte_tryget_map_lock(mm, pmd, address, &ptl); + if (!pte) { + result = SCAN_PMD_NULL; + goto out; + } for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++, _address += PAGE_SIZE) { pte_t pteval = *_pte; @@ -1370,6 +1374,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, } out_unmap: pte_unmap_unlock(pte, ptl); + pte_put(mm, pmd, address); if (ret) { node = khugepaged_find_target_node(); /* collapse_huge_page will return with the mmap_lock released */ @@ -1472,7 +1477,9 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) if (!pmd) goto drop_hpage; - start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); + start_pte = pte_tryget_map_lock(mm, pmd, haddr, &ptl); + if (!start_pte) + goto drop_hpage; /* step 1: check all mapped PTEs are to the right huge page */ for (i = 0, addr = haddr, pte = start_pte; @@ -1510,6 +1517,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) } pte_unmap_unlock(start_pte, ptl); + pte_put(mm, pmd, haddr); /* step 3: set proper refcount and mm_counters. */ if (count) { @@ -1531,6 +1539,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) abort: pte_unmap_unlock(start_pte, ptl); + pte_put(mm, pmd, haddr); goto drop_hpage; } diff --git a/mm/ksm.c b/mm/ksm.c index 0662093237e4..94aeaed42c1f 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1140,9 +1140,12 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, addr + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); - ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); + ptep = pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!ptep) + goto out_mn; if (!pte_same(*ptep, orig_pte)) { pte_unmap_unlock(ptep, ptl); + pte_put(mm, pmd, addr); goto out_mn; } @@ -1182,6 +1185,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, put_page(page); pte_unmap_unlock(ptep, ptl); + pte_put(mm, pmd, addr); err = 0; out_mn: mmu_notifier_invalidate_range_end(&range); diff --git a/mm/madvise.c b/mm/madvise.c index 82fc40b6dcbf..5cf2832abb98 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -200,9 +200,12 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, struct page *page; spinlock_t *ptl; - orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); + orig_pte = pte_tryget_map_lock(vma->vm_mm, pmd, start, &ptl); + if (!orig_pte) + continue; pte = *(orig_pte + ((index - start) / PAGE_SIZE)); pte_unmap_unlock(orig_pte, ptl); + pte_put(vma->vm_mm, pmd, start); if (pte_present(pte) || pte_none(pte)) continue; @@ -317,6 +320,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, pte_t *orig_pte, *pte, ptent; spinlock_t *ptl; struct page *page = NULL; + unsigned long start = addr; LIST_HEAD(page_list); if (fatal_signal_pending(current)) @@ -393,7 +397,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, return 0; #endif tlb_change_page_size(tlb, PAGE_SIZE); - orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + orig_pte = pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr < end; pte++, addr += PAGE_SIZE) { @@ -471,6 +477,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, arch_leave_lazy_mmu_mode(); pte_unmap_unlock(orig_pte, ptl); + pte_put(vma->vm_mm, pmd, start); if (pageout) reclaim_pages(&page_list); cond_resched(); @@ -592,7 +599,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, return 0; tlb_change_page_size(tlb, PAGE_SIZE); - orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + orig_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl); + if (!pte) + return 0; + nr_put++; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr != end; pte++, addr += PAGE_SIZE) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 781605e92015..7283044d4f64 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5773,6 +5773,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, struct vm_area_struct *vma = walk->vma; pte_t *pte; spinlock_t *ptl; + unsigned long start = addr; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -5789,11 +5790,14 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, if (pmd_trans_unstable(pmd)) return 0; - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; for (; addr != end; pte++, addr += PAGE_SIZE) if (get_mctgt_type(vma, addr, *pte, NULL)) mc.precharge++; /* increment precharge temporarily */ pte_unmap_unlock(pte - 1, ptl); + pte_put(vma->vm_mm, pmd, start); cond_resched(); return 0; @@ -5973,6 +5977,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, enum mc_target_type target_type; union mc_target target; struct page *page; + unsigned long start = addr; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -6008,7 +6013,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, if (pmd_trans_unstable(pmd)) return 0; retry: - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; for (; addr != end; addr += PAGE_SIZE) { pte_t ptent = *(pte++); bool device = false; @@ -6058,6 +6065,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, } } pte_unmap_unlock(pte - 1, ptl); + pte_put(vma->vm_mm, pmd, start); cond_resched(); if (addr != end) { diff --git a/mm/memory-failure.c b/mm/memory-failure.c index f64ebb6226cb..6f281e827c32 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -331,10 +331,13 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page, return 0; if (pmd_devmap(*pmd)) return PMD_SHIFT; - pte = pte_offset_map(pmd, address); + pte = pte_tryget_map(pmd, address); + if (!pte) + return 0; if (pte_present(*pte) && pte_devmap(*pte)) ret = PAGE_SHIFT; pte_unmap(pte); + pte_put(vma->vm_mm, pmd, address); return ret; } @@ -634,6 +637,7 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr, int ret = 0; pte_t *ptep, *mapped_pte; spinlock_t *ptl; + unsigned long start = addr; ptl = pmd_trans_huge_lock(pmdp, walk->vma); if (ptl) { @@ -645,8 +649,10 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr, if (pmd_trans_unstable(pmdp)) goto out; - mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp, + mapped_pte = ptep = pte_tryget_map_lock(walk->vma->vm_mm, pmdp, addr, &ptl); + if (!ptep) + goto out; for (; addr != end; ptep++, addr += PAGE_SIZE) { ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT, hwp->pfn, &hwp->tk); @@ -654,6 +660,7 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr, break; } pte_unmap_unlock(mapped_pte, ptl); + pte_put(walk->vma->vm_mm, pmdp, start); out: cond_resched(); return ret; diff --git a/mm/memory.c b/mm/memory.c index 5cc4ce0af665..e360ecd37a71 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1165,7 +1165,8 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, src_pmd = pmd_offset(src_pud, addr); do { next = pmd_addr_end(addr, end); - if (is_huge_pmd(*src_pmd)) { +retry: + if (is_huge_pmd(READ_ONCE(*src_pmd))) { int err; VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma); err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd, @@ -1178,9 +1179,14 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, } if (pmd_none_or_clear_bad(src_pmd)) continue; + if (pte_try_get(src_pmd)) + goto retry; if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd, - addr, next)) + addr, next)) { + pte_put(src_mm, src_pmd, addr); return -ENOMEM; + } + pte_put(src_mm, src_pmd, addr); } while (dst_pmd++, src_pmd++, addr = next, addr != end); return 0; } @@ -1494,7 +1500,10 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, */ if (pmd_none_or_trans_huge_or_clear_bad(pmd)) goto next; + if (pte_try_get(pmd)) + goto next; next = zap_pte_range(tlb, vma, pmd, addr, next, details); + pte_put(tlb->mm, pmd, addr); next: cond_resched(); } while (pmd++, addr = next, addr != end); @@ -2606,18 +2615,26 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, pmd = pmd_offset(pud, addr); } do { + pmd_t pmdval; + next = pmd_addr_end(addr, end); - if (pmd_none(*pmd) && !create) +retry: + pmdval = READ_ONCE(*pmd); + if (pmd_none(pmdval) && !create) continue; - if (WARN_ON_ONCE(pmd_leaf(*pmd))) + if (WARN_ON_ONCE(pmd_leaf(pmdval))) return -EINVAL; - if (!pmd_none(*pmd) && WARN_ON_ONCE(pmd_bad(*pmd))) { + if (!pmd_none(pmdval) && WARN_ON_ONCE(pmd_bad(pmdval))) { if (!create) continue; pmd_clear_bad(pmd); } + if (!create && pte_try_get(pmd)) + goto retry; err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create, mask); + if (!create) + pte_put(mm, pmd, addr); if (err) break; } while (pmd++, addr = next, addr != end); @@ -4343,26 +4360,31 @@ static vm_fault_t do_fault(struct vm_fault *vmf) * If we find a migration pmd entry or a none pmd entry, which * should never happen, return SIGBUS */ - if (unlikely(!pmd_present(*vmf->pmd))) + if (unlikely(!pmd_present(*vmf->pmd))) { ret = VM_FAULT_SIGBUS; - else { - vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, + } else { + vmf->pte = pte_tryget_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); - /* - * Make sure this is not a temporary clearing of pte - * by holding ptl and checking again. A R/M/W update - * of pte involves: take ptl, clearing the pte so that - * we don't have concurrent modification by hardware - * followed by an update. - */ - if (unlikely(pte_none(*vmf->pte))) + if (vmf->pte) { + /* + * Make sure this is not a temporary clearing of pte + * by holding ptl and checking again. A R/M/W update + * of pte involves: take ptl, clearing the pte so that + * we don't have concurrent modification by hardware + * followed by an update. + */ + if (unlikely(pte_none(*vmf->pte))) + ret = VM_FAULT_SIGBUS; + else + ret = VM_FAULT_NOPAGE; + + pte_unmap_unlock(vmf->pte, vmf->ptl); + pte_put(vma->vm_mm, vmf->pmd, vmf->address); + } else { ret = VM_FAULT_SIGBUS; - else - ret = VM_FAULT_NOPAGE; - - pte_unmap_unlock(vmf->pte, vmf->ptl); + } } } else if (!(vmf->flags & FAULT_FLAG_WRITE)) ret = do_read_fault(vmf); @@ -5016,13 +5038,22 @@ int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, (address & PAGE_MASK) + PAGE_SIZE); mmu_notifier_invalidate_range_start(range); } - ptep = pte_offset_map_lock(mm, pmd, address, ptlp); + ptep = pte_tryget_map_lock(mm, pmd, address, ptlp); + if (!ptep) + goto out; if (!pte_present(*ptep)) goto unlock; + /* + * when we reach here, it means that the refcount of the pte is at least + * one and the contents of the PTE page table are stable until @ptlp is + * released, so we can put pte safely. + */ + pte_put(mm, pmd, address); *ptepp = ptep; return 0; unlock: pte_unmap_unlock(ptep, *ptlp); + pte_put(mm, pmd, address); if (range) mmu_notifier_invalidate_range_end(range); out: diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 774a3d3183a7..18e57ba515dc 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -508,6 +508,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, bool has_unmovable = false; pte_t *pte, *mapped_pte; spinlock_t *ptl; + unsigned long start = addr; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { @@ -520,7 +521,9 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; - mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + mapped_pte = pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!pte) + return 0; for (; addr != end; pte++, addr += PAGE_SIZE) { if (!pte_present(*pte)) continue; @@ -553,6 +556,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, break; } pte_unmap_unlock(mapped_pte, ptl); + pte_put(walk->mm, pmd, start); cond_resched(); if (has_unmovable) diff --git a/mm/migrate.c b/mm/migrate.c index c03ac25f42a9..5a234ddf36b1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -323,8 +323,12 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address) { spinlock_t *ptl = pte_lockptr(mm, pmd); - pte_t *ptep = pte_offset_map(pmd, address); - __migration_entry_wait(mm, ptep, ptl); + pte_t *ptep = pte_tryget_map(pmd, address); + + if (ptep) { + __migration_entry_wait(mm, ptep, ptl); + pte_put(mm, pmd, address); + } } void migration_entry_wait_huge(struct vm_area_struct *vma, @@ -2249,21 +2253,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, unsigned long addr = start, unmapped = 0; spinlock_t *ptl; pte_t *ptep; + pmd_t pmdval; again: - if (pmd_none(*pmdp)) + pmdval = READ_ONCE(*pmdp); + if (pmd_none(pmdvalp)) return migrate_vma_collect_hole(start, end, -1, walk); - if (pmd_trans_huge(*pmdp)) { + if (pmd_trans_huge(pmdval)) { struct page *page; ptl = pmd_lock(mm, pmdp); - if (unlikely(!pmd_trans_huge(*pmdp))) { + if (unlikely(!pmd_trans_huge(pmdval))) { spin_unlock(ptl); goto again; } - page = pmd_page(*pmdp); + page = pmd_page(pmdval); if (is_huge_zero_page(page)) { spin_unlock(ptl); split_huge_pmd(vma, pmdp, addr); @@ -2284,16 +2290,18 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (ret) return migrate_vma_collect_skip(start, end, walk); - if (pmd_none(*pmdp)) + if (pmd_none(pmdval)) return migrate_vma_collect_hole(start, end, -1, walk); } } - if (unlikely(pmd_bad(*pmdp))) + if (unlikely(pmd_bad(pmdval))) return migrate_vma_collect_skip(start, end, walk); - ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); + ptep = pte_tryget_map_lock(mm, pmdp, addr, &ptl); + if (!ptep) + goto again; arch_enter_lazy_mmu_mode(); for (; addr < end; addr += PAGE_SIZE, ptep++) { @@ -2416,6 +2424,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, } arch_leave_lazy_mmu_mode(); pte_unmap_unlock(ptep - 1, ptl); + pte_put(mm, pmdp, start); /* Only flush the TLB if we actually modified any entries */ if (unmapped) diff --git a/mm/mincore.c b/mm/mincore.c index 9122676b54d6..92e56cef2473 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -104,7 +104,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pte_t *ptep; unsigned char *vec = walk->private; int nr = (end - addr) >> PAGE_SHIFT; + unsigned long start = addr; +retry: ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { memset(vec, 1, nr); @@ -117,7 +119,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } - ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + ptep = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl); + if (!ptep) + goto retry; for (; addr != end; ptep++, addr += PAGE_SIZE) { pte_t pte = *ptep; @@ -148,6 +152,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, vec++; } pte_unmap_unlock(ptep - 1, ptl); + pte_put(walk->mm, pmd, start); out: walk->private += nr; cond_resched(); diff --git a/mm/mprotect.c b/mm/mprotect.c index 2d5064a4631c..5c663270b816 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -234,9 +234,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pmd = pmd_offset(pud, addr); do { unsigned long this_pages; + pmd_t pmdval; next = pmd_addr_end(addr, end); - +retry: + pmdval = READ_ONCE(*pmd); /* * Automatic NUMA balancing walks the tables with mmap_lock * held for read. It's possible a parallel update to occur @@ -245,7 +247,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, * Hence, it's necessary to atomically read the PMD value * for all the checks. */ - if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) && + if (!is_swap_pmd(pmdval) && !pmd_devmap(pmdval) && pmd_none_or_clear_bad_unless_trans_huge(pmd)) goto next; @@ -257,7 +259,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, mmu_notifier_invalidate_range_start(&range); } - if (is_huge_pmd(*pmd)) { + if (is_huge_pmd(pmdval)) { if (next - addr != HPAGE_PMD_SIZE) { __split_huge_pmd(vma, pmd, addr, false, NULL); } else { @@ -276,8 +278,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, } /* fall through, the trans huge pmd just split */ } + if (pte_try_get(pmd)) + goto retry; this_pages = change_pte_range(vma, pmd, addr, next, newprot, cp_flags); + pte_put(vma->vm_mm, pmd, addr); pages += this_pages; next: cond_resched(); diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index f7b331081791..4725a2f78f09 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -211,6 +211,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) } pvmw->pmd = pmd_offset(pud, pvmw->address); +retry: /* * Make sure the pmd value isn't cached in a register by the * compiler and used as a stale value after we've observed a @@ -258,6 +259,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) step_forward(pvmw, PMD_SIZE); continue; } + if (pte_try_get(pvmw->pmd)) + goto retry; if (!map_pte(pvmw)) goto next_pte; this_pte: @@ -275,6 +278,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) pvmw->ptl = NULL; } pte_unmap(pvmw->pte); + pte_put(pvmw->vma->vm_mm, pvmw->pmd, pvmw->address); pvmw->pte = NULL; goto restart; } diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 9b3db11a4d1d..72074a34beea 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -110,6 +110,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, do { again: next = pmd_addr_end(addr, end); +retry: if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) { if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); @@ -147,10 +148,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, goto again; } - if (is_hugepd(__hugepd(pmd_val(*pmd)))) + if (is_hugepd(__hugepd(pmd_val(*pmd)))) { err = walk_hugepd_range((hugepd_t *)pmd, addr, next, walk, PMD_SHIFT); - else - err = walk_pte_range(pmd, addr, next, walk); + } else { + if (!walk->no_vma) { + if (pte_try_get(pmd)) + goto retry; + err = walk_pte_range(pmd, addr, next, walk); + pte_put(walk->mm, pmd, addr); + } else { + err = walk_pte_range(pmd, addr, next, walk); + } + } if (err) break; } while (pmd++, addr = next, addr != end); diff --git a/mm/swapfile.c b/mm/swapfile.c index e59e08ef46e1..175b35fec758 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2023,8 +2023,11 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, next = pmd_addr_end(addr, end); if (pmd_none_or_trans_huge_or_clear_bad(pmd)) continue; + if (pte_try_get(pmd)) + continue; ret = unuse_pte_range(vma, pmd, addr, next, type, frontswap, fs_pages_to_unuse); + pte_put(vma->vm_mm, pmd, addr); if (ret) return ret; } while (pmd++, addr = next, addr != end); From patchwork Wed Nov 10 10:54:25 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611837 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A761C433F5 for ; Wed, 10 Nov 2021 11:03:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D38F761221 for ; Wed, 10 Nov 2021 11:03:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org D38F761221 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 779796B0073; Wed, 10 Nov 2021 06:03:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 702F26B007E; Wed, 10 Nov 2021 06:03:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5A3AB6B0080; Wed, 10 Nov 2021 06:03:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0114.hostedemail.com [216.40.44.114]) by kanga.kvack.org (Postfix) with ESMTP id 49CAD6B0073 for ; Wed, 10 Nov 2021 06:03:07 -0500 (EST) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 0268180897 for ; Wed, 10 Nov 2021 11:03:07 +0000 (UTC) X-FDA: 78792733614.03.515783B Received: from mail-qv1-f47.google.com (mail-qv1-f47.google.com [209.85.219.47]) by imf26.hostedemail.com (Postfix) with ESMTP id 7B7C320019EC for ; Wed, 10 Nov 2021 11:03:07 +0000 (UTC) Received: by mail-qv1-f47.google.com with SMTP id j9so1576573qvm.10 for ; Wed, 10 Nov 2021 03:03:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=IGTjS9X6O5MHDP8mauZ+DA3PMqh0e9FVGxEeR+4xDgM=; b=4E9Pf+Ux5qiQPyaFELQIdPxWMnKZJis8IWSE3myEyTQTg/ZtXTEmfRj3+xgUYIKqib Amlin7VFQFJ9UH4YWXcbd66LmFZ/Q9HAZ5JznNQb+3eNmtOtBwcniA9rtqgofz2RADaL 3Oevl51wrxppp12TRgvXM83uKsKWlYYvY8hlOln+RIeSnCMzVcMIHjohuG1S/zRLiFAv ifCPerwpG9LO3xn1O7DQl4FAghQmnMGyHsvN7iFTnEDcx0kk+rI4Joggpghut/u+jQr3 dtHkRy2LcimFnQELgVdilPLogAAqcEJ82I/jvSP8xHasm/HtgQ4udpeHg91Uubpsow6a RufA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=IGTjS9X6O5MHDP8mauZ+DA3PMqh0e9FVGxEeR+4xDgM=; b=pA2mPAnmtdH0Sj+P4YgY/81nE+vJsSK7/WoI/GO+XiMBVE4Nkmj49TAA41bEpUGmGV ErxpX/x7N+j4thaSlHwaENgaUOykZcc8EPhgNRXdHjzsuX21cpCbv9gmxiPNbRnsAeYz XoMXztxdVlqvhfcnBkpQLfx5IkiPDv0RKZlN+mOHdEEENxOwKwmL8S/7RebmoL9BFg04 MtzICceQVVVojKgp4yQY055kH46zKcB/ZY6H7ic9PkiZxhjTb4n3h3ptgVU4Xzm4bM5p zapjz9S28FAKTpCK8PARWoYgGug2vJYS08ScW5nQjHQBtT87+q8kYhQxmQb+6O2l33y8 dOpQ== X-Gm-Message-State: AOAM533ScoIa4cgS9X0vmdG+66HcQkQ2b47S9oatlpIb/tIySV7H4GLz 0ZehDXT0unI+NRGC5+q+G0WecOWuRe+Zig== X-Google-Smtp-Source: ABdhPJwO0teQCgkCbj+haMRjSvhHg6xl/7vC4w0DoEK+PZZxUK02oqNeIVry9l3a6VyPVJK+jwnH+g== X-Received: by 2002:a62:16c7:0:b0:49f:a6cc:c77d with SMTP id 190-20020a6216c7000000b0049fa6ccc77dmr15313549pfw.23.1636541749492; Wed, 10 Nov 2021 02:55:49 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.55.43 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:55:49 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 12/15] mm/pte_ref: update the pmd entry in move_normal_pmd() Date: Wed, 10 Nov 2021 18:54:25 +0800 Message-Id: <20211110105428.32458-13-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 7B7C320019EC X-Stat-Signature: 3wdd3kykyeh5j5ik5bhch36qmmk38fgo Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=4E9Pf+Ux; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf26.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.219.47 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-HE-Tag: 1636542187-997971 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The ->pmd member record the pmd entry that maps the user PTE page table page. When the pmd entry changes, ->pmd needs to be updated synchronously. Signed-off-by: Qi Zheng --- mm/mremap.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/mremap.c b/mm/mremap.c index 088a7a75cb4b..4661cdec79dc 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -278,6 +278,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, VM_BUG_ON(!pmd_none(*new_pmd)); pmd_populate(mm, new_pmd, pmd_pgtable(pmd)); + pte_update_pmd(pmd, new_pmd); flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE); if (new_ptl != old_ptl) spin_unlock(new_ptl); From patchwork Wed Nov 10 10:54:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611775 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CA72C433EF for ; Wed, 10 Nov 2021 10:55:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 184A7611AD for ; Wed, 10 Nov 2021 10:55:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 184A7611AD Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id ADD6F6B007E; Wed, 10 Nov 2021 05:55:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A666D6B0080; Wed, 10 Nov 2021 05:55:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 892756B0081; Wed, 10 Nov 2021 05:55:56 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0213.hostedemail.com [216.40.44.213]) by kanga.kvack.org (Postfix) with ESMTP id 738ED6B007E for ; Wed, 10 Nov 2021 05:55:56 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 380435C4D2 for ; Wed, 10 Nov 2021 10:55:56 +0000 (UTC) X-FDA: 78792715512.17.BDBFBBC Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) by imf11.hostedemail.com (Postfix) with ESMTP id BAE2DF000218 for ; Wed, 10 Nov 2021 10:55:55 +0000 (UTC) Received: by mail-pj1-f45.google.com with SMTP id j6-20020a17090a588600b001a78a5ce46aso1474206pji.0 for ; Wed, 10 Nov 2021 02:55:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=BEkgVbZuVsA0FhlYXGGaJBi830ur6y858NAr5lK/SlI=; b=fDInN7lDg/WOd26P7ACBdvTx8SeMC7Ds7sXvURJA9gmFdDBbfrcR0a6J5AlvkegzCB A/BxGt0L/fuy21RUR5VlbRRybOHrySj7zuw015WCcRgHXeUQrwB3a1W/MBid9gOnPDkw JBp4l90W1LFmgq3oVLyf6TIZIDuVWw+tcEjaCub7DprpdrFSD2qakI8Re2BeCcyEsTXq gja6TuaztelGhb3BvANmH72PL+0NJ6G4Yynpd/WrTJ1H3xvcxo/36mnjYbVUzdznQS4U 6tGstJXrl5BpzllHZ8E8ehMzvene9l1J7jJ8KL5uw6poPa9L8vGdRV+KDBrRDPLiYRkz Wxcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=BEkgVbZuVsA0FhlYXGGaJBi830ur6y858NAr5lK/SlI=; b=pBqN61MoomcLp1Ar9Kl8w5uZaqCJBht/001zW0nsb0P0QqgR7IzSgE8P72sP8BQQV5 k2YqpbZCbjzHZusDllxH4VvIz8LDUNjCvlj8AZaE+QshcuafFbBrJZ7aOvuV5w4FOt7y 6s35dwHt/iYhQMZ0+RPSrDKsOk4HEDvmKeaSsh/5mnETjW1JKVD6zCAThPk+R/xTnYlZ UO0D54ZC0XwgZgFEVXHl4r2Dm61N8lzkmjtX963xsSW/Q5BqkgRXXr3/apvepyYy4wCK qaGNqRwcg6XdPcQ+BlB9He8j67n1XNKFN6i2oBHyiyJR4o7pEr3PfOrw9siLQG6eHOb7 4BfQ== X-Gm-Message-State: AOAM530TVdrklF41/uupimgJMa6oPxmmCoIOrhRsmt2MV1vXfRUrM92Q hfoETypC6kSwGcrFrjwH4XsOAA== X-Google-Smtp-Source: ABdhPJzEQ/iU9FptzK8teQahqEayhIxvuxDSVPhhvBP66Hk7CNTA+H93MNXC4UqhOzmnr4h/baHdyA== X-Received: by 2002:a17:90a:d583:: with SMTP id v3mr15985016pju.216.1636541754871; Wed, 10 Nov 2021 02:55:54 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.55.49 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:55:54 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 13/15] mm/pte_ref: free user PTE page table pages Date: Wed, 10 Nov 2021 18:54:26 +0800 Message-Id: <20211110105428.32458-14-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: BAE2DF000218 X-Stat-Signature: wxpecrg6esgn9y67cogwhuzsoxmkiwdz Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=fDInN7lD; spf=pass (imf11.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.45 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-HE-Tag: 1636541755-566956 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This commit introduces CONFIG_FREE_USER_PTE option and implements the actual logic to the dummy helpers about pte_ref. Signed-off-by: Qi Zheng Reported-by: kernel test robot --- include/linux/mm.h | 2 ++ include/linux/pgtable.h | 3 +- include/linux/pte_ref.h | 53 ++++++++++++++++++++++++---- mm/Kconfig | 4 +++ mm/debug_vm_pgtable.c | 2 ++ mm/memory.c | 15 ++++++++ mm/pte_ref.c | 91 +++++++++++++++++++++++++++++++++++++++++++++---- 7 files changed, 156 insertions(+), 14 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 753a9435e0d0..18fbf9e0996a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -437,6 +437,7 @@ extern pgprot_t protection_map[16]; * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals. + * @FAULT_FLAG_PTE_GET: This means the refcount of the pte has been got. * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two @@ -468,6 +469,7 @@ enum fault_flag { FAULT_FLAG_REMOTE = 1 << 7, FAULT_FLAG_INSTRUCTION = 1 << 8, FAULT_FLAG_INTERRUPTIBLE = 1 << 9, + FAULT_FLAG_PTE_GET = 1 << 10, }; /* diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index c8f045705c1e..6ac51d58f11a 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -480,7 +480,6 @@ static inline pte_t ptep_get_lockless(pte_t *ptep) } #endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE #ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long address, @@ -491,6 +490,8 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, return pmd; } #endif /* __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR */ + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index b6d8335bdc59..8a26eaba83ef 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -8,6 +8,7 @@ #define _LINUX_PTE_REF_H #include +#include enum pte_tryget_type { TRYGET_SUCCESSED, @@ -16,12 +17,49 @@ enum pte_tryget_type { TRYGET_FAILED_HUGE_PMD, }; -bool pte_get_unless_zero(pmd_t *pmd); -enum pte_tryget_type pte_try_get(pmd_t *pmd); void pte_put_vmf(struct vm_fault *vmf); +enum pte_tryget_type pte_try_get(pmd_t *pmd); +bool pte_get_unless_zero(pmd_t *pmd); + +#ifdef CONFIG_FREE_USER_PTE +void free_user_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr); static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count) { + pte->pmd = pmd; + atomic_set(&pte->pte_refcount, count); +} + +static inline pmd_t *pte_to_pmd(pte_t *pte) +{ + return virt_to_page(pte)->pmd; +} + +static inline void pte_update_pmd(pmd_t old_pmd, pmd_t *new_pmd) +{ + pmd_pgtable(old_pmd)->pmd = new_pmd; +} + +static inline void pte_get_many(pmd_t *pmd, unsigned int nr) +{ + pgtable_t pte = pmd_pgtable(*pmd); + + VM_BUG_ON(!PageTable(pte)); + atomic_add(nr, &pte->pte_refcount); +} + +static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, unsigned int nr) +{ + pgtable_t pte = pmd_pgtable(*pmd); + + VM_BUG_ON(!PageTable(pte)); + if (atomic_sub_and_test(nr, &pte->pte_refcount)) + free_user_pte_table(mm, pmd, addr & PMD_MASK); +} +#else +static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count) +{ } static inline pmd_t *pte_to_pmd(pte_t *pte) @@ -37,6 +75,12 @@ static inline void pte_get_many(pmd_t *pmd, unsigned int nr) { } +static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, unsigned int nr) +{ +} +#endif /* CONFIG_FREE_USER_PTE */ + /* * pte_get - Increment refcount for the PTE page table. * @pmd: a pointer to the pmd entry corresponding to the PTE page table. @@ -66,11 +110,6 @@ static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd, return pte_offset_map_lock(mm, pmd, address, ptlp); } -static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd, - unsigned long addr, unsigned int nr) -{ -} - /* * pte_put - Decrement refcount for the PTE page table. * @mm: the mm_struct of the target address space. diff --git a/mm/Kconfig b/mm/Kconfig index 5c5508fafcec..44549d287869 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -898,6 +898,10 @@ config IO_MAPPING config SECRETMEM def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED +config FREE_USER_PTE + def_bool y + depends on X86_64 + source "mm/damon/Kconfig" endmenu diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 52f006654664..757dd84ee254 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -1049,8 +1049,10 @@ static void __init destroy_args(struct pgtable_debug_args *args) /* Free page table entries */ if (args->start_ptep) { pte_put(args->mm, args->start_pmdp, args->vaddr); +#ifndef CONFIG_FREE_USER_PTE pte_free(args->mm, args->start_ptep); mm_dec_nr_ptes(args->mm); +#endif } if (args->start_pmdp) { diff --git a/mm/memory.c b/mm/memory.c index e360ecd37a71..4d1ede78d1b0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -219,6 +219,17 @@ static void check_sync_rss_stat(struct task_struct *task) #endif /* SPLIT_RSS_COUNTING */ +#ifdef CONFIG_FREE_USER_PTE +static inline void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, + unsigned long addr) +{ + /* + * We should never reach here since the PTE page tables are + * dynamically freed. + */ + BUG(); +} +#else /* * Note: this doesn't free the actual pages themselves. That * has been handled earlier when unmapping all the memory regions. @@ -231,6 +242,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, pte_free_tlb(tlb, token, addr); mm_dec_nr_ptes(tlb->mm); } +#endif /* CONFIG_FREE_USER_PTE */ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, unsigned long addr, unsigned long end, @@ -4631,6 +4643,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) goto retry; vmf->orig_pte = *vmf->pte; + if (IS_ENABLED(CONFIG_FREE_USER_PTE)) + vmf->flags |= FAULT_FLAG_PTE_GET; + /* * some architectures can have larger ptes than wordsize, * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and diff --git a/mm/pte_ref.c b/mm/pte_ref.c index de109905bc8f..728e61cea25e 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -7,7 +7,10 @@ #include #include +#include +#include +#ifdef CONFIG_FREE_USER_PTE /* * pte_get_unless_zero - Increment refcount for the PTE page table * unless it is zero. @@ -15,7 +18,10 @@ */ bool pte_get_unless_zero(pmd_t *pmd) { - return true; + pgtable_t pte = pmd_pgtable(*pmd); + + VM_BUG_ON(!PageTable(pte)); + return atomic_inc_not_zero(&pte->pte_refcount); } /* @@ -32,12 +38,20 @@ bool pte_get_unless_zero(pmd_t *pmd) */ enum pte_tryget_type pte_try_get(pmd_t *pmd) { - if (unlikely(pmd_none(*pmd))) - return TRYGET_FAILED_NONE; - if (unlikely(is_huge_pmd(*pmd))) - return TRYGET_FAILED_HUGE_PMD; + int retval = TRYGET_SUCCESSED; + pmd_t pmdval; - return TRYGET_SUCCESSED; + rcu_read_lock(); + pmdval = READ_ONCE(*pmd); + if (unlikely(pmd_none(pmdval))) + retval = TRYGET_FAILED_NONE; + else if (unlikely(is_huge_pmd(pmdval))) + retval = TRYGET_FAILED_HUGE_PMD; + else if (!pte_get_unless_zero(&pmdval)) + retval = TRYGET_FAILED_ZERO; + rcu_read_unlock(); + + return retval; } /* @@ -52,4 +66,69 @@ enum pte_tryget_type pte_try_get(pmd_t *pmd) */ void pte_put_vmf(struct vm_fault *vmf) { + if (!(vmf->flags & FAULT_FLAG_PTE_GET)) + return; + vmf->flags &= ~FAULT_FLAG_PTE_GET; + + pte_put(vmf->vma->vm_mm, vmf->pmd, vmf->address); +} +#else +bool pte_get_unless_zero(pmd_t *pmd) +{ + return true; +} + +enum pte_tryget_type pte_try_get(pmd_t *pmd) +{ + if (unlikely(pmd_none(*pmd))) + return TRYGET_FAILED_NONE; + + if (unlikely(is_huge_pmd(*pmd))) + return TRYGET_FAILED_HUGE_PMD; + + return TRYGET_SUCCESSED; +} + +void pte_put_vmf(struct vm_fault *vmf) +{ +} +#endif /* CONFIG_FREE_USER_PTE */ + +#ifdef CONFIG_DEBUG_VM +static void pte_free_debug(pmd_t pmd) +{ + pte_t *ptep = (pte_t *)pmd_page_vaddr(pmd); + int i = 0; + + for (i = 0; i < PTRS_PER_PTE; i++) + BUG_ON(!pte_none(*ptep++)); +} +#else +static inline void pte_free_debug(pmd_t pmd) +{ +} +#endif + +static void pte_free_rcu(struct rcu_head *rcu) +{ + struct page *page = container_of(rcu, struct page, rcu_head); + + pgtable_pte_page_dtor(page); + __free_page(page); +} + +void free_user_pte_table(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) +{ + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); + spinlock_t *ptl; + pmd_t pmdval; + + ptl = pmd_lock(mm, pmd); + pmdval = pmdp_huge_get_and_clear(mm, addr, pmd); + flush_tlb_range(&vma, addr, addr + PMD_SIZE); + spin_unlock(ptl); + + pte_free_debug(pmdval); + mm_dec_nr_ptes(mm); + call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu); } From patchwork Wed Nov 10 10:54:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611777 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 558AAC433EF for ; Wed, 10 Nov 2021 10:56:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EF46D611AD for ; Wed, 10 Nov 2021 10:56:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org EF46D611AD Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 9804E6B0071; Wed, 10 Nov 2021 05:56:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 906CB6B0080; Wed, 10 Nov 2021 05:56:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A8886B0081; Wed, 10 Nov 2021 05:56:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0111.hostedemail.com [216.40.44.111]) by kanga.kvack.org (Postfix) with ESMTP id 656B16B0071 for ; Wed, 10 Nov 2021 05:56:01 -0500 (EST) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 212411803A13A for ; Wed, 10 Nov 2021 10:56:01 +0000 (UTC) X-FDA: 78792715722.03.99CFF47 Received: from mail-pg1-f179.google.com (mail-pg1-f179.google.com [209.85.215.179]) by imf05.hostedemail.com (Postfix) with ESMTP id CB4DE50961F3 for ; Wed, 10 Nov 2021 10:55:39 +0000 (UTC) Received: by mail-pg1-f179.google.com with SMTP id p8so1893286pgh.11 for ; Wed, 10 Nov 2021 02:56:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=oACp49gXaGAjKIiwkR+ZeBWqiQkGBkUqqKqI3adW4JI=; b=mYFTL8nAGfoCHW+ltxTYZYlXDGFJnMrF2VCPBAxylGRD2oFy5grbsGypUkK/thKoL5 +HDEBjxveBsreY4/QAXWNNN3NCUjVtlIu+dXA+sEVew8eBN/NOAYKTD8bjGwoAg4DT8P nA266BKFpGTAEUchN6jo4bbemK6xw1SgSz2wcj6mmPJTV26W1ZLUbJaHy4pY4MRTLkeP PkEhNjCiFFyj9cs63AMuh07LHAzL6EP1KxHwQ1DFzPXqQpKGj9BFHTcFbR/9G4aFjL09 ssmWBBRktXlg+M4KvdgY2ZD6i03sO0ej9Ly0sv9VqbPWsh2Hu1JFwr7baM2VV55tG4qT SffQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=oACp49gXaGAjKIiwkR+ZeBWqiQkGBkUqqKqI3adW4JI=; b=Xi3j9a4esy1DR7NPJT6Hp4bONntkI6HhsMUXEKXxQk6ApcJQCG+RVTAFTcz9ry+oIt TjJGHSLxHAHQ05iqEqqeKDc0MMejjoAplm3sGoPrreDuQpPHc1+wTQT3JHMWxVbXVthW Ahihn77w9sjgfA/UD3wYWXpVpNUnBeNnrp6EtrWtsLepL/rod0Qmj9tgPwq7lZOrvQvX YPGTKnTV7b5lXNUpwDDBJ0KYanzFbpBze0LNL05EyZi4HOwtYUY2MDQHc+2lwcvUD5zc F6sDuuncgsUqsh+pxjmxRQ3m8Fu9FzjATvqKkdFlo8HO29S97sH8g4CY177qx/78i5eU wfTA== X-Gm-Message-State: AOAM533Gm1HGHSdyU+v4bO54wer91UPw8sBQ6Jkm+sQRdWDUXoaW6bMk zfLiZ6U0sSgfzrPTXSdHseW/Og== X-Google-Smtp-Source: ABdhPJxsiLB9lgCGapZu/Cp89rN/P3dimmlX2wiU05T9FeDDXf/uTbUmrVDMvDWMjGH9ivQ2d5gBsg== X-Received: by 2002:a63:8048:: with SMTP id j69mr11183988pgd.111.1636541759762; Wed, 10 Nov 2021 02:55:59 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.55.55 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:55:59 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 14/15] Documentation: add document for pte_ref Date: Wed, 10 Nov 2021 18:54:27 +0800 Message-Id: <20211110105428.32458-15-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: CB4DE50961F3 X-Stat-Signature: defg14dtzt6n79aejurpd6sngpb9pwd6 Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=mYFTL8nA; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf05.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.179 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com X-HE-Tag: 1636541739-269949 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This commit adds document for pte_ref under `Documentation/vm/`. Signed-off-by: Qi Zheng --- Documentation/vm/pte_ref.rst | 212 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 212 insertions(+) create mode 100644 Documentation/vm/pte_ref.rst diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst new file mode 100644 index 000000000000..c5323a263464 --- /dev/null +++ b/Documentation/vm/pte_ref.rst @@ -0,0 +1,212 @@ +.. _pte_ref: + +============================================================================ +pte_ref: Tracking about how many references to each user PTE page table page +============================================================================ + +.. contents:: :local: + +1. Preface +========== + +Now in order to pursue high performance, applications mostly use some +high-performance user-mode memory allocators, such as jemalloc or tcmalloc. +These memory allocators use ``madvise(MADV_DONTNEED or MADV_FREE)`` to release +physical memory for the following reasons:: + + First of all, we should hold as few write locks of mmap_lock as possible,since + the mmap_lock semaphore has long been a contention point in the memory + management subsystem. The mmap()/munmap() hold the write lock, and the + madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using madvise() + instead of munmap() to released physical memory can reduce the competition of + the mmap_lock. + + Secondly, after using madvise() to release physical memory, there is no need to + build vma and allocate page tables again when accessing the same virtual + address again, which can also save some time. + +The following is the largest user PTE page table memory that can be allocated by +a single user process in a 32-bit and a 64-bit system. + ++---------------------------+--------+---------+ +| | 32-bit | 64-bit | ++===========================+========+=========+ +| user PTE page table pages | 3 MiB | 512 GiB | ++---------------------------+--------+---------+ +| user PMD page table pages | 3 KiB | 1 GiB | ++---------------------------+--------+---------+ + +(for 32-bit, take 3G user address space, 4K page size as an example; for 64-bit, +take 48-bit address width, 4K page size as an example.) + +After using ``madvise()``, everything looks good, but as can be seen from the +above table, a single process can create a large number of PTE page tables on a +64-bit system, since both of the ``MADV_DONTNEED`` and ``MADV_FREE`` will not +release page table memory. And before the process exits or calls ``munmap()``, +the kernel cannot reclaim these pages even if these PTE page tables do not map +anything. + +Therefore, we decided to introduce reference count to manage the PTE page table +life cycle, so that some free PTE page table memory in the system can be +dynamically released. + +2. The reference count of user PTE page table pages +=================================================== + +We introduce two members for the ``struct page`` of the user PTE page table +page:: + + union { + pgtable_t pmd_huge_pte; /* protected by page->ptl */ + pmd_t *pmd; /* PTE page only */ + }; + union { + struct mm_struct *pt_mm; /* x86 pgds only */ + atomic_t pt_frag_refcount; /* powerpc */ + atomic_t pte_refcount; /* PTE page only */ + }; + +The ``pmd`` member record the pmd entry that maps the user PTE page table page, +the ``pte_refcount`` member keep track of how many references to the user PTE +page table page. + +The following people will hold a reference on the user PTE page table page:: + + The !pte_none() entry, such as regular page table entry that map physical + pages, or swap entry, or migrate entry, etc. + + Visitor to the PTE page table entries, such as page table walker. + +Any ``!pte_none()`` entry and visitor can be regarded as the user of its PTE +page table page. When the ``pte_refcount`` is reduced to 0, it means that no one +is using the PTE page table page, then this free PTE page table page can be +released back to the system at this time. + +3. Competitive relationship +=========================== + +Now, the user page table will only be released by calling ``free_pgtables()`` +when the process exits or ``unmap_region()`` is called (e.g. ``munmap()`` path). +So other threads only need to ensure mutual exclusion with these paths to ensure +that the page table is not released. For example:: + + thread A thread B + page table walker munmap + ================= ====== + + mmap_read_lock() + if (!pte_none() && pte_present() && !pmd_trans_unstable()) { + pte_offset_map_lock() + *walk page table* + pte_unmap_unlock() + } + mmap_read_unlock() + + mmap_write_lock_killable() + detach_vmas_to_be_unmapped() + unmap_region() + --> free_pgtables() + +But after we introduce the reference count for the user PTE page table page, +these existing balances will be broken. The page can be released at any time +when its ``pte_refcount`` is reduced to 0. Therefore, the following case may +happen:: + + thread A thread B thread C + page table walker madvise(MADV_DONTNEED) page fault + ================= ====================== ========== + + mmap_read_lock() + if (!pte_none() && pte_present() && !pmd_trans_unstable()) { + + mmap_read_lock() + unmap_page_range() + --> zap_pte_range() + *the pte_refcount is reduced to 0* + --> *free PTE page table page* + + /* broken!! */ mmap_read_lock() + pte_offset_map_lock() + +As we can see, all of the thread A, B and C hold the read lock of mmap_lock, so +they can execute concurrently. When thread B releases the PTE page table page, +the value in the corresponding pmd entry will become unstable, which may be +none or huge pmd, or map a new PTE page table page again. This will cause system +chaos and even panic. + +So as described in the section "The reference count of user PTE page table +pages", we need to try to take a reference to the PTE page table page before +walking page table, then the system will become orderly again:: + + thread A thread B + page table walker madvise(MADV_DONTNEED) + ================= ====================== + + mmap_read_lock() + if (!pte_none() && pte_present() && !pmd_trans_unstable()) { + pte_try_get() + --> pte_get_unless_zero + *if successfully, then:* + + mmap_read_lock() + unmap_page_range() + --> zap_pte_range() + *the pte_refcount is reduced to 1* + + pte_offset_map_lock() + *walk page table* + pte_unmap_unlock() + pte_put() + --> *the pte_refcount is reduced to 0* + --> *free PTE page table page* + +There is also a lock-less scenario(such as fast GUP). Fortunately, we don't need +to do any additional operations to ensure that the system is in order. Take fast +GUP as an example:: + + thread A thread B + fast GUP madvise(MADV_DONTNEED) + ======== ====================== + + get_user_pages_fast_only() + --> local_irq_save(); + *free PTE page table page* + --> unhook page + /* The CPU where thread A is located closed + * the local interrupt and cannot respond to + * IPI, so it will block here */ + TLB invalidate page + gup_pgd_range(); + local_irq_restore(); + *free page* + +4. Helpers +========== + ++---------------------+-------------------------------------------------+ +| pte_ref_init | Initialize the pte_refcount and pmd | ++---------------------+-------------------------------------------------+ +| pte_to_pmd | Get the corresponding pmd | ++---------------------+-------------------------------------------------+ +| pte_update_pmd | Update the corresponding pmd | ++---------------------+-------------------------------------------------+ +| pte_get | Increment a pte_refcount | ++---------------------+-------------------------------------------------+ +| pte_get_many | Add a value to a pte_refcount | ++---------------------+-------------------------------------------------+ +| pte_get_unless_zero | Increment a pte_refcount unless it is 0 | ++---------------------+-------------------------------------------------+ +| pte_try_get | Try to increment a pte_refcount | ++---------------------+-------------------------------------------------+ +| pte_tryget_map | Try to increment a pte_refcount before | +| | pte_offset_map() | ++---------------------+-------------------------------------------------+ +| pte_tryget_map_lock | Try to increment a pte_refcount before | +| | pte_offset_map_lock() | ++---------------------+-------------------------------------------------+ +| pte_put | Decrement a pte_refcount | ++---------------------+-------------------------------------------------+ +| pte_put_many | Sub a value to a pte_refcount | ++---------------------+-------------------------------------------------+ +| pte_put_vmf | Decrement a pte_refcount in the page fault path | ++---------------------+-------------------------------------------------+ From patchwork Wed Nov 10 10:54:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 12611779 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08235C4332F for ; Wed, 10 Nov 2021 10:56:07 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A4DFE6112D for ; Wed, 10 Nov 2021 10:56:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org A4DFE6112D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 3FFA06B0072; Wed, 10 Nov 2021 05:56:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 362016B0080; Wed, 10 Nov 2021 05:56:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1DCAA6B0081; Wed, 10 Nov 2021 05:56:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0003.hostedemail.com [216.40.44.3]) by kanga.kvack.org (Postfix) with ESMTP id 04CED6B0072 for ; Wed, 10 Nov 2021 05:56:06 -0500 (EST) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id B7AE154CBC for ; Wed, 10 Nov 2021 10:56:05 +0000 (UTC) X-FDA: 78792715890.20.03D0810 Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) by imf03.hostedemail.com (Postfix) with ESMTP id 26C5430000AC for ; Wed, 10 Nov 2021 10:55:57 +0000 (UTC) Received: by mail-pj1-f54.google.com with SMTP id nh10-20020a17090b364a00b001a69adad5ebso1327836pjb.2 for ; Wed, 10 Nov 2021 02:56:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=25VRZHR2Pd1VOPr3qZJqaRujj9XwfT4H1vR/Swabhf0=; b=lz02LYdU8qxmCn1b6rHkpZA/GZ/O+w2Elu0uYFnq1f5iHi/0SBko9OtG0wFLIoD4yz b0hWoKv/hzZ0NYjWQ/OJu9oGxhKVIOhCv+moAsqKJJKJwZ89zZvegnmx1NxOLqtQl2c5 hMV71i7n2eMyMCO1JHtvVw48yoi8urgDLIHg1ob7sGE3BCov3KrnzAysFU94hqOwMdXN dH9vL3Z6czJou33gMPj72kQ3xO7D2Ms7T1uFWlImi+lXqu83eo+AGGO0D9StuTSabqVe O0nZBwDnMViWz7EmAxIiALvpWL+dCP80ex4FwnJA1Dni3ChoPgKRtoJRcrqYQtShwBxO iMjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=25VRZHR2Pd1VOPr3qZJqaRujj9XwfT4H1vR/Swabhf0=; b=ZV6MYnuRS2cZj2KRM1dcRY3gfLx4icvOC7BZ3KPX2rmcBBFLiAbQnlYx4NRYOHwbBJ 71htBENaXSavyWBe3LYWsGonj8+clG491r3KgxM2C6QG8OrA19EkerE/tw+PIPfJTISp yyIT0M1dU9Q/7uxqWn8tnuLalxVGgJPNSOHULczi5qevIhDN43Q1EfikPPQhGuOqwqWk HwMhuJaaZnCZUKn91MQBlaIKZUBuVH0jao20b9xBadP2TeWZ6KaOqkZZgyxZzjgdOh0O ybebjgMYdWjTYenku8rTWgtDM93qGLEqjWzPoJEz2HRH/eyGkpIOIwmoikbq/CCZWSaT pScQ== X-Gm-Message-State: AOAM530MsvIM46PAfcrQrlf3+oxywY1HJHe6x3AnTBdkbOi2w0Wbr9Xt QpEPQqi+YATbhuN25zjC74EASA== X-Google-Smtp-Source: ABdhPJwGhGqveQ+o8jACuOsGuYbmyPmdBPe4k0o2lNyqcYut9KcMuAGLX1QXE/zSG37UacD6C/vYpA== X-Received: by 2002:a17:902:e789:b0:140:801:1262 with SMTP id cp9-20020a170902e78900b0014008011262mr15005904plb.42.1636541764450; Wed, 10 Nov 2021 02:56:04 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id v38sm5865829pgl.38.2021.11.10.02.56.00 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Nov 2021 02:56:04 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng Subject: [PATCH v3 15/15] mm/pte_ref: use mmu_gather to free PTE page table pages Date: Wed, 10 Nov 2021 18:54:28 +0800 Message-Id: <20211110105428.32458-16-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20211110105428.32458-1-zhengqi.arch@bytedance.com> References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=lz02LYdU; spf=pass (imf03.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.54 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 26C5430000AC X-Stat-Signature: rh1y3wdi3x751epun8nzj3xius5rarq9 X-HE-Tag: 1636541757-584795 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In unmap_region() and other paths, we can reuse @tlb to free PTE page table, which can reduce the number of tlb flush. Signed-off-by: Qi Zheng --- Documentation/vm/pte_ref.rst | 58 +++++++++++++++++++++++--------------------- arch/x86/Kconfig | 2 +- include/linux/pte_ref.h | 34 ++++++++++++++++++++------ mm/madvise.c | 4 +-- mm/memory.c | 4 +-- mm/mmu_gather.c | 40 +++++++++++++----------------- mm/pte_ref.c | 13 +++++++--- 7 files changed, 90 insertions(+), 65 deletions(-) diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst index c5323a263464..d304c0bfaae1 100644 --- a/Documentation/vm/pte_ref.rst +++ b/Documentation/vm/pte_ref.rst @@ -183,30 +183,34 @@ GUP as an example:: 4. Helpers ========== -+---------------------+-------------------------------------------------+ -| pte_ref_init | Initialize the pte_refcount and pmd | -+---------------------+-------------------------------------------------+ -| pte_to_pmd | Get the corresponding pmd | -+---------------------+-------------------------------------------------+ -| pte_update_pmd | Update the corresponding pmd | -+---------------------+-------------------------------------------------+ -| pte_get | Increment a pte_refcount | -+---------------------+-------------------------------------------------+ -| pte_get_many | Add a value to a pte_refcount | -+---------------------+-------------------------------------------------+ -| pte_get_unless_zero | Increment a pte_refcount unless it is 0 | -+---------------------+-------------------------------------------------+ -| pte_try_get | Try to increment a pte_refcount | -+---------------------+-------------------------------------------------+ -| pte_tryget_map | Try to increment a pte_refcount before | -| | pte_offset_map() | -+---------------------+-------------------------------------------------+ -| pte_tryget_map_lock | Try to increment a pte_refcount before | -| | pte_offset_map_lock() | -+---------------------+-------------------------------------------------+ -| pte_put | Decrement a pte_refcount | -+---------------------+-------------------------------------------------+ -| pte_put_many | Sub a value to a pte_refcount | -+---------------------+-------------------------------------------------+ -| pte_put_vmf | Decrement a pte_refcount in the page fault path | -+---------------------+-------------------------------------------------+ ++---------------------+------------------------------------------------------+ +| pte_ref_init | Initialize the pte_refcount and pmd | ++---------------------+------------------------------------------------------+ +| pte_to_pmd | Get the corresponding pmd | ++---------------------+------------------------------------------------------+ +| pte_update_pmd | Update the corresponding pmd | ++---------------------+------------------------------------------------------+ +| pte_get | Increment a pte_refcount | ++---------------------+------------------------------------------------------+ +| pte_get_many | Add a value to a pte_refcount | ++---------------------+------------------------------------------------------+ +| pte_get_unless_zero | Increment a pte_refcount unless it is 0 | ++---------------------+------------------------------------------------------+ +| pte_try_get | Try to increment a pte_refcount | ++---------------------+------------------------------------------------------+ +| pte_tryget_map | Try to increment a pte_refcount before | +| | pte_offset_map() | ++---------------------+------------------------------------------------------+ +| pte_tryget_map_lock | Try to increment a pte_refcount before | +| | pte_offset_map_lock() | ++---------------------+------------------------------------------------------+ +| __pte_put | Decrement a pte_refcount | ++---------------------+------------------------------------------------------+ +| __pte_put_many | Sub a value to a pte_refcount | ++---------------------+------------------------------------------------------+ +| pte_put | Decrement a pte_refcount(without tlb parameter) | ++---------------------+------------------------------------------------------+ +| pte_put_many | Sub a value to a pte_refcount(without tlb parameter) | ++---------------------+------------------------------------------------------+ +| pte_put_vmf | Decrement a pte_refcount in the page fault path | ++---------------------+------------------------------------------------------+ diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index ca5bfe83ec61..69ea13437947 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -233,7 +233,7 @@ config X86 select HAVE_PCI select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP - select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT + select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT || FREE_USER_PTE select HAVE_POSIX_CPU_TIMERS_TASK_WORK select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RELIABLE_STACKTRACE if X86_64 && (UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h index 8a26eaba83ef..dc3923bb38f6 100644 --- a/include/linux/pte_ref.h +++ b/include/linux/pte_ref.h @@ -22,7 +22,8 @@ enum pte_tryget_type pte_try_get(pmd_t *pmd); bool pte_get_unless_zero(pmd_t *pmd); #ifdef CONFIG_FREE_USER_PTE -void free_user_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr); +void free_user_pte_table(struct mmu_gather *tlb, struct mm_struct *mm, + pmd_t *pmd, unsigned long addr); static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count) { @@ -48,14 +49,21 @@ static inline void pte_get_many(pmd_t *pmd, unsigned int nr) atomic_add(nr, &pte->pte_refcount); } -static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd, - unsigned long addr, unsigned int nr) +static inline void __pte_put_many(struct mmu_gather *tlb, struct mm_struct *mm, + pmd_t *pmd, unsigned long addr, + unsigned int nr) { pgtable_t pte = pmd_pgtable(*pmd); VM_BUG_ON(!PageTable(pte)); if (atomic_sub_and_test(nr, &pte->pte_refcount)) - free_user_pte_table(mm, pmd, addr & PMD_MASK); + free_user_pte_table(tlb, mm, pmd, addr & PMD_MASK); +} + +static inline void __pte_put(struct mmu_gather *tlb, struct mm_struct *mm, + pmd_t *pmd, unsigned long addr) +{ + __pte_put_many(tlb, mm, pmd, addr, 1); } #else static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count) @@ -75,8 +83,14 @@ static inline void pte_get_many(pmd_t *pmd, unsigned int nr) { } -static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd, - unsigned long addr, unsigned int nr) +static inline void __pte_put_many(struct mmu_gather *tlb, struct mm_struct *mm, + pmd_t *pmd, unsigned long addr, + unsigned int nr) +{ +} + +static inline void __pte_put(struct mmu_gather *tlb, struct mm_struct *mm, + pmd_t *pmd, unsigned long addr) { } #endif /* CONFIG_FREE_USER_PTE */ @@ -110,6 +124,12 @@ static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd, return pte_offset_map_lock(mm, pmd, address, ptlp); } +static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, unsigned int nr) +{ + __pte_put_many(NULL, mm, pmd, addr, nr); +} + /* * pte_put - Decrement refcount for the PTE page table. * @mm: the mm_struct of the target address space. @@ -120,7 +140,7 @@ static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd, */ static inline void pte_put(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) { - pte_put_many(mm, pmd, addr, 1); + __pte_put(NULL, mm, pmd, addr); } #endif diff --git a/mm/madvise.c b/mm/madvise.c index 5cf2832abb98..b51254305bb2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -477,7 +477,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, arch_leave_lazy_mmu_mode(); pte_unmap_unlock(orig_pte, ptl); - pte_put(vma->vm_mm, pmd, start); + __pte_put(tlb, vma->vm_mm, pmd, start); if (pageout) reclaim_pages(&page_list); cond_resched(); @@ -710,7 +710,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, arch_leave_lazy_mmu_mode(); pte_unmap_unlock(orig_pte, ptl); if (nr_put) - pte_put_many(mm, pmd, start, nr_put); + __pte_put_many(tlb, mm, pmd, start, nr_put); cond_resched(); next: return 0; diff --git a/mm/memory.c b/mm/memory.c index 4d1ede78d1b0..1bdae3b0f877 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1469,7 +1469,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, } if (nr_put) - pte_put_many(mm, pmd, start, nr_put); + __pte_put_many(tlb, mm, pmd, start, nr_put); return addr; } @@ -1515,7 +1515,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, if (pte_try_get(pmd)) goto next; next = zap_pte_range(tlb, vma, pmd, addr, next, details); - pte_put(tlb->mm, pmd, addr); + __pte_put(tlb, tlb->mm, pmd, addr); next: cond_resched(); } while (pmd++, addr = next, addr != end); diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 1b9837419bf9..1bd9fa889421 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -134,42 +134,42 @@ static void __tlb_remove_table_free(struct mmu_table_batch *batch) * */ -static void tlb_remove_table_smp_sync(void *arg) +static void tlb_remove_table_rcu(struct rcu_head *head) { - /* Simply deliver the interrupt */ + __tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu)); } -static void tlb_remove_table_sync_one(void) +static void tlb_remove_table_free(struct mmu_table_batch *batch) { - /* - * This isn't an RCU grace period and hence the page-tables cannot be - * assumed to be actually RCU-freed. - * - * It is however sufficient for software page-table walkers that rely on - * IRQ disabling. - */ - smp_call_function(tlb_remove_table_smp_sync, NULL, 1); + call_rcu(&batch->rcu, tlb_remove_table_rcu); } -static void tlb_remove_table_rcu(struct rcu_head *head) +static void tlb_remove_table_one_rcu(struct rcu_head *head) { - __tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu)); + struct page *page = container_of(head, struct page, rcu_head); + + __tlb_remove_table(page); } -static void tlb_remove_table_free(struct mmu_table_batch *batch) +static void tlb_remove_table_one(void *table) { - call_rcu(&batch->rcu, tlb_remove_table_rcu); + pgtable_t page = (pgtable_t)table; + + call_rcu(&page->rcu_head, tlb_remove_table_one_rcu); } #else /* !CONFIG_MMU_GATHER_RCU_TABLE_FREE */ -static void tlb_remove_table_sync_one(void) { } - static void tlb_remove_table_free(struct mmu_table_batch *batch) { __tlb_remove_table_free(batch); } +static void tlb_remove_table_one(void *table) +{ + __tlb_remove_table(table); +} + #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */ /* @@ -187,12 +187,6 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb) } } -static void tlb_remove_table_one(void *table) -{ - tlb_remove_table_sync_one(); - __tlb_remove_table(table); -} - static void tlb_table_flush(struct mmu_gather *tlb) { struct mmu_table_batch **batch = &tlb->batch; diff --git a/mm/pte_ref.c b/mm/pte_ref.c index 728e61cea25e..f9650ad23c7c 100644 --- a/mm/pte_ref.c +++ b/mm/pte_ref.c @@ -8,6 +8,8 @@ #include #include #include +#include +#include #include #ifdef CONFIG_FREE_USER_PTE @@ -117,7 +119,8 @@ static void pte_free_rcu(struct rcu_head *rcu) __free_page(page); } -void free_user_pte_table(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) +void free_user_pte_table(struct mmu_gather *tlb, struct mm_struct *mm, + pmd_t *pmd, unsigned long addr) { struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); spinlock_t *ptl; @@ -125,10 +128,14 @@ void free_user_pte_table(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) ptl = pmd_lock(mm, pmd); pmdval = pmdp_huge_get_and_clear(mm, addr, pmd); - flush_tlb_range(&vma, addr, addr + PMD_SIZE); + if (!tlb) + flush_tlb_range(&vma, addr, addr + PMD_SIZE); + else + pte_free_tlb(tlb, pmd_pgtable(pmdval), addr); spin_unlock(ptl); pte_free_debug(pmdval); mm_dec_nr_ptes(mm); - call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu); + if (!tlb) + call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu); }