From patchwork Sun Mar 20 00:07:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nadav Amit X-Patchwork-Id: 12786386 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ABF3EC433EF for ; Sun, 20 Mar 2022 00:06:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B5A9B8D0002; Sat, 19 Mar 2022 20:06:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B08C58D0001; Sat, 19 Mar 2022 20:06:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 95BF78D0002; Sat, 19 Mar 2022 20:06:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0079.hostedemail.com [216.40.44.79]) by kanga.kvack.org (Postfix) with ESMTP id 81E858D0001 for ; Sat, 19 Mar 2022 20:06:44 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 120F48249980 for ; Sun, 20 Mar 2022 00:06:44 +0000 (UTC) X-FDA: 79262823528.28.05591A0 Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) by imf28.hostedemail.com (Postfix) with ESMTP id 8B16AC0035 for ; Sun, 20 Mar 2022 00:06:43 +0000 (UTC) Received: by mail-pj1-f54.google.com with SMTP id mr5-20020a17090b238500b001c67366ae93so8985251pjb.4 for ; Sat, 19 Mar 2022 17:06:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=EC6WALFPUi9ITN2DMWBVzPer8E81RkRZd/50MygUiBY=; b=RAGKULfequG2DNxQP+w0xBNZMqSJbhTDMflwsw/r5l1W5NjfgPJSf/SP/EBUrJWL+h NJAoOfRtABezHByGfp5p2IkP8nKsjiM7UK/a80XIBXAmJjRAq9HRiNjl5gxgiBZn3z3E 9tSzevb8deMTECUG/su/hz9ALsqrSZ1NkPQpbUFcQ+ayO+jC0LW+PaUQ8IqxA/S9p914 rTKjO4yrhzxP3wcGXB3432TruiF4PJkxBo7JWaDrLQUrvbtt+FnX0/0rslujPb9n5atA vogxFVi1tapWosI+THxgG+VM+2Bfh2Mwk04PBFpquxZPxnZ5ZU0dyg9WpGS5jYb61IY3 Qz4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=EC6WALFPUi9ITN2DMWBVzPer8E81RkRZd/50MygUiBY=; b=g6CwecLDfhqi8d51aYlFB69StWUGn/+4bRFvNVbupTBFnWeiVtJAn5BVjTwn/hDpNa ruDDC6nM0VESNfcvJikM65/vWcviziO1SAIU5m9H3Ad08b/CU8UttHjlfDqj3czAl4YK gd33ghqbdKdw/hMtcEhOLsrdYLvA7QAOV88Jr1X7o/a4lGnIx69xlYQ95irSnyiAtB4x QqUZEOOpx7GVjrRQmzi40Y7HRUSwWj8i+l3i9gCFr4KctiVAPcCsu7KRuFVroOFPEihs bvdFQnrbIPOj8OaGM5PO6iXk1IqtKq4bVZcAToztoXs9Utvnww8ZYUJQQtCReR016jXQ frpA== X-Gm-Message-State: AOAM53184/EaHmucZT1gEjBarWIa3HVrKYhNebyg90nhJk9rHumqSC01 zSDZ0NbeRy1pMjFx7IetQ4w= X-Google-Smtp-Source: ABdhPJw9pifaNrCLhh1VGoE0EXmfovbPAX98eOgvuJjyuJezTA1zXWfgfL7njgJBHN+YVFOWuuWrMQ== X-Received: by 2002:a17:90b:4a47:b0:1c6:42c6:fef2 with SMTP id lb7-20020a17090b4a4700b001c642c6fef2mr18640764pjb.147.1647734802456; Sat, 19 Mar 2022 17:06:42 -0700 (PDT) Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id ep2-20020a17090ae64200b001c6a7c22aedsm5550124pjb.37.2022.03.19.17.06.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 19 Mar 2022 17:06:41 -0700 (PDT) From: Nadav Amit X-Google-Original-From: Nadav Amit To: Andrew Morton Cc: linux-mm@kvack.org, Nadav Amit , Andrea Arcangeli , Andy Lutomirski , Dave Hansen , Peter Zijlstra , Thomas Gleixner , Will Deacon , Yu Zhao , Nick Piggin , x86@kernel.org Subject: [RESEND PATCH v5 2/3] mm/mprotect: do not flush when not required architecturally Date: Sat, 19 Mar 2022 17:07:18 -0700 Message-Id: <20220320000719.1533862-3-namit@vmware.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220320000719.1533862-1-namit@vmware.com> References: <20220320000719.1533862-1-namit@vmware.com> MIME-Version: 1.0 X-Stat-Signature: zdquy3n13q4stn8z3cr7ed5w1wj7reqs Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=RAGKULfe; spf=none (imf28.hostedemail.com: domain of mail-pj1-f54.google.com has no SPF policy when checking 209.85.216.54) smtp.helo=mail-pj1-f54.google.com; dmarc=pass (policy=none) header.from=gmail.com X-Rspam-User: X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 8B16AC0035 X-HE-Tag: 1647734803-340326 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Nadav Amit Currently, using mprotect() to unprotect a memory region or uffd to unprotect a memory region causes a TLB flush. However, in such cases the PTE is often not modified (i.e., remain RO) and therefore not TLB flush is needed. Add an arch-specific pte_needs_flush() which tells whether a TLB flush is needed based on the old PTE and the new one. Implement an x86 pte_needs_flush(). Always flush the TLB when it is architecturally needed even when skipping a TLB flush might only result in a spurious page-faults by skipping the flush. Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Andy Lutomirski Cc: Dave Hansen Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Cc: Yu Zhao Cc: Nick Piggin Cc: x86@kernel.org Signed-off-by: Nadav Amit --- arch/x86/include/asm/pgtable_types.h | 2 + arch/x86/include/asm/tlbflush.h | 121 +++++++++++++++++++++++++++ include/asm-generic/tlb.h | 14 ++++ mm/huge_memory.c | 9 +- mm/mprotect.c | 3 +- 5 files changed, 144 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 40497a9020c6..8668bc661026 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -110,9 +110,11 @@ #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) #define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP) +#define _PAGE_SOFTW4 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW4) #else #define _PAGE_NX (_AT(pteval_t, 0)) #define _PAGE_DEVMAP (_AT(pteval_t, 0)) +#define _PAGE_SOFTW4 (_AT(pteval_t, 0)) #endif #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 98fa0a114074..923d34a7214d 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -259,6 +259,127 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); +static inline bool pte_flags_need_flush(unsigned long oldflags, + unsigned long newflags, + bool ignore_access) +{ + unsigned long diff = oldflags ^ newflags; + pteval_t enable_mask, ignore_mask; + + /* Ignore software bits */ + ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 | _PAGE_SOFTW3 | _PAGE_SOFTW4; + + /* + * Flags that require a flush when cleared but not when they are set. + * We only include flags that would not trigger spurious page-faults. + * Non-present entries are not cached. Hardware would set the + * dirty/access bit if needed without a fault. + */ + enable_mask = _PAGE_DIRTY | _PAGE_PRESENT; + + if (ignore_access) + ignore_mask |= _PAGE_ACCESSED; + else + enable_mask |= _PAGE_ACCESSED; + + /* + * Flags that indicate feature enabling are set in the old flags but + * not in the new flags; feature disabled; need flush. + */ + if (diff & oldflags & enable_mask) + return true; + + /* + * For other non-software flags (e.g., protection-keys, cachability, + * page-size, NX, write), we require a flush. For some cases (e.g., + * protection keys) it is architecturally required, in others (e.g., NX) + * is done to avoid spurious page-faults. + */ + if (diff & ~(ignore_mask | enable_mask)) + return true; + + return false; +} + +/* + * pte_needs_flush() checks whether permissions were demoted and require a + * flush. It should only be used for userspace PTEs. + */ +static inline bool pte_needs_flush(pte_t oldpte, pte_t newpte) +{ + /* + * We first check whether only the new or old PTE is present. These + * tests are a more conservative than checking _PAGE_PRESENT as done by + * pte_flags_need_flush() since pte_present() also checks + * _PAGE_PROTNONE. Apparently _PAGE_PROTNONE can be ignored, since + * changes to _PAGE_PROTNONE are already followed by a flush. But for + * now, be conservative. + * + * In the future the first pte_present(oldpte) test can just check the + * _PAGE_PRESENT flag and the second one dropped. + */ + + /* !PRESENT -> * ; no need for flush */ + if (!pte_present(oldpte)) + return false; + + /* PRESENT -> !PRESENT ; needs flush */ + if (!pte_present(newpte)) + return true; + + /* PFN changed ; needs flush */ + if (pte_pfn(oldpte) != pte_pfn(newpte)) + return true; + + /* + * check PTE flags; ignore access-bit; see comment in + * ptep_clear_flush_young(). + */ + return pte_flags_need_flush(pte_flags(oldpte), pte_flags(newpte), + true); +} +#define pte_needs_flush pte_needs_flush + +/* + * huge_pmd_needs_flush() checks whether permissions were demoted and require a + * flush. It should only be used for userspace huge PMDs. + */ +static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd) +{ + /* + * We first check whether only the new or old PMD is present. These + * tests are a more conservative than checking _PAGE_PRESENT as later + * done by pte_flags_need_flush(), since pte_present() also checks + * _PAGE_PROTNONE and _PAGE_PSE. Apparently, both _PAGE_PROTNONE and + * _PAGE_PSE can be ignored: changes to _PAGE_PROTNONE are followed by + * flush and _PAGE_PSE might only be set (without _PAGE_PRESENT) while + * the PMD's page-table lock is taken. But be conservative for now. + * + * In the future the first pte_present(oldpmd) test can just check the + * _PAGE_PRESENT flag and the second one dropped. + */ + + /* !PRESENT -> * ; no need for flush */ + if (!pmd_present(oldpmd)) + return false; + + /* PRESENT -> !PRESENT ; needs flush */ + if (!pmd_present(newpmd)) + return true; + + /* PFN changed ; needs flush */ + if (pmd_pfn(oldpmd) != pmd_pfn(newpmd)) + return true; + + /* + * check PMD flags; do not ignore access-bit; see + * pmdp_clear_flush_young(). + */ + return pte_flags_need_flush(pmd_flags(oldpmd), pmd_flags(newpmd), + false); +} +#define huge_pmd_needs_flush huge_pmd_needs_flush + #endif /* !MODULE */ static inline void __native_tlb_flush_global(unsigned long cr4) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index fd7feb5c7894..3a30e23fa35d 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -654,6 +654,20 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb, } while (0) #endif +#ifndef pte_needs_flush +static inline bool pte_needs_flush(pte_t oldpte, pte_t newpte) +{ + return true; +} +#endif + +#ifndef huge_pmd_needs_flush +static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd) +{ + return true; +} +#endif + #endif /* CONFIG_MMU */ #endif /* _ASM_GENERIC__TLB_H */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d58a5b498011..51b0f3cb1ba0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1698,7 +1698,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; - pmd_t entry; + pmd_t oldpmd, entry; bool preserve_write; int ret; bool prot_numa = cp_flags & MM_CP_PROT_NUMA; @@ -1784,9 +1784,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, * pmdp_invalidate() is required to make sure we don't miss * dirty/young flags set by hardware. */ - entry = pmdp_invalidate(vma, addr, pmd); + oldpmd = pmdp_invalidate(vma, addr, pmd); - entry = pmd_modify(entry, newprot); + entry = pmd_modify(oldpmd, newprot); if (preserve_write) entry = pmd_mk_savedwrite(entry); if (uffd_wp) { @@ -1803,7 +1803,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, ret = HPAGE_PMD_NR; set_pmd_at(mm, addr, pmd, entry); - tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE); + if (huge_pmd_needs_flush(oldpmd, entry)) + tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE); BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry)); unlock: diff --git a/mm/mprotect.c b/mm/mprotect.c index ba3fc6d5ed2a..dd963e5da118 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -152,7 +152,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb, ptent = pte_mkwrite(ptent); } ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); - tlb_flush_pte_range(tlb, addr, PAGE_SIZE); + if (pte_needs_flush(oldpte, ptent)) + tlb_flush_pte_range(tlb, addr, PAGE_SIZE); pages++; } else if (is_swap_pte(oldpte)) { swp_entry_t entry = pte_to_swp_entry(oldpte); From patchwork Sun Mar 20 00:07:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nadav Amit X-Patchwork-Id: 12786387 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC9A3C433F5 for ; Sun, 20 Mar 2022 00:06:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 73A528D0003; Sat, 19 Mar 2022 20:06:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6E9C98D0001; Sat, 19 Mar 2022 20:06:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5B24E8D0003; Sat, 19 Mar 2022 20:06:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0150.hostedemail.com [216.40.44.150]) by kanga.kvack.org (Postfix) with ESMTP id 49DC08D0001 for ; Sat, 19 Mar 2022 20:06:46 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id E797C181CB2CA for ; Sun, 20 Mar 2022 00:06:45 +0000 (UTC) X-FDA: 79262823612.26.489CFEB Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) by imf24.hostedemail.com (Postfix) with ESMTP id 62A5A180032 for ; Sun, 20 Mar 2022 00:06:45 +0000 (UTC) Received: by mail-pl1-f177.google.com with SMTP id p17so9858259plo.9 for ; Sat, 19 Mar 2022 17:06:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=EneRWVgWQxrY1C12VG15OjPWHgql2L2aCtdoR7Sm4pM=; b=JXLg5HFtAOkvtS8q/ZDF+fC7rtvvrWLHqtnkq1waxlLm1ENKt2S+B4ie82Nr0wkgwi pRgPsC0AfIbuXVsUb+Vt8aGqY9DUblUHDkCHoi/B1hw0aGa9TFVJIeKM1hNhwKmeNrcp mNkiuH5NMgRmjJNt/uXmXVmVrNxHsitQWej6GbAraje/jBSw1Q64TgY+pncz6Mcl6L86 xQ0wKD7ibORWvz9CKDimJ9FFfFOOCvgsD7JAFgQJ+8RCQ0Z2L8OhQnPkTmk+19q4Jrd6 b2Z0IRqZ2DJkPuxLd1CvlTjF8LUXdMirp9+0JMjfrf3WwN3PZ36JRZ2Vxa/OtzAgTC1E HIqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=EneRWVgWQxrY1C12VG15OjPWHgql2L2aCtdoR7Sm4pM=; b=5PEQeq7MP9nvyVyf6OVRc/kF6LUhJ4VgJt/bV0g1omV9yU+HQxxHWTGsH56o5LMSch r2gPZga0U85Ah3Rb17iNzE6IwUScOcrRxKAC7fqDOD/O5WkscK2g7TzYKnLwrdV/J9vZ 9M6HfSwa9AOeGPFOyDcz3gJGo5MdOxOh2djLi00ftnnnL0GI42/vDZ0pU4wm2WYWDRDP ptbzwKcaU/e8jLJUxBAUH1Ko9Agiq0UeKkUOnUsCzPghxShrxhq9eJ011ZPNiWrx+DSX 1Vt+/PizUQbXzSn7zprYtXLuGOR08XEfWyH5V4QBbhSyI5gyXz1MJ/3EdltCyZPL/SFs +nfg== X-Gm-Message-State: AOAM531BlhQCqsFErj69F7wshT1IiFziuNDOW5ahfQoByJkY/FScBw17 HQ9d0aruUPan1F4ixcx+oWk= X-Google-Smtp-Source: ABdhPJwoLzmTUpgkB2qeni2yJYYchDAW317jN/xzudnFZ56RyH9Z/9K/UODDGBA31tCa1ZgoPUjoLA== X-Received: by 2002:a17:90a:4a06:b0:1c6:d786:10a8 with SMTP id e6-20020a17090a4a0600b001c6d78610a8mr5035281pjh.207.1647734804292; Sat, 19 Mar 2022 17:06:44 -0700 (PDT) Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id ep2-20020a17090ae64200b001c6a7c22aedsm5550124pjb.37.2022.03.19.17.06.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 19 Mar 2022 17:06:43 -0700 (PDT) From: Nadav Amit X-Google-Original-From: Nadav Amit To: Andrew Morton Cc: linux-mm@kvack.org, Nadav Amit , Andrea Arcangeli , Andrew Cooper , Andy Lutomirski , Dave Hansen , Peter Xu , Peter Zijlstra , Thomas Gleixner , Will Deacon , Yu Zhao , Nick Piggin , x86@kernel.org Subject: [RESEND PATCH v5 3/3] mm: avoid unnecessary flush on change_huge_pmd() Date: Sat, 19 Mar 2022 17:07:19 -0700 Message-Id: <20220320000719.1533862-4-namit@vmware.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220320000719.1533862-1-namit@vmware.com> References: <20220320000719.1533862-1-namit@vmware.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 62A5A180032 X-Stat-Signature: ucbwqkkxbaycxdk1po37a1y3rroy11xz Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=JXLg5HFt; spf=none (imf24.hostedemail.com: domain of mail-pl1-f177.google.com has no SPF policy when checking 209.85.214.177) smtp.helo=mail-pl1-f177.google.com; dmarc=pass (policy=none) header.from=gmail.com X-HE-Tag: 1647734805-627383 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Nadav Amit Calls to change_protection_range() on THP can trigger, at least on x86, two TLB flushes for one page: one immediately, when pmdp_invalidate() is called by change_huge_pmd(), and then another one later (that can be batched) when change_protection_range() finishes. The first TLB flush is only necessary to prevent the dirty bit (and with a lesser importance the access bit) from changing while the PTE is modified. However, this is not necessary as the x86 CPUs set the dirty-bit atomically with an additional check that the PTE is (still) present. One caveat is Intel's Knights Landing that has a bug and does not do so. Leverage this behavior to eliminate the unnecessary TLB flush in change_huge_pmd(). Introduce a new arch specific pmdp_invalidate_ad() that only invalidates the access and dirty bit from further changes. Cc: Andrea Arcangeli Cc: Andrew Cooper Cc: Andrew Morton Cc: Andy Lutomirski Cc: Dave Hansen Cc: Peter Xu Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Cc: Yu Zhao Cc: Nick Piggin Cc: x86@kernel.org Signed-off-by: Nadav Amit --- arch/x86/include/asm/pgtable.h | 5 +++++ arch/x86/mm/pgtable.c | 10 ++++++++++ include/linux/pgtable.h | 20 ++++++++++++++++++++ mm/huge_memory.c | 4 ++-- mm/pgtable-generic.c | 8 ++++++++ 5 files changed, 45 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 62ab07e24aef..23ad34edcc4b 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1173,6 +1173,11 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma, } } #endif + +#define __HAVE_ARCH_PMDP_INVALIDATE_AD +extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp); + /* * Page table pages are page-aligned. The lower half of the top * level is used for userspace and the top half for the kernel. diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 3481b35cb4ec..f16059e9a85e 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -608,6 +608,16 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma, return young; } + +pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmdp) +{ + /* + * No flush is necessary. Once an invalid PTE is established, the PTE's + * access and dirty bits cannot be updated. + */ + return pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); +} #endif /** diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index f4f4077b97aa..5826e8e52619 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -570,6 +570,26 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); #endif +#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD + +/* + * pmdp_invalidate_ad() invalidates the PMD while changing a transparent + * hugepage mapping in the page tables. This function is similar to + * pmdp_invalidate(), but should only be used if the access and dirty bits would + * not be cleared by the software in the new PMD value. The function ensures + * that hardware changes of the access and dirty bits updates would not be lost. + * + * Doing so can allow in certain architectures to avoid a TLB flush in most + * cases. Yet, another TLB flush might be necessary later if the PMD update + * itself requires such flush (e.g., if protection was set to be stricter). Yet, + * even when a TLB flush is needed because of the update, the caller may be able + * to batch these TLB flushing operations, so fewer TLB flush operations are + * needed. + */ +extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp); +#endif + #ifndef __HAVE_ARCH_PTE_SAME static inline int pte_same(pte_t pte_a, pte_t pte_b) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 51b0f3cb1ba0..691d80edcfd7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1781,10 +1781,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, * The race makes MADV_DONTNEED miss the huge pmd and don't clear it * which may break userspace. * - * pmdp_invalidate() is required to make sure we don't miss + * pmdp_invalidate_ad() is required to make sure we don't miss * dirty/young flags set by hardware. */ - oldpmd = pmdp_invalidate(vma, addr, pmd); + oldpmd = pmdp_invalidate_ad(vma, addr, pmd); entry = pmd_modify(oldpmd, newprot); if (preserve_write) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 6523fda274e5..90ab721a12a8 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -201,6 +201,14 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, } #endif +#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD +pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmdp) +{ + return pmdp_invalidate(vma, address, pmdp); +} +#endif + #ifndef pmdp_collapse_flush pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp)