From patchwork Fri Feb 21 00:52:59 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rik van Riel X-Patchwork-Id: 13984684 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DDF58C021B2 for ; Fri, 21 Feb 2025 00:55:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 25911280010; Thu, 20 Feb 2025 19:55:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1BACC28000B; Thu, 20 Feb 2025 19:55:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EDC85280010; Thu, 20 Feb 2025 19:55:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B7A6728000B for ; Thu, 20 Feb 2025 19:55:25 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 764BD12022B for ; Fri, 21 Feb 2025 00:55:25 +0000 (UTC) X-FDA: 83142133410.01.EEA2204 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by imf09.hostedemail.com (Postfix) with ESMTP id 051CF140004 for ; Fri, 21 Feb 2025 00:55:23 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=none; spf=pass (imf09.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740099324; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references; bh=5RUpld10LQ53U4FNDu86EEruVeE5e+xUyAUUiEjwUvs=; b=3nB3Ru4WQiT+s31m/xSu1c+2UZryUdX+ToVy9Bp9XvIGVcPOY7KGX35DLPDD1ZUOnGggOc 8yUZ2YBCvoyAWN5BOHNNucKw7FOQK0ii1z3SRchEMv/0AtKfFHq6CKSnA3BucYMMNYLx+o Onhtez9gBsDHC0JyPa6osap30Rn1ID8= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=none; spf=pass (imf09.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740099324; a=rsa-sha256; cv=none; b=VwT8Co+Lk5C3l9xIdyK0z3ZCP5+Mp090e34OxceDi+Bqq8popPr07kakVFJRISVaoA0/0u 7Uvoy6A+CBgoWAyRGt35Bepzymic30C0P9HmCL73dKvUSwOgDEGQNerAdTL7/KTIK4aM3n V1xRrp3k9F7AD1hoLbIg3vcjrPnGCdc= Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1tlHIZ-000000003Qf-0Ury; Thu, 20 Feb 2025 19:53:47 -0500 From: Rik van Riel To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, bp@alien8.de, peterz@infradead.org, dave.hansen@linux.intel.com, zhengqi.arch@bytedance.com, nadav.amit@gmail.com, thomas.lendacky@amd.com, kernel-team@meta.com, linux-mm@kvack.org, akpm@linux-foundation.org, jackmanb@google.com, jannh@google.com, mhklinux@outlook.com, andrew.cooper3@citrix.com, Manali.Shukla@amd.com Subject: [PATCH v12 00/16] AMD broadcast TLB invalidation Date: Thu, 20 Feb 2025 19:52:59 -0500 Message-ID: <20250221005345.2156760-1-riel@surriel.com> X-Mailer: git-send-email 2.47.1 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Queue-Id: 051CF140004 X-Stat-Signature: dmrs8rbmcjy1qbcy43o71noge9k3e6e5 X-Rspamd-Server: rspam03 X-HE-Tag: 1740099323-837720 X-HE-Meta: U2FsdGVkX1/XrO0zve4P4uNpIrSOjW1w9HXUTn60kqkxdagva11g428BuB+yYuS5HkpEGcXhV/5P8CXSMYdReeVUIy7zixSW15F8Pi4S0vPTk5KZOsKndPyt+XkN1bB7+1PqW7R1dFkLqqxPahqeDuuLRikbaPFSJSfxzKToLPc/JOEwQIgH2TYgmZp/3MnmNttz8UGqzWRFP3OjzRzAymgIfauG57EF3Cl4Z1GGD8HnsQfKMsiFIF2M8HTImQhJOUZk4YVxgrVviFKIXe9oGrfRZwPP749dGIyo4ku6aGohdMJYSRrem6i//qaMEvb91a4MSrGHEhKw7PzS7rfOn+nXhxFMgQspTqaQ6jdNKKtj2vOe2tyhre2HlSA7F3c35VsNpw94IZ+IS3rUbwjGSIMyVaREF8anYIH120jwA36sb6f1uqpLRG08ZWo+mMyv7tQGWsXwDq3bUBnzPvNUPZwiFK+3GNy7az37d4WF4DsNa0Kwt6mDuMBeAhwmsA3Z6XIsRfV5E5qwKBmalneTNqrcNb9aHmO02DG4L6LTMMcbxKOvOF6r2L3CnnetgqLF0MRZePhzm6Vi8gHss4ISTJ26TQE6s0X/r03GjO/4ip+diMkI0uFXAbjCLOR+otMhRePR6EgAMUUGXh6kHVZ31BPrBvJH2qUsUlQDFC6tzceIK2tCLAnPACTMspp8BFc+J2cimamt/vZeUPMgDR1EwCBIGCstTxe7QeiCsHw8iwgSDUpmlXRpm7fq1gkgRUruXhjyShdQkx2ZlloSGieFvgjBD6kc9pb0NwdciNuZoVC4+Z32ebZf7Tjb7fH/YQHKZ2fLnfJKf0aKlKsqo+HlP2wIVIU9TF845avgPPvtXsNkG0JFv/Iz6DcRRtZSAmKlO0riTBePCuvV6pDuVYgU2kfuwZ3V6Xsn6Kk1iy9M2eYj6xqvHRRmP/tZqjgeaxi/3nhkgq923m+MRTNcUTf oQNHxuLv 3u9Sh6gLtf8EymdTa6NE0nm6VgLvSRXK6jN2UdymKbA5KXk0yrsmCDRIAt4U7sPGmGi1QmC50LNPQBHeIwNKijEef83vDVPSercpFMSXRDJ91CDAfeQupB+Eyljg6jX0yiunErNdIOOzXl89j0b5RZEn4VF0dotWxMZ/p6ksuzCIw6AZbCPPa/DZoJ+6oiP3rWLJJYok+Rqjl5Oe3tkiy2D3DL0GDBvGw+VNO3cfpMkqREQSTl/k5mgzBbSezbIC+xF0ByRlWV79AQIuv8TLY+g2+YRgqS720zbSeTSanuaYckCb2dYgbXvuBBUDGRPbciXf9STuzuzOlKPpZmQVjhf0J6lU3LQJ0IHrJwkrLWg5MQyALNGD45CtFVnfvuY7qZ3AKD0wfdGe2TssgpVTUSxFmnbr7ydz4+drCcvX4Sg6vW9nNSTv3Kd340TCjhHS+YqCv X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Add support for broadcast TLB invalidation using AMD's INVLPGB instruction. This allows the kernel to invalidate TLB entries on remote CPUs without needing to send IPIs, without having to wait for remote CPUs to handle those interrupts, and with less interruption to what was running on those CPUs. Because x86 PCID space is limited, and there are some very large systems out there, broadcast TLB invalidation is only used for processes that are active on 3 or more CPUs, with the threshold being gradually increased the more the PCID space gets exhausted. Combined with the removal of unnecessary lru_add_drain calls (see https://lkml.org/lkml/2024/12/19/1388) this results in a nice performance boost for the will-it-scale tlb_flush2_threads test on an AMD Milan system with 36 cores: - vanilla kernel: 527k loops/second - lru_add_drain removal: 731k loops/second - only INVLPGB: 527k loops/second - lru_add_drain + INVLPGB: 1157k loops/second Profiling with only the INVLPGB changes showed while TLB invalidation went down from 40% of the total CPU time to only around 4% of CPU time, the contention simply moved to the LRU lock. Fixing both at the same time about doubles the number of iterations per second from this case. Some numbers closer to real world performance can be found at Phoronix, thanks to Michael: https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits My current plan is to implement support for Intel's RAR (Remote Action Request) TLB flushing in a follow-up series, after this thing has been merged into -tip. Making things any larger would just be unwieldy for reviewers. v12: - make sure "nopcid" command line option turns off invlpgb (Brendan) - add "noinvlpgb" kernel command line option - split out kernel TLB flushing differently (Dave & Yosry) - split up the patch that does invlpgb flushing for user processes (Dave) - clean up get_flush_tlb_info (Boris) - move invlpgb_count_max initialization to get_cpu_cap (Boris) - bunch more comments as requested v11: - resolve conflict with CONFIG_PT_RECLAIM code - a few more cleanups (Peter, Brendan, Nadav) v10: - simplify partial pages with min(nr, 1) in the invlpgb loop (Peter) - document x86 paravirt, AMD invlpgb, and ARM64 flush without IPI (Brendan) - remove IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) (Brendan) - various cleanups (Brendan) v9: - print warning when start or end address was rounded (Peter) - in the reclaim code, tlbsync at context switch time (Peter) - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan) v8: - round start & end to handle non-page-aligned callers (Steven & Jan) - fix up changelog & add tested-by tags (Manali) v7: - a few small code cleanups (Nadav) - fix spurious VM_WARN_ON_ONCE in mm_global_asid - code simplifications & better barriers (Peter & Dave) v6: - fix info->end check in flush_tlb_kernel_range (Michael) - disable broadcast TLB flushing on 32 bit x86 v5: - use byte assembly for compatibility with older toolchains (Borislav, Michael) - ensure a panic on an invalid number of extra pages (Dave, Tom) - add cant_migrate() assertion to tlbsync (Jann) - a bunch more cleanups (Nadav) - key TCE enabling off X86_FEATURE_TCE (Andrew) - fix a race between reclaim and ASID transition (Jann) v4: - Use only bitmaps to track free global ASIDs (Nadav) - Improved AMD initialization (Borislav & Tom) - Various naming and documentation improvements (Peter, Nadav, Tom, Dave) - Fixes for subtle race conditions (Jann) v3: - Remove paravirt tlb_remove_table call (thank you Qi Zheng) - More suggested cleanups and changelog fixes by Peter and Nadav v2: - Apply suggestions by Peter and Borislav (thank you!) - Fix bug in arch_tlbbatch_flush, where we need to do both the TLBSYNC, and flush the CPUs that are in the cpumask. - Some updates to comments and changelogs based on questions.