From patchwork Fri Feb 17 04:12:26 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 13144323 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 539BCC05027 for ; Fri, 17 Feb 2023 04:14:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References: Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=5N7QthZtaI0+95xlit1cnZMFw66dCaoZ2Np6qatJaUM=; b=e8rP8tDhs+zX6RWZOYu5RUDJs1 +2VZZIZtib3uaRsNT0WQkM+7JLpDu/puYDDB98D/VDP0g05R60uAjQfOlcUIiYBt44CyMTjbMHfVi pfHyDhDKJEZc79u0Usbe3aI9xLW8+jldgmTpDOeGKjcIeJIUj0WLdE6CemXOsc2zSV03NiX5kWwTq /uhJakWCbnd6yqgzsS3AoHWXfi0N7qPjY06Z+BrKNaX+Fr91W2JDX3UCoaJ6s89OvE9Z3SxQVcWRl n6rfLZFk4SFUsAzhH7A4zYp11iQINSUrt/3TUDqGlKJmpXnIZtTSFV5+HG3madeR8Z+kBL/jh1lqp RlcC5miw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1pSs7J-00Ca8b-5E; Fri, 17 Feb 2023 04:13:01 +0000 Received: from mail-yb1-xb4a.google.com ([2607:f8b0:4864:20::b4a]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1pSs6x-00CZtP-V3 for linux-arm-kernel@lists.infradead.org; Fri, 17 Feb 2023 04:12:41 +0000 Received: by mail-yb1-xb4a.google.com with SMTP id k9-20020a25bec9000000b00944353b6a81so4264724ybm.7 for ; Thu, 16 Feb 2023 20:12:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Yzogm92QmXfssxHKM4jCt3C6vEqgvX7DQk09DtYoUyc=; b=dOLMIGuUslf4WaXhpFtEYe+ZvfMFSXx31VRArlOz/rNZk3zy5YMmk3nCIzYxrGsNnQ 7eRf4QdsdODHHdw/wlrVm7kSLLs+E22VAIm7EsCIR92rCUG+EUcYdYgiYbkJnMtJzpTg rS6yL1Y8cSJGD2id5jKC2qJUALM2w7GybvtYoZ9p5lB6n892fxqLa/Oi84Gd2BIaAmzM 1MuoCcs3gdI2qDkS3MSaD05GmwOz5MykmjGtOnBaaMGNkkyIGgyiVAHPQw0yBx6R0RPE jtZGgtOi3SyqmladgbWsAbqw7tEMsjumncO64YH26dlS89ZgYt0+Mo5zUNUWGuZxIW4p 5FZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Yzogm92QmXfssxHKM4jCt3C6vEqgvX7DQk09DtYoUyc=; b=o0P2PwvKnE5hfJj0KXQcgLZN6NMC1XY3zEBPnB8OpHe+RjZgBrCs1jktQ8oschJQt9 AGzUq102eDPT51nZI/guubpwP75o56U4MZbBfpMY3BmcbmyfP3G3x8p2xMF+nnWsn/a7 DXmfRJh7BcSCPHC7bvl6ABYf2enxYlJENN4dQxRG+um356NeSeBD0BO/wlCF/196FkWh bJn/xcuXxqq3UsERkLNyTdckiFDrnQeqz+JZ/XnyKNa2w0qN1F15nqhT9YvBFDN3pL6d orOMhGmucmzcluwyT2wyn5nnw60fD6UFyBRyBX+EkOZrgdSGaRrjjMBM0uWXB+xyL43u Tqiw== X-Gm-Message-State: AO0yUKVjbokf/2lMzXsLu2zjlm3Yx921cekxIkiKpEdopvpLQvSlIAcc JxqtxY8g5Z8HVz/Ry+tvCwxqUKfwXPI= X-Google-Smtp-Source: AK7set+ctTnPBlk//+grImBJ58HiDlxzr1n1x/EYTRUxI9MgHdyKhlzqFYbYy1jt34qlgVB0Lo9nfk8TooA= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:6fb3:61e:d31f:1ad3]) (user=yuzhao job=sendgmr) by 2002:a81:3e05:0:b0:536:4d58:54b2 with SMTP id l5-20020a813e05000000b005364d5854b2mr149ywa.4.1676607156974; Thu, 16 Feb 2023 20:12:36 -0800 (PST) Date: Thu, 16 Feb 2023 21:12:26 -0700 In-Reply-To: <20230217041230.2417228-1-yuzhao@google.com> Message-Id: <20230217041230.2417228-2-yuzhao@google.com> Mime-Version: 1.0 References: <20230217041230.2417228-1-yuzhao@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH mm-unstable v1 1/5] mm/kvm: add mmu_notifier_test_clear_young() From: Yu Zhao To: Andrew Morton , Paolo Bonzini Cc: Jonathan Corbet , Michael Larabel , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-mm@google.com, Yu Zhao X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230216_201240_040185_138EAD86 X-CRM114-Status: GOOD ( 28.37 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org mmu_notifier_test_clear_young() allows the caller to safely test and clear the accessed bit in KVM PTEs without taking the MMU lock. This patch adds the generic infrastructure to invoke the subsequent arch-specific patches. The arch-specific implementations generally rely on two techniques: RCU and cmpxchg. The former protects KVM page tables from being freed while the latter clears the accessed bit atomically against both the hardware and other software page table walkers. mmu_notifier_test_clear_young() follows two design patterns: fallback and batching. For any unsupported cases, it can optionally fall back to mmu_notifier_ops->clear_young(). For a range of KVM PTEs, it can test or test and clear their accessed bits according to a bitmap provided by the caller. mmu_notifier_test_clear_young() always returns 0 if fallback is not allowed. If fallback happens, its return value is similar to that of mmu_notifier_clear_young(). The bitmap parameter has the following specifications: 1. The number of bits should be at least (end-start)/PAGE_SIZE. 2. The offset of each bit is relative to the end. E.g., the offset corresponding to addr is (end-addr)/PAGE_SIZE-1. This is to better suit batching while forward looping. 3. For each KVM PTE with the accessed bit set (young), arch-specific implementations flip the corresponding bit in the bitmap. It only clears the accessed bit if the old value is 1. A caller can test or test and clear the accessed bit by setting the corresponding bit in the bitmap to 0 or 1, and the new value will be 1 or 0 for a young KVM PTE. Signed-off-by: Yu Zhao --- include/linux/kvm_host.h | 29 ++++++++++++++++++ include/linux/mmu_notifier.h | 40 +++++++++++++++++++++++++ mm/mmu_notifier.c | 26 ++++++++++++++++ virt/kvm/kvm_main.c | 58 ++++++++++++++++++++++++++++++++++++ 4 files changed, 153 insertions(+) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4f26b244f6d0..df46fc815c8b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2281,4 +2281,33 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) /* Max number of entries allowed for each kvm dirty ring */ #define KVM_DIRTY_RING_MAX_ENTRIES 65536 +/* + * Architectures that implement kvm_arch_test_clear_young() should override + * kvm_arch_has_test_clear_young(). + * + * kvm_arch_has_test_clear_young() is allowed to return false positive. It can + * return true if kvm_arch_test_clear_young() is supported but disabled due to + * some runtime constraint. In this case, kvm_arch_test_clear_young() should + * return false. + * + * The last parameter to kvm_arch_test_clear_young() is a bitmap with the + * following specifications: + * 1. The offset of each bit is relative to the second to the last parameter + * lsb_gfn. E.g., the offset corresponding to gfn is lsb_gfn-gfn. This is to + * better suit batching while forward looping. + * 2. For each KVM PTE with the accessed bit set, the implementation should flip + * the corresponding bit in the bitmap. It should only clear the accessed bit + * if the old value is 1. This allows the caller to test or test and clear + * the accessed bit. + */ +#ifndef kvm_arch_has_test_clear_young +static inline bool kvm_arch_has_test_clear_young(void) +{ + return false; +} +#endif + +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range, + gfn_t lsb_gfn, unsigned long *bitmap); + #endif diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 64a3e051c3c4..432b51cd6843 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -122,6 +122,11 @@ struct mmu_notifier_ops { struct mm_struct *mm, unsigned long address); + /* see the comments on mmu_notifier_test_clear_young() */ + bool (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end, + unsigned long *bitmap); + /* * change_pte is called in cases that pte mapping to page is changed: * for example, when ksm remaps pte to point to a new shared page. @@ -390,6 +395,9 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, extern int __mmu_notifier_clear_young(struct mm_struct *mm, unsigned long start, unsigned long end); +extern int __mmu_notifier_test_clear_young(struct mm_struct *mm, + unsigned long start, unsigned long end, + bool fallback, unsigned long *bitmap); extern int __mmu_notifier_test_young(struct mm_struct *mm, unsigned long address); extern void __mmu_notifier_change_pte(struct mm_struct *mm, @@ -432,6 +440,31 @@ static inline int mmu_notifier_clear_young(struct mm_struct *mm, return 0; } +/* + * This function always returns 0 if fallback is not allowed. If fallback + * happens, its return value is similar to that of mmu_notifier_clear_young(). + * + * The bitmap has the following specifications: + * 1. The number of bits should be at least (end-start)/PAGE_SIZE. + * 2. The offset of each bit is relative to the end. E.g., the offset + * corresponding to addr is (end-addr)/PAGE_SIZE-1. This is to better suit + * batching while forward looping. + * 3. For each KVM PTE with the accessed bit set (young), this function flips + * the corresponding bit in the bitmap. It only clears the accessed bit if + * the old value is 1. A caller can test or test and clear the accessed bit + * by setting the corresponding bit in the bitmap to 0 or 1, and the new + * value will be 1 or 0 for a young KVM PTE. + */ +static inline int mmu_notifier_test_clear_young(struct mm_struct *mm, + unsigned long start, unsigned long end, + bool fallback, unsigned long *bitmap) +{ + if (mm_has_notifiers(mm)) + return __mmu_notifier_test_clear_young(mm, start, end, fallback, bitmap); + + return 0; +} + static inline int mmu_notifier_test_young(struct mm_struct *mm, unsigned long address) { @@ -684,6 +717,13 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, return 0; } +static inline int mmu_notifier_test_clear_young(struct mm_struct *mm, + unsigned long start, unsigned long end, + bool fallback, unsigned long *bitmap) +{ + return 0; +} + static inline int mmu_notifier_test_young(struct mm_struct *mm, unsigned long address) { diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 50c0dde1354f..dd39b9b4d6d3 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -402,6 +402,32 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, return young; } +/* see the comments on mmu_notifier_test_clear_young() */ +int __mmu_notifier_test_clear_young(struct mm_struct *mm, + unsigned long start, unsigned long end, + bool fallback, unsigned long *bitmap) +{ + int key; + struct mmu_notifier *mn; + int young = 0; + + key = srcu_read_lock(&srcu); + + hlist_for_each_entry_srcu(mn, &mm->notifier_subscriptions->list, + hlist, srcu_read_lock_held(&srcu)) { + if (mn->ops->test_clear_young && + mn->ops->test_clear_young(mn, mm, start, end, bitmap)) + continue; + + if (fallback && mn->ops->clear_young) + young |= mn->ops->clear_young(mn, mm, start, end); + } + + srcu_read_unlock(&srcu, key); + + return young; +} + int __mmu_notifier_test_young(struct mm_struct *mm, unsigned long address) { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 9c60384b5ae0..1b465df4a93d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -875,6 +875,63 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn, return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn); } +static bool kvm_test_clear_young(struct kvm *kvm, unsigned long start, + unsigned long end, unsigned long *bitmap) +{ + int i; + int key; + bool success = true; + + trace_kvm_age_hva(start, end); + + key = srcu_read_lock(&kvm->srcu); + + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { + struct interval_tree_node *node; + struct kvm_memslots *slots = __kvm_memslots(kvm, i); + + kvm_for_each_memslot_in_hva_range(node, slots, start, end - 1) { + gfn_t lsb_gfn; + unsigned long hva_start, hva_end; + struct kvm_gfn_range range = { + .slot = container_of(node, struct kvm_memory_slot, + hva_node[slots->node_idx]), + }; + + hva_start = max(start, range.slot->userspace_addr); + hva_end = min(end - 1, range.slot->userspace_addr + + range.slot->npages * PAGE_SIZE - 1); + + range.start = hva_to_gfn_memslot(hva_start, range.slot); + range.end = hva_to_gfn_memslot(hva_end, range.slot) + 1; + + if (WARN_ON_ONCE(range.end <= range.start)) + continue; + + /* see the comments on the generic kvm_arch_has_test_clear_young() */ + lsb_gfn = hva_to_gfn_memslot(end - 1, range.slot); + + success = kvm_arch_test_clear_young(kvm, &range, lsb_gfn, bitmap); + if (!success) + break; + } + } + + srcu_read_unlock(&kvm->srcu, key); + + return success; +} + +static bool kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end, + unsigned long *bitmap) +{ + if (kvm_arch_has_test_clear_young()) + return kvm_test_clear_young(mmu_notifier_to_kvm(mn), start, end, bitmap); + + return false; +} + static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long address) @@ -903,6 +960,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { .clear_flush_young = kvm_mmu_notifier_clear_flush_young, .clear_young = kvm_mmu_notifier_clear_young, .test_young = kvm_mmu_notifier_test_young, + .test_clear_young = kvm_mmu_notifier_test_clear_young, .change_pte = kvm_mmu_notifier_change_pte, .release = kvm_mmu_notifier_release, }; From patchwork Fri Feb 17 04:12:27 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 13144324 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 9FEE9C636D4 for ; Fri, 17 Feb 2023 04:14:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References: Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=9R7NJGJrIphpdGeEFRDADID8AzYArMdqiOlNnimJrCE=; b=rA2gMtsmQzFV2lbgSEFuY0/Tjm McjBd+iIVsg6oMgaxPfVaVXnDMI9gKQAVj0u0pXxZrxNOXH2t+o6F6Zk6oPoP8Dh6wueoo56yyUqA BLs7pt56STQKt7myqQiWicyaXoUMC00QwoAfk7n085yyJMwzG/oC/+CwBGBdmedBg34Iv9JuqWBAc nYmsyKQuyt49/FROpzKs+P+26DQl8eHZGZ+OT+OqQAJeV8oIm+VGyPr/PVMibIWmAVzBv1fxdMFmF 68TLjM8VPJddOITSMaS2D4QY+BnEMvvZTLTeZhH4P0Q1iutol3fM+jQKdJMoBjMD1blmd9M0+zYVc 2WPJL2Zw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1pSs7b-00CaIn-P4; Fri, 17 Feb 2023 04:13:19 +0000 Received: from mail-yw1-x114a.google.com ([2607:f8b0:4864:20::114a]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1pSs6z-00CZu5-L0 for linux-arm-kernel@lists.infradead.org; Fri, 17 Feb 2023 04:12:43 +0000 Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-52a8f97c8ccso44809177b3.6 for ; Thu, 16 Feb 2023 20:12:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=zOMC30Dt5jTWJdMwqPugeizmTtOBtAcC7CRKjtqAf8M=; b=bm43kjVqsx1c7qHOsFFVJ0+0xrXUHmX/V9lp20jfJ1X7N4zZd4d3vyNBymGX/me9gK nTebRJhW2oIy2yjncVC9wMVFjstG9iWFcB6e1y3NE5fXCSioxKMMujXyRRWA09QDaWuI 5NHp8SGOOrBADfd/tJIWc3jugNV6pCNLo8yDtwEDkvM75+pJzBnfO9BcBe5GRXiC6M6C BsxIBG/O68HSI1OC0bY7F+eEOjy5Z383u8SWnJZPB3cOWNgoqbleGxYTNZYvv1mrxews ibrqxX3GI/du8KUrofan0Xz4K1l4wKn3rqZtCcgiQArjBpayUlYXpRK7dgISyESZns/1 kB0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=zOMC30Dt5jTWJdMwqPugeizmTtOBtAcC7CRKjtqAf8M=; b=1wf7K4kXVjNJD/zd8ISSf+816b31teJNlKnCJDheDB1yVYKERJPohUChQboKy1cmNG TDGNVxKjVFAMTTyB1O/9p4OcUoWVrDWX6JrKRsnJl6Phhsmqkf3Hjn6AcMIscHVMRENJ uU4wj3EykEvAKtACG+0gp1jn8oTSJYFIhWB/fd4WCTC430hnLF8LLK5VoRRrxA/xvCHQ XARueNVwlJpJznEbBIDARPSScdl80Gj54oqz3nzVR5ZFICXNTN5pjcndMwTrkbSIH/rf Plvmgn+ecPVQ4YoX9Takz3p1hicOiODL70qGN7HPqtuWLL6q7kaFa+3FvchxVAAYFRxI xkug== X-Gm-Message-State: AO0yUKV5ATHIMPvjPdZpDM775D5tWCK7Lydm/jmPPr33CL0yLTvQ5Vuf DAXQSt9+q2i9kJziV1QsNzG/HvXaJQQ= X-Google-Smtp-Source: AK7set9YiUb7IHS5x9zvQDFrhuacJgLt4KpkKYJiB428hgjDG7WBXSnP1azODlrvhkhiovQ/6TgBQWA2SqA= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:6fb3:61e:d31f:1ad3]) (user=yuzhao job=sendgmr) by 2002:a5b:786:0:b0:92c:23ba:7adb with SMTP id b6-20020a5b0786000000b0092c23ba7adbmr836225ybq.545.1676607158852; Thu, 16 Feb 2023 20:12:38 -0800 (PST) Date: Thu, 16 Feb 2023 21:12:27 -0700 In-Reply-To: <20230217041230.2417228-1-yuzhao@google.com> Message-Id: <20230217041230.2417228-3-yuzhao@google.com> Mime-Version: 1.0 References: <20230217041230.2417228-1-yuzhao@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young() From: Yu Zhao To: Andrew Morton , Paolo Bonzini Cc: Jonathan Corbet , Michael Larabel , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-mm@google.com, Yu Zhao X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230216_201241_715369_E8D94DB1 X-CRM114-Status: GOOD ( 19.21 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org This patch adds kvm_arch_test_clear_young() for the vast majority of VMs that are not nested and run on hardware that sets the accessed bit in TDP MMU page tables. It relies on two techniques, RCU and cmpxchg, to safely test and clear the accessed bit without taking the MMU lock. The former protects KVM page tables from being freed while the latter clears the accessed bit atomically against both the hardware and other software page table walkers. Signed-off-by: Yu Zhao --- arch/x86/include/asm/kvm_host.h | 27 ++++++++++++++++++++++ arch/x86/kvm/mmu/spte.h | 12 ---------- arch/x86/kvm/mmu/tdp_mmu.c | 41 +++++++++++++++++++++++++++++++++ 3 files changed, 68 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6aaae18f1854..d2995c9e8f07 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1367,6 +1367,12 @@ struct kvm_arch { * the MMU lock in read mode + the tdp_mmu_pages_lock or * the MMU lock in write mode * + * kvm_arch_test_clear_young() is a special case. It relies on two + * techniques, RCU and cmpxchg, to safely test and clear the accessed + * bit without taking the MMU lock. The former protects KVM page tables + * from being freed while the latter clears the accessed bit atomically + * against both the hardware and other software page table walkers. + * * Roots will remain in the list until their tdp_mmu_root_count * drops to zero, at which point the thread that decremented the * count to zero should removed the root from the list and clean @@ -2171,4 +2177,25 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages); KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \ KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS) +extern u64 __read_mostly shadow_accessed_mask; + +/* + * Returns true if A/D bits are supported in hardware and are enabled by KVM. + * When enabled, KVM uses A/D bits for all non-nested MMUs. Because L1 can + * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the + * scenario where KVM is using A/D bits for L1, but not L2. + */ +static inline bool kvm_ad_enabled(void) +{ + return shadow_accessed_mask; +} + +/* see the comments on the generic kvm_arch_has_test_clear_young() */ +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young +static inline bool kvm_arch_has_test_clear_young(void) +{ + return IS_ENABLED(CONFIG_KVM) && IS_ENABLED(CONFIG_X86_64) && + (!IS_REACHABLE(CONFIG_KVM) || (kvm_ad_enabled() && tdp_enabled)); +} + #endif /* _ASM_X86_KVM_HOST_H */ diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index 6f54dc9409c9..0dc7fed1f3fd 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -153,7 +153,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask; extern u64 __read_mostly shadow_nx_mask; extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */ extern u64 __read_mostly shadow_user_mask; -extern u64 __read_mostly shadow_accessed_mask; extern u64 __read_mostly shadow_dirty_mask; extern u64 __read_mostly shadow_mmio_value; extern u64 __read_mostly shadow_mmio_mask; @@ -247,17 +246,6 @@ static inline bool is_shadow_present_pte(u64 pte) return !!(pte & SPTE_MMU_PRESENT_MASK); } -/* - * Returns true if A/D bits are supported in hardware and are enabled by KVM. - * When enabled, KVM uses A/D bits for all non-nested MMUs. Because L1 can - * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the - * scenario where KVM is using A/D bits for L1, but not L2. - */ -static inline bool kvm_ad_enabled(void) -{ - return !!shadow_accessed_mask; -} - static inline bool sp_ad_disabled(struct kvm_mmu_page *sp) { return sp->role.ad_disabled; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index d6df38d371a0..9028e09f1aab 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1309,6 +1309,47 @@ bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range); } +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range, + gfn_t lsb_gfn, unsigned long *bitmap) +{ + struct kvm_mmu_page *root; + + if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young())) + return false; + + if (kvm_memslots_have_rmaps(kvm)) + return false; + + /* see the comments on kvm_arch->tdp_mmu_roots */ + rcu_read_lock(); + + list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) { + struct tdp_iter iter; + + if (kvm_mmu_page_as_id(root) != range->slot->as_id) + continue; + + tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) { + u64 *sptep = rcu_dereference(iter.sptep); + u64 new_spte = iter.old_spte & ~shadow_accessed_mask; + + VM_WARN_ON_ONCE(!page_count(virt_to_page(sptep))); + VM_WARN_ON_ONCE(iter.gfn < range->start || iter.gfn >= range->end); + + if (new_spte == iter.old_spte) + continue; + + /* see the comments on the generic kvm_arch_has_test_clear_young() */ + if (__test_and_change_bit(lsb_gfn - iter.gfn, bitmap)) + cmpxchg64(sptep, iter.old_spte, new_spte); + } + } + + rcu_read_unlock(); + + return true; +} + static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter, struct kvm_gfn_range *range) { From patchwork Fri Feb 17 04:12:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 13144325 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 2713CC05027 for ; Fri, 17 Feb 2023 04:14:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References: Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=TOL2S1zgKQKU8RCvWi2BGNn2/1tW1S/gDXm7NohHAb4=; b=UuCoUpyyUmZlEtHOvL48ZmMYwn gCSrVBtqEa8uPrIVwUEo3cmgtLOKGc9CWq8dITIgC5GU1n/doDpTpQgXPjvef2SBz5u+bZulmdUWD O7Z7ZQpUX9QWQe7qZuPSUW+IKIe6ZkoIz9FfX8vMckw281jJGRYmCYCe6raRJpLmNOIEZqSye46Tm 1aYC/bZssotsoCfGGgFfMu5p8CIAHkmKhMKyVesKi2JZiC1QSDhL5CIzUEZpdin/QvuCVWOFhi5OO rdgGFD3peEogLfh7nb4xpjnbU3TMk0Ruao//5rl1OOpGnvu6vny0MsykUn65PBeau5Q7rRBUuLdqy VPYJ759A==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1pSs7p-00CaQ4-SA; Fri, 17 Feb 2023 04:13:34 +0000 Received: from mail-yb1-xb4a.google.com ([2607:f8b0:4864:20::b4a]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1pSs6z-00CZvB-VH for linux-arm-kernel@lists.infradead.org; Fri, 17 Feb 2023 04:12:44 +0000 Received: by mail-yb1-xb4a.google.com with SMTP id h14-20020a258a8e000000b00827819f87e5so4336511ybl.0 for ; Thu, 16 Feb 2023 20:12:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6v2BS1UfItgh1K5tfej0Iq92+WtqqMzCx/cGSgmoiSg=; b=gziB1BrvY9BHZS7WfBeSnKEbTJsWIJ/fFr2/PiFmwtTyfe68OsIjhQ61dZpnb87SkA JE0OnpGDz5FxpkIlNn0nA2tfqnhJ3b3cHbwE9Flj2oR041XOL5Rmeeme4csAtb6QB8A7 1dwd6zfVUE8cwmDds621GWHJkAxZcDvbF53xftCwvOocB2lzYTkdoHlckg6k5SWObsu/ JnAmp96+xPUNaXs4fnLZH6LMMVGk9cwNJZuiX9VYpp8GDr7rFYU2r6amNuFL9lTtFwOC 0QGI/m4BCLP0eKmz8qQupuOfcUmAy+lDPbfLsyqst4coAeQie4wxcpaUJpvs3jf5QJc4 t1MA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6v2BS1UfItgh1K5tfej0Iq92+WtqqMzCx/cGSgmoiSg=; b=DyUu9eDJvQTpen9ytGlB6LnDdMVGMZPDokGevEzfvDD7jqPPXLhPaWjxRz4qVhCZI2 d5db3H/YSnkvC9fT/YSrRBuVvomx//VdThaRBP5tlwExUqCkGvurKwXcRjdyEWwsMVXM jU1VhKfNKzbr0nGXuCJ5BfqcGDXx2LGEAHYSav5nvM8TkRBRfCWYOCQCEQPHbEPmpJWQ szM+xQ3o2tiArsTNFqCBvbd2V6mDiyGr5fpk0QwnXJIJ3ChUqbLpBF5J8DG0rpV0rWb1 AXUO4hAvxvZAr6jMw+CvnZe548Oa2LM+SBDRtPdH406hyufB7CiXXxAauaHQyDMCrdCS nkDw== X-Gm-Message-State: AO0yUKWUaxBUR2TOYodbyIzqYp3YDcgfosVbLHZBpDL+LiONN2KPRGyT fG3oexzmoTBOYGBFzOUHQB6IFq6YxI4= X-Google-Smtp-Source: AK7set/Pxuk29uHWJwH5Tc+hpFqe2YNk4NRZV8/+4ud1blKjBerJFMDRpndxfsiGGJtM3eObeOVv77vpkZw= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:6fb3:61e:d31f:1ad3]) (user=yuzhao job=sendgmr) by 2002:a25:9c83:0:b0:93c:785a:ba76 with SMTP id y3-20020a259c83000000b0093c785aba76mr1106910ybo.617.1676607160685; Thu, 16 Feb 2023 20:12:40 -0800 (PST) Date: Thu, 16 Feb 2023 21:12:28 -0700 In-Reply-To: <20230217041230.2417228-1-yuzhao@google.com> Message-Id: <20230217041230.2417228-4-yuzhao@google.com> Mime-Version: 1.0 References: <20230217041230.2417228-1-yuzhao@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young() From: Yu Zhao To: Andrew Morton , Paolo Bonzini Cc: Jonathan Corbet , Michael Larabel , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-mm@google.com, Yu Zhao X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230216_201242_084251_D24B56E0 X-CRM114-Status: GOOD ( 25.99 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org This patch adds kvm_arch_test_clear_young() for the vast majority of VMs that are not pKVM and run on hardware that sets the accessed bit in KVM page tables. It relies on two techniques, RCU and cmpxchg, to safely test and clear the accessed bit without taking the MMU lock. The former protects KVM page tables from being freed while the latter clears the accessed bit atomically against both the hardware and other software page table walkers. Signed-off-by: Yu Zhao --- arch/arm64/include/asm/kvm_host.h | 7 +++ arch/arm64/include/asm/kvm_pgtable.h | 8 +++ arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++ arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/hyp/pgtable.c | 51 ++-------------- arch/arm64/kvm/mmu.c | 77 ++++++++++++++++++++++++- 6 files changed, 141 insertions(+), 46 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 35a159d131b5..572bcd321586 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { } void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu); bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu); +/* see the comments on the generic kvm_arch_has_test_clear_young() */ +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young +static inline bool kvm_arch_has_test_clear_young(void) +{ + return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && !is_protected_kvm_enabled(); +} + #endif /* __ARM64_KVM_HOST_H__ */ diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index 63f81b27a4e3..8c9a04388c88 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -105,6 +105,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level) * @put_page: Decrement the refcount on a page. When the * refcount reaches 0 the page is automatically * freed. + * @put_page_rcu: RCU variant of put_page(). * @page_count: Return the refcount of a page. * @phys_to_virt: Convert a physical address into a virtual * address mapped in the current context. @@ -122,6 +123,7 @@ struct kvm_pgtable_mm_ops { void (*free_removed_table)(void *addr, u32 level); void (*get_page)(void *addr); void (*put_page)(void *addr); + void (*put_page_rcu)(void *addr); int (*page_count)(void *addr); void* (*phys_to_virt)(phys_addr_t phys); phys_addr_t (*virt_to_phys)(void *addr); @@ -188,6 +190,12 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end, * children. * @KVM_PGTABLE_WALK_SHARED: Indicates the page-tables may be shared * with other software walkers. + * + * kvm_arch_test_clear_young() is a special case. It relies on two + * techniques, RCU and cmpxchg, to safely test and clear the accessed + * bit without taking the MMU lock. The former protects KVM page tables + * from being freed while the latter clears the accessed bit atomically + * against both the hardware and other software page table walkers. */ enum kvm_pgtable_walk_flags { KVM_PGTABLE_WALK_LEAF = BIT(0), diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h index c8dca8ae359c..350437661d4b 100644 --- a/arch/arm64/include/asm/stage2_pgtable.h +++ b/arch/arm64/include/asm/stage2_pgtable.h @@ -30,4 +30,47 @@ */ #define kvm_mmu_cache_min_pages(kvm) (kvm_stage2_levels(kvm) - 1) +#define KVM_PTE_TYPE BIT(1) +#define KVM_PTE_TYPE_BLOCK 0 +#define KVM_PTE_TYPE_PAGE 1 +#define KVM_PTE_TYPE_TABLE 1 + +#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2) + +#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2) +#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6) +#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO 3 +#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW 1 +#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8) +#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3 +#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10) + +#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR GENMASK(5, 2) +#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6) +#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7) +#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8) +#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3 +#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10) + +#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51) + +#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55) + +#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54) + +#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54) + +#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \ + KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \ + KVM_PTE_LEAF_ATTR_HI_S2_XN) + +#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2) +#define KVM_MAX_OWNER_ID 1 + +/* + * Used to indicate a pte for which a 'break-before-make' sequence is in + * progress. + */ +#define KVM_INVALID_PTE_LOCKED BIT(10) + #endif /* __ARM64_S2_PGTABLE_H_ */ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 9c5573bc4614..6770bc47f5c9 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) */ void kvm_arch_destroy_vm(struct kvm *kvm) { + kvm_free_stage2_pgd(&kvm->arch.mmu); bitmap_free(kvm->arch.pmu_filter); free_cpumask_var(kvm->arch.supported_cpus); diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index b11cf2c618a6..8d65ee4767f1 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -12,49 +12,6 @@ #include -#define KVM_PTE_TYPE BIT(1) -#define KVM_PTE_TYPE_BLOCK 0 -#define KVM_PTE_TYPE_PAGE 1 -#define KVM_PTE_TYPE_TABLE 1 - -#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2) - -#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2) -#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6) -#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO 3 -#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW 1 -#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8) -#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3 -#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10) - -#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR GENMASK(5, 2) -#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6) -#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7) -#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8) -#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3 -#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10) - -#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51) - -#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55) - -#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54) - -#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54) - -#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \ - KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \ - KVM_PTE_LEAF_ATTR_HI_S2_XN) - -#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2) -#define KVM_MAX_OWNER_ID 1 - -/* - * Used to indicate a pte for which a 'break-before-make' sequence is in - * progress. - */ -#define KVM_INVALID_PTE_LOCKED BIT(10) - struct kvm_pgtable_walk_data { struct kvm_pgtable_walker *walker; @@ -994,8 +951,12 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops), kvm_granule_size(ctx->level)); - if (childp) - mm_ops->put_page(childp); + if (childp) { + if (mm_ops->put_page_rcu) + mm_ops->put_page_rcu(childp); + else + mm_ops->put_page(childp); + } return 0; } diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index a3ee3b605c9b..761fffc788f5 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -171,6 +171,21 @@ static int kvm_host_page_count(void *addr) return page_count(virt_to_page(addr)); } +static void kvm_s2_rcu_put_page(struct rcu_head *head) +{ + put_page(container_of(head, struct page, rcu_head)); +} + +static void kvm_s2_put_page_rcu(void *addr) +{ + struct page *page = virt_to_page(addr); + + if (kvm_host_page_count(addr) == 1) + kvm_account_pgtable_pages(addr, -1); + + call_rcu(&page->rcu_head, kvm_s2_rcu_put_page); +} + static phys_addr_t kvm_host_pa(void *addr) { return __pa(addr); @@ -684,6 +699,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = { .free_removed_table = stage2_free_removed_table, .get_page = kvm_host_get_page, .put_page = kvm_s2_put_page, + .put_page_rcu = kvm_s2_put_page_rcu, .page_count = kvm_host_page_count, .phys_to_virt = kvm_host_va, .virt_to_phys = kvm_host_pa, @@ -1624,6 +1640,66 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) return pte_valid(pte) && pte_young(pte); } +struct test_clear_young_arg { + struct kvm_gfn_range *range; + gfn_t lsb_gfn; + unsigned long *bitmap; +}; + +static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx, + enum kvm_pgtable_walk_flags flags) +{ + struct test_clear_young_arg *arg = ctx->arg; + gfn_t gfn = ctx->addr / PAGE_SIZE; + kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF; + + VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep))); + VM_WARN_ON_ONCE(gfn < arg->range->start || gfn >= arg->range->end); + + if (!kvm_pte_valid(new)) + return 0; + + if (new == ctx->old) + return 0; + + /* see the comments on the generic kvm_arch_has_test_clear_young() */ + if (__test_and_change_bit(arg->lsb_gfn - gfn, arg->bitmap)) + cmpxchg64(ctx->ptep, ctx->old, new); + + return 0; +} + +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range, + gfn_t lsb_gfn, unsigned long *bitmap) +{ + u64 start = range->start * PAGE_SIZE; + u64 end = range->end * PAGE_SIZE; + struct test_clear_young_arg arg = { + .range = range, + .lsb_gfn = lsb_gfn, + .bitmap = bitmap, + }; + struct kvm_pgtable_walker walker = { + .cb = stage2_test_clear_young, + .arg = &arg, + .flags = KVM_PGTABLE_WALK_LEAF, + }; + + BUILD_BUG_ON(is_hyp_code()); + + if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young())) + return false; + + /* see the comments on kvm_pgtable_walk_flags */ + rcu_read_lock(); + + kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, &walker); + + rcu_read_unlock(); + + return true; +} + bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { if (!kvm->arch.mmu.pgt) @@ -1848,7 +1924,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) void kvm_arch_flush_shadow_all(struct kvm *kvm) { - kvm_free_stage2_pgd(&kvm->arch.mmu); } void kvm_arch_flush_shadow_memslot(struct kvm *kvm, From patchwork Fri Feb 17 04:12:29 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 13144326 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 2930CC05027 for ; Fri, 17 Feb 2023 04:14:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References: Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=+/1/kYNlaOJKMX8ArF7iLcKs9hE15NZYswfDnKjRkQU=; b=Xerj6lVNMk44ysJVz/Vbr0GFeJ 0RSbeSnPCvftJ4fuB46ZJ/IwRidTJNGOh1G439AbV3d8472qjCd0TFFEq26BsFGOSUv+b3IbCr9ZY FpwlVpGvDg7j+CT+oSAyxJgPPyqTzlSzmimDxCuVvv+CnKrdf4uiIU4haw6svjy4V8GkIPHOHHx7+ LbWQYX9O9OvWiaYK7ZoBSufSJVHpH5evsPoPBM5yaMhJD2mrVxH9W4s9Ua0zrX+uzfTxtV0Usua2r Ot+Xmd0OHY4qEgBe1jF0k67sc+Cwn+1mtX+0HdhX9I+kP8LYZoq3RwFUeiAzOHnLbt+ATNONXgfVd aLH+omrg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1pSs86-00CaYt-0c; Fri, 17 Feb 2023 04:13:50 +0000 Received: from mail-yb1-xb49.google.com ([2607:f8b0:4864:20::b49]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1pSs71-00CZwU-Dh for linux-arm-kernel@lists.infradead.org; Fri, 17 Feb 2023 04:12:46 +0000 Received: by mail-yb1-xb49.google.com with SMTP id p83-20020a25d856000000b0095d2ada3d26so4231983ybg.5 for ; Thu, 16 Feb 2023 20:12:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=amHoLle71o7p/XllEoZS3hif3UtpM0j0bW+cZG3KVdY=; b=tKKqOK/4v9Nsl+iZ5uj+QLwvBB9/MlZmEqfeAcO65vdFUXJ2neb8OCPtRfPOaWDyxP FkEFBCKiFG2TAlI4pbNOUhKBZytnLS1P7nr+a+PTPquwKNcmuXNGSPjkFUGmw29bW+Ah /Da997b9GFB7bCnPYHpg2jKoZ2Y/mB2zh9Hp5FwTJeHoexE5+GN1xeuXBJXsz8C8IX5S L7gcpavYRz3yYEoRWFaXHcNffqgqtABgaN3Ncgntbv/iUEBl68KG/tq6fbetIAu2dGSB ToUCeR9ccz+fO4AOL2Ch/Flg7o+hMxiFfDMOP9ByuijjtcQxaoorvyhl8x0v3P6WJKBQ RryA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=amHoLle71o7p/XllEoZS3hif3UtpM0j0bW+cZG3KVdY=; b=NEtT1EsbyESwGZXPE28QGQV8uvszkO9JrIjPWqLbaCPDHSIXRaLkvpyYawX/bE4F06 3K+mhp9TOT+oyq8c3G1aKynwOoOQPzvY/m3yBzGocqHw0IxoSeA4Zi5IQw+cT8U1h3rh PrC+lPbLgRu1IneUXArpn574R6XtUORj3O09q9ip2jetuE7RxR+SDSWWA3D4pqKUUTNg hGThfg3/UtbdsXCEndO7dzQeZdlWza0Qsf876VvYRLUDIB3htaTdSvHmYbqgHW4gorAr Ow9S8fY55entphpO0h60pvJa9oMLtnwuDiMiGh1QyBRNtrSUQx2dTB/enl/J8WL50hhV Q3RA== X-Gm-Message-State: AO0yUKXBXcDYcrM78YtrO3EbM7s0qMzhLo/GjgXzfiLx/Fe2bAjbYNwt ddmqnWQWmUkFuMj0vnlK2M89IgjI5vo= X-Google-Smtp-Source: AK7set/zG1Ljp2MUiBWQb+jj+0ocG59LCXivAn6wr+nza0TKNrRHXUDRDhG90IH2RrsF7xFXRI6zxpp9jUg= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:6fb3:61e:d31f:1ad3]) (user=yuzhao job=sendgmr) by 2002:a05:6902:3c7:b0:8dd:52a3:b3a5 with SMTP id g7-20020a05690203c700b008dd52a3b3a5mr70700ybs.5.1676607162038; Thu, 16 Feb 2023 20:12:42 -0800 (PST) Date: Thu, 16 Feb 2023 21:12:29 -0700 In-Reply-To: <20230217041230.2417228-1-yuzhao@google.com> Message-Id: <20230217041230.2417228-5-yuzhao@google.com> Mime-Version: 1.0 References: <20230217041230.2417228-1-yuzhao@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH mm-unstable v1 4/5] kvm/powerpc: add kvm_arch_test_clear_young() From: Yu Zhao To: Andrew Morton , Paolo Bonzini Cc: Jonathan Corbet , Michael Larabel , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-mm@google.com, Yu Zhao X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230216_201243_535590_C5EB23B9 X-CRM114-Status: GOOD ( 20.00 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org This patch adds kvm_arch_test_clear_young() for the vast majority of VMs that are not nested and run on hardware with Radix MMU enabled. It relies on two techniques, RCU and cmpxchg, to safely test and clear the accessed bit without taking the MMU lock. The former protects KVM page tables from being freed while the latter clears the accessed bit atomically against both the hardware and other software page table walkers. Signed-off-by: Yu Zhao --- arch/powerpc/include/asm/kvm_host.h | 18 ++++++ arch/powerpc/include/asm/kvm_ppc.h | 14 +---- arch/powerpc/kvm/book3s.c | 7 +++ arch/powerpc/kvm/book3s.h | 2 + arch/powerpc/kvm/book3s_64_mmu_radix.c | 78 +++++++++++++++++++++++++- arch/powerpc/kvm/book3s_hv.c | 10 ++-- 6 files changed, 110 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index caea15dcb91d..996850029ce0 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -886,4 +886,22 @@ static inline void kvm_arch_exit(void) {} static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {} +static inline int kvmppc_radix_possible(void) +{ + return cpu_has_feature(CPU_FTR_ARCH_300) && radix_enabled(); +} + +static inline bool kvmhv_on_pseries(void) +{ + return IS_ENABLED(CONFIG_PPC_PSERIES) && !cpu_has_feature(CPU_FTR_HVMODE); +} + +/* see the comments on the generic kvm_arch_has_test_clear_young() */ +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young +static inline bool kvm_arch_has_test_clear_young(void) +{ + return IS_ENABLED(CONFIG_KVM) && IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) && + kvmppc_radix_possible() && !kvmhv_on_pseries(); +} + #endif /* __POWERPC_KVM_HOST_H__ */ diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index eae9619b6190..0bb772fc12b1 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -277,6 +277,8 @@ struct kvmppc_ops { bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); + bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range, + gfn_t lsb_gfn, unsigned long *bitmap); bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); void (*free_memslot)(struct kvm_memory_slot *slot); int (*init_vm)(struct kvm *kvm); @@ -580,18 +582,6 @@ static inline bool kvm_hv_mode_active(void) { return false; } #endif -#ifdef CONFIG_PPC_PSERIES -static inline bool kvmhv_on_pseries(void) -{ - return !cpu_has_feature(CPU_FTR_HVMODE); -} -#else -static inline bool kvmhv_on_pseries(void) -{ - return false; -} -#endif - #ifdef CONFIG_KVM_XICS static inline int kvmppc_xics_enabled(struct kvm_vcpu *vcpu) { diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 6d525285dbe8..f4cf330e3e81 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -877,6 +877,13 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) return kvm->arch.kvm_ops->test_age_gfn(kvm, range); } +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range, + gfn_t lsb_gfn, unsigned long *bitmap) +{ + return kvm->arch.kvm_ops->test_clear_young && + kvm->arch.kvm_ops->test_clear_young(kvm, range, lsb_gfn, bitmap); +} + bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { return kvm->arch.kvm_ops->set_spte_gfn(kvm, range); diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h index 58391b4b32ed..fe9cac423817 100644 --- a/arch/powerpc/kvm/book3s.h +++ b/arch/powerpc/kvm/book3s.h @@ -12,6 +12,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm, extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range); extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); +extern bool kvmhv_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range, + gfn_t lsb_gfn, unsigned long *bitmap); extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu); diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 9d3743ca16d5..8476646c554c 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -1083,6 +1083,78 @@ bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot, return ref; } +bool kvmhv_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range, + gfn_t lsb_gfn, unsigned long *bitmap) +{ + bool success; + gfn_t gfn = range->start; + + if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young())) + return false; + + /* + * This function relies on two techniques, RCU and cmpxchg, to safely + * test and clear the accessed bit without taking the MMU lock. The + * former protects KVM page tables from being freed while the latter + * clears the accessed bit atomically against both the hardware and + * other software page table walkers. + */ + rcu_read_lock(); + + success = kvm_is_radix(kvm); + if (!success) + goto unlock; + + /* + * case 1: this function kvmppc_switch_mmu_to_hpt() + * + * rcu_read_lock() + * test kvm_is_radix() kvm->arch.radix = 0 + * use kvm->arch.pgtable + * rcu_read_unlock() + * synchronize_rcu() + * kvmppc_free_radix() + * + * + * case 2: this function kvmppc_switch_mmu_to_radix() + * + * kvmppc_init_vm_radix() + * smp_wmb() + * test kvm_is_radix() kvm->arch.radix = 1 + * smp_rmb() + * use kvm->arch.pgtable + */ + smp_rmb(); + + while (gfn < range->end) { + pte_t *ptep; + pte_t old, new; + unsigned int shift; + + ptep = find_kvm_secondary_pte_unlocked(kvm, gfn * PAGE_SIZE, &shift); + if (!ptep) + goto next; + + VM_WARN_ON_ONCE(!page_count(virt_to_page(ptep))); + + old = READ_ONCE(*ptep); + if (!pte_present(old) || !pte_young(old)) + goto next; + + new = pte_mkold(old); + + /* see the comments on the generic kvm_arch_has_test_clear_young() */ + if (__test_and_change_bit(lsb_gfn - gfn, bitmap)) + pte_xchg(ptep, old, new); +next: + gfn += shift ? BIT(shift - PAGE_SHIFT) : 1; + } +unlock: + rcu_read_unlock(); + + return success; +} + /* Returns the number of PAGE_SIZE pages that are dirty */ static int kvm_radix_test_clear_dirty(struct kvm *kvm, struct kvm_memory_slot *memslot, int pagenum) @@ -1464,13 +1536,15 @@ int kvmppc_radix_init(void) { unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE; - kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor); + kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, + SLAB_TYPESAFE_BY_RCU, pte_ctor); if (!kvm_pte_cache) return -ENOMEM; size = sizeof(void *) << RADIX_PMD_INDEX_SIZE; - kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor); + kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, + SLAB_TYPESAFE_BY_RCU, pmd_ctor); if (!kvm_pmd_cache) { kmem_cache_destroy(kvm_pte_cache); return -ENOMEM; diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 6ba68dd6190b..17b415661282 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -5242,6 +5242,8 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm) spin_lock(&kvm->mmu_lock); kvm->arch.radix = 0; spin_unlock(&kvm->mmu_lock); + /* see the comments in kvmhv_test_clear_young() */ + synchronize_rcu(); kvmppc_free_radix(kvm); lpcr = LPCR_VPM1; @@ -5266,6 +5268,8 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm) if (err) return err; kvmppc_rmap_reset(kvm); + /* see the comments in kvmhv_test_clear_young() */ + smp_wmb(); /* Mutual exclusion with kvm_unmap_gfn_range etc. */ spin_lock(&kvm->mmu_lock); kvm->arch.radix = 1; @@ -6165,6 +6169,7 @@ static struct kvmppc_ops kvm_ops_hv = { .unmap_gfn_range = kvm_unmap_gfn_range_hv, .age_gfn = kvm_age_gfn_hv, .test_age_gfn = kvm_test_age_gfn_hv, + .test_clear_young = kvmhv_test_clear_young, .set_spte_gfn = kvm_set_spte_gfn_hv, .free_memslot = kvmppc_core_free_memslot_hv, .init_vm = kvmppc_core_init_vm_hv, @@ -6225,11 +6230,6 @@ static int kvm_init_subcore_bitmap(void) return 0; } -static int kvmppc_radix_possible(void) -{ - return cpu_has_feature(CPU_FTR_ARCH_300) && radix_enabled(); -} - static int kvmppc_book3s_init_hv(void) { int r; From patchwork Fri Feb 17 04:12:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 13144327 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C2D7EC05027 for ; Fri, 17 Feb 2023 04:15:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References: Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=3d+NW/3MW6ls3eOQ8pU7fDRQMGgBm9pnIV5W3HOGN+Q=; b=VOB1IrR72xGQVfO3YmpUDFSX2j nizYLLgkFKxcO/QAut6L0MqhuagzqyoP9fy/yyGcumG2L6XRli/LzbGyhGaxYojAUmPZkFhyV4jFW Vt1yrohRiTvIndpqIOMd9YYzluNTZO7vqtx+OorpBsxL4gVp0cSjYm5dDE4WofsY4hoDSPbV7r653 HTfDr5J1G1dtwugiLret+vNDvmpH9BJCOwzh6c1Na0CH5i2iqK2Mbwqfg+XZUiNk5byl+G4fsaPLO bGFALh13HJv2hYQulThrzImVDzsiv1CQiKyMgohzJCOj4NQCafZCgg4qif2CYlIvNZhA1ma0Dgw03 fhX8Ro8w==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1pSs8K-00Cahb-92; Fri, 17 Feb 2023 04:14:04 +0000 Received: from desiato.infradead.org ([2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1pSs7F-00Ca6e-5v for linux-arm-kernel@bombadil.infradead.org; Fri, 17 Feb 2023 04:12:57 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:Cc:To:From:Subject: References:Mime-Version:Message-Id:In-Reply-To:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=XhFwXntjEszX2NR0BL+FH//vn03ADgHeMtjzY93nmt4=; b=E/RDrR9qc+wNImnZpncZEAVAXz jnTcwh/9QDyC3/PRsf3RKvfoc2E9HFynWgMmpiJj60El1Zr5MAjzz51mZgihVn+mzJrcL0RzlLUjR nEfNojE5+FlRnAFcJpRpA0QE5DVkPRV1SAv2S5QxbOgYW6E2k3sGoBY56HoDRq4KSzYe/8IPtVvVx ClvXeAzoCOISTU9T7Xl5NUT5XmW0BVy1b8E8TNFfvF14zS8b9chJ2jftfQ8o0wyPV8SSDx/p1vxBC QBUFrhZh+zIB/GvFA1gjhyifkA3M6thNSVoDbgroC10DLEP+nAJo6eFrk4TkO7peCFYk4wL73/6Sz /5ghwXbA==; Received: from mail-yb1-xb49.google.com ([2607:f8b0:4864:20::b49]) by desiato.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux)) id 1pSs6T-00AVp5-1K for linux-arm-kernel@lists.infradead.org; Fri, 17 Feb 2023 04:12:55 +0000 Received: by mail-yb1-xb49.google.com with SMTP id b132-20020a25348a000000b0096c886848c9so2740393yba.3 for ; Thu, 16 Feb 2023 20:12:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XhFwXntjEszX2NR0BL+FH//vn03ADgHeMtjzY93nmt4=; b=XNUXq0PTcl/8yQwac6rGM5O74+vQWyUh1AwA7ePVJdgeiAUlD4NgM/8W79T6ThAX5d 8L+r3mqqpTd4bu5CNG4e4L2wx/8D5VXPDgeOyArsY1MkkauDIfO/A0RFt6d2vg5l2vmu hweIMbdWE0pJDts0aaAJhLSeQ2dvv4nx27ec6MfgxyUQn4go1b3A0P5I5YZe4zrvSmNQ QnjYpb9gyiRRAI7GOUcY7YneWU1+n/bWWipo6U9kb/M9PTR3QYPIaLf6KsnEUGCzyhdx EHGSfZ7eFdzHKHHjVuXrZx5KuVV/Ov5OwMCJJ/HnyOkrnjd7f6Knvbnn1fuBDRf5kRV+ zuIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XhFwXntjEszX2NR0BL+FH//vn03ADgHeMtjzY93nmt4=; b=WKP6iPQLZaCmqSJKI6H3UO7eelV/AVmiDC+l+LNRHetXUqycwWBvP2H7PhiCZ1cuXf 7sPq5p1BuFhydmSHsD1RQ3vsQ6mm0IXA4JaNFuvrWlO3hkl/7HI4x3dU8urOgApOg/Xc ul5M2kmp6mLWlD/eUis6cqOWz6t30VG9NSp3tYQwVt4gNaPWM0LGfjjSpjPhrlVLXviF mIfWEBz5Z5/wa2VeRGvkC4b2mt5NZPkxBnYHWwdxFlY+gd33jTdRCZ66055abPu0eN4l wuo/cB0R80W1x4fEpvfrB9rs900z0+HPI3W6WgISxqhw+4cA8WfBPf6+slU2Xl0VlKeZ 0f9w== X-Gm-Message-State: AO0yUKUKxZzaL73OQ931GwLKnuNJnXpnR/sUR+QRJtz5ndxHJ/cct6ia JWBpe5nAPKAOpTGttUtl775EH9GoHt0= X-Google-Smtp-Source: AK7set9xe5nXschn8sCJ042xv+lkLsKlliPh3pvCt+D2t2Ej6DUaOZa2DrrSs5Y7E7xe7n5GA4zAjxW8lz0= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:6fb3:61e:d31f:1ad3]) (user=yuzhao job=sendgmr) by 2002:a05:6902:308:b0:8dd:78cc:e52e with SMTP id b8-20020a056902030800b008dd78cce52emr65241ybs.13.1676607163481; Thu, 16 Feb 2023 20:12:43 -0800 (PST) Date: Thu, 16 Feb 2023 21:12:30 -0700 In-Reply-To: <20230217041230.2417228-1-yuzhao@google.com> Message-Id: <20230217041230.2417228-6-yuzhao@google.com> Mime-Version: 1.0 References: <20230217041230.2417228-1-yuzhao@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young() From: Yu Zhao To: Andrew Morton , Paolo Bonzini Cc: Jonathan Corbet , Michael Larabel , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-mm@google.com, Yu Zhao X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230217_041253_123716_3793DC7C X-CRM114-Status: GOOD ( 19.70 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org An existing selftest can quickly demonstrate the effectiveness of this patch. On a generic workstation equipped with 128 CPUs and 256GB DRAM: $ sudo max_guest_memory_test -c 64 -m 250 -s 250 MGLRU run2 --------------- Before ~600s After ~50s Off ~250s kswapd (MGLRU before) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.97% try_to_shrink_lruvec 99.06% evict_folios 97.41% shrink_folio_list 31.33% folio_referenced 31.06% rmap_walk_file 30.89% folio_referenced_one 20.83% __mmu_notifier_clear_flush_young 20.54% kvm_mmu_notifier_clear_flush_young => 19.34% _raw_write_lock kswapd (MGLRU after) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.97% try_to_shrink_lruvec 99.51% evict_folios 71.70% shrink_folio_list 7.08% folio_referenced 6.78% rmap_walk_file 6.72% folio_referenced_one 5.60% lru_gen_look_around => 1.53% __mmu_notifier_test_clear_young kswapd (MGLRU off) 100.00% balance_pgdat 100.00% shrink_node 99.92% shrink_lruvec 69.95% shrink_folio_list 19.35% folio_referenced 18.37% rmap_walk_file 17.88% folio_referenced_one 13.20% __mmu_notifier_clear_flush_young 11.64% kvm_mmu_notifier_clear_flush_young => 9.93% _raw_write_lock 26.23% shrink_active_list 25.50% folio_referenced 25.35% rmap_walk_file 25.28% folio_referenced_one 23.87% __mmu_notifier_clear_flush_young 23.69% kvm_mmu_notifier_clear_flush_young => 18.98% _raw_write_lock Signed-off-by: Yu Zhao --- include/linux/mmzone.h | 6 +- mm/rmap.c | 8 +-- mm/vmscan.c | 127 ++++++++++++++++++++++++++++++++++++----- 3 files changed, 121 insertions(+), 20 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9fb1b03b83b2..ce34b7ea8e4c 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -379,6 +379,7 @@ enum { LRU_GEN_CORE, LRU_GEN_MM_WALK, LRU_GEN_NONLEAF_YOUNG, + LRU_GEN_SPTE_WALK, NR_LRU_GEN_CAPS }; @@ -485,7 +486,7 @@ struct lru_gen_mm_walk { }; void lru_gen_init_lruvec(struct lruvec *lruvec); -void lru_gen_look_around(struct page_vma_mapped_walk *pvmw); +bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw); #ifdef CONFIG_MEMCG @@ -573,8 +574,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec) { } -static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) { + return false; } #ifdef CONFIG_MEMCG diff --git a/mm/rmap.c b/mm/rmap.c index 15ae24585fc4..eb0089f8f112 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -823,12 +823,10 @@ static bool folio_referenced_one(struct folio *folio, return false; /* To break the loop */ } - if (pvmw.pte) { - if (lru_gen_enabled() && pte_young(*pvmw.pte)) { - lru_gen_look_around(&pvmw); + if (lru_gen_enabled() && pvmw.pte) { + if (lru_gen_look_around(&pvmw)) referenced++; - } - + } else if (pvmw.pte) { if (ptep_clear_flush_young_notify(vma, address, pvmw.pte)) referenced++; diff --git a/mm/vmscan.c b/mm/vmscan.c index 9c1c5e8b24b8..d6d69f0baabf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -57,6 +57,8 @@ #include #include #include +#include +#include #include #include @@ -3923,6 +3925,55 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg, return folio; } +static bool test_spte_young(struct mm_struct *mm, unsigned long addr, unsigned long end, + unsigned long *bitmap, unsigned long *last) +{ + if (!kvm_arch_has_test_clear_young() || !get_cap(LRU_GEN_SPTE_WALK)) + return false; + + if (*last > addr) + goto done; + + *last = end - addr > MIN_LRU_BATCH * PAGE_SIZE ? + addr + MIN_LRU_BATCH * PAGE_SIZE - 1 : end - 1; + bitmap_zero(bitmap, MIN_LRU_BATCH); + + mmu_notifier_test_clear_young(mm, addr, *last + 1, false, bitmap); +done: + return test_bit((*last - addr) / PAGE_SIZE, bitmap); +} + +static void clear_spte_young(struct mm_struct *mm, unsigned long addr, + unsigned long *bitmap, unsigned long *last) +{ + int i; + unsigned long start, end = *last + 1; + + if (addr + PAGE_SIZE != end) + return; + + i = find_last_bit(bitmap, MIN_LRU_BATCH); + if (i == MIN_LRU_BATCH) + return; + + start = end - (i + 1) * PAGE_SIZE; + + i = find_first_bit(bitmap, MIN_LRU_BATCH); + + end -= i * PAGE_SIZE; + + mmu_notifier_test_clear_young(mm, start, end, false, bitmap); +} + +static void skip_spte_young(struct mm_struct *mm, unsigned long addr, + unsigned long *bitmap, unsigned long *last) +{ + if (*last > addr) + __clear_bit((*last - addr) / PAGE_SIZE, bitmap); + + clear_spte_young(mm, addr, bitmap, last); +} + static bool suitable_to_scan(int total, int young) { int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8); @@ -3938,6 +3989,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, pte_t *pte; spinlock_t *ptl; unsigned long addr; + unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)]; + unsigned long last = 0; int total = 0; int young = 0; struct lru_gen_mm_walk *walk = args->private; @@ -3956,6 +4009,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, pte = pte_offset_map(pmd, start & PMD_MASK); restart: for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) { + bool success; unsigned long pfn; struct folio *folio; @@ -3963,20 +4017,27 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, walk->mm_stats[MM_LEAF_TOTAL]++; pfn = get_pte_pfn(pte[i], args->vma, addr); - if (pfn == -1) + if (pfn == -1) { + skip_spte_young(args->vma->vm_mm, addr, bitmap, &last); continue; + } - if (!pte_young(pte[i])) { + success = test_spte_young(args->vma->vm_mm, addr, end, bitmap, &last); + if (!success && !pte_young(pte[i])) { + skip_spte_young(args->vma->vm_mm, addr, bitmap, &last); walk->mm_stats[MM_LEAF_OLD]++; continue; } folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap); - if (!folio) + if (!folio) { + skip_spte_young(args->vma->vm_mm, addr, bitmap, &last); continue; + } - if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) - VM_WARN_ON_ONCE(true); + clear_spte_young(args->vma->vm_mm, addr, bitmap, &last); + if (pte_young(pte[i])) + ptep_test_and_clear_young(args->vma, addr, pte + i); young++; walk->mm_stats[MM_LEAF_YOUNG]++; @@ -4581,6 +4642,24 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) * rmap/PT walk feedback ******************************************************************************/ +static bool should_look_around(struct vm_area_struct *vma, unsigned long addr, + pte_t *pte, int *young) +{ + unsigned long old = true; + + *young = mmu_notifier_test_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE, true, &old); + if (!old) + *young = true; + + if (pte_young(*pte)) { + ptep_test_and_clear_young(vma, addr, pte); + *young = true; + return true; + } + + return !old && get_cap(LRU_GEN_SPTE_WALK); +} + /* * This function exploits spatial locality when shrink_folio_list() walks the * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If @@ -4588,12 +4667,14 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) * the PTE table to the Bloom filter. This forms a feedback loop between the * eviction and the aging. */ -void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) { int i; unsigned long start; unsigned long end; struct lru_gen_mm_walk *walk; + unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)]; + unsigned long last = 0; int young = 0; pte_t *pte = pvmw->pte; unsigned long addr = pvmw->address; @@ -4607,8 +4688,11 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) lockdep_assert_held(pvmw->ptl); VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio); + if (!should_look_around(pvmw->vma, addr, pte, &young)) + return young; + if (spin_is_contended(pvmw->ptl)) - return; + return young; /* avoid taking the LRU lock under the PTL when possible */ walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL; @@ -4616,6 +4700,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) start = max(addr & PMD_MASK, pvmw->vma->vm_start); end = min(addr | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1; + if (end - start == PAGE_SIZE) + return young; + if (end - start > MIN_LRU_BATCH * PAGE_SIZE) { if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2) end = start + MIN_LRU_BATCH * PAGE_SIZE; @@ -4629,28 +4716,37 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) /* folio_update_gen() requires stable folio_memcg() */ if (!mem_cgroup_trylock_pages(memcg)) - return; + return young; arch_enter_lazy_mmu_mode(); pte -= (addr - start) / PAGE_SIZE; for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) { + bool success; unsigned long pfn; pfn = get_pte_pfn(pte[i], pvmw->vma, addr); - if (pfn == -1) + if (pfn == -1) { + skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last); continue; + } - if (!pte_young(pte[i])) + success = test_spte_young(pvmw->vma->vm_mm, addr, end, bitmap, &last); + if (!success && !pte_young(pte[i])) { + skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last); continue; + } folio = get_pfn_folio(pfn, memcg, pgdat, !walk || walk->can_swap); - if (!folio) + if (!folio) { + skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last); continue; + } - if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i)) - VM_WARN_ON_ONCE(true); + clear_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last); + if (pte_young(pte[i])) + ptep_test_and_clear_young(pvmw->vma, addr, pte + i); young++; @@ -4680,6 +4776,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) /* feedback from rmap walkers to page table walkers */ if (suitable_to_scan(i, young)) update_bloom_filter(lruvec, max_seq, pvmw->pmd); + + return young; } /****************************************************************************** @@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG)) caps |= BIT(LRU_GEN_NONLEAF_YOUNG); + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK)) + caps |= BIT(LRU_GEN_SPTE_WALK); + return sysfs_emit(buf, "0x%04x\n", caps); }