From patchwork Sat Apr 23 03:47:41 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <seanjc@google.com>
X-Patchwork-Id: 12824400
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7B75DC433EF
	for <kvm@archiver.kernel.org>; Sat, 23 Apr 2022 03:48:18 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232726AbiDWDvK (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 22 Apr 2022 23:51:10 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54936 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232626AbiDWDvI (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 22 Apr 2022 23:51:08 -0400
Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com
 [IPv6:2607:f8b0:4864:20::44a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4B8A71BD5C4
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:13 -0700 (PDT)
Received: by mail-pf1-x44a.google.com with SMTP id
 p18-20020aa78612000000b0050d1c170018so441193pfn.15
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=reply-to:date:in-reply-to:message-id:mime-version:references
         :subject:from:to:cc;
        bh=UpDwvHE3te50hhADXZwRX35cXLlp2sOCZQHjdNcE0rc=;
        b=COJtGw4Wavj/9yPBesyN/SgXH/HIWMsbDBmf9ECoXEaMXvPup15jFuJN/PKGgWnlrB
         jXQrLxE14lurQzFajjzsr6+93rpNVy0pIwYNdg8JIDm7VJdeWKnfDokdHbFVGSsHe+gq
         eF20CJ8Xb1pTUI6pAPJPUM2864n/FJqbcrJ0YL6xAH5iQ+PvmulL5BGd3XB/HtKO6VhW
         e/DLavN3/K0BJ4walqhBRroYuU43zEWA3Pyp2XgG7MNM/htDDPvXuw+CNjngMubCA+ND
         3tuYU6togv1tI/lvupH9gh2cO0JI1qNgVAmez5RGIc2Yz1u2xisxETb4wB3qCtqDONdi
         iBTw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:reply-to:date:in-reply-to:message-id
         :mime-version:references:subject:from:to:cc;
        bh=UpDwvHE3te50hhADXZwRX35cXLlp2sOCZQHjdNcE0rc=;
        b=XWJ6yySjygHvTCkcAusnFgsXbn3Lnk7X+C0oYjeqVgLYJzPwx56akGEmdlj63Ap4fG
         P/nUYvw6TWwoEbKLTkXaETgcPZCfE4cTDjpFXQ3RTqx1uR71AiTOvjCCpEUAgQwECRYS
         m6xSwT3IvWjYVtPKWMA26DYYKmN8IqgT+YmICqdkYKnPK0mA7ir8FbsuOMXpQNnFykLk
         8pGooYsW0Jzq5BFNlHf5No20QoinHiRJElgAQlzhQZH7bk+KRS2XkaVkhqKe3ZlFlC93
         VI5vbziXljKQ1Ac5/D+MdyEa6bWudV6xHAoYsVAXLMSYgfzrcNcXaZEEw1y5dXwitMZ+
         95dw==
X-Gm-Message-State: AOAM5331ICwDKZtrtCdjDOIwyaPj3Wneu0FEAwERVtiiHsp60c1dbVbk
        g/XgqSK6MM38MT6yA9TMNIwwM43J3lg=
X-Google-Smtp-Source: 
 ABdhPJyXteO9JqYc3L2JBj7BFHCj5A5q1lxAeJctmp3IloEpb2fcn+tkkeSHQ58pOpEy2m1zzPRxNbsLstQ=
X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5])
 (user=seanjc job=sendgmr) by 2002:a17:902:c14a:b0:15b:9c29:935a with SMTP id
 10-20020a170902c14a00b0015b9c29935amr7612387plj.2.1650685692822; Fri, 22 Apr
 2022 20:48:12 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Sat, 23 Apr 2022 03:47:41 +0000
In-Reply-To: <20220423034752.1161007-1-seanjc@google.com>
Message-Id: <20220423034752.1161007-2-seanjc@google.com>
Mime-Version: 1.0
References: <20220423034752.1161007-1-seanjc@google.com>
X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog
Subject: [PATCH 01/12] KVM: x86/mmu: Don't treat fully writable SPTEs as
 volatile (modulo A/D)
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>,
        Venkatesh Srinivas <venkateshs@google.com>,
        Chao Peng <chao.p.peng@linux.intel.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Don't treat SPTEs that are truly writable, i.e. writable in hardware, as
being volatile (unless they're volatile for other reasons, e.g. A/D bits).
KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit
out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get
cleared just because the SPTE is MMU-writable.

Rename the wrapper of MMU-writable to be more literal, the previous name
of spte_can_locklessly_be_made_writable() is wrong and misleading.

Fixes: c7ba5b48cc8d ("KVM: MMU: fast path of handling guest page fault")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c  | 17 +++++++++--------
 arch/x86/kvm/mmu/spte.h |  2 +-
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 904f0faff218..612316768e8e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -481,13 +481,15 @@ static bool spte_has_volatile_bits(u64 spte)
 	 * also, it can help us to get a stable is_writable_pte()
 	 * to ensure tlb flush is not missed.
 	 */
-	if (spte_can_locklessly_be_made_writable(spte) ||
-	    is_access_track_spte(spte))
+	if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
+		return true;
+
+	if (is_access_track_spte(spte))
 		return true;
 
 	if (spte_ad_enabled(spte)) {
-		if ((spte & shadow_accessed_mask) == 0 ||
-	    	    (is_writable_pte(spte) && (spte & shadow_dirty_mask) == 0))
+		if (!(spte & shadow_accessed_mask) ||
+		    (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
 			return true;
 	}
 
@@ -554,7 +556,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	 * we always atomically update it, see the comments in
 	 * spte_has_volatile_bits().
 	 */
-	if (spte_can_locklessly_be_made_writable(old_spte) &&
+	if (is_mmu_writable_spte(old_spte) &&
 	      !is_writable_pte(new_spte))
 		flush = true;
 
@@ -1192,7 +1194,7 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
 	u64 spte = *sptep;
 
 	if (!is_writable_pte(spte) &&
-	      !(pt_protect && spte_can_locklessly_be_made_writable(spte)))
+	    !(pt_protect && is_mmu_writable_spte(spte)))
 		return false;
 
 	rmap_printk("spte %p %llx\n", sptep, *sptep);
@@ -3171,8 +3173,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * be removed in the fast path only if the SPTE was
 		 * write-protected for dirty-logging or access tracking.
 		 */
-		if (fault->write &&
-		    spte_can_locklessly_be_made_writable(spte)) {
+		if (fault->write && is_mmu_writable_spte(spte)) {
 			new_spte |= PT_WRITABLE_MASK;
 
 			/*
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index ad8ce3c5d083..570699682f6d 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -398,7 +398,7 @@ static inline void check_spte_writable_invariants(u64 spte)
 			  "kvm: Writable SPTE is not MMU-writable: %llx", spte);
 }
 
-static inline bool spte_can_locklessly_be_made_writable(u64 spte)
+static inline bool is_mmu_writable_spte(u64 spte)
 {
 	return spte & shadow_mmu_writable_mask;
 }

From patchwork Sat Apr 23 03:47:42 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <seanjc@google.com>
X-Patchwork-Id: 12824401
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 70945C433EF
	for <kvm@archiver.kernel.org>; Sat, 23 Apr 2022 03:48:23 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232768AbiDWDvO (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 22 Apr 2022 23:51:14 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55080 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231573AbiDWDvK (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 22 Apr 2022 23:51:10 -0400
Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com
 [IPv6:2607:f8b0:4864:20::449])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E79BF1C1BA6
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:14 -0700 (PDT)
Received: by mail-pf1-x449.google.com with SMTP id
 d6-20020aa78686000000b0050adc2b200cso4680889pfo.21
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=reply-to:date:in-reply-to:message-id:mime-version:references
         :subject:from:to:cc;
        bh=dtBZMDfTalZh2XkDGyYCthXE8Qbr+M1eG4ZapuRAgmk=;
        b=Bcff9EnUeQjBvfNF0ZD4Jxot9AFwuW5ObvM+nOepNeFE8fW67Rqag9IEH+IVOpA1IH
         0NIIFCxWds+9BXOGtnmxHFMuQSLcb77BloSATp/9nrhR6+Ve2ES0m8fYr2UaCI4FYk6I
         vgfH8sA7JIsKX86nzfLOpomrU5h+yvKjWWxR7iEgq9ILHpiJyaBHxkUEhok1E+FDfnKe
         YSEcl1SBDKQiXHpOpWEvhx/qEsDtpwXag9Xk2AAGjmRrhVmiz6yzddEPnsFGmDTUaqDb
         zv8fiwFlGnQfMxiOTuAJXVqPDwCHt/eRyNbjo3Qzno1kI+MEypVRgJAhDuQvW0O+pZaq
         wF8Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:reply-to:date:in-reply-to:message-id
         :mime-version:references:subject:from:to:cc;
        bh=dtBZMDfTalZh2XkDGyYCthXE8Qbr+M1eG4ZapuRAgmk=;
        b=7LQOwM8rHOYUV3iYS/MvzOCtmbtD/xz6tWwRqJ/r8+2XW2vXvmOiiCerQ6b7ClnDcu
         AHQJbzD1uycsDGmqJin180LXfqC1zOJmgomFSEbuPM3aBCY+f5BrVWCaQJuSa5aGtE2w
         kFyMLmbfGXWvUCpKUfpWgf+WaIDDcCjo+c3XWRxK+VVZUEl8icvbRY4CBCkJftfFD7bx
         /1QDpBMM7D7EX2V2nCZY5sljTCq3bl6BXgdZ4rFl94CQendzjY43lDGSwtPmO+BMX17h
         O/mpmtdzOmiBnP8dRerzuEzMyr73qXL2TS2pqCsdm1oHS0WW9EWEz5u1l3QuXwyQsBLB
         Pnlg==
X-Gm-Message-State: AOAM531B8PodErz7BxR1U1B7zglTPsx5XPX02gvGRpz3rtCaK/QoB1hi
        noEfKCqfOAbXMyIThleA05PBPxjW2fQ=
X-Google-Smtp-Source: 
 ABdhPJzB5ZvT1BP2AXWkijl7gJ9hs+fwyWMBfbxmRJUD6kFR5UyFoyTZ4f782KaP5mq3+tk3zPM049PzgQ0=
X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5])
 (user=seanjc job=sendgmr) by 2002:a05:6a00:1955:b0:505:7902:36d3 with SMTP id
 s21-20020a056a00195500b00505790236d3mr8175856pfk.77.1650685694468; Fri, 22
 Apr 2022 20:48:14 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Sat, 23 Apr 2022 03:47:42 +0000
In-Reply-To: <20220423034752.1161007-1-seanjc@google.com>
Message-Id: <20220423034752.1161007-3-seanjc@google.com>
Mime-Version: 1.0
References: <20220423034752.1161007-1-seanjc@google.com>
X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog
Subject: [PATCH 02/12] KVM: x86/mmu: Move shadow-present check out of
 spte_has_volatile_bits()
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>,
        Venkatesh Srinivas <venkateshs@google.com>,
        Chao Peng <chao.p.peng@linux.intel.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
and into its callers.  Well, caller, since only one of its two callers
doesn't already do the shadow-present check.

Opportunistically move the helper to spte.c/h so that it can be used by
the TDP MMU, which is also the primary motivation for the shadow-present
change.  Unlike the legacy MMU, the TDP MMU uses a single path for clear
leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
MMU will need to check is_last_spte() prior to calling
spte_has_volatile_bits(), and calling is_last_spte() without first
calling is_shadow_present_spte() is at best odd, and at worst a violation
of KVM's loosely defines SPTE rules.

Note, mmu_spte_clear_track_bits() could likely skip the write entirely
for SPTEs that are not shadow-present.  Leave that cleanup for a future
patch to avoid introducing a functional change, and because the
shadow-present check can likely be moved further up the stack, e.g.
drop_large_spte() appears to be the only path that doesn't already
explicitly check for a shadow-present SPTE.

No functional change intended.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c  | 29 ++---------------------------
 arch/x86/kvm/mmu/spte.c | 28 ++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/spte.h |  2 ++
 3 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 612316768e8e..65b723201738 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -470,32 +470,6 @@ static u64 __get_spte_lockless(u64 *sptep)
 }
 #endif
 
-static bool spte_has_volatile_bits(u64 spte)
-{
-	if (!is_shadow_present_pte(spte))
-		return false;
-
-	/*
-	 * Always atomically update spte if it can be updated
-	 * out of mmu-lock, it can ensure dirty bit is not lost,
-	 * also, it can help us to get a stable is_writable_pte()
-	 * to ensure tlb flush is not missed.
-	 */
-	if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
-		return true;
-
-	if (is_access_track_spte(spte))
-		return true;
-
-	if (spte_ad_enabled(spte)) {
-		if (!(spte & shadow_accessed_mask) ||
-		    (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
-			return true;
-	}
-
-	return false;
-}
-
 /* Rules for using mmu_spte_set:
  * Set the sptep from nonpresent to present.
  * Note: the sptep being assigned *must* be either not present
@@ -590,7 +564,8 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 	u64 old_spte = *sptep;
 	int level = sptep_to_sp(sptep)->role.level;
 
-	if (!spte_has_volatile_bits(old_spte))
+	if (!is_shadow_present_pte(old_spte) ||
+	    !spte_has_volatile_bits(old_spte))
 		__update_clear_spte_fast(sptep, 0ull);
 	else
 		old_spte = __update_clear_spte_slow(sptep, 0ull);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 3d611f07eee8..800b857b3a53 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -90,6 +90,34 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 				     E820_TYPE_RAM);
 }
 
+/*
+ * Returns true if the SPTE has bits that may be set without holding mmu_lock.
+ * The caller is responsible for checking if the SPTE is shadow-present, and
+ * for determining whether or not the caller cares about non-leaf SPTEs.
+ */
+bool spte_has_volatile_bits(u64 spte)
+{
+	/*
+	 * Always atomically update spte if it can be updated
+	 * out of mmu-lock, it can ensure dirty bit is not lost,
+	 * also, it can help us to get a stable is_writable_pte()
+	 * to ensure tlb flush is not missed.
+	 */
+	if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
+		return true;
+
+	if (is_access_track_spte(spte))
+		return true;
+
+	if (spte_ad_enabled(spte)) {
+		if (!(spte & shadow_accessed_mask) ||
+		    (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
+			return true;
+	}
+
+	return false;
+}
+
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 570699682f6d..098d7d144627 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -412,6 +412,8 @@ static inline u64 get_mmio_spte_generation(u64 spte)
 	return gen;
 }
 
+bool spte_has_volatile_bits(u64 spte);
+
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,

From patchwork Sat Apr 23 03:47:43 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <seanjc@google.com>
X-Patchwork-Id: 12824402
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ACFC9C433FE
	for <kvm@archiver.kernel.org>; Sat, 23 Apr 2022 03:48:28 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232784AbiDWDvT (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 22 Apr 2022 23:51:19 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55244 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232756AbiDWDvM (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 22 Apr 2022 23:51:12 -0400
Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com
 [IPv6:2607:f8b0:4864:20::449])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6A42A1C2415
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:16 -0700 (PDT)
Received: by mail-pf1-x449.google.com with SMTP id
 m8-20020a62a208000000b0050593296139so6575343pff.1
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=reply-to:date:in-reply-to:message-id:mime-version:references
         :subject:from:to:cc;
        bh=moq5t2aceAdbzIzukfRmgSKet+BsMKCYgSlsSAirzs0=;
        b=FbDhbCaF8hOQ7EYUVhNkt6Tmwu1NFcBdYH6RBxl2Sh3YMMQBgs0eDoRhoGrFjg3HBX
         WaQaXIJWLH1C5vWv397BpYdWwHWnhAVDs6kIib54ckP1KMbipi/UZEj1oFKGgoRqWee3
         PxJ1j32pQ6Uqq3FkvX0hfj4D7UuXKPC3boSebggtMZxhnkAeDZo3AT1tHnQQLCfwMPPZ
         Y4DtNppAsVRLppmn2mBkDINvS0VmaA0s2WFX92khgy9kE0g4qPnXX1iR8ATFxRRb3mo2
         YUtzfLWgjxZDrLgGnKgvKWc5/57X7tlzAFj8RSVtOpdVwQt4BEAwWGycFivQazrN1PE8
         JKAQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:reply-to:date:in-reply-to:message-id
         :mime-version:references:subject:from:to:cc;
        bh=moq5t2aceAdbzIzukfRmgSKet+BsMKCYgSlsSAirzs0=;
        b=B8eorJoVVqdjCg3MYkHR0ZkVjMxXEfDCgO2l420dHht8L91LOuryyhzbkpnXjYfGLl
         G8CmEyKlnkV745hp1axRgv/srUduv4xaTSqsgeRbtXhCv6MCdK3EaZ1RlV6cbY1dsSxt
         C6MgzTXliSpGZ3I9lltbYJ6Smb5EfNRlAPyaRasoaiaHWlIGmiPpy1TTfu6gZYqvGqPR
         fjMUqWXZNVtcwLKpZub3+3ryVClVgmH9RBrJoNsyilwIMqkNtPCAsJRvyhUYs6okWYmm
         /bWVUMRbAhvfejh8gERijs71CJQmZLxRPbzvxN8rrMOoGP63/xA3i2rzT5YPV3DF/THl
         AQHg==
X-Gm-Message-State: AOAM532gYDJOEvXOi/SxduK3/tvVKmUpA94pzmHeeFdfFF0XNp9CAUH1
        ryfECIifL8L/8oRBcPtF1+o1MyLBbsE=
X-Google-Smtp-Source: 
 ABdhPJzcS9IArhXpxbfXdD5/BljR1DmGHgBCCOg8nBjyT9uxpr1r7vUEFp8lVhJl8c/G5V31ziU92656SWc=
X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5])
 (user=seanjc job=sendgmr) by 2002:a17:90a:a593:b0:1c9:b837:e77d with SMTP id
 b19-20020a17090aa59300b001c9b837e77dmr8653466pjq.205.1650685695924; Fri, 22
 Apr 2022 20:48:15 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Sat, 23 Apr 2022 03:47:43 +0000
In-Reply-To: <20220423034752.1161007-1-seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Mime-Version: 1.0
References: <20220423034752.1161007-1-seanjc@google.com>
X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog
Subject: [PATCH 03/12] KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs
 with volatile bits
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>,
        Venkatesh Srinivas <venkateshs@google.com>,
        Chao Peng <chao.p.peng@linux.intel.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock.  If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.

Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.

Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page.  In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush.  If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.

For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.

The Accessed bit is a complete non-issue.  Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.

Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE.  If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM.  But that all depends
on the Dirty bit being problematic in the first place.

Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_iter.h | 34 ++++++++++++++-
 arch/x86/kvm/mmu/tdp_mmu.c  | 82 ++++++++++++++++++++++++-------------
 2 files changed, 85 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index b1eaf6ec0e0b..f0af385c56e0 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -6,6 +6,7 @@
 #include <linux/kvm_host.h>
 
 #include "mmu.h"
+#include "spte.h"
 
 /*
  * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
@@ -17,9 +18,38 @@ static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
 {
 	return READ_ONCE(*rcu_dereference(sptep));
 }
-static inline void kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 val)
+
+static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
 {
-	WRITE_ONCE(*rcu_dereference(sptep), val);
+	return xchg(rcu_dereference(sptep), new_spte);
+}
+
+static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
+{
+	WRITE_ONCE(*rcu_dereference(sptep), new_spte);
+}
+
+static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
+					 u64 new_spte, int level)
+{
+	/*
+	 * Atomically write the SPTE if it is a shadow-present, leaf SPTE with
+	 * volatile bits, i.e. has bits that can be set outside of mmu_lock.
+	 * The Writable bit can be set by KVM's fast page fault handler, and
+	 * Accessed and Dirty bits can be set by the CPU.
+	 *
+	 * Note, non-leaf SPTEs do have Accessed bits and those bits are
+	 * technically volatile, but KVM doesn't consume the Accessed bit of
+	 * non-leaf SPTEs, i.e. KVM doesn't care if it clobbers the bit.  This
+	 * logic needs to be reassessed if KVM were to use non-leaf Accessed
+	 * bits, e.g. to skip stepping down into child SPTEs when aging SPTEs.
+	 */
+	if (is_shadow_present_pte(old_spte) && is_last_spte(old_spte, level) &&
+	    spte_has_volatile_bits(old_spte))
+		return kvm_tdp_mmu_write_spte_atomic(sptep, new_spte);
+
+	__kvm_tdp_mmu_write_spte(sptep, new_spte);
+	return old_spte;
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 566548a3efa7..e9033cce8aeb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -426,9 +426,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 	tdp_mmu_unlink_sp(kvm, sp, shared);
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
-		u64 *sptep = rcu_dereference(pt) + i;
+		tdp_ptep_t sptep = pt + i;
 		gfn_t gfn = base_gfn + i * KVM_PAGES_PER_HPAGE(level);
-		u64 old_child_spte;
+		u64 old_spte;
 
 		if (shared) {
 			/*
@@ -440,8 +440,8 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			 * value to the removed SPTE value.
 			 */
 			for (;;) {
-				old_child_spte = xchg(sptep, REMOVED_SPTE);
-				if (!is_removed_spte(old_child_spte))
+				old_spte = kvm_tdp_mmu_write_spte_atomic(sptep, REMOVED_SPTE);
+				if (!is_removed_spte(old_spte))
 					break;
 				cpu_relax();
 			}
@@ -455,23 +455,43 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			 * are guarded by the memslots generation, not by being
 			 * unreachable.
 			 */
-			old_child_spte = READ_ONCE(*sptep);
-			if (!is_shadow_present_pte(old_child_spte))
+			old_spte = kvm_tdp_mmu_read_spte(sptep);
+			if (!is_shadow_present_pte(old_spte))
 				continue;
 
 			/*
-			 * Marking the SPTE as a removed SPTE is not
-			 * strictly necessary here as the MMU lock will
-			 * stop other threads from concurrently modifying
-			 * this SPTE. Using the removed SPTE value keeps
-			 * the two branches consistent and simplifies
-			 * the function.
+			 * Use the common helper instead of a raw WRITE_ONCE as
+			 * the SPTE needs to be updated atomically if it can be
+			 * modified by a different vCPU outside of mmu_lock.
+			 * Even though the parent SPTE is !PRESENT, the TLB
+			 * hasn't yet been flushed, and both Intel and AMD
+			 * document that A/D assists can use upper-level PxE
+			 * entries that are cached in the TLB, i.e. the CPU can
+			 * still access the page and mark it dirty.
+			 *
+			 * No retry is needed in the atomic update path as the
+			 * sole concern is dropping a Dirty bit, i.e. no other
+			 * task can zap/remove the SPTE as mmu_lock is held for
+			 * write.  Marking the SPTE as a removed SPTE is not
+			 * strictly necessary for the same reason, but using
+			 * the remove SPTE value keeps the shared/exclusive
+			 * paths consistent and allows the handle_changed_spte()
+			 * call below to hardcode the new value to REMOVED_SPTE.
+			 *
+			 * Note, even though dropping a Dirty bit is the only
+			 * scenario where a non-atomic update could result in a
+			 * functional bug, simply checking the Dirty bit isn't
+			 * sufficient as a fast page fault could read the upper
+			 * level SPTE before it is zapped, and then make this
+			 * target SPTE writable, resume the guest, and set the
+			 * Dirty bit between reading the SPTE above and writing
+			 * it here.
 			 */
-			WRITE_ONCE(*sptep, REMOVED_SPTE);
+			old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
+							  REMOVED_SPTE, level);
 		}
 		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
-				    old_child_spte, REMOVED_SPTE, level,
-				    shared);
+				    old_spte, REMOVED_SPTE, level, shared);
 	}
 
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
@@ -667,14 +687,13 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 					   KVM_PAGES_PER_HPAGE(iter->level));
 
 	/*
-	 * No other thread can overwrite the removed SPTE as they
-	 * must either wait on the MMU lock or use
-	 * tdp_mmu_set_spte_atomic which will not overwrite the
-	 * special removed SPTE value. No bookkeeping is needed
-	 * here since the SPTE is going from non-present
-	 * to non-present.
+	 * No other thread can overwrite the removed SPTE as they must either
+	 * wait on the MMU lock or use tdp_mmu_set_spte_atomic() which will not
+	 * overwrite the special removed SPTE value. No bookkeeping is needed
+	 * here since the SPTE is going from non-present to non-present.  Use
+	 * the raw write helper to avoid an unnecessary check on volatile bits.
 	 */
-	kvm_tdp_mmu_write_spte(iter->sptep, 0);
+	__kvm_tdp_mmu_write_spte(iter->sptep, 0);
 
 	return 0;
 }
@@ -699,10 +718,13 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
  *		      unless performing certain dirty logging operations.
  *		      Leaving record_dirty_log unset in that case prevents page
  *		      writes from being double counted.
+ *
+ * Returns the old SPTE value, which _may_ be different than @old_spte if the
+ * SPTE had voldatile bits.
  */
-static void __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
-			       u64 old_spte, u64 new_spte, gfn_t gfn, int level,
-			       bool record_acc_track, bool record_dirty_log)
+static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
+			      u64 old_spte, u64 new_spte, gfn_t gfn, int level,
+			      bool record_acc_track, bool record_dirty_log)
 {
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
@@ -715,7 +737,7 @@ static void __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	 */
 	WARN_ON(is_removed_spte(old_spte) || is_removed_spte(new_spte));
 
-	kvm_tdp_mmu_write_spte(sptep, new_spte);
+	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
 
 	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
 
@@ -724,6 +746,7 @@ static void __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	if (record_dirty_log)
 		handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
 					      new_spte, level);
+	return old_spte;
 }
 
 static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
@@ -732,9 +755,10 @@ static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 {
 	WARN_ON_ONCE(iter->yielded);
 
-	__tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep, iter->old_spte,
-			   new_spte, iter->gfn, iter->level,
-			   record_acc_track, record_dirty_log);
+	iter->old_spte = __tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep,
+					    iter->old_spte, new_spte,
+					    iter->gfn, iter->level,
+					    record_acc_track, record_dirty_log);
 }
 
 static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,

From patchwork Sat Apr 23 03:47:44 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <seanjc@google.com>
X-Patchwork-Id: 12824412
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CA49AC433F5
	for <kvm@archiver.kernel.org>; Sat, 23 Apr 2022 03:48:43 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232902AbiDWDvh (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 22 Apr 2022 23:51:37 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55314 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232754AbiDWDvO (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 22 Apr 2022 23:51:14 -0400
Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com
 [IPv6:2607:f8b0:4864:20::44a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2998F1C37AE
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:18 -0700 (PDT)
Received: by mail-pf1-x44a.google.com with SMTP id
 a16-20020a62d410000000b00505734b752aso6564118pfh.4
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=reply-to:date:in-reply-to:message-id:mime-version:references
         :subject:from:to:cc;
        bh=s1FdvFSZ5B4eOljesil+5nZQsUnLyQTrJ9BbjzQR3zY=;
        b=dG4roe8tmNBs3teFZlh8AQCmDvgQZvVdbxLP3lsVzH1JbCuuVevt2rKfOqBKh9FwzH
         0L++ZsjttRWY8LjRlxb/2k8UzTLWR8IaheYNrtUBIGyfM0MpPto/9X/66EV5ytOqfI4N
         ZhSg5a5+4WtQpkXd08lpYFN8748PHSodN8+GXDDzsXBsEH9mKY/hQC2LN8Tq6/47EDOf
         kO6N0oP2vaHXSS4JMTFRKcjp/1nPA6FuLe8nJGUOnUNJAZ6LsdaGbx4RZtsYWjzUjRqb
         yF6Y0tPVEgcgFOIsUfHPmbU8nlWxYadO/5uCTOZjEQrZF65AvNMKw2uwprWDATQ5Gl/s
         Vz3g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:reply-to:date:in-reply-to:message-id
         :mime-version:references:subject:from:to:cc;
        bh=s1FdvFSZ5B4eOljesil+5nZQsUnLyQTrJ9BbjzQR3zY=;
        b=T5yDSmE8NEJHBYA3IBUu3FnFZond4ks+jJuv1z9At0CnQr6wc+0QctW0sRZuTauo6q
         tVEOfXU9PdebJ0fL8LjPG+Fl55NfeD9tTumVaBYUrlBMB5k76euGsTAbAPvNq+nkpLBi
         P9c2vMkuQb+7eeABbGA9xCs7hnm2Y1wJNWDj5n6FKjLl/TvUt3M0+IYUhX307IKLfmhn
         /GUNQGdonfWuDcUjfN7NI6snt+NjUx6ExBMH5Td22IRgZNfDbRIRCf5AlUBuEgz8vABB
         oFSeVtWofruwriBqPDT/pB7UFZslCa6d8vDcx8SUjNv1Ts6IwmAlzOmvkvIPj46yz19P
         1Ejw==
X-Gm-Message-State: AOAM532xGYSbdBWuMFW20ChxbfwXBXb39DNmR5DhNarcWfv4F5rLM9OW
        dP4DWTMNLLKvwP4FV8+iLLSUkg+RnW8=
X-Google-Smtp-Source: 
 ABdhPJz5NQNv1hdLtAQq4ffsV/hY9ndFZZJXFkD5aX28NMEWgpojXPxIe+Priq6HJRRLcXS48lPt84zpYlU=
X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5])
 (user=seanjc job=sendgmr) by 2002:a17:90a:5d08:b0:1d7:9587:9288 with SMTP id
 s8-20020a17090a5d0800b001d795879288mr9142204pji.204.1650685697637; Fri, 22
 Apr 2022 20:48:17 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Sat, 23 Apr 2022 03:47:44 +0000
In-Reply-To: <20220423034752.1161007-1-seanjc@google.com>
Message-Id: <20220423034752.1161007-5-seanjc@google.com>
Mime-Version: 1.0
References: <20220423034752.1161007-1-seanjc@google.com>
X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog
Subject: [PATCH 04/12] KVM: x86/mmu: Don't attempt fast page fault just
 because EPT is in use
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>,
        Venkatesh Srinivas <venkateshs@google.com>,
        Chao Peng <chao.p.peng@linux.intel.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path.  Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.

In other words, don't attempt the fast path just because EPT is enabled.

Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small.  E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios.  Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.

The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults.  Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.

On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles.  So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.

Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 45 ++++++++++++++++++++++++--------------
 arch/x86/kvm/mmu/spte.h    | 11 ++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 3 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 65b723201738..dfd1cfa9c08c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3013,19 +3013,20 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
 
 	/*
 	 * #PF can be fast if:
-	 * 1. The shadow page table entry is not present, which could mean that
-	 *    the fault is potentially caused by access tracking (if enabled).
-	 * 2. The shadow page table entry is present and the fault
-	 *    is caused by write-protect, that means we just need change the W
-	 *    bit of the spte which can be done out of mmu-lock.
 	 *
-	 * However, if access tracking is disabled we know that a non-present
-	 * page must be a genuine page fault where we have to create a new SPTE.
-	 * So, if access tracking is disabled, we return true only for write
-	 * accesses to a present page.
+	 * 1. The shadow page table entry is not present and A/D bits are
+	 *    disabled _by KVM_, which could mean that the fault is potentially
+	 *    caused by access tracking (if enabled).  If A/D bits are enabled
+	 *    by KVM, but disabled by L1 for L2, KVM is forced to disable A/D
+	 *    bits for L2 and employ access tracking, but the fast page fault
+	 *    mechanism only supports direct MMUs.
+	 * 2. The shadow page table entry is present, the access is a write,
+	 *    and no reserved bits are set (MMIO SPTEs cannot be "fixed"), i.e.
+	 *    the fault was caused by a write-protection violation.  If the
+	 *    SPTE is MMU-writable (determined later), the fault can be fixed
+	 *    by setting the Writable bit, which can be done out of mmu_lock.
 	 */
-
-	return shadow_acc_track_mask != 0 || (fault->write && fault->present);
+	return !kvm_ad_enabled() || (fault->write && fault->present);
 }
 
 /*
@@ -3140,13 +3141,25 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 		new_spte = spte;
 
-		if (is_access_track_spte(spte))
+		/*
+		 * KVM only supports fixing page faults outside of MMU lock for
+		 * direct MMUs, nested MMUs are always indirect, and KVM always
+		 * uses A/D bits for non-nested MMUs.  Thus, if A/D bits are
+		 * enabled, the SPTE can't be an access-tracked SPTE.
+		 */
+		if (unlikely(!kvm_ad_enabled()) && is_access_track_spte(spte))
 			new_spte = restore_acc_track_spte(new_spte);
 
 		/*
-		 * Currently, to simplify the code, write-protection can
-		 * be removed in the fast path only if the SPTE was
-		 * write-protected for dirty-logging or access tracking.
+		 * To keep things simple, only SPTEs that are MMU-writable can
+		 * be made fully writable outside of mmu_lock, e.g. only SPTEs
+		 * that were write-protected for dirty-logging or access
+		 * tracking are handled here.  Don't bother checking if the
+		 * SPTE is writable to prioritize running with A/D bits enabled.
+		 * The is_access_allowed() check above handles the common case
+		 * of the fault being spurious, and the SPTE is known to be
+		 * shadow-present, i.e. except for access tracking restoration
+		 * making the new SPTE writable, the check is wasteful.
 		 */
 		if (fault->write && is_mmu_writable_spte(spte)) {
 			new_spte |= PT_WRITABLE_MASK;
@@ -4751,7 +4764,7 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 	role.efer_nx = true;
 	role.smm = cpu_role.base.smm;
 	role.guest_mode = cpu_role.base.guest_mode;
-	role.ad_disabled = (shadow_accessed_mask == 0);
+	role.ad_disabled = !kvm_ad_enabled();
 	role.level = kvm_mmu_get_tdp_level(vcpu);
 	role.direct = true;
 	role.has_4_byte_gpte = false;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 098d7d144627..43ec7a8641b3 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -220,6 +220,17 @@ static inline bool is_shadow_present_pte(u64 pte)
 	return !!(pte & SPTE_MMU_PRESENT_MASK);
 }
 
+/*
+ * Returns true if A/D bits are supported in hardware and are enabled by KVM.
+ * When enabled, KVM uses A/D bits for all non-nested MMUs.  Because L1 can
+ * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the
+ * scenario where KVM is using A/D bits for L1, but not L2.
+ */
+static inline bool kvm_ad_enabled(void)
+{
+	return !!shadow_accessed_mask;
+}
+
 static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
 {
 	return sp->role.ad_disabled;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e9033cce8aeb..a2eda3e55697 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1135,7 +1135,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 			   struct kvm_mmu_page *sp, bool account_nx,
 			   bool shared)
 {
-	u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
+	u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled());
 	int ret = 0;
 
 	if (shared) {

From patchwork Sat Apr 23 03:47:45 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <seanjc@google.com>
X-Patchwork-Id: 12824409
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 47936C433EF
	for <kvm@archiver.kernel.org>; Sat, 23 Apr 2022 03:48:31 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232831AbiDWDvW (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 22 Apr 2022 23:51:22 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55354 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232769AbiDWDvP (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 22 Apr 2022 23:51:15 -0400
Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com
 [IPv6:2607:f8b0:4864:20::449])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EDE4E1C24AD
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:19 -0700 (PDT)
Received: by mail-pf1-x449.google.com with SMTP id
 a16-20020a62d410000000b00505734b752aso6564153pfh.4
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=reply-to:date:in-reply-to:message-id:mime-version:references
         :subject:from:to:cc;
        bh=1xXlNuitmBIDnvcFCjqiRv5q853x221AI5N7xfAKCvw=;
        b=Ieha1ltUZLArH56kihEmLOyaNSxIK2twG7xkdpiCELNo5UCcfb1vggkZz9RL3fnwfX
         0PrKYmdnN319T2R6MCRyHOi/2cE2OfEmEIYTitnPdLe2o4Jca2F215W/GNgyRjtwd426
         d9nAiQtpboT76sQ/h5AHKqMAqBJao/zS+ZOlO/rSyuCh6X90Sjw28e/ce/KYoEGm0x0p
         y63JdEvo60hj73YVRQ1eTAapaPh9O+eCPkMUh5qo6bqmoxPd6o9KfUKCBlVu/PiKITli
         JMwbL98H13gphrSULEzFoSbv75jKpC2eCtQ/SnT0NgzN9Du2rLIOoz0kY5zHajyPk0pr
         yJsg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:reply-to:date:in-reply-to:message-id
         :mime-version:references:subject:from:to:cc;
        bh=1xXlNuitmBIDnvcFCjqiRv5q853x221AI5N7xfAKCvw=;
        b=uKQztpjfkY7QWZ9YtcSbWrTylWmqWZoxh5/NeoqQY6tkiTN4EINjPAOzPoByR6uTfp
         rnmAB0rH/3KFCmwpzWK67jmxvjyb750NMypGTrPy9b1p5cg8dxpU7IDV6BibpTCOmVUs
         fV2WXfkmlPXC0VxJYj1Eke/zQYsswusRp6/AMk1iEU68Apu98l7l0tDWgBMHydD7Hz3W
         HTrogTT2Fhck//cJxlanqjt5BCY+c6+G+dzKC26eGVSxjeVa+6C2Bkimbdr3SjNwyQix
         TRCiWZXBRjZuASXjE1ng74TYSgJz0fTi4jlEwrTVL19cr9lANxePchrcI2hApC7K6o9p
         SMcg==
X-Gm-Message-State: AOAM532pWTJ3fehACUaepm7jWPlHA1I9O5ir9a+/5uqWp81IB+Vb14+0
        cL10AUzxYPmBHcWIeNSndQZQmcjwooM=
X-Google-Smtp-Source: 
 ABdhPJyfPN5aVhvQWAJBOV4tNgKYBnOtE/XxNWNn9A1avm3uZilEE7X+6lalb5vG8Nfj4dmjhglENd/AbM0=
X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5])
 (user=seanjc job=sendgmr) by 2002:a63:2006:0:b0:39d:8460:48a4 with SMTP id
 g6-20020a632006000000b0039d846048a4mr6604430pgg.623.1650685699488; Fri, 22
 Apr 2022 20:48:19 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Sat, 23 Apr 2022 03:47:45 +0000
In-Reply-To: <20220423034752.1161007-1-seanjc@google.com>
Message-Id: <20220423034752.1161007-6-seanjc@google.com>
Mime-Version: 1.0
References: <20220423034752.1161007-1-seanjc@google.com>
X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog
Subject: [PATCH 05/12] KVM: x86/mmu: Drop exec/NX check from "page fault can
 be fast"
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>,
        Venkatesh Srinivas <venkateshs@google.com>,
        Chao Peng <chao.p.peng@linux.intel.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
faults in the access tracking case, and drop the exec/NX check that
becomes redundant as a result.  No sane hardware will generate an access
that is both an instruct fetch and a write, i.e. it's a waste of cycles.
If hardware goes off the rails, or KVM runs under a misguided hypervisor,
spuriously running throught fast path is benign (KVM has been uknowingly
being doing exactly that for years).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dfd1cfa9c08c..f1618d8289ce 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3001,16 +3001,14 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
 {
 	/*
-	 * Do not fix the mmio spte with invalid generation number which
-	 * need to be updated by slow page fault path.
+	 * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
+	 * reach the common page fault handler if the SPTE has an invalid MMIO
+	 * generation number.  Refreshing the MMIO generation needs to go down
+	 * the slow path.  Note, EPT Misconfigs do NOT set the PRESENT flag!
 	 */
 	if (fault->rsvd)
 		return false;
 
-	/* See if the page fault is due to an NX violation */
-	if (unlikely(fault->exec && fault->present))
-		return false;
-
 	/*
 	 * #PF can be fast if:
 	 *
@@ -3026,7 +3024,14 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
 	 *    SPTE is MMU-writable (determined later), the fault can be fixed
 	 *    by setting the Writable bit, which can be done out of mmu_lock.
 	 */
-	return !kvm_ad_enabled() || (fault->write && fault->present);
+	if (!fault->present)
+		return !kvm_ad_enabled();
+
+	/*
+	 * Note, instruction fetches and writes are mutually exclusive, ignore
+	 * the "exec" flag.
+	 */
+	return fault->write;
 }
 
 /*

From patchwork Sat Apr 23 03:47:46 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <seanjc@google.com>
X-Patchwork-Id: 12824410
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4C5D6C433F5
	for <kvm@archiver.kernel.org>; Sat, 23 Apr 2022 03:48:35 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232731AbiDWDv1 (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 22 Apr 2022 23:51:27 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55582 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232670AbiDWDvT (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 22 Apr 2022 23:51:19 -0400
Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com
 [IPv6:2607:f8b0:4864:20::649])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C4ED91C3E6E
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:21 -0700 (PDT)
Received: by mail-pl1-x649.google.com with SMTP id
 z5-20020a170902ccc500b0015716eaec65so5799502ple.14
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=reply-to:date:in-reply-to:message-id:mime-version:references
         :subject:from:to:cc;
        bh=QaIK5hbUNuIoBL/vPKNJJnLGWUg9x9GQfI0ts38CvUw=;
        b=FtNsmllT//XrLcX5ne2kL6yQKkeHsq3D2zI2Xzn8HpzZA6vzUqDZtNKOsa0PYyP+eL
         eSe6NitmPB4/hX1Mxolr+zv/gnK/yMOGcxLurUDIv71cWfwbnq1iA9RM9IreHWlO58uM
         dT4PKcEbwuI6NyNaJIQFCNIBFZ40Hc/MrA2ywsy6q7zFb1jhgb4e20lM0kXbmotCACTx
         3MI09l6cGQkKCuQlB/DzvHLdb0/vGIL4gnFQrrqE5ltI3gqw4NxJnpnMuFNE0SYM7o3M
         7Xfj5Jd8Ct7hGwx5fJJFyy77vN52n3Z/pmXnEunqjRBUz/qbFX8M36SZLEEtJXLkrnFB
         vdVQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:reply-to:date:in-reply-to:message-id
         :mime-version:references:subject:from:to:cc;
        bh=QaIK5hbUNuIoBL/vPKNJJnLGWUg9x9GQfI0ts38CvUw=;
        b=tcQeu72uz0lFn0fzklS8BNFeYSA4oh7d9q9eqZzPNeKwTuQpHQYOZzzr1RkNSw7r1t
         eN+UsjmpQLtO2SFDh4jbtJvlw+ta273bSbEn005LP0X15rdU7Bj4FgnkNpKTItY4HTVq
         bR7an3Mgg1KdbgwaKcbgYUMhxGaznUy23uVCkpxp1QX7EMcmhMX5heeJBnM9zm84V6Vi
         8RSwy1CQUfMSoBj1XUsBekAnqZoOteloFNwbRos59EwYoT/7sKjbtKxI0vAlIlkHgbrA
         IcQ94cK6iokK367L4FqxzNlB/oKrmrgwERx/V0astP2CQ2l/xswrvz1SteQiTYqK5KLv
         AKVA==
X-Gm-Message-State: AOAM533ZorVOiBDpgResKAlH7HRX6FBdD2pbuRBC2UeCzsRIxp7wO7gG
        Dhdmzu90UoG8Q4lI9Ogur5pkMG3Iw7I=
X-Google-Smtp-Source: 
 ABdhPJzewDqgk9TdQxLpF903Hqz7WXWv0VBiYFXxfvQYdLhtDL8p/28Ud5xlsAjN3E7FvZ7kT2AvsFpyJbI=
X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5])
 (user=seanjc job=sendgmr) by 2002:a05:6a00:1256:b0:4fb:1374:2f65 with SMTP id
 u22-20020a056a00125600b004fb13742f65mr8305139pfi.72.1650685701202; Fri, 22
 Apr 2022 20:48:21 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Sat, 23 Apr 2022 03:47:46 +0000
In-Reply-To: <20220423034752.1161007-1-seanjc@google.com>
Message-Id: <20220423034752.1161007-7-seanjc@google.com>
Mime-Version: 1.0
References: <20220423034752.1161007-1-seanjc@google.com>
X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog
Subject: [PATCH 06/12] KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate
 bool+int* "returns"
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>,
        Venkatesh Srinivas <venkateshs@google.com>,
        Chao Peng <chao.p.peng@linux.intel.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
kvm_faultin_pfn() to signal that the page fault handler should continue
doing its thing.  Aside from being gross and inefficient, using a boolean
return to signal continue vs. stop makes it extremely difficult to add
more helpers and/or move existing code to a helper.

E.g. hypothetically, if nested MMUs were to gain a separate page fault
handler in the future, everything up to the "is self-modifying PTE" check
can be shared by all shadow MMUs, but communicating up the stack whether
to continue on or stop becomes a nightmare.

More concretely, proposed support for private guest memory ran into a
similar issue, where it'll be forced to forego a helper in order to yield
sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.

No functional change intended.

Cc: David Matlack <dmatlack@google.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 51 ++++++++++++++-------------------
 arch/x86/kvm/mmu/mmu_internal.h |  9 +++++-
 arch/x86/kvm/mmu/mmutrace.h     |  1 +
 arch/x86/kvm/mmu/paging_tmpl.h  |  6 ++--
 4 files changed, 35 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f1618d8289ce..f1e8d71e6f7c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2970,14 +2970,12 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
 	return -EFAULT;
 }
 
-static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
-				unsigned int access, int *ret_val)
+static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+			       unsigned int access)
 {
 	/* The pfn is invalid, report the error! */
-	if (unlikely(is_error_pfn(fault->pfn))) {
-		*ret_val = kvm_handle_bad_page(vcpu, fault->gfn, fault->pfn);
-		return true;
-	}
+	if (unlikely(is_error_pfn(fault->pfn)))
+		return kvm_handle_bad_page(vcpu, fault->gfn, fault->pfn);
 
 	if (unlikely(!fault->slot)) {
 		gva_t gva = fault->is_tdp ? 0 : fault->addr;
@@ -2989,13 +2987,11 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		 * touching the shadow page tables as attempting to install an
 		 * MMIO SPTE will just be an expensive nop.
 		 */
-		if (unlikely(!enable_mmio_caching)) {
-			*ret_val = RET_PF_EMULATE;
-			return true;
-		}
+		if (unlikely(!enable_mmio_caching))
+			return RET_PF_EMULATE;
 	}
 
-	return false;
+	return RET_PF_CONTINUE;
 }
 
 static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
@@ -3903,7 +3899,7 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
 }
 
-static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
+static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
@@ -3914,7 +3910,7 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.
 	 */
 	if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
-		goto out_retry;
+		return RET_PF_RETRY;
 
 	if (!kvm_is_visible_memslot(slot)) {
 		/* Don't expose private memslots to L2. */
@@ -3922,7 +3918,7 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			fault->slot = NULL;
 			fault->pfn = KVM_PFN_NOSLOT;
 			fault->map_writable = false;
-			return false;
+			return RET_PF_CONTINUE;
 		}
 		/*
 		 * If the APIC access page exists but is disabled, go directly
@@ -3931,10 +3927,8 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		 * when the AVIC is re-enabled.
 		 */
 		if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
-		    !kvm_apicv_activated(vcpu->kvm)) {
-			*r = RET_PF_EMULATE;
-			return true;
-		}
+		    !kvm_apicv_activated(vcpu->kvm))
+			return RET_PF_EMULATE;
 	}
 
 	async = false;
@@ -3942,26 +3936,23 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 					  fault->write, &fault->map_writable,
 					  &fault->hva);
 	if (!async)
-		return false; /* *pfn has correct page already */
+		return RET_PF_CONTINUE; /* *pfn has correct page already */
 
 	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
 		if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
 			trace_kvm_async_pf_doublefault(fault->addr, fault->gfn);
 			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
-			goto out_retry;
-		} else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn))
-			goto out_retry;
+			return RET_PF_RETRY;
+		} else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) {
+			return RET_PF_RETRY;
+		}
 	}
 
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, NULL,
 					  fault->write, &fault->map_writable,
 					  &fault->hva);
-	return false;
-
-out_retry:
-	*r = RET_PF_RETRY;
-	return true;
+	return RET_PF_CONTINUE;
 }
 
 /*
@@ -4016,10 +4007,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (kvm_faultin_pfn(vcpu, fault, &r))
+	r = kvm_faultin_pfn(vcpu, fault);
+	if (r != RET_PF_CONTINUE)
 		return r;
 
-	if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r))
+	r = handle_abnormal_pfn(vcpu, fault, ACC_ALL);
+	if (r != RET_PF_CONTINUE)
 		return r;
 
 	r = RET_PF_RETRY;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1bff453f7cbe..c0e502b17ef7 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -143,6 +143,7 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);
 /*
  * Return values of handle_mmio_page_fault, mmu.page_fault, and fast_page_fault().
  *
+ * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
  * RET_PF_RETRY: let CPU fault again on the address.
  * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
@@ -151,9 +152,15 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);
  *
  * Any names added to this enum should be exported to userspace for use in
  * tracepoints via TRACE_DEFINE_ENUM() in mmutrace.h
+ *
+ * Note, all values must be greater than or equal to zero so as not to encroach
+ * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
+ * will allow for efficient machine code when checking for CONTINUE, e.g.
+ * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
  */
 enum {
-	RET_PF_RETRY = 0,
+	RET_PF_CONTINUE = 0,
+	RET_PF_RETRY,
 	RET_PF_EMULATE,
 	RET_PF_INVALID,
 	RET_PF_FIXED,
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index 12247b96af01..ae86820cef69 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -54,6 +54,7 @@
 	{ PFERR_RSVD_MASK, "RSVD" },	\
 	{ PFERR_FETCH_MASK, "F" }
 
+TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
 TRACE_DEFINE_ENUM(RET_PF_RETRY);
 TRACE_DEFINE_ENUM(RET_PF_EMULATE);
 TRACE_DEFINE_ENUM(RET_PF_INVALID);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index b025decf610d..7f8f1c8dbed2 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -838,10 +838,12 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (kvm_faultin_pfn(vcpu, fault, &r))
+	r = kvm_faultin_pfn(vcpu, fault);
+	if (r != RET_PF_CONTINUE)
 		return r;
 
-	if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r))
+	r = handle_abnormal_pfn(vcpu, fault, walker.pte_access);
+	if (r != RET_PF_CONTINUE)
 		return r;
 
 	/*

From patchwork Sat Apr 23 03:47:47 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <seanjc@google.com>
X-Patchwork-Id: 12824411
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 42C87C433F5
	for <kvm@archiver.kernel.org>; Sat, 23 Apr 2022 03:48:41 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232868AbiDWDvc (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 22 Apr 2022 23:51:32 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55764 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232789AbiDWDvV (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 22 Apr 2022 23:51:21 -0400
Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com
 [IPv6:2607:f8b0:4864:20::44a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 69A4B1C45B8
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:23 -0700 (PDT)
Received: by mail-pf1-x44a.google.com with SMTP id
 d6-20020aa78686000000b0050adc2b200cso4681017pfo.21
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=reply-to:date:in-reply-to:message-id:mime-version:references
         :subject:from:to:cc;
        bh=RxFR1mf00mKdP3w8SqK2ejHHgn+MhCaTObKpuTFI/vM=;
        b=UcSm2gQUKYPN/4Eo6y24/YtvtXFOCcr1DyaMtZvO97RFpSFLFjwK2SCcIu1BmrBI5z
         ttapLMuLIA4u6/i1bT34VblsLeJtOEamEXThcL6LAnNy/Eumh9fBzBuxzdyrzJeR9DQ7
         YKEBMqH5ut9pN5xxZOH7/JAk1n6fwx5M41Z6f5Dke1d3lV7XanMzUS5aSIgCqUH+5+Od
         /1DW3jY7kV6VmmMEQuOciL7EPN23wt+px9kuAbykSX3P9CB5O5HVaiCSqX6sHiIKYlQk
         SB7b9gzhScOiDXAu5MlCrFgw2+mK3YUmDbkRfrrQbsv4QDfx1/9fz9sP62r/EQjXDdSi
         OMoA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:reply-to:date:in-reply-to:message-id
         :mime-version:references:subject:from:to:cc;
        bh=RxFR1mf00mKdP3w8SqK2ejHHgn+MhCaTObKpuTFI/vM=;
        b=XmOIdRa9IAvdjquxI3iqvBWZcVjXsACXDCA7K1KkleQC0mGuAQ74EB7M15+GO0Opp7
         CiWfqjKrLfaCRDdSq0wBfFM3dB0FErHxMuhepBJLewsrssjRD0cQccsKuE6JSskq5sy3
         HyTia6OG6X0tfZeu44JJPV2K7mUrfz+o8eglZh4MHwmMD3LG2gPRplERBtWFpf+vL7Mt
         O2kXaG+WvYfwO+Y+4Je1vpIHC05Gjwk6K8jOHRC5Qv/emx4K6gGxq7ooMj14Bg4S+G8l
         1Jr+Tx72K/8bztH5VHgatH36S3V08cmtxvsr2Amo5XNUu1uYWIRGLI/6H3WqqvI6hWZz
         TIWg==
X-Gm-Message-State: AOAM531NBFT5IR2T91haLzSO5DZ3eKM+6vBbeBO7raWOfr5LzADVCMOC
        QgxusRDo+cSXoPomhHkuSm4ofGJEfjk=
X-Google-Smtp-Source: 
 ABdhPJyRCjyyImwCwDiw01HZ+aBCOjwSSC0m5TMVo9TJ7pzHs3IohmxUhdEjZO2jVVo5dc4WI5qw7nl2aOI=
X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5])
 (user=seanjc job=sendgmr) by 2002:a62:b60f:0:b0:508:2a61:2c8b with SMTP id
 j15-20020a62b60f000000b005082a612c8bmr8280695pff.2.1650685702931; Fri, 22 Apr
 2022 20:48:22 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Sat, 23 Apr 2022 03:47:47 +0000
In-Reply-To: <20220423034752.1161007-1-seanjc@google.com>
Message-Id: <20220423034752.1161007-8-seanjc@google.com>
Mime-Version: 1.0
References: <20220423034752.1161007-1-seanjc@google.com>
X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog
Subject: [PATCH 07/12] KVM: x86/mmu: Make all page fault handlers internal to
 the MMU
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>,
        Venkatesh Srinivas <venkateshs@google.com>,
        Chao Peng <chao.p.peng@linux.intel.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
of the page fault handling collateral that was in mmu.h purely for the
async #PF handler into mmu_internal.h, where it belongs.  This will allow
kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
expose those enums outside of the MMU.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu.h              | 87 -------------------------------
 arch/x86/kvm/mmu/mmu.c          | 19 +++++++
 arch/x86/kvm/mmu/mmu_internal.h | 90 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              | 19 -------
 4 files changed, 108 insertions(+), 107 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 671cfeccf04e..461052bef896 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -117,93 +117,6 @@ static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
 					  vcpu->arch.mmu->root_role.level);
 }
 
-struct kvm_page_fault {
-	/* arguments to kvm_mmu_do_page_fault.  */
-	const gpa_t addr;
-	const u32 error_code;
-	const bool prefetch;
-
-	/* Derived from error_code.  */
-	const bool exec;
-	const bool write;
-	const bool present;
-	const bool rsvd;
-	const bool user;
-
-	/* Derived from mmu and global state.  */
-	const bool is_tdp;
-	const bool nx_huge_page_workaround_enabled;
-
-	/*
-	 * Whether a >4KB mapping can be created or is forbidden due to NX
-	 * hugepages.
-	 */
-	bool huge_page_disallowed;
-
-	/*
-	 * Maximum page size that can be created for this fault; input to
-	 * FNAME(fetch), __direct_map and kvm_tdp_mmu_map.
-	 */
-	u8 max_level;
-
-	/*
-	 * Page size that can be created based on the max_level and the
-	 * page size used by the host mapping.
-	 */
-	u8 req_level;
-
-	/*
-	 * Page size that will be created based on the req_level and
-	 * huge_page_disallowed.
-	 */
-	u8 goal_level;
-
-	/* Shifted addr, or result of guest page table walk if addr is a gva.  */
-	gfn_t gfn;
-
-	/* The memslot containing gfn. May be NULL. */
-	struct kvm_memory_slot *slot;
-
-	/* Outputs of kvm_faultin_pfn.  */
-	kvm_pfn_t pfn;
-	hva_t hva;
-	bool map_writable;
-};
-
-int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
-
-extern int nx_huge_pages;
-static inline bool is_nx_huge_page_enabled(void)
-{
-	return READ_ONCE(nx_huge_pages);
-}
-
-static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-					u32 err, bool prefetch)
-{
-	struct kvm_page_fault fault = {
-		.addr = cr2_or_gpa,
-		.error_code = err,
-		.exec = err & PFERR_FETCH_MASK,
-		.write = err & PFERR_WRITE_MASK,
-		.present = err & PFERR_PRESENT_MASK,
-		.rsvd = err & PFERR_RSVD_MASK,
-		.user = err & PFERR_USER_MASK,
-		.prefetch = prefetch,
-		.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
-		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
-
-		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
-		.req_level = PG_LEVEL_4K,
-		.goal_level = PG_LEVEL_4K,
-	};
-#ifdef CONFIG_RETPOLINE
-	if (fault.is_tdp)
-		return kvm_tdp_page_fault(vcpu, &fault);
-#endif
-	return vcpu->arch.mmu->page_fault(vcpu, &fault);
-}
-
 /*
  * Check if a given access (described through the I/D, W/R and U/S bits of a
  * page fault error code pfec) causes a permission fault with the given PTE
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f1e8d71e6f7c..8b8b62d2a903 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3899,6 +3899,25 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
 }
 
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
+{
+	int r;
+
+	if ((vcpu->arch.mmu->root_role.direct != work->arch.direct_map) ||
+	      work->wakeup_all)
+		return;
+
+	r = kvm_mmu_reload(vcpu);
+	if (unlikely(r))
+		return;
+
+	if (!vcpu->arch.mmu->root_role.direct &&
+	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
+		return;
+
+	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
+}
+
 static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index c0e502b17ef7..c0c85cbfa159 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -140,8 +140,70 @@ void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 					u64 start_gfn, u64 pages);
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);
 
+extern int nx_huge_pages;
+static inline bool is_nx_huge_page_enabled(void)
+{
+	return READ_ONCE(nx_huge_pages);
+}
+
+struct kvm_page_fault {
+	/* arguments to kvm_mmu_do_page_fault.  */
+	const gpa_t addr;
+	const u32 error_code;
+	const bool prefetch;
+
+	/* Derived from error_code.  */
+	const bool exec;
+	const bool write;
+	const bool present;
+	const bool rsvd;
+	const bool user;
+
+	/* Derived from mmu and global state.  */
+	const bool is_tdp;
+	const bool nx_huge_page_workaround_enabled;
+
+	/*
+	 * Whether a >4KB mapping can be created or is forbidden due to NX
+	 * hugepages.
+	 */
+	bool huge_page_disallowed;
+
+	/*
+	 * Maximum page size that can be created for this fault; input to
+	 * FNAME(fetch), __direct_map and kvm_tdp_mmu_map.
+	 */
+	u8 max_level;
+
+	/*
+	 * Page size that can be created based on the max_level and the
+	 * page size used by the host mapping.
+	 */
+	u8 req_level;
+
+	/*
+	 * Page size that will be created based on the req_level and
+	 * huge_page_disallowed.
+	 */
+	u8 goal_level;
+
+	/* Shifted addr, or result of guest page table walk if addr is a gva.  */
+	gfn_t gfn;
+
+	/* The memslot containing gfn. May be NULL. */
+	struct kvm_memory_slot *slot;
+
+	/* Outputs of kvm_faultin_pfn.  */
+	kvm_pfn_t pfn;
+	hva_t hva;
+	bool map_writable;
+};
+
+int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+
 /*
- * Return values of handle_mmio_page_fault, mmu.page_fault, and fast_page_fault().
+ * Return values of handle_mmio_page_fault(), mmu.page_fault(), fast_page_fault(),
+ * and of course kvm_mmu_do_page_fault().
  *
  * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
  * RET_PF_RETRY: let CPU fault again on the address.
@@ -167,6 +229,32 @@ enum {
 	RET_PF_SPURIOUS,
 };
 
+static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+					u32 err, bool prefetch)
+{
+	struct kvm_page_fault fault = {
+		.addr = cr2_or_gpa,
+		.error_code = err,
+		.exec = err & PFERR_FETCH_MASK,
+		.write = err & PFERR_WRITE_MASK,
+		.present = err & PFERR_PRESENT_MASK,
+		.rsvd = err & PFERR_RSVD_MASK,
+		.user = err & PFERR_USER_MASK,
+		.prefetch = prefetch,
+		.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
+		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
+
+		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
+		.req_level = PG_LEVEL_4K,
+		.goal_level = PG_LEVEL_4K,
+	};
+#ifdef CONFIG_RETPOLINE
+	if (fault.is_tdp)
+		return kvm_tdp_page_fault(vcpu, &fault);
+#endif
+	return vcpu->arch.mmu->page_fault(vcpu, &fault);
+}
+
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn,
 			      kvm_pfn_t pfn, int max_level);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 951d0a78ccda..7663c35a5c70 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12356,25 +12356,6 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
-void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
-{
-	int r;
-
-	if ((vcpu->arch.mmu->root_role.direct != work->arch.direct_map) ||
-	      work->wakeup_all)
-		return;
-
-	r = kvm_mmu_reload(vcpu);
-	if (unlikely(r))
-		return;
-
-	if (!vcpu->arch.mmu->root_role.direct &&
-	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
-		return;
-
-	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
-}
-
 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
 {
 	BUILD_BUG_ON(!is_power_of_2(ASYNC_PF_PER_VCPU));

From patchwork Sat Apr 23 03:47:48 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <seanjc@google.com>
X-Patchwork-Id: 12824413
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 36306C43217
	for <kvm@archiver.kernel.org>; Sat, 23 Apr 2022 03:48:45 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232894AbiDWDvg (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 22 Apr 2022 23:51:36 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55582 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232159AbiDWDvV (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 22 Apr 2022 23:51:21 -0400
Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com
 [IPv6:2607:f8b0:4864:20::649])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1F8CD1C58EA
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:25 -0700 (PDT)
Received: by mail-pl1-x649.google.com with SMTP id
 ij17-20020a170902ab5100b00158f6f83068so5804411plb.19
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=reply-to:date:in-reply-to:message-id:mime-version:references
         :subject:from:to:cc;
        bh=nIl4luVOJWFkIclprxcKdIn+QCEjsjA9FteQ7s7ZGic=;
        b=Jks/1SdRI93xyP5uwMJ6C+K0aagltFg0tPscYlPq2RulXCzDR9bwmd/2YFjWammzja
         RnmNJkbEImsLiYjDKXLFudnHmutWlf9LMoek0VRl7qtsD7Dt5gjHzM1GsTD+er3aHqe8
         MyT0brKKquYdYgfR8hs28aTTvGl/5qXrE9GpOB0UIQPx+HdOQFo5cmvsblUiIPmNMR7t
         yGpq5BqTYAM1ST6Nx38oijcHR1KOwDrkNel658nL/5Dp5prtpqXGwrmxWGW78ByFcMxs
         +cDcAPTc7KEyWzhNoETs7kOxhnjOXZE+G8XKpf1xP2kxi+DaHPPwXT03r+cn6ouNMshM
         on1A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:reply-to:date:in-reply-to:message-id
         :mime-version:references:subject:from:to:cc;
        bh=nIl4luVOJWFkIclprxcKdIn+QCEjsjA9FteQ7s7ZGic=;
        b=W9tbdhzhnEyn5lP0dQLUOYJm1RU4dIrpm2dfgQi/RzCJDCDpvdi2jxQAhuIoaDgczA
         S8ODlCjklveC+gNEwegquxBfA8CHv+AvSDW2AA/MxxE4yD/VCk6lbJVVT5v8NReq5xMg
         8NHiW+zKkPM1Cwzu/b4ONNeqsj1we/OgcIYLeZhlkPI/YXiPCoxU/Kwk83XKOGTVQyUR
         8G4TupHA2cyuHrMnFD03Me5dKRprpD3s76MwHLNVrDuodXJXOWt74qOkEpqSK5cxllg4
         QQt496NTtmthupmsRypx6sWu0cmVCRats1N3yKFy16yVpnaXPngeT82VqEexj55tHQmc
         O0gg==
X-Gm-Message-State: AOAM531jaPgKAl7Th5JW99Y76SrCO/u9O5G6gAXb/pDdtI55gAqBMqaD
        hn5QcJCUKPuJkxiKHZPHMAX15w/ocrc=
X-Google-Smtp-Source: 
 ABdhPJz28rMsOFWsoRfGTKdC9uqo+Oi0ikmXtSwLqUR7BvCyXYi01bop8+pSXl7ackmkhJjqc0UalBUIsOE=
X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5])
 (user=seanjc job=sendgmr) by 2002:a05:6a00:14c4:b0:50a:9524:67bf with SMTP id
 w4-20020a056a0014c400b0050a952467bfmr8129386pfu.55.1650685704656; Fri, 22 Apr
 2022 20:48:24 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Sat, 23 Apr 2022 03:47:48 +0000
In-Reply-To: <20220423034752.1161007-1-seanjc@google.com>
Message-Id: <20220423034752.1161007-9-seanjc@google.com>
Mime-Version: 1.0
References: <20220423034752.1161007-1-seanjc@google.com>
X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog
Subject: [PATCH 08/12] KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for
 TDP page faults
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>,
        Venkatesh Srinivas <venkateshs@google.com>,
        Chao Peng <chao.p.peng@linux.intel.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast
path for TDP page faults.  The generated code is identical, and the #ifdef
makes it dangerously difficult to extend the logic (guess who forgot to
add an "else" inside the #ifdef and ran through the page fault handler
twice).

No functional or binary change intented.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu_internal.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index c0c85cbfa159..9caa747ee033 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -248,10 +248,10 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.req_level = PG_LEVEL_4K,
 		.goal_level = PG_LEVEL_4K,
 	};
-#ifdef CONFIG_RETPOLINE
-	if (fault.is_tdp)
+
+	if (IS_ENABLED(CONFIG_RETPOLINE) && fault.is_tdp)
 		return kvm_tdp_page_fault(vcpu, &fault);
-#endif
+
 	return vcpu->arch.mmu->page_fault(vcpu, &fault);
 }
 

From patchwork Sat Apr 23 03:47:49 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <seanjc@google.com>
X-Patchwork-Id: 12824416
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6DA6EC433EF
	for <kvm@archiver.kernel.org>; Sat, 23 Apr 2022 03:48:54 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232910AbiDWDvs (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 22 Apr 2022 23:51:48 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55846 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232821AbiDWDvW (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 22 Apr 2022 23:51:22 -0400
Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com
 [IPv6:2607:f8b0:4864:20::449])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C94CE1C65EE
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:26 -0700 (PDT)
Received: by mail-pf1-x449.google.com with SMTP id
 j8-20020aa78d08000000b0050ade744b37so4508868pfe.16
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=reply-to:date:in-reply-to:message-id:mime-version:references
         :subject:from:to:cc;
        bh=YY8H3iTygcq3QVv8LQgrDAaxhN4Yk1aATZLxq6rfZiU=;
        b=MsteZce6/w12u24Ta6z0Wiz+JiCEC6QQcGbGjaMpfop07BYflnQdSzMl3aU5y1eIp+
         ScPzgjJmFkmxpJQJlNKccfNZgqWiSxk3DbkOeNNAHa0/LJpEH/TECQZiCe3TKU6R/asC
         59kXNskB9aFnAtcJ8W68Cg/OYIiRkOWL6dWhFO8tOksfy9hhjlNQ6lLWpTCX8hkJhVeO
         svt4uiS6FlxbmN9Z95Iq7HP/MuvMtaf5csjKhkP+4RRYi++5F/E6X0hn2AM+wNRVUWGS
         LlEJsDipAlEiOC9yRinRHr33MzFJ7UNtr+cTIDnf0wlvIuGKqUaTFbwjRJE9SsbUBmBL
         9Ngg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:reply-to:date:in-reply-to:message-id
         :mime-version:references:subject:from:to:cc;
        bh=YY8H3iTygcq3QVv8LQgrDAaxhN4Yk1aATZLxq6rfZiU=;
        b=HVgfu9L/3403rWmic6fblZY3uAaIUnOYBwNgbDvCEZDr/dx67V9Wwg86IVAYA5uxw9
         B0721Amq4i8A7lqenSms8+nn9KENVKS8bJ38IXCLoyjIxOzXoHkLYqvbREQDW5yt4LbI
         AET57AqIfeF7JatjqVPwWhzKYPtvAfzUtTZyvFRdUZPIpAqjRtq8mdlNy2WAgw1zYOVX
         X7fE//Yw3SqnZDKhFs/VpP2E5w7Zp2XYbnW0nlYcmiCguL9vAM3vNalv7zuMUeStwA4q
         sigyYagkctiDaYFRlwrW+y5+wRvuKKzO6YcT602Hetm+lVBnhesFt710lMUjBFiXD6HJ
         +nxw==
X-Gm-Message-State: AOAM531SYF503Nh8CZw53b0mEMGYX3b9l4fKlf77CUhbj4K5XRRJiRtq
        TrLf5R02Av7QQ9i6Wnttd/WOZVWImRQ=
X-Google-Smtp-Source: 
 ABdhPJzSTsbeEjfeieThDJe75FkXMw8fvWL8YHFt1inu2y4xMtSc+OenEYhKMblhZ2rOLi/mB6JMbzTrtQw=
X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5])
 (user=seanjc job=sendgmr) by 2002:a17:902:f68e:b0:15c:4367:d1a0 with SMTP id
 l14-20020a170902f68e00b0015c4367d1a0mr5362196plg.164.1650685706328; Fri, 22
 Apr 2022 20:48:26 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Sat, 23 Apr 2022 03:47:49 +0000
In-Reply-To: <20220423034752.1161007-1-seanjc@google.com>
Message-Id: <20220423034752.1161007-10-seanjc@google.com>
Mime-Version: 1.0
References: <20220423034752.1161007-1-seanjc@google.com>
X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog
Subject: [PATCH 09/12] KVM: x86/mmu: Expand and clean up page fault stats
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>,
        Venkatesh Srinivas <venkateshs@google.com>,
        Chao Peng <chao.p.peng@linux.intel.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Expand and clean up the page fault stats.  The current stats are at best
incomplete, and at worst misleading.  Differentiate between faults that
are actually fixed vs those that result in an MMIO SPTE being created,
track faults that are spurious, faults that trigger emulation, faults
that that are fixed in the fast path, and last but not least, track the
number of faults that are taken.

Note, the number of faults that require emulation for write-protected
shadow pages can roughly be calculated by subtracting the number of MMIO
SPTEs created from the overall number of faults that trigger emulation.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  5 +++++
 arch/x86/kvm/mmu/mmu.c          |  7 +++++--
 arch/x86/kvm/mmu/mmu_internal.h | 28 ++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h  |  1 -
 arch/x86/kvm/mmu/tdp_mmu.c      |  8 +-------
 arch/x86/kvm/x86.c              |  5 +++++
 6 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f164c6c1514a..c5fb4115176d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1269,7 +1269,12 @@ struct kvm_vm_stat {
 
 struct kvm_vcpu_stat {
 	struct kvm_vcpu_stat_generic generic;
+	u64 pf_taken;
 	u64 pf_fixed;
+	u64 pf_emulate;
+	u64 pf_spurious;
+	u64 pf_fast;
+	u64 pf_mmio_spte_created;
 	u64 pf_guest;
 	u64 tlb_flush;
 	u64 invlpg;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8b8b62d2a903..744c06bd7017 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2660,6 +2660,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 		 *sptep, write_fault, gfn);
 
 	if (unlikely(is_noslot_pfn(pfn))) {
+		vcpu->stat.pf_mmio_spte_created++;
 		mark_mmio_spte(vcpu, sptep, gfn, pte_access);
 		return RET_PF_EMULATE;
 	}
@@ -2943,7 +2944,6 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		return ret;
 
 	direct_pte_prefetch(vcpu, it.sptep);
-	++vcpu->stat.pf_fixed;
 	return ret;
 }
 
@@ -3206,6 +3206,9 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	trace_fast_page_fault(vcpu, fault, sptep, spte, ret);
 	walk_shadow_page_lockless_end(vcpu);
 
+	if (ret != RET_PF_INVALID)
+		vcpu->stat.pf_fast++;
+
 	return ret;
 }
 
@@ -5311,7 +5314,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
-int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
+int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 		       void *insn, int insn_len)
 {
 	int r, emulation_type = EMULTYPE_PF;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 9caa747ee033..bd2a26897b97 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -248,11 +248,35 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.req_level = PG_LEVEL_4K,
 		.goal_level = PG_LEVEL_4K,
 	};
+	int r;
+
+	/*
+	 * Async #PF "faults", a.k.a. prefetch faults, are not faults from the
+	 * guest perspective and have already been counted at the time of the
+	 * original fault.
+	 */
+	if (!prefetch)
+		vcpu->stat.pf_taken++;
 
 	if (IS_ENABLED(CONFIG_RETPOLINE) && fault.is_tdp)
-		return kvm_tdp_page_fault(vcpu, &fault);
+		r = kvm_tdp_page_fault(vcpu, &fault);
+	else
+		r = vcpu->arch.mmu->page_fault(vcpu, &fault);
 
-	return vcpu->arch.mmu->page_fault(vcpu, &fault);
+	/*
+	 * Similar to above, prefetch faults aren't truly spurious, and the
+	 * async #PF path doesn't do emulation.  Do count faults that are fixed
+	 * by the async #PF handler though, otherwise they'll never be counted.
+	 */
+	if (r == RET_PF_FIXED)
+		vcpu->stat.pf_fixed++;
+	else if (prefetch)
+		;
+	else if (r == RET_PF_EMULATE)
+		vcpu->stat.pf_emulate++;
+	else if (r == RET_PF_SPURIOUS)
+		vcpu->stat.pf_spurious++;
+	return r;
 }
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 7f8f1c8dbed2..db80f7ccaa4e 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -723,7 +723,6 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		return ret;
 
 	FNAME(pte_prefetch)(vcpu, gw, it.sptep);
-	++vcpu->stat.pf_fixed;
 	return ret;
 
 out_gpte_changed:
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a2eda3e55697..8089beb312d1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1099,6 +1099,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 
 	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
 	if (unlikely(is_mmio_spte(new_spte))) {
+		vcpu->stat.pf_mmio_spte_created++;
 		trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
 				     new_spte);
 		ret = RET_PF_EMULATE;
@@ -1107,13 +1108,6 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 				       rcu_dereference(iter->sptep));
 	}
 
-	/*
-	 * Increase pf_fixed in both RET_PF_EMULATE and RET_PF_FIXED to be
-	 * consistent with legacy MMU behavior.
-	 */
-	if (ret != RET_PF_SPURIOUS)
-		vcpu->stat.pf_fixed++;
-
 	return ret;
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7663c35a5c70..a6441b281fb3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -266,7 +266,12 @@ const struct kvm_stats_header kvm_vm_stats_header = {
 
 const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = {
 	KVM_GENERIC_VCPU_STATS(),
+	STATS_DESC_COUNTER(VCPU, pf_taken),
 	STATS_DESC_COUNTER(VCPU, pf_fixed),
+	STATS_DESC_COUNTER(VCPU, pf_emulate),
+	STATS_DESC_COUNTER(VCPU, pf_spurious),
+	STATS_DESC_COUNTER(VCPU, pf_fast),
+	STATS_DESC_COUNTER(VCPU, pf_mmio_spte_created),
 	STATS_DESC_COUNTER(VCPU, pf_guest),
 	STATS_DESC_COUNTER(VCPU, tlb_flush),
 	STATS_DESC_COUNTER(VCPU, invlpg),

From patchwork Sat Apr 23 03:47:50 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <seanjc@google.com>
X-Patchwork-Id: 12824414
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0B06FC433EF
	for <kvm@archiver.kernel.org>; Sat, 23 Apr 2022 03:48:45 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232924AbiDWDvi (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 22 Apr 2022 23:51:38 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56788 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232789AbiDWDvf (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 22 Apr 2022 23:51:35 -0400
Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com
 [IPv6:2607:f8b0:4864:20::64a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D39F61C78E9
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:28 -0700 (PDT)
Received: by mail-pl1-x64a.google.com with SMTP id
 j21-20020a170902c3d500b0015cecdddb3dso67522plj.21
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=reply-to:date:in-reply-to:message-id:mime-version:references
         :subject:from:to:cc;
        bh=ElX6s8O+jQm/pf5/HoWc8mJl1k3Q5fJv6/6onQIYo7A=;
        b=r150U+eD4HCCeu3Wtly98bpxxf14GzpxUSlmcXASoApafzO1Kasae0DXfkL++A/kXu
         L9a+4bIXQlnFUca/vtg/auzoVn7lJxgJj9n23bu4hiZkXxt+ZiAPm6zioFicXHZ/WA53
         oQFyKfFdV9PIS7vuXoQrLd4Un4HhuQUDuUChb/4KN8tRjUwfm51/F7P+MpcyXy5uRbyg
         MZ2VYwkwbaG1LOf8kh3r4corliZxbp/QLt8zoV77WTeZc8m0QZ/m8TvYO2QfcckNGNRA
         o8CaMgNHpIA+WR3Ch8/X2njK7BQIXRbhtoQSICo4h6VAo6iAu1gpl5DZd8+KSitZhcYw
         lzPQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:reply-to:date:in-reply-to:message-id
         :mime-version:references:subject:from:to:cc;
        bh=ElX6s8O+jQm/pf5/HoWc8mJl1k3Q5fJv6/6onQIYo7A=;
        b=Bl+yVpKierCqSqeB1EtAd7EjSqVYvzTSNrlfMO3WiE2MJLP+TNTjglTQMcWrtWICXs
         oTiuUZ+qVOXKRhySelPZNzXV+lz7lbOpPX2r8JIg8Bvaz9YodMec5fRVAH0gJaQRonki
         6Yyk9ocN6238LeVcPZNu3CaRbaebNSQF76hxGK7y3T4IFK37mOWXHTcYbHfoZYtjHjmw
         DwIE82Zlug1XKqT6cwDwnHSxU/nqL5FeeZnZedkhxwHIPDID9hFzXtcGlzMLv/I1lsvh
         MdxAWtwJWfJ0qk2qZg3uiC4168sbpDwb3whBCMo6JLpWal1QGrqAnaLNaURsKeR/AOLM
         QefA==
X-Gm-Message-State: AOAM530anJqhLBUmwq5qsLVbldmOBdnzJtBc6fauEqJO8ATxQjyV+INQ
        f3XO3KcePCIDuc3fXwtv8pSHiFe8K78=
X-Google-Smtp-Source: 
 ABdhPJxrk2xLYR8PaLzAgUGhmeIhZwmbUeg/2Vk3Hssukf/3zfMDOSsmF/hWuknXOnynUDpT0bJixw9tkXQ=
X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5])
 (user=seanjc job=sendgmr) by 2002:a17:90b:2384:b0:1cb:5223:9dc4 with SMTP id
 mr4-20020a17090b238400b001cb52239dc4mr716293pjb.1.1650685708066; Fri, 22 Apr
 2022 20:48:28 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Sat, 23 Apr 2022 03:47:50 +0000
In-Reply-To: <20220423034752.1161007-1-seanjc@google.com>
Message-Id: <20220423034752.1161007-11-seanjc@google.com>
Mime-Version: 1.0
References: <20220423034752.1161007-1-seanjc@google.com>
X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog
Subject: [PATCH 10/12] DO NOT MERGE: KVM: x86/mmu: Always send !PRESENT faults
 down the fast path
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>,
        Venkatesh Srinivas <venkateshs@google.com>,
        Chao Peng <chao.p.peng@linux.intel.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Posted for posterity, and to show that it's possible to funnel indirect
page faults down the fast path.

Not-signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 744c06bd7017..7ba88907d032 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3006,26 +3006,25 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
 		return false;
 
 	/*
-	 * #PF can be fast if:
-	 *
-	 * 1. The shadow page table entry is not present and A/D bits are
-	 *    disabled _by KVM_, which could mean that the fault is potentially
-	 *    caused by access tracking (if enabled).  If A/D bits are enabled
-	 *    by KVM, but disabled by L1 for L2, KVM is forced to disable A/D
-	 *    bits for L2 and employ access tracking, but the fast page fault
-	 *    mechanism only supports direct MMUs.
-	 * 2. The shadow page table entry is present, the access is a write,
-	 *    and no reserved bits are set (MMIO SPTEs cannot be "fixed"), i.e.
-	 *    the fault was caused by a write-protection violation.  If the
-	 *    SPTE is MMU-writable (determined later), the fault can be fixed
-	 *    by setting the Writable bit, which can be done out of mmu_lock.
+	 * Unconditionally send !PRESENT page faults (except for emulated MMIO)
+	 * through the fast path.  There are two scenarios where the fast path
+	 * can resolve the fault.  The first is if the fault is spurious, i.e.
+	 * a different vCPU has faulted in the page, which applies to all MMUs.
+	 * The second scenario is if KVM marked the SPTE !PRESENT for access
+	 * tracking (due to lack of EPT A/D bits), in which case KVM can fix
+	 * the fault after logging the access.
 	 */
 	if (!fault->present)
-		return !kvm_ad_enabled();
+		return true;
 
 	/*
-	 * Note, instruction fetches and writes are mutually exclusive, ignore
-	 * the "exec" flag.
+	 * Skip the fast path if the fault is due to a protection violation and
+	 * the access isn't a write.  Write-protection violations can be fixed
+	 * by KVM, e.g. if memory is write-protected for dirty logging, but all
+	 * other protection violations are in the domain of a third party, i.e.
+	 * either the primary MMU or the guest's page tables, and thus are
+	 * extremely unlikely to be resolved by KVM.  Note, instruction fetches
+	 * and writes are mutually exclusive, ignore the "exec" flag.
 	 */
 	return fault->write;
 }
@@ -3041,12 +3040,13 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	/*
 	 * Theoretically we could also set dirty bit (and flush TLB) here in
 	 * order to eliminate unnecessary PML logging. See comments in
-	 * set_spte. But fast_page_fault is very unlikely to happen with PML
-	 * enabled, so we do not do this. This might result in the same GPA
-	 * to be logged in PML buffer again when the write really happens, and
-	 * eventually to be called by mark_page_dirty twice. But it's also no
-	 * harm. This also avoids the TLB flush needed after setting dirty bit
-	 * so non-PML cases won't be impacted.
+	 * set_spte. But a write-protection violation that can be fixed outside
+	 * of mmu_lock is very unlikely to happen with PML enabled, so we don't
+	 * do this. This might result in the same GPA to be logged in the PML
+	 * buffer again when the write really happens, and eventually to be
+	 * sent to mark_page_dirty() twice, but that's a minor performance blip
+	 * and not a function issue.  This also avoids the TLB flush needed
+	 * after setting dirty bit so non-PML cases won't be impacted.
 	 *
 	 * Compare with set_spte where instead shadow_dirty_mask is set.
 	 */

From patchwork Sat Apr 23 03:47:51 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <seanjc@google.com>
X-Patchwork-Id: 12824417
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 64614C433F5
	for <kvm@archiver.kernel.org>; Sat, 23 Apr 2022 03:48:58 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232984AbiDWDvt (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 22 Apr 2022 23:51:49 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56832 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232884AbiDWDvf (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 22 Apr 2022 23:51:35 -0400
Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com
 [IPv6:2607:f8b0:4864:20::64a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7C7DD1C894F
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:30 -0700 (PDT)
Received: by mail-pl1-x64a.google.com with SMTP id
 bj12-20020a170902850c00b0015adf30aaccso3953835plb.15
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=reply-to:date:in-reply-to:message-id:mime-version:references
         :subject:from:to:cc;
        bh=KV4Fo5a3dO6FFB3wbKkacVHrutnN8NJ0VpRE3b0g+kk=;
        b=ThnwOxcxDHHiupsaV5coB0vY0mPhYkihbiu/nKoipSLxRlJ9HOECjsKGcDZ/w/dCHV
         1NjwYL9dPQSTzddEpXngi6QF6VduMTNkE3mLZRzUme8A8QeKB/cmshdzB3FvKRhhk56q
         mWGXCGPmcjmxijCDOfuXjJTmCmgQVXs6C4XKVGpISvLBUVxgG71ZEZlSCgFsxG+qgj3m
         pyNLMVNriQc8aS9M2NMrbgfxsP+oPIfQXkVhQaIplsik43HRLOk2LFwrbo5iCEYL+tYg
         o9+IQvS8I/u/OwJuQYW32vFl3WT1II00K2rNZ9caop4bO79G8/aOWd3t8/mVt2938tgL
         T8zg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:reply-to:date:in-reply-to:message-id
         :mime-version:references:subject:from:to:cc;
        bh=KV4Fo5a3dO6FFB3wbKkacVHrutnN8NJ0VpRE3b0g+kk=;
        b=t0ORlcL4/Nk+1+WsQwURq77sEUjdNHgJ87u546YA4b+uqZobhdOlpqHxG0AWDfKDwA
         K0S/BO7VBC5OMvzBsEhOBpjdHMxVrLMQ0xddEU1+OT3onqL6wBotdSwBOv7wzRRpm4K7
         A/2t+urhyINr66PKOsjDmXNGFRq+z2TE85AvETwAxwJayt/9zHIkZZhWzai2eXVj11SM
         bSM4x8gT8O7GdLQSpcp8iCBRb/FW8UdCG8WJnbacBcDcqKeD4S+ow1kcrbqkwuJjOwfm
         y+lNhUusuIEGrUrQ31wxl2hR7XggqXF3+7OOhaA8J6TD6j8c+TVRqFnSyT2Q8LwYt3zq
         tl6Q==
X-Gm-Message-State: AOAM531LnmH0W4/j+YLC7HodPDTa2JEiAYk+/a9vaX5bEe0jT8Q0hgTC
        rjQcDRIR90OnUA06flIs5lwre3Y/Eyo=
X-Google-Smtp-Source: 
 ABdhPJyKn4ZJL97YgXahf8sfZNpuwamE7Zun/i+Q2+UR6ydkW1QVf8xSlffiFad4YurJYsK65Kx50GZqWoU=
X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5])
 (user=seanjc job=sendgmr) by 2002:a62:a50a:0:b0:506:cef:44f5 with SMTP id
 v10-20020a62a50a000000b005060cef44f5mr8215010pfm.22.1650685710013; Fri, 22
 Apr 2022 20:48:30 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Sat, 23 Apr 2022 03:47:51 +0000
In-Reply-To: <20220423034752.1161007-1-seanjc@google.com>
Message-Id: <20220423034752.1161007-12-seanjc@google.com>
Mime-Version: 1.0
References: <20220423034752.1161007-1-seanjc@google.com>
X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog
Subject: [PATCH 11/12] DO NOT MERGE: KVM: x86/mmu: Use fast_page_fault() to
 detect spurious shadow MMU faults
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>,
        Venkatesh Srinivas <venkateshs@google.com>,
        Chao Peng <chao.p.peng@linux.intel.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Sounds good in theory, but in practice it's really, really rare to detect
a spurious fault outside of mmu_lock.  The window is teeny tiny, so more
likely than not, spurious faults won't be detected until the slow path,
not too mention spurious faults on !PRESENT pages are rare in and of
themselves.

Not-signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 28 ++++++++++++++++++----------
 arch/x86/kvm/mmu/paging_tmpl.h |  8 ++++++++
 2 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7ba88907d032..850d58793307 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2994,7 +2994,7 @@ static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fau
 	return RET_PF_CONTINUE;
 }
 
-static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
+static bool page_fault_can_be_fast(struct kvm_page_fault *fault, bool direct_mmu)
 {
 	/*
 	 * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
@@ -3025,8 +3025,12 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
 	 * either the primary MMU or the guest's page tables, and thus are
 	 * extremely unlikely to be resolved by KVM.  Note, instruction fetches
 	 * and writes are mutually exclusive, ignore the "exec" flag.
+	 *
+	 * KVM doesn't support resolving write-protection violations outside of
+	 * mmu_lock for indirect MMUs as the gfn is not stable for indirect
+	 * shadow pages. See Documentation/virt/kvm/locking.rst for details.
 	 */
-	return fault->write;
+	return fault->write && direct_mmu;
 }
 
 /*
@@ -3097,7 +3101,8 @@ static u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
 /*
  * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS.
  */
-static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+			   bool direct_mmu)
 {
 	struct kvm_mmu_page *sp;
 	int ret = RET_PF_INVALID;
@@ -3105,7 +3110,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	u64 *sptep = NULL;
 	uint retry_count = 0;
 
-	if (!page_fault_can_be_fast(fault))
+	if (!page_fault_can_be_fast(fault, direct_mmu))
 		return ret;
 
 	walk_shadow_page_lockless_begin(vcpu);
@@ -3140,6 +3145,14 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			break;
 		}
 
+		/*
+		 * KVM doesn't support fixing SPTEs outside of mmu_lock for
+		 * indirect MMUs as the gfn isn't stable for indirect shadow
+		 * pages. See Documentation/virt/kvm/locking.rst for details.
+		 */
+		if (!direct_mmu)
+			break;
+
 		new_spte = spte;
 
 		/*
@@ -3185,11 +3198,6 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		    !is_access_allowed(fault, new_spte))
 			break;
 
-		/*
-		 * Currently, fast page fault only works for direct mapping
-		 * since the gfn is not stable for indirect shadow page. See
-		 * Documentation/virt/kvm/locking.rst to get more detail.
-		 */
 		if (fast_pf_fix_direct_spte(vcpu, fault, sptep, spte, new_spte)) {
 			ret = RET_PF_FIXED;
 			break;
@@ -4018,7 +4026,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	if (page_fault_handle_page_track(vcpu, fault))
 		return RET_PF_EMULATE;
 
-	r = fast_page_fault(vcpu, fault);
+	r = fast_page_fault(vcpu, fault, true);
 	if (r != RET_PF_INVALID)
 		return r;
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index db80f7ccaa4e..d33b01a2714e 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -812,6 +812,14 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		return RET_PF_RETRY;
 	}
 
+	/* See if the fault has already been resolved by a different vCPU. */
+	r = fast_page_fault(vcpu, fault, false);
+	if (r == RET_PF_SPURIOUS)
+		return r;
+
+	/* Indirect page faults should never be fixed in the fast path. */
+	WARN_ON_ONCE(r != RET_PF_INVALID);
+
 	fault->gfn = walker.gfn;
 	fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
 

From patchwork Sat Apr 23 03:47:52 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sean Christopherson <seanjc@google.com>
X-Patchwork-Id: 12824415
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 36DA4C433F5
	for <kvm@archiver.kernel.org>; Sat, 23 Apr 2022 03:48:50 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232755AbiDWDvj (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 22 Apr 2022 23:51:39 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56878 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232896AbiDWDvg (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 22 Apr 2022 23:51:36 -0400
Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com
 [IPv6:2607:f8b0:4864:20::449])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 40AB01C9CCF
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:32 -0700 (PDT)
Received: by mail-pf1-x449.google.com with SMTP id
 d6-20020aa78686000000b0050adc2b200cso4681164pfo.21
        for <kvm@vger.kernel.org>; Fri, 22 Apr 2022 20:48:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=reply-to:date:in-reply-to:message-id:mime-version:references
         :subject:from:to:cc;
        bh=+2AgNH8Qoc+XZ2aqZVkP0WtczvqEDl4P0Mq70NzABEw=;
        b=kGItDFS0r10ErwaqYlTjCN0WPBwM2VJ3bMsQ/gqn6/xWsMCkcd1qtw87r9LQHFUTQZ
         4iSiiuMTS1dFRSRO1CiXuvUyTcW1tePqdKeO+qb0TzC2cgrIeJbnBBZ16ocxeKpjbkZe
         yH3b0/oz3+9Pruzq0EjTpXpgIPhqltEH5ncCgsom1E5rvwdy0FSM4TQcsk4KsjrMfYqv
         AXaJRzXdN4rmtxUN4Quq68rxz9IHL7jHstt3uFRUb+ySSfMyjhq/MidNxW4i7z7l7j92
         TfkXfMSTB6FBkdBlmUcEjCkY3GZr4Uqm7v+PrASnNwWwlsj4scy4PL0nu/7SLw7VibBV
         T+8g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:reply-to:date:in-reply-to:message-id
         :mime-version:references:subject:from:to:cc;
        bh=+2AgNH8Qoc+XZ2aqZVkP0WtczvqEDl4P0Mq70NzABEw=;
        b=y/7AyxgYsv6mu0p+zhoKvr/UNV2m7JQBUtx3h5+6PMlBkKFdKgQEfq3/KuL1vaQPMW
         CnEfTvaU7GUVQnD0tn+xX4pTGOd3ZjJRSOCfZazo/hgZQ86BEKuiwJRHL/kV3CBVrkwq
         wS7BJeaFRbPs6XAxOi49xnbZzZCwJxZ4/npAMnsKDtVtCEFFXoYtUstSholC+PEOgEGs
         vSI0GbwnOrTkxNSZ/kiS/wk+nsaj/kiLpkuDT/5ttOXqvhjfrGsIIH98Hooa+62eC0R7
         8s9TQTh2rM3FVKgXZc8XGnFUtPeWRAr6fgV9/fSM12MCecErvcA0fFPK3qVolBodx4D+
         N9Qw==
X-Gm-Message-State: AOAM531kn1o1AhmgUUVqylsdCgZUxAUkFB1rexyJzF/FUOB2QcuCRobn
        9nhiByF6eIebA67Ac2g+o0aJuqcv+lA=
X-Google-Smtp-Source: 
 ABdhPJxOEEyTteX2y8CRdvIlhV9uOqkCk9U5/ipTsEkE9jxr9rlMcFGOy5YSFLFADMhE9OS8atlo920yY0Y=
X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5])
 (user=seanjc job=sendgmr) by 2002:a05:6a00:114e:b0:4c8:55f7:faad with SMTP id
 b14-20020a056a00114e00b004c855f7faadmr8312356pfm.86.1650685711773; Fri, 22
 Apr 2022 20:48:31 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Sat, 23 Apr 2022 03:47:52 +0000
In-Reply-To: <20220423034752.1161007-1-seanjc@google.com>
Message-Id: <20220423034752.1161007-13-seanjc@google.com>
Mime-Version: 1.0
References: <20220423034752.1161007-1-seanjc@google.com>
X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog
Subject: [PATCH 12/12] DO NOT MERGE: KVM: selftests: Attempt to detect lost
 dirty bits
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        David Matlack <dmatlack@google.com>,
        Venkatesh Srinivas <venkateshs@google.com>,
        Chao Peng <chao.p.peng@linux.intel.com>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

A failed attempt to detect improper dropping of Writable and/or Dirty
bits.  Doesn't work because the primary MMU write-protects its PTEs when
file writeback occurs, i.e. KVM's dirty bits are meaningless as far as
file-backed guest memory is concnered.

Not-signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   4 +
 .../selftests/kvm/volatile_spte_test.c        | 208 ++++++++++++++++++
 3 files changed, 213 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/volatile_spte_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 56140068b763..3307444d9fda 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -70,3 +70,4 @@
 /steal_time
 /kvm_binary_stats_test
 /system_counter_offset_test
+/volatile_spte_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index af582d168621..bc0907de6638 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -103,6 +103,7 @@ TEST_GEN_PROGS_x86_64 += set_memory_region_test
 TEST_GEN_PROGS_x86_64 += steal_time
 TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test
 TEST_GEN_PROGS_x86_64 += system_counter_offset_test
+TEST_GEN_PROGS_x86_64 += volatile_spte_test
 
 TEST_GEN_PROGS_aarch64 += aarch64/arch_timer
 TEST_GEN_PROGS_aarch64 += aarch64/debug-exceptions
@@ -122,6 +123,7 @@ TEST_GEN_PROGS_aarch64 += rseq_test
 TEST_GEN_PROGS_aarch64 += set_memory_region_test
 TEST_GEN_PROGS_aarch64 += steal_time
 TEST_GEN_PROGS_aarch64 += kvm_binary_stats_test
+TEST_GEN_PROGS_aarch64 += volatile_spte_test
 
 TEST_GEN_PROGS_s390x = s390x/memop
 TEST_GEN_PROGS_s390x += s390x/resets
@@ -134,6 +136,7 @@ TEST_GEN_PROGS_s390x += kvm_page_table_test
 TEST_GEN_PROGS_s390x += rseq_test
 TEST_GEN_PROGS_s390x += set_memory_region_test
 TEST_GEN_PROGS_s390x += kvm_binary_stats_test
+TEST_GEN_PROGS_s390x += volatile_spte_test
 
 TEST_GEN_PROGS_riscv += demand_paging_test
 TEST_GEN_PROGS_riscv += dirty_log_test
@@ -141,6 +144,7 @@ TEST_GEN_PROGS_riscv += kvm_create_max_vcpus
 TEST_GEN_PROGS_riscv += kvm_page_table_test
 TEST_GEN_PROGS_riscv += set_memory_region_test
 TEST_GEN_PROGS_riscv += kvm_binary_stats_test
+TEST_GEN_PROGS_riscv += volatile_spte_test
 
 TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(UNAME_M))
 LIBKVM += $(LIBKVM_$(UNAME_M))
diff --git a/tools/testing/selftests/kvm/volatile_spte_test.c b/tools/testing/selftests/kvm/volatile_spte_test.c
new file mode 100644
index 000000000000..a4277216eb3d
--- /dev/null
+++ b/tools/testing/selftests/kvm/volatile_spte_test.c
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include <errno.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <syscall.h>
+#include <sys/ioctl.h>
+#include <sys/sysinfo.h>
+#include <asm/barrier.h>
+#include <linux/atomic.h>
+#include <linux/rseq.h>
+#include <linux/unistd.h>
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+
+#define VCPU_ID 0
+
+#define PAGE_SIZE 4096
+
+#define NR_ITERATIONS		1000
+
+#define MEM_FILE_NAME		"volatile_spte_test_mem"
+#define MEM_FILE_MEMSLOT	1
+#define MEM_FILE_DATA_PATTERN	0xa5a5a5a5a5a5a5a5ul
+
+static const uint64_t gpa = (4ull * (1 << 30));
+
+static uint64_t *hva;
+
+static pthread_t mprotect_thread;
+static atomic_t rendezvous;
+static bool done;
+
+static void guest_code(void)
+{
+	uint64_t *gva = (uint64_t *)gpa;
+
+	while (!READ_ONCE(done)) {
+		WRITE_ONCE(*gva, 0);
+		GUEST_SYNC(0);
+
+		WRITE_ONCE(*gva, MEM_FILE_DATA_PATTERN);
+		GUEST_SYNC(1);
+	}
+}
+
+static void *mprotect_worker(void *ign)
+{
+	int i, r;
+
+	i = 0;
+	while (!READ_ONCE(done)) {
+		for ( ; atomic_read(&rendezvous) != 1; i++)
+			cpu_relax();
+
+		usleep((i % 10) + 1);
+
+		r = mprotect(hva, PAGE_SIZE, PROT_NONE);
+		TEST_ASSERT(!r, "Failed to mprotect file (hva = %lx), errno = %d (%s)",
+			    (unsigned long)hva, errno, strerror(errno));
+
+		atomic_inc(&rendezvous);
+	}
+	return NULL;
+}
+
+int main(int argc, char *argv[])
+{
+	uint64_t bitmap = -1ull, val;
+	int i, r, fd, nr_writes;
+	struct kvm_regs regs;
+	struct ucall ucall;
+	struct kvm_vm *vm;
+
+	vm = vm_create_default(VCPU_ID, 0, guest_code);
+	vcpu_regs_get(vm, VCPU_ID, &regs);
+	ucall_init(vm, NULL);
+
+	pthread_create(&mprotect_thread, NULL, mprotect_worker, 0);
+
+	fd = open(MEM_FILE_NAME, O_RDWR | O_CREAT, 0644);
+	TEST_ASSERT(fd >= 0, "Failed to open '%s', errno = %d (%s)",
+		    MEM_FILE_NAME, errno, strerror(errno));
+
+	r = ftruncate(fd, PAGE_SIZE);
+	TEST_ASSERT(fd >= 0, "Failed to ftruncate '%s', errno = %d (%s)",
+		    MEM_FILE_NAME, errno, strerror(errno));
+
+	hva = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(hva != MAP_FAILED,  "Failed to map file, errno = %d (%s)",
+		    errno, strerror(errno));
+
+	vm_set_user_memory_region(vm, MEM_FILE_MEMSLOT, KVM_MEM_LOG_DIRTY_PAGES,
+				  gpa, PAGE_SIZE, hva);
+	virt_pg_map(vm, gpa, gpa);
+
+	for (i = 0, nr_writes = 0; i < NR_ITERATIONS; i++) {
+		fdatasync(fd);
+
+		vcpu_run(vm, VCPU_ID);
+		ASSERT_EQ(*hva, 0);
+		ASSERT_EQ(get_ucall(vm, VCPU_ID, &ucall), UCALL_SYNC);
+		ASSERT_EQ(ucall.args[1], 0);
+
+		/*
+		 * The origin hope/intent was to detect dropped Dirty bits by
+		 * checking for missed file writeback.  Sadly, the kernel is
+		 * too smart and write-protects the primary MMU's PTEs, which
+		 * zaps KVM's SPTEs and ultimately causes the folio/page to get
+		 * marked marked dirty by the primary MMU when KVM re-faults on
+		 * the page.
+		 *
+		 * Triggering swap _might_ be a way to detect failure, as swap
+		 * is treated differently than "normal" files.
+		 *
+		 * RIP: 0010:kvm_unmap_gfn_range+0xf1/0x100 [kvm]
+		 * Call Trace:
+		 * <TASK>
+		 *   kvm_mmu_notifier_invalidate_range_start+0x11c/0x2c0 [kvm]
+		 *   __mmu_notifier_invalidate_range_start+0x7e/0x190
+		 *   page_mkclean_one+0x226/0x250
+		 *   rmap_walk_file+0x213/0x430
+		 *   folio_mkclean+0x95/0xb0
+		 *   folio_clear_dirty_for_io+0x5d/0x1c0
+		 *   mpage_submit_page+0x1f/0x70
+		 *   mpage_process_page_bufs+0xf8/0x110
+		 *   mpage_prepare_extent_to_map+0x1e3/0x420
+		 *   ext4_writepages+0x277/0xca0
+		 *   do_writepages+0xd1/0x190
+		 *   filemap_fdatawrite_wbc+0x62/0x90
+		 *   file_write_and_wait_range+0xa3/0xe0
+		 *   ext4_sync_file+0xdb/0x340
+		 *   do_fsync+0x38/0x70
+		 *   __x64_sys_fdatasync+0x13/0x20
+		 *   do_syscall_64+0x31/0x50
+		 *   entry_SYSCALL_64_after_hwframe+0x44/0xae
+		 * </TASK>
+		 *
+		 * RIP: 0010:__folio_mark_dirty+0x266/0x310
+		 * Call Trace:
+		 * <TASK>
+		 *   mark_buffer_dirty+0xe7/0x140
+		 *   __block_commit_write.isra.0+0x59/0xc0
+		 *   block_page_mkwrite+0x15a/0x170
+		 *   ext4_page_mkwrite+0x485/0x620
+		 *   do_page_mkwrite+0x54/0x150
+		 *   __handle_mm_fault+0xe2a/0x1600
+		 *   handle_mm_fault+0xbd/0x280
+		 *   do_user_addr_fault+0x192/0x600
+		 *   exc_page_fault+0x6c/0x140
+		 *   asm_exc_page_fault+0x1e/0x30
+		 * </TASK>
+		 */
+		/* fdatasync(fd); */
+
+		/*
+		 * Clear the dirty log to coerce KVM into write-protecting the
+		 * SPTE (or into clearing dirty bits when using PML).
+		 */
+		kvm_vm_clear_dirty_log(vm, MEM_FILE_MEMSLOT, &bitmap, 0, 1);
+
+		atomic_inc(&rendezvous);
+
+		usleep(i % 10);
+
+		r = _vcpu_run(vm, VCPU_ID);
+
+		while (atomic_read(&rendezvous) != 2)
+			cpu_relax();
+
+		atomic_set(&rendezvous, 0);
+
+		fdatasync(fd);
+		mprotect(hva, PAGE_SIZE, PROT_READ | PROT_WRITE);
+
+		val = READ_ONCE(*hva);
+		if (r) {
+			TEST_ASSERT(!val, "Memory should be zero, write faulted\n");
+			vcpu_regs_set(vm, VCPU_ID, &regs);
+			continue;
+		}
+		nr_writes++;
+		TEST_ASSERT(val == MEM_FILE_DATA_PATTERN,
+			    "Memory doesn't match data pattern, want 0x%lx, got 0x%lx",
+			    MEM_FILE_DATA_PATTERN, val);
+		ASSERT_EQ(get_ucall(vm, VCPU_ID, &ucall), UCALL_SYNC);
+		ASSERT_EQ(ucall.args[1], 1);
+	}
+
+	printf("%d of %d iterations wrote memory\n", nr_writes, NR_ITERATIONS);
+
+	atomic_inc(&rendezvous);
+	WRITE_ONCE(done, true);
+
+	pthread_join(mprotect_thread, NULL);
+
+	kvm_vm_free(vm);
+
+	return 0;
+}
+