From patchwork Fri Sep 6 20:45:14 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vipin Sharma X-Patchwork-Id: 13794733 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DABC9157A67 for ; Fri, 6 Sep 2024 20:45:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725655522; cv=none; b=GxNVU3LEfVA5qwwH8eniDtSjXtJ7UU3mCR7s2iKq9QSG/p48Tw3hUhj2leVi1ngEIj/gssAikw0FeS3z6e/4j16nJhNEZzFFPFfRRwYMU7KYlL0mfb/2WqEyV++GZ83Gdi7bePRKzsmhG0WHYgUg4fM/YJnp/X6ues4jEd8rB/Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725655522; c=relaxed/simple; bh=gd0clIPIodl2f/Gcii95WqTvvODsQXy40ftXvlCE0CY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=a6ej1ynCTkVO0g5bDfEnl9FGSKqj5v/5aSCTWNWdACm84DK3GvJ4YCXfesHVsmD0BQ0NHadLfsBIvqL6g86XsNLsqK1GfJeRYbNcrnnV8WbRS2S7S+jSTptVSh7kmTZKWHVY8oFe35kjqWmJFtJaowNBWXvu8B3vmuVn1gjJkTY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=IA/ByX6r; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="IA/ByX6r" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e178e745c49so4496956276.2 for ; Fri, 06 Sep 2024 13:45:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725655520; x=1726260320; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ynLeLZHmcgV0OPoa4jVXAQCVisoCKJo4HE1cjVAXqpk=; b=IA/ByX6rreKnhD/i2o35R5/L26i8ptZd5U4yZvk6G8asLJDHE9G+9E2h5vR823ujU7 jMhSSFfcgG2u78UX3XvFCxmBhhmj26jI/+nFwVgonziblqyJhXdLLEbEvkqghfN3q68y dSlgVULAmS7FjMA7Y8mowNwuTZtdGg8NkkYuGs49dNjGn2DAR8WEtV1pqTjU+gwuCxdS Dk5yeCxrGpcFcWOZFtT1p2N+iYsh3lb7f0BeQWXfwiZzOP0R1XYmkvMX2JWQKEMBZLr/ NzZ7ldKV4h02ImDqnd19BvBxSIUX4g/ixUH8UkpTZi4g0mzT/CKalrU5WB50XCGvVlhY dHcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725655520; x=1726260320; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ynLeLZHmcgV0OPoa4jVXAQCVisoCKJo4HE1cjVAXqpk=; b=Oe2c2ZXheWH9ChRAroqQJ87vUZFfkVz+3unAWBMWTF8jbIw3NJlIOXtJPx4QTCF2R6 tWgZFswDMWsglA3/wSTTkDm/XQh3+v9xZWAuIOtuJdGheNXGAf9mNUhOd8Du3XrXpwPy L0UR6H5v2WDpo1XbZhlynnIhEvE4nMsrr61tK6VqdJe/jCJ0SP0KAzOijsyi5ooWxLGx fvDlLdgCldZUiB6M55NuUH6MvtqFpw6rGcr4tL5R5QWdUJH7bEhIcvTRCcMacvN8A1KX WaShGcMaBwDe1nNi3eOxsUC+9cYMZYHJzEOoT4MHVWHLkUh2duQb+PCbApiBCWP7/v7s 5DqA== X-Gm-Message-State: AOJu0YyOdpODLzS+/Sw5/sGjjdduPDNoKFB3VVde2dFoZVKYWn2EqwH5 lH9xvGby8UUnV4pj/KqN1mBolVCwy2r0ZgHvHHFfUrK14IeAoNhMzZ0j/WYuCpVzbFO5yOLk8rb USiPxZQ== X-Google-Smtp-Source: AGHT+IEDFjsBUjHwCCJpwBeX/aAnrpUa/0s1xrwzsiQJRhRCrfP8t9vU+ua0+rlLtRRGWrOapFWOzrXc4LFs X-Received: from vipin.c.googlers.com ([34.105.13.176]) (user=vipinsh job=sendgmr) by 2002:a25:83c3:0:b0:e0e:499f:3d9b with SMTP id 3f1490d57ef6-e1d3486852emr6340276.1.1725655519903; Fri, 06 Sep 2024 13:45:19 -0700 (PDT) Date: Fri, 6 Sep 2024 13:45:14 -0700 In-Reply-To: <20240906204515.3276696-1-vipinsh@google.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240906204515.3276696-1-vipinsh@google.com> X-Mailer: git-send-email 2.46.0.469.g59c65b2a67-goog Message-ID: <20240906204515.3276696-2-vipinsh@google.com> Subject: [PATCH v3 1/2] KVM: x86/mmu: Track TDP MMU NX huge pages separately From: Vipin Sharma To: seanjc@google.com, pbonzini@redhat.com, dmatlack@google.com Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Vipin Sharma Create separate list for storing TDP MMU NX huge pages and provide counter for it. Use this list in NX huge page recovery worker along with the existing NX huge pages list. Use old NX huge pages list for storing only non-TDP MMU pages and provide separate counter for it. Separate list will allow to optimize TDP MMU NX huge page recovery in future patches by using MMU read lock. Suggested-by: Sean Christopherson Suggested-by: David Matlack Signed-off-by: Vipin Sharma --- arch/x86/include/asm/kvm_host.h | 13 ++++++++++- arch/x86/kvm/mmu/mmu.c | 39 +++++++++++++++++++++++---------- arch/x86/kvm/mmu/mmu_internal.h | 8 +++++-- arch/x86/kvm/mmu/tdp_mmu.c | 19 ++++++++++++---- arch/x86/kvm/mmu/tdp_mmu.h | 1 + 5 files changed, 61 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 950a03e0181e..0f21f9a69285 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1318,8 +1318,12 @@ struct kvm_arch { * guarantee an NX huge page will be created in its stead, e.g. if the * guest attempts to execute from the region then KVM obviously can't * create an NX huge page (without hanging the guest). + * + * This list only contains shadow and legacy MMU pages. TDP MMU pages + * are stored separately in tdp_mmu_possible_nx_huge_pages. */ struct list_head possible_nx_huge_pages; + u64 nr_possible_nx_huge_pages; #ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING struct kvm_page_track_notifier_head track_notifier_head; #endif @@ -1474,7 +1478,7 @@ struct kvm_arch { * is held in read mode: * - tdp_mmu_roots (above) * - the link field of kvm_mmu_page structs used by the TDP MMU - * - possible_nx_huge_pages; + * - tdp_mmu_possible_nx_huge_pages * - the possible_nx_huge_page_link field of kvm_mmu_page structs used * by the TDP MMU * Because the lock is only taken within the MMU lock, strictly @@ -1483,6 +1487,13 @@ struct kvm_arch { * the code to do so. */ spinlock_t tdp_mmu_pages_lock; + + /* + * Similar to possible_nx_huge_pages list but this one stores only TDP + * MMU pages. + */ + struct list_head tdp_mmu_possible_nx_huge_pages; + u64 tdp_mmu_nr_possible_nx_huge_pages; #endif /* CONFIG_X86_64 */ /* diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 901be9e420a4..455caaaa04f5 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -857,7 +857,8 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) kvm_flush_remote_tlbs_gfn(kvm, gfn, PG_LEVEL_4K); } -void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) +void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *pages, u64 *nr_pages) { /* * If it's possible to replace the shadow page with an NX huge page, @@ -870,9 +871,9 @@ void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) if (!list_empty(&sp->possible_nx_huge_page_link)) return; + ++(*nr_pages); ++kvm->stat.nx_lpage_splits; - list_add_tail(&sp->possible_nx_huge_page_link, - &kvm->arch.possible_nx_huge_pages); + list_add_tail(&sp->possible_nx_huge_page_link, pages); } static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp, @@ -881,7 +882,10 @@ static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp, sp->nx_huge_page_disallowed = true; if (nx_huge_page_possible) - track_possible_nx_huge_page(kvm, sp); + track_possible_nx_huge_page(kvm, + sp, + &kvm->arch.possible_nx_huge_pages, + &kvm->arch.nr_possible_nx_huge_pages); } static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) @@ -900,11 +904,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) kvm_mmu_gfn_allow_lpage(slot, gfn); } -void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) +void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp, + u64 *nr_pages) { if (list_empty(&sp->possible_nx_huge_page_link)) return; + --(*nr_pages); --kvm->stat.nx_lpage_splits; list_del_init(&sp->possible_nx_huge_page_link); } @@ -913,7 +919,7 @@ static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) { sp->nx_huge_page_disallowed = false; - untrack_possible_nx_huge_page(kvm, sp); + untrack_possible_nx_huge_page(kvm, sp, &kvm->arch.nr_possible_nx_huge_pages); } static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, @@ -7311,9 +7317,9 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel return err; } -static void kvm_recover_nx_huge_pages(struct kvm *kvm) +void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages, + unsigned long nr_pages) { - unsigned long nx_lpage_splits = kvm->stat.nx_lpage_splits; struct kvm_memory_slot *slot; int rcu_idx; struct kvm_mmu_page *sp; @@ -7333,9 +7339,9 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm) rcu_read_lock(); ratio = READ_ONCE(nx_huge_pages_recovery_ratio); - to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0; + to_zap = ratio ? DIV_ROUND_UP(nr_pages, ratio) : 0; for ( ; to_zap; --to_zap) { - if (list_empty(&kvm->arch.possible_nx_huge_pages)) + if (list_empty(pages)) break; /* @@ -7345,7 +7351,7 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm) * the total number of shadow pages. And because the TDP MMU * doesn't use active_mmu_pages. */ - sp = list_first_entry(&kvm->arch.possible_nx_huge_pages, + sp = list_first_entry(pages, struct kvm_mmu_page, possible_nx_huge_page_link); WARN_ON_ONCE(!sp->nx_huge_page_disallowed); @@ -7417,6 +7423,12 @@ static long get_nx_huge_page_recovery_timeout(u64 start_time) : MAX_SCHEDULE_TIMEOUT; } +static void kvm_mmu_recover_nx_huge_pages(struct kvm *kvm) +{ + kvm_recover_nx_huge_pages(kvm, &kvm->arch.possible_nx_huge_pages, + kvm->arch.nr_possible_nx_huge_pages); +} + static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data) { u64 start_time; @@ -7438,7 +7450,10 @@ static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data) if (kthread_should_stop()) return 0; - kvm_recover_nx_huge_pages(kvm); + kvm_mmu_recover_nx_huge_pages(kvm); + if (tdp_mmu_enabled) + kvm_tdp_mmu_recover_nx_huge_pages(kvm); + } } diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 1721d97743e9..2d2e1231996a 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -351,7 +351,11 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); -void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp); -void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp); +void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *pages, u64 *nr_pages); +void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp, + u64 *nr_pages); +void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages, + unsigned long nr_pages); #endif /* __KVM_X86_MMU_INTERNAL_H */ diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index c7dc49ee7388..9a6c26d20210 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -15,6 +15,7 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm) { INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots); + INIT_LIST_HEAD(&kvm->arch.tdp_mmu_possible_nx_huge_pages); spin_lock_init(&kvm->arch.tdp_mmu_pages_lock); } @@ -73,6 +74,13 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head) tdp_mmu_free_sp(sp); } +void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm) +{ + kvm_recover_nx_huge_pages(kvm, + &kvm->arch.tdp_mmu_possible_nx_huge_pages, + kvm->arch.tdp_mmu_nr_possible_nx_huge_pages); +} + void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root) { if (!refcount_dec_and_test(&root->tdp_mmu_root_count)) @@ -318,7 +326,7 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp) spin_lock(&kvm->arch.tdp_mmu_pages_lock); sp->nx_huge_page_disallowed = false; - untrack_possible_nx_huge_page(kvm, sp); + untrack_possible_nx_huge_page(kvm, sp, &kvm->arch.tdp_mmu_nr_possible_nx_huge_pages); spin_unlock(&kvm->arch.tdp_mmu_pages_lock); } @@ -1162,10 +1170,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) } if (fault->huge_page_disallowed && - fault->req_level >= iter.level) { + fault->req_level >= iter.level && + sp->nx_huge_page_disallowed) { spin_lock(&kvm->arch.tdp_mmu_pages_lock); - if (sp->nx_huge_page_disallowed) - track_possible_nx_huge_page(kvm, sp); + track_possible_nx_huge_page(kvm, + sp, + &kvm->arch.tdp_mmu_possible_nx_huge_pages, + &kvm->arch.tdp_mmu_nr_possible_nx_huge_pages); spin_unlock(&kvm->arch.tdp_mmu_pages_lock); } } diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 1b74e058a81c..510baf3eb3f1 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -66,6 +66,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level); u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn, u64 *spte); +void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm); #ifdef CONFIG_X86_64 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; } From patchwork Fri Sep 6 20:45:15 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vipin Sharma X-Patchwork-Id: 13794734 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 651311D7E39 for ; Fri, 6 Sep 2024 20:45:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725655525; cv=none; b=hTQjtzVtRATeRGuejsDsWtvZSChPQhxHri5I6rA/lFgQyovLUoNc3sPXN3ZUsyABJw6wiJFVMr7+ZrGJ1CMw7PozxjORMCyOAqT7WCR9L9hnPUMlT3GEyZco9v7UUe4eGYoPfPoYnuTSsPekoczRV+Ljy5Mhl1jm5RUAzebwV24= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725655525; c=relaxed/simple; bh=sMoZ6zXACMl/k6sr6RMk0twcIsgikQ36WnjB+a6b3TI=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=aL7j2V0V/42LsQhEHXYZFrVQWorHAalPPrvNRk4AWMPF3LEEyLxjoeyHrMWv3jIh37GOgMxpdGEy/HRqrmxUiX/G0c187YxXE6lokvsiA9O6fq/3NC7QLER7HXi0Xy+YutF7S5r9gPJ4JK0cHDZxMv62u75p1XMRWW3mZ3DPwaM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=JQ1m04CY; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="JQ1m04CY" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-205428e6facso29485445ad.2 for ; Fri, 06 Sep 2024 13:45:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725655522; x=1726260322; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XX1S537jtVISqBJxzm6B+4ZHvpDhxy7xsVVvCabn7cw=; b=JQ1m04CYJQZbiy/21X0srae7TS2vXpgjaRZieuq25mCachACR2kUFlZPhEdQrJC1Np K+3M16KEgJtnnXdQk5hCI4noo1IJL4jNYR4uLKpVE23RZusJTU+HM1nFfzlSVjS35u4x eCLjNC53IhAMVgyR3Fjqxco2BKhOfl6IhC9a8N8EWy6r2zbCWi3lxOzFcab+PF6NsRNj pSK1oIwxs2xtAUCsc7iqTBkpZUNxT1qw/WyX1zge3XdLC8b6xkjY1LEIu5ut8RxhVocS 9AgxVnh8CsvU/fFXMFgBwExNe9Vlgfwf8ZqrdO2Hv/6ThvAOWkfeg3JYm0PNZj2i5WHF I+6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725655522; x=1726260322; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XX1S537jtVISqBJxzm6B+4ZHvpDhxy7xsVVvCabn7cw=; b=ENd23oy1iP6oza4wQy8jaGCT4UWnUoNTpVIA/Ju2yvskPI/3mElbxPbI7viRzXwHAf 9+Y0Ri0Y/Bu8Y9v173hH3ofqBNRWPMJkAjvxNEF3cmNj5ZrI/bo2C/Jg/3oENnlu+oBM 5fpWQhqH658MTNq3C1hrn6UvYyq/aK8ytBVG9FCwMWv0Bi5Qb6hUB6hGX1JLK+9aJ96F 389B2q9bvy9/ZuE+Bdm4oV/lQS3NsNvfWeMAb/7pm8Z/YlZjawbhOuMRj9iEgklhXLxE sZdeLZko17JX/Pt2E1qOW6jt4k4nhTXbricKpSAp/jLkD18470TWLXWODYUkdov3L2Ef c+mQ== X-Gm-Message-State: AOJu0YwQadOkdRsUkEmV5pUalAL3mv+PqDBkdWM7FtogOHrNI00w0gm0 QigCqzmER63ZjAcFPQ3gwpWlTcpjNERdCtDzIUaQwFMTHw7O2Wd+sj4zRjXLOt5nxC7NukIKKb8 spATz7w== X-Google-Smtp-Source: AGHT+IEB3I4S81gLMcFJ/JuG6s6suhuCxS7h6lZ0rtdEXrDyRVVSF/9n67cIAzpvMDwHHDqZKmG1CnVQMCL+ X-Received: from vipin.c.googlers.com ([34.105.13.176]) (user=vipinsh job=sendgmr) by 2002:a17:902:fa10:b0:205:5f42:4191 with SMTP id d9443c01a7336-206f04d5649mr554085ad.4.1725655521528; Fri, 06 Sep 2024 13:45:21 -0700 (PDT) Date: Fri, 6 Sep 2024 13:45:15 -0700 In-Reply-To: <20240906204515.3276696-1-vipinsh@google.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240906204515.3276696-1-vipinsh@google.com> X-Mailer: git-send-email 2.46.0.469.g59c65b2a67-goog Message-ID: <20240906204515.3276696-3-vipinsh@google.com> Subject: [PATCH v3 2/2] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock From: Vipin Sharma To: seanjc@google.com, pbonzini@redhat.com, dmatlack@google.com Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Vipin Sharma Use MMU read lock to recover TDP MMU NX huge pages. Iterate huge pages list under tdp_mmu_pages_lock protection and unaccount the page before dropping the lock. Modify kvm_tdp_mmu_zap_sp() to kvm_tdp_mmu_zap_possible_nx_huge_page() as there are no other user of it. Ignore the zap if any of the following condition is true: - It is a root page. - Parent is pointing to: - A different page table. - A huge page. - Not present Warn if zapping SPTE fails and current SPTE is still pointing to same page table. This should never happen. There is always a race between dirty logging, vCPU faults, and NX huge page recovery for backing a gfn by an NX huge page or an execute small page. Unaccounting sooner during the list traversal is increasing the window of that race. Functionally, it is okay, because accounting doesn't protect against iTLB multi-hit bug, it is there purely to prevent KVM from bouncing a gfn between two page sizes. The only downside is that a vCPU will end up doing more work in tearing down all the child SPTEs. This should be a very rare race. Zapping under MMU read lock unblock vCPUs which are waiting for MMU read lock. This optimizaion is done to solve a guest jitter issue on Windows VM which was observing an increase in network latency. The test workload sets up two Windows VM and use latte.exe[1] binary to run network latency benchmark. Running NX huge page recovery under MMU lock was causing latency to increase up to 30 ms because vCPUs were waiting for MMU lock. Running the tool on VMs using MMU read lock NX huge page recovery removed the jitter issue completely and MMU lock wait time by vCPUs was also reduced. Command used for testing: Server: latte.exe -udp -a 192.168.100.1:9000 -i 10000000 Client: latte.exe -c -udp -a 192.168.100.1:9000 -i 10000000 -hist -hl 1000 -hc 30 Output from the latency tool on client: Before ------ Protocol UDP SendMethod Blocking ReceiveMethod Blocking SO_SNDBUF Default SO_RCVBUF Default MsgSize(byte) 4 Iterations 10000000 Latency(usec) 69.98 CPU(%) 2.8 CtxSwitch/sec 32783 (2.29/iteration) SysCall/sec 99948 (6.99/iteration) Interrupt/sec 55164 (3.86/iteration) Interval(usec) Frequency 0 9999967 1000 14 2000 0 3000 5 4000 1 5000 0 6000 0 7000 0 8000 0 9000 0 10000 0 11000 0 12000 2 13000 2 14000 4 15000 2 16000 2 17000 0 18000 1 After ----- Protocol UDP SendMethod Blocking ReceiveMethod Blocking SO_SNDBUF Default SO_RCVBUF Default MsgSize(byte) 4 Iterations 10000000 Latency(usec) 67.66 CPU(%) 1.6 CtxSwitch/sec 32869 (2.22/iteration) SysCall/sec 69366 (4.69/iteration) Interrupt/sec 50693 (3.43/iteration) Interval(usec) Frequency 0 9999972 1000 27 2000 1 [1] https://github.com/microsoft/latte Suggested-by: Sean Christopherson Signed-off-by: Vipin Sharma --- arch/x86/kvm/mmu/mmu.c | 85 ++++++++++++++++++++++----------- arch/x86/kvm/mmu/mmu_internal.h | 4 +- arch/x86/kvm/mmu/tdp_mmu.c | 56 ++++++++++++++++++---- arch/x86/kvm/mmu/tdp_mmu.h | 5 +- 4 files changed, 110 insertions(+), 40 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 455caaaa04f5..fc597f66aa11 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7317,8 +7317,8 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel return err; } -void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages, - unsigned long nr_pages) +void kvm_recover_nx_huge_pages(struct kvm *kvm, bool shared, + struct list_head *pages, unsigned long nr_pages) { struct kvm_memory_slot *slot; int rcu_idx; @@ -7329,7 +7329,10 @@ void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages, ulong to_zap; rcu_idx = srcu_read_lock(&kvm->srcu); - write_lock(&kvm->mmu_lock); + if (shared) + read_lock(&kvm->mmu_lock); + else + write_lock(&kvm->mmu_lock); /* * Zapping TDP MMU shadow pages, including the remote TLB flush, must @@ -7341,8 +7344,13 @@ void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages, ratio = READ_ONCE(nx_huge_pages_recovery_ratio); to_zap = ratio ? DIV_ROUND_UP(nr_pages, ratio) : 0; for ( ; to_zap; --to_zap) { - if (list_empty(pages)) + if (tdp_mmu_enabled) + kvm_tdp_mmu_pages_lock(kvm); + if (list_empty(pages)) { + if (tdp_mmu_enabled) + kvm_tdp_mmu_pages_unlock(kvm); break; + } /* * We use a separate list instead of just using active_mmu_pages @@ -7358,24 +7366,41 @@ void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages, WARN_ON_ONCE(!sp->role.direct); /* - * Unaccount and do not attempt to recover any NX Huge Pages - * that are being dirty tracked, as they would just be faulted - * back in as 4KiB pages. The NX Huge Pages in this slot will be - * recovered, along with all the other huge pages in the slot, - * when dirty logging is disabled. + * Unaccount the shadow page before zapping its SPTE so as to + * avoid bouncing tdp_mmu_pages_lock more than is necessary. + * Clearing nx_huge_page_disallowed before zapping is safe, as + * the flag doesn't protect against iTLB multi-hit, it's there + * purely to prevent bouncing the gfn between an NX huge page + * and an X small spage. A vCPU could get stuck tearing down + * the shadow page, e.g. if it happens to fault on the region + * before the SPTE is zapped and replaces the shadow page with + * an NX huge page and get stuck tearing down the child SPTEs, + * but that is a rare race, i.e. shouldn't impact performance. + */ + unaccount_nx_huge_page(kvm, sp); + if (tdp_mmu_enabled) + kvm_tdp_mmu_pages_unlock(kvm); + + /* + * Do not attempt to recover any NX Huge Pages that are being + * dirty tracked, as they would just be faulted back in as 4KiB + * pages. The NX Huge Pages in this slot will be recovered, + * along with all the other huge pages in the slot, when dirty + * logging is disabled. * * Since gfn_to_memslot() is relatively expensive, it helps to * skip it if it the test cannot possibly return true. On the * other hand, if any memslot has logging enabled, chances are - * good that all of them do, in which case unaccount_nx_huge_page() - * is much cheaper than zapping the page. + * good that all of them do, in which case + * unaccount_nx_huge_page() is much cheaper than zapping the + * page. * - * If a memslot update is in progress, reading an incorrect value - * of kvm->nr_memslots_dirty_logging is not a problem: if it is - * becoming zero, gfn_to_memslot() will be done unnecessarily; if - * it is becoming nonzero, the page will be zapped unnecessarily. - * Either way, this only affects efficiency in racy situations, - * and not correctness. + * If a memslot update is in progress, reading an incorrect + * value of kvm->nr_memslots_dirty_logging is not a problem: if + * it is becoming zero, gfn_to_memslot() will be done + * unnecessarily; if it is becoming nonzero, the page will be + * zapped unnecessarily. Either way, this only affects + * efficiency in racy situations, and not correctness. */ slot = NULL; if (atomic_read(&kvm->nr_memslots_dirty_logging)) { @@ -7385,20 +7410,21 @@ void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages, slot = __gfn_to_memslot(slots, sp->gfn); WARN_ON_ONCE(!slot); } - - if (slot && kvm_slot_dirty_track_enabled(slot)) - unaccount_nx_huge_page(kvm, sp); - else if (is_tdp_mmu_page(sp)) - flush |= kvm_tdp_mmu_zap_sp(kvm, sp); - else - kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); + if (!slot || !kvm_slot_dirty_track_enabled(slot)) { + if (shared) + flush |= kvm_tdp_mmu_zap_possible_nx_huge_page(kvm, sp); + else + kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); + } WARN_ON_ONCE(sp->nx_huge_page_disallowed); if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); rcu_read_unlock(); - - cond_resched_rwlock_write(&kvm->mmu_lock); + if (shared) + cond_resched_rwlock_read(&kvm->mmu_lock); + else + cond_resched_rwlock_write(&kvm->mmu_lock); flush = false; rcu_read_lock(); @@ -7408,7 +7434,10 @@ void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages, rcu_read_unlock(); - write_unlock(&kvm->mmu_lock); + if (shared) + read_unlock(&kvm->mmu_lock); + else + write_unlock(&kvm->mmu_lock); srcu_read_unlock(&kvm->srcu, rcu_idx); } @@ -7425,7 +7454,7 @@ static long get_nx_huge_page_recovery_timeout(u64 start_time) static void kvm_mmu_recover_nx_huge_pages(struct kvm *kvm) { - kvm_recover_nx_huge_pages(kvm, &kvm->arch.possible_nx_huge_pages, + kvm_recover_nx_huge_pages(kvm, false, &kvm->arch.possible_nx_huge_pages, kvm->arch.nr_possible_nx_huge_pages); } diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 2d2e1231996a..e6b757c59ccc 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -355,7 +355,7 @@ void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp, struct list_head *pages, u64 *nr_pages); void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *nr_pages); -void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages, - unsigned long nr_pages); +void kvm_recover_nx_huge_pages(struct kvm *kvm, bool shared, + struct list_head *pages, unsigned long nr_pages); #endif /* __KVM_X86_MMU_INTERNAL_H */ diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 9a6c26d20210..8a6ffc150c99 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -74,9 +74,19 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head) tdp_mmu_free_sp(sp); } +void kvm_tdp_mmu_pages_lock(struct kvm *kvm) +{ + spin_lock(&kvm->arch.tdp_mmu_pages_lock); +} + +void kvm_tdp_mmu_pages_unlock(struct kvm *kvm) +{ + spin_unlock(&kvm->arch.tdp_mmu_pages_lock); +} + void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm) { - kvm_recover_nx_huge_pages(kvm, + kvm_recover_nx_huge_pages(kvm, true, &kvm->arch.tdp_mmu_possible_nx_huge_pages, kvm->arch.tdp_mmu_nr_possible_nx_huge_pages); } @@ -825,23 +835,51 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, rcu_read_unlock(); } -bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) +bool kvm_tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm, + struct kvm_mmu_page *sp) { - u64 old_spte; + struct tdp_iter iter = { + .old_spte = sp->ptep ? kvm_tdp_mmu_read_spte(sp->ptep) : 0, + .sptep = sp->ptep, + .level = sp->role.level + 1, + .gfn = sp->gfn, + .as_id = kvm_mmu_page_as_id(sp), + }; + + lockdep_assert_held_read(&kvm->mmu_lock); + if (WARN_ON_ONCE(!is_tdp_mmu_page(sp))) + return false; /* - * This helper intentionally doesn't allow zapping a root shadow page, - * which doesn't have a parent page table and thus no associated entry. + * Root shadow pages don't a parent page table and thus no associated + * entry, but they can never be possible NX huge pages. */ if (WARN_ON_ONCE(!sp->ptep)) return false; - old_spte = kvm_tdp_mmu_read_spte(sp->ptep); - if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) + /* + * Since mmu_lock is held in read mode, it's possible another task has + * already modified the SPTE. Zap the SPTE if and only if the SPTE + * points at the SP's page table, as checking shadow-present isn't + * sufficient, e.g. the SPTE could be replaced by a leaf SPTE, or even + * another SP. Note, spte_to_child_pt() also checks that the SPTE is + * shadow-present, i.e. guards against zapping a frozen SPTE. + */ + if ((tdp_ptep_t)sp->spt != spte_to_child_pt(iter.old_spte, iter.level)) return false; - tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, - SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1); + /* + * If a different task modified the SPTE, then it should be impossible + * for the SPTE to still be used for the to-be-zapped SP. Non-leaf + * SPTEs don't have Dirty bits, KVM always sets the Accessed bit when + * creating non-leaf SPTEs, and all other bits are immutable for non- + * leaf SPTEs, i.e. the only legal operations for non-leaf SPTEs are + * zapping and replacement. + */ + if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE)) { + WARN_ON_ONCE((tdp_ptep_t)sp->spt == spte_to_child_pt(iter.old_spte, iter.level)); + return false; + } return true; } diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 510baf3eb3f1..ed4bdceb9aec 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -20,7 +20,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root) void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root); bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush); -bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp); +bool kvm_tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm, + struct kvm_mmu_page *sp); void kvm_tdp_mmu_zap_all(struct kvm *kvm); void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm); void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm); @@ -66,6 +67,8 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level); u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn, u64 *spte); +void kvm_tdp_mmu_pages_lock(struct kvm *kvm); +void kvm_tdp_mmu_pages_unlock(struct kvm *kvm); void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm); #ifdef CONFIG_X86_64