Message ID | 9d5595dfe1b5ab77bcb5650bc4d940dd977b0a32.1667110240.git.isaku.yamahata@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM TDX basic feature support | expand |
On 2022/10/30 14:22, isaku.yamahata@intel.com wrote: > From: Isaku Yamahata<isaku.yamahata@intel.com> > > Allocate protected page table for private page table, and add hooks to > operate on protected page table. This patch adds allocation/free of > protected page tables and hooks. When calling hooks to update SPTE entry, > freeze the entry, call hooks and unfree unfreeze > the entry to allow concurrent > updates on page tables. Which is the advantage of TDP MMU. As > kvm_gfn_shared_mask() returns false always, those hooks aren't called yet > with this patch. > > When the faulting GPA is private, the KVM fault is called private. When > resolving private KVM, private KVM fault? > allocate protected page table and call hooks to > operate on protected page table. On the change of the private PTE entry, > invoke kvm_x86_ops hook in __handle_changed_spte() to propagate the change > to protected page table. The following depicts the relationship. > > private KVM page fault | > | | > V | > private GPA | CPU protected EPTP > | | | > V | V > private PT root | protected PT root > | | | > V | V > private PT --hook to propagate-->protected PT > | | | > \--------------------+------\ | > | | | > | V V > | private guest page > | > | > non-encrypted memory | encrypted memory > | > PT: page table > > The existing KVM TDP MMU code uses atomic update of SPTE. On populating > the EPT entry, atomically set the entry. However, it requires TLB > shootdown to zap SPTE. To address it, the entry is frozen with the special > SPTE value that clears the present bit. After the TLB shootdown, the entry > is set to the eventual value (unfreeze). > > For protected page table, hooks are called to update protected page table > in addition to direct access to the private SPTE. For the zapping case, it > works to freeze the SPTE. It can call hooks in addition to TLB shootdown. > For populating the private SPTE entry, there can be a race condition > without further protection > > vcpu 1: populating 2M private SPTE > vcpu 2: populating 4K private SPTE > vcpu 2: TDX SEAMCALL to update 4K protected SPTE => error > vcpu 1: TDX SEAMCALL to update 2M protected SPTE > > To avoid the race, the frozen SPTE is utilized. Instead of atomic update > of the private entry, freeze the entry, call the hook that update protected > SPTE, set the entry to the final value. > > Support 4K page only at this stage. 2M page support can be done in future > patches. > > Co-developed-by: Kai Huang<kai.huang@intel.com> > Signed-off-by: Kai Huang<kai.huang@intel.com> > Signed-off-by: Isaku Yamahata<isaku.yamahata@intel.com> > --- > arch/x86/include/asm/kvm-x86-ops.h | 5 + > arch/x86/include/asm/kvm_host.h | 11 ++ > arch/x86/kvm/mmu/mmu.c | 15 +- > arch/x86/kvm/mmu/mmu_internal.h | 32 ++++ > arch/x86/kvm/mmu/tdp_iter.h | 2 +- > arch/x86/kvm/mmu/tdp_mmu.c | 244 +++++++++++++++++++++++++---- > arch/x86/kvm/mmu/tdp_mmu.h | 2 +- > virt/kvm/kvm_main.c | 1 + > 8 files changed, 280 insertions(+), 32 deletions(-) > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h > index f28c9fd72ac4..1b01dc2098b0 100644 > --- a/arch/x86/include/asm/kvm-x86-ops.h > +++ b/arch/x86/include/asm/kvm-x86-ops.h > @@ -94,6 +94,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr) > KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr) > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask) > KVM_X86_OP(load_mmu_pgd) > +KVM_X86_OP_OPTIONAL(link_private_spt) > +KVM_X86_OP_OPTIONAL(free_private_spt) > +KVM_X86_OP_OPTIONAL(set_private_spte) > +KVM_X86_OP_OPTIONAL(remove_private_spte) > +KVM_X86_OP_OPTIONAL(zap_private_spte) > KVM_X86_OP(has_wbinvd_exit) > KVM_X86_OP(get_l2_tsc_offset) > KVM_X86_OP(get_l2_tsc_multiplier) > > @@ -509,9 +524,81 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared) > WARN_ON_ONCE(ret); > } > > + if (is_private_sp(sp) && > + WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level, > + kvm_mmu_private_spt(sp)))) { > + /* > + * Failed to unlink Secure EPT page and there is nothing to do > + * further. Intentionally leak the page to prevent the kernel > + * from accessing the encrypted page. > + */ > + kvm_mmu_init_private_spt(sp, NULL); Do you think is it better to add some statistics for the intentinal leakage? > + } > + > call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback); > } > > pu, until a matching vcpu_put()
On 10/30/2022 2:22 PM, isaku.yamahata@intel.com wrote: > From: Isaku Yamahata <isaku.yamahata@intel.com> > > Allocate protected page table for private page table, and add hooks to > operate on protected page table. This patch adds allocation/free of > protected page tables and hooks. When calling hooks to update SPTE entry, > freeze the entry, call hooks and unfree the entry to allow concurrent > updates on page tables. Which is the advantage of TDP MMU. As > kvm_gfn_shared_mask() returns false always, those hooks aren't called yet > with this patch. > > When the faulting GPA is private, the KVM fault is called private. When > resolving private KVM, allocate protected page table and call hooks to > operate on protected page table. On the change of the private PTE entry, > invoke kvm_x86_ops hook in __handle_changed_spte() to propagate the change > to protected page table. The following depicts the relationship. > > private KVM page fault | > | | > V | > private GPA | CPU protected EPTP > | | | > V | V > private PT root | protected PT root > | | | > V | V > private PT --hook to propagate-->protected PT > | | | > \--------------------+------\ | > | | | > | V V > | private guest page > | > | > non-encrypted memory | encrypted memory > | > PT: page table > > The existing KVM TDP MMU code uses atomic update of SPTE. On populating > the EPT entry, atomically set the entry. However, it requires TLB > shootdown to zap SPTE. To address it, the entry is frozen with the special > SPTE value that clears the present bit. After the TLB shootdown, the entry > is set to the eventual value (unfreeze). > > For protected page table, hooks are called to update protected page table > in addition to direct access to the private SPTE. For the zapping case, it > works to freeze the SPTE. It can call hooks in addition to TLB shootdown. > For populating the private SPTE entry, there can be a race condition > without further protection > > vcpu 1: populating 2M private SPTE > vcpu 2: populating 4K private SPTE > vcpu 2: TDX SEAMCALL to update 4K protected SPTE => error > vcpu 1: TDX SEAMCALL to update 2M protected SPTE > > To avoid the race, the frozen SPTE is utilized. Instead of atomic update > of the private entry, freeze the entry, call the hook that update protected > SPTE, set the entry to the final value. > > Support 4K page only at this stage. 2M page support can be done in future > patches. > > Co-developed-by: Kai Huang <kai.huang@intel.com> > Signed-off-by: Kai Huang <kai.huang@intel.com> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> > --- > arch/x86/include/asm/kvm-x86-ops.h | 5 + > arch/x86/include/asm/kvm_host.h | 11 ++ > arch/x86/kvm/mmu/mmu.c | 15 +- > arch/x86/kvm/mmu/mmu_internal.h | 32 ++++ > arch/x86/kvm/mmu/tdp_iter.h | 2 +- > arch/x86/kvm/mmu/tdp_mmu.c | 244 +++++++++++++++++++++++++---- > arch/x86/kvm/mmu/tdp_mmu.h | 2 +- > virt/kvm/kvm_main.c | 1 + > 8 files changed, 280 insertions(+), 32 deletions(-) > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h > index f28c9fd72ac4..1b01dc2098b0 100644 > --- a/arch/x86/include/asm/kvm-x86-ops.h > +++ b/arch/x86/include/asm/kvm-x86-ops.h > @@ -94,6 +94,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr) > KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr) > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask) > KVM_X86_OP(load_mmu_pgd) > +KVM_X86_OP_OPTIONAL(link_private_spt) > +KVM_X86_OP_OPTIONAL(free_private_spt) > +KVM_X86_OP_OPTIONAL(set_private_spte) > +KVM_X86_OP_OPTIONAL(remove_private_spte) > +KVM_X86_OP_OPTIONAL(zap_private_spte) > KVM_X86_OP(has_wbinvd_exit) > KVM_X86_OP(get_l2_tsc_offset) > KVM_X86_OP(get_l2_tsc_multiplier) > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 381df2c8136d..5f9634c130d0 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -467,6 +467,7 @@ struct kvm_mmu { > struct kvm_mmu_page *sp); > void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa); > struct kvm_mmu_root_info root; > + hpa_t private_root_hpa; > union kvm_cpu_role cpu_role; > union kvm_mmu_page_role root_role; > > @@ -1613,6 +1614,16 @@ struct kvm_x86_ops { > void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa, > int root_level); > > + int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level, > + void *private_spt); > + int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level, > + void *private_spt); > + int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level, > + kvm_pfn_t pfn); > + int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level, > + kvm_pfn_t pfn); > + int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level); > + > bool (*has_wbinvd_exit)(void); > > u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu); > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 0237e143299c..02e7b5cf3231 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3646,7 +3646,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) > goto out_unlock; > > if (is_tdp_mmu_enabled(vcpu->kvm)) { > - root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu); > + if (kvm_gfn_shared_mask(vcpu->kvm) && > + !VALID_PAGE(mmu->private_root_hpa)) { > + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true); > + mmu->private_root_hpa = root; > + } > + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false); > mmu->root.hpa = root; > } else if (shadow_root_level >= PT64_ROOT_4LEVEL) { > root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level); > @@ -4357,7 +4362,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault > unsigned long mmu_seq; > int r; > > - fault->gfn = fault->addr >> PAGE_SHIFT; > + fault->gfn = gpa_to_gfn(fault->addr) & ~kvm_gfn_shared_mask(vcpu->kvm); > fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn); > > if (page_fault_handle_page_track(vcpu, fault)) > @@ -5893,6 +5898,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) > > mmu->root.hpa = INVALID_PAGE; > mmu->root.pgd = 0; > + mmu->private_root_hpa = INVALID_PAGE; > for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) > mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID; > > @@ -6116,7 +6122,7 @@ static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) > }; > > /* > - * this handles both private gfn and shared gfn. > + * This handles both private gfn and shared gfn. > * All private page should be zapped on memslot deletion. > */ > flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush, true); > @@ -6919,6 +6925,9 @@ int kvm_mmu_vendor_module_init(void) > void kvm_mmu_destroy(struct kvm_vcpu *vcpu) > { > kvm_mmu_unload(vcpu); > + if (is_tdp_mmu_enabled(vcpu->kvm)) > + mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa, > + NULL); > free_mmu_pages(&vcpu->arch.root_mmu); > free_mmu_pages(&vcpu->arch.guest_mmu); > mmu_free_memory_caches(vcpu); > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index 4c013124534b..508e8402c07a 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -6,6 +6,8 @@ > #include <linux/kvm_host.h> > #include <asm/kvm_host.h> > > +#include "mmu.h" > + > #undef MMU_DEBUG > > #ifdef MMU_DEBUG > @@ -209,11 +211,29 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, > } > } > > +static inline int kvm_alloc_private_spt_for_split(struct kvm_mmu_page *sp, gfp_t gfp) > +{ > + gfp &= ~__GFP_ZERO; > + sp->private_spt = (void *)__get_free_page(gfp); > + if (!sp->private_spt) > + return -ENOMEM; > + return 0; > +} > + > static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp) > { > if (sp->private_spt) > free_page((unsigned long)sp->private_spt); > } > + > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root, > + gfn_t gfn) > +{ > + if (is_private_sp(root)) > + return kvm_gfn_private(kvm, gfn); > + else > + return kvm_gfn_shared(kvm, gfn); > +} > #else > static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp) > { > @@ -230,9 +250,20 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, > { > } > > +static inline int kvm_alloc_private_spt_for_split(struct kvm_mmu_page *sp, gfp_t gfp) > +{ > + return -ENOMEM; > +} > + > static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp) > { > } > + > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root, > + gfn_t gfn) > +{ > + return gfn; > +} > #endif > > static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp) > @@ -367,6 +398,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > .is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault), > .nx_huge_page_workaround_enabled = > is_nx_huge_page_enabled(vcpu->kvm), > + .is_private = kvm_is_private_gpa(vcpu->kvm, cr2_or_gpa), > > .max_level = vcpu->kvm->arch.tdp_max_page_level, > .req_level = PG_LEVEL_4K, > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h > index 9e56a5b1024c..eab62baf8549 100644 > --- a/arch/x86/kvm/mmu/tdp_iter.h > +++ b/arch/x86/kvm/mmu/tdp_iter.h > @@ -71,7 +71,7 @@ struct tdp_iter { > tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL]; > /* A pointer to the current SPTE */ > tdp_ptep_t sptep; > - /* The lowest GFN mapped by the current SPTE */ > + /* The lowest GFN (shared bits included) mapped by the current SPTE */ > gfn_t gfn; > /* The level of the root page given to the iterator */ > int root_level; > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index bdb50c26849f..0e053b96444a 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -285,6 +285,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, > sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache); > sp->role = role; > > + if (kvm_mmu_page_role_is_private(role)) > + kvm_mmu_alloc_private_spt(vcpu, NULL, sp); > + > return sp; > } > > @@ -305,7 +308,8 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep, > trace_kvm_mmu_get_page(sp, true); > } > > -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) > +static struct kvm_mmu_page *kvm_tdp_mmu_get_vcpu_root(struct kvm_vcpu *vcpu, > + bool private) > { > union kvm_mmu_page_role role = vcpu->arch.mmu->root_role; > struct kvm *kvm = vcpu->kvm; > @@ -317,6 +321,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) > * Check for an existing root before allocating a new one. Note, the > * role check prevents consuming an invalid root. > */ > + if (private) > + kvm_mmu_page_role_set_private(&role); > for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) { > if (root->role.word == role.word && > kvm_tdp_mmu_get_root(root)) > @@ -333,11 +339,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) > spin_unlock(&kvm->arch.tdp_mmu_pages_lock); > > out: > - return __pa(root->spt); > + return root; > +} > + > +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private) > +{ > + return __pa(kvm_tdp_mmu_get_vcpu_root(vcpu, private)->spt); > } > > static int __must_check handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, > - u64 old_spte, u64 new_spte, int level, > + u64 old_spte, u64 new_spte, > + union kvm_mmu_page_role role, > bool shared); > > static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level) > @@ -364,6 +376,8 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn, > > if ((!is_writable_pte(old_spte) || pfn_changed) && > is_writable_pte(new_spte)) { > + /* For memory slot operations, use GFN without aliasing */ > + gfn = gfn & ~kvm_gfn_shared_mask(kvm); > slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn); > mark_page_dirty_in_slot(kvm, slot, gfn); > } > @@ -500,7 +514,8 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared) > REMOVED_SPTE, level); > } > ret = handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn, > - old_spte, REMOVED_SPTE, level, shared); > + old_spte, REMOVED_SPTE, sp->role, > + shared); > /* > * We are removing page tables. Because in TDX case we don't > * zap private page tables except tearing down VM. It means > @@ -509,9 +524,81 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared) > WARN_ON_ONCE(ret); > } > > + if (is_private_sp(sp) && > + WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level, > + kvm_mmu_private_spt(sp)))) { > + /* > + * Failed to unlink Secure EPT page and there is nothing to do > + * further. Intentionally leak the page to prevent the kernel > + * from accessing the encrypted page. > + */ > + kvm_mmu_init_private_spt(sp, NULL); > + } > + > call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback); > } > > +static void *get_private_spt(gfn_t gfn, u64 new_spte, int level) > +{ > + if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) { > + struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(spte_to_pfn(new_spte))); > + void *private_spt = kvm_mmu_private_spt(sp); > + > + WARN_ON_ONCE(!private_spt); > + WARN_ON_ONCE(sp->role.level + 1 != level); > + WARN_ON_ONCE(sp->gfn != gfn); > + return private_spt; > + } > + > + return NULL; > +} > + > +static int __must_check handle_changed_private_spte(struct kvm *kvm, gfn_t gfn, > + u64 old_spte, u64 new_spte, > + int level) > +{ > + bool was_present = is_shadow_present_pte(old_spte); > + bool is_present = is_shadow_present_pte(new_spte); > + bool was_leaf = was_present && is_last_spte(old_spte, level); > + bool is_leaf = is_present && is_last_spte(new_spte, level); > + kvm_pfn_t old_pfn = spte_to_pfn(old_spte); > + kvm_pfn_t new_pfn = spte_to_pfn(new_spte); > + int ret; int ret = 0; > + > + lockdep_assert_held(&kvm->mmu_lock); > + if (is_present) { > + /* TDP MMU doesn't change present -> present */ > + KVM_BUG_ON(was_present, kvm); > + > + /* > + * Use different call to either set up middle level > + * private page table, or leaf. > + */ > + if (is_leaf) > + ret = static_call(kvm_x86_set_private_spte)(kvm, gfn, level, new_pfn); > + else { > + void *private_spt = get_private_spt(gfn, new_spte, level); > + > + KVM_BUG_ON(!private_spt, kvm); > + ret = static_call(kvm_x86_link_private_spt)(kvm, gfn, level, private_spt); > + } > + } else if (was_leaf) { > + /* non-present -> non-present doesn't make sense. */ > + KVM_BUG_ON(!was_present, kvm); > + /* > + * Zap private leaf SPTE. Zapping private table is done > + * below in handle_removed_tdp_mmu_page(). > + */ > + lockdep_assert_held_write(&kvm->mmu_lock); > + ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level); > + if (!ret) { > + ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn); > + WARN_ON_ONCE(ret); > + } > + } Otherwise, "ret" may be returned without initialization. Then it will trigger the WARN_ON_ONCE(ret) after handle_changed_spte() in patch 48. > + return ret; > +} > + > /** > * __handle_changed_spte - handle bookkeeping associated with an SPTE change > * @kvm: kvm instance > @@ -519,7 +606,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared) > * @gfn: the base GFN that was mapped by the SPTE > * @old_spte: The value of the SPTE before the change > * @new_spte: The value of the SPTE after the change > - * @level: the level of the PT the SPTE is part of in the paging structure > + * @role: the role of the PT the SPTE is part of in the paging structure > * @shared: This operation may not be running under the exclusive use of > * the MMU lock and the operation must synchronize with other > * threads that might be modifying SPTEs. > @@ -528,14 +615,18 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared) > * This function must be called for all TDP SPTE modifications. > */ > static int __must_check __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, > - u64 old_spte, u64 new_spte, int level, > - bool shared) > + u64 old_spte, u64 new_spte, > + union kvm_mmu_page_role role, bool shared) > { > + bool is_private = kvm_mmu_page_role_is_private(role); > + int level = role.level; > bool was_present = is_shadow_present_pte(old_spte); > bool is_present = is_shadow_present_pte(new_spte); > bool was_leaf = was_present && is_last_spte(old_spte, level); > bool is_leaf = is_present && is_last_spte(new_spte, level); > - bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte); > + kvm_pfn_t old_pfn = spte_to_pfn(old_spte); > + kvm_pfn_t new_pfn = spte_to_pfn(new_spte); > + bool pfn_changed = old_pfn != new_pfn; > > WARN_ON(level > PT64_ROOT_MAX_LEVEL); > WARN_ON(level < PG_LEVEL_4K); > @@ -602,7 +693,7 @@ static int __must_check __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t > > if (was_leaf && is_dirty_spte(old_spte) && > (!is_present || !is_dirty_spte(new_spte) || pfn_changed)) > - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); > + kvm_set_pfn_dirty(old_pfn); > > /* > * Recursively handle child PTs if the change removed a subtree from > @@ -611,26 +702,42 @@ static int __must_check __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t > * pages are kernel allocations and should never be migrated. > */ > if (was_present && !was_leaf && > - (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) > + (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) { > + KVM_BUG_ON(is_private != is_private_sptep(spte_to_child_pt(old_spte, level)), > + kvm); > handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared); > + } > > + /* > + * Special handling for the private mapping. We are either > + * setting up new mapping at middle level page table, or leaf, > + * or tearing down existing mapping. > + * > + * This is after handling lower page table by above > + * handle_remove_tdp_mmu_page(). Secure-EPT requires to remove > + * Secure-EPT tables after removing children. > + */ > + if (is_private && > + /* Ignore change of software only bits. e.g. host_writable */ > + (was_leaf != is_leaf || was_present != is_present || pfn_changed)) > + return handle_changed_private_spte(kvm, gfn, old_spte, new_spte, role.level); > return 0; > } > > static int __must_check handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, > - u64 old_spte, u64 new_spte, int level, > + u64 old_spte, u64 new_spte, > + union kvm_mmu_page_role role, > bool shared) > { > int ret; > > - ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, > - shared); > + ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, shared); > if (ret) > return ret; > > - handle_changed_spte_acc_track(old_spte, new_spte, level); > + handle_changed_spte_acc_track(old_spte, new_spte, role.level); > handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, > - new_spte, level); > + new_spte, role.level); > return 0; > } > > @@ -656,6 +763,24 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm, > struct tdp_iter *iter, > u64 new_spte) > { > + /* > + * For conventional page table, the update flow is > + * - update STPE with atomic operation > + * - handle changed SPTE. __handle_changed_spte() > + * NOTE: __handle_changed_spte() (and functions) must be safe against > + * concurrent update. It is an exception to zap SPTE. See > + * tdp_mmu_zap_spte_atomic(). > + * > + * For private page table, callbacks are needed to propagate SPTE > + * change into the protected page table. In order to atomically update > + * both the SPTE and the protected page tables with callbacks, utilize > + * freezing SPTE. > + * - Freeze the SPTE. Set entry to REMOVED_SPTE. > + * - Trigger callbacks for protected page tables. __handle_changed_spte() > + * - Unfreeze the SPTE. Set the entry to new_spte. > + */ > + bool freeze_spte = is_private_sptep(iter->sptep) && !is_removed_spte(new_spte); > + u64 tmp_spte = freeze_spte ? REMOVED_SPTE : new_spte; > u64 *sptep = rcu_dereference(iter->sptep); > int ret; > > @@ -673,14 +798,24 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm, > * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and > * does not hold the mmu_lock. > */ > - if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte)) > + if (!try_cmpxchg64(sptep, &iter->old_spte, tmp_spte)) > return -EBUSY; > > ret = __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte, > - new_spte, iter->level, true); > + new_spte, sptep_to_sp(sptep)->role, true); > if (!ret) > handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level); > > + if (ret) { > + /* > + * !freeze_spte means this fault isn't private. No call to > + * operation on Secure EPT. > + */ > + WARN_ON_ONCE(!freeze_spte); > + __kvm_tdp_mmu_write_spte(sptep, iter->old_spte); > + } else if (freeze_spte) > + __kvm_tdp_mmu_write_spte(sptep, new_spte); > + > return ret; > } > > @@ -750,6 +885,7 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep, > u64 old_spte, u64 new_spte, gfn_t gfn, int level, > bool record_acc_track, bool record_dirty_log) > { > + union kvm_mmu_page_role role; > int ret; > > lockdep_assert_held_write(&kvm->mmu_lock); > @@ -765,7 +901,9 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep, > > old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level); > > - ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false); > + role = sptep_to_sp(sptep)->role; > + role.level = level; > + ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false); > /* Because write spin lock is held, no race. It should success. */ > WARN_ON_ONCE(ret); > > @@ -819,8 +957,11 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm, > continue; \ > else > > -#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \ > - for_each_tdp_pte(_iter, to_shadow_page(_mmu->root.hpa), _start, _end) > +#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end) \ > + for_each_tdp_pte(_iter, \ > + to_shadow_page((_private) ? _mmu->private_root_hpa : \ > + _mmu->root.hpa), \ > + _start, _end) > > /* > * Yield if the MMU lock is contended or this thread needs to return control > @@ -983,6 +1124,14 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > if (!zap_private && is_private_sp(root)) > return false; > > + /* > + * start and end doesn't have GFN shared bit. This function zaps > + * a region including alias. Adjust shared bit of [start, end) if the > + * root is shared. > + */ > + start = kvm_gfn_for_root(kvm, root, start); > + end = kvm_gfn_for_root(kvm, root, end); > + > rcu_read_lock(); > > for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) { > @@ -1111,10 +1260,19 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, > WARN_ON(sp->role.level != fault->goal_level); > if (unlikely(!fault->slot)) > new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL); > - else > - wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn, > - fault->pfn, iter->old_spte, fault->prefetch, true, > - fault->map_writable, &new_spte); > + else { > + unsigned long pte_access = ACC_ALL; > + gfn_t gfn_unalias = iter->gfn & ~kvm_gfn_shared_mask(vcpu->kvm); > + > + /* TDX shared GPAs are no executable, enforce this for the SDV. */ > + if (kvm_gfn_shared_mask(vcpu->kvm) && !fault->is_private) > + pte_access &= ~ACC_EXEC_MASK; > + > + wrprot = make_spte(vcpu, sp, fault->slot, pte_access, > + gfn_unalias, fault->pfn, iter->old_spte, > + fault->prefetch, true, fault->map_writable, > + &new_spte); > + } > > if (new_spte == iter->old_spte) > ret = RET_PF_SPURIOUS; > @@ -1214,6 +1372,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > { > struct kvm_mmu *mmu = vcpu->arch.mmu; > struct tdp_iter iter; > + gfn_t raw_gfn; > + bool is_private = fault->is_private; > int ret; > > kvm_mmu_hugepage_adjust(vcpu, fault); > @@ -1222,7 +1382,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > rcu_read_lock(); > > - tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) { > + raw_gfn = gpa_to_gfn(fault->addr); > + > + if (is_error_noslot_pfn(fault->pfn) || > + !kvm_pfn_to_refcounted_page(fault->pfn)) { > + if (is_private) { > + rcu_read_unlock(); > + return -EFAULT; > + } > + } > + > + tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) { > if (fault->nx_huge_page_workaround_enabled) > disallowed_hugepage_adjust(fault, iter.old_spte, iter.level); > > @@ -1238,6 +1408,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > is_large_pte(iter.old_spte)) { > if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter)) > break; > + /* > + * TODO: large page support. > + * Doesn't support large page for TDX now > + */ > + KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm); > + > > /* > * The iter must explicitly re-read the spte here > @@ -1480,6 +1656,12 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, union kvm_mm > > sp->role = role; > sp->spt = (void *)__get_free_page(gfp); > + if (kvm_mmu_page_role_is_private(role)) { > + if (kvm_alloc_private_spt_for_split(sp, gfp)) { > + free_page((unsigned long)sp->spt); > + sp->spt = NULL; > + } > + } > if (!sp->spt) { > kmem_cache_free(mmu_page_header_cache, sp); > return NULL; > @@ -1495,6 +1677,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > union kvm_mmu_page_role role = tdp_iter_child_role(iter); > struct kvm_mmu_page *sp; > > + KVM_BUG_ON(kvm_mmu_page_role_is_private(role) != > + is_private_sptep(iter->sptep), kvm); > + /* TODO: Large page isn't supported for private SPTE yet. */ > + KVM_BUG_ON(kvm_mmu_page_role_is_private(role), kvm); > + > /* > * Since we are allocating while under the MMU lock we have to be > * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct > @@ -1929,7 +2116,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, > if (WARN_ON_ONCE(kvm_gfn_shared_mask(vcpu->kvm))) > return leaf; > > - tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) { > + tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) { > leaf = iter.level; > sptes[leaf] = iter.old_spte; > } > @@ -1956,7 +2143,10 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr, > gfn_t gfn = addr >> PAGE_SHIFT; > tdp_ptep_t sptep = NULL; > > - tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) { > + /* fast page fault for private GPA isn't supported. */ > + WARN_ON_ONCE(kvm_is_private_gpa(vcpu->kvm, addr)); > + > + tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) { > *spte = iter.old_spte; > sptep = iter.sptep; > } > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h > index c98c7df449a8..695175c921a5 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.h > +++ b/arch/x86/kvm/mmu/tdp_mmu.h > @@ -5,7 +5,7 @@ > > #include <linux/kvm_host.h> > > -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu); > +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private); > > __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root) > { > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index dda2f2ec4faa..8c996f40b544 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -211,6 +211,7 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn) > > return NULL; > } > +EXPORT_SYMBOL_GPL(kvm_pfn_to_refcounted_page); > > /* > * Switches to specified vcpu, until a matching vcpu_put()
> > +static inline int kvm_alloc_private_spt_for_split(struct kvm_mmu_page *sp, gfp_t gfp) > +{ > + gfp &= ~__GFP_ZERO; > + sp->private_spt = (void *)__get_free_page(gfp); > + if (!sp->private_spt) > + return -ENOMEM; > + return 0; > +} > + > [...] > @@ -1238,6 +1408,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > is_large_pte(iter.old_spte)) { > if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter)) > break; > + /* > + * TODO: large page support. > + * Doesn't support large page for TDX now > + */ > + KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm); > + > So large page is not supported for private page, ... > /* > * The iter must explicitly re-read the spte here > @@ -1480,6 +1656,12 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, union kvm_mm > > sp->role = role; > sp->spt = (void *)__get_free_page(gfp); > + if (kvm_mmu_page_role_is_private(role)) { > + if (kvm_alloc_private_spt_for_split(sp, gfp)) { > + free_page((unsigned long)sp->spt); > + sp->spt = NULL; > + } > + } ... then I don't think eager splitting could happen for private mapping? If so, should we just KVM_BUG_ON() if role is private?
On Wed, Nov 16, 2022 at 09:40:38AM +0800, Chenyi Qiang <chenyi.qiang@intel.com> wrote: > > +static int __must_check handle_changed_private_spte(struct kvm *kvm, gfn_t gfn, > > + u64 old_spte, u64 new_spte, > > + int level) > > +{ > > + bool was_present = is_shadow_present_pte(old_spte); > > + bool is_present = is_shadow_present_pte(new_spte); > > + bool was_leaf = was_present && is_last_spte(old_spte, level); > > + bool is_leaf = is_present && is_last_spte(new_spte, level); > > + kvm_pfn_t old_pfn = spte_to_pfn(old_spte); > > + kvm_pfn_t new_pfn = spte_to_pfn(new_spte); > > + int ret; > > int ret = 0; > > > + > > + lockdep_assert_held(&kvm->mmu_lock); > > + if (is_present) { > > + /* TDP MMU doesn't change present -> present */ > > + KVM_BUG_ON(was_present, kvm); > > + > > + /* > > + * Use different call to either set up middle level > > + * private page table, or leaf. > > + */ > > + if (is_leaf) > > + ret = static_call(kvm_x86_set_private_spte)(kvm, gfn, level, new_pfn); > > + else { > > + void *private_spt = get_private_spt(gfn, new_spte, level); > > + > > + KVM_BUG_ON(!private_spt, kvm); > > + ret = static_call(kvm_x86_link_private_spt)(kvm, gfn, level, private_spt); > > + } > > + } else if (was_leaf) { > > + /* non-present -> non-present doesn't make sense. */ > > + KVM_BUG_ON(!was_present, kvm); > > + /* > > + * Zap private leaf SPTE. Zapping private table is done > > + * below in handle_removed_tdp_mmu_page(). > > + */ > > + lockdep_assert_held_write(&kvm->mmu_lock); > > + ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level); > > + if (!ret) { > > + ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn); > > + WARN_ON_ONCE(ret); > > + } > > + } > > Otherwise, "ret" may be returned without initialization. Then it will > trigger the WARN_ON_ONCE(ret) after handle_changed_spte() in patch 48. Thanks for catching it. Compiler didn't complain it. Maybe W=1 is needed.
On Wed, Nov 16, 2022 at 11:58:46AM +0000, "Huang, Kai" <kai.huang@intel.com> wrote: > > > > > +static inline int kvm_alloc_private_spt_for_split(struct kvm_mmu_page *sp, gfp_t gfp) > > +{ > > + gfp &= ~__GFP_ZERO; > > + sp->private_spt = (void *)__get_free_page(gfp); > > + if (!sp->private_spt) > > + return -ENOMEM; > > + return 0; > > +} > > + > > > [...] > > > @@ -1238,6 +1408,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > is_large_pte(iter.old_spte)) { > > if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter)) > > break; > > + /* > > + * TODO: large page support. > > + * Doesn't support large page for TDX now > > + */ > > + KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm); > > + > > > > So large page is not supported for private page, ... > > > /* > > * The iter must explicitly re-read the spte here > > @@ -1480,6 +1656,12 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, union kvm_mm > > > > sp->role = role; > > sp->spt = (void *)__get_free_page(gfp); > > + if (kvm_mmu_page_role_is_private(role)) { > > + if (kvm_alloc_private_spt_for_split(sp, gfp)) { > > + free_page((unsigned long)sp->spt); > > + sp->spt = NULL; > > + } > > + } > > ... then I don't think eager splitting could happen for private mapping? > > If so, should we just KVM_BUG_ON() if role is private? Right, will remove this part from this patch series and move it to large page support series.
On Tue, Nov 08, 2022 at 09:41:44PM +0800, Binbin Wu <binbin.wu@linux.intel.com> wrote: > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h > > index f28c9fd72ac4..1b01dc2098b0 100644 > > --- a/arch/x86/include/asm/kvm-x86-ops.h > > +++ b/arch/x86/include/asm/kvm-x86-ops.h > > @@ -94,6 +94,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr) > > KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr) > > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask) > > KVM_X86_OP(load_mmu_pgd) > > +KVM_X86_OP_OPTIONAL(link_private_spt) > > +KVM_X86_OP_OPTIONAL(free_private_spt) > > +KVM_X86_OP_OPTIONAL(set_private_spte) > > +KVM_X86_OP_OPTIONAL(remove_private_spte) > > +KVM_X86_OP_OPTIONAL(zap_private_spte) > > KVM_X86_OP(has_wbinvd_exit) > > KVM_X86_OP(get_l2_tsc_offset) > > KVM_X86_OP(get_l2_tsc_multiplier) > > > > @@ -509,9 +524,81 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared) > > WARN_ON_ONCE(ret); > > } > > + if (is_private_sp(sp) && > > + WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level, > > + kvm_mmu_private_spt(sp)))) { > > + /* > > + * Failed to unlink Secure EPT page and there is nothing to do > > + * further. Intentionally leak the page to prevent the kernel > > + * from accessing the encrypted page. > > + */ > > + kvm_mmu_init_private_spt(sp, NULL); > > Do you think is it better to add some statistics for the intentinal leakage? No because this error case happens only when race condition bug(TDX_OPERAND_BUSY) or TDX module was shutdown due to bug or attack. They are (should be) rare. The race condition bug should be fixed. It's covered by WARN_ON(). In the case of TDX module shutdown, the current TDX KVM implementation checks unexpected error and mark it bugged, KVM_BUG_ON(), in vmx/tdx.c. So adding statistic counter doesn't help. It's another story to harden the code against TDX module shutdown.
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index f28c9fd72ac4..1b01dc2098b0 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -94,6 +94,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr) KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr) KVM_X86_OP_OPTIONAL_RET0(get_mt_mask) KVM_X86_OP(load_mmu_pgd) +KVM_X86_OP_OPTIONAL(link_private_spt) +KVM_X86_OP_OPTIONAL(free_private_spt) +KVM_X86_OP_OPTIONAL(set_private_spte) +KVM_X86_OP_OPTIONAL(remove_private_spte) +KVM_X86_OP_OPTIONAL(zap_private_spte) KVM_X86_OP(has_wbinvd_exit) KVM_X86_OP(get_l2_tsc_offset) KVM_X86_OP(get_l2_tsc_multiplier) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 381df2c8136d..5f9634c130d0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -467,6 +467,7 @@ struct kvm_mmu { struct kvm_mmu_page *sp); void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa); struct kvm_mmu_root_info root; + hpa_t private_root_hpa; union kvm_cpu_role cpu_role; union kvm_mmu_page_role root_role; @@ -1613,6 +1614,16 @@ struct kvm_x86_ops { void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level); + int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level, + void *private_spt); + int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level, + void *private_spt); + int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level, + kvm_pfn_t pfn); + int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level, + kvm_pfn_t pfn); + int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level); + bool (*has_wbinvd_exit)(void); u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0237e143299c..02e7b5cf3231 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3646,7 +3646,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) goto out_unlock; if (is_tdp_mmu_enabled(vcpu->kvm)) { - root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu); + if (kvm_gfn_shared_mask(vcpu->kvm) && + !VALID_PAGE(mmu->private_root_hpa)) { + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true); + mmu->private_root_hpa = root; + } + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false); mmu->root.hpa = root; } else if (shadow_root_level >= PT64_ROOT_4LEVEL) { root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level); @@ -4357,7 +4362,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault unsigned long mmu_seq; int r; - fault->gfn = fault->addr >> PAGE_SHIFT; + fault->gfn = gpa_to_gfn(fault->addr) & ~kvm_gfn_shared_mask(vcpu->kvm); fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn); if (page_fault_handle_page_track(vcpu, fault)) @@ -5893,6 +5898,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) mmu->root.hpa = INVALID_PAGE; mmu->root.pgd = 0; + mmu->private_root_hpa = INVALID_PAGE; for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID; @@ -6116,7 +6122,7 @@ static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) }; /* - * this handles both private gfn and shared gfn. + * This handles both private gfn and shared gfn. * All private page should be zapped on memslot deletion. */ flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush, true); @@ -6919,6 +6925,9 @@ int kvm_mmu_vendor_module_init(void) void kvm_mmu_destroy(struct kvm_vcpu *vcpu) { kvm_mmu_unload(vcpu); + if (is_tdp_mmu_enabled(vcpu->kvm)) + mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa, + NULL); free_mmu_pages(&vcpu->arch.root_mmu); free_mmu_pages(&vcpu->arch.guest_mmu); mmu_free_memory_caches(vcpu); diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 4c013124534b..508e8402c07a 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -6,6 +6,8 @@ #include <linux/kvm_host.h> #include <asm/kvm_host.h> +#include "mmu.h" + #undef MMU_DEBUG #ifdef MMU_DEBUG @@ -209,11 +211,29 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, } } +static inline int kvm_alloc_private_spt_for_split(struct kvm_mmu_page *sp, gfp_t gfp) +{ + gfp &= ~__GFP_ZERO; + sp->private_spt = (void *)__get_free_page(gfp); + if (!sp->private_spt) + return -ENOMEM; + return 0; +} + static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp) { if (sp->private_spt) free_page((unsigned long)sp->private_spt); } + +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root, + gfn_t gfn) +{ + if (is_private_sp(root)) + return kvm_gfn_private(kvm, gfn); + else + return kvm_gfn_shared(kvm, gfn); +} #else static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp) { @@ -230,9 +250,20 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, { } +static inline int kvm_alloc_private_spt_for_split(struct kvm_mmu_page *sp, gfp_t gfp) +{ + return -ENOMEM; +} + static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp) { } + +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root, + gfn_t gfn) +{ + return gfn; +} #endif static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp) @@ -367,6 +398,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, .is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault), .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm), + .is_private = kvm_is_private_gpa(vcpu->kvm, cr2_or_gpa), .max_level = vcpu->kvm->arch.tdp_max_page_level, .req_level = PG_LEVEL_4K, diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index 9e56a5b1024c..eab62baf8549 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -71,7 +71,7 @@ struct tdp_iter { tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL]; /* A pointer to the current SPTE */ tdp_ptep_t sptep; - /* The lowest GFN mapped by the current SPTE */ + /* The lowest GFN (shared bits included) mapped by the current SPTE */ gfn_t gfn; /* The level of the root page given to the iterator */ int root_level; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index bdb50c26849f..0e053b96444a 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -285,6 +285,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache); sp->role = role; + if (kvm_mmu_page_role_is_private(role)) + kvm_mmu_alloc_private_spt(vcpu, NULL, sp); + return sp; } @@ -305,7 +308,8 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep, trace_kvm_mmu_get_page(sp, true); } -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) +static struct kvm_mmu_page *kvm_tdp_mmu_get_vcpu_root(struct kvm_vcpu *vcpu, + bool private) { union kvm_mmu_page_role role = vcpu->arch.mmu->root_role; struct kvm *kvm = vcpu->kvm; @@ -317,6 +321,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) * Check for an existing root before allocating a new one. Note, the * role check prevents consuming an invalid root. */ + if (private) + kvm_mmu_page_role_set_private(&role); for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) { if (root->role.word == role.word && kvm_tdp_mmu_get_root(root)) @@ -333,11 +339,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) spin_unlock(&kvm->arch.tdp_mmu_pages_lock); out: - return __pa(root->spt); + return root; +} + +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private) +{ + return __pa(kvm_tdp_mmu_get_vcpu_root(vcpu, private)->spt); } static int __must_check handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, - u64 old_spte, u64 new_spte, int level, + u64 old_spte, u64 new_spte, + union kvm_mmu_page_role role, bool shared); static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level) @@ -364,6 +376,8 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn, if ((!is_writable_pte(old_spte) || pfn_changed) && is_writable_pte(new_spte)) { + /* For memory slot operations, use GFN without aliasing */ + gfn = gfn & ~kvm_gfn_shared_mask(kvm); slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn); mark_page_dirty_in_slot(kvm, slot, gfn); } @@ -500,7 +514,8 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared) REMOVED_SPTE, level); } ret = handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn, - old_spte, REMOVED_SPTE, level, shared); + old_spte, REMOVED_SPTE, sp->role, + shared); /* * We are removing page tables. Because in TDX case we don't * zap private page tables except tearing down VM. It means @@ -509,9 +524,81 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared) WARN_ON_ONCE(ret); } + if (is_private_sp(sp) && + WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level, + kvm_mmu_private_spt(sp)))) { + /* + * Failed to unlink Secure EPT page and there is nothing to do + * further. Intentionally leak the page to prevent the kernel + * from accessing the encrypted page. + */ + kvm_mmu_init_private_spt(sp, NULL); + } + call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback); } +static void *get_private_spt(gfn_t gfn, u64 new_spte, int level) +{ + if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) { + struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(spte_to_pfn(new_spte))); + void *private_spt = kvm_mmu_private_spt(sp); + + WARN_ON_ONCE(!private_spt); + WARN_ON_ONCE(sp->role.level + 1 != level); + WARN_ON_ONCE(sp->gfn != gfn); + return private_spt; + } + + return NULL; +} + +static int __must_check handle_changed_private_spte(struct kvm *kvm, gfn_t gfn, + u64 old_spte, u64 new_spte, + int level) +{ + bool was_present = is_shadow_present_pte(old_spte); + bool is_present = is_shadow_present_pte(new_spte); + bool was_leaf = was_present && is_last_spte(old_spte, level); + bool is_leaf = is_present && is_last_spte(new_spte, level); + kvm_pfn_t old_pfn = spte_to_pfn(old_spte); + kvm_pfn_t new_pfn = spte_to_pfn(new_spte); + int ret; + + lockdep_assert_held(&kvm->mmu_lock); + if (is_present) { + /* TDP MMU doesn't change present -> present */ + KVM_BUG_ON(was_present, kvm); + + /* + * Use different call to either set up middle level + * private page table, or leaf. + */ + if (is_leaf) + ret = static_call(kvm_x86_set_private_spte)(kvm, gfn, level, new_pfn); + else { + void *private_spt = get_private_spt(gfn, new_spte, level); + + KVM_BUG_ON(!private_spt, kvm); + ret = static_call(kvm_x86_link_private_spt)(kvm, gfn, level, private_spt); + } + } else if (was_leaf) { + /* non-present -> non-present doesn't make sense. */ + KVM_BUG_ON(!was_present, kvm); + /* + * Zap private leaf SPTE. Zapping private table is done + * below in handle_removed_tdp_mmu_page(). + */ + lockdep_assert_held_write(&kvm->mmu_lock); + ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level); + if (!ret) { + ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn); + WARN_ON_ONCE(ret); + } + } + return ret; +} + /** * __handle_changed_spte - handle bookkeeping associated with an SPTE change * @kvm: kvm instance @@ -519,7 +606,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared) * @gfn: the base GFN that was mapped by the SPTE * @old_spte: The value of the SPTE before the change * @new_spte: The value of the SPTE after the change - * @level: the level of the PT the SPTE is part of in the paging structure + * @role: the role of the PT the SPTE is part of in the paging structure * @shared: This operation may not be running under the exclusive use of * the MMU lock and the operation must synchronize with other * threads that might be modifying SPTEs. @@ -528,14 +615,18 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared) * This function must be called for all TDP SPTE modifications. */ static int __must_check __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, - u64 old_spte, u64 new_spte, int level, - bool shared) + u64 old_spte, u64 new_spte, + union kvm_mmu_page_role role, bool shared) { + bool is_private = kvm_mmu_page_role_is_private(role); + int level = role.level; bool was_present = is_shadow_present_pte(old_spte); bool is_present = is_shadow_present_pte(new_spte); bool was_leaf = was_present && is_last_spte(old_spte, level); bool is_leaf = is_present && is_last_spte(new_spte, level); - bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte); + kvm_pfn_t old_pfn = spte_to_pfn(old_spte); + kvm_pfn_t new_pfn = spte_to_pfn(new_spte); + bool pfn_changed = old_pfn != new_pfn; WARN_ON(level > PT64_ROOT_MAX_LEVEL); WARN_ON(level < PG_LEVEL_4K); @@ -602,7 +693,7 @@ static int __must_check __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t if (was_leaf && is_dirty_spte(old_spte) && (!is_present || !is_dirty_spte(new_spte) || pfn_changed)) - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); + kvm_set_pfn_dirty(old_pfn); /* * Recursively handle child PTs if the change removed a subtree from @@ -611,26 +702,42 @@ static int __must_check __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t * pages are kernel allocations and should never be migrated. */ if (was_present && !was_leaf && - (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) + (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) { + KVM_BUG_ON(is_private != is_private_sptep(spte_to_child_pt(old_spte, level)), + kvm); handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared); + } + /* + * Special handling for the private mapping. We are either + * setting up new mapping at middle level page table, or leaf, + * or tearing down existing mapping. + * + * This is after handling lower page table by above + * handle_remove_tdp_mmu_page(). Secure-EPT requires to remove + * Secure-EPT tables after removing children. + */ + if (is_private && + /* Ignore change of software only bits. e.g. host_writable */ + (was_leaf != is_leaf || was_present != is_present || pfn_changed)) + return handle_changed_private_spte(kvm, gfn, old_spte, new_spte, role.level); return 0; } static int __must_check handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, - u64 old_spte, u64 new_spte, int level, + u64 old_spte, u64 new_spte, + union kvm_mmu_page_role role, bool shared) { int ret; - ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, - shared); + ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, shared); if (ret) return ret; - handle_changed_spte_acc_track(old_spte, new_spte, level); + handle_changed_spte_acc_track(old_spte, new_spte, role.level); handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, - new_spte, level); + new_spte, role.level); return 0; } @@ -656,6 +763,24 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte) { + /* + * For conventional page table, the update flow is + * - update STPE with atomic operation + * - handle changed SPTE. __handle_changed_spte() + * NOTE: __handle_changed_spte() (and functions) must be safe against + * concurrent update. It is an exception to zap SPTE. See + * tdp_mmu_zap_spte_atomic(). + * + * For private page table, callbacks are needed to propagate SPTE + * change into the protected page table. In order to atomically update + * both the SPTE and the protected page tables with callbacks, utilize + * freezing SPTE. + * - Freeze the SPTE. Set entry to REMOVED_SPTE. + * - Trigger callbacks for protected page tables. __handle_changed_spte() + * - Unfreeze the SPTE. Set the entry to new_spte. + */ + bool freeze_spte = is_private_sptep(iter->sptep) && !is_removed_spte(new_spte); + u64 tmp_spte = freeze_spte ? REMOVED_SPTE : new_spte; u64 *sptep = rcu_dereference(iter->sptep); int ret; @@ -673,14 +798,24 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm, * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and * does not hold the mmu_lock. */ - if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte)) + if (!try_cmpxchg64(sptep, &iter->old_spte, tmp_spte)) return -EBUSY; ret = __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte, - new_spte, iter->level, true); + new_spte, sptep_to_sp(sptep)->role, true); if (!ret) handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level); + if (ret) { + /* + * !freeze_spte means this fault isn't private. No call to + * operation on Secure EPT. + */ + WARN_ON_ONCE(!freeze_spte); + __kvm_tdp_mmu_write_spte(sptep, iter->old_spte); + } else if (freeze_spte) + __kvm_tdp_mmu_write_spte(sptep, new_spte); + return ret; } @@ -750,6 +885,7 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep, u64 old_spte, u64 new_spte, gfn_t gfn, int level, bool record_acc_track, bool record_dirty_log) { + union kvm_mmu_page_role role; int ret; lockdep_assert_held_write(&kvm->mmu_lock); @@ -765,7 +901,9 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep, old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level); - ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false); + role = sptep_to_sp(sptep)->role; + role.level = level; + ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false); /* Because write spin lock is held, no race. It should success. */ WARN_ON_ONCE(ret); @@ -819,8 +957,11 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm, continue; \ else -#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \ - for_each_tdp_pte(_iter, to_shadow_page(_mmu->root.hpa), _start, _end) +#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end) \ + for_each_tdp_pte(_iter, \ + to_shadow_page((_private) ? _mmu->private_root_hpa : \ + _mmu->root.hpa), \ + _start, _end) /* * Yield if the MMU lock is contended or this thread needs to return control @@ -983,6 +1124,14 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, if (!zap_private && is_private_sp(root)) return false; + /* + * start and end doesn't have GFN shared bit. This function zaps + * a region including alias. Adjust shared bit of [start, end) if the + * root is shared. + */ + start = kvm_gfn_for_root(kvm, root, start); + end = kvm_gfn_for_root(kvm, root, end); + rcu_read_lock(); for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) { @@ -1111,10 +1260,19 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, WARN_ON(sp->role.level != fault->goal_level); if (unlikely(!fault->slot)) new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL); - else - wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn, - fault->pfn, iter->old_spte, fault->prefetch, true, - fault->map_writable, &new_spte); + else { + unsigned long pte_access = ACC_ALL; + gfn_t gfn_unalias = iter->gfn & ~kvm_gfn_shared_mask(vcpu->kvm); + + /* TDX shared GPAs are no executable, enforce this for the SDV. */ + if (kvm_gfn_shared_mask(vcpu->kvm) && !fault->is_private) + pte_access &= ~ACC_EXEC_MASK; + + wrprot = make_spte(vcpu, sp, fault->slot, pte_access, + gfn_unalias, fault->pfn, iter->old_spte, + fault->prefetch, true, fault->map_writable, + &new_spte); + } if (new_spte == iter->old_spte) ret = RET_PF_SPURIOUS; @@ -1214,6 +1372,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct kvm_mmu *mmu = vcpu->arch.mmu; struct tdp_iter iter; + gfn_t raw_gfn; + bool is_private = fault->is_private; int ret; kvm_mmu_hugepage_adjust(vcpu, fault); @@ -1222,7 +1382,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) rcu_read_lock(); - tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) { + raw_gfn = gpa_to_gfn(fault->addr); + + if (is_error_noslot_pfn(fault->pfn) || + !kvm_pfn_to_refcounted_page(fault->pfn)) { + if (is_private) { + rcu_read_unlock(); + return -EFAULT; + } + } + + tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) { if (fault->nx_huge_page_workaround_enabled) disallowed_hugepage_adjust(fault, iter.old_spte, iter.level); @@ -1238,6 +1408,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) is_large_pte(iter.old_spte)) { if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter)) break; + /* + * TODO: large page support. + * Doesn't support large page for TDX now + */ + KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm); + /* * The iter must explicitly re-read the spte here @@ -1480,6 +1656,12 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, union kvm_mm sp->role = role; sp->spt = (void *)__get_free_page(gfp); + if (kvm_mmu_page_role_is_private(role)) { + if (kvm_alloc_private_spt_for_split(sp, gfp)) { + free_page((unsigned long)sp->spt); + sp->spt = NULL; + } + } if (!sp->spt) { kmem_cache_free(mmu_page_header_cache, sp); return NULL; @@ -1495,6 +1677,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, union kvm_mmu_page_role role = tdp_iter_child_role(iter); struct kvm_mmu_page *sp; + KVM_BUG_ON(kvm_mmu_page_role_is_private(role) != + is_private_sptep(iter->sptep), kvm); + /* TODO: Large page isn't supported for private SPTE yet. */ + KVM_BUG_ON(kvm_mmu_page_role_is_private(role), kvm); + /* * Since we are allocating while under the MMU lock we have to be * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct @@ -1929,7 +2116,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, if (WARN_ON_ONCE(kvm_gfn_shared_mask(vcpu->kvm))) return leaf; - tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) { + tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) { leaf = iter.level; sptes[leaf] = iter.old_spte; } @@ -1956,7 +2143,10 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr, gfn_t gfn = addr >> PAGE_SHIFT; tdp_ptep_t sptep = NULL; - tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) { + /* fast page fault for private GPA isn't supported. */ + WARN_ON_ONCE(kvm_is_private_gpa(vcpu->kvm, addr)); + + tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) { *spte = iter.old_spte; sptep = iter.sptep; } diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index c98c7df449a8..695175c921a5 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -5,7 +5,7 @@ #include <linux/kvm_host.h> -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu); +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private); __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root) { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index dda2f2ec4faa..8c996f40b544 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -211,6 +211,7 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn) return NULL; } +EXPORT_SYMBOL_GPL(kvm_pfn_to_refcounted_page); /* * Switches to specified vcpu, until a matching vcpu_put()