Message ID | 20210112181041.356734-23-bgardon@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Allow parallel page faults with TDP MMU | expand |
On Tue, Jan 12, 2021, Ben Gardon wrote: > When the TDP MMU is allowed to handle page faults in parallel there is > the possiblity of a race where an SPTE is cleared and then imediately > replaced with a present SPTE pointing to a different PFN, before the > TLBs can be flushed. This race would violate architectural specs. Ensure > that the TLBs are flushed properly before other threads are allowed to > install any present value for the SPTE. > > Reviewed-by: Peter Feiner <pfeiner@google.com> > > Signed-off-by: Ben Gardon <bgardon@google.com> > --- > arch/x86/kvm/mmu/spte.h | 16 +++++++++- > arch/x86/kvm/mmu/tdp_mmu.c | 62 ++++++++++++++++++++++++++++++++------ > 2 files changed, 68 insertions(+), 10 deletions(-) > > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h > index 2b3a30bd38b0..ecd9bfbccef4 100644 > --- a/arch/x86/kvm/mmu/spte.h > +++ b/arch/x86/kvm/mmu/spte.h > @@ -130,6 +130,20 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask; > PT64_EPT_EXECUTABLE_MASK) > #define SHADOW_ACC_TRACK_SAVED_BITS_SHIFT PT64_SECOND_AVAIL_BITS_SHIFT > > +/* > + * If a thread running without exclusive control of the MMU lock must perform a > + * multi-part operation on an SPTE, it can set the SPTE to FROZEN_SPTE as a > + * non-present intermediate value. This will guarantee that other threads will > + * not modify the spte. > + * > + * This constant works because it is considered non-present on both AMD and > + * Intel CPUs and does not create a L1TF vulnerability because the pfn section > + * is zeroed out. > + * > + * Only used by the TDP MMU. > + */ > +#define FROZEN_SPTE (1ull << 59) I dislike FROZEN, for similar reasons that I disliked "disconnected". The SPTE isn't frozen in the sense that it's temporarily immutable, rather it's been removed but hasn't been flushed and so can't yet be reused. Given that FROZEN_SPTEs are treated as not-preset SPTEs, there's zero chance that this can be extended in the future to be a generic temporarily freeze mechanism. Mabye REMOVED_SPTE to match earlier feedback? > + > /* > * In some cases, we need to preserve the GFN of a non-present or reserved > * SPTE when we usurp the upper five bits of the physical address space to > @@ -187,7 +201,7 @@ static inline bool is_access_track_spte(u64 spte) > > static inline int is_shadow_present_pte(u64 pte) Waaaay off topic, I'm going to send a patch to have this, and any other pte helpers that return an int, return a bool. While futzing around with ideas I managed to turn this into a nop by doing return pte & SPTE_PRESENT; which is guaranteed to be 0 if SPTE_PRESENT is a bit > 31. I'm sure others will point out that I'm a heathen for not doing !!(pte & SPTE_PRESENT), but still... > { > - return (pte != 0) && !is_mmio_spte(pte); > + return (pte != 0) && !is_mmio_spte(pte) && (pte != FROZEN_SPTE); For all other checks, I'd strongly prefer to add a helper, e.g. is_removed_spte() or whatever. That way changing the implementation won't be as painful, and we can add assertions and whatnot if we break things. Especially since FROZEN_SPTE is a single bit, which makes it look like a flag even though it's used as a full 64-bit constant. For this, I worry that is_shadow_present_pte() is getting bloated. It's also a bit unfortunate that it's bloated for the old MMU, without any benefit. That being said, most that bloat is from the existing MMIO checks. Looking elsewhere, TDX's SEPT also has a similar concept that may or may not need to hook is_shadow_present_pte(). Rather than bundle MMIO SPTEs into the access-tracking flags and have a bunch of special cases for not-present SPTEs, what if we add an explicit flag to mark SPTEs as present (or not-present)? Defining SPTE_PRESENT instead of SPTE_NOT_PRESENT might require a few more changes, but it would be the most optimal for is_shadow_present_pte(). I'm thinking something like this (completely untested): diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c index c51ad544f25b..86f6c84569c4 100644 --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -38,7 +38,7 @@ static u64 generation_mmio_spte_mask(u64 gen) u64 mask; WARN_ON(gen & ~MMIO_SPTE_GEN_MASK); - BUILD_BUG_ON((MMIO_SPTE_GEN_HIGH_MASK | MMIO_SPTE_GEN_LOW_MASK) & SPTE_SPECIAL_MASK); + BUILD_BUG_ON((MMIO_SPTE_GEN_HIGH_MASK | MMIO_SPTE_GEN_LOW_MASK) & SPTE_MMIO); mask = (gen << MMIO_SPTE_GEN_LOW_SHIFT) & MMIO_SPTE_GEN_LOW_MASK; mask |= (gen << MMIO_SPTE_GEN_HIGH_SHIFT) & MMIO_SPTE_GEN_HIGH_MASK; @@ -86,7 +86,7 @@ int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level, bool can_unsync, bool host_writable, bool ad_disabled, u64 *new_spte) { - u64 spte = 0; + u64 spte = SPTE_PRESENT; int ret = 0; if (ad_disabled) @@ -247,7 +247,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 access_mask) BUG_ON((u64)(unsigned)access_mask != access_mask); WARN_ON(mmio_value & (shadow_nonpresent_or_rsvd_mask << SHADOW_NONPRESENT_OR_RSVD_MASK_LEN)); WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask); - shadow_mmio_value = mmio_value | SPTE_MMIO_MASK; + shadow_mmio_value = mmio_value | SPTE_MMIO; shadow_mmio_access_mask = access_mask; } EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask); diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index ecd9bfbccef4..465e43d34034 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -5,18 +5,15 @@ #include "mmu_internal.h" +/* Software available bits for present SPTEs. */ #define PT_FIRST_AVAIL_BITS_SHIFT 10 #define PT64_SECOND_AVAIL_BITS_SHIFT 54 -/* - * The mask used to denote special SPTEs, which can be either MMIO SPTEs or - * Access Tracking SPTEs. - */ +/* The mask used to denote Access Tracking SPTEs. Note, val=3 is available. */ #define SPTE_SPECIAL_MASK (3ULL << 52) #define SPTE_AD_ENABLED_MASK (0ULL << 52) #define SPTE_AD_DISABLED_MASK (1ULL << 52) #define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52) -#define SPTE_MMIO_MASK (3ULL << 52) #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK #define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1)) @@ -55,12 +52,16 @@ #define SPTE_HOST_WRITEABLE (1ULL << PT_FIRST_AVAIL_BITS_SHIFT) #define SPTE_MMU_WRITEABLE (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1)) +#define SPTE_REMOVED BIT_ULL(60) +#define SPTE_MMIO BIT_ULL(61) +#define SPTE_PRESENT BIT_ULL(62) + /* * Due to limited space in PTEs, the MMIO generation is a 18 bit subset of * the memslots generation and is derived as follows: * * Bits 0-8 of the MMIO generation are propagated to spte bits 3-11 - * Bits 9-17 of the MMIO generation are propagated to spte bits 54-62 + * Bits 9-17 of the MMIO generation are propagated to spte bits 52-60 * * The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in * the MMIO generation number, as doing so would require stealing a bit from @@ -73,8 +74,8 @@ #define MMIO_SPTE_GEN_LOW_START 3 #define MMIO_SPTE_GEN_LOW_END 11 -#define MMIO_SPTE_GEN_HIGH_START PT64_SECOND_AVAIL_BITS_SHIFT -#define MMIO_SPTE_GEN_HIGH_END 62 +#define MMIO_SPTE_GEN_HIGH_START 52 +#define MMIO_SPTE_GEN_HIGH_END 60 #define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \ MMIO_SPTE_GEN_LOW_START) @@ -162,7 +163,7 @@ extern u8 __read_mostly shadow_phys_bits; static inline bool is_mmio_spte(u64 spte) { - return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK; + return spte & SPTE_MMIO; } static inline bool sp_ad_disabled(struct kvm_mmu_page *sp) @@ -199,9 +200,9 @@ static inline bool is_access_track_spte(u64 spte) return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0; } -static inline int is_shadow_present_pte(u64 pte) +static inline bool is_shadow_present_pte(u64 pte) { - return (pte != 0) && !is_mmio_spte(pte) && (pte != FROZEN_SPTE); + return pte & SPTE_PRESENT; } static inline int is_large_pte(u64 pte) > } > > static inline int is_large_pte(u64 pte) > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 7b12a87a4124..5c9d053000ad 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -429,15 +429,19 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, > */ > if (!was_present && !is_present) { > /* > - * If this change does not involve a MMIO SPTE, it is > - * unexpected. Log the change, though it should not impact the > - * guest since both the former and current SPTEs are nonpresent. > + * If this change does not involve a MMIO SPTE or FROZEN_SPTE, For comments and error message, I think we should avoid using the exact constant name, and instead call them "removed SPTE", similar to MMIO SPTE. That will help reduce thrash and/or stale comments if the name changes. > + * it is unexpected. Log the change, though it should not > + * impact the guest since both the former and current SPTEs > + * are nonpresent. > */ > - if (WARN_ON(!is_mmio_spte(old_spte) && !is_mmio_spte(new_spte))) > + if (WARN_ON(!is_mmio_spte(old_spte) && > + !is_mmio_spte(new_spte) && > + new_spte != FROZEN_SPTE)) > pr_err("Unexpected SPTE change! Nonpresent SPTEs\n" > "should not be replaced with another,\n" > "different nonpresent SPTE, unless one or both\n" > - "are MMIO SPTEs.\n" > + "are MMIO SPTEs, or the new SPTE is\n" > + "FROZEN_SPTE.\n" > "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d", > as_id, gfn, old_spte, new_spte, level); > return;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index 2b3a30bd38b0..ecd9bfbccef4 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -130,6 +130,20 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask; PT64_EPT_EXECUTABLE_MASK) #define SHADOW_ACC_TRACK_SAVED_BITS_SHIFT PT64_SECOND_AVAIL_BITS_SHIFT +/* + * If a thread running without exclusive control of the MMU lock must perform a + * multi-part operation on an SPTE, it can set the SPTE to FROZEN_SPTE as a + * non-present intermediate value. This will guarantee that other threads will + * not modify the spte. + * + * This constant works because it is considered non-present on both AMD and + * Intel CPUs and does not create a L1TF vulnerability because the pfn section + * is zeroed out. + * + * Only used by the TDP MMU. + */ +#define FROZEN_SPTE (1ull << 59) + /* * In some cases, we need to preserve the GFN of a non-present or reserved * SPTE when we usurp the upper five bits of the physical address space to @@ -187,7 +201,7 @@ static inline bool is_access_track_spte(u64 spte) static inline int is_shadow_present_pte(u64 pte) { - return (pte != 0) && !is_mmio_spte(pte); + return (pte != 0) && !is_mmio_spte(pte) && (pte != FROZEN_SPTE); } static inline int is_large_pte(u64 pte) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 7b12a87a4124..5c9d053000ad 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -429,15 +429,19 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, */ if (!was_present && !is_present) { /* - * If this change does not involve a MMIO SPTE, it is - * unexpected. Log the change, though it should not impact the - * guest since both the former and current SPTEs are nonpresent. + * If this change does not involve a MMIO SPTE or FROZEN_SPTE, + * it is unexpected. Log the change, though it should not + * impact the guest since both the former and current SPTEs + * are nonpresent. */ - if (WARN_ON(!is_mmio_spte(old_spte) && !is_mmio_spte(new_spte))) + if (WARN_ON(!is_mmio_spte(old_spte) && + !is_mmio_spte(new_spte) && + new_spte != FROZEN_SPTE)) pr_err("Unexpected SPTE change! Nonpresent SPTEs\n" "should not be replaced with another,\n" "different nonpresent SPTE, unless one or both\n" - "are MMIO SPTEs.\n" + "are MMIO SPTEs, or the new SPTE is\n" + "FROZEN_SPTE.\n" "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d", as_id, gfn, old_spte, new_spte, level); return; @@ -488,6 +492,13 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm, kvm_mmu_lock_assert_held_shared(kvm); + /* + * Do not change FROZEN_SPTEs. Only the thread that froze the SPTE + * may modify it. + */ + if (iter->old_spte == FROZEN_SPTE) + return false; + if (cmpxchg64(iter->sptep, iter->old_spte, new_spte) != iter->old_spte) return false; @@ -497,6 +508,34 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm, return true; } +static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm, + struct tdp_iter *iter) +{ + /* + * Freeze the SPTE by setting it to a special, + * non-present value. This will stop other threads from + * immediately installing a present entry in its place + * before the TLBs are flushed. + */ + if (!tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE)) + return false; + + kvm_flush_remote_tlbs_with_address(kvm, iter->gfn, + KVM_PAGES_PER_HPAGE(iter->level)); + + /* + * No other thread can overwrite the frozen SPTE as they + * must either wait on the MMU lock or use + * tdp_mmu_set_spte_atomic which will not overrite the + * special frozen SPTE value. No bookkeeping is needed + * here since the SPTE is going from non-present + * to non-present. + */ + WRITE_ONCE(*iter->sptep, 0); + + return true; +} + /* * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping @@ -524,6 +563,14 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter, kvm_mmu_lock_assert_held_exclusive(kvm); + /* + * No thread should be using this function to set SPTEs to FROZEN_SPTE. + * If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic + * should be used. If operating under the MMU lock in write mode, the + * use of FROZEN_SPTE should not be necessary. + */ + WARN_ON(iter->old_spte == FROZEN_SPTE); + WRITE_ONCE(*iter->sptep, new_spte); __handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte, @@ -801,12 +848,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, */ if (is_shadow_present_pte(iter.old_spte) && is_large_pte(iter.old_spte)) { - if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, 0)) + if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter)) break; - kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn, - KVM_PAGES_PER_HPAGE(iter.level)); - /* * The iter must explicitly re-read the spte here * because the new value informs the !present