Message ID | 20210630214802.1902448-5-dmatlack@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/mmu: Fast page fault support for the TDP MMU | expand |
Hi David, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on kvm/queue] [also build test WARNING on linus/master next-20210630] [cannot apply to vhost/linux-next v5.13] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/David-Matlack/KVM-x86-mmu-Fast-page-fault-support-for-the-TDP-MMU/20210701-055009 base: https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue config: x86_64-allyesconfig (attached as .config) compiler: gcc-9 (Debian 9.3.0-22) 9.3.0 reproduce (this is a W=1 build): # https://github.com/0day-ci/linux/commit/7709823e6135aaf4aeac8235973f37e679064356 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review David-Matlack/KVM-x86-mmu-Fast-page-fault-support-for-the-TDP-MMU/20210701-055009 git checkout 7709823e6135aaf4aeac8235973f37e679064356 # save the attached .config to linux build tree make W=1 ARCH=x86_64 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> arch/x86/kvm/mmu/mmu.c:3119:6: warning: no previous prototype for 'get_last_sptep_lockless' [-Wmissing-prototypes] 3119 | u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte) | ^~~~~~~~~~~~~~~~~~~~~~~ vim +/get_last_sptep_lockless +3119 arch/x86/kvm/mmu/mmu.c 3107 3108 /* 3109 * Returns the last level spte pointer of the shadow page walk for the given 3110 * gpa, and sets *spte to the spte value. This spte may be non-preset. 3111 * 3112 * If no walk could be performed, returns NULL and *spte does not contain valid 3113 * data. 3114 * 3115 * Constraints: 3116 * - Must be called between walk_shadow_page_lockless_{begin,end}. 3117 * - The returned sptep must not be used after walk_shadow_page_lockless_end. 3118 */ > 3119 u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte) 3120 { 3121 struct kvm_shadow_walk_iterator iterator; 3122 u64 old_spte; 3123 u64 *sptep = NULL; 3124 3125 if (is_tdp_mmu(vcpu->arch.mmu)) 3126 return kvm_tdp_mmu_get_last_sptep_lockless(vcpu, gpa, spte); 3127 3128 for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) { 3129 sptep = iterator.sptep; 3130 *spte = old_spte; 3131 3132 if (!is_shadow_present_pte(old_spte)) 3133 break; 3134 } 3135 3136 return sptep; 3137 } 3138 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Hi David, Thank you for the patch! Yet something to improve: [auto build test ERROR on kvm/queue] [also build test ERROR on linus/master next-20210630] [cannot apply to vhost/linux-next v5.13] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/David-Matlack/KVM-x86-mmu-Fast-page-fault-support-for-the-TDP-MMU/20210701-055009 base: https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue config: i386-buildonly-randconfig-r001-20210630 (attached as .config) compiler: gcc-9 (Debian 9.3.0-22) 9.3.0 reproduce (this is a W=1 build): # https://github.com/0day-ci/linux/commit/7709823e6135aaf4aeac8235973f37e679064356 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review David-Matlack/KVM-x86-mmu-Fast-page-fault-support-for-the-TDP-MMU/20210701-055009 git checkout 7709823e6135aaf4aeac8235973f37e679064356 # save the attached .config to linux build tree make W=1 ARCH=i386 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): >> arch/x86/kvm/mmu/mmu.c:3119:6: error: no previous prototype for 'get_last_sptep_lockless' [-Werror=missing-prototypes] 3119 | u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte) | ^~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors vim +/get_last_sptep_lockless +3119 arch/x86/kvm/mmu/mmu.c 3107 3108 /* 3109 * Returns the last level spte pointer of the shadow page walk for the given 3110 * gpa, and sets *spte to the spte value. This spte may be non-preset. 3111 * 3112 * If no walk could be performed, returns NULL and *spte does not contain valid 3113 * data. 3114 * 3115 * Constraints: 3116 * - Must be called between walk_shadow_page_lockless_{begin,end}. 3117 * - The returned sptep must not be used after walk_shadow_page_lockless_end. 3118 */ > 3119 u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte) 3120 { 3121 struct kvm_shadow_walk_iterator iterator; 3122 u64 old_spte; 3123 u64 *sptep = NULL; 3124 3125 if (is_tdp_mmu(vcpu->arch.mmu)) 3126 return kvm_tdp_mmu_get_last_sptep_lockless(vcpu, gpa, spte); 3127 3128 for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) { 3129 sptep = iterator.sptep; 3130 *spte = old_spte; 3131 3132 if (!is_shadow_present_pte(old_spte)) 3133 break; 3134 } 3135 3136 return sptep; 3137 } 3138 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On Thu, Jul 01, 2021 at 12:27:58PM +0800, kernel test robot wrote: > > >> arch/x86/kvm/mmu/mmu.c:3119:6: error: no previous prototype for 'get_last_sptep_lockless' [-Werror=missing-prototypes] > 3119 | u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte) > | ^~~~~~~~~~~~~~~~~~~~~~~ get_last_sptep_lockless should be static. I will include a fix in the next version of this series.
On Wed, Jun 30, 2021 at 09:48:00PM +0000, David Matlack wrote: > Make fast_page_fault interoperate with the TDP MMU by leveraging > walk_shadow_page_lockless_{begin,end} to acquire the RCU read lock and > introducing a new helper function kvm_tdp_mmu_get_last_sptep_lockless to > grab the lowest level sptep. > > Suggested-by: Ben Gardon <bgardon@google.com> > Signed-off-by: David Matlack <dmatlack@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 55 +++++++++++++++++++++++++++----------- > arch/x86/kvm/mmu/tdp_mmu.c | 36 +++++++++++++++++++++++++ > arch/x86/kvm/mmu/tdp_mmu.h | 2 ++ > 3 files changed, 78 insertions(+), 15 deletions(-) > ... > +/* > + * Must be called between kvm_tdp_mmu_walk_shadow_page_lockless_{begin,end}. > + * > + * The returned sptep must not be used after > + * kvm_tdp_mmu_walk_shadow_page_lockless_end. > + */ The function names in the comment are spelled wrong and should be: /* * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}. * * The returned sptep must not be used after kvm_tdp_mmu_walk_lockless_end. */
On Wed, Jun 30, 2021 at 2:48 PM David Matlack <dmatlack@google.com> wrote: > > Make fast_page_fault interoperate with the TDP MMU by leveraging > walk_shadow_page_lockless_{begin,end} to acquire the RCU read lock and > introducing a new helper function kvm_tdp_mmu_get_last_sptep_lockless to > grab the lowest level sptep. > > Suggested-by: Ben Gardon <bgardon@google.com> > Signed-off-by: David Matlack <dmatlack@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 55 +++++++++++++++++++++++++++----------- > arch/x86/kvm/mmu/tdp_mmu.c | 36 +++++++++++++++++++++++++ > arch/x86/kvm/mmu/tdp_mmu.h | 2 ++ > 3 files changed, 78 insertions(+), 15 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 88c71a8a55f1..1d410278a4cc 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3105,15 +3105,45 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte) > return spte & PT_PRESENT_MASK; > } > > +/* > + * Returns the last level spte pointer of the shadow page walk for the given > + * gpa, and sets *spte to the spte value. This spte may be non-preset. > + * > + * If no walk could be performed, returns NULL and *spte does not contain valid > + * data. > + * > + * Constraints: > + * - Must be called between walk_shadow_page_lockless_{begin,end}. > + * - The returned sptep must not be used after walk_shadow_page_lockless_end. > + */ > +u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte) > +{ > + struct kvm_shadow_walk_iterator iterator; > + u64 old_spte; > + u64 *sptep = NULL; > + > + if (is_tdp_mmu(vcpu->arch.mmu)) > + return kvm_tdp_mmu_get_last_sptep_lockless(vcpu, gpa, spte); > + > + for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) { > + sptep = iterator.sptep; > + *spte = old_spte; > + > + if (!is_shadow_present_pte(old_spte)) > + break; > + } > + > + return sptep; > +} > + > /* > * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS. > */ > static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) > { > - struct kvm_shadow_walk_iterator iterator; > - struct kvm_mmu_page *sp; > int ret = RET_PF_INVALID; > u64 spte = 0ull; > + u64 *sptep = NULL; > uint retry_count = 0; > > if (!page_fault_can_be_fast(error_code)) > @@ -3122,16 +3152,14 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) > walk_shadow_page_lockless_begin(vcpu); > > do { > + struct kvm_mmu_page *sp; > u64 new_spte; > > - for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte) > - if (!is_shadow_present_pte(spte)) > - break; > - > + sptep = get_last_sptep_lockless(vcpu, gpa, &spte); > if (!is_shadow_present_pte(spte)) > break; > > - sp = sptep_to_sp(iterator.sptep); > + sp = sptep_to_sp(sptep); > if (!is_last_spte(spte, sp->role.level)) > break; > > @@ -3189,8 +3217,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) > * since the gfn is not stable for indirect shadow page. See > * Documentation/virt/kvm/locking.rst to get more detail. > */ > - if (fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte, > - new_spte)) { > + if (fast_pf_fix_direct_spte(vcpu, sp, sptep, spte, new_spte)) { > ret = RET_PF_FIXED; > break; > } > @@ -3203,7 +3230,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) > > } while (true); > > - trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret); > + trace_fast_page_fault(vcpu, gpa, error_code, sptep, spte, ret); > walk_shadow_page_lockless_end(vcpu); > > return ret; > @@ -3838,11 +3865,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, > if (page_fault_handle_page_track(vcpu, error_code, gfn)) > return RET_PF_EMULATE; > > - if (!is_tdp_mmu_fault) { > - r = fast_page_fault(vcpu, gpa, error_code); > - if (r != RET_PF_INVALID) > - return r; > - } > + r = fast_page_fault(vcpu, gpa, error_code); > + if (r != RET_PF_INVALID) > + return r; > > r = mmu_topup_memory_caches(vcpu, false); > if (r) > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index c6fa8d00bf9f..2c9e0ed71fa0 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -527,6 +527,10 @@ static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm, > if (is_removed_spte(iter->old_spte)) > return false; > > + /* > + * TDP MMU sptes can also be concurrently cmpxchg'd in > + * fast_pf_fix_direct_spte as part of fast_page_fault. > + */ > if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, > new_spte) != iter->old_spte) > return false; I'm a little nervous about not going through the handle_changed_spte flow for the TDP MMU, but as things are now, I think it's safe. > @@ -1546,3 +1550,35 @@ int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, > > return leaf; > } > + > +/* > + * Must be called between kvm_tdp_mmu_walk_shadow_page_lockless_{begin,end}. > + * > + * The returned sptep must not be used after > + * kvm_tdp_mmu_walk_shadow_page_lockless_end. > + */ > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr, > + u64 *spte) > +{ > + struct tdp_iter iter; > + struct kvm_mmu *mmu = vcpu->arch.mmu; > + gfn_t gfn = addr >> PAGE_SHIFT; > + tdp_ptep_t sptep = NULL; > + > + tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) { > + *spte = iter.old_spte; > + sptep = iter.sptep; > + } > + > + if (sptep) > + /* > + * Perform the rcu dereference here since we are passing the > + * sptep up to the generic MMU code which does not know the > + * synchronization details of the TDP MMU. This is safe as long > + * as the caller obeys the contract that the sptep is not used > + * after kvm_tdp_mmu_walk_shadow_page_lockless_end. > + */ There's a little more to this contract: 1. The caller should only modify the SPTE using an atomic cmpxchg with the returned spte value. 2. The caller should not modify the mapped PFN or present <-> not present state of the SPTE. 3. There are other bits the caller can't modify too. (lpage, mt, etc.) If the comments on this function don't document all the constraints on how the returned sptep can be used, it might be safer to specify that this is only meant to be used as part of the fast page fault handler. > + return rcu_dereference(sptep); > + > + return NULL; > +} > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h > index e9dde5f9c0ef..508a23bdf7da 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.h > +++ b/arch/x86/kvm/mmu/tdp_mmu.h > @@ -81,6 +81,8 @@ void kvm_tdp_mmu_walk_lockless_begin(void); > void kvm_tdp_mmu_walk_lockless_end(void); > int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, > int *root_level); > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr, > + u64 *spte); > > #ifdef CONFIG_X86_64 > bool kvm_mmu_init_tdp_mmu(struct kvm *kvm); > -- > 2.32.0.93.g670b81a890-goog >
On Mon, Jul 12, 2021 at 10:49:55AM -0700, Ben Gardon wrote: > On Wed, Jun 30, 2021 at 2:48 PM David Matlack <dmatlack@google.com> wrote: > > > > Make fast_page_fault interoperate with the TDP MMU by leveraging > > walk_shadow_page_lockless_{begin,end} to acquire the RCU read lock and > > introducing a new helper function kvm_tdp_mmu_get_last_sptep_lockless to > > grab the lowest level sptep. > > > > Suggested-by: Ben Gardon <bgardon@google.com> > > Signed-off-by: David Matlack <dmatlack@google.com> > > --- > > arch/x86/kvm/mmu/mmu.c | 55 +++++++++++++++++++++++++++----------- > > arch/x86/kvm/mmu/tdp_mmu.c | 36 +++++++++++++++++++++++++ > > arch/x86/kvm/mmu/tdp_mmu.h | 2 ++ > > 3 files changed, 78 insertions(+), 15 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 88c71a8a55f1..1d410278a4cc 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -3105,15 +3105,45 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte) > > return spte & PT_PRESENT_MASK; > > } > > > > +/* > > + * Returns the last level spte pointer of the shadow page walk for the given > > + * gpa, and sets *spte to the spte value. This spte may be non-preset. > > + * > > + * If no walk could be performed, returns NULL and *spte does not contain valid > > + * data. > > + * > > + * Constraints: > > + * - Must be called between walk_shadow_page_lockless_{begin,end}. > > + * - The returned sptep must not be used after walk_shadow_page_lockless_end. > > + */ > > +u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte) > > +{ > > + struct kvm_shadow_walk_iterator iterator; > > + u64 old_spte; > > + u64 *sptep = NULL; > > + > > + if (is_tdp_mmu(vcpu->arch.mmu)) > > + return kvm_tdp_mmu_get_last_sptep_lockless(vcpu, gpa, spte); > > + > > + for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) { > > + sptep = iterator.sptep; > > + *spte = old_spte; > > + > > + if (!is_shadow_present_pte(old_spte)) > > + break; > > + } > > + > > + return sptep; > > +} > > + > > /* > > * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS. > > */ > > static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) > > { > > - struct kvm_shadow_walk_iterator iterator; > > - struct kvm_mmu_page *sp; > > int ret = RET_PF_INVALID; > > u64 spte = 0ull; > > + u64 *sptep = NULL; > > uint retry_count = 0; > > > > if (!page_fault_can_be_fast(error_code)) > > @@ -3122,16 +3152,14 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) > > walk_shadow_page_lockless_begin(vcpu); > > > > do { > > + struct kvm_mmu_page *sp; > > u64 new_spte; > > > > - for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte) > > - if (!is_shadow_present_pte(spte)) > > - break; > > - > > + sptep = get_last_sptep_lockless(vcpu, gpa, &spte); > > if (!is_shadow_present_pte(spte)) > > break; > > > > - sp = sptep_to_sp(iterator.sptep); > > + sp = sptep_to_sp(sptep); > > if (!is_last_spte(spte, sp->role.level)) > > break; > > > > @@ -3189,8 +3217,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) > > * since the gfn is not stable for indirect shadow page. See > > * Documentation/virt/kvm/locking.rst to get more detail. > > */ > > - if (fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte, > > - new_spte)) { > > + if (fast_pf_fix_direct_spte(vcpu, sp, sptep, spte, new_spte)) { > > ret = RET_PF_FIXED; > > break; > > } > > @@ -3203,7 +3230,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) > > > > } while (true); > > > > - trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret); > > + trace_fast_page_fault(vcpu, gpa, error_code, sptep, spte, ret); > > walk_shadow_page_lockless_end(vcpu); > > > > return ret; > > @@ -3838,11 +3865,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, > > if (page_fault_handle_page_track(vcpu, error_code, gfn)) > > return RET_PF_EMULATE; > > > > - if (!is_tdp_mmu_fault) { > > - r = fast_page_fault(vcpu, gpa, error_code); > > - if (r != RET_PF_INVALID) > > - return r; > > - } > > + r = fast_page_fault(vcpu, gpa, error_code); > > + if (r != RET_PF_INVALID) > > + return r; > > > > r = mmu_topup_memory_caches(vcpu, false); > > if (r) > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index c6fa8d00bf9f..2c9e0ed71fa0 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -527,6 +527,10 @@ static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm, > > if (is_removed_spte(iter->old_spte)) > > return false; > > > > + /* > > + * TDP MMU sptes can also be concurrently cmpxchg'd in > > + * fast_pf_fix_direct_spte as part of fast_page_fault. > > + */ > > if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, > > new_spte) != iter->old_spte) > > return false; > > I'm a little nervous about not going through the handle_changed_spte > flow for the TDP MMU, but as things are now, I think it's safe. > > > @@ -1546,3 +1550,35 @@ int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, > > > > return leaf; > > } > > + > > +/* > > + * Must be called between kvm_tdp_mmu_walk_shadow_page_lockless_{begin,end}. > > + * > > + * The returned sptep must not be used after > > + * kvm_tdp_mmu_walk_shadow_page_lockless_end. > > + */ > > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr, > > + u64 *spte) > > +{ > > + struct tdp_iter iter; > > + struct kvm_mmu *mmu = vcpu->arch.mmu; > > + gfn_t gfn = addr >> PAGE_SHIFT; > > + tdp_ptep_t sptep = NULL; > > + > > + tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) { > > + *spte = iter.old_spte; > > + sptep = iter.sptep; > > + } > > + > > + if (sptep) > > + /* > > + * Perform the rcu dereference here since we are passing the > > + * sptep up to the generic MMU code which does not know the > > + * synchronization details of the TDP MMU. This is safe as long > > + * as the caller obeys the contract that the sptep is not used > > + * after kvm_tdp_mmu_walk_shadow_page_lockless_end. > > + */ > > There's a little more to this contract: > 1. The caller should only modify the SPTE using an atomic cmpxchg with > the returned spte value. > 2. The caller should not modify the mapped PFN or present <-> not > present state of the SPTE. > 3. There are other bits the caller can't modify too. (lpage, mt, etc.) > > If the comments on this function don't document all the constraints on > how the returned sptep can be used, it might be safer to specify that > this is only meant to be used as part of the fast page fault handler. I think documenting that this is only be meant to used as part of the fast page fault handler is a simpler and less brittle approach. I can also change the function names so there is no ambiguity that it is meant for fast page fault handling. For example: kvm_tdp_mmu_fast_pf_get_last_sptep(). > > > + return rcu_dereference(sptep); > > + > > + return NULL; > > +} > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h > > index e9dde5f9c0ef..508a23bdf7da 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.h > > +++ b/arch/x86/kvm/mmu/tdp_mmu.h > > @@ -81,6 +81,8 @@ void kvm_tdp_mmu_walk_lockless_begin(void); > > void kvm_tdp_mmu_walk_lockless_end(void); > > int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, > > int *root_level); > > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr, > > + u64 *spte); > > > > #ifdef CONFIG_X86_64 > > bool kvm_mmu_init_tdp_mmu(struct kvm *kvm); > > -- > > 2.32.0.93.g670b81a890-goog > >
On Mon, Jul 12, 2021, Ben Gardon wrote: > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index c6fa8d00bf9f..2c9e0ed71fa0 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -527,6 +527,10 @@ static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm, > > if (is_removed_spte(iter->old_spte)) > > return false; > > > > + /* > > + * TDP MMU sptes can also be concurrently cmpxchg'd in > > + * fast_pf_fix_direct_spte as part of fast_page_fault. > > + */ The cmpxchg64 part isn't what's interesting, it's just the means to the end. Maybe reword slightly to focus on modifying SPTEs without holding mmu_lock, e.g. /* * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs outside * of mmu_lock. */ > > if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, > > new_spte) != iter->old_spte) > > return false; > > I'm a little nervous about not going through the handle_changed_spte > flow for the TDP MMU, but as things are now, I think it's safe. Ya, it would be nice to flow through the TDP MMU proper as we could also "restore" __rcu. That said, the fast #PF fix flow is unique and specific enough that I don't think it's worth going out of our way to force the issue. > > @@ -1546,3 +1550,35 @@ int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, > > > > return leaf; > > } > > + > > +/* > > + * Must be called between kvm_tdp_mmu_walk_shadow_page_lockless_{begin,end}. > > + * > > + * The returned sptep must not be used after > > + * kvm_tdp_mmu_walk_shadow_page_lockless_end. > > + */ > > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr, > > + u64 *spte) > > +{ > > + struct tdp_iter iter; > > + struct kvm_mmu *mmu = vcpu->arch.mmu; > > + gfn_t gfn = addr >> PAGE_SHIFT; > > + tdp_ptep_t sptep = NULL; > > + > > + tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) { > > + *spte = iter.old_spte; > > + sptep = iter.sptep; > > + } > > + > > + if (sptep) This check is unnecessary, even when using rcu_dereference. > > + /* > > + * Perform the rcu dereference here since we are passing the > > + * sptep up to the generic MMU code which does not know the > > + * synchronization details of the TDP MMU. This is safe as long > > + * as the caller obeys the contract that the sptep is not used > > + * after kvm_tdp_mmu_walk_shadow_page_lockless_end. > > + */ > > There's a little more to this contract: > 1. The caller should only modify the SPTE using an atomic cmpxchg with > the returned spte value. > 2. The caller should not modify the mapped PFN or present <-> not > present state of the SPTE. > 3. There are other bits the caller can't modify too. (lpage, mt, etc.) > > If the comments on this function don't document all the constraints on > how the returned sptep can be used, it might be safer to specify that > this is only meant to be used as part of the fast page fault handler. Or maybe a less specific, but more scary comment? > > > + return rcu_dereference(sptep); I still vote to use "(__force u64 *)" instead of rcu_dereference() to make it clear we're cheating in order to share code with the legacy MMU. /* * Squash the __rcu annotation, the legacy MMU doesn't rely on RCU to * protect its page tables and so the common MMU code doesn't preserve * the annotation. * * It goes without saying, but the caller must honor all TDP MMU * contracts for accessing/modifying SPTEs outside of mmu_lock. */ return (__force u64 *)sptep; > > + return NULL; > > +} > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h > > index e9dde5f9c0ef..508a23bdf7da 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.h > > +++ b/arch/x86/kvm/mmu/tdp_mmu.h > > @@ -81,6 +81,8 @@ void kvm_tdp_mmu_walk_lockless_begin(void); > > void kvm_tdp_mmu_walk_lockless_end(void); > > int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, > > int *root_level); > > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr, > > + u64 *spte); > > > > #ifdef CONFIG_X86_64 > > bool kvm_mmu_init_tdp_mmu(struct kvm *kvm); > > -- > > 2.32.0.93.g670b81a890-goog > >
On Mon, Jul 12, 2021 at 09:03:11PM +0000, Sean Christopherson wrote: > On Mon, Jul 12, 2021, Ben Gardon wrote: > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > > index c6fa8d00bf9f..2c9e0ed71fa0 100644 > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > > @@ -527,6 +527,10 @@ static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm, > > > if (is_removed_spte(iter->old_spte)) > > > return false; > > > > > > + /* > > > + * TDP MMU sptes can also be concurrently cmpxchg'd in > > > + * fast_pf_fix_direct_spte as part of fast_page_fault. > > > + */ > > The cmpxchg64 part isn't what's interesting, it's just the means to the end. > Maybe reword slightly to focus on modifying SPTEs without holding mmu_lock, e.g. > > /* > * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs outside > * of mmu_lock. > */ Good point about cmpxchg. I'll use your comment in v3. > > > > if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, > > > new_spte) != iter->old_spte) > > > return false; > > > > I'm a little nervous about not going through the handle_changed_spte > > flow for the TDP MMU, but as things are now, I think it's safe. > > Ya, it would be nice to flow through the TDP MMU proper as we could also "restore" > __rcu. That said, the fast #PF fix flow is unique and specific enough that I don't > think it's worth going out of our way to force the issue. > > > > @@ -1546,3 +1550,35 @@ int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, > > > > > > return leaf; > > > } > > > + > > > +/* > > > + * Must be called between kvm_tdp_mmu_walk_shadow_page_lockless_{begin,end}. > > > + * > > > + * The returned sptep must not be used after > > > + * kvm_tdp_mmu_walk_shadow_page_lockless_end. > > > + */ > > > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr, > > > + u64 *spte) > > > +{ > > > + struct tdp_iter iter; > > > + struct kvm_mmu *mmu = vcpu->arch.mmu; > > > + gfn_t gfn = addr >> PAGE_SHIFT; > > > + tdp_ptep_t sptep = NULL; > > > + > > > + tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) { > > > + *spte = iter.old_spte; > > > + sptep = iter.sptep; > > > + } > > > + > > > + if (sptep) > > This check is unnecessary, even when using rcu_dereference. Ack. Will fix. > > > > + /* > > > + * Perform the rcu dereference here since we are passing the > > > + * sptep up to the generic MMU code which does not know the > > > + * synchronization details of the TDP MMU. This is safe as long > > > + * as the caller obeys the contract that the sptep is not used > > > + * after kvm_tdp_mmu_walk_shadow_page_lockless_end. > > > + */ > > > > There's a little more to this contract: > > 1. The caller should only modify the SPTE using an atomic cmpxchg with > > the returned spte value. > > 2. The caller should not modify the mapped PFN or present <-> not > > present state of the SPTE. > > 3. There are other bits the caller can't modify too. (lpage, mt, etc.) > > > > If the comments on this function don't document all the constraints on > > how the returned sptep can be used, it might be safer to specify that > > this is only meant to be used as part of the fast page fault handler. > > Or maybe a less specific, but more scary comment? > > > > > > + return rcu_dereference(sptep); > > I still vote to use "(__force u64 *)" instead of rcu_dereference() to make it > clear we're cheating in order to share code with the legacy MMU. Some downsides I see of using __force is: - The implementation of rcu_dereference() is non-trivial. I'm not sure how much of it we have to re-implement here. For example, should we us READ_ONCE() in addition to the type cast? - rcu_dereference() checks if the rcu read lock is held and also calls rcu_check_sparse, which seem like useful debugging checks we'd miss out on. I think a big comment should be sufficient to draw the readers eyes and explain [the extent to which :)] we are cheating. > > /* > * Squash the __rcu annotation, the legacy MMU doesn't rely on RCU to > * protect its page tables and so the common MMU code doesn't preserve > * the annotation. > * > * It goes without saying, but the caller must honor all TDP MMU > * contracts for accessing/modifying SPTEs outside of mmu_lock. > */ > return (__force u64 *)sptep; > > > > + return NULL; > > > +} > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h > > > index e9dde5f9c0ef..508a23bdf7da 100644 > > > --- a/arch/x86/kvm/mmu/tdp_mmu.h > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.h > > > @@ -81,6 +81,8 @@ void kvm_tdp_mmu_walk_lockless_begin(void); > > > void kvm_tdp_mmu_walk_lockless_end(void); > > > int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, > > > int *root_level); > > > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr, > > > + u64 *spte); > > > > > > #ifdef CONFIG_X86_64 > > > bool kvm_mmu_init_tdp_mmu(struct kvm *kvm); > > > -- > > > 2.32.0.93.g670b81a890-goog > > >
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 88c71a8a55f1..1d410278a4cc 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3105,15 +3105,45 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte) return spte & PT_PRESENT_MASK; } +/* + * Returns the last level spte pointer of the shadow page walk for the given + * gpa, and sets *spte to the spte value. This spte may be non-preset. + * + * If no walk could be performed, returns NULL and *spte does not contain valid + * data. + * + * Constraints: + * - Must be called between walk_shadow_page_lockless_{begin,end}. + * - The returned sptep must not be used after walk_shadow_page_lockless_end. + */ +u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte) +{ + struct kvm_shadow_walk_iterator iterator; + u64 old_spte; + u64 *sptep = NULL; + + if (is_tdp_mmu(vcpu->arch.mmu)) + return kvm_tdp_mmu_get_last_sptep_lockless(vcpu, gpa, spte); + + for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) { + sptep = iterator.sptep; + *spte = old_spte; + + if (!is_shadow_present_pte(old_spte)) + break; + } + + return sptep; +} + /* * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS. */ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) { - struct kvm_shadow_walk_iterator iterator; - struct kvm_mmu_page *sp; int ret = RET_PF_INVALID; u64 spte = 0ull; + u64 *sptep = NULL; uint retry_count = 0; if (!page_fault_can_be_fast(error_code)) @@ -3122,16 +3152,14 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) walk_shadow_page_lockless_begin(vcpu); do { + struct kvm_mmu_page *sp; u64 new_spte; - for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte) - if (!is_shadow_present_pte(spte)) - break; - + sptep = get_last_sptep_lockless(vcpu, gpa, &spte); if (!is_shadow_present_pte(spte)) break; - sp = sptep_to_sp(iterator.sptep); + sp = sptep_to_sp(sptep); if (!is_last_spte(spte, sp->role.level)) break; @@ -3189,8 +3217,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) * since the gfn is not stable for indirect shadow page. See * Documentation/virt/kvm/locking.rst to get more detail. */ - if (fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte, - new_spte)) { + if (fast_pf_fix_direct_spte(vcpu, sp, sptep, spte, new_spte)) { ret = RET_PF_FIXED; break; } @@ -3203,7 +3230,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) } while (true); - trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret); + trace_fast_page_fault(vcpu, gpa, error_code, sptep, spte, ret); walk_shadow_page_lockless_end(vcpu); return ret; @@ -3838,11 +3865,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, if (page_fault_handle_page_track(vcpu, error_code, gfn)) return RET_PF_EMULATE; - if (!is_tdp_mmu_fault) { - r = fast_page_fault(vcpu, gpa, error_code); - if (r != RET_PF_INVALID) - return r; - } + r = fast_page_fault(vcpu, gpa, error_code); + if (r != RET_PF_INVALID) + return r; r = mmu_topup_memory_caches(vcpu, false); if (r) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index c6fa8d00bf9f..2c9e0ed71fa0 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -527,6 +527,10 @@ static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm, if (is_removed_spte(iter->old_spte)) return false; + /* + * TDP MMU sptes can also be concurrently cmpxchg'd in + * fast_pf_fix_direct_spte as part of fast_page_fault. + */ if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, new_spte) != iter->old_spte) return false; @@ -1546,3 +1550,35 @@ int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, return leaf; } + +/* + * Must be called between kvm_tdp_mmu_walk_shadow_page_lockless_{begin,end}. + * + * The returned sptep must not be used after + * kvm_tdp_mmu_walk_shadow_page_lockless_end. + */ +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr, + u64 *spte) +{ + struct tdp_iter iter; + struct kvm_mmu *mmu = vcpu->arch.mmu; + gfn_t gfn = addr >> PAGE_SHIFT; + tdp_ptep_t sptep = NULL; + + tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) { + *spte = iter.old_spte; + sptep = iter.sptep; + } + + if (sptep) + /* + * Perform the rcu dereference here since we are passing the + * sptep up to the generic MMU code which does not know the + * synchronization details of the TDP MMU. This is safe as long + * as the caller obeys the contract that the sptep is not used + * after kvm_tdp_mmu_walk_shadow_page_lockless_end. + */ + return rcu_dereference(sptep); + + return NULL; +} diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index e9dde5f9c0ef..508a23bdf7da 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -81,6 +81,8 @@ void kvm_tdp_mmu_walk_lockless_begin(void); void kvm_tdp_mmu_walk_lockless_end(void); int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level); +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr, + u64 *spte); #ifdef CONFIG_X86_64 bool kvm_mmu_init_tdp_mmu(struct kvm *kvm);
Make fast_page_fault interoperate with the TDP MMU by leveraging walk_shadow_page_lockless_{begin,end} to acquire the RCU read lock and introducing a new helper function kvm_tdp_mmu_get_last_sptep_lockless to grab the lowest level sptep. Suggested-by: Ben Gardon <bgardon@google.com> Signed-off-by: David Matlack <dmatlack@google.com> --- arch/x86/kvm/mmu/mmu.c | 55 +++++++++++++++++++++++++++----------- arch/x86/kvm/mmu/tdp_mmu.c | 36 +++++++++++++++++++++++++ arch/x86/kvm/mmu/tdp_mmu.h | 2 ++ 3 files changed, 78 insertions(+), 15 deletions(-)