From patchwork Fri Dec 10 16:36:20 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 12670147 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7A36BC433EF for ; Fri, 10 Dec 2021 16:36:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243671AbhLJQkY (ORCPT ); Fri, 10 Dec 2021 11:40:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58300 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243659AbhLJQkU (ORCPT ); Fri, 10 Dec 2021 11:40:20 -0500 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 21692C061746 for ; Fri, 10 Dec 2021 08:36:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description; bh=dq6kCfTobkJAf9urX0UZuHspNC1NMoB5EfP9wWTBNzw=; b=kQ247fqZ0qEzHQWiwCe7gUNMt9 1dV51vfwVzhk54FRiZtGesRAqlkpfr13lQzqz7xB9c8g3zq82M6uMFnMhFuthf83KqzKYGAMRvBiE OtSJK5KqhYLfi4rOaPpVZ8PUZuUpk0EBzHWWerCFe9j/Ax8cSn6ePorS1dDogK92H77lz3aFNN0QJ JToEHa2DcEet1YLNGNqov5XhkC3PVKYbuh66qkCEa7iaEbi+6bVmLuPT/AeIXtV6b6/hYvUDVsewd WKDDMZLLldUpEVhLOtB2Cn2BM/Cfd1EToWv/8NhUyBLDgQz6T1BWqKLg6TssQR6dewZI/6TCKWyYN Af4G/stw==; Received: from i7.infradead.org ([2001:8b0:10b:1:21e:67ff:fecb:7a92]) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1mvisk-00AUQu-Nb; Fri, 10 Dec 2021 16:36:27 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1mvisl-0000lX-0J; Fri, 10 Dec 2021 16:36:27 +0000 From: David Woodhouse To: Paolo Bonzini , kvm Cc: Boris Ostrovsky , Joao Martins , "jmattson @ google . com" , "wanpengli @ tencent . com" , "seanjc @ google . com" , "vkuznets @ redhat . com" , "mtosatti @ redhat . com" , "joro @ 8bytes . org" , karahmed@amazon.com, butt3rflyh4ck Subject: [PATCH v6 1/6] KVM: Warn if mark_page_dirty() is called without an active vCPU Date: Fri, 10 Dec 2021 16:36:20 +0000 Message-Id: <20211210163625.2886-2-dwmw2@infradead.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211210163625.2886-1-dwmw2@infradead.org> References: <20211210163625.2886-1-dwmw2@infradead.org> MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Woodhouse The various kvm_write_guest() and mark_page_dirty() functions must only ever be called in the context of an active vCPU, because if dirty ring tracking is enabled it may simply oops when kvm_get_running_vcpu() returns NULL for the vcpu and then kvm_dirty_ring_get() dereferences it. This oops was reported by "butt3rflyh4ck" in https://lore.kernel.org/kvm/CAFcO6XOmoS7EacN_n6v4Txk7xL7iqRa2gABg3F7E3Naf5uG94g@mail.gmail.com/ That actual bug will be fixed under separate cover but this warning should help to prevent new ones from being added. Signed-off-by: David Woodhouse --- include/linux/kvm_dirty_ring.h | 6 ------ virt/kvm/dirty_ring.c | 9 --------- virt/kvm/kvm_main.c | 7 ++++++- 3 files changed, 6 insertions(+), 16 deletions(-) diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h index 4da8d4a4140b..906f899813dc 100644 --- a/include/linux/kvm_dirty_ring.h +++ b/include/linux/kvm_dirty_ring.h @@ -43,11 +43,6 @@ static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, return 0; } -static inline struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm) -{ - return NULL; -} - static inline int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) { @@ -78,7 +73,6 @@ static inline bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring) u32 kvm_dirty_ring_get_rsvd_entries(void); int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size); -struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm); /* * called with kvm->slots_lock held, returns the number of diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index 88f4683198ea..8e9874760fb3 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -36,15 +36,6 @@ static bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring) return kvm_dirty_ring_used(ring) >= ring->size; } -struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm) -{ - struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); - - WARN_ON_ONCE(vcpu->kvm != kvm); - - return &vcpu->dirty_ring; -} - static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask) { struct kvm_memory_slot *memslot; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index b0f7e6eb00ff..af5b4427b139 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3155,12 +3155,17 @@ void mark_page_dirty_in_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot, gfn_t gfn) { + struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); + + if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm)) + return; + if (memslot && kvm_slot_dirty_track_enabled(memslot)) { unsigned long rel_gfn = gfn - memslot->base_gfn; u32 slot = (memslot->as_id << 16) | memslot->id; if (kvm->dirty_ring_size) - kvm_dirty_ring_push(kvm_dirty_ring_get(kvm), + kvm_dirty_ring_push(&vcpu->dirty_ring, slot, rel_gfn); else set_bit_le(rel_gfn, memslot->dirty_bitmap); From patchwork Fri Dec 10 16:36:21 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 12670143 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8D9C9C433F5 for ; Fri, 10 Dec 2021 16:36:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243683AbhLJQkU (ORCPT ); Fri, 10 Dec 2021 11:40:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58288 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243661AbhLJQkT (ORCPT ); Fri, 10 Dec 2021 11:40:19 -0500 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BBB53C061746 for ; Fri, 10 Dec 2021 08:36:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc: To:From:Reply-To:Content-ID:Content-Description; bh=8GoQF05q19SkunrVrZ9lHSZN8gL8MskKyFUEbZqhoyo=; b=BvmOVBou1Yh9AZLnh92bU3LrFO CVtRat4h7ScCcm7TFtod8TNztfTEGSezIMt4A0CG3RQnwQimibohef4BtnGis38Iqdlu0+npI+18K J3n42Iyr9NXtOqELY61gElWVX6G+PLXFD0qkR0s0KSmgjnsb0Sly2SDhNQzAHL3qC1iTlP3B8vgah rJKW6WpWCdiw5UJkA8PBa3hwIZ4W7GRq6Cr0U+gvx5PkhiBoOwSDVibjVvTB/f8iH+N4Ivh64gOkt uluS5nLqEo2F5vui+ITdlc6819SulsYFCJhqv7Wi4YOdiKBZOgSlp+xm++y0ToIEgbprnNVhzjowx WZ04mIsQ==; Received: from i7.infradead.org ([2001:8b0:10b:1:21e:67ff:fecb:7a92]) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1mvisl-000abt-By; Fri, 10 Dec 2021 16:36:27 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1mvisl-0000la-1k; Fri, 10 Dec 2021 16:36:27 +0000 From: David Woodhouse To: Paolo Bonzini , kvm Cc: Boris Ostrovsky , Joao Martins , "jmattson @ google . com" , "wanpengli @ tencent . com" , "seanjc @ google . com" , "vkuznets @ redhat . com" , "mtosatti @ redhat . com" , "joro @ 8bytes . org" , karahmed@amazon.com, butt3rflyh4ck Subject: [PATCH v6 2/6] KVM: Reinstate gfn_to_pfn_cache with invalidation support Date: Fri, 10 Dec 2021 16:36:21 +0000 Message-Id: <20211210163625.2886-3-dwmw2@infradead.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211210163625.2886-1-dwmw2@infradead.org> References: <20211210163625.2886-1-dwmw2@infradead.org> MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by desiato.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Woodhouse This can be used in two modes. There is an atomic mode where the cached mapping is accessed while holding the rwlock, and a mode where the physical address is used by a vCPU in guest mode. For the latter case, an invalidation will wake the vCPU with the new KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any caches it still needs to access before entering guest mode again. Only one vCPU can be targeted by the wake requests; it's simple enough to make it wake all vCPUs or even a mask but I don't see a use case for that additional complexity right now. Invalidation happens from the invalidate_range_start MMU notifier, which needs to be able to sleep in order to wake the vCPU and wait for it. This means that revalidation potentially needs to "wait" for the MMU operation to complete and the invalidate_range_end notifier to be invoked. Like the vCPU when it takes a page fault in that period, we just spin — fixing that in a future patch by implementing an actual *wait* may be another part of shaving this particularly hirsute yak. As noted in the comments in the function itself, the only case where the invalidate_range_start notifier is expected to be called *without* being able to sleep is when the OOM reaper is killing the process. In that case, we expect the vCPU threads already to have exited, and thus there will be nothing to wake, and no reason to wait. So we clear the KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly if there actually *was* anything to wake up. Signed-off-by: David Woodhouse --- arch/x86/kvm/Kconfig | 1 + include/linux/kvm_host.h | 103 ++++++++++++ include/linux/kvm_types.h | 18 ++ virt/kvm/Kconfig | 3 + virt/kvm/Makefile.kvm | 1 + virt/kvm/dirty_ring.c | 2 +- virt/kvm/kvm_main.c | 12 +- virt/kvm/kvm_mm.h | 44 +++++ virt/kvm/mmu_lock.h | 23 --- virt/kvm/pfncache.c | 337 ++++++++++++++++++++++++++++++++++++++ 10 files changed, 517 insertions(+), 27 deletions(-) create mode 100644 virt/kvm/kvm_mm.h delete mode 100644 virt/kvm/mmu_lock.h create mode 100644 virt/kvm/pfncache.c diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 03b2ce34e7f4..ebc8ce9ec917 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -26,6 +26,7 @@ config KVM select PREEMPT_NOTIFIERS select MMU_NOTIFIER select HAVE_KVM_IRQCHIP + select HAVE_KVM_PFNCACHE select HAVE_KVM_IRQFD select HAVE_KVM_DIRTY_RING select IRQ_BYPASS_MANAGER diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index f8ed799e8674..b56d3f4fb2fd 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -155,6 +155,7 @@ static inline bool is_error_page(struct page *page) #define KVM_REQ_UNBLOCK 2 #define KVM_REQ_UNHALT 3 #define KVM_REQ_VM_DEAD (4 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) +#define KVM_REQ_GPC_INVALIDATE (5 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) #define KVM_REQUEST_ARCH_BASE 8 #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \ @@ -593,6 +594,10 @@ struct kvm { unsigned long mn_active_invalidate_count; struct rcuwait mn_memslots_update_rcuwait; + /* For management / invalidation of gfn_to_pfn_caches */ + spinlock_t gpc_lock; + struct list_head gpc_list; + /* * created_vcpus is protected by kvm->lock, and is incremented * at the beginning of KVM_CREATE_VCPU. online_vcpus is only @@ -1099,6 +1104,104 @@ int kvm_vcpu_write_guest(struct kvm_vcpu *vcpu, gpa_t gpa, const void *data, unsigned long len); void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn); +/** + * kvm_gfn_to_pfn_cache_init - prepare a cached kernel mapping and HPA for a + * given guest physical address. + * + * @kvm: pointer to kvm instance. + * @gpc: struct gfn_to_pfn_cache object. + * @vcpu: vCPU to be used for marking pages dirty and to be woken on + * invalidation. + * @guest_uses_pa: indicates that the resulting host physical PFN is used while + * @vcpu is IN_GUEST_MODE so invalidations should wake it. + * @kernel_map: requests a kernel virtual mapping (kmap / memremap). + * @gpa: guest physical address to map. + * @len: sanity check; the range being access must fit a single page. + * @dirty: mark the cache dirty immediately. + * + * @return: 0 for success. + * -EINVAL for a mapping which would cross a page boundary. + * -EFAULT for an untranslatable guest physical address. + * + * This primes a gfn_to_pfn_cache and links it into the @kvm's list for + * invalidations to be processed. Invalidation callbacks to @vcpu using + * %KVM_REQ_GPC_INVALIDATE will occur only for MMU notifiers, not for KVM + * memslot changes. Callers are required to use kvm_gfn_to_pfn_cache_check() + * to ensure that the cache is valid before accessing the target page. + */ +int kvm_gfn_to_pfn_cache_init(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, + struct kvm_vcpu *vcpu, bool guest_uses_pa, + bool kernel_map, gpa_t gpa, unsigned long len, + bool dirty); + +/** + * kvm_gfn_to_pfn_cache_check - check validity of a gfn_to_pfn_cache. + * + * @kvm: pointer to kvm instance. + * @gpc: struct gfn_to_pfn_cache object. + * @gpa: current guest physical address to map. + * @len: sanity check; the range being access must fit a single page. + * @dirty: mark the cache dirty immediately. + * + * @return: %true if the cache is still valid and the address matches. + * %false if the cache is not valid. + * + * Callers outside IN_GUEST_MODE context should hold a read lock on @gpc->lock + * while calling this function, and then continue to hold the lock until the + * access is complete. + * + * Callers in IN_GUEST_MODE may do so without locking, although they should + * still hold a read lock on kvm->scru for the memslot checks. + */ +bool kvm_gfn_to_pfn_cache_check(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, + gpa_t gpa, unsigned long len); + +/** + * kvm_gfn_to_pfn_cache_refresh - update a previously initialized cache. + * + * @kvm: pointer to kvm instance. + * @gpc: struct gfn_to_pfn_cache object. + * @gpa: updated guest physical address to map. + * @len: sanity check; the range being access must fit a single page. + * @dirty: mark the cache dirty immediately. + * + * @return: 0 for success. + * -EINVAL for a mapping which would cross a page boundary. + * -EFAULT for an untranslatable guest physical address. + * + * This will attempt to refresh a gfn_to_pfn_cache. Note that a successful + * returm from this function does not mean the page can be immediately + * accessed because it may have raced with an invalidation. Callers must + * still lock and check the cache status, as this function does not return + * with the lock still held to permit access. + */ +int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, + gpa_t gpa, unsigned long len, bool dirty); + +/** + * kvm_gfn_to_pfn_cache_unmap - temporarily unmap a gfn_to_pfn_cache. + * + * @kvm: pointer to kvm instance. + * @gpc: struct gfn_to_pfn_cache object. + * + * This unmaps the referenced page and marks it dirty, if appropriate. The + * cache is left in the invalid state but at least the mapping from GPA to + * userspace HVA will remain cached and can be reused on a subsequent + * refresh. + */ +void kvm_gfn_to_pfn_cache_unmap(struct kvm *kvm, struct gfn_to_pfn_cache *gpc); + +/** + * kvm_gfn_to_pfn_cache_destroy - destroy and unlink a gfn_to_pfn_cache. + * + * @kvm: pointer to kvm instance. + * @gpc: struct gfn_to_pfn_cache object. + * + * This removes a cache from the @kvm's list to be processed on MMU notifier + * invocation. + */ +void kvm_gfn_to_pfn_cache_destroy(struct kvm *kvm, struct gfn_to_pfn_cache *gpc); + void kvm_sigset_activate(struct kvm_vcpu *vcpu); void kvm_sigset_deactivate(struct kvm_vcpu *vcpu); diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index 888ef12862c9..dceac12c1ce5 100644 --- a/include/linux/kvm_types.h +++ b/include/linux/kvm_types.h @@ -19,6 +19,7 @@ struct kvm_memslots; enum kvm_mr_change; #include +#include #include @@ -53,6 +54,23 @@ struct gfn_to_hva_cache { struct kvm_memory_slot *memslot; }; +struct gfn_to_pfn_cache { + u64 generation; + gpa_t gpa; + unsigned long uhva; + struct kvm_memory_slot *memslot; + struct kvm_vcpu *vcpu; + struct list_head list; + rwlock_t lock; + void *khva; + kvm_pfn_t pfn; + bool active; + bool valid; + bool dirty; + bool kernel_map; + bool guest_uses_pa; +}; + #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE /* * Memory caches are used to preallocate memory ahead of various MMU flows, diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 97cf5413ac25..f4834c20e4a6 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -4,6 +4,9 @@ config HAVE_KVM bool +config HAVE_KVM_PFNCACHE + bool + config HAVE_KVM_IRQCHIP bool diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm index ffdcad3cc97a..2c27d5d0c367 100644 --- a/virt/kvm/Makefile.kvm +++ b/virt/kvm/Makefile.kvm @@ -11,3 +11,4 @@ kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o +kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index 8e9874760fb3..222ecc81d7df 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -9,7 +9,7 @@ #include #include #include -#include "mmu_lock.h" +#include "kvm_mm.h" int __weak kvm_cpu_dirty_log_size(void) { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index af5b4427b139..6e8e9d36f382 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -59,7 +59,7 @@ #include "coalesced_mmio.h" #include "async_pf.h" -#include "mmu_lock.h" +#include "kvm_mm.h" #include "vfio.h" #define CREATE_TRACE_POINTS @@ -711,6 +711,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, kvm->mn_active_invalidate_count++; spin_unlock(&kvm->mn_invalidate_lock); + gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end, + hva_range.may_block); + __kvm_handle_hva_range(kvm, &hva_range); return 0; @@ -1071,6 +1074,9 @@ static struct kvm *kvm_create_vm(unsigned long type) rcuwait_init(&kvm->mn_memslots_update_rcuwait); xa_init(&kvm->vcpu_array); + INIT_LIST_HEAD(&kvm->gpc_list); + spin_lock_init(&kvm->gpc_lock); + INIT_LIST_HEAD(&kvm->devices); BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX); @@ -2539,8 +2545,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, * 2): @write_fault = false && @writable, @writable will tell the caller * whether the mapping is writable. */ -static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async, - bool write_fault, bool *writable) +kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async, + bool write_fault, bool *writable) { struct vm_area_struct *vma; kvm_pfn_t pfn = 0; diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h new file mode 100644 index 000000000000..34ca40823260 --- /dev/null +++ b/virt/kvm/kvm_mm.h @@ -0,0 +1,44 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#ifndef __KVM_MM_H__ +#define __KVM_MM_H__ 1 + +/* + * Architectures can choose whether to use an rwlock or spinlock + * for the mmu_lock. These macros, for use in common code + * only, avoids using #ifdefs in places that must deal with + * multiple architectures. + */ + +#ifdef KVM_HAVE_MMU_RWLOCK +#define KVM_MMU_LOCK_INIT(kvm) rwlock_init(&(kvm)->mmu_lock) +#define KVM_MMU_LOCK(kvm) write_lock(&(kvm)->mmu_lock) +#define KVM_MMU_UNLOCK(kvm) write_unlock(&(kvm)->mmu_lock) +#define KVM_MMU_READ_LOCK(kvm) read_lock(&(kvm)->mmu_lock) +#define KVM_MMU_READ_UNLOCK(kvm) read_unlock(&(kvm)->mmu_lock) +#else +#define KVM_MMU_LOCK_INIT(kvm) spin_lock_init(&(kvm)->mmu_lock) +#define KVM_MMU_LOCK(kvm) spin_lock(&(kvm)->mmu_lock) +#define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock) +#define KVM_MMU_READ_LOCK(kvm) spin_lock(&(kvm)->mmu_lock) +#define KVM_MMU_READ_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock) +#endif /* KVM_HAVE_MMU_RWLOCK */ + +kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async, + bool write_fault, bool *writable); + +#ifdef CONFIG_HAVE_KVM_PFNCACHE +void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, + unsigned long start, + unsigned long end, + bool may_block); +#else +static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, + unsigned long start, + unsigned long end, + bool may_block) +{ +} +#endif /* HAVE_KVM_PFNCACHE */ + +#endif /* __KVM_MM_H__ */ diff --git a/virt/kvm/mmu_lock.h b/virt/kvm/mmu_lock.h deleted file mode 100644 index 9e1308f9734c..000000000000 --- a/virt/kvm/mmu_lock.h +++ /dev/null @@ -1,23 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only - -#ifndef KVM_MMU_LOCK_H -#define KVM_MMU_LOCK_H 1 - -/* - * Architectures can choose whether to use an rwlock or spinlock - * for the mmu_lock. These macros, for use in common code - * only, avoids using #ifdefs in places that must deal with - * multiple architectures. - */ - -#ifdef KVM_HAVE_MMU_RWLOCK -#define KVM_MMU_LOCK_INIT(kvm) rwlock_init(&(kvm)->mmu_lock) -#define KVM_MMU_LOCK(kvm) write_lock(&(kvm)->mmu_lock) -#define KVM_MMU_UNLOCK(kvm) write_unlock(&(kvm)->mmu_lock) -#else -#define KVM_MMU_LOCK_INIT(kvm) spin_lock_init(&(kvm)->mmu_lock) -#define KVM_MMU_LOCK(kvm) spin_lock(&(kvm)->mmu_lock) -#define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock) -#endif /* KVM_HAVE_MMU_RWLOCK */ - -#endif diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c new file mode 100644 index 000000000000..ce878f4be4da --- /dev/null +++ b/virt/kvm/pfncache.c @@ -0,0 +1,337 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Kernel-based Virtual Machine driver for Linux + * + * This module enables kernel and guest-mode vCPU access to guest physical + * memory with suitable invalidation mechanisms. + * + * Copyright © 2021 Amazon.com, Inc. or its affiliates. + * + * Authors: + * David Woodhouse + */ + +#include +#include +#include +#include +#include + +#include "kvm_mm.h" + +/* + * MMU notifier 'invalidate_range_start' hook. + */ +void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, + unsigned long end, bool may_block) +{ + DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS); + struct gfn_to_pfn_cache *gpc; + bool wake_vcpus = false; + + spin_lock(&kvm->gpc_lock); + list_for_each_entry(gpc, &kvm->gpc_list, list) { + write_lock_irq(&gpc->lock); + + /* Only a single page so no need to care about length */ + if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) && + gpc->uhva >= start && gpc->uhva < end) { + gpc->valid = false; + + /* + * If a guest vCPU could be using the physical address, + * it needs to be woken. + */ + if (gpc->guest_uses_pa) { + if (!wake_vcpus) { + wake_vcpus = true; + bitmap_zero(vcpu_bitmap, KVM_MAX_VCPUS); + } + __set_bit(gpc->vcpu->vcpu_idx, vcpu_bitmap); + } + + /* + * We cannot call mark_page_dirty() from here because + * this physical CPU might not have an active vCPU + * with which to do the KVM dirty tracking. + * + * Neither is there any point in telling the kernel MM + * that the underlying page is dirty. A vCPU in guest + * mode might still be writing to it up to the point + * where we wake them a few lines further down anyway. + * + * So all the dirty marking happens on the unmap. + */ + } + write_unlock_irq(&gpc->lock); + } + spin_unlock(&kvm->gpc_lock); + + if (wake_vcpus) { + unsigned int req = KVM_REQ_GPC_INVALIDATE; + bool called; + + /* + * If the OOM reaper is active, then all vCPUs should have + * been stopped already, so perform the request without + * KVM_REQUEST_WAIT and be sad if any needed to be woken. + */ + if (!may_block) + req &= ~KVM_REQUEST_WAIT; + + called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap); + + WARN_ON_ONCE(called && !may_block); + } +} + +bool kvm_gfn_to_pfn_cache_check(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, + gpa_t gpa, unsigned long len) +{ + struct kvm_memslots *slots = kvm_memslots(kvm); + + if ((gpa & ~PAGE_MASK) + len > PAGE_SIZE) + return false; + + if (gpc->gpa != gpa || gpc->generation != slots->generation || + kvm_is_error_hva(gpc->uhva)) + return false; + + if (!gpc->valid) + return false; + + return true; +} +EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_check); + +static void __release_gpc(struct kvm *kvm, kvm_pfn_t pfn, void *khva, + gpa_t gpa, bool dirty) +{ + /* Unmap the old page if it was mapped before, and release it */ + if (!is_error_noslot_pfn(pfn)) { + if (khva) { + if (pfn_valid(pfn)) + kunmap(pfn_to_page(pfn)); +#ifdef CONFIG_HAS_IOMEM + else + memunmap(khva); +#endif + } + + kvm_release_pfn(pfn, dirty); + if (dirty) + mark_page_dirty(kvm, gpa); + } +} + +static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, unsigned long uhva) +{ + unsigned long mmu_seq; + kvm_pfn_t new_pfn; + int retry; + + do { + mmu_seq = kvm->mmu_notifier_seq; + smp_rmb(); + + /* We always request a writeable mapping */ + new_pfn = hva_to_pfn(uhva, false, NULL, true, NULL); + if (is_error_noslot_pfn(new_pfn)) + break; + + KVM_MMU_READ_LOCK(kvm); + retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva); + KVM_MMU_READ_UNLOCK(kvm); + if (!retry) + break; + + cond_resched(); + } while (1); + + return new_pfn; +} + +int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, + gpa_t gpa, unsigned long len, bool dirty) +{ + struct kvm_memslots *slots = kvm_memslots(kvm); + unsigned long page_offset = gpa & ~PAGE_MASK; + kvm_pfn_t old_pfn, new_pfn; + unsigned long old_uhva; + gpa_t old_gpa; + void *old_khva; + bool old_valid, old_dirty; + int ret = 0; + + /* + * If must fit within a single page. The 'len' argument is + * only to enforce that. + */ + if (page_offset + len > PAGE_SIZE) + return -EINVAL; + + write_lock_irq(&gpc->lock); + + old_gpa = gpc->gpa; + old_pfn = gpc->pfn; + old_khva = gpc->khva - offset_in_page(gpc->khva); + old_uhva = gpc->uhva; + old_valid = gpc->valid; + old_dirty = gpc->dirty; + + /* If the userspace HVA is invalid, refresh that first */ + if (gpc->gpa != gpa || gpc->generation != slots->generation || + kvm_is_error_hva(gpc->uhva)) { + gfn_t gfn = gpa_to_gfn(gpa); + + gpc->dirty = false; + gpc->gpa = gpa; + gpc->generation = slots->generation; + gpc->memslot = __gfn_to_memslot(slots, gfn); + gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn); + + if (kvm_is_error_hva(gpc->uhva)) { + ret = -EFAULT; + goto out; + } + + gpc->uhva += page_offset; + } + + /* + * If the userspace HVA changed or the PFN was already invalid, + * drop the lock and do the HVA to PFN lookup again. + */ + if (!old_valid || old_uhva != gpc->uhva) { + unsigned long uhva = gpc->uhva; + void *new_khva = NULL; + + /* Placeholders for "hva is valid but not yet mapped" */ + gpc->pfn = KVM_PFN_ERR_FAULT; + gpc->khva = NULL; + gpc->valid = true; + + write_unlock_irq(&gpc->lock); + + new_pfn = hva_to_pfn_retry(kvm, uhva); + if (is_error_noslot_pfn(new_pfn)) { + ret = -EFAULT; + goto map_done; + } + + if (gpc->kernel_map) { + if (new_pfn == old_pfn) { + new_khva = old_khva; + old_pfn = KVM_PFN_ERR_FAULT; + old_khva = NULL; + } else if (pfn_valid(new_pfn)) { + new_khva = kmap(pfn_to_page(new_pfn)); +#ifdef CONFIG_HAS_IOMEM + } else { + new_khva = memremap(pfn_to_hpa(new_pfn), PAGE_SIZE, MEMREMAP_WB); +#endif + } + if (new_khva) + new_khva += page_offset; + else + ret = -EFAULT; + } + + map_done: + write_lock_irq(&gpc->lock); + if (ret) { + gpc->valid = false; + gpc->pfn = KVM_PFN_ERR_FAULT; + gpc->khva = NULL; + } else { + /* At this point, gpc->valid may already have been cleared */ + gpc->pfn = new_pfn; + gpc->khva = new_khva; + } + } else { + /* If the HVA→PFN mapping was already valid, don't unmap it. */ + old_pfn = KVM_PFN_ERR_FAULT; + old_khva = NULL; + } + + out: + if (ret) + gpc->dirty = false; + else + gpc->dirty = dirty; + + write_unlock_irq(&gpc->lock); + + __release_gpc(kvm, old_pfn, old_khva, old_gpa, old_dirty); + + return ret; +} +EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_refresh); + +void kvm_gfn_to_pfn_cache_unmap(struct kvm *kvm, struct gfn_to_pfn_cache *gpc) +{ + void *old_khva; + kvm_pfn_t old_pfn; + bool old_dirty; + gpa_t old_gpa; + + write_lock_irq(&gpc->lock); + + gpc->valid = false; + + old_khva = gpc->khva - offset_in_page(gpc->khva); + old_dirty = gpc->dirty; + old_gpa = gpc->gpa; + old_pfn = gpc->pfn; + + /* + * We can leave the GPA → uHVA map cache intact but the PFN + * lookup will need to be redone even for the same page. + */ + gpc->khva = NULL; + gpc->pfn = KVM_PFN_ERR_FAULT; + + write_unlock_irq(&gpc->lock); + + __release_gpc(kvm, old_pfn, old_khva, old_gpa, old_dirty); +} +EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_unmap); + + +int kvm_gfn_to_pfn_cache_init(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, + struct kvm_vcpu *vcpu, bool guest_uses_pa, + bool kernel_map, gpa_t gpa, unsigned long len, + bool dirty) +{ + if (!gpc->active) { + rwlock_init(&gpc->lock); + + gpc->khva = NULL; + gpc->pfn = KVM_PFN_ERR_FAULT; + gpc->uhva = KVM_HVA_ERR_BAD; + gpc->vcpu = vcpu; + gpc->kernel_map = kernel_map; + gpc->guest_uses_pa = guest_uses_pa; + gpc->valid = false; + gpc->active = true; + + spin_lock(&kvm->gpc_lock); + list_add(&gpc->list, &kvm->gpc_list); + spin_unlock(&kvm->gpc_lock); + } + return kvm_gfn_to_pfn_cache_refresh(kvm, gpc, gpa, len, dirty); +} +EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_init); + +void kvm_gfn_to_pfn_cache_destroy(struct kvm *kvm, struct gfn_to_pfn_cache *gpc) +{ + if (gpc->active) { + spin_lock(&kvm->gpc_lock); + list_del(&gpc->list); + spin_unlock(&kvm->gpc_lock); + + kvm_gfn_to_pfn_cache_unmap(kvm, gpc); + gpc->active = false; + } +} +EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_destroy); From patchwork Fri Dec 10 16:36:22 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 12670139 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 47045C4332F for ; Fri, 10 Dec 2021 16:36:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243688AbhLJQkV (ORCPT ); Fri, 10 Dec 2021 11:40:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58294 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243669AbhLJQkT (ORCPT ); Fri, 10 Dec 2021 11:40:19 -0500 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4EAFCC0617A2 for ; Fri, 10 Dec 2021 08:36:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc: To:From:Reply-To:Content-ID:Content-Description; bh=InWCUgUYJLGeO/hhoUYObYArENpRsQYdt3sDsLY0824=; b=nYLhA6ydbmq3nTNkLzyH9rUl1g bjhu7owgAWGyE+t+3tzomJSgBrxy8n1uMc0f/v/wUtPdCBOgCI6p1BDftTqymWKMiQBM14MYI23t+ onP7EoV5MovHd/UieTlHjv4GygRpPfU2gy9X/DimYsrBczmhLIn94wIuLvvEZ2x/Tvf9E9vWFAAGb XkXVRHFxzQzMdagXUaVFV4hA7DlhfIyPCGzfnGHLtLC/Ih6ZnFqP/mKFao3E/tcOe1JjTdJ9EbcdB QI2UQEagiklLuMPuHsxCxN9M7kNu+KW/bX8yY2pz1hHZZQ6y5q88lHiA0gJm4r6OA0pmdhU732aZF Q5Jde1Yw==; Received: from i7.infradead.org ([2001:8b0:10b:1:21e:67ff:fecb:7a92]) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1mvisk-00AUQw-Qn; Fri, 10 Dec 2021 16:36:27 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1mvisl-0000lf-3g; Fri, 10 Dec 2021 16:36:27 +0000 From: David Woodhouse To: Paolo Bonzini , kvm Cc: Boris Ostrovsky , Joao Martins , "jmattson @ google . com" , "wanpengli @ tencent . com" , "seanjc @ google . com" , "vkuznets @ redhat . com" , "mtosatti @ redhat . com" , "joro @ 8bytes . org" , karahmed@amazon.com, butt3rflyh4ck Subject: [PATCH v6 3/6] KVM: x86/xen: Maintain valid mapping of Xen shared_info page Date: Fri, 10 Dec 2021 16:36:22 +0000 Message-Id: <20211210163625.2886-4-dwmw2@infradead.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211210163625.2886-1-dwmw2@infradead.org> References: <20211210163625.2886-1-dwmw2@infradead.org> MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Woodhouse Use the newly reinstated gfn_to_pfn_cache to maintain a kernel mapping of the Xen shared_info page so that it can be accessed in atomic context. Note that we do not participate in dirty tracking for the shared info page and we do not explicitly mark it dirty every single tim we deliver an event channel interrupts. We wouldn't want to do that even if we *did* have a valid vCPU context with which to do so. Signed-off-by: David Woodhouse --- Documentation/virt/kvm/api.rst | 12 ++++++++++++ arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/xen.c | 25 ++++++++++++++----------- 3 files changed, 27 insertions(+), 12 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index aeeb071c7688..455664c39d42 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -371,6 +371,9 @@ The bits in the dirty bitmap are cleared before the ioctl returns, unless KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is enabled. For more information, see the description of the capability. +Note that the Xen shared info page, if configured, shall always be assumed +to be dirty. KVM will not explicitly mark it such. + 4.9 KVM_SET_MEMORY_ALIAS ------------------------ @@ -5134,6 +5137,15 @@ KVM_XEN_ATTR_TYPE_SHARED_INFO not aware of the Xen CPU id which is used as the index into the vcpu_info[] array, so cannot know the correct default location. + Note that the shared info page may be constantly written to by KVM; + it contains the event channel bitmap used to deliver interrupts to + a Xen guest, amongst other things. It is exempt from dirty tracking + mechanisms — KVM will not explicitly mark the page as dirty each + time an event channel interrupt is delivered to the guest! Thus, + userspace should always assume that the designated GFN is dirty if + any vCPU has been running or any event channel interrupts can be + routed to the guest. + KVM_XEN_ATTR_TYPE_UPCALL_VECTOR Sets the exception vector used to deliver Xen event channel upcalls. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index d5fede05eb5f..a49d503e7296 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1019,7 +1019,7 @@ struct msr_bitmap_range { struct kvm_xen { bool long_mode; u8 upcall_vector; - gfn_t shinfo_gfn; + struct gfn_to_pfn_cache shinfo_cache; }; enum kvm_irqchip_mode { diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index dff2bdf9507a..da4bf2c6407f 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -23,16 +23,21 @@ DEFINE_STATIC_KEY_DEFERRED_FALSE(kvm_xen_enabled, HZ); static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn) { + struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache; gpa_t gpa = gfn_to_gpa(gfn); int wc_ofs, sec_hi_ofs; int ret = 0; int idx = srcu_read_lock(&kvm->srcu); - if (kvm_is_error_hva(gfn_to_hva(kvm, gfn))) { - ret = -EFAULT; + if (gfn == GPA_INVALID) { + kvm_gfn_to_pfn_cache_destroy(kvm, gpc); goto out; } - kvm->arch.xen.shinfo_gfn = gfn; + + ret = kvm_gfn_to_pfn_cache_init(kvm, gpc, NULL, false, true, gpa, + PAGE_SIZE, false); + if (ret) + goto out; /* Paranoia checks on the 32-bit struct layout */ BUILD_BUG_ON(offsetof(struct compat_shared_info, wc) != 0x900); @@ -260,15 +265,9 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data) break; case KVM_XEN_ATTR_TYPE_SHARED_INFO: - if (data->u.shared_info.gfn == GPA_INVALID) { - kvm->arch.xen.shinfo_gfn = GPA_INVALID; - r = 0; - break; - } r = kvm_xen_shared_info_init(kvm, data->u.shared_info.gfn); break; - case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR: if (data->u.vector && data->u.vector < 0x10) r = -EINVAL; @@ -299,7 +298,10 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data) break; case KVM_XEN_ATTR_TYPE_SHARED_INFO: - data->u.shared_info.gfn = kvm->arch.xen.shinfo_gfn; + if (kvm->arch.xen.shinfo_cache.active) + data->u.shared_info.gfn = gpa_to_gfn(kvm->arch.xen.shinfo_cache.gpa); + else + data->u.shared_info.gfn = GPA_INVALID; r = 0; break; @@ -661,11 +663,12 @@ int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc) void kvm_xen_init_vm(struct kvm *kvm) { - kvm->arch.xen.shinfo_gfn = GPA_INVALID; } void kvm_xen_destroy_vm(struct kvm *kvm) { + kvm_gfn_to_pfn_cache_destroy(kvm, &kvm->arch.xen.shinfo_cache); + if (kvm->arch.xen_hvm_config.msr) static_branch_slow_dec_deferred(&kvm_xen_enabled); } From patchwork Fri Dec 10 16:36:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 12670149 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6E81C433FE for ; Fri, 10 Dec 2021 16:36:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243107AbhLJQkU (ORCPT ); Fri, 10 Dec 2021 11:40:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58290 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243665AbhLJQkT (ORCPT ); Fri, 10 Dec 2021 11:40:19 -0500 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BDBBEC0617A1 for ; Fri, 10 Dec 2021 08:36:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description; bh=bLkOPRbUbwgM3KnEqL1eymENntdxkQnksQhOYjFM2MI=; b=Ef85/PhIRxNyxZHkYy8qWEzehk kH/DxxDuJUaQolDVtXYMXlItcR9tjTjpsRHB55CcJPtM0bO3XmpWDG7tLHd1ycEPFECSgMwx9WIEa L2P47nBVfmQ9FbhImsCivLyIMlev79qYmYe0j0EejD1dN5KlKK5s0Vp4D5OaozALMNbkFRqowtHs3 HnbCG+GZ8vEzgmcJ45qO+tF+G5d1rw3pJurmj5+Vz3krR6/o4dLhw1KgWzYB74LJUjSiDxn6jRR8E c/hShMrMAjvuY6re6wcTHr9y5Z8gtKBzQphyfzqTZkVmIfhWff1hFoY+dZfB1O0++kzZnx7jdUcrx by3Z3gBQ==; Received: from i7.infradead.org ([2001:8b0:10b:1:21e:67ff:fecb:7a92]) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1mvisl-000abu-FM; Fri, 10 Dec 2021 16:36:27 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1mvisl-0000lj-51; Fri, 10 Dec 2021 16:36:27 +0000 From: David Woodhouse To: Paolo Bonzini , kvm Cc: Boris Ostrovsky , Joao Martins , "jmattson @ google . com" , "wanpengli @ tencent . com" , "seanjc @ google . com" , "vkuznets @ redhat . com" , "mtosatti @ redhat . com" , "joro @ 8bytes . org" , karahmed@amazon.com, butt3rflyh4ck Subject: [PATCH v6 4/6] KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery Date: Fri, 10 Dec 2021 16:36:23 +0000 Message-Id: <20211210163625.2886-5-dwmw2@infradead.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211210163625.2886-1-dwmw2@infradead.org> References: <20211210163625.2886-1-dwmw2@infradead.org> MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by desiato.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Woodhouse This adds basic support for delivering 2 level event channels to a guest. Initially, it only supports delivery via the IRQ routing table, triggered by an eventfd. In order to do so, it has a kvm_xen_set_evtchn_fast() function which will use the pre-mapped shared_info page if it already exists and is still valid, while the slow path through the irqfd_inject workqueue will remap the shared_info page if necessary. It sets the bits in the shared_info page but not the vcpu_info; that is deferred to __kvm_xen_has_interrupt() which raises the vector to the appropriate vCPU. Add a 'verbose' mode to xen_shinfo_test while adding test cases for this. Signed-off-by: David Woodhouse --- Documentation/virt/kvm/api.rst | 21 ++ arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/irq_comm.c | 12 + arch/x86/kvm/x86.c | 3 +- arch/x86/kvm/xen.c | 262 +++++++++++++++++- arch/x86/kvm/xen.h | 9 + include/linux/kvm_host.h | 7 + include/uapi/linux/kvm.h | 11 + .../selftests/kvm/x86_64/xen_shinfo_test.c | 184 +++++++++++- 9 files changed, 503 insertions(+), 7 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 455664c39d42..ec4d693851a2 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1799,6 +1799,7 @@ No flags are specified so far, the corresponding field must be set to zero. struct kvm_irq_routing_msi msi; struct kvm_irq_routing_s390_adapter adapter; struct kvm_irq_routing_hv_sint hv_sint; + struct kvm_irq_routing_xen_evtchn xen_evtchn; __u32 pad[8]; } u; }; @@ -1808,6 +1809,7 @@ No flags are specified so far, the corresponding field must be set to zero. #define KVM_IRQ_ROUTING_MSI 2 #define KVM_IRQ_ROUTING_S390_ADAPTER 3 #define KVM_IRQ_ROUTING_HV_SINT 4 + #define KVM_IRQ_ROUTING_XEN_EVTCHN 5 flags: @@ -1859,6 +1861,20 @@ address_hi must be zero. __u32 sint; }; + struct kvm_irq_routing_xen_evtchn { + __u32 port; + __u32 vcpu; + __u32 priority; + }; + + +When KVM_CAP_XEN_HVM includes the KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL bit +in its indication of supported features, routing to Xen event channels +is supported. Although the priority field is present, only the value +KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL is supported, which means delivery by +2 level event channels. FIFO event channel support may be added in +the future. + 4.55 KVM_SET_TSC_KHZ -------------------- @@ -7413,6 +7429,7 @@ PVHVM guests. Valid flags are:: #define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1) #define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2) #define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 2) + #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 3) The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG ioctl is available, for the guest to set its hypercall page. @@ -7432,6 +7449,10 @@ The KVM_XEN_HVM_CONFIG_RUNSTATE flag indicates that the runstate-related features KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR/_CURRENT/_DATA/_ADJUST are supported by the KVM_XEN_VCPU_SET_ATTR/KVM_XEN_VCPU_GET_ATTR ioctls. +The KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL flag indicates that IRQ routing entries +of the type KVM_IRQ_ROUTING_XEN_EVTCHN are supported, with the priority +field set to indicate 2 level event channel delivery. + 8.31 KVM_CAP_PPC_MULTITCE ------------------------- diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index a49d503e7296..392d13c36083 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -609,6 +609,7 @@ struct kvm_vcpu_xen { u64 last_steal; u64 runstate_entry_time; u64 runstate_times[4]; + unsigned long evtchn_pending_sel; }; struct kvm_vcpu_arch { diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c index 39ad02d6dc63..6e0dab04320e 100644 --- a/arch/x86/kvm/irq_comm.c +++ b/arch/x86/kvm/irq_comm.c @@ -24,6 +24,7 @@ #include "hyperv.h" #include "x86.h" +#include "xen.h" static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, int irq_source_id, int level, @@ -175,6 +176,13 @@ int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, return r; break; +#ifdef CONFIG_KVM_XEN + case KVM_IRQ_ROUTING_XEN_EVTCHN: + if (!level) + return -1; + + return kvm_xen_set_evtchn_fast(e, kvm); +#endif default: break; } @@ -310,6 +318,10 @@ int kvm_set_routing_entry(struct kvm *kvm, e->hv_sint.vcpu = ue->u.hv_sint.vcpu; e->hv_sint.sint = ue->u.hv_sint.sint; break; +#ifdef CONFIG_KVM_XEN + case KVM_IRQ_ROUTING_XEN_EVTCHN: + return kvm_xen_setup_evtchn(kvm, e, ue); +#endif default: return -EINVAL; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 26cb3a4cd0e9..1f254d6d503f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4183,7 +4183,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_XEN_HVM: r = KVM_XEN_HVM_CONFIG_HYPERCALL_MSR | KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL | - KVM_XEN_HVM_CONFIG_SHARED_INFO; + KVM_XEN_HVM_CONFIG_SHARED_INFO | + KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL; if (sched_info_on()) r |= KVM_XEN_HVM_CONFIG_RUNSTATE; break; diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index da4bf2c6407f..ceddabd1f5c6 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "trace.h" @@ -195,6 +196,8 @@ void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, int state) int __kvm_xen_has_interrupt(struct kvm_vcpu *v) { + unsigned long evtchn_pending_sel = READ_ONCE(v->arch.xen.evtchn_pending_sel); + bool atomic = in_atomic() || !task_is_running(current); int err; u8 rc = 0; @@ -204,6 +207,9 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v) */ struct gfn_to_hva_cache *ghc = &v->arch.xen.vcpu_info_cache; struct kvm_memslots *slots = kvm_memslots(v->kvm); + bool ghc_valid = slots->generation == ghc->generation && + !kvm_is_error_hva(ghc->hva) && ghc->memslot; + unsigned int offset = offsetof(struct vcpu_info, evtchn_upcall_pending); /* No need for compat handling here */ @@ -219,8 +225,7 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v) * cache in kvm_read_guest_offset_cached(), but just uses * __get_user() instead. And falls back to the slow path. */ - if (likely(slots->generation == ghc->generation && - !kvm_is_error_hva(ghc->hva) && ghc->memslot)) { + if (!evtchn_pending_sel && ghc_valid) { /* Fast path */ pagefault_disable(); err = __get_user(rc, (u8 __user *)ghc->hva + offset); @@ -239,11 +244,82 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v) * and we'll end up getting called again from a context where we *can* * fault in the page and wait for it. */ - if (in_atomic() || !task_is_running(current)) + if (atomic) return 1; - kvm_read_guest_offset_cached(v->kvm, ghc, &rc, offset, - sizeof(rc)); + if (!ghc_valid) { + err = kvm_gfn_to_hva_cache_init(v->kvm, ghc, ghc->gpa, ghc->len); + if (err || !ghc->memslot) { + /* + * If this failed, userspace has screwed up the + * vcpu_info mapping. No interrupts for you. + */ + return 0; + } + } + + /* + * Now we have a valid (protected by srcu) userspace HVA in + * ghc->hva which points to the struct vcpu_info. If there + * are any bits in the in-kernel evtchn_pending_sel then + * we need to write those to the guest vcpu_info and set + * its evtchn_upcall_pending flag. If there aren't any bits + * to add, we only want to *check* evtchn_upcall_pending. + */ + if (evtchn_pending_sel) { + bool long_mode = v->kvm->arch.xen.long_mode; + + if (!user_access_begin((void __user *)ghc->hva, sizeof(struct vcpu_info))) + return 0; + + if (IS_ENABLED(CONFIG_64BIT) && long_mode) { + struct vcpu_info __user *vi = (void __user *)ghc->hva; + + /* Attempt to set the evtchn_pending_sel bits in the + * guest, and if that succeeds then clear the same + * bits in the in-kernel version. */ + asm volatile("1:\t" LOCK_PREFIX "orq %0, %1\n" + "\tnotq %0\n" + "\t" LOCK_PREFIX "andq %0, %2\n" + "2:\n" + "\t.section .fixup,\"ax\"\n" + "3:\tjmp\t2b\n" + "\t.previous\n" + _ASM_EXTABLE_UA(1b, 3b) + : "=r" (evtchn_pending_sel), + "+m" (vi->evtchn_pending_sel), + "+m" (v->arch.xen.evtchn_pending_sel) + : "0" (evtchn_pending_sel)); + } else { + struct compat_vcpu_info __user *vi = (void __user *)ghc->hva; + u32 evtchn_pending_sel32 = evtchn_pending_sel; + + /* Attempt to set the evtchn_pending_sel bits in the + * guest, and if that succeeds then clear the same + * bits in the in-kernel version. */ + asm volatile("1:\t" LOCK_PREFIX "orl %0, %1\n" + "\tnotl %0\n" + "\t" LOCK_PREFIX "andl %0, %2\n" + "2:\n" + "\t.section .fixup,\"ax\"\n" + "3:\tjmp\t2b\n" + "\t.previous\n" + _ASM_EXTABLE_UA(1b, 3b) + : "=r" (evtchn_pending_sel32), + "+m" (vi->evtchn_pending_sel), + "+m" (v->arch.xen.evtchn_pending_sel) + : "0" (evtchn_pending_sel32)); + } + rc = 1; + unsafe_put_user(rc, (u8 __user *)ghc->hva + offset, err); + + err: + user_access_end(); + + mark_page_dirty_in_slot(v->kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT); + } else { + __get_user(rc, (u8 __user *)ghc->hva + offset); + } return rc; } @@ -740,3 +816,179 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu) return 0; } + +static inline int max_evtchn_port(struct kvm *kvm) +{ + if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) + return EVTCHN_2L_NR_CHANNELS; + else + return COMPAT_EVTCHN_2L_NR_CHANNELS; +} + +/* + * This follows the kvm_set_irq() API, so it returns: + * < 0 Interrupt was ignored (masked or not delivered for other reasons) + * = 0 Interrupt was coalesced (previous irq is still pending) + * > 0 Number of CPUs interrupt was delivered to + */ +int kvm_xen_set_evtchn_fast(struct kvm_kernel_irq_routing_entry *e, + struct kvm *kvm) +{ + struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache; + struct kvm_vcpu *vcpu; + unsigned long *pending_bits, *mask_bits; + unsigned long flags; + int port_word_bit; + bool kick_vcpu = false; + int idx; + int rc; + + vcpu = kvm_get_vcpu_by_id(kvm, e->xen_evtchn.vcpu); + if (!vcpu) + return -1; + + if (!vcpu->arch.xen.vcpu_info_set) + return -1; + + if (e->xen_evtchn.port >= max_evtchn_port(kvm)) + return -1; + + rc = -EWOULDBLOCK; + read_lock_irqsave(&gpc->lock, flags); + + idx = srcu_read_lock(&kvm->srcu); + if (!kvm_gfn_to_pfn_cache_check(kvm, gpc, gpc->gpa, PAGE_SIZE)) + goto out_rcu; + + if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) { + struct shared_info *shinfo = gpc->khva; + pending_bits = (unsigned long *)&shinfo->evtchn_pending; + mask_bits = (unsigned long *)&shinfo->evtchn_mask; + port_word_bit = e->xen_evtchn.port / 64; + } else { + struct compat_shared_info *shinfo = gpc->khva; + pending_bits = (unsigned long *)&shinfo->evtchn_pending; + mask_bits = (unsigned long *)&shinfo->evtchn_mask; + port_word_bit = e->xen_evtchn.port / 32; + } + + /* + * If this port wasn't already set, and if it isn't masked, then + * we try to set the corresponding bit in the in-kernel shadow of + * evtchn_pending_sel for the target vCPU. And if *that* wasn't + * already set, then we kick the vCPU in question to write to the + * *real* evtchn_pending_sel in its own guest vcpu_info struct. + */ + if (test_and_set_bit(e->xen_evtchn.port, pending_bits)) { + rc = 0; /* It was already raised */ + } else if (test_bit(e->xen_evtchn.port, mask_bits)) { + rc = -1; /* Masked */ + } else { + rc = 1; /* Delivered. But was the vCPU waking already? */ + if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel)) + kick_vcpu = true; + } + + out_rcu: + srcu_read_unlock(&kvm->srcu, idx); + read_unlock_irqrestore(&gpc->lock, flags); + + if (kick_vcpu) { + kvm_make_request(KVM_REQ_EVENT, vcpu); + kvm_vcpu_kick(vcpu); + } + + return rc; +} + +/* This is the version called from kvm_set_irq() as the .set function */ +static int evtchn_set_fn(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, + int irq_source_id, int level, bool line_status) +{ + bool mm_borrowed = false; + int rc; + + if (!level) + return -1; + + rc = kvm_xen_set_evtchn_fast(e, kvm); + if (rc != -EWOULDBLOCK) + return rc; + + if (current->mm != kvm->mm) { + /* + * If not on a thread which already belongs to this KVM, + * we'd better be in the irqfd workqueue. + */ + if (WARN_ON_ONCE(current->mm)) + return -EINVAL; + + kthread_use_mm(kvm->mm); + mm_borrowed = true; + } + + /* + * For the irqfd workqueue, using the main kvm->lock mutex is + * fine since this function is invoked from kvm_set_irq() with + * no other lock held, no srcu. In future if it will be called + * directly from a vCPU thread (e.g. on hypercall for an IPI) + * then it may need to switch to using a leaf-node mutex for + * serializing the shared_info mapping. + */ + mutex_lock(&kvm->lock); + + /* + * It is theoretically possible for the page to be unmapped + * and the MMU notifier to invalidate the shared_info before + * we even get to use it. In that case, this looks like an + * infinite loop. It was tempting to do it via the userspace + * HVA instead... but that just *hides* the fact that it's + * an infinite loop, because if a fault occurs and it waits + * for the page to come back, it can *still* immediately + * fault and have to wait again, repeatedly. + * + * Conversely, the page could also have been reinstated by + * another thread before we even obtain the mutex above, so + * check again *first* before remapping it. + */ + do { + struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache; + int idx; + + rc = kvm_xen_set_evtchn_fast(e, kvm); + if (rc != -EWOULDBLOCK) + break; + + idx = srcu_read_lock(&kvm->srcu); + rc = kvm_gfn_to_pfn_cache_refresh(kvm, gpc, gpc->gpa, + PAGE_SIZE, false); + srcu_read_unlock(&kvm->srcu, idx); + } while(!rc); + + mutex_unlock(&kvm->lock); + + if (mm_borrowed) + kthread_unuse_mm(kvm->mm); + + return rc; +} + +int kvm_xen_setup_evtchn(struct kvm *kvm, + struct kvm_kernel_irq_routing_entry *e, + const struct kvm_irq_routing_entry *ue) + +{ + if (ue->u.xen_evtchn.port >= max_evtchn_port(kvm)) + return -EINVAL; + + /* We only support 2 level event channels for now */ + if (ue->u.xen_evtchn.priority != KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL) + return -EINVAL; + + e->xen_evtchn.port = ue->u.xen_evtchn.port; + e->xen_evtchn.vcpu = ue->u.xen_evtchn.vcpu; + e->xen_evtchn.priority = ue->u.xen_evtchn.priority; + e->set = evtchn_set_fn; + + return 0; +} diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h index cc0cf5f37450..adbcc9ed59db 100644 --- a/arch/x86/kvm/xen.h +++ b/arch/x86/kvm/xen.h @@ -24,6 +24,12 @@ int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc); void kvm_xen_init_vm(struct kvm *kvm); void kvm_xen_destroy_vm(struct kvm *kvm); +int kvm_xen_set_evtchn_fast(struct kvm_kernel_irq_routing_entry *e, + struct kvm *kvm); +int kvm_xen_setup_evtchn(struct kvm *kvm, + struct kvm_kernel_irq_routing_entry *e, + const struct kvm_irq_routing_entry *ue); + static inline bool kvm_xen_msr_enabled(struct kvm *kvm) { return static_branch_unlikely(&kvm_xen_enabled.key) && @@ -134,6 +140,9 @@ struct compat_shared_info { struct compat_arch_shared_info arch; }; +#define COMPAT_EVTCHN_2L_NR_CHANNELS (8 * \ + sizeof_field(struct compat_shared_info, \ + evtchn_pending)) struct compat_vcpu_runstate_info { int state; uint64_t state_entry_time; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b56d3f4fb2fd..0853e789031b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -497,6 +497,12 @@ struct kvm_hv_sint { u32 sint; }; +struct kvm_xen_evtchn { + u32 port; + u32 vcpu; + u32 priority; +}; + struct kvm_kernel_irq_routing_entry { u32 gsi; u32 type; @@ -517,6 +523,7 @@ struct kvm_kernel_irq_routing_entry { } msi; struct kvm_s390_adapter_int adapter; struct kvm_hv_sint hv_sint; + struct kvm_xen_evtchn xen_evtchn; }; struct hlist_node link; }; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 1daa45268de2..12421e76adcb 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1162,11 +1162,20 @@ struct kvm_irq_routing_hv_sint { __u32 sint; }; +struct kvm_irq_routing_xen_evtchn { + __u32 port; + __u32 vcpu; + __u32 priority; +}; + +#define KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL ((__u32)(-1)) + /* gsi routing entry types */ #define KVM_IRQ_ROUTING_IRQCHIP 1 #define KVM_IRQ_ROUTING_MSI 2 #define KVM_IRQ_ROUTING_S390_ADAPTER 3 #define KVM_IRQ_ROUTING_HV_SINT 4 +#define KVM_IRQ_ROUTING_XEN_EVTCHN 5 struct kvm_irq_routing_entry { __u32 gsi; @@ -1178,6 +1187,7 @@ struct kvm_irq_routing_entry { struct kvm_irq_routing_msi msi; struct kvm_irq_routing_s390_adapter adapter; struct kvm_irq_routing_hv_sint hv_sint; + struct kvm_irq_routing_xen_evtchn xen_evtchn; __u32 pad[8]; } u; }; @@ -1208,6 +1218,7 @@ struct kvm_x86_mce { #define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1) #define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2) #define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 3) +#define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4) struct kvm_xen_hvm_config { __u32 flags; diff --git a/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c b/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c index a0699f00b3d6..478e0ae8b93e 100644 --- a/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c +++ b/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c @@ -14,6 +14,9 @@ #include #include #include +#include + +#include #define VCPU_ID 5 @@ -22,10 +25,15 @@ #define SHINFO_REGION_SLOT 10 #define PAGE_SIZE 4096 +#define DUMMY_REGION_GPA (SHINFO_REGION_GPA + (2 * PAGE_SIZE)) +#define DUMMY_REGION_SLOT 11 + +#define SHINFO_ADDR (SHINFO_REGION_GPA) #define PVTIME_ADDR (SHINFO_REGION_GPA + PAGE_SIZE) #define RUNSTATE_ADDR (SHINFO_REGION_GPA + PAGE_SIZE + 0x20) #define VCPU_INFO_ADDR (SHINFO_REGION_GPA + 0x40) +#define SHINFO_VADDR (SHINFO_REGION_GVA) #define RUNSTATE_VADDR (SHINFO_REGION_GVA + PAGE_SIZE + 0x20) #define VCPU_INFO_VADDR (SHINFO_REGION_GVA + 0x40) @@ -73,15 +81,37 @@ struct vcpu_info { struct pvclock_vcpu_time_info time; }; /* 64 bytes (x86) */ +struct shared_info { + struct vcpu_info vcpu_info[32]; + unsigned long evtchn_pending[64]; + unsigned long evtchn_mask[64]; + struct pvclock_wall_clock wc; + uint32_t wc_sec_hi; + /* arch_shared_info here */ +}; + #define RUNSTATE_running 0 #define RUNSTATE_runnable 1 #define RUNSTATE_blocked 2 #define RUNSTATE_offline 3 +static const char *runstate_names[] = { + "running", + "runnable", + "blocked", + "offline" +}; + +struct { + struct kvm_irq_routing info; + struct kvm_irq_routing_entry entries[2]; +} irq_routes; + static void evtchn_handler(struct ex_regs *regs) { struct vcpu_info *vi = (void *)VCPU_INFO_VADDR; vi->evtchn_upcall_pending = 0; + vi->evtchn_pending_sel = 0; GUEST_SYNC(0x20); } @@ -127,7 +157,25 @@ static void guest_code(void) GUEST_SYNC(6); GUEST_ASSERT(rs->time[RUNSTATE_runnable] >= MIN_STEAL_TIME); - GUEST_DONE(); + /* Attempt to deliver a *masked* interrupt */ + GUEST_SYNC(7); + + /* Wait until we see the bit set */ + struct shared_info *si = (void *)SHINFO_VADDR; + while (!si->evtchn_pending[0]) + __asm__ __volatile__ ("rep nop" : : : "memory"); + + /* Now deliver an *unmasked* interrupt */ + GUEST_SYNC(8); + + while (!si->evtchn_pending[1]) + __asm__ __volatile__ ("rep nop" : : : "memory"); + + /* Change memslots and deliver an interrupt */ + GUEST_SYNC(9); + + for (;;) + __asm__ __volatile__ ("rep nop" : : : "memory"); } static int cmp_timespec(struct timespec *a, struct timespec *b) @@ -144,9 +192,18 @@ static int cmp_timespec(struct timespec *a, struct timespec *b) return 0; } +static void handle_alrm(int sig) +{ + TEST_FAIL("IRQ delivery timed out"); +} + int main(int argc, char *argv[]) { struct timespec min_ts, max_ts, vm_ts; + bool verbose; + + verbose = argc > 1 && (!strncmp(argv[1], "-v", 3) || + !strncmp(argv[1], "--verbose", 10)); int xen_caps = kvm_check_cap(KVM_CAP_XEN_HVM); if (!(xen_caps & KVM_XEN_HVM_CONFIG_SHARED_INFO) ) { @@ -155,6 +212,7 @@ int main(int argc, char *argv[]) } bool do_runstate_tests = !!(xen_caps & KVM_XEN_HVM_CONFIG_RUNSTATE); + bool do_eventfd_tests = !!(xen_caps & KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL); clock_gettime(CLOCK_REALTIME, &min_ts); @@ -166,6 +224,11 @@ int main(int argc, char *argv[]) SHINFO_REGION_GPA, SHINFO_REGION_SLOT, 2, 0); virt_map(vm, SHINFO_REGION_GVA, SHINFO_REGION_GPA, 2); + struct shared_info *shinfo = addr_gpa2hva(vm, SHINFO_VADDR); + + int zero_fd = open("/dev/zero", O_RDONLY); + TEST_ASSERT(zero_fd != -1, "Failed to open /dev/zero"); + struct kvm_xen_hvm_config hvmc = { .flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL, .msr = XEN_HYPERCALL_MSR, @@ -184,6 +247,16 @@ int main(int argc, char *argv[]) }; vm_ioctl(vm, KVM_XEN_HVM_SET_ATTR, &ha); + /* + * Test what happens when the HVA of the shinfo page is remapped after + * the kernel has a reference to it. But make sure we copy the clock + * info over since that's only set at setup time, and we test it later. + */ + struct pvclock_wall_clock wc_copy = shinfo->wc; + void *m = mmap(shinfo, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_PRIVATE, zero_fd, 0); + TEST_ASSERT(m == shinfo, "Failed to map /dev/zero over shared info"); + shinfo->wc = wc_copy; + struct kvm_xen_vcpu_attr vi = { .type = KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO, .u.gpa = VCPU_INFO_ADDR, @@ -214,6 +287,49 @@ int main(int argc, char *argv[]) vcpu_ioctl(vm, VCPU_ID, KVM_XEN_VCPU_SET_ATTR, &st); } + int irq_fd[2] = { -1, -1 }; + + if (do_eventfd_tests) { + irq_fd[0] = eventfd(0, 0); + irq_fd[1] = eventfd(0, 0); + + /* Unexpected, but not a KVM failure */ + if (irq_fd[0] == -1 || irq_fd[1] == -1) + do_eventfd_tests = false; + } + + if (do_eventfd_tests) { + irq_routes.info.nr = 2; + + irq_routes.entries[0].gsi = 32; + irq_routes.entries[0].type = KVM_IRQ_ROUTING_XEN_EVTCHN; + irq_routes.entries[0].u.xen_evtchn.port = 15; + irq_routes.entries[0].u.xen_evtchn.vcpu = VCPU_ID; + irq_routes.entries[0].u.xen_evtchn.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; + + irq_routes.entries[1].gsi = 33; + irq_routes.entries[1].type = KVM_IRQ_ROUTING_XEN_EVTCHN; + irq_routes.entries[1].u.xen_evtchn.port = 66; + irq_routes.entries[1].u.xen_evtchn.vcpu = VCPU_ID; + irq_routes.entries[1].u.xen_evtchn.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; + + vm_ioctl(vm, KVM_SET_GSI_ROUTING, &irq_routes); + + struct kvm_irqfd ifd = { }; + + ifd.fd = irq_fd[0]; + ifd.gsi = 32; + vm_ioctl(vm, KVM_IRQFD, &ifd); + + ifd.fd = irq_fd[1]; + ifd.gsi = 33; + vm_ioctl(vm, KVM_IRQFD, &ifd); + + struct sigaction sa = { }; + sa.sa_handler = handle_alrm; + sigaction(SIGALRM, &sa, NULL); + } + struct vcpu_info *vinfo = addr_gpa2hva(vm, VCPU_INFO_VADDR); vinfo->evtchn_upcall_pending = 0; @@ -248,6 +364,8 @@ int main(int argc, char *argv[]) switch (uc.args[1]) { case 0: + if (verbose) + printf("Delivering evtchn upcall\n"); evtchn_irq_expected = true; vinfo->evtchn_upcall_pending = 1; break; @@ -256,11 +374,16 @@ int main(int argc, char *argv[]) TEST_ASSERT(!evtchn_irq_expected, "Event channel IRQ not seen"); if (!do_runstate_tests) goto done; + if (verbose) + printf("Testing runstate %s\n", runstate_names[uc.args[1]]); rst.type = KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT; rst.u.runstate.state = uc.args[1]; vcpu_ioctl(vm, VCPU_ID, KVM_XEN_VCPU_SET_ATTR, &rst); break; + case 4: + if (verbose) + printf("Testing RUNSTATE_ADJUST\n"); rst.type = KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST; memset(&rst.u, 0, sizeof(rst.u)); rst.u.runstate.state = (uint64_t)-1; @@ -274,6 +397,8 @@ int main(int argc, char *argv[]) break; case 5: + if (verbose) + printf("Testing RUNSTATE_DATA\n"); rst.type = KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_DATA; memset(&rst.u, 0, sizeof(rst.u)); rst.u.runstate.state = RUNSTATE_running; @@ -282,16 +407,54 @@ int main(int argc, char *argv[]) rst.u.runstate.time_offline = 0x5a; vcpu_ioctl(vm, VCPU_ID, KVM_XEN_VCPU_SET_ATTR, &rst); break; + case 6: + if (verbose) + printf("Testing steal time\n"); /* Yield until scheduler delay exceeds target */ rundelay = get_run_delay() + MIN_STEAL_TIME; do { sched_yield(); } while (get_run_delay() < rundelay); break; + + case 7: + if (!do_eventfd_tests) + goto done; + if (verbose) + printf("Testing masked event channel\n"); + shinfo->evtchn_mask[0] = 0x8000; + eventfd_write(irq_fd[0], 1UL); + alarm(1); + break; + + case 8: + if (verbose) + printf("Testing unmasked event channel\n"); + /* Unmask that, but deliver the other one */ + shinfo->evtchn_pending[0] = 0; + shinfo->evtchn_mask[0] = 0; + eventfd_write(irq_fd[1], 1UL); + evtchn_irq_expected = true; + alarm(1); + break; + + case 9: + if (verbose) + printf("Testing event channel after memslot change\n"); + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, + DUMMY_REGION_GPA, DUMMY_REGION_SLOT, 1, 0); + eventfd_write(irq_fd[0], 1UL); + evtchn_irq_expected = true; + alarm(1); + break; + case 0x20: TEST_ASSERT(evtchn_irq_expected, "Unexpected event channel IRQ"); evtchn_irq_expected = false; + if (shinfo->evtchn_pending[1] && + shinfo->evtchn_pending[0]) + goto done; break; } break; @@ -318,6 +481,16 @@ int main(int argc, char *argv[]) ti = addr_gpa2hva(vm, SHINFO_REGION_GPA + 0x40 + 0x20); ti2 = addr_gpa2hva(vm, PVTIME_ADDR); + if (verbose) { + printf("Wall clock (v %d) %d.%09d\n", wc->version, wc->sec, wc->nsec); + printf("Time info 1: v %u tsc %" PRIu64 " time %" PRIu64 " mul %u shift %u flags %x\n", + ti->version, ti->tsc_timestamp, ti->system_time, ti->tsc_to_system_mul, + ti->tsc_shift, ti->flags); + printf("Time info 2: v %u tsc %" PRIu64 " time %" PRIu64 " mul %u shift %u flags %x\n", + ti2->version, ti2->tsc_timestamp, ti2->system_time, ti2->tsc_to_system_mul, + ti2->tsc_shift, ti2->flags); + } + vm_ts.tv_sec = wc->sec; vm_ts.tv_nsec = wc->nsec; TEST_ASSERT(wc->version && !(wc->version & 1), @@ -341,6 +514,15 @@ int main(int argc, char *argv[]) }; vcpu_ioctl(vm, VCPU_ID, KVM_XEN_VCPU_GET_ATTR, &rst); + if (verbose) { + printf("Runstate: %s(%d), entry %" PRIu64 " ns\n", + rs->state <= RUNSTATE_offline ? runstate_names[rs->state] : "unknown", + rs->state, rs->state_entry_time); + for (int i = RUNSTATE_running; i <= RUNSTATE_offline; i++) { + printf("State %s: %" PRIu64 " ns\n", + runstate_names[i], rs->time[i]); + } + } TEST_ASSERT(rs->state == rst.u.runstate.state, "Runstate mismatch"); TEST_ASSERT(rs->state_entry_time == rst.u.runstate.state_entry_time, "State entry time mismatch"); From patchwork Fri Dec 10 16:36:24 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 12670145 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CDD4DC43217 for ; Fri, 10 Dec 2021 16:36:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243690AbhLJQkX (ORCPT ); Fri, 10 Dec 2021 11:40:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58296 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243671AbhLJQkT (ORCPT ); Fri, 10 Dec 2021 11:40:19 -0500 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51B85C061353 for ; Fri, 10 Dec 2021 08:36:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description; bh=V5ElOkfVRDzjvPfScyHzr/shE5gWQxuaU0C3XbyTCXQ=; b=asCSgKlE4T9QOetZZow1J2xP9L P2a3jgmqLE3EYgcUVIP7Jwyv1FgcmmfbFspd0CVZ2r/ooQadzV//kyqCOVNvKBYqy+FSI0hgewhKd VksrnK+yCz75p/dLm45FPdJqV8DAwQahmjuVyA7z0zlGoRijdSIKhT2Piq0B5+pnA5FvZugr/4mZ3 27y3obtRyhWOJ5RH6s8DzXBcYygkUuZQ5s2schCMt0io7nRcNHbIjUMjuTlqYzv+2HKuwhvGvR2i9 j5mBn6A25rgDqEyU4xk+OrtoOJCdqQ7xrqK2emCeQRPmGtY45135vF6397e0X6YjczEeVR6fOOCx7 4m2NW23w==; Received: from i7.infradead.org ([2001:8b0:10b:1:21e:67ff:fecb:7a92]) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1mvisk-00AUQx-W7; Fri, 10 Dec 2021 16:36:27 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1mvisl-0000ln-7m; Fri, 10 Dec 2021 16:36:27 +0000 From: David Woodhouse To: Paolo Bonzini , kvm Cc: Boris Ostrovsky , Joao Martins , "jmattson @ google . com" , "wanpengli @ tencent . com" , "seanjc @ google . com" , "vkuznets @ redhat . com" , "mtosatti @ redhat . com" , "joro @ 8bytes . org" , karahmed@amazon.com, butt3rflyh4ck Subject: [PATCH v6 5/6] KVM: x86: Fix wall clock writes in Xen shared_info not to mark page dirty Date: Fri, 10 Dec 2021 16:36:24 +0000 Message-Id: <20211210163625.2886-6-dwmw2@infradead.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211210163625.2886-1-dwmw2@infradead.org> References: <20211210163625.2886-1-dwmw2@infradead.org> MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Woodhouse When dirty ring logging is enabled, any dirty logging without an active vCPU context will cause a kernel oops. But we've already declared that the shared_info page doesn't get dirty tracking anyway, since it would be kind of insane to mark it dirty every time we deliver an event channel interrupt. Userspace is supposed to just assume it's always dirty any time a vCPU can run or event channels are routed. So stop using the generic kvm_write_wall_clock() and just write directly through the gfn_to_pfn_cache that we already have set up. We can make kvm_write_wall_clock() static in x86.c again now, but let's not remove the 'sec_hi_ofs' argument even though it's not used yet. At some point we *will* want to use that for KVM guests too. Fixes: 629b5348841a ("KVM: x86/xen: update wallclock region") Reported-by: butt3rflyh4ck Signed-off-by: David Woodhouse --- arch/x86/kvm/x86.c | 2 +- arch/x86/kvm/x86.h | 1 - arch/x86/kvm/xen.c | 62 +++++++++++++++++++++++++++++++++++----------- 3 files changed, 49 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1f254d6d503f..5e7a4982fb90 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2130,7 +2130,7 @@ static s64 get_kvmclock_base_ns(void) } #endif -void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_ofs) +static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_ofs) { int version; int r; diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 4abcd8d9836d..da7031e80f23 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -301,7 +301,6 @@ static inline bool kvm_vcpu_latch_init(struct kvm_vcpu *vcpu) return is_smm(vcpu) || static_call(kvm_x86_apic_init_signal_blocked)(vcpu); } -void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_ofs); void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip); u64 get_kvmclock_ns(struct kvm *kvm); diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index ceddabd1f5c6..0e3f7d6e9fd7 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -25,8 +25,11 @@ DEFINE_STATIC_KEY_DEFERRED_FALSE(kvm_xen_enabled, HZ); static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn) { struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache; + struct pvclock_wall_clock *wc; gpa_t gpa = gfn_to_gpa(gfn); - int wc_ofs, sec_hi_ofs; + u32 *wc_sec_hi; + u32 wc_version; + u64 wall_nsec; int ret = 0; int idx = srcu_read_lock(&kvm->srcu); @@ -35,32 +38,63 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn) goto out; } - ret = kvm_gfn_to_pfn_cache_init(kvm, gpc, NULL, false, true, gpa, - PAGE_SIZE, false); - if (ret) - goto out; + do { + ret = kvm_gfn_to_pfn_cache_init(kvm, gpc, NULL, false, true, + gpa, PAGE_SIZE, false); + if (ret) + goto out; + + /* + * This code mirrors kvm_write_wall_clock() except that it writes + * directly through the pfn cache and doesn't mark the page dirty. + */ + wall_nsec = ktime_get_real_ns() - get_kvmclock_ns(kvm); + + /* It could be invalid again already, so we need to check */ + read_lock_irq(&gpc->lock); + + if (gpc->valid) + break; + + read_unlock_irq(&gpc->lock); + } while (1); /* Paranoia checks on the 32-bit struct layout */ BUILD_BUG_ON(offsetof(struct compat_shared_info, wc) != 0x900); BUILD_BUG_ON(offsetof(struct compat_shared_info, arch.wc_sec_hi) != 0x924); BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0); - /* 32-bit location by default */ - wc_ofs = offsetof(struct compat_shared_info, wc); - sec_hi_ofs = offsetof(struct compat_shared_info, arch.wc_sec_hi); - #ifdef CONFIG_X86_64 /* Paranoia checks on the 64-bit struct layout */ BUILD_BUG_ON(offsetof(struct shared_info, wc) != 0xc00); BUILD_BUG_ON(offsetof(struct shared_info, wc_sec_hi) != 0xc0c); - if (kvm->arch.xen.long_mode) { - wc_ofs = offsetof(struct shared_info, wc); - sec_hi_ofs = offsetof(struct shared_info, wc_sec_hi); - } + if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) { + struct shared_info *shinfo = gpc->khva; + + wc_sec_hi = &shinfo->wc_sec_hi; + wc = &shinfo->wc; + } else #endif + { + struct compat_shared_info *shinfo = gpc->khva; + + wc_sec_hi = &shinfo->arch.wc_sec_hi; + wc = &shinfo->wc; + } + + /* Increment and ensure an odd value */ + wc_version = wc->version = (wc->version + 1) | 1; + smp_wmb(); + + wc->nsec = do_div(wall_nsec, 1000000000); + wc->sec = (u32)wall_nsec; + *wc_sec_hi = wall_nsec >> 32; + smp_wmb(); + + wc->version = wc_version + 1; + read_unlock_irq(&gpc->lock); - kvm_write_wall_clock(kvm, gpa + wc_ofs, sec_hi_ofs - wc_ofs); kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE); out: From patchwork Fri Dec 10 16:36:25 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 12670137 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8AA48C433EF for ; Fri, 10 Dec 2021 16:36:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243674AbhLJQkT (ORCPT ); Fri, 10 Dec 2021 11:40:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58286 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243659AbhLJQkS (ORCPT ); Fri, 10 Dec 2021 11:40:18 -0500 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BEEC2C0617A2 for ; Fri, 10 Dec 2021 08:36:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description; bh=Dy+EnFe1FApmZyKcptLtP5bmR0mNvgSWFdTb0xnTfpg=; b=jU22HAqWuJEvTxk9WlpqJzPSqC KeVfh4M0s59lHVYE5uYOy1DezU9tTK6rq56AKp404/2mdOH8czNrUbTJOoF187WPS0vODyGDhLYF7 lc2CLNn2Q/FVlhvunetCdqjYye0vJEyJ7QBdSrSt/D7zDRTix1XgJzmlN4uhz4tuzmFSMj+R2zkKj M1gNm8aC8O3DGPv9ri7ALDw/5JwnPz46MK4CPoan3FLjXUytQvzrColxBwnmJqvkMx3YmIsgA6ozh hEiATMIPa92gifFTN1WuLWskqdUbzmMr6shUB5m8k44X+SBVBkLf5G3SR1ESsCOnh3sH34X7d1efS QHU8jp/A==; Received: from i7.infradead.org ([2001:8b0:10b:1:21e:67ff:fecb:7a92]) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1mvisl-000abx-LD; Fri, 10 Dec 2021 16:36:27 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1mvisl-0000lr-AN; Fri, 10 Dec 2021 16:36:27 +0000 From: David Woodhouse To: Paolo Bonzini , kvm Cc: Boris Ostrovsky , Joao Martins , "jmattson @ google . com" , "wanpengli @ tencent . com" , "seanjc @ google . com" , "vkuznets @ redhat . com" , "mtosatti @ redhat . com" , "joro @ 8bytes . org" , karahmed@amazon.com, butt3rflyh4ck Subject: [PATCH v6 6/6] KVM: x86: First attempt at converting nested virtual APIC page to gpc Date: Fri, 10 Dec 2021 16:36:25 +0000 Message-Id: <20211210163625.2886-7-dwmw2@infradead.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211210163625.2886-1-dwmw2@infradead.org> References: <20211210163625.2886-1-dwmw2@infradead.org> MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by desiato.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: David Woodhouse This is what evolved during the discussion at https://lore.kernel.org/kvm/960E233F-EC0B-4FB5-BA2E-C8D2CCB38B12@infradead.org/T/#m11d75fcfe2da357ec1dabba0d0e3abb91fd13665 As discussed, an alternative approach might be to augment kvm_arch_memslots_updated() to raise KVM_REQ_GET_NESTED_STATE_PAGES to each vCPU (and make that req only do anything on a given vCPU if that vCPU is actually in L2 guest mode). That would mean the reload gets actively triggered even on memslot changes rather than only on MMU notifiers as is the case now. It could *potentially* mean we can drop the new 'check_guest_maps' function. The 'check_guest_maps' function could be a lot simpler than it is, though. It only really needs to get kvm->memslots->generation, then check each gpc->generation against that, and each gpc->valid. Also I suspect we *shouldn't* destroy the virtual_apic_cache in nested_vmx_vmexit(). We can just leave it there for next time the vCPU enters guest mode. If it happens to get invalidated in the meantime, that's fine and we'll refresh it on the way back in. We probably *would* want to actively do something on memslot changes in that case though, to ensure that even if the vCPU isn't in guest mode any more, we *release* the cached page. Signed-off-by: David Woodhouse --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/nested.c | 50 ++++++++++++++++++++++++++++----- arch/x86/kvm/vmx/vmx.c | 12 +++++--- arch/x86/kvm/vmx/vmx.h | 2 +- arch/x86/kvm/x86.c | 10 +++++++ 5 files changed, 63 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 392d13c36083..8216ae8d1b38 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1517,6 +1517,7 @@ struct kvm_x86_nested_ops { int (*enable_evmcs)(struct kvm_vcpu *vcpu, uint16_t *vmcs_version); uint16_t (*get_evmcs_version)(struct kvm_vcpu *vcpu); + void (*check_guest_maps)(struct kvm_vcpu *vcpu); }; struct kvm_x86_init_ops { diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 2f6f465e575f..09fab0f3c472 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -315,7 +315,7 @@ static void free_nested(struct kvm_vcpu *vcpu) kvm_release_page_clean(vmx->nested.apic_access_page); vmx->nested.apic_access_page = NULL; } - kvm_vcpu_unmap(vcpu, &vmx->nested.virtual_apic_map, true); + kvm_gfn_to_pfn_cache_destroy(vcpu->kvm, &vmx->nested.virtual_apic_cache); kvm_vcpu_unmap(vcpu, &vmx->nested.pi_desc_map, true); vmx->nested.pi_desc = NULL; @@ -3199,10 +3199,12 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) } if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) { - map = &vmx->nested.virtual_apic_map; + struct gfn_to_pfn_cache *gpc = &vmx->nested.virtual_apic_cache; - if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->virtual_apic_page_addr), map)) { - vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(map->pfn)); + if (!kvm_gfn_to_pfn_cache_init(vcpu->kvm, gpc, vcpu, true, true, + vmcs12->virtual_apic_page_addr, + PAGE_SIZE, true)) { + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(gpc->pfn)); } else if (nested_cpu_has(vmcs12, CPU_BASED_CR8_LOAD_EXITING) && nested_cpu_has(vmcs12, CPU_BASED_CR8_STORE_EXITING) && !nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) { @@ -3227,6 +3229,9 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) if (nested_cpu_has_posted_intr(vmcs12)) { map = &vmx->nested.pi_desc_map; + if (kvm_vcpu_mapped(map)) + kvm_vcpu_unmap(vcpu, map, true); + if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->posted_intr_desc_addr), map)) { vmx->nested.pi_desc = (struct pi_desc *)(((void *)map->hva) + @@ -3271,6 +3276,29 @@ static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu) return true; } +static void nested_vmx_check_guest_maps(struct kvm_vcpu *vcpu) +{ + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); + struct vcpu_vmx *vmx = to_vmx(vcpu); + struct gfn_to_pfn_cache *gpc; + + int valid; + + if (nested_cpu_has_posted_intr(vmcs12)) { + gpc = &vmx->nested.virtual_apic_cache; + + read_lock(&gpc->lock); + valid = kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, + vmcs12->virtual_apic_page_addr, + PAGE_SIZE); + read_unlock(&gpc->lock); + if (!valid) { + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); + return; + } + } +} + static int nested_vmx_write_pml_buffer(struct kvm_vcpu *vcpu, gpa_t gpa) { struct vmcs12 *vmcs12; @@ -3768,9 +3796,15 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256); if (max_irr != 256) { - vapic_page = vmx->nested.virtual_apic_map.hva; - if (!vapic_page) + struct gfn_to_pfn_cache *gpc = &vmx->nested.virtual_apic_cache; + + read_lock(&gpc->lock); + if (!kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, gpc->gpa, PAGE_SIZE)) { + read_unlock(&gpc->lock); goto mmio_needed; + } + + vapic_page = gpc->khva; __kvm_apic_update_irr(vmx->nested.pi_desc->pir, vapic_page, &max_irr); @@ -3780,6 +3814,7 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) status |= (u8)max_irr; vmcs_write16(GUEST_INTR_STATUS, status); } + read_unlock(&gpc->lock); } nested_mark_vmcs12_pages_dirty(vcpu); @@ -4599,7 +4634,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, kvm_release_page_clean(vmx->nested.apic_access_page); vmx->nested.apic_access_page = NULL; } - kvm_vcpu_unmap(vcpu, &vmx->nested.virtual_apic_map, true); + kvm_gfn_to_pfn_cache_unmap(vcpu->kvm, &vmx->nested.virtual_apic_cache); kvm_vcpu_unmap(vcpu, &vmx->nested.pi_desc_map, true); vmx->nested.pi_desc = NULL; @@ -6776,4 +6811,5 @@ struct kvm_x86_nested_ops vmx_nested_ops = { .write_log_dirty = nested_vmx_write_pml_buffer, .enable_evmcs = nested_enable_evmcs, .get_evmcs_version = nested_get_evmcs_version, + .check_guest_maps = nested_vmx_check_guest_maps, }; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 63615d242bdf..d7c36ca544e3 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -3852,19 +3852,23 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu) static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); - void *vapic_page; + struct gfn_to_pfn_cache *gpc = &vmx->nested.virtual_apic_cache; u32 vppr; int rvi; if (WARN_ON_ONCE(!is_guest_mode(vcpu)) || !nested_cpu_has_vid(get_vmcs12(vcpu)) || - WARN_ON_ONCE(!vmx->nested.virtual_apic_map.gfn)) + WARN_ON_ONCE(gpc->gpa == GPA_INVALID)) return false; rvi = vmx_get_rvi(); - vapic_page = vmx->nested.virtual_apic_map.hva; - vppr = *((u32 *)(vapic_page + APIC_PROCPRI)); + read_lock(&gpc->lock); + if (!kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, gpc->gpa, PAGE_SIZE)) + vppr = *((u32 *)(gpc->khva + APIC_PROCPRI)); + else + vppr = 0xff; + read_unlock(&gpc->lock); return ((rvi & 0xf0) > (vppr & 0xf0)); } diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 6c2c1aff1c3d..400e7bed11fc 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -204,7 +204,7 @@ struct nested_vmx { * pointers, so we must keep them pinned while L2 runs. */ struct page *apic_access_page; - struct kvm_host_map virtual_apic_map; + struct gfn_to_pfn_cache virtual_apic_cache; struct kvm_host_map pi_desc_map; struct kvm_host_map msr_bitmap_map; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5e7a4982fb90..e9d89bc3f7dd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9826,6 +9826,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu)) static_call(kvm_x86_update_cpu_dirty_logging)(vcpu); + if (kvm_check_request(KVM_REQ_GPC_INVALIDATE, vcpu)) + ; /* Nothing to do. It just wanted to wake us */ } if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win || @@ -9872,6 +9874,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) local_irq_disable(); vcpu->mode = IN_GUEST_MODE; + /* + * If the guest requires direct access to mapped L1 pages, check + * the caches are valid. Will raise KVM_REQ_GET_NESTED_STATE_PAGES + * to go and revalidate them, if necessary. + */ + if (is_guest_mode(vcpu) && kvm_x86_ops.nested_ops->check_guest_maps) + kvm_x86_ops.nested_ops->check_guest_maps(vcpu); + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx); /*