From patchwork Thu Sep 29 07:03:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chenyi Qiang X-Patchwork-Id: 12993566 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1ABFC04A95 for ; Thu, 29 Sep 2022 06:56:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234910AbiI2G4w (ORCPT ); Thu, 29 Sep 2022 02:56:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45900 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234978AbiI2G4u (ORCPT ); Thu, 29 Sep 2022 02:56:50 -0400 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A631612B4BB for ; Wed, 28 Sep 2022 23:56:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1664434607; x=1695970607; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=/6/VqwylSrczNCuoRPX4zu1X/zz8KeLytqpXmkoMnUA=; b=i1kbpqpRQlXFNgilolC7fnQC4lmjqqZuTypciQ6IcgHFp7Oa+3rl6RxS n4LI8E6FD9xoyRaprJ2pA8QZMeW75Xdf9bUXHu16qTAGezTe+iKjbbFfq 76aa0ES9fNfjbhFUZS9JfCEOB+0DW7eS+BrhGdPx9bXnqP4YqNFrTGRtR bdP/A4hQq/DWWRzCaeeaEwbH7v70Mj5lviiGgsvEiHm3tC0mzpscP74Ud PgWLexWYLe8BEev7/2xD/kx1zY2OJRIiHD5vCVx7EdN5N72E8SmY6ntz0 vMl9RA3a8+opicgAbjyp/xnhdZxEFVPkIBTYsydmYv7+IK9JY6NdSlVgj w==; X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="302724723" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="302724723" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Sep 2022 23:56:47 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="711268481" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="711268481" Received: from chenyi-pc.sh.intel.com ([10.239.159.53]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Sep 2022 23:56:44 -0700 From: Chenyi Qiang To: Paolo Bonzini , Marcelo Tosatti , Richard Henderson , Eduardo Habkost , Peter Xu , Xiaoyao Li Cc: Chenyi Qiang , qemu-devel@nongnu.org, kvm@vger.kernel.org Subject: [PATCH v8 1/4] i386: kvm: extend kvm_{get, put}_vcpu_events to support pending triple fault Date: Thu, 29 Sep 2022 15:03:38 +0800 Message-Id: <20220929070341.4846-2-chenyi.qiang@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220929070341.4846-1-chenyi.qiang@intel.com> References: <20220929070341.4846-1-chenyi.qiang@intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org For the direct triple faults, i.e. hardware detected and KVM morphed to VM-Exit, KVM will never lose them. But for triple faults sythesized by KVM, e.g. the RSM path, if KVM exits to userspace before the request is serviced, userspace could migrate the VM and lose the triple fault. A new flag KVM_VCPUEVENT_VALID_TRIPLE_FAULT is defined to signal that the event.triple_fault_pending field contains a valid state if the KVM_CAP_X86_TRIPLE_FAULT_EVENT capability is enabled. Acked-by: Peter Xu Signed-off-by: Chenyi Qiang --- target/i386/cpu.c | 1 + target/i386/cpu.h | 1 + target/i386/kvm/kvm.c | 20 ++++++++++++++++++++ target/i386/machine.c | 20 ++++++++++++++++++++ 4 files changed, 42 insertions(+) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 1db1278a59..6e107466b3 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6017,6 +6017,7 @@ static void x86_cpu_reset(DeviceState *dev) env->exception_has_payload = false; env->exception_payload = 0; env->nmi_injected = false; + env->triple_fault_pending = false; #if !defined(CONFIG_USER_ONLY) /* We hard-wire the BSP to the first CPU. */ apic_designate_bsp(cpu->apic_state, s->cpu_index == 0); diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 82004b65b9..d4124973ce 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -1739,6 +1739,7 @@ typedef struct CPUArchState { uint8_t has_error_code; uint8_t exception_has_payload; uint64_t exception_payload; + uint8_t triple_fault_pending; uint32_t ins_len; uint32_t sipi_vector; bool tsc_valid; diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index a1fd1f5379..3838827134 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -132,6 +132,7 @@ static int has_xcrs; static int has_pit_state2; static int has_sregs2; static int has_exception_payload; +static int has_triple_fault_event; static bool has_msr_mcg_ext_ctl; @@ -2483,6 +2484,16 @@ int kvm_arch_init(MachineState *ms, KVMState *s) } } + has_triple_fault_event = kvm_check_extension(s, KVM_CAP_X86_TRIPLE_FAULT_EVENT); + if (has_triple_fault_event) { + ret = kvm_vm_enable_cap(s, KVM_CAP_X86_TRIPLE_FAULT_EVENT, 0, true); + if (ret < 0) { + error_report("kvm: Failed to enable triple fault event cap: %s", + strerror(-ret)); + return ret; + } + } + ret = kvm_get_supported_msrs(s); if (ret < 0) { return ret; @@ -4299,6 +4310,11 @@ static int kvm_put_vcpu_events(X86CPU *cpu, int level) } } + if (has_triple_fault_event) { + events.flags |= KVM_VCPUEVENT_VALID_TRIPLE_FAULT; + events.triple_fault.pending = env->triple_fault_pending; + } + return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_VCPU_EVENTS, &events); } @@ -4368,6 +4384,10 @@ static int kvm_get_vcpu_events(X86CPU *cpu) } } + if (events.flags & KVM_VCPUEVENT_VALID_TRIPLE_FAULT) { + env->triple_fault_pending = events.triple_fault.pending; + } + env->sipi_vector = events.sipi_vector; return 0; diff --git a/target/i386/machine.c b/target/i386/machine.c index cecd476e98..310b125235 100644 --- a/target/i386/machine.c +++ b/target/i386/machine.c @@ -1562,6 +1562,25 @@ static const VMStateDescription vmstate_arch_lbr = { } }; +static bool triple_fault_needed(void *opaque) +{ + X86CPU *cpu = opaque; + CPUX86State *env = &cpu->env; + + return env->triple_fault_pending; +} + +static const VMStateDescription vmstate_triple_fault = { + .name = "cpu/triple_fault", + .version_id = 1, + .minimum_version_id = 1, + .needed = triple_fault_needed, + .fields = (VMStateField[]) { + VMSTATE_UINT8(env.triple_fault_pending, X86CPU), + VMSTATE_END_OF_LIST() + } +}; + const VMStateDescription vmstate_x86_cpu = { .name = "cpu", .version_id = 12, @@ -1706,6 +1725,7 @@ const VMStateDescription vmstate_x86_cpu = { &vmstate_amx_xtile, #endif &vmstate_arch_lbr, + &vmstate_triple_fault, NULL } }; From patchwork Thu Sep 29 07:03:39 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chenyi Qiang X-Patchwork-Id: 12993567 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA5DDC07E9D for ; Thu, 29 Sep 2022 06:56:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234982AbiI2G4z (ORCPT ); Thu, 29 Sep 2022 02:56:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45944 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234970AbiI2G4u (ORCPT ); Thu, 29 Sep 2022 02:56:50 -0400 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C952B12EDB8 for ; Wed, 28 Sep 2022 23:56:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1664434609; x=1695970609; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=uQGTvFkRu2suvO/BBfMCi7VX8taOodq9QWl6FH4+5vk=; b=hr5bytRnSZSv/a/3Rq67DcQNLpZ5BL/Pqln3rOtXOtAR9KrsdAjG7uZr 5ZapnKGIuwvz0F1PpZ7mIT/MlyFdGpFuAnYj1g006VvdLkxXZoSaoyqoP XqUBzVl9oUONZ62vuVK4+UvoZAqQWLEmbHP4ruibsB0zYS6wKfcghud3j buycaccWK31ScBOnx9uOY0GpB/8odI0wKMzYXM1Gmeec5x7j+zFM4PSMo cx6jfAkFo1cEF3gaKih+lgMyU7b5BalyeSXpm/rImnxjOtAN6FmnmWwtq 55bPExsc49u1COURKOZka65cH4EoW4O9GZMEqK5SBaOxHTMja70znZDpH g==; X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="302724732" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="302724732" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Sep 2022 23:56:49 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="711268512" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="711268512" Received: from chenyi-pc.sh.intel.com ([10.239.159.53]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Sep 2022 23:56:47 -0700 From: Chenyi Qiang To: Paolo Bonzini , Marcelo Tosatti , Richard Henderson , Eduardo Habkost , Peter Xu , Xiaoyao Li Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org Subject: [PATCH v8 2/4] kvm: allow target-specific accelerator properties Date: Thu, 29 Sep 2022 15:03:39 +0800 Message-Id: <20220929070341.4846-3-chenyi.qiang@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220929070341.4846-1-chenyi.qiang@intel.com> References: <20220929070341.4846-1-chenyi.qiang@intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Paolo Bonzini Several hypervisor capabilities in KVM are target-specific. When exposed to QEMU users as accelerator properties (i.e. -accel kvm,prop=value), they should not be available for all targets. Add a hook for targets to add their own properties to -accel kvm, for now no such property is defined. Signed-off-by: Paolo Bonzini --- accel/kvm/kvm-all.c | 2 ++ include/sysemu/kvm.h | 2 ++ target/arm/kvm.c | 4 ++++ target/i386/kvm/kvm.c | 4 ++++ target/mips/kvm.c | 4 ++++ target/ppc/kvm.c | 4 ++++ target/riscv/kvm.c | 4 ++++ target/s390x/kvm/kvm.c | 4 ++++ 8 files changed, 28 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 5acab1767f..f90c5cb285 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -3737,6 +3737,8 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data) NULL, NULL); object_class_property_set_description(oc, "dirty-ring-size", "Size of KVM dirty page ring buffer (default: 0, i.e. use bitmap)"); + + kvm_arch_accel_class_init(oc); } static const TypeInfo kvm_accel_type = { diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index efd6dee818..50868ebf60 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -353,6 +353,8 @@ bool kvm_device_supported(int vmfd, uint64_t type); extern const KVMCapabilityInfo kvm_arch_required_capabilities[]; +void kvm_arch_accel_class_init(ObjectClass *oc); + void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run); MemTxAttrs kvm_arch_post_run(CPUState *cpu, struct kvm_run *run); diff --git a/target/arm/kvm.c b/target/arm/kvm.c index e5c1bd50d2..d21603cf28 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -1056,3 +1056,7 @@ bool kvm_arch_cpu_check_are_resettable(void) { return true; } + +void kvm_arch_accel_class_init(ObjectClass *oc) +{ +} diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 3838827134..eab09833f9 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -5472,3 +5472,7 @@ void kvm_request_xsave_components(X86CPU *cpu, uint64_t mask) mask &= ~BIT_ULL(bit); } } + +void kvm_arch_accel_class_init(ObjectClass *oc) +{ +} diff --git a/target/mips/kvm.c b/target/mips/kvm.c index caf70decd2..bcb8e06b2c 100644 --- a/target/mips/kvm.c +++ b/target/mips/kvm.c @@ -1294,3 +1294,7 @@ bool kvm_arch_cpu_check_are_resettable(void) { return true; } + +void kvm_arch_accel_class_init(ObjectClass *oc) +{ +} diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c index 466d0d2f4c..7c25348b7b 100644 --- a/target/ppc/kvm.c +++ b/target/ppc/kvm.c @@ -2966,3 +2966,7 @@ bool kvm_arch_cpu_check_are_resettable(void) { return true; } + +void kvm_arch_accel_class_init(ObjectClass *oc) +{ +} diff --git a/target/riscv/kvm.c b/target/riscv/kvm.c index 70b4cff06f..30f21453d6 100644 --- a/target/riscv/kvm.c +++ b/target/riscv/kvm.c @@ -532,3 +532,7 @@ bool kvm_arch_cpu_check_are_resettable(void) { return true; } + +void kvm_arch_accel_class_init(ObjectClass *oc) +{ +} diff --git a/target/s390x/kvm/kvm.c b/target/s390x/kvm/kvm.c index 6a8dbadf7e..508c24cfec 100644 --- a/target/s390x/kvm/kvm.c +++ b/target/s390x/kvm/kvm.c @@ -2581,3 +2581,7 @@ int kvm_s390_get_zpci_op(void) { return cap_zpci_op; } + +void kvm_arch_accel_class_init(ObjectClass *oc) +{ +} From patchwork Thu Sep 29 07:03:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chenyi Qiang X-Patchwork-Id: 12993568 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48813C04A95 for ; Thu, 29 Sep 2022 06:57:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234983AbiI2G47 (ORCPT ); Thu, 29 Sep 2022 02:56:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45944 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234976AbiI2G4x (ORCPT ); Thu, 29 Sep 2022 02:56:53 -0400 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4324412B4B8 for ; Wed, 28 Sep 2022 23:56:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1664434612; x=1695970612; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=1HHLTf+mZEpAZdLpZzQsDowRVBDqPVQnup/F4dRvES4=; b=Mz1OxtGBg//+umzROYB+LZgx2qYeTvMvb8dWMWoSguUpW9XyWtxsdACe JOYRpsPDbeYGpzRpIwVJEy3avS7N5DLJhLHGCwhLStP3uj1GYiUCU5Hjk kc4Tw3rZTvfoT5bh6Zzb5a00zCetLqbjOlo8AjZuuOyjjsKKdLoYkwYy5 tH1z/gZRCHw47/Ly5Hi0Bqd9cZSeizdEgAniIK6Mhf9QoNlQzOvBkFLmR XKfy3o4D4M4D+7wtcwZlpQKrkahmINTEsSsjgYZSbyi3RsZMi81Epj9QF RwocebsoHLkG1gMk4iUesmG+iBA2O0tE1/UxkMlYNTFohr7V4qiD1L8zS g==; X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="302724736" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="302724736" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Sep 2022 23:56:51 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="711268548" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="711268548" Received: from chenyi-pc.sh.intel.com ([10.239.159.53]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Sep 2022 23:56:49 -0700 From: Chenyi Qiang To: Paolo Bonzini , Marcelo Tosatti , Richard Henderson , Eduardo Habkost , Peter Xu , Xiaoyao Li Cc: Chenyi Qiang , qemu-devel@nongnu.org, kvm@vger.kernel.org Subject: [PATCH v8 3/4] kvm: expose struct KVMState Date: Thu, 29 Sep 2022 15:03:40 +0800 Message-Id: <20220929070341.4846-4-chenyi.qiang@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220929070341.4846-1-chenyi.qiang@intel.com> References: <20220929070341.4846-1-chenyi.qiang@intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Expose struct KVMState out of kvm-all.c so that the field of struct KVMState can be accessed when defining target-specific accelerator properties. Signed-off-by: Chenyi Qiang --- accel/kvm/kvm-all.c | 74 --------------------------------------- include/sysemu/kvm_int.h | 75 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 75 insertions(+), 74 deletions(-) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index f90c5cb285..3624ed8447 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -77,86 +77,12 @@ do { } while (0) #endif -#define KVM_MSI_HASHTAB_SIZE 256 - struct KVMParkedVcpu { unsigned long vcpu_id; int kvm_fd; QLIST_ENTRY(KVMParkedVcpu) node; }; -enum KVMDirtyRingReaperState { - KVM_DIRTY_RING_REAPER_NONE = 0, - /* The reaper is sleeping */ - KVM_DIRTY_RING_REAPER_WAIT, - /* The reaper is reaping for dirty pages */ - KVM_DIRTY_RING_REAPER_REAPING, -}; - -/* - * KVM reaper instance, responsible for collecting the KVM dirty bits - * via the dirty ring. - */ -struct KVMDirtyRingReaper { - /* The reaper thread */ - QemuThread reaper_thr; - volatile uint64_t reaper_iteration; /* iteration number of reaper thr */ - volatile enum KVMDirtyRingReaperState reaper_state; /* reap thr state */ -}; - -struct KVMState -{ - AccelState parent_obj; - - int nr_slots; - int fd; - int vmfd; - int coalesced_mmio; - int coalesced_pio; - struct kvm_coalesced_mmio_ring *coalesced_mmio_ring; - bool coalesced_flush_in_progress; - int vcpu_events; - int robust_singlestep; - int debugregs; -#ifdef KVM_CAP_SET_GUEST_DEBUG - QTAILQ_HEAD(, kvm_sw_breakpoint) kvm_sw_breakpoints; -#endif - int max_nested_state_len; - int many_ioeventfds; - int intx_set_mask; - int kvm_shadow_mem; - bool kernel_irqchip_allowed; - bool kernel_irqchip_required; - OnOffAuto kernel_irqchip_split; - bool sync_mmu; - uint64_t manual_dirty_log_protect; - /* The man page (and posix) say ioctl numbers are signed int, but - * they're not. Linux, glibc and *BSD all treat ioctl numbers as - * unsigned, and treating them as signed here can break things */ - unsigned irq_set_ioctl; - unsigned int sigmask_len; - GHashTable *gsimap; -#ifdef KVM_CAP_IRQ_ROUTING - struct kvm_irq_routing *irq_routes; - int nr_allocated_irq_routes; - unsigned long *used_gsi_bitmap; - unsigned int gsi_count; - QTAILQ_HEAD(, KVMMSIRoute) msi_hashtab[KVM_MSI_HASHTAB_SIZE]; -#endif - KVMMemoryListener memory_listener; - QLIST_HEAD(, KVMParkedVcpu) kvm_parked_vcpus; - - /* For "info mtree -f" to tell if an MR is registered in KVM */ - int nr_as; - struct KVMAs { - KVMMemoryListener *ml; - AddressSpace *as; - } *as; - uint64_t kvm_dirty_ring_bytes; /* Size of the per-vcpu dirty ring */ - uint32_t kvm_dirty_ring_size; /* Number of dirty GFNs per ring */ - struct KVMDirtyRingReaper reaper; -}; - KVMState *kvm_state; bool kvm_kernel_irqchip; bool kvm_split_irqchip; diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h index 1f5487d9b7..07394744ad 100644 --- a/include/sysemu/kvm_int.h +++ b/include/sysemu/kvm_int.h @@ -36,6 +36,81 @@ typedef struct KVMMemoryListener { int as_id; } KVMMemoryListener; +#define KVM_MSI_HASHTAB_SIZE 256 + +enum KVMDirtyRingReaperState { + KVM_DIRTY_RING_REAPER_NONE = 0, + /* The reaper is sleeping */ + KVM_DIRTY_RING_REAPER_WAIT, + /* The reaper is reaping for dirty pages */ + KVM_DIRTY_RING_REAPER_REAPING, +}; + +/* + * KVM reaper instance, responsible for collecting the KVM dirty bits + * via the dirty ring. + */ +struct KVMDirtyRingReaper { + /* The reaper thread */ + QemuThread reaper_thr; + volatile uint64_t reaper_iteration; /* iteration number of reaper thr */ + volatile enum KVMDirtyRingReaperState reaper_state; /* reap thr state */ +}; +struct KVMState +{ + AccelState parent_obj; + + int nr_slots; + int fd; + int vmfd; + int coalesced_mmio; + int coalesced_pio; + struct kvm_coalesced_mmio_ring *coalesced_mmio_ring; + bool coalesced_flush_in_progress; + int vcpu_events; + int robust_singlestep; + int debugregs; +#ifdef KVM_CAP_SET_GUEST_DEBUG + QTAILQ_HEAD(, kvm_sw_breakpoint) kvm_sw_breakpoints; +#endif + int max_nested_state_len; + int many_ioeventfds; + int intx_set_mask; + int kvm_shadow_mem; + bool kernel_irqchip_allowed; + bool kernel_irqchip_required; + OnOffAuto kernel_irqchip_split; + bool sync_mmu; + uint64_t manual_dirty_log_protect; + /* The man page (and posix) say ioctl numbers are signed int, but + * they're not. Linux, glibc and *BSD all treat ioctl numbers as + * unsigned, and treating them as signed here can break things */ + unsigned irq_set_ioctl; + unsigned int sigmask_len; + GHashTable *gsimap; +#ifdef KVM_CAP_IRQ_ROUTING + struct kvm_irq_routing *irq_routes; + int nr_allocated_irq_routes; + unsigned long *used_gsi_bitmap; + unsigned int gsi_count; + QTAILQ_HEAD(, KVMMSIRoute) msi_hashtab[KVM_MSI_HASHTAB_SIZE]; +#endif + KVMMemoryListener memory_listener; + QLIST_HEAD(, KVMParkedVcpu) kvm_parked_vcpus; + + /* For "info mtree -f" to tell if an MR is registered in KVM */ + int nr_as; + struct KVMAs { + KVMMemoryListener *ml; + AddressSpace *as; + } *as; + uint64_t kvm_dirty_ring_bytes; /* Size of the per-vcpu dirty ring */ + uint32_t kvm_dirty_ring_size; /* Number of dirty GFNs per ring */ + struct KVMDirtyRingReaper reaper; + NotifyVmexitOption notify_vmexit; + uint32_t notify_window; +}; + void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml, AddressSpace *as, int as_id, const char *name); From patchwork Thu Sep 29 07:03:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chenyi Qiang X-Patchwork-Id: 12993569 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AB031C04A95 for ; Thu, 29 Sep 2022 06:57:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234970AbiI2G5C (ORCPT ); Thu, 29 Sep 2022 02:57:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46194 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234985AbiI2G44 (ORCPT ); Thu, 29 Sep 2022 02:56:56 -0400 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DCE2112F74E for ; Wed, 28 Sep 2022 23:56:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1664434614; x=1695970614; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=SdArTuMBopFpNPIdNPr3Yg4/CQF85c8UIvdrsLpDT7Y=; b=RsfF/GA4E3iGu4cuQ8K/hZOAk5dtZ0BCaxK3UKlEH0E8Vec7x2/Dg5QH tGvhYWXprBxzP9683b28cEZpZmaeH/a4P/nhmmHdwNZDaDI0L/qpEmVKU +69wSohK2dc7I0UBDL4x80zYCoaxSbM92fyj1sAPHlZcw7TQWyFCm9QdG IrKm06WNghZ0WhQXn+A8fH+swiz1ekUvNvYfr5PQpdEee84DRnkqteyDr Edjw+QZ6z2kxMtc8qCZKZrRETClHp90KscnwXxOJXtIkxeQo9sWVuJSng W//ey2iCn1KO3kWdbPy0N25X1CHnMeRX0zWH3dp7wKHzmC5yyK7uI89eT g==; X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="302724741" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="302724741" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Sep 2022 23:56:54 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="711268559" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="711268559" Received: from chenyi-pc.sh.intel.com ([10.239.159.53]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Sep 2022 23:56:51 -0700 From: Chenyi Qiang To: Paolo Bonzini , Marcelo Tosatti , Richard Henderson , Eduardo Habkost , Peter Xu , Xiaoyao Li Cc: Chenyi Qiang , qemu-devel@nongnu.org, kvm@vger.kernel.org Subject: [PATCH v8 4/4] i386: add notify VM exit support Date: Thu, 29 Sep 2022 15:03:41 +0800 Message-Id: <20220929070341.4846-5-chenyi.qiang@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220929070341.4846-1-chenyi.qiang@intel.com> References: <20220929070341.4846-1-chenyi.qiang@intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org There are cases that malicious virtual machine can cause CPU stuck (due to event windows don't open up), e.g., infinite loop in microcode when nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and IRQ) can be delivered. It leads the CPU to be unavailable to host or other VMs. Notify VM exit is introduced to mitigate such kind of attacks, which will generate a VM exit if no event window occurs in VM non-root mode for a specified amount of time (notify window). A new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT is exposed to user space so that the user can query the capability and set the expected notify window when creating VMs. The format of the argument when enabling this capability is as follows: Bit 63:32 - notify window specified in qemu command Bit 31:0 - some flags (e.g. KVM_X86_NOTIFY_VMEXIT_ENABLED is set to enable the feature.) Users can configure the feature by a new (x86 only) accel property: qemu -accel kvm,notify-vmexit=run|internal-error|disable,notify-window=n The default option of notify-vmexit is run, which will enable the capability and do nothing if the exit happens. The internal-error option raises a KVM internal error if it happens. The disable option does not enable the capability. The default value of notify-window is 0. It is valid only when notify-vmexit is not disabled. The valid range of notify-window is non-negative. It is even safe to set it to zero since there's an internal hardware threshold to be added to ensure no false positive. Because a notify VM exit may happen with VM_CONTEXT_INVALID set in exit qualification (no cases are anticipated that would set this bit), which means VM context is corrupted. It would be reflected in the flags of KVM_EXIT_NOTIFY exit. If KVM_NOTIFY_CONTEXT_INVALID bit is set, raise a KVM internal error unconditionally. Acked-by: Peter Xu Signed-off-by: Chenyi Qiang --- accel/kvm/kvm-all.c | 2 + qapi/run-state.json | 17 ++++++++ qemu-options.hx | 11 +++++ target/i386/kvm/kvm.c | 97 +++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 127 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 3624ed8447..41ba9de3b8 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -3636,6 +3636,8 @@ static void kvm_accel_instance_init(Object *obj) s->kernel_irqchip_split = ON_OFF_AUTO_AUTO; /* KVM dirty ring is by default off */ s->kvm_dirty_ring_size = 0; + s->notify_vmexit = NOTIFY_VMEXIT_OPTION_RUN; + s->notify_window = 0; } static void kvm_accel_class_init(ObjectClass *oc, void *data) diff --git a/qapi/run-state.json b/qapi/run-state.json index 9273ea6516..49989d30e6 100644 --- a/qapi/run-state.json +++ b/qapi/run-state.json @@ -643,3 +643,20 @@ { 'struct': 'MemoryFailureFlags', 'data': { 'action-required': 'bool', 'recursive': 'bool'} } + +## +# @NotifyVmexitOption: +# +# An enumeration of the options specified when enabling notify VM exit +# +# @run: enable the feature, do nothing and continue if the notify VM exit happens. +# +# @internal-error: enable the feature, raise a internal error if the notify +# VM exit happens. +# +# @disable: disable the feature. +# +# Since: 7.2 +## +{ 'enum': 'NotifyVmexitOption', + 'data': [ 'run', 'internal-error', 'disable' ] } \ No newline at end of file diff --git a/qemu-options.hx b/qemu-options.hx index 913c71e38f..8f85004a7d 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -191,6 +191,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel, " split-wx=on|off (enable TCG split w^x mapping)\n" " tb-size=n (TCG translation block cache size)\n" " dirty-ring-size=n (KVM dirty ring GFN count, default 0)\n" + " notify-vmexit=run|internal-error|disable,notify-window=n (enable notify VM exit and set notify window, x86 only)\n" " thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL) SRST ``-accel name[,prop=value[,...]]`` @@ -242,6 +243,16 @@ SRST is disabled (dirty-ring-size=0). When enabled, KVM will instead record dirty pages in a bitmap. + ``notify-vmexit=run|internal-error|disable,notify-window=n`` + Enables or disables notify VM exit support on x86 host and specify + the corresponding notify window to trigger the VM exit if enabled. + ``run`` option enables the feature. It does nothing and continue + if the exit happens. ``internal-error`` option enables the feature. + It raises a internal error. ``disable`` option doesn't enable the feature. + This feature can mitigate the CPU stuck issue due to event windows don't + open up for a specified of time (i.e. notify-window). + Default: notify-vmexit=run,notify-window=0. + ERST DEF("smp", HAS_ARG, QEMU_OPTION_smp, diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index eab09833f9..55f521e4ec 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -15,6 +15,7 @@ #include "qemu/osdep.h" #include "qapi/qapi-events-run-state.h" #include "qapi/error.h" +#include "qapi/visitor.h" #include #include #include @@ -2599,6 +2600,21 @@ int kvm_arch_init(MachineState *ms, KVMState *s) } } + if (s->notify_vmexit != NOTIFY_VMEXIT_OPTION_DISABLE && + kvm_check_extension(s, KVM_CAP_X86_NOTIFY_VMEXIT)) { + uint64_t notify_window_flags = + ((uint64_t)s->notify_window << 32) | + KVM_X86_NOTIFY_VMEXIT_ENABLED | + KVM_X86_NOTIFY_VMEXIT_USER; + ret = kvm_vm_enable_cap(s, KVM_CAP_X86_NOTIFY_VMEXIT, 0, + notify_window_flags); + if (ret < 0) { + error_report("kvm: Failed to enable notify vmexit cap: %s", + strerror(-ret)); + return ret; + } + } + return 0; } @@ -5141,6 +5157,8 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run) X86CPU *cpu = X86_CPU(cs); uint64_t code; int ret; + bool ctx_invalid; + char str[256]; switch (run->exit_reason) { case KVM_EXIT_HLT: @@ -5196,6 +5214,21 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run) /* already handled in kvm_arch_post_run */ ret = 0; break; + case KVM_EXIT_NOTIFY: + KVMState *s = KVM_STATE(current_accel()); + ctx_invalid = !!(run->notify.flags & KVM_NOTIFY_CONTEXT_INVALID); + sprintf(str, "Encounter a notify exit with %svalid context in" + " guest. There can be possible misbehaves in guest." + " Please have a look.", ctx_invalid ? "in" : ""); + if (ctx_invalid || + s->notify_vmexit == NOTIFY_VMEXIT_OPTION_INTERNAL_ERROR) { + warn_report("KVM internal error: %s", str); + ret = -1; + } else { + warn_report_once("KVM: %s", str); + ret = 0; + } + break; default: fprintf(stderr, "KVM: unknown exit reason %d\n", run->exit_reason); ret = -1; @@ -5473,6 +5506,70 @@ void kvm_request_xsave_components(X86CPU *cpu, uint64_t mask) } } +static int kvm_arch_get_notify_vmexit(Object *obj, Error **errp) +{ + KVMState *s = KVM_STATE(obj); + return s->notify_vmexit; +} + +static void kvm_arch_set_notify_vmexit(Object *obj, int value, Error **errp) +{ + KVMState *s = KVM_STATE(obj); + + if (s->fd != -1) { + error_setg(errp, "Cannot set properties after the accelerator has been initialized"); + return; + } + + s->notify_vmexit = value; +} + +static void kvm_arch_get_notify_window(Object *obj, Visitor *v, + const char *name, void *opaque, + Error **errp) +{ + KVMState *s = KVM_STATE(obj); + uint32_t value = s->notify_window; + + visit_type_uint32(v, name, &value, errp); +} + +static void kvm_arch_set_notify_window(Object *obj, Visitor *v, + const char *name, void *opaque, + Error **errp) +{ + KVMState *s = KVM_STATE(obj); + Error *error = NULL; + uint32_t value; + + if (s->fd != -1) { + error_setg(errp, "Cannot set properties after the accelerator has been initialized"); + return; + } + + visit_type_uint32(v, name, &value, &error); + if (error) { + error_propagate(errp, error); + return; + } + + s->notify_window = value; +} + void kvm_arch_accel_class_init(ObjectClass *oc) { + object_class_property_add_enum(oc, "notify-vmexit", "NotifyVMexitOption", + &NotifyVmexitOption_lookup, + kvm_arch_get_notify_vmexit, + kvm_arch_set_notify_vmexit); + object_class_property_set_description(oc, "notify-vmexit", + "Enable notify VM exit"); + + object_class_property_add(oc, "notify-window", "uint32", + kvm_arch_get_notify_window, + kvm_arch_set_notify_window, + NULL, NULL); + object_class_property_set_description(oc, "notify-window", + "Clock cycles without an event window " + "after which a notification VM exit occurs"); }