From patchwork Thu Sep 29 07:20:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chenyi Qiang X-Patchwork-Id: 12993588 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 680B1C04A95 for ; Thu, 29 Sep 2022 07:14:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235174AbiI2HOP (ORCPT ); Thu, 29 Sep 2022 03:14:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34392 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235151AbiI2HOG (ORCPT ); Thu, 29 Sep 2022 03:14:06 -0400 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EF364132D75 for ; Thu, 29 Sep 2022 00:14:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1664435641; x=1695971641; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=/6/VqwylSrczNCuoRPX4zu1X/zz8KeLytqpXmkoMnUA=; b=TKGlTxTG3gLeh8xrTX9z6a2YqDeRLhVNqvmnF+MoVGEss8UiSonSXdLY eLlbsD2AgtpS/lH2LuS0ClxyUJWCiITn75Vhxo/6uWvpAvXzPUlPHo0+1 J/ze0EjG2C8Mv3TcG1zpK51R38AZW9OPT/F6C/AQUPw4TYGuLpNJ4Jbj3 wUoi2oD6JpNQy+ug76swJTRclAIR0s6zysOGfMzoR1ts2JCsIzsjvBisW 5on91shTGYZ+mQCbHePHwFS4CGqQDkBhPLCzA6XWoqkI2WhZwJGoPPPxU jgVT7lXPQzPy2NbHgR4B+E4oPKnePC9b0WGmjyw6pEffLVwf6RvKQhcUi Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="288978803" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="288978803" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Sep 2022 00:14:01 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="655440721" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="655440721" Received: from chenyi-pc.sh.intel.com ([10.239.159.53]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Sep 2022 00:13:58 -0700 From: Chenyi Qiang To: Paolo Bonzini , Marcelo Tosatti , Richard Henderson , Eduardo Habkost , Peter Xu , Xiaoyao Li Cc: Chenyi Qiang , qemu-devel@nongnu.org, kvm@vger.kernel.org Subject: [RESEND PATCH v8 1/4] i386: kvm: extend kvm_{get, put}_vcpu_events to support pending triple fault Date: Thu, 29 Sep 2022 15:20:11 +0800 Message-Id: <20220929072014.20705-2-chenyi.qiang@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220929072014.20705-1-chenyi.qiang@intel.com> References: <20220929072014.20705-1-chenyi.qiang@intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org For the direct triple faults, i.e. hardware detected and KVM morphed to VM-Exit, KVM will never lose them. But for triple faults sythesized by KVM, e.g. the RSM path, if KVM exits to userspace before the request is serviced, userspace could migrate the VM and lose the triple fault. A new flag KVM_VCPUEVENT_VALID_TRIPLE_FAULT is defined to signal that the event.triple_fault_pending field contains a valid state if the KVM_CAP_X86_TRIPLE_FAULT_EVENT capability is enabled. Acked-by: Peter Xu Signed-off-by: Chenyi Qiang --- target/i386/cpu.c | 1 + target/i386/cpu.h | 1 + target/i386/kvm/kvm.c | 20 ++++++++++++++++++++ target/i386/machine.c | 20 ++++++++++++++++++++ 4 files changed, 42 insertions(+) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 1db1278a59..6e107466b3 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6017,6 +6017,7 @@ static void x86_cpu_reset(DeviceState *dev) env->exception_has_payload = false; env->exception_payload = 0; env->nmi_injected = false; + env->triple_fault_pending = false; #if !defined(CONFIG_USER_ONLY) /* We hard-wire the BSP to the first CPU. */ apic_designate_bsp(cpu->apic_state, s->cpu_index == 0); diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 82004b65b9..d4124973ce 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -1739,6 +1739,7 @@ typedef struct CPUArchState { uint8_t has_error_code; uint8_t exception_has_payload; uint64_t exception_payload; + uint8_t triple_fault_pending; uint32_t ins_len; uint32_t sipi_vector; bool tsc_valid; diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index a1fd1f5379..3838827134 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -132,6 +132,7 @@ static int has_xcrs; static int has_pit_state2; static int has_sregs2; static int has_exception_payload; +static int has_triple_fault_event; static bool has_msr_mcg_ext_ctl; @@ -2483,6 +2484,16 @@ int kvm_arch_init(MachineState *ms, KVMState *s) } } + has_triple_fault_event = kvm_check_extension(s, KVM_CAP_X86_TRIPLE_FAULT_EVENT); + if (has_triple_fault_event) { + ret = kvm_vm_enable_cap(s, KVM_CAP_X86_TRIPLE_FAULT_EVENT, 0, true); + if (ret < 0) { + error_report("kvm: Failed to enable triple fault event cap: %s", + strerror(-ret)); + return ret; + } + } + ret = kvm_get_supported_msrs(s); if (ret < 0) { return ret; @@ -4299,6 +4310,11 @@ static int kvm_put_vcpu_events(X86CPU *cpu, int level) } } + if (has_triple_fault_event) { + events.flags |= KVM_VCPUEVENT_VALID_TRIPLE_FAULT; + events.triple_fault.pending = env->triple_fault_pending; + } + return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_VCPU_EVENTS, &events); } @@ -4368,6 +4384,10 @@ static int kvm_get_vcpu_events(X86CPU *cpu) } } + if (events.flags & KVM_VCPUEVENT_VALID_TRIPLE_FAULT) { + env->triple_fault_pending = events.triple_fault.pending; + } + env->sipi_vector = events.sipi_vector; return 0; diff --git a/target/i386/machine.c b/target/i386/machine.c index cecd476e98..310b125235 100644 --- a/target/i386/machine.c +++ b/target/i386/machine.c @@ -1562,6 +1562,25 @@ static const VMStateDescription vmstate_arch_lbr = { } }; +static bool triple_fault_needed(void *opaque) +{ + X86CPU *cpu = opaque; + CPUX86State *env = &cpu->env; + + return env->triple_fault_pending; +} + +static const VMStateDescription vmstate_triple_fault = { + .name = "cpu/triple_fault", + .version_id = 1, + .minimum_version_id = 1, + .needed = triple_fault_needed, + .fields = (VMStateField[]) { + VMSTATE_UINT8(env.triple_fault_pending, X86CPU), + VMSTATE_END_OF_LIST() + } +}; + const VMStateDescription vmstate_x86_cpu = { .name = "cpu", .version_id = 12, @@ -1706,6 +1725,7 @@ const VMStateDescription vmstate_x86_cpu = { &vmstate_amx_xtile, #endif &vmstate_arch_lbr, + &vmstate_triple_fault, NULL } }; From patchwork Thu Sep 29 07:20:12 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chenyi Qiang X-Patchwork-Id: 12993589 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 59028C07E9D for ; Thu, 29 Sep 2022 07:14:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235100AbiI2HOS (ORCPT ); Thu, 29 Sep 2022 03:14:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34436 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235137AbiI2HOH (ORCPT ); Thu, 29 Sep 2022 03:14:07 -0400 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4B9C0132FC1 for ; Thu, 29 Sep 2022 00:14:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1664435644; x=1695971644; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=uQGTvFkRu2suvO/BBfMCi7VX8taOodq9QWl6FH4+5vk=; b=IaO3xdBlcr5onNWFme9JzWN15RHexThxDinHBIMsPjze0HYTeDtyqJpw G8HmWrKDeyeJMoIHlYOa7e+MQuCnuh6Iy+oId9GXRGHMrRS/U66O0lFWt b75PFv2xs/qdz/T8n71yrySs8OgAyMGBBnhcgsRiDEgSWh8a0murSpyyh E7yUPDZX5JG5zgC7SuWHEfGmUb+abknUx/H43PY8Cbq6gODJTaZWLwLmu RfTSt4Q9rRPARZpaDpHA0ObH5F702NkMWmh1Bu1mYU/pCaLOaFue05oi1 4HUdFBM61aYcQ6i2OxBloIQlqxojEdoV9w6SqD/duUoGn5Km5aUaGdZWQ Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="288978808" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="288978808" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Sep 2022 00:14:03 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="655440727" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="655440727" Received: from chenyi-pc.sh.intel.com ([10.239.159.53]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Sep 2022 00:14:01 -0700 From: Chenyi Qiang To: Paolo Bonzini , Marcelo Tosatti , Richard Henderson , Eduardo Habkost , Peter Xu , Xiaoyao Li Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org Subject: [RESEND PATCH v8 2/4] kvm: allow target-specific accelerator properties Date: Thu, 29 Sep 2022 15:20:12 +0800 Message-Id: <20220929072014.20705-3-chenyi.qiang@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220929072014.20705-1-chenyi.qiang@intel.com> References: <20220929072014.20705-1-chenyi.qiang@intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Paolo Bonzini Several hypervisor capabilities in KVM are target-specific. When exposed to QEMU users as accelerator properties (i.e. -accel kvm,prop=value), they should not be available for all targets. Add a hook for targets to add their own properties to -accel kvm, for now no such property is defined. Signed-off-by: Paolo Bonzini --- accel/kvm/kvm-all.c | 2 ++ include/sysemu/kvm.h | 2 ++ target/arm/kvm.c | 4 ++++ target/i386/kvm/kvm.c | 4 ++++ target/mips/kvm.c | 4 ++++ target/ppc/kvm.c | 4 ++++ target/riscv/kvm.c | 4 ++++ target/s390x/kvm/kvm.c | 4 ++++ 8 files changed, 28 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 5acab1767f..f90c5cb285 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -3737,6 +3737,8 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data) NULL, NULL); object_class_property_set_description(oc, "dirty-ring-size", "Size of KVM dirty page ring buffer (default: 0, i.e. use bitmap)"); + + kvm_arch_accel_class_init(oc); } static const TypeInfo kvm_accel_type = { diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index efd6dee818..50868ebf60 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -353,6 +353,8 @@ bool kvm_device_supported(int vmfd, uint64_t type); extern const KVMCapabilityInfo kvm_arch_required_capabilities[]; +void kvm_arch_accel_class_init(ObjectClass *oc); + void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run); MemTxAttrs kvm_arch_post_run(CPUState *cpu, struct kvm_run *run); diff --git a/target/arm/kvm.c b/target/arm/kvm.c index e5c1bd50d2..d21603cf28 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -1056,3 +1056,7 @@ bool kvm_arch_cpu_check_are_resettable(void) { return true; } + +void kvm_arch_accel_class_init(ObjectClass *oc) +{ +} diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 3838827134..eab09833f9 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -5472,3 +5472,7 @@ void kvm_request_xsave_components(X86CPU *cpu, uint64_t mask) mask &= ~BIT_ULL(bit); } } + +void kvm_arch_accel_class_init(ObjectClass *oc) +{ +} diff --git a/target/mips/kvm.c b/target/mips/kvm.c index caf70decd2..bcb8e06b2c 100644 --- a/target/mips/kvm.c +++ b/target/mips/kvm.c @@ -1294,3 +1294,7 @@ bool kvm_arch_cpu_check_are_resettable(void) { return true; } + +void kvm_arch_accel_class_init(ObjectClass *oc) +{ +} diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c index 466d0d2f4c..7c25348b7b 100644 --- a/target/ppc/kvm.c +++ b/target/ppc/kvm.c @@ -2966,3 +2966,7 @@ bool kvm_arch_cpu_check_are_resettable(void) { return true; } + +void kvm_arch_accel_class_init(ObjectClass *oc) +{ +} diff --git a/target/riscv/kvm.c b/target/riscv/kvm.c index 70b4cff06f..30f21453d6 100644 --- a/target/riscv/kvm.c +++ b/target/riscv/kvm.c @@ -532,3 +532,7 @@ bool kvm_arch_cpu_check_are_resettable(void) { return true; } + +void kvm_arch_accel_class_init(ObjectClass *oc) +{ +} diff --git a/target/s390x/kvm/kvm.c b/target/s390x/kvm/kvm.c index 6a8dbadf7e..508c24cfec 100644 --- a/target/s390x/kvm/kvm.c +++ b/target/s390x/kvm/kvm.c @@ -2581,3 +2581,7 @@ int kvm_s390_get_zpci_op(void) { return cap_zpci_op; } + +void kvm_arch_accel_class_init(ObjectClass *oc) +{ +} From patchwork Thu Sep 29 07:20:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chenyi Qiang X-Patchwork-Id: 12993590 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 839CCC04A95 for ; Thu, 29 Sep 2022 07:14:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235170AbiI2HOW (ORCPT ); Thu, 29 Sep 2022 03:14:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34580 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235167AbiI2HOO (ORCPT ); Thu, 29 Sep 2022 03:14:14 -0400 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A9E07132FC9 for ; Thu, 29 Sep 2022 00:14:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1664435647; x=1695971647; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=1HHLTf+mZEpAZdLpZzQsDowRVBDqPVQnup/F4dRvES4=; b=M36mFj52JVgKog7mul6OGEnj27d6vSwF4OliiydaHX9ylofyPeIfSRlX 7PLxA7xogkGuB9+Uc+y4JaBJxm4fGyvz6go3JUNYYbCTCT6a98XFq/925 uorX+VWiwbOl+hgIZKnDO72Tgj6nc8Oc1IO5/wMkQuUBqYvCyH0QQUPEq oXE0BomC0z0bw8/9FofpFs773UslDHupocTexLSc8je8OJ3YBYiTbM2sI ZLnGTuAbxjdzfUDVaniHgMK8dx7+9YSseJPQxwaUPygB1xeLV267tfLiL gAnXnOrJ5m32fKQBZtiUDp34K71Qcw3nLvsd3t4TPe7Yd83EFEJVDwti2 g==; X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="288978817" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="288978817" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Sep 2022 00:14:05 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="655440736" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="655440736" Received: from chenyi-pc.sh.intel.com ([10.239.159.53]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Sep 2022 00:14:03 -0700 From: Chenyi Qiang To: Paolo Bonzini , Marcelo Tosatti , Richard Henderson , Eduardo Habkost , Peter Xu , Xiaoyao Li Cc: Chenyi Qiang , qemu-devel@nongnu.org, kvm@vger.kernel.org Subject: [RESEND PATCH v8 3/4] kvm: expose struct KVMState Date: Thu, 29 Sep 2022 15:20:13 +0800 Message-Id: <20220929072014.20705-4-chenyi.qiang@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220929072014.20705-1-chenyi.qiang@intel.com> References: <20220929072014.20705-1-chenyi.qiang@intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Expose struct KVMState out of kvm-all.c so that the field of struct KVMState can be accessed when defining target-specific accelerator properties. Signed-off-by: Chenyi Qiang --- accel/kvm/kvm-all.c | 74 --------------------------------------- include/sysemu/kvm_int.h | 75 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 75 insertions(+), 74 deletions(-) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index f90c5cb285..3624ed8447 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -77,86 +77,12 @@ do { } while (0) #endif -#define KVM_MSI_HASHTAB_SIZE 256 - struct KVMParkedVcpu { unsigned long vcpu_id; int kvm_fd; QLIST_ENTRY(KVMParkedVcpu) node; }; -enum KVMDirtyRingReaperState { - KVM_DIRTY_RING_REAPER_NONE = 0, - /* The reaper is sleeping */ - KVM_DIRTY_RING_REAPER_WAIT, - /* The reaper is reaping for dirty pages */ - KVM_DIRTY_RING_REAPER_REAPING, -}; - -/* - * KVM reaper instance, responsible for collecting the KVM dirty bits - * via the dirty ring. - */ -struct KVMDirtyRingReaper { - /* The reaper thread */ - QemuThread reaper_thr; - volatile uint64_t reaper_iteration; /* iteration number of reaper thr */ - volatile enum KVMDirtyRingReaperState reaper_state; /* reap thr state */ -}; - -struct KVMState -{ - AccelState parent_obj; - - int nr_slots; - int fd; - int vmfd; - int coalesced_mmio; - int coalesced_pio; - struct kvm_coalesced_mmio_ring *coalesced_mmio_ring; - bool coalesced_flush_in_progress; - int vcpu_events; - int robust_singlestep; - int debugregs; -#ifdef KVM_CAP_SET_GUEST_DEBUG - QTAILQ_HEAD(, kvm_sw_breakpoint) kvm_sw_breakpoints; -#endif - int max_nested_state_len; - int many_ioeventfds; - int intx_set_mask; - int kvm_shadow_mem; - bool kernel_irqchip_allowed; - bool kernel_irqchip_required; - OnOffAuto kernel_irqchip_split; - bool sync_mmu; - uint64_t manual_dirty_log_protect; - /* The man page (and posix) say ioctl numbers are signed int, but - * they're not. Linux, glibc and *BSD all treat ioctl numbers as - * unsigned, and treating them as signed here can break things */ - unsigned irq_set_ioctl; - unsigned int sigmask_len; - GHashTable *gsimap; -#ifdef KVM_CAP_IRQ_ROUTING - struct kvm_irq_routing *irq_routes; - int nr_allocated_irq_routes; - unsigned long *used_gsi_bitmap; - unsigned int gsi_count; - QTAILQ_HEAD(, KVMMSIRoute) msi_hashtab[KVM_MSI_HASHTAB_SIZE]; -#endif - KVMMemoryListener memory_listener; - QLIST_HEAD(, KVMParkedVcpu) kvm_parked_vcpus; - - /* For "info mtree -f" to tell if an MR is registered in KVM */ - int nr_as; - struct KVMAs { - KVMMemoryListener *ml; - AddressSpace *as; - } *as; - uint64_t kvm_dirty_ring_bytes; /* Size of the per-vcpu dirty ring */ - uint32_t kvm_dirty_ring_size; /* Number of dirty GFNs per ring */ - struct KVMDirtyRingReaper reaper; -}; - KVMState *kvm_state; bool kvm_kernel_irqchip; bool kvm_split_irqchip; diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h index 1f5487d9b7..07394744ad 100644 --- a/include/sysemu/kvm_int.h +++ b/include/sysemu/kvm_int.h @@ -36,6 +36,81 @@ typedef struct KVMMemoryListener { int as_id; } KVMMemoryListener; +#define KVM_MSI_HASHTAB_SIZE 256 + +enum KVMDirtyRingReaperState { + KVM_DIRTY_RING_REAPER_NONE = 0, + /* The reaper is sleeping */ + KVM_DIRTY_RING_REAPER_WAIT, + /* The reaper is reaping for dirty pages */ + KVM_DIRTY_RING_REAPER_REAPING, +}; + +/* + * KVM reaper instance, responsible for collecting the KVM dirty bits + * via the dirty ring. + */ +struct KVMDirtyRingReaper { + /* The reaper thread */ + QemuThread reaper_thr; + volatile uint64_t reaper_iteration; /* iteration number of reaper thr */ + volatile enum KVMDirtyRingReaperState reaper_state; /* reap thr state */ +}; +struct KVMState +{ + AccelState parent_obj; + + int nr_slots; + int fd; + int vmfd; + int coalesced_mmio; + int coalesced_pio; + struct kvm_coalesced_mmio_ring *coalesced_mmio_ring; + bool coalesced_flush_in_progress; + int vcpu_events; + int robust_singlestep; + int debugregs; +#ifdef KVM_CAP_SET_GUEST_DEBUG + QTAILQ_HEAD(, kvm_sw_breakpoint) kvm_sw_breakpoints; +#endif + int max_nested_state_len; + int many_ioeventfds; + int intx_set_mask; + int kvm_shadow_mem; + bool kernel_irqchip_allowed; + bool kernel_irqchip_required; + OnOffAuto kernel_irqchip_split; + bool sync_mmu; + uint64_t manual_dirty_log_protect; + /* The man page (and posix) say ioctl numbers are signed int, but + * they're not. Linux, glibc and *BSD all treat ioctl numbers as + * unsigned, and treating them as signed here can break things */ + unsigned irq_set_ioctl; + unsigned int sigmask_len; + GHashTable *gsimap; +#ifdef KVM_CAP_IRQ_ROUTING + struct kvm_irq_routing *irq_routes; + int nr_allocated_irq_routes; + unsigned long *used_gsi_bitmap; + unsigned int gsi_count; + QTAILQ_HEAD(, KVMMSIRoute) msi_hashtab[KVM_MSI_HASHTAB_SIZE]; +#endif + KVMMemoryListener memory_listener; + QLIST_HEAD(, KVMParkedVcpu) kvm_parked_vcpus; + + /* For "info mtree -f" to tell if an MR is registered in KVM */ + int nr_as; + struct KVMAs { + KVMMemoryListener *ml; + AddressSpace *as; + } *as; + uint64_t kvm_dirty_ring_bytes; /* Size of the per-vcpu dirty ring */ + uint32_t kvm_dirty_ring_size; /* Number of dirty GFNs per ring */ + struct KVMDirtyRingReaper reaper; + NotifyVmexitOption notify_vmexit; + uint32_t notify_window; +}; + void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml, AddressSpace *as, int as_id, const char *name); From patchwork Thu Sep 29 07:20:14 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chenyi Qiang X-Patchwork-Id: 12993591 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C63DC07E9D for ; Thu, 29 Sep 2022 07:14:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235154AbiI2HOY (ORCPT ); Thu, 29 Sep 2022 03:14:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34352 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235171AbiI2HOO (ORCPT ); Thu, 29 Sep 2022 03:14:14 -0400 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8FEF8132FF6 for ; Thu, 29 Sep 2022 00:14:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1664435648; x=1695971648; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=PvmDqbquWzupiy4aH26wEXQ5Z+AGyeFV5o+EPKcrQdA=; b=dHJDL4Wi746RwnRIMEmcD22RdJg1CurRMrS0Meljue811vwC+AC2pfzQ S1uikJ3Ku13NW3CIXPfyDf4+4zWOugK1/zhbH6/fAeqv430wE0pLAXpzA o5ijvmH4WbaF7d3DpTaGnpnLpPa+SPLruwx368+yYvhiwy7qkqCy3NYjU X1drk1FD2xkTJ1Wln0v326KYAY751IrXGj0hizmWR3ndwL80kiIGlBVvF 2YgaLrjZaw/MZ5yFl1eXyN03Gy+33XNSvxbL6NbjUyvP7PqJZh1sDECnx gQxJ8sG5LE0wZSzgMSSBA3n+NpVJlP7Day4l0unalKKO/YarRQPGsWgu1 Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="288978824" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="288978824" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Sep 2022 00:14:08 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10484"; a="655440752" X-IronPort-AV: E=Sophos;i="5.93,354,1654585200"; d="scan'208";a="655440752" Received: from chenyi-pc.sh.intel.com ([10.239.159.53]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Sep 2022 00:14:05 -0700 From: Chenyi Qiang To: Paolo Bonzini , Marcelo Tosatti , Richard Henderson , Eduardo Habkost , Peter Xu , Xiaoyao Li Cc: Chenyi Qiang , qemu-devel@nongnu.org, kvm@vger.kernel.org Subject: [RESEND PATCH v8 4/4] i386: add notify VM exit support Date: Thu, 29 Sep 2022 15:20:14 +0800 Message-Id: <20220929072014.20705-5-chenyi.qiang@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220929072014.20705-1-chenyi.qiang@intel.com> References: <20220929072014.20705-1-chenyi.qiang@intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org There are cases that malicious virtual machine can cause CPU stuck (due to event windows don't open up), e.g., infinite loop in microcode when nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and IRQ) can be delivered. It leads the CPU to be unavailable to host or other VMs. Notify VM exit is introduced to mitigate such kind of attacks, which will generate a VM exit if no event window occurs in VM non-root mode for a specified amount of time (notify window). A new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT is exposed to user space so that the user can query the capability and set the expected notify window when creating VMs. The format of the argument when enabling this capability is as follows: Bit 63:32 - notify window specified in qemu command Bit 31:0 - some flags (e.g. KVM_X86_NOTIFY_VMEXIT_ENABLED is set to enable the feature.) Users can configure the feature by a new (x86 only) accel property: qemu -accel kvm,notify-vmexit=run|internal-error|disable,notify-window=n The default option of notify-vmexit is run, which will enable the capability and do nothing if the exit happens. The internal-error option raises a KVM internal error if it happens. The disable option does not enable the capability. The default value of notify-window is 0. It is valid only when notify-vmexit is not disabled. The valid range of notify-window is non-negative. It is even safe to set it to zero since there's an internal hardware threshold to be added to ensure no false positive. Because a notify VM exit may happen with VM_CONTEXT_INVALID set in exit qualification (no cases are anticipated that would set this bit), which means VM context is corrupted. It would be reflected in the flags of KVM_EXIT_NOTIFY exit. If KVM_NOTIFY_CONTEXT_INVALID bit is set, raise a KVM internal error unconditionally. Acked-by: Peter Xu Signed-off-by: Chenyi Qiang --- accel/kvm/kvm-all.c | 2 + qapi/run-state.json | 17 ++++++++ qemu-options.hx | 11 +++++ target/i386/kvm/kvm.c | 98 +++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 128 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 3624ed8447..41ba9de3b8 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -3636,6 +3636,8 @@ static void kvm_accel_instance_init(Object *obj) s->kernel_irqchip_split = ON_OFF_AUTO_AUTO; /* KVM dirty ring is by default off */ s->kvm_dirty_ring_size = 0; + s->notify_vmexit = NOTIFY_VMEXIT_OPTION_RUN; + s->notify_window = 0; } static void kvm_accel_class_init(ObjectClass *oc, void *data) diff --git a/qapi/run-state.json b/qapi/run-state.json index 9273ea6516..49989d30e6 100644 --- a/qapi/run-state.json +++ b/qapi/run-state.json @@ -643,3 +643,20 @@ { 'struct': 'MemoryFailureFlags', 'data': { 'action-required': 'bool', 'recursive': 'bool'} } + +## +# @NotifyVmexitOption: +# +# An enumeration of the options specified when enabling notify VM exit +# +# @run: enable the feature, do nothing and continue if the notify VM exit happens. +# +# @internal-error: enable the feature, raise a internal error if the notify +# VM exit happens. +# +# @disable: disable the feature. +# +# Since: 7.2 +## +{ 'enum': 'NotifyVmexitOption', + 'data': [ 'run', 'internal-error', 'disable' ] } \ No newline at end of file diff --git a/qemu-options.hx b/qemu-options.hx index 913c71e38f..8f85004a7d 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -191,6 +191,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel, " split-wx=on|off (enable TCG split w^x mapping)\n" " tb-size=n (TCG translation block cache size)\n" " dirty-ring-size=n (KVM dirty ring GFN count, default 0)\n" + " notify-vmexit=run|internal-error|disable,notify-window=n (enable notify VM exit and set notify window, x86 only)\n" " thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL) SRST ``-accel name[,prop=value[,...]]`` @@ -242,6 +243,16 @@ SRST is disabled (dirty-ring-size=0). When enabled, KVM will instead record dirty pages in a bitmap. + ``notify-vmexit=run|internal-error|disable,notify-window=n`` + Enables or disables notify VM exit support on x86 host and specify + the corresponding notify window to trigger the VM exit if enabled. + ``run`` option enables the feature. It does nothing and continue + if the exit happens. ``internal-error`` option enables the feature. + It raises a internal error. ``disable`` option doesn't enable the feature. + This feature can mitigate the CPU stuck issue due to event windows don't + open up for a specified of time (i.e. notify-window). + Default: notify-vmexit=run,notify-window=0. + ERST DEF("smp", HAS_ARG, QEMU_OPTION_smp, diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index eab09833f9..9a4378b304 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -15,6 +15,7 @@ #include "qemu/osdep.h" #include "qapi/qapi-events-run-state.h" #include "qapi/error.h" +#include "qapi/visitor.h" #include #include #include @@ -2599,6 +2600,21 @@ int kvm_arch_init(MachineState *ms, KVMState *s) } } + if (s->notify_vmexit != NOTIFY_VMEXIT_OPTION_DISABLE && + kvm_check_extension(s, KVM_CAP_X86_NOTIFY_VMEXIT)) { + uint64_t notify_window_flags = + ((uint64_t)s->notify_window << 32) | + KVM_X86_NOTIFY_VMEXIT_ENABLED | + KVM_X86_NOTIFY_VMEXIT_USER; + ret = kvm_vm_enable_cap(s, KVM_CAP_X86_NOTIFY_VMEXIT, 0, + notify_window_flags); + if (ret < 0) { + error_report("kvm: Failed to enable notify vmexit cap: %s", + strerror(-ret)); + return ret; + } + } + return 0; } @@ -5141,6 +5157,9 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run) X86CPU *cpu = X86_CPU(cs); uint64_t code; int ret; + bool ctx_invalid; + char str[256]; + KVMState *state; switch (run->exit_reason) { case KVM_EXIT_HLT: @@ -5196,6 +5215,21 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run) /* already handled in kvm_arch_post_run */ ret = 0; break; + case KVM_EXIT_NOTIFY: + ctx_invalid = !!(run->notify.flags & KVM_NOTIFY_CONTEXT_INVALID); + state = KVM_STATE(current_accel()); + sprintf(str, "Encounter a notify exit with %svalid context in" + " guest. There can be possible misbehaves in guest." + " Please have a look.", ctx_invalid ? "in" : ""); + if (ctx_invalid || + state->notify_vmexit == NOTIFY_VMEXIT_OPTION_INTERNAL_ERROR) { + warn_report("KVM internal error: %s", str); + ret = -1; + } else { + warn_report_once("KVM: %s", str); + ret = 0; + } + break; default: fprintf(stderr, "KVM: unknown exit reason %d\n", run->exit_reason); ret = -1; @@ -5473,6 +5507,70 @@ void kvm_request_xsave_components(X86CPU *cpu, uint64_t mask) } } +static int kvm_arch_get_notify_vmexit(Object *obj, Error **errp) +{ + KVMState *s = KVM_STATE(obj); + return s->notify_vmexit; +} + +static void kvm_arch_set_notify_vmexit(Object *obj, int value, Error **errp) +{ + KVMState *s = KVM_STATE(obj); + + if (s->fd != -1) { + error_setg(errp, "Cannot set properties after the accelerator has been initialized"); + return; + } + + s->notify_vmexit = value; +} + +static void kvm_arch_get_notify_window(Object *obj, Visitor *v, + const char *name, void *opaque, + Error **errp) +{ + KVMState *s = KVM_STATE(obj); + uint32_t value = s->notify_window; + + visit_type_uint32(v, name, &value, errp); +} + +static void kvm_arch_set_notify_window(Object *obj, Visitor *v, + const char *name, void *opaque, + Error **errp) +{ + KVMState *s = KVM_STATE(obj); + Error *error = NULL; + uint32_t value; + + if (s->fd != -1) { + error_setg(errp, "Cannot set properties after the accelerator has been initialized"); + return; + } + + visit_type_uint32(v, name, &value, &error); + if (error) { + error_propagate(errp, error); + return; + } + + s->notify_window = value; +} + void kvm_arch_accel_class_init(ObjectClass *oc) { + object_class_property_add_enum(oc, "notify-vmexit", "NotifyVMexitOption", + &NotifyVmexitOption_lookup, + kvm_arch_get_notify_vmexit, + kvm_arch_set_notify_vmexit); + object_class_property_set_description(oc, "notify-vmexit", + "Enable notify VM exit"); + + object_class_property_add(oc, "notify-window", "uint32", + kvm_arch_get_notify_window, + kvm_arch_set_notify_window, + NULL, NULL); + object_class_property_set_description(oc, "notify-window", + "Clock cycles without an event window " + "after which a notification VM exit occurs"); }