From patchwork Fri Nov 4 15:14:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Emanuele Giuseppe Esposito X-Patchwork-Id: 13032156 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D7957C43217 for ; Fri, 4 Nov 2022 15:15:54 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oqyPT-0004kn-2C; Fri, 04 Nov 2022 11:15:07 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oqyPR-0004kY-9Y for qemu-devel@nongnu.org; Fri, 04 Nov 2022 11:15:05 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oqyPP-0000Y3-IP for qemu-devel@nongnu.org; Fri, 04 Nov 2022 11:15:05 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1667574902; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FqajUAJmYrE2XQXwS2++lD6bdpKZOtRow+ECsg6ZdMo=; b=Jjt+V1z2V/77eyydv8Tc1n2mFqAjMU0pKMZbHmIyixwzX775/T7XzFZQuAFMLWNcKlNGTr i6AeC4wUqRmzVwJs3xDA7BqGxJMbPXNYSuCsishF41OdmfosG/LCD9kKrlSxo0L0YrWOGM SyJTjcWNOe+K+GDWs761wZ3LYsDRWVY= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-10-iCifCcRINnyAhF1zNoGO6g-1; Fri, 04 Nov 2022 11:14:56 -0400 X-MC-Unique: iCifCcRINnyAhF1zNoGO6g-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 61DBA185A78F; Fri, 4 Nov 2022 15:14:56 +0000 (UTC) Received: from virtlab701.virt.lab.eng.bos.redhat.com (virtlab701.virt.lab.eng.bos.redhat.com [10.19.152.228]) by smtp.corp.redhat.com (Postfix) with ESMTP id 223B8C1908B; Fri, 4 Nov 2022 15:14:56 +0000 (UTC) From: Emanuele Giuseppe Esposito To: qemu-devel@nongnu.org Cc: Paolo Bonzini , Eduardo Habkost , Marcel Apfelbaum , =?utf-8?q?Philippe_Mathieu-D?= =?utf-8?q?aud=C3=A9?= , Yanan Wang , kvm@vger.kernel.org, Emanuele Giuseppe Esposito , David Hildenbrand Subject: [RFC PATCH 1/3] KVM: keep track of running ioctls Date: Fri, 4 Nov 2022 11:14:52 -0400 Message-Id: <20221104151454.136551-2-eesposit@redhat.com> In-Reply-To: <20221104151454.136551-1-eesposit@redhat.com> References: <20221104151454.136551-1-eesposit@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.8 Received-SPF: pass client-ip=170.10.133.124; envelope-from=eesposit@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -30 X-Spam_score: -3.1 X-Spam_bar: --- X-Spam_report: (-3.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-1.045, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Qemu-devel" Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org kvm_in_ioctl_lock uses a QemuLockCnt to keep track of the running ioctls. It will be used by the memory listener to make sure no new ioctl runs when it has to perform memslots modifications. Signed-off-by: David Hildenbrand Signed-off-by: Emanuele Giuseppe Esposito --- accel/kvm/kvm-all.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index f99b0becd8..099d7bda56 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -105,6 +105,8 @@ static int kvm_sstep_flags; static bool kvm_immediate_exit; static hwaddr kvm_max_slot_size = ~0; +QemuLockCnt kvm_in_ioctl_lock; + static const KVMCapabilityInfo kvm_required_capabilites[] = { KVM_CAP_INFO(USER_MEMORY), KVM_CAP_INFO(DESTROY_MEMORY_REGION_WORKS), @@ -2310,6 +2312,7 @@ static int kvm_init(MachineState *ms) assert(TARGET_PAGE_SIZE <= qemu_real_host_page_size()); s->sigmask_len = 8; + qemu_lockcnt_init(&kvm_in_ioctl_lock); #ifdef KVM_CAP_SET_GUEST_DEBUG QTAILQ_INIT(&s->kvm_sw_breakpoints); @@ -2808,6 +2811,18 @@ static void kvm_eat_signals(CPUState *cpu) } while (sigismember(&chkset, SIG_IPI)); } +static void kvm_set_in_ioctl(bool in_ioctl) +{ + if (likely(qemu_mutex_iothread_locked())) { + return; + } + if (in_ioctl) { + qemu_lockcnt_inc(&kvm_in_ioctl_lock); + } else { + qemu_lockcnt_dec(&kvm_in_ioctl_lock); + } +} + int kvm_cpu_exec(CPUState *cpu) { struct kvm_run *run = cpu->kvm_run; @@ -3014,7 +3029,9 @@ int kvm_vm_ioctl(KVMState *s, int type, ...) va_end(ap); trace_kvm_vm_ioctl(type, arg); + kvm_set_in_ioctl(true); ret = ioctl(s->vmfd, type, arg); + kvm_set_in_ioctl(false); if (ret == -1) { ret = -errno; } @@ -3050,7 +3067,9 @@ int kvm_device_ioctl(int fd, int type, ...) va_end(ap); trace_kvm_device_ioctl(fd, type, arg); + kvm_set_in_ioctl(true); ret = ioctl(fd, type, arg); + kvm_set_in_ioctl(false); if (ret == -1) { ret = -errno; } From patchwork Fri Nov 4 15:14:53 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Emanuele Giuseppe Esposito X-Patchwork-Id: 13032157 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 002FBC433FE for ; Fri, 4 Nov 2022 15:15:54 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oqyPr-0004oR-Sy; Fri, 04 Nov 2022 11:15:31 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oqyPc-0004lo-O9 for qemu-devel@nongnu.org; Fri, 04 Nov 2022 11:15:24 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oqyPP-0000Wl-4H for qemu-devel@nongnu.org; Fri, 04 Nov 2022 11:15:16 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1667574901; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tE42PNnFF/m0fEHtmSiadsyNXCXcPI2ur/gt+05bV+k=; b=IhG/58gQ6rUZ6ywdRgC7vXzJUHPjOM7D0Sjv/i3Y02XjRW2GiIjnIxk+zN73aG8jEGO5Pt ojq4YZo2quTVsXB83cBGA87SFvZbXzvPMWCHRVUEiPLcca3JodRkRlm6bkgDCCcLTbc1xL RTnjTeJWQNSyQ7q/GkJgDkBEGFiWwfw= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-596-bP32ffFeP-WFP6Kkw4XF0g-1; Fri, 04 Nov 2022 11:14:57 -0400 X-MC-Unique: bP32ffFeP-WFP6Kkw4XF0g-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id DDC132999B25; Fri, 4 Nov 2022 15:14:56 +0000 (UTC) Received: from virtlab701.virt.lab.eng.bos.redhat.com (virtlab701.virt.lab.eng.bos.redhat.com [10.19.152.228]) by smtp.corp.redhat.com (Postfix) with ESMTP id 6ABD2C2C8D9; Fri, 4 Nov 2022 15:14:56 +0000 (UTC) From: Emanuele Giuseppe Esposito To: qemu-devel@nongnu.org Cc: Paolo Bonzini , Eduardo Habkost , Marcel Apfelbaum , =?utf-8?q?Philippe_Mathieu-D?= =?utf-8?q?aud=C3=A9?= , Yanan Wang , kvm@vger.kernel.org, Emanuele Giuseppe Esposito , David Hildenbrand Subject: [RFC PATCH 2/3] KVM: keep track of running vcpu ioctls Date: Fri, 4 Nov 2022 11:14:53 -0400 Message-Id: <20221104151454.136551-3-eesposit@redhat.com> In-Reply-To: <20221104151454.136551-1-eesposit@redhat.com> References: <20221104151454.136551-1-eesposit@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.8 Received-SPF: pass client-ip=170.10.133.124; envelope-from=eesposit@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -30 X-Spam_score: -3.1 X-Spam_bar: --- X-Spam_report: (-3.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-1.045, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Qemu-devel" Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Just as in kvm-all.c, introduce per-vcpu QemuLockCnt to keep track when vcpus are executing an ioctl. Keep a per-vcpu lock to minimize cache line bouncing. The kvm listener will then use the lock to prevent new ioctls from running when modifying memslots. Signed-off-by: David Hildenbrand Signed-off-by: Emanuele Giuseppe Esposito --- accel/kvm/kvm-all.c | 14 ++++++++++++++ hw/core/cpu-common.c | 2 ++ include/hw/core/cpu.h | 3 +++ 3 files changed, 19 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 099d7bda56..48cdf1fecd 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -2811,6 +2811,18 @@ static void kvm_eat_signals(CPUState *cpu) } while (sigismember(&chkset, SIG_IPI)); } +static void kvm_cpu_set_in_ioctl(CPUState *cpu, bool in_ioctl) +{ + if (unlikely(qemu_mutex_iothread_locked())) { + return; + } + if (in_ioctl) { + qemu_lockcnt_inc(&cpu->in_ioctl_lock); + } else { + qemu_lockcnt_dec(&cpu->in_ioctl_lock); + } +} + static void kvm_set_in_ioctl(bool in_ioctl) { if (likely(qemu_mutex_iothread_locked())) { @@ -3049,7 +3061,9 @@ int kvm_vcpu_ioctl(CPUState *cpu, int type, ...) va_end(ap); trace_kvm_vcpu_ioctl(cpu->cpu_index, type, arg); + kvm_cpu_set_in_ioctl(cpu, true); ret = ioctl(cpu->kvm_fd, type, arg); + kvm_cpu_set_in_ioctl(cpu, false); if (ret == -1) { ret = -errno; } diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c index f9fdd46b9d..8d6a4b1b65 100644 --- a/hw/core/cpu-common.c +++ b/hw/core/cpu-common.c @@ -237,6 +237,7 @@ static void cpu_common_initfn(Object *obj) cpu->nr_threads = 1; qemu_mutex_init(&cpu->work_mutex); + qemu_lockcnt_init(&cpu->in_ioctl_lock); QSIMPLEQ_INIT(&cpu->work_list); QTAILQ_INIT(&cpu->breakpoints); QTAILQ_INIT(&cpu->watchpoints); @@ -248,6 +249,7 @@ static void cpu_common_finalize(Object *obj) { CPUState *cpu = CPU(obj); + qemu_lockcnt_destroy(&cpu->in_ioctl_lock); qemu_mutex_destroy(&cpu->work_mutex); } diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h index f9b58773f7..07852b2a3c 100644 --- a/include/hw/core/cpu.h +++ b/include/hw/core/cpu.h @@ -397,6 +397,9 @@ struct CPUState { uint32_t kvm_fetch_index; uint64_t dirty_pages; + /* kvm only for now: CPU is in kvm_vcpu_ioctl() (esp. KVM_RUN) */ + QemuLockCnt in_ioctl_lock; + /* Used for events with 'vcpu' and *without* the 'disabled' properties */ DECLARE_BITMAP(trace_dstate_delayed, CPU_TRACE_DSTATE_MAX_EVENTS); DECLARE_BITMAP(trace_dstate, CPU_TRACE_DSTATE_MAX_EVENTS); From patchwork Fri Nov 4 15:14:54 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Emanuele Giuseppe Esposito X-Patchwork-Id: 13032158 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1F184C4332F for ; Fri, 4 Nov 2022 15:15:56 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oqyPU-0004lA-SI; Fri, 04 Nov 2022 11:15:08 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oqyPS-0004kg-Er for qemu-devel@nongnu.org; Fri, 04 Nov 2022 11:15:06 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oqyPP-0000ZY-Aw for qemu-devel@nongnu.org; Fri, 04 Nov 2022 11:15:06 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1667574901; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nMGhB+cIUztS62Zo93dN1SSLUlu0sn9Iy9UJXvUpLpc=; b=UHrukGAGET6KGOFaCOZhdsHcxQpqJnkuv/rOACWK0GdyIcklRJeBhP/ouXC8JW7BwdfjGH Wyml7q+qDYSQwxesq8LoTpZfEqF//3xGPWpaDqXDnqtvSFe3vvEGqYMqEr9pJQuVVUgWH4 LIjdwOw3xAOmWq9S+V6Ha+8mdRMlyiw= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-158-swPDqkY4O0me1kburFFHTw-1; Fri, 04 Nov 2022 11:14:57 -0400 X-MC-Unique: swPDqkY4O0me1kburFFHTw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 362202999B2C; Fri, 4 Nov 2022 15:14:57 +0000 (UTC) Received: from virtlab701.virt.lab.eng.bos.redhat.com (virtlab701.virt.lab.eng.bos.redhat.com [10.19.152.228]) by smtp.corp.redhat.com (Postfix) with ESMTP id E7A92C2C8C5; Fri, 4 Nov 2022 15:14:56 +0000 (UTC) From: Emanuele Giuseppe Esposito To: qemu-devel@nongnu.org Cc: Paolo Bonzini , Eduardo Habkost , Marcel Apfelbaum , =?utf-8?q?Philippe_Mathieu-D?= =?utf-8?q?aud=C3=A9?= , Yanan Wang , kvm@vger.kernel.org, David Hildenbrand , Emanuele Giuseppe Esposito Subject: [RFC PATCH 3/3] kvm: Atomic memslot updates Date: Fri, 4 Nov 2022 11:14:54 -0400 Message-Id: <20221104151454.136551-4-eesposit@redhat.com> In-Reply-To: <20221104151454.136551-1-eesposit@redhat.com> References: <20221104151454.136551-1-eesposit@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.8 Received-SPF: pass client-ip=170.10.133.124; envelope-from=eesposit@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -30 X-Spam_score: -3.1 X-Spam_bar: --- X-Spam_report: (-3.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-1.045, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Qemu-devel" Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org From: David Hildenbrand If we update an existing memslot (e.g., resize, split), we temporarily remove the memslot to re-add it immediately afterwards. These updates are not atomic, especially not for KVM VCPU threads, such that we can get spurious faults. Let's inhibit most KVM ioctls while performing relevant updates, such that we can perform the update just as if it would happen atomically without additional kernel support. We capture the add/del changes and apply them in the notifier commit stage instead. There, we can check for overlaps and perform the ioctl inhibiting only if really required (-> overlap). To keep things simple we don't perform additional checks that wouldn't actually result in an overlap -- such as !RAM memory regions in some cases (see kvm_set_phys_mem()). To minimize cache-line bouncing, use a separate indicator (in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock while performing both actions (removing+re-adding). We have to wait until all IOCTLs were exited and block new ones from getting executed. Kick all CPUs, so they will exit the KVM_RUN ioctl. This approach cannot result in a deadlock as long as the inhibitor does not hold any locks that might hinder an IOCTL from getting finished and exited - something fairly unusual. The inhibitor will always hold the BQL. AFAIKs, one possible candidate would be userfaultfd. If a page cannot be placed (e.g., during postcopy), because we're waiting for a lock, or if the userfaultfd thread cannot process a fault, because it is waiting for a lock, there could be a deadlock. However, the BQL is not applicable here, because any other guest memory access while holding the BQL would already result in a deadlock. Nothing else in the kernel should block forever and wait for userspace intervention. Note: pause_all_vcpus()/resume_all_vcpus() or start_exclusive()/end_exclusive() cannot be used, as they either drop the BQL or require to be called without the BQL - something inhibitors cannot handle. We need a low-level locking mechanism that is deadlock-free even when not releasing the BQL. Signed-off-by: David Hildenbrand Signed-off-by: Emanuele Giuseppe Esposito Tested-by: Emanuele Giuseppe Esposito --- accel/kvm/kvm-all.c | 142 ++++++++++++++++++++++++++++++++++++--- include/sysemu/kvm_int.h | 8 +++ 2 files changed, 139 insertions(+), 11 deletions(-) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 48cdf1fecd..05b7aa5403 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -46,6 +46,7 @@ #include "sysemu/hw_accel.h" #include "kvm-cpus.h" #include "sysemu/dirtylimit.h" +#include "qemu/range.h" #include "hw/boards.h" #include "monitor/stats.h" @@ -1294,6 +1295,7 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size) kvm_max_slot_size = max_slot_size; } +/* Called with KVMMemoryListener.slots_lock held */ static void kvm_set_phys_mem(KVMMemoryListener *kml, MemoryRegionSection *section, bool add) { @@ -1328,14 +1330,12 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml, ram = memory_region_get_ram_ptr(mr) + mr_offset; ram_start_offset = memory_region_get_ram_addr(mr) + mr_offset; - kvm_slots_lock(); - if (!add) { do { slot_size = MIN(kvm_max_slot_size, size); mem = kvm_lookup_matching_slot(kml, start_addr, slot_size); if (!mem) { - goto out; + return; } if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) { /* @@ -1373,7 +1373,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml, start_addr += slot_size; size -= slot_size; } while (size); - goto out; + return; } /* register the new slot */ @@ -1398,9 +1398,6 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml, ram += slot_size; size -= slot_size; } while (size); - -out: - kvm_slots_unlock(); } static void *kvm_dirty_ring_reaper_thread(void *data) @@ -1457,18 +1454,137 @@ static void kvm_region_add(MemoryListener *listener, MemoryRegionSection *section) { KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener); + KVMMemoryUpdate *update; + + update = g_new0(KVMMemoryUpdate, 1); + update->section = memory_region_section_new_copy(section); - memory_region_ref(section->mr); - kvm_set_phys_mem(kml, section, true); + QSIMPLEQ_INSERT_TAIL(&kml->transaction_add, update, next); } static void kvm_region_del(MemoryListener *listener, MemoryRegionSection *section) { KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener); + KVMMemoryUpdate *update; + + update = g_new0(KVMMemoryUpdate, 1); + update->section = memory_region_section_new_copy(section); + + QSIMPLEQ_INSERT_TAIL(&kml->transaction_del, update, next); +} + +static void kvm_ioctl_inhibit_begin(void) +{ + CPUState *cpu; + + /* + * We allow to inhibit only when holding the BQL, so we can identify + * when an inhibitor wants to issue an ioctl easily. + */ + g_assert(qemu_mutex_iothread_locked()); + + CPU_FOREACH(cpu) { + qemu_lockcnt_lock(&cpu->in_ioctl_lock); + } + qemu_lockcnt_lock(&kvm_in_ioctl_lock); - kvm_set_phys_mem(kml, section, false); - memory_region_unref(section->mr); + /* Inhibiting happens rarely, we can keep things simple and spin here. */ + while (true) { + bool any_cpu_in_ioctl = false; + + CPU_FOREACH(cpu) { + if (qemu_lockcnt_count(&cpu->in_ioctl_lock)) { + any_cpu_in_ioctl = true; + qemu_cpu_kick(cpu); + } + } + if (!any_cpu_in_ioctl && + !qemu_lockcnt_count(&kvm_in_ioctl_lock)) { + break; + } + g_usleep(100); + } +} + +static void kvm_ioctl_inhibit_end(void) +{ + CPUState *cpu; + + qemu_lockcnt_unlock(&kvm_in_ioctl_lock); + CPU_FOREACH(cpu) { + qemu_lockcnt_unlock(&cpu->in_ioctl_lock); + } +} + +static void kvm_region_commit(MemoryListener *listener) +{ + KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, + listener); + KVMMemoryUpdate *u1, *u2; + bool need_inhibit = false; + + if (QSIMPLEQ_EMPTY(&kml->transaction_add) && + QSIMPLEQ_EMPTY(&kml->transaction_del)) { + return; + } + + /* + * We have to be careful when regions to add overlap with ranges to remove. + * We have to simulate atomic KVM memslot updates by making sure no ioctl() + * is currently active. + * + * The lists are order by addresses, so it's easy to find overlaps. + */ + u1 = QSIMPLEQ_FIRST(&kml->transaction_del); + u2 = QSIMPLEQ_FIRST(&kml->transaction_add); + while (u1 && u2) { + Range r1, r2; + + range_init_nofail(&r1, u1->section->offset_within_address_space, + int128_get64(u1->section->size)); + range_init_nofail(&r2, u2->section->offset_within_address_space, + int128_get64(u2->section->size)); + + if (range_overlaps_range(&r1, &r2)) { + need_inhibit = true; + break; + } + if (range_lob(&r1) < range_lob(&r2)) { + u1 = QSIMPLEQ_NEXT(u1, next); + } else { + u2 = QSIMPLEQ_NEXT(u2, next); + } + } + + + kvm_slots_lock(); + if (need_inhibit) { + kvm_ioctl_inhibit_begin(); + } + + /* Remove all memslots before adding the new ones. */ + QSIMPLEQ_FOREACH_SAFE(u1, &kml->transaction_del, next, u2) { + kvm_set_phys_mem(kml, u1->section, false); + memory_region_unref(u1->section->mr); + + QSIMPLEQ_REMOVE(&kml->transaction_del, u1, KVMMemoryUpdate, next); + memory_region_section_free_copy(u1->section); + g_free(u1); + } + QSIMPLEQ_FOREACH_SAFE(u1, &kml->transaction_add, next, u2) { + memory_region_ref(u1->section->mr); + kvm_set_phys_mem(kml, u1->section, true); + + QSIMPLEQ_REMOVE(&kml->transaction_add, u1, KVMMemoryUpdate, next); + memory_region_section_free_copy(u1->section); + g_free(u1); + } + + if (need_inhibit) { + kvm_ioctl_inhibit_end(); + } + kvm_slots_unlock(); } static void kvm_log_sync(MemoryListener *listener, @@ -1612,8 +1728,12 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml, kml->slots[i].slot = i; } + QSIMPLEQ_INIT(&kml->transaction_add); + QSIMPLEQ_INIT(&kml->transaction_del); + kml->listener.region_add = kvm_region_add; kml->listener.region_del = kvm_region_del; + kml->listener.commit = kvm_region_commit; kml->listener.log_start = kvm_log_start; kml->listener.log_stop = kvm_log_stop; kml->listener.priority = 10; diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h index 3b4adcdc10..5ea2d7924b 100644 --- a/include/sysemu/kvm_int.h +++ b/include/sysemu/kvm_int.h @@ -12,6 +12,7 @@ #include "exec/memory.h" #include "qapi/qapi-types-common.h" #include "qemu/accel.h" +#include "qemu/queue.h" #include "sysemu/kvm.h" typedef struct KVMSlot @@ -31,10 +32,17 @@ typedef struct KVMSlot ram_addr_t ram_start_offset; } KVMSlot; +typedef struct KVMMemoryUpdate { + QSIMPLEQ_ENTRY(KVMMemoryUpdate) next; + MemoryRegionSection *section; +} KVMMemoryUpdate; + typedef struct KVMMemoryListener { MemoryListener listener; KVMSlot *slots; int as_id; + QSIMPLEQ_HEAD(, KVMMemoryUpdate) transaction_add; + QSIMPLEQ_HEAD(, KVMMemoryUpdate) transaction_del; } KVMMemoryListener; #define KVM_MSI_HASHTAB_SIZE 256