[PULL,55/63] kvm: handle KVM_EXIT_MEMORY_FAULT

Message ID	20240423150951.41600-56-pbonzini@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Paolo Bonzini <pbonzini@redhat.com> To: qemu-devel@nongnu.org Cc: Chao Peng <chao.p.peng@linux.intel.com>, Xiaoyao Li <xiaoyao.li@intel.com> Subject: [PULL 55/63] kvm: handle KVM_EXIT_MEMORY_FAULT Date: Tue, 23 Apr 2024 17:09:43 +0200 Message-ID: <20240423150951.41600-56-pbonzini@redhat.com> In-Reply-To: <20240423150951.41600-1-pbonzini@redhat.com> References: <20240423150951.41600-1-pbonzini@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.133.124; envelope-from=pbonzini@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.67, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	[PULL,01/63] meson: do not link pixman automatically into all targets \| expand [PULL,01/63] meson: do not link pixman automatically into all targets [PULL,02/63] tests: only build plugins if TCG is enabled [PULL,03/63] ebpf: Restrict to system emulation [PULL,04/63] tests/unit: match some unit tests to corresponding feature switches [PULL,05/63] yank: only build if needed [PULL,06/63] util/qemu-config: Extract QMP commands to qemu-config-qmp.c [PULL,07/63] hw/core: Move system emulation files to system_ss [PULL,08/63] hw: Include minimal source set in user emulation build [PULL,09/63] stubs: remove obsolete stubs [PULL,10/63] hw/usb: move stubs out of stubs/ [PULL,11/63] hw/virtio: move stubs out of stubs/ [PULL,12/63] semihosting: move stubs out of stubs/ [PULL,13/63] ramfb: move stubs out of stubs/ [PULL,14/63] memory-device: move stubs out of stubs/ [PULL,15/63] colo: move stubs out of stubs/ [PULL,16/63] stubs: split record/replay stubs further [PULL,17/63] stubs: include stubs only if needed [PULL,18/63] stubs: move monitor_fdsets_cleanup with other fdset stubs [PULL,19/63] vga: optimize computation of dirty memory region [PULL,20/63] vga: move dirty memory region code together [PULL,21/63] kvm: use configs/ definition to conditionalize debug support [PULL,22/63] hw: Add compat machines for 9.1 [PULL,23/63] target/i386: add guest-phys-bits cpu property [PULL,24/63] kvm: add support for guest physical bits [PULL,25/63] i386/kvm: Move architectural CPUID leaf generation to separate helper [PULL,26/63] target/i386: Introduce Icelake-Server-v7 to enable TSX [PULL,27/63] target/i386: Add new CPU model SierraForest [PULL,28/63] target/i386: Export RFDS bit to guests [PULL,29/63] pci-host/q35: Move PAM initialization above SMRAM initialization [PULL,30/63] q35: Introduce smm_ranges property for q35-pci-host [PULL,31/63] hw/i386/acpi: Set PCAT_COMPAT bit only when pic is not disabled [PULL,32/63] confidential guest support: Add kvm_init() and kvm_reset() in class [PULL,33/63] i386/sev: Switch to use confidential_guest_kvm_init() [PULL,34/63] ppc/pef: switch to use confidential_guest_kvm_init/reset() [PULL,35/63] s390: Switch to use confidential_guest_kvm_init() [PULL,36/63] scripts/update-linux-headers: Add setup_data.h to import list [PULL,37/63] scripts/update-linux-headers: Add bits.h to file imports [PULL,38/63] linux-headers: update to current kvm/next [PULL,39/63] runstate: skip initial CPU reset if reset is not actually possible [PULL,40/63] KVM: track whether guest state is encrypted [PULL,41/63] KVM: remove kvm_arch_cpu_check_are_resettable [PULL,42/63] target/i386: introduce x86-confidential-guest [PULL,43/63] target/i386: Implement mc->kvm_type() to get VM type [PULL,44/63] target/i386: SEV: use KVM_SEV_INIT2 if possible [PULL,45/63] i386/sev: Add 'legacy-vm-type' parameter for SEV guest objects [PULL,46/63] hw/i386/sev: Use legacy SEV VM types for older machine types [PULL,47/63] trace/kvm: Split address space and slot id in trace_kvm_set_user_memory() [PULL,48/63] kvm: Introduce support for memory_attributes [PULL,49/63] RAMBlock: Add support of KVM private guest memfd [PULL,50/63] kvm: Enable KVM_SET_USER_MEMORY_REGION2 for memslot [PULL,51/63] kvm/memory: Make memory type private by default if it has guest memfd backend [PULL,52/63] HostMem: Add mechanism to opt in kvm guest memfd via MachineState [PULL,53/63] RAMBlock: make guest_memfd require uncoordinated discard [PULL,54/63] physmem: Introduce ram_block_discard_guest_memfd_range() [PULL,55/63] kvm: handle KVM_EXIT_MEMORY_FAULT [PULL,56/63] kvm/tdx: Don't complain when converting vMMIO region to shared [PULL,57/63] kvm/tdx: Ignore memory conversion to shared of unassigned region [PULL,58/63] target/i386/host-cpu: Consolidate the use of warn_report_once() [PULL,59/63] target/i386/cpu: Consolidate the use of warn_report_once() [PULL,60/63] target/i386/cpu: Merge the warning and error messages for AMD HT check [PULL,61/63] accel/tcg/icount-common: Consolidate the use of warn_report_once() [PULL,62/63] pythondeps.toml: warn about updates needed to docs/requirements.txt [PULL,63/63] target/i386/translate.c: always write 32-bits for SGDT and SIDT

Message ID

20240423150951.41600-56-pbonzini@redhat.com (mailing list archive)

State

New, archived

Headers

From: Paolo Bonzini <pbonzini@redhat.com>
To: qemu-devel@nongnu.org
Cc: Chao Peng <chao.p.peng@linux.intel.com>, Xiaoyao Li <xiaoyao.li@intel.com>
Subject: [PULL 55/63] kvm: handle KVM_EXIT_MEMORY_FAULT
Date: Tue, 23 Apr 2024 17:09:43 +0200
Message-ID: <20240423150951.41600-56-pbonzini@redhat.com>
In-Reply-To: <20240423150951.41600-1-pbonzini@redhat.com>
References: <20240423150951.41600-1-pbonzini@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=170.10.133.124;
 envelope-from=pbonzini@redhat.com;
 helo=us-smtp-delivery-124.mimecast.com
X-Spam_score_int: -27
X-Spam_score: -2.8
X-Spam_bar: --
X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.67,
 DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
 RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001,
 SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org

Series

[PULL,01/63] meson: do not link pixman automatically into all targets | expand

Commit Message

Paolo Bonzini April 23, 2024, 3:09 p.m. UTC

From: Chao Peng <chao.p.peng@linux.intel.com>

Upon an KVM_EXIT_MEMORY_FAULT exit, userspace needs to do the memory
conversion on the RAMBlock to turn the memory into desired attribute,
switching between private and shared.

Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
KVM_EXIT_MEMORY_FAULT happens.

Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has
guest_memfd memory backend.

Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is
added.

When page is converted from shared to private, the original shared
memory can be discarded via ram_block_discard_range(). Note, shared
memory can be discarded only when it's not back'ed by hugetlb because
hugetlb is supposed to be pre-allocated and no need for discarding.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>

Message-ID: <20240320083945.991426-13-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/sysemu/kvm.h   |  2 +
 accel/kvm/kvm-all.c    | 98 +++++++++++++++++++++++++++++++++++++-----
 accel/kvm/trace-events |  2 +
 3 files changed, 92 insertions(+), 10 deletions(-)

Comments

Peter Maydell April 26, 2024, 1:40 p.m. UTC | #1

On Tue, 23 Apr 2024 at 16:16, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> From: Chao Peng <chao.p.peng@linux.intel.com>
>
> Upon an KVM_EXIT_MEMORY_FAULT exit, userspace needs to do the memory
> conversion on the RAMBlock to turn the memory into desired attribute,
> switching between private and shared.
>
> Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
> KVM_EXIT_MEMORY_FAULT happens.
>
> Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has
> guest_memfd memory backend.
>
> Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is
> added.
>
> When page is converted from shared to private, the original shared
> memory can be discarded via ram_block_discard_range(). Note, shared
> memory can be discarded only when it's not back'ed by hugetlb because
> hugetlb is supposed to be pre-allocated and no need for discarding.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>
> Message-ID: <20240320083945.991426-13-michael.roth@amd.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Hi; Coverity points out an issue with this code (CID 1544114):



> +int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> +{
> +    MemoryRegionSection section;
> +    ram_addr_t offset;

offset here is not initialized...

> +    MemoryRegion *mr;
> +    RAMBlock *rb;
> +    void *addr;
> +    int ret = -1;
> +
> +    trace_kvm_convert_memory(start, size, to_private ? "shared_to_private" : "private_to_shared");
> +
> +    if (!QEMU_PTR_IS_ALIGNED(start, qemu_real_host_page_size()) ||
> +        !QEMU_PTR_IS_ALIGNED(size, qemu_real_host_page_size())) {
> +        return -1;
> +    }
> +
> +    if (!size) {
> +        return -1;
> +    }
> +
> +    section = memory_region_find(get_system_memory(), start, size);
> +    mr = section.mr;
> +    if (!mr) {
> +        return -1;
> +    }
> +
> +    if (!memory_region_has_guest_memfd(mr)) {
> +        error_report("Converting non guest_memfd backed memory region "
> +                     "(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",
> +                     start, size, to_private ? "private" : "shared");
> +        goto out_unref;
> +    }
> +
> +    if (to_private) {
> +        ret = kvm_set_memory_attributes_private(start, size);
> +    } else {
> +        ret = kvm_set_memory_attributes_shared(start, size);
> +    }
> +    if (ret) {
> +        goto out_unref;
> +    }
> +
> +    addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
> +    rb = qemu_ram_block_from_host(addr, false, &offset);

...and this call to qemu_ram_block_from_host() will only initialize
offset if it does not fail (i.e. doesn't return NULL)...

> +
> +    if (to_private) {
> +        if (rb->page_size != qemu_real_host_page_size()) {

...but here we assume rb is not NULL...

> +            /*
> +             * shared memory is backed by hugetlb, which is supposed to be
> +             * pre-allocated and doesn't need to be discarded
> +             */
> +            goto out_unref;
> +        }
> +        ret = ram_block_discard_range(rb, offset, size);
> +    } else {
> +        ret = ram_block_discard_guest_memfd_range(rb, offset, size);

...and here we use offset assuming it has been initialized.

I think this code should either handle the case where
qemu_ram_block_from_host() fails, or, if it is impossible
for it to fail in this situation, add an assert() and a
comment about why we know it can't fail.

> +    }
> +
> +out_unref:
> +    memory_region_unref(mr);
> +    return ret;
> +}

thanks
-- PMM

Paolo Bonzini April 30, 2024, 7:06 p.m. UTC | #2

On Fri, Apr 26, 2024 at 3:40 PM Peter Maydell <peter.maydell@linaro.org> wrote:
> > +    addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
> > +    rb = qemu_ram_block_from_host(addr, false, &offset);
>
> ...and this call to qemu_ram_block_from_host() will only initialize
> offset if it does not fail (i.e. doesn't return NULL)...
>
> I think this code should either handle the case where
> qemu_ram_block_from_host() fails, or, if it is impossible
> for it to fail in this situation, add an assert() and a
> comment about why we know it can't fail.

The assertion is in memory_region_get_ram_ptr(), but Coverity
understandably cannot see it.

Similar to other code in hw/virtio/virtio-balloon.c, this code is
using memory_region_get_ram_ptr() as a roundabout way to go from
MemoryRegion (in this case MemoryRegionSection) to RAMBlock.  The
right fix is to introduce memory_region_get_ram_block() and use it.

Paolo

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 217f3fe17ba..47f9e8be1b3 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -542,4 +542,6 @@  int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
 int kvm_set_memory_attributes_private(hwaddr start, uint64_t size);
 int kvm_set_memory_attributes_shared(hwaddr start, uint64_t size);
 
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private);
+
 #endif
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index f49b2b95b54..9eef2c64003 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2900,6 +2900,69 @@  static void kvm_eat_signals(CPUState *cpu)
     } while (sigismember(&chkset, SIG_IPI));
 }
 
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
+{
+    MemoryRegionSection section;
+    ram_addr_t offset;
+    MemoryRegion *mr;
+    RAMBlock *rb;
+    void *addr;
+    int ret = -1;
+
+    trace_kvm_convert_memory(start, size, to_private ? "shared_to_private" : "private_to_shared");
+
+    if (!QEMU_PTR_IS_ALIGNED(start, qemu_real_host_page_size()) ||
+        !QEMU_PTR_IS_ALIGNED(size, qemu_real_host_page_size())) {
+        return -1;
+    }
+
+    if (!size) {
+        return -1;
+    }
+
+    section = memory_region_find(get_system_memory(), start, size);
+    mr = section.mr;
+    if (!mr) {
+        return -1;
+    }
+
+    if (!memory_region_has_guest_memfd(mr)) {
+        error_report("Converting non guest_memfd backed memory region "
+                     "(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",
+                     start, size, to_private ? "private" : "shared");
+        goto out_unref;
+    }
+
+    if (to_private) {
+        ret = kvm_set_memory_attributes_private(start, size);
+    } else {
+        ret = kvm_set_memory_attributes_shared(start, size);
+    }
+    if (ret) {
+        goto out_unref;
+    }
+
+    addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
+    rb = qemu_ram_block_from_host(addr, false, &offset);
+
+    if (to_private) {
+        if (rb->page_size != qemu_real_host_page_size()) {
+            /*
+             * shared memory is backed by hugetlb, which is supposed to be
+             * pre-allocated and doesn't need to be discarded
+             */
+            goto out_unref;
+        }
+        ret = ram_block_discard_range(rb, offset, size);
+    } else {
+        ret = ram_block_discard_guest_memfd_range(rb, offset, size);
+    }
+
+out_unref:
+    memory_region_unref(mr);
+    return ret;
+}
+
 int kvm_cpu_exec(CPUState *cpu)
 {
     struct kvm_run *run = cpu->kvm_run;
@@ -2967,18 +3030,20 @@  int kvm_cpu_exec(CPUState *cpu)
                 ret = EXCP_INTERRUPT;
                 break;
             }
-            fprintf(stderr, "error: kvm run failed %s\n",
-                    strerror(-run_ret));
+            if (!(run_ret == -EFAULT && run->exit_reason == KVM_EXIT_MEMORY_FAULT)) {
+                fprintf(stderr, "error: kvm run failed %s\n",
+                        strerror(-run_ret));
 #ifdef TARGET_PPC
-            if (run_ret == -EBUSY) {
-                fprintf(stderr,
-                        "This is probably because your SMT is enabled.\n"
-                        "VCPU can only run on primary threads with all "
-                        "secondary threads offline.\n");
-            }
+                if (run_ret == -EBUSY) {
+                    fprintf(stderr,
+                            "This is probably because your SMT is enabled.\n"
+                            "VCPU can only run on primary threads with all "
+                            "secondary threads offline.\n");
+                }
 #endif
-            ret = -1;
-            break;
+                ret = -1;
+                break;
+            }
         }
 
         trace_kvm_run_exit(cpu->cpu_index, run->exit_reason);
@@ -3061,6 +3126,19 @@  int kvm_cpu_exec(CPUState *cpu)
                 break;
             }
             break;
+        case KVM_EXIT_MEMORY_FAULT:
+            trace_kvm_memory_fault(run->memory_fault.gpa,
+                                   run->memory_fault.size,
+                                   run->memory_fault.flags);
+            if (run->memory_fault.flags & ~KVM_MEMORY_EXIT_FLAG_PRIVATE) {
+                error_report("KVM_EXIT_MEMORY_FAULT: Unknown flag 0x%" PRIx64,
+                             (uint64_t)run->memory_fault.flags);
+                ret = -1;
+                break;
+            }
+            ret = kvm_convert_memory(run->memory_fault.gpa, run->memory_fault.size,
+                                     run->memory_fault.flags & KVM_MEMORY_EXIT_FLAG_PRIVATE);
+            break;
         default:
             ret = kvm_arch_handle_exit(cpu, run);
             break;
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index e8c52cb9e7a..681ccb667d6 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -31,3 +31,5 @@  kvm_cpu_exec(void) ""
 kvm_interrupt_exit_request(void) ""
 kvm_io_window_exit(void) ""
 kvm_run_exit_system_event(int cpu_index, uint32_t event_type) "cpu_index %d, system_even_type %"PRIu32
+kvm_convert_memory(uint64_t start, uint64_t size, const char *msg) "start 0x%" PRIx64 " size 0x%" PRIx64 " %s"
+kvm_memory_fault(uint64_t start, uint64_t size, uint64_t flags) "start 0x%" PRIx64 " size 0x%" PRIx64 " flags 0x%" PRIx64

[PULL,55/63] kvm: handle KVM_EXIT_MEMORY_FAULT

Commit Message

Comments

Patch