Message ID | 20241022213503.1189954-3-william.roche@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1,1/4] accel/kvm: SIGBUS handler should also deal with si_addr_lsb | expand |
On 22.10.24 23:35, “William Roche wrote: > From: William Roche <william.roche@oracle.com> > > Add the page size information to the hwpoison_page_list elements. > As the kernel doesn't always report the actual poisoned page size, > we adjust this size from the backend real page size. > We take into account the recorded page size to adjust the size > and location of the memory hole. > > Signed-off-by: William Roche <william.roche@oracle.com> > --- > accel/kvm/kvm-all.c | 14 ++++++++++---- > include/exec/cpu-common.h | 1 + > include/sysemu/kvm.h | 3 ++- > include/sysemu/kvm_int.h | 3 ++- > system/physmem.c | 20 ++++++++++++++++++++ > target/arm/kvm.c | 8 ++++++-- > target/i386/kvm/kvm.c | 8 ++++++-- > 7 files changed, 47 insertions(+), 10 deletions(-) > > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c > index 2adc4d9c24..40117eefa7 100644 > --- a/accel/kvm/kvm-all.c > +++ b/accel/kvm/kvm-all.c > @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension) > */ > typedef struct HWPoisonPage { > ram_addr_t ram_addr; > + size_t page_size; > QLIST_ENTRY(HWPoisonPage) list; > } HWPoisonPage; > > @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param) > > QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) { > QLIST_REMOVE(page, list); > - qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE); > + qemu_ram_remap(page->ram_addr, page->page_size); Can't we just use the page size from the RAMBlock in qemu_ram_remap? There we lookup the RAMBlock, and all pages in a RAMBlock have the same size. I'll note that qemu_ram_remap() is rather stupid and optimized only for private memory (not shmem etc). mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page from the pagecache; you'd have to punch a hole instead. It might be better to use ram_block_discard_range() in the long run. Memory preallocation + page pinning is tricky, but we could simply bail out in these cases (preallocation failing, ram discard being disabled). qemu_ram_remap() might be problematic with page pinning (vfio) as is in any way :(
On 10/23/24 09:28, David Hildenbrand wrote: > On 22.10.24 23:35, “William Roche wrote: >> From: William Roche <william.roche@oracle.com> >> >> Add the page size information to the hwpoison_page_list elements. >> As the kernel doesn't always report the actual poisoned page size, >> we adjust this size from the backend real page size. >> We take into account the recorded page size to adjust the size >> and location of the memory hole. >> >> Signed-off-by: William Roche <william.roche@oracle.com> >> --- >> accel/kvm/kvm-all.c | 14 ++++++++++---- >> include/exec/cpu-common.h | 1 + >> include/sysemu/kvm.h | 3 ++- >> include/sysemu/kvm_int.h | 3 ++- >> system/physmem.c | 20 ++++++++++++++++++++ >> target/arm/kvm.c | 8 ++++++-- >> target/i386/kvm/kvm.c | 8 ++++++-- >> 7 files changed, 47 insertions(+), 10 deletions(-) >> >> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c >> index 2adc4d9c24..40117eefa7 100644 >> --- a/accel/kvm/kvm-all.c >> +++ b/accel/kvm/kvm-all.c >> @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned >> int extension) >> */ >> typedef struct HWPoisonPage { >> ram_addr_t ram_addr; >> + size_t page_size; >> QLIST_ENTRY(HWPoisonPage) list; >> } HWPoisonPage; >> @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param) >> QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) { >> QLIST_REMOVE(page, list); >> - qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE); >> + qemu_ram_remap(page->ram_addr, page->page_size); > > Can't we just use the page size from the RAMBlock in qemu_ram_remap? > There we lookup the RAMBlock, and all pages in a RAMBlock have the same > size. Yes, we could use the page size from the RAMBlock in qemu_ram_remap() that is called when the VM is resetting. I think that knowing the information about the size of poisoned chunk of memory when the poison is created is useful to give a trace of what is going on, before seeing maybe other pages being reported as poisoned. That's the 4th patch goal to give an information as soon as we get it. It also helps to filter the new errors reported and only create an entry in the hwpoison_page_list for new large pages. Now we could delay the page size retrieval until we are resetting and present the information (post mortem). I do think that having the information earlier is better in this case. > > I'll note that qemu_ram_remap() is rather stupid and optimized only for > private memory (not shmem etc). > > mmap(MAP_FIXED|MAP_SHARED, fd) will give you the same poisoned page from > the pagecache; you'd have to punch a hole instead. > > It might be better to use ram_block_discard_range() in the long run. > Memory preallocation + page pinning is tricky, but we could simply bail > out in these cases (preallocation failing, ram discard being disabled). I see that ram_block_discard_range() adds more control before discarding the RAM region and can also call madvise() in addition to the fallocate punch hole for standard sized memory pages. Now as the range is supposed to be recreated, I'm not convinced that these madvise calls are necessary. But we can also notice that this function will report the following warning in all cases of not shared file backends: "ram_block_discard_range: Discarding RAM in private file mappings is possibly dangerous, because it will modify the underlying file and will affect other users of the file" Which means that hugetlbfs configurations do see this new cryptic warning message on reboot if it is impacted by a memory poisoning. So I would prefer to leave the fallocate call in the qemu_ram_remap() function. Or would you prefer to enhance ram_block_discard_range() code to avoid the message in a reset situation (when called from qemu_ram_remap) ? > > qemu_ram_remap() might be problematic with page pinning (vfio) as is in > any way :( > I agree. If qemu_ram_remap() fails, Qemu is ended either abort() or exit(1). Do you say that memory pinning could be detected by ram_block_discard_range() or maybe mmap call for the impacted region and make one of them fail ? This would be an additional reason to call ram_block_discard_range() from qemu_ram_remap(). Is it what you are suggesting ?
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 2adc4d9c24..40117eefa7 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -1266,6 +1266,7 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension) */ typedef struct HWPoisonPage { ram_addr_t ram_addr; + size_t page_size; QLIST_ENTRY(HWPoisonPage) list; } HWPoisonPage; @@ -1278,15 +1279,18 @@ static void kvm_unpoison_all(void *param) QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) { QLIST_REMOVE(page, list); - qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE); + qemu_ram_remap(page->ram_addr, page->page_size); g_free(page); } } -void kvm_hwpoison_page_add(ram_addr_t ram_addr) +void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz) { HWPoisonPage *page; + if (sz > TARGET_PAGE_SIZE) + ram_addr = ROUND_DOWN(ram_addr, sz); + QLIST_FOREACH(page, &hwpoison_page_list, list) { if (page->ram_addr == ram_addr) { return; @@ -1294,6 +1298,7 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr) } page = g_new(HWPoisonPage, 1); page->ram_addr = ram_addr; + page->page_size = sz; QLIST_INSERT_HEAD(&hwpoison_page_list, page, list); } @@ -3140,7 +3145,8 @@ int kvm_cpu_exec(CPUState *cpu) if (unlikely(have_sigbus_pending)) { bql_lock(); kvm_arch_on_sigbus_vcpu(cpu, pending_sigbus_code, - pending_sigbus_addr); + pending_sigbus_addr, + pending_sigbus_addr_lsb); have_sigbus_pending = false; bql_unlock(); } @@ -3678,7 +3684,7 @@ int kvm_on_sigbus(int code, void *addr, short addr_lsb) * we can only get action optional here. */ assert(code != BUS_MCEERR_AR); - kvm_arch_on_sigbus_vcpu(first_cpu, code, addr); + kvm_arch_on_sigbus_vcpu(first_cpu, code, addr, addr_lsb); return 0; #else return 1; diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h index 638dc806a5..b971b13306 100644 --- a/include/exec/cpu-common.h +++ b/include/exec/cpu-common.h @@ -108,6 +108,7 @@ bool qemu_ram_is_named_file(RAMBlock *rb); int qemu_ram_get_fd(RAMBlock *rb); size_t qemu_ram_pagesize(RAMBlock *block); +size_t qemu_ram_pagesize_from_host(void *addr); size_t qemu_ram_pagesize_largest(void); /** diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index 1bde598404..4106a7ec07 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -383,7 +383,8 @@ bool kvm_vcpu_id_is_valid(int vcpu_id); unsigned long kvm_arch_vcpu_id(CPUState *cpu); #ifdef KVM_HAVE_MCE_INJECTION -void kvm_arch_on_sigbus_vcpu(CPUState *cpu, int code, void *addr); +void kvm_arch_on_sigbus_vcpu(CPUState *cpu, int code, void *addr, + short addr_lsb); #endif void kvm_arch_init_irq_routing(KVMState *s); diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h index a1e72763da..d2160be0ae 100644 --- a/include/sysemu/kvm_int.h +++ b/include/sysemu/kvm_int.h @@ -178,10 +178,11 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size); * * Parameters: * @ram_addr: the address in the RAM for the poisoned page + * @sz: size of the poisoned page as reported by the kernel * * Add a poisoned page to the list * * Return: None. */ -void kvm_hwpoison_page_add(ram_addr_t ram_addr); +void kvm_hwpoison_page_add(ram_addr_t ram_addr, size_t sz); #endif diff --git a/system/physmem.c b/system/physmem.c index dc1db3a384..3757428336 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -1665,6 +1665,26 @@ size_t qemu_ram_pagesize(RAMBlock *rb) return rb->page_size; } +/* Returns backend real page size used for the given address */ +size_t qemu_ram_pagesize_from_host(void *addr) +{ + RAMBlock *rb; + ram_addr_t offset; + + /* + * XXX kernel provided size is not reliable... + * As kvm_send_hwpoison_signal() uses a hard-coded PAGE_SHIFT + * signal value on hwpoison signal. + * So we must identify the actual size to consider from the + * mapping block pagesize. + */ + rb = qemu_ram_block_from_host(addr, false, &offset); + if (!rb) { + return TARGET_PAGE_SIZE; + } + return qemu_ram_pagesize(rb); +} + /* Returns the largest size of page in use */ size_t qemu_ram_pagesize_largest(void) { diff --git a/target/arm/kvm.c b/target/arm/kvm.c index f1f1b5b375..11579e170b 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -2348,10 +2348,11 @@ int kvm_arch_get_registers(CPUState *cs, Error **errp) return ret; } -void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr) +void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb) { ram_addr_t ram_addr; hwaddr paddr; + size_t sz = (addr_lsb > 0) ? (1 << addr_lsb) : TARGET_PAGE_SIZE; assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO); @@ -2359,7 +2360,10 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr) ram_addr = qemu_ram_addr_from_host(addr); if (ram_addr != RAM_ADDR_INVALID && kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) { - kvm_hwpoison_page_add(ram_addr); + if (sz == TARGET_PAGE_SIZE) { + sz = qemu_ram_pagesize_from_host(addr); + } + kvm_hwpoison_page_add(ram_addr, sz); /* * If this is a BUS_MCEERR_AR, we know we have been called * synchronously from the vCPU thread, so we can easily diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index fd9f198892..71e674bca0 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -735,12 +735,13 @@ static void hardware_memory_error(void *host_addr) exit(1); } -void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr) +void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr, short addr_lsb) { X86CPU *cpu = X86_CPU(c); CPUX86State *env = &cpu->env; ram_addr_t ram_addr; hwaddr paddr; + size_t sz = (addr_lsb > 0) ? (1 << addr_lsb) : TARGET_PAGE_SIZE; /* If we get an action required MCE, it has been injected by KVM * while the VM was running. An action optional MCE instead should @@ -753,7 +754,10 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr) ram_addr = qemu_ram_addr_from_host(addr); if (ram_addr != RAM_ADDR_INVALID && kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) { - kvm_hwpoison_page_add(ram_addr); + if (sz == TARGET_PAGE_SIZE) { + sz = qemu_ram_pagesize_from_host(addr); + } + kvm_hwpoison_page_add(ram_addr, sz); kvm_mce_inject(cpu, paddr, code); /*