diff mbox series

[v1,3/4] system/physmem: Largepage punch hole before reset of memory pages

Message ID 20241022213503.1189954-4-william.roche@oracle.com (mailing list archive)
State New
Headers show
Series [v1,1/4] accel/kvm: SIGBUS handler should also deal with si_addr_lsb | expand

Commit Message

William Roche Oct. 22, 2024, 9:35 p.m. UTC
From: William Roche <william.roche@oracle.com>

When the VM reboots, a memory reset is performed calling
qemu_ram_remap() on all hwpoisoned pages.
While we take into account the recorded page sizes to repair the
memory locations, a large page also needs to punch a hole in the
backend file to regenerate a usable memory, cleaning the HW
poisoned section. This is mandatory for hugetlbfs case for example.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

David Hildenbrand Oct. 23, 2024, 7:30 a.m. UTC | #1
On 22.10.24 23:35, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> When the VM reboots, a memory reset is performed calling
> qemu_ram_remap() on all hwpoisoned pages.
> While we take into account the recorded page sizes to repair the
> memory locations, a large page also needs to punch a hole in the
> backend file to regenerate a usable memory, cleaning the HW
> poisoned section. This is mandatory for hugetlbfs case for example.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   system/physmem.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 3757428336..3f6024a92d 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
>                   prot = PROT_READ;
>                   prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>                   if (block->fd >= 0) {
> +                    if (length > TARGET_PAGE_SIZE && fallocate(block->fd,
> +                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
> +                        offset + block->fd_offset, length) != 0) {
> +                        error_report("Could not recreate the file hole for "
> +                                     "addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     length, addr);
> +                        exit(1);
> +                    }
>                       area = mmap(vaddr, length, prot, flags, block->fd,
>                                   offset + block->fd_offset);
>                   } else {

Ah! Just what I commented to patch #3; we should be using 
ram_discard_range(). It might be better to avoid the mmap() completely 
if ram_discard_range() worked.

And as raised, there is the problem with memory preallocation (where we 
should fail if it doesn't work) and ram discards being disabled because 
something relies on long-term page pinning ...
William Roche Oct. 25, 2024, 11:27 p.m. UTC | #2
On 10/23/24 09:30, David Hildenbrand wrote:

> On 22.10.24 23:35, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> When the VM reboots, a memory reset is performed calling
>> qemu_ram_remap() on all hwpoisoned pages.
>> While we take into account the recorded page sizes to repair the
>> memory locations, a large page also needs to punch a hole in the
>> backend file to regenerate a usable memory, cleaning the HW
>> poisoned section. This is mandatory for hugetlbfs case for example.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   system/physmem.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 3757428336..3f6024a92d 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -2211,6 +2211,14 @@ void qemu_ram_remap(ram_addr_t addr, 
>> ram_addr_t length)
>>                   prot = PROT_READ;
>>                   prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>>                   if (block->fd >= 0) {
>> +                    if (length > TARGET_PAGE_SIZE && 
>> fallocate(block->fd,
>> +                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
>> +                        offset + block->fd_offset, length) != 0) {
>> +                        error_report("Could not recreate the file 
>> hole for "
>> +                                     "addr: " RAM_ADDR_FMT "@" 
>> RAM_ADDR_FMT "",
>> +                                     length, addr);
>> +                        exit(1);
>> +                    }
>>                       area = mmap(vaddr, length, prot, flags, block->fd,
>>                                   offset + block->fd_offset);
>>                   } else {
>
> Ah! Just what I commented to patch #3; we should be using 
> ram_discard_range(). It might be better to avoid the mmap() completely 
> if ram_discard_range() worked.


I think you are referring to ram_block_discard_range() here, as 
ram_discard_range() seems to relate to VM migrations, maybe not a VM reset.

Remapping the page is needed to get rid of the poison. So if we want to 
avoid the mmap(), we have to shrink the memory address space -- which 
can be a real problem if we imagine a VM with 1G large pages for 
example. qemu_ram_remap() is used to regenerate the lost memory and the 
mmap() call looks mandatory on the reset phase.


>
> And as raised, there is the problem with memory preallocation (where 
> we should fail if it doesn't work) and ram discards being disabled 
> because something relies on long-term page pinning ...


Yes. Do you suggest that we add a call to qemu_prealloc_mem() for the 
remapped area in case of a backend->prealloc being true ?

Or as we are running on posix machines for this piece of code (ifndef 
_WIN32) maybe we could simply add a MAP_POPULATE flag to the mmap call 
done in qemu_ram_remap() in the case where the backend requires a 
'prealloc' ?  Can you confirm if this flag could be used on all systems 
running this code ?

Unfortunately, I don't know how to get the MEMORY_BACKEND corresponding 
to a given memory block. I'm not sure that MEMORY_BACKEND(block->mr) is 
a valid way to retrieve the Backend object and its 'prealloc' property 
here. Could you please give me a direction here ?

I can send a new version using ram_block_discard_range() as you 
suggested to replace the direct call to fallocate(), if you think it 
would be better.
Please let me know what other enhancement(s) you'd like to see in this 
code change.

Thanks in advance,
William.
diff mbox series

Patch

diff --git a/system/physmem.c b/system/physmem.c
index 3757428336..3f6024a92d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2211,6 +2211,14 @@  void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
                 prot = PROT_READ;
                 prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
                 if (block->fd >= 0) {
+                    if (length > TARGET_PAGE_SIZE && fallocate(block->fd,
+                        FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+                        offset + block->fd_offset, length) != 0) {
+                        error_report("Could not recreate the file hole for "
+                                     "addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
                     area = mmap(vaddr, length, prot, flags, block->fd,
                                 offset + block->fd_offset);
                 } else {