diff mbox series

pflash: Only read non-zero parts of backend image

Message ID 20190505070059.4664-1-zhengxiang9@huawei.com (mailing list archive)
State New, archived
Headers show
Series pflash: Only read non-zero parts of backend image | expand

Commit Message

Xiang Zheng May 5, 2019, 7 a.m. UTC
Currently we fill the memory space with two 64MB NOR images when
using persistent UEFI variables on virt board. Actually we only use
a very small(non-zero) part of the memory while the rest significant
large(zero) part of memory is wasted.

So this patch checks the block status and only writes the non-zero part
into memory. This requires pflash devices to use sparse files for
backends.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
---
 hw/block/block.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

Comments

Peter Maydell May 5, 2019, 3:37 p.m. UTC | #1
On Sun, 5 May 2019 at 08:02, Xiang Zheng <zhengxiang9@huawei.com> wrote:
>
> Currently we fill the memory space with two 64MB NOR images when
> using persistent UEFI variables on virt board. Actually we only use
> a very small(non-zero) part of the memory while the rest significant
> large(zero) part of memory is wasted.
>
> So this patch checks the block status and only writes the non-zero part
> into memory. This requires pflash devices to use sparse files for
> backends.

Do you mean "pflash devices will no longer work if the file
that is backing them is not sparse", or just "if the file that
is backing them is not sparse then you won't get the benefit
of using less memory" ?

thanks
-- PMM
Xiang Zheng May 6, 2019, 2:51 a.m. UTC | #2
On 2019/5/5 23:37, Peter Maydell wrote:
> On Sun, 5 May 2019 at 08:02, Xiang Zheng <zhengxiang9@huawei.com> wrote:
>>
>> Currently we fill the memory space with two 64MB NOR images when
>> using persistent UEFI variables on virt board. Actually we only use
>> a very small(non-zero) part of the memory while the rest significant
>> large(zero) part of memory is wasted.
>>
>> So this patch checks the block status and only writes the non-zero part
>> into memory. This requires pflash devices to use sparse files for
>> backends.
> 
> Do you mean "pflash devices will no longer work if the file
> that is backing them is not sparse", or just "if the file that
> is backing them is not sparse then you won't get the benefit
> of using less memory" ?
> 

I mean the latter, if the file is not sparse, nothing would change.
I will improve this commit message in the next version.
Markus Armbruster May 7, 2019, 6:01 p.m. UTC | #3
The subject is slightly misleading.  Holes read as zero.  So do
non-holes full of zeroes.  The patch avoids reading the former, but
still reads the latter.

Xiang Zheng <zhengxiang9@huawei.com> writes:

> Currently we fill the memory space with two 64MB NOR images when
> using persistent UEFI variables on virt board. Actually we only use
> a very small(non-zero) part of the memory while the rest significant
> large(zero) part of memory is wasted.

Neglects to mention that the "virt board" is ARM.

> So this patch checks the block status and only writes the non-zero part
> into memory. This requires pflash devices to use sparse files for
> backends.

I started to draft an improved commit message, but then I realized this
patch can't work.

The pflash_cfi01 device allocates its device memory like this:

    memory_region_init_rom_device(
        &pfl->mem, OBJECT(dev),
        &pflash_cfi01_ops,
        pfl,
        pfl->name, total_len, &local_err);

pflash_cfi02 is similar.

memory_region_init_rom_device() calls
memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls
qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
memory gets written to even with this patch.

I'm afraid you neglected to test.

I still believe this approach can be made to work.  Need a replacement
for memory_region_init_rom_device() that uses mmap() with MAP_ANONYMOUS.
Laszlo Ersek May 7, 2019, 7:03 p.m. UTC | #4
Hi Markus,

On 05/07/19 20:01, Markus Armbruster wrote:
> The subject is slightly misleading.  Holes read as zero.  So do
> non-holes full of zeroes.  The patch avoids reading the former, but
> still reads the latter.
> 
> Xiang Zheng <zhengxiang9@huawei.com> writes:
> 
>> Currently we fill the memory space with two 64MB NOR images when
>> using persistent UEFI variables on virt board. Actually we only use
>> a very small(non-zero) part of the memory while the rest significant
>> large(zero) part of memory is wasted.
> 
> Neglects to mention that the "virt board" is ARM.
> 
>> So this patch checks the block status and only writes the non-zero part
>> into memory. This requires pflash devices to use sparse files for
>> backends.
> 
> I started to draft an improved commit message, but then I realized this
> patch can't work.
> 
> The pflash_cfi01 device allocates its device memory like this:
> 
>     memory_region_init_rom_device(
>         &pfl->mem, OBJECT(dev),
>         &pflash_cfi01_ops,
>         pfl,
>         pfl->name, total_len, &local_err);
> 
> pflash_cfi02 is similar.
> 
> memory_region_init_rom_device() calls
> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls
> qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
> memory gets written to even with this patch.

As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to
allocate the the new RAMBlock object called "new_block". The actual
guest RAM allocation occurs inside ram_block_add(), which is also called
by qemu_ram_alloc_internal().

One frame outwards the stack, qemu_ram_alloc() passes NULL to
qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in
qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well.

Then in ram_block_add(), we take the (!new_block->host) branch, and call
phys_mem_alloc().

Unfortunately, "phys_mem_alloc" is a function pointer, set with
phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from
"target/s390x/kvm.c" (setting the function pointer to
legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end
up calling the default qemu_anon_ram_alloc() function, through the
funcptr. (I think anyway.)

And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in
qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc()
passes (-1) for "fd".)

I may have missed something, of course -- I obviously didn't test it,
just speculated from the source.

Thanks
Laszlo

> 
> I'm afraid you neglected to test.
> 
> I still believe this approach can be made to work.  Need a replacement
> for memory_region_init_rom_device() that uses mmap() with MAP_ANONYMOUS.
>
Markus Armbruster May 8, 2019, 1:20 p.m. UTC | #5
Laszlo Ersek <lersek@redhat.com> writes:

> Hi Markus,
>
> On 05/07/19 20:01, Markus Armbruster wrote:
>> The subject is slightly misleading.  Holes read as zero.  So do
>> non-holes full of zeroes.  The patch avoids reading the former, but
>> still reads the latter.
>> 
>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>> 
>>> Currently we fill the memory space with two 64MB NOR images when
>>> using persistent UEFI variables on virt board. Actually we only use
>>> a very small(non-zero) part of the memory while the rest significant
>>> large(zero) part of memory is wasted.
>> 
>> Neglects to mention that the "virt board" is ARM.
>> 
>>> So this patch checks the block status and only writes the non-zero part
>>> into memory. This requires pflash devices to use sparse files for
>>> backends.
>> 
>> I started to draft an improved commit message, but then I realized this
>> patch can't work.
>> 
>> The pflash_cfi01 device allocates its device memory like this:
>> 
>>     memory_region_init_rom_device(
>>         &pfl->mem, OBJECT(dev),
>>         &pflash_cfi01_ops,
>>         pfl,
>>         pfl->name, total_len, &local_err);
>> 
>> pflash_cfi02 is similar.
>> 
>> memory_region_init_rom_device() calls
>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls
>> qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
>> memory gets written to even with this patch.
>
> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to
> allocate the the new RAMBlock object called "new_block". The actual
> guest RAM allocation occurs inside ram_block_add(), which is also called
> by qemu_ram_alloc_internal().

You're right.  I should've read more attentively.

> One frame outwards the stack, qemu_ram_alloc() passes NULL to
> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in
> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well.
>
> Then in ram_block_add(), we take the (!new_block->host) branch, and call
> phys_mem_alloc().
>
> Unfortunately, "phys_mem_alloc" is a function pointer, set with
> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from
> "target/s390x/kvm.c" (setting the function pointer to
> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end
> up calling the default qemu_anon_ram_alloc() function, through the
> funcptr. (I think anyway.)
>
> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in
> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc()
> passes (-1) for "fd".)
>
> I may have missed something, of course -- I obviously didn't test it,
> just speculated from the source.

Thanks for your sleuthing!

>> I'm afraid you neglected to test.

Accusation actually unsupported.  I apologize, and replace it by a
question: have you observed the improvement you're trying to achieve,
and if yes, how?

[...]
Xiang Zheng May 9, 2019, 7:14 a.m. UTC | #6
On 2019/5/8 21:20, Markus Armbruster wrote:
> Laszlo Ersek <lersek@redhat.com> writes:
> 
>> Hi Markus,
>>
>> On 05/07/19 20:01, Markus Armbruster wrote:
>>> The subject is slightly misleading.  Holes read as zero.  So do
>>> non-holes full of zeroes.  The patch avoids reading the former, but
>>> still reads the latter.
>>>
>>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>>>
>>>> Currently we fill the memory space with two 64MB NOR images when
>>>> using persistent UEFI variables on virt board. Actually we only use
>>>> a very small(non-zero) part of the memory while the rest significant
>>>> large(zero) part of memory is wasted.
>>>
>>> Neglects to mention that the "virt board" is ARM.
>>>
>>>> So this patch checks the block status and only writes the non-zero part
>>>> into memory. This requires pflash devices to use sparse files for
>>>> backends.
>>>
>>> I started to draft an improved commit message, but then I realized this
>>> patch can't work.
>>>
>>> The pflash_cfi01 device allocates its device memory like this:
>>>
>>>     memory_region_init_rom_device(
>>>         &pfl->mem, OBJECT(dev),
>>>         &pflash_cfi01_ops,
>>>         pfl,
>>>         pfl->name, total_len, &local_err);
>>>
>>> pflash_cfi02 is similar.
>>>
>>> memory_region_init_rom_device() calls
>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls
>>> qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
>>> memory gets written to even with this patch.
>>
>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to
>> allocate the the new RAMBlock object called "new_block". The actual
>> guest RAM allocation occurs inside ram_block_add(), which is also called
>> by qemu_ram_alloc_internal().
> 
> You're right.  I should've read more attentively.
> 
>> One frame outwards the stack, qemu_ram_alloc() passes NULL to
>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in
>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well.
>>
>> Then in ram_block_add(), we take the (!new_block->host) branch, and call
>> phys_mem_alloc().
>>
>> Unfortunately, "phys_mem_alloc" is a function pointer, set with
>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from
>> "target/s390x/kvm.c" (setting the function pointer to
>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end
>> up calling the default qemu_anon_ram_alloc() function, through the
>> funcptr. (I think anyway.)
>>
>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in
>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc()
>> passes (-1) for "fd".)
>>
>> I may have missed something, of course -- I obviously didn't test it,
>> just speculated from the source.
> 
> Thanks for your sleuthing!
> 
>>> I'm afraid you neglected to test.
> 
> Accusation actually unsupported.  I apologize, and replace it by a
> question: have you observed the improvement you're trying to achieve,
> and if yes, how?
> 

Yes, we need to create sparse files as the backing images for pflash device.
To create sparse files like:

   dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
   dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc

   dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0

Start a VM with below commandline:

    -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\
    -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \

Then observe the memory usage of the qemu process (THP is on).

1) Without this patch:
# cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
AnonHugePages:    706560 kB
AnonHugePages:      2048 kB
AnonHugePages:     65536 kB    // pflash memory device
AnonHugePages:     65536 kB    // pflash memory device
AnonHugePages:      2048 kB

# ps aux | grep qemu-system-aarch64
RSS: 879684

2) After applying this patch:
# cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
AnonHugePages:    700416 kB
AnonHugePages:      2048 kB
AnonHugePages:      2048 kB    // pflash memory device
AnonHugePages:      2048 kB    // pflash memory device
AnonHugePages:      2048 kB

# ps aux | grep qemu-system-aarch64
RSS: 744380

Obviously, there are at least 100MiB memory saved for each guest.
Markus Armbruster May 9, 2019, 11:59 a.m. UTC | #7
Xiang Zheng <zhengxiang9@huawei.com> writes:

> On 2019/5/8 21:20, Markus Armbruster wrote:
>> Laszlo Ersek <lersek@redhat.com> writes:
>> 
>>> Hi Markus,
>>>
>>> On 05/07/19 20:01, Markus Armbruster wrote:
>>>> The subject is slightly misleading.  Holes read as zero.  So do
>>>> non-holes full of zeroes.  The patch avoids reading the former, but
>>>> still reads the latter.
>>>>
>>>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>>>>
>>>>> Currently we fill the memory space with two 64MB NOR images when
>>>>> using persistent UEFI variables on virt board. Actually we only use
>>>>> a very small(non-zero) part of the memory while the rest significant
>>>>> large(zero) part of memory is wasted.
>>>>
>>>> Neglects to mention that the "virt board" is ARM.
>>>>
>>>>> So this patch checks the block status and only writes the non-zero part
>>>>> into memory. This requires pflash devices to use sparse files for
>>>>> backends.
>>>>
>>>> I started to draft an improved commit message, but then I realized this
>>>> patch can't work.
>>>>
>>>> The pflash_cfi01 device allocates its device memory like this:
>>>>
>>>>     memory_region_init_rom_device(
>>>>         &pfl->mem, OBJECT(dev),
>>>>         &pflash_cfi01_ops,
>>>>         pfl,
>>>>         pfl->name, total_len, &local_err);
>>>>
>>>> pflash_cfi02 is similar.
>>>>
>>>> memory_region_init_rom_device() calls
>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls
>>>> qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
>>>> memory gets written to even with this patch.
>>>
>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to
>>> allocate the the new RAMBlock object called "new_block". The actual
>>> guest RAM allocation occurs inside ram_block_add(), which is also called
>>> by qemu_ram_alloc_internal().
>> 
>> You're right.  I should've read more attentively.
>> 
>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to
>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in
>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well.
>>>
>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call
>>> phys_mem_alloc().
>>>
>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with
>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from
>>> "target/s390x/kvm.c" (setting the function pointer to
>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end
>>> up calling the default qemu_anon_ram_alloc() function, through the
>>> funcptr. (I think anyway.)
>>>
>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in
>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc()
>>> passes (-1) for "fd".)
>>>
>>> I may have missed something, of course -- I obviously didn't test it,
>>> just speculated from the source.
>> 
>> Thanks for your sleuthing!
>> 
>>>> I'm afraid you neglected to test.
>> 
>> Accusation actually unsupported.  I apologize, and replace it by a
>> question: have you observed the improvement you're trying to achieve,
>> and if yes, how?
>> 
>
> Yes, we need to create sparse files as the backing images for pflash device.
> To create sparse files like:
>
>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc

This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to
64MiB.

>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0

This creates the varstore as a 64MiB hole.  As far as I know (very
little), you should use the varstore template that comes with the
firmware binary.

I use

    cp --sparse=always bld/pc-bios/edk2-arm-vars.fd .
    cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd .

These guys are already zero-padded, and I use cp to sparsify.

> Start a VM with below commandline:
>
>     -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\
>     -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \
>
> Then observe the memory usage of the qemu process (THP is on).
>
> 1) Without this patch:
> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
> AnonHugePages:    706560 kB
> AnonHugePages:      2048 kB
> AnonHugePages:     65536 kB    // pflash memory device
> AnonHugePages:     65536 kB    // pflash memory device
> AnonHugePages:      2048 kB
>
> # ps aux | grep qemu-system-aarch64
> RSS: 879684
>
> 2) After applying this patch:
> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
> AnonHugePages:    700416 kB
> AnonHugePages:      2048 kB
> AnonHugePages:      2048 kB    // pflash memory device
> AnonHugePages:      2048 kB    // pflash memory device
> AnonHugePages:      2048 kB
>
> # ps aux | grep qemu-system-aarch64
> RSS: 744380

Okay, this demonstrates the patch succeeds at mapping parts of the
pflash memory as holes.

Do the guests in these QEMU processes run?

> Obviously, there are at least 100MiB memory saved for each guest.

For a definition of "memory".

Next question: what impact on system performance do you observe?

Let me explain.

Virtual memory holes get filled in by demand paging on access.  In other
words, they remain holes only as long as nothing accesses the memory.

Without your patch, we allocate pages at image read time and fill them
with zeroes. If we don't access them again, the kernel will eventually
page them out (assuming you're running with swap).  So the steady state
is "we waste some swap space", not "we waste some physical RAM".

Your patch lets us map pflash memory pages containing only zeros as
holes.

For pages that never get accessed, your patch avoids page allocation,
filling with zeroes, writing to swap (all one-time costs), and saves
some swap space (not commonly an issue).

For pflash memory that gets accessed, your patch merely delays page
allocation from image read time to first access.

I wonder how these savings and delays affect actual system performance.
Without an observable change in system performance, all we'd accomplish
is changing a bunch of numers in /proc/$pid/.

What improvement(s) can you observe?

I guess the best case for your patch is many guests with relatively
small RAM sizes.
Xiang Zheng May 10, 2019, 1:12 p.m. UTC | #8
On 2019/5/9 19:59, Markus Armbruster wrote:
> Xiang Zheng <zhengxiang9@huawei.com> writes:
> 
>> On 2019/5/8 21:20, Markus Armbruster wrote:
>>> Laszlo Ersek <lersek@redhat.com> writes:
>>>
>>>> Hi Markus,
>>>>
>>>> On 05/07/19 20:01, Markus Armbruster wrote:
>>>>> The subject is slightly misleading.  Holes read as zero.  So do
>>>>> non-holes full of zeroes.  The patch avoids reading the former, but
>>>>> still reads the latter.
>>>>>
>>>>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>>>>>
>>>>>> Currently we fill the memory space with two 64MB NOR images when
>>>>>> using persistent UEFI variables on virt board. Actually we only use
>>>>>> a very small(non-zero) part of the memory while the rest significant
>>>>>> large(zero) part of memory is wasted.
>>>>>
>>>>> Neglects to mention that the "virt board" is ARM.
>>>>>
>>>>>> So this patch checks the block status and only writes the non-zero part
>>>>>> into memory. This requires pflash devices to use sparse files for
>>>>>> backends.
>>>>>
>>>>> I started to draft an improved commit message, but then I realized this
>>>>> patch can't work.
>>>>>
>>>>> The pflash_cfi01 device allocates its device memory like this:
>>>>>
>>>>>     memory_region_init_rom_device(
>>>>>         &pfl->mem, OBJECT(dev),
>>>>>         &pflash_cfi01_ops,
>>>>>         pfl,
>>>>>         pfl->name, total_len, &local_err);
>>>>>
>>>>> pflash_cfi02 is similar.
>>>>>
>>>>> memory_region_init_rom_device() calls
>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls
>>>>> qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
>>>>> memory gets written to even with this patch.
>>>>
>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to
>>>> allocate the the new RAMBlock object called "new_block". The actual
>>>> guest RAM allocation occurs inside ram_block_add(), which is also called
>>>> by qemu_ram_alloc_internal().
>>>
>>> You're right.  I should've read more attentively.
>>>
>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to
>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in
>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well.
>>>>
>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call
>>>> phys_mem_alloc().
>>>>
>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with
>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from
>>>> "target/s390x/kvm.c" (setting the function pointer to
>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end
>>>> up calling the default qemu_anon_ram_alloc() function, through the
>>>> funcptr. (I think anyway.)
>>>>
>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in
>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc()
>>>> passes (-1) for "fd".)
>>>>
>>>> I may have missed something, of course -- I obviously didn't test it,
>>>> just speculated from the source.
>>>
>>> Thanks for your sleuthing!
>>>
>>>>> I'm afraid you neglected to test.
>>>
>>> Accusation actually unsupported.  I apologize, and replace it by a
>>> question: have you observed the improvement you're trying to achieve,
>>> and if yes, how?
>>>
>>
>> Yes, we need to create sparse files as the backing images for pflash device.
>> To create sparse files like:
>>
>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
> 
> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to
> 64MiB.
> 
>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
> 
> This creates the varstore as a 64MiB hole.  As far as I know (very
> little), you should use the varstore template that comes with the
> firmware binary.
> 
> I use
> 
>     cp --sparse=always bld/pc-bios/edk2-arm-vars.fd .
>     cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd .
> 
> These guys are already zero-padded, and I use cp to sparsify.
> 
>> Start a VM with below commandline:
>>
>>     -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\
>>     -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \
>>
>> Then observe the memory usage of the qemu process (THP is on).
>>
>> 1) Without this patch:
>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
>> AnonHugePages:    706560 kB
>> AnonHugePages:      2048 kB
>> AnonHugePages:     65536 kB    // pflash memory device
>> AnonHugePages:     65536 kB    // pflash memory device
>> AnonHugePages:      2048 kB
>>
>> # ps aux | grep qemu-system-aarch64
>> RSS: 879684
>>
>> 2) After applying this patch:
>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
>> AnonHugePages:    700416 kB
>> AnonHugePages:      2048 kB
>> AnonHugePages:      2048 kB    // pflash memory device
>> AnonHugePages:      2048 kB    // pflash memory device
>> AnonHugePages:      2048 kB
>>
>> # ps aux | grep qemu-system-aarch64
>> RSS: 744380
> 
> Okay, this demonstrates the patch succeeds at mapping parts of the
> pflash memory as holes.
> 
> Do the guests in these QEMU processes run?

Yes.

> 
>> Obviously, there are at least 100MiB memory saved for each guest.
> 
> For a definition of "memory".
> 
> Next question: what impact on system performance do you observe?
> 
> Let me explain.
> 
> Virtual memory holes get filled in by demand paging on access.  In other
> words, they remain holes only as long as nothing accesses the memory.
> 
> Without your patch, we allocate pages at image read time and fill them
> with zeroes. If we don't access them again, the kernel will eventually
> page them out (assuming you're running with swap).  So the steady state
> is "we waste some swap space", not "we waste some physical RAM".
> 

Not everybody wants to run with swap because it may cause low performance.

> Your patch lets us map pflash memory pages containing only zeros as
> holes.
> 
> For pages that never get accessed, your patch avoids page allocation,
> filling with zeroes, writing to swap (all one-time costs), and saves
> some swap space (not commonly an issue).
> 
> For pflash memory that gets accessed, your patch merely delays page
> allocation from image read time to first access.
> 
> I wonder how these savings and delays affect actual system performance.
> Without an observable change in system performance, all we'd accomplish
> is changing a bunch of numers in /proc/$pid/.
> 
> What improvement(s) can you observe?

We only use pflash device for UEFI, and we hardly care about the performance.
I think the bottleneck of the performance is the MMIO emulation, even this
patch would delay page allocation at the first access.

> 
> I guess the best case for your patch is many guests with relatively
> small RAM sizes.
> 
> .
>
Markus Armbruster May 10, 2019, 3:16 p.m. UTC | #9
Xiang Zheng <zhengxiang9@huawei.com> writes:

> On 2019/5/9 19:59, Markus Armbruster wrote:
>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>> 
>>> On 2019/5/8 21:20, Markus Armbruster wrote:
>>>> Laszlo Ersek <lersek@redhat.com> writes:
>>>>
>>>>> Hi Markus,
>>>>>
>>>>> On 05/07/19 20:01, Markus Armbruster wrote:
>>>>>> The subject is slightly misleading.  Holes read as zero.  So do
>>>>>> non-holes full of zeroes.  The patch avoids reading the former, but
>>>>>> still reads the latter.
>>>>>>
>>>>>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>>>>>>
>>>>>>> Currently we fill the memory space with two 64MB NOR images when
>>>>>>> using persistent UEFI variables on virt board. Actually we only use
>>>>>>> a very small(non-zero) part of the memory while the rest significant
>>>>>>> large(zero) part of memory is wasted.
>>>>>>
>>>>>> Neglects to mention that the "virt board" is ARM.
>>>>>>
>>>>>>> So this patch checks the block status and only writes the non-zero part
>>>>>>> into memory. This requires pflash devices to use sparse files for
>>>>>>> backends.
>>>>>>
>>>>>> I started to draft an improved commit message, but then I realized this
>>>>>> patch can't work.
>>>>>>
>>>>>> The pflash_cfi01 device allocates its device memory like this:
>>>>>>
>>>>>>     memory_region_init_rom_device(
>>>>>>         &pfl->mem, OBJECT(dev),
>>>>>>         &pflash_cfi01_ops,
>>>>>>         pfl,
>>>>>>         pfl->name, total_len, &local_err);
>>>>>>
>>>>>> pflash_cfi02 is similar.
>>>>>>
>>>>>> memory_region_init_rom_device() calls
>>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls
>>>>>> qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
>>>>>> memory gets written to even with this patch.
>>>>>
>>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to
>>>>> allocate the the new RAMBlock object called "new_block". The actual
>>>>> guest RAM allocation occurs inside ram_block_add(), which is also called
>>>>> by qemu_ram_alloc_internal().
>>>>
>>>> You're right.  I should've read more attentively.
>>>>
>>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to
>>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in
>>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well.
>>>>>
>>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call
>>>>> phys_mem_alloc().
>>>>>
>>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with
>>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from
>>>>> "target/s390x/kvm.c" (setting the function pointer to
>>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end
>>>>> up calling the default qemu_anon_ram_alloc() function, through the
>>>>> funcptr. (I think anyway.)
>>>>>
>>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in
>>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc()
>>>>> passes (-1) for "fd".)
>>>>>
>>>>> I may have missed something, of course -- I obviously didn't test it,
>>>>> just speculated from the source.
>>>>
>>>> Thanks for your sleuthing!
>>>>
>>>>>> I'm afraid you neglected to test.
>>>>
>>>> Accusation actually unsupported.  I apologize, and replace it by a
>>>> question: have you observed the improvement you're trying to achieve,
>>>> and if yes, how?
>>>>
>>>
>>> Yes, we need to create sparse files as the backing images for pflash device.
>>> To create sparse files like:
>>>
>>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
>> 
>> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to
>> 64MiB.
>> 
>>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
>> 
>> This creates the varstore as a 64MiB hole.  As far as I know (very
>> little), you should use the varstore template that comes with the
>> firmware binary.
>> 
>> I use
>> 
>>     cp --sparse=always bld/pc-bios/edk2-arm-vars.fd .
>>     cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd .
>> 
>> These guys are already zero-padded, and I use cp to sparsify.
>> 
>>> Start a VM with below commandline:
>>>
>>>     -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\
>>>     -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \
>>>
>>> Then observe the memory usage of the qemu process (THP is on).
>>>
>>> 1) Without this patch:
>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
>>> AnonHugePages:    706560 kB
>>> AnonHugePages:      2048 kB
>>> AnonHugePages:     65536 kB    // pflash memory device
>>> AnonHugePages:     65536 kB    // pflash memory device
>>> AnonHugePages:      2048 kB
>>>
>>> # ps aux | grep qemu-system-aarch64
>>> RSS: 879684
>>>
>>> 2) After applying this patch:
>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
>>> AnonHugePages:    700416 kB
>>> AnonHugePages:      2048 kB
>>> AnonHugePages:      2048 kB    // pflash memory device
>>> AnonHugePages:      2048 kB    // pflash memory device
>>> AnonHugePages:      2048 kB
>>>
>>> # ps aux | grep qemu-system-aarch64
>>> RSS: 744380
>> 
>> Okay, this demonstrates the patch succeeds at mapping parts of the
>> pflash memory as holes.
>> 
>> Do the guests in these QEMU processes run?
>
> Yes.

Good to know, thanks.

>>> Obviously, there are at least 100MiB memory saved for each guest.
>> 
>> For a definition of "memory".
>> 
>> Next question: what impact on system performance do you observe?
>> 
>> Let me explain.
>> 
>> Virtual memory holes get filled in by demand paging on access.  In other
>> words, they remain holes only as long as nothing accesses the memory.
>> 
>> Without your patch, we allocate pages at image read time and fill them
>> with zeroes. If we don't access them again, the kernel will eventually
>> page them out (assuming you're running with swap).  So the steady state
>> is "we waste some swap space", not "we waste some physical RAM".
>> 
>
> Not everybody wants to run with swap because it may cause low performance.

Someone running without swap because he heard someone say someone said
swap may be slow is probably throwing away performance.

But I assume you mean people running without swap because they measured
their workload and found it more performant without swap.  Legitimate.

>> Your patch lets us map pflash memory pages containing only zeros as
>> holes.
>> 
>> For pages that never get accessed, your patch avoids page allocation,
>> filling with zeroes, writing to swap (all one-time costs), and saves
>> some swap space (not commonly an issue).
>> 
>> For pflash memory that gets accessed, your patch merely delays page
>> allocation from image read time to first access.
>> 
>> I wonder how these savings and delays affect actual system performance.
>> Without an observable change in system performance, all we'd accomplish
>> is changing a bunch of numers in /proc/$pid/.
>> 
>> What improvement(s) can you observe?
>
> We only use pflash device for UEFI, and we hardly care about the performance.
> I think the bottleneck of the performance is the MMIO emulation, even this
> patch would delay page allocation at the first access.

I wasn't inquiring about the performance of the pflash device.  I was
inquiring about *system* performance.  But let me rephrase my question.

Doing work to save resources is only worthwhile if something valuable
gets better in a measurable way.  I'm asking you

(1) to explain what exactly you value, and 

(2) to provide measurements that show improvement.

>> I guess the best case for your patch is many guests with relatively
>> small RAM sizes.
Xiang Zheng May 11, 2019, 8:36 a.m. UTC | #10
On 2019/5/10 23:16, Markus Armbruster wrote:
> Xiang Zheng <zhengxiang9@huawei.com> writes:
> 
>> On 2019/5/9 19:59, Markus Armbruster wrote:
>>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>>>
>>>> On 2019/5/8 21:20, Markus Armbruster wrote:
>>>>> Laszlo Ersek <lersek@redhat.com> writes:
>>>>>
>>>>>> Hi Markus,
>>>>>>
>>>>>> On 05/07/19 20:01, Markus Armbruster wrote:
>>>>>>> The subject is slightly misleading.  Holes read as zero.  So do
>>>>>>> non-holes full of zeroes.  The patch avoids reading the former, but
>>>>>>> still reads the latter.
>>>>>>>
>>>>>>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>>>>>>>
>>>>>>>> Currently we fill the memory space with two 64MB NOR images when
>>>>>>>> using persistent UEFI variables on virt board. Actually we only use
>>>>>>>> a very small(non-zero) part of the memory while the rest significant
>>>>>>>> large(zero) part of memory is wasted.
>>>>>>>
>>>>>>> Neglects to mention that the "virt board" is ARM.
>>>>>>>
>>>>>>>> So this patch checks the block status and only writes the non-zero part
>>>>>>>> into memory. This requires pflash devices to use sparse files for
>>>>>>>> backends.
>>>>>>>
>>>>>>> I started to draft an improved commit message, but then I realized this
>>>>>>> patch can't work.
>>>>>>>
>>>>>>> The pflash_cfi01 device allocates its device memory like this:
>>>>>>>
>>>>>>>     memory_region_init_rom_device(
>>>>>>>         &pfl->mem, OBJECT(dev),
>>>>>>>         &pflash_cfi01_ops,
>>>>>>>         pfl,
>>>>>>>         pfl->name, total_len, &local_err);
>>>>>>>
>>>>>>> pflash_cfi02 is similar.
>>>>>>>
>>>>>>> memory_region_init_rom_device() calls
>>>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls
>>>>>>> qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
>>>>>>> memory gets written to even with this patch.
>>>>>>
>>>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to
>>>>>> allocate the the new RAMBlock object called "new_block". The actual
>>>>>> guest RAM allocation occurs inside ram_block_add(), which is also called
>>>>>> by qemu_ram_alloc_internal().
>>>>>
>>>>> You're right.  I should've read more attentively.
>>>>>
>>>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to
>>>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in
>>>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well.
>>>>>>
>>>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call
>>>>>> phys_mem_alloc().
>>>>>>
>>>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with
>>>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from
>>>>>> "target/s390x/kvm.c" (setting the function pointer to
>>>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end
>>>>>> up calling the default qemu_anon_ram_alloc() function, through the
>>>>>> funcptr. (I think anyway.)
>>>>>>
>>>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in
>>>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc()
>>>>>> passes (-1) for "fd".)
>>>>>>
>>>>>> I may have missed something, of course -- I obviously didn't test it,
>>>>>> just speculated from the source.
>>>>>
>>>>> Thanks for your sleuthing!
>>>>>
>>>>>>> I'm afraid you neglected to test.
>>>>>
>>>>> Accusation actually unsupported.  I apologize, and replace it by a
>>>>> question: have you observed the improvement you're trying to achieve,
>>>>> and if yes, how?
>>>>>
>>>>
>>>> Yes, we need to create sparse files as the backing images for pflash device.
>>>> To create sparse files like:
>>>>
>>>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>>>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
>>>
>>> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to
>>> 64MiB.
>>>
>>>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
>>>
>>> This creates the varstore as a 64MiB hole.  As far as I know (very
>>> little), you should use the varstore template that comes with the
>>> firmware binary.
>>>
>>> I use
>>>
>>>     cp --sparse=always bld/pc-bios/edk2-arm-vars.fd .
>>>     cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd .
>>>
>>> These guys are already zero-padded, and I use cp to sparsify.
>>>
>>>> Start a VM with below commandline:
>>>>
>>>>     -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\
>>>>     -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \
>>>>
>>>> Then observe the memory usage of the qemu process (THP is on).
>>>>
>>>> 1) Without this patch:
>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
>>>> AnonHugePages:    706560 kB
>>>> AnonHugePages:      2048 kB
>>>> AnonHugePages:     65536 kB    // pflash memory device
>>>> AnonHugePages:     65536 kB    // pflash memory device
>>>> AnonHugePages:      2048 kB
>>>>
>>>> # ps aux | grep qemu-system-aarch64
>>>> RSS: 879684
>>>>
>>>> 2) After applying this patch:
>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
>>>> AnonHugePages:    700416 kB
>>>> AnonHugePages:      2048 kB
>>>> AnonHugePages:      2048 kB    // pflash memory device
>>>> AnonHugePages:      2048 kB    // pflash memory device
>>>> AnonHugePages:      2048 kB
>>>>
>>>> # ps aux | grep qemu-system-aarch64
>>>> RSS: 744380
>>>
>>> Okay, this demonstrates the patch succeeds at mapping parts of the
>>> pflash memory as holes.
>>>
>>> Do the guests in these QEMU processes run?
>>
>> Yes.
> 
> Good to know, thanks.
> 
>>>> Obviously, there are at least 100MiB memory saved for each guest.
>>>
>>> For a definition of "memory".
>>>
>>> Next question: what impact on system performance do you observe?
>>>
>>> Let me explain.
>>>
>>> Virtual memory holes get filled in by demand paging on access.  In other
>>> words, they remain holes only as long as nothing accesses the memory.
>>>
>>> Without your patch, we allocate pages at image read time and fill them
>>> with zeroes. If we don't access them again, the kernel will eventually
>>> page them out (assuming you're running with swap).  So the steady state
>>> is "we waste some swap space", not "we waste some physical RAM".
>>>
>>
>> Not everybody wants to run with swap because it may cause low performance.
> 
> Someone running without swap because he heard someone say someone said
> swap may be slow is probably throwing away performance.
> 
> But I assume you mean people running without swap because they measured
> their workload and found it more performant without swap.  Legitimate.

Yes, and I had ever suffered from the high IO waits with swap.:)

> 
>>> Your patch lets us map pflash memory pages containing only zeros as
>>> holes.
>>>
>>> For pages that never get accessed, your patch avoids page allocation,
>>> filling with zeroes, writing to swap (all one-time costs), and saves
>>> some swap space (not commonly an issue).
>>>
>>> For pflash memory that gets accessed, your patch merely delays page
>>> allocation from image read time to first access.
>>>
>>> I wonder how these savings and delays affect actual system performance.
>>> Without an observable change in system performance, all we'd accomplish
>>> is changing a bunch of numers in /proc/$pid/.
>>>
>>> What improvement(s) can you observe?
>>
>> We only use pflash device for UEFI, and we hardly care about the performance.
>> I think the bottleneck of the performance is the MMIO emulation, even this
>> patch would delay page allocation at the first access.
> 
> I wasn't inquiring about the performance of the pflash device.  I was
> inquiring about *system* performance.  But let me rephrase my question.
> 
> Doing work to save resources is only worthwhile if something valuable
> gets better in a measurable way.  I'm asking you
> 
> (1) to explain what exactly you value, and 
> 
> (2) to provide measurements that show improvement.
> 

What we exactly value is the cost of memory resources and it is the only
thing that this patch aims to resolve.

I am confused that why you think it will impact the system performance? Did I
neglect something?

>>> I guess the best case for your patch is many guests with relatively
>>> small RAM sizes.
> 
> .
>
Markus Armbruster May 13, 2019, 11:59 a.m. UTC | #11
Xiang Zheng <zhengxiang9@huawei.com> writes:

> On 2019/5/10 23:16, Markus Armbruster wrote:
>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>> 
>>> On 2019/5/9 19:59, Markus Armbruster wrote:
>>>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>>>>
>>>>> On 2019/5/8 21:20, Markus Armbruster wrote:
>>>>>> Laszlo Ersek <lersek@redhat.com> writes:
>>>>>>
>>>>>>> Hi Markus,
>>>>>>>
>>>>>>> On 05/07/19 20:01, Markus Armbruster wrote:
>>>>>>>> The subject is slightly misleading.  Holes read as zero.  So do
>>>>>>>> non-holes full of zeroes.  The patch avoids reading the former, but
>>>>>>>> still reads the latter.
>>>>>>>>
>>>>>>>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>>>>>>>>
>>>>>>>>> Currently we fill the memory space with two 64MB NOR images when
>>>>>>>>> using persistent UEFI variables on virt board. Actually we only use
>>>>>>>>> a very small(non-zero) part of the memory while the rest significant
>>>>>>>>> large(zero) part of memory is wasted.
>>>>>>>>
>>>>>>>> Neglects to mention that the "virt board" is ARM.
>>>>>>>>
>>>>>>>>> So this patch checks the block status and only writes the non-zero part
>>>>>>>>> into memory. This requires pflash devices to use sparse files for
>>>>>>>>> backends.
>>>>>>>>
>>>>>>>> I started to draft an improved commit message, but then I realized this
>>>>>>>> patch can't work.
>>>>>>>>
>>>>>>>> The pflash_cfi01 device allocates its device memory like this:
>>>>>>>>
>>>>>>>>     memory_region_init_rom_device(
>>>>>>>>         &pfl->mem, OBJECT(dev),
>>>>>>>>         &pflash_cfi01_ops,
>>>>>>>>         pfl,
>>>>>>>>         pfl->name, total_len, &local_err);
>>>>>>>>
>>>>>>>> pflash_cfi02 is similar.
>>>>>>>>
>>>>>>>> memory_region_init_rom_device() calls
>>>>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls
>>>>>>>> qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
>>>>>>>> memory gets written to even with this patch.
>>>>>>>
>>>>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to
>>>>>>> allocate the the new RAMBlock object called "new_block". The actual
>>>>>>> guest RAM allocation occurs inside ram_block_add(), which is also called
>>>>>>> by qemu_ram_alloc_internal().
>>>>>>
>>>>>> You're right.  I should've read more attentively.
>>>>>>
>>>>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to
>>>>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in
>>>>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well.
>>>>>>>
>>>>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call
>>>>>>> phys_mem_alloc().
>>>>>>>
>>>>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with
>>>>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from
>>>>>>> "target/s390x/kvm.c" (setting the function pointer to
>>>>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end
>>>>>>> up calling the default qemu_anon_ram_alloc() function, through the
>>>>>>> funcptr. (I think anyway.)
>>>>>>>
>>>>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in
>>>>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc()
>>>>>>> passes (-1) for "fd".)
>>>>>>>
>>>>>>> I may have missed something, of course -- I obviously didn't test it,
>>>>>>> just speculated from the source.
>>>>>>
>>>>>> Thanks for your sleuthing!
>>>>>>
>>>>>>>> I'm afraid you neglected to test.
>>>>>>
>>>>>> Accusation actually unsupported.  I apologize, and replace it by a
>>>>>> question: have you observed the improvement you're trying to achieve,
>>>>>> and if yes, how?
>>>>>>
>>>>>
>>>>> Yes, we need to create sparse files as the backing images for pflash device.
>>>>> To create sparse files like:
>>>>>
>>>>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>>>>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
>>>>
>>>> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to
>>>> 64MiB.
>>>>
>>>>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
>>>>
>>>> This creates the varstore as a 64MiB hole.  As far as I know (very
>>>> little), you should use the varstore template that comes with the
>>>> firmware binary.
>>>>
>>>> I use
>>>>
>>>>     cp --sparse=always bld/pc-bios/edk2-arm-vars.fd .
>>>>     cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd .
>>>>
>>>> These guys are already zero-padded, and I use cp to sparsify.
>>>>
>>>>> Start a VM with below commandline:
>>>>>
>>>>>     -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\
>>>>>     -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \
>>>>>
>>>>> Then observe the memory usage of the qemu process (THP is on).
>>>>>
>>>>> 1) Without this patch:
>>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
>>>>> AnonHugePages:    706560 kB
>>>>> AnonHugePages:      2048 kB
>>>>> AnonHugePages:     65536 kB    // pflash memory device
>>>>> AnonHugePages:     65536 kB    // pflash memory device
>>>>> AnonHugePages:      2048 kB
>>>>>
>>>>> # ps aux | grep qemu-system-aarch64
>>>>> RSS: 879684
>>>>>
>>>>> 2) After applying this patch:
>>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
>>>>> AnonHugePages:    700416 kB
>>>>> AnonHugePages:      2048 kB
>>>>> AnonHugePages:      2048 kB    // pflash memory device
>>>>> AnonHugePages:      2048 kB    // pflash memory device
>>>>> AnonHugePages:      2048 kB
>>>>>
>>>>> # ps aux | grep qemu-system-aarch64
>>>>> RSS: 744380
>>>>
>>>> Okay, this demonstrates the patch succeeds at mapping parts of the
>>>> pflash memory as holes.
>>>>
>>>> Do the guests in these QEMU processes run?
>>>
>>> Yes.
>> 
>> Good to know, thanks.
>> 
>>>>> Obviously, there are at least 100MiB memory saved for each guest.
>>>>
>>>> For a definition of "memory".
>>>>
>>>> Next question: what impact on system performance do you observe?
>>>>
>>>> Let me explain.
>>>>
>>>> Virtual memory holes get filled in by demand paging on access.  In other
>>>> words, they remain holes only as long as nothing accesses the memory.
>>>>
>>>> Without your patch, we allocate pages at image read time and fill them
>>>> with zeroes. If we don't access them again, the kernel will eventually
>>>> page them out (assuming you're running with swap).  So the steady state
>>>> is "we waste some swap space", not "we waste some physical RAM".
>>>>
>>>
>>> Not everybody wants to run with swap because it may cause low performance.
>> 
>> Someone running without swap because he heard someone say someone said
>> swap may be slow is probably throwing away performance.
>> 
>> But I assume you mean people running without swap because they measured
>> their workload and found it more performant without swap.  Legitimate.
>
> Yes, and I had ever suffered from the high IO waits with swap.:)
>
>> 
>>>> Your patch lets us map pflash memory pages containing only zeros as
>>>> holes.
>>>>
>>>> For pages that never get accessed, your patch avoids page allocation,
>>>> filling with zeroes, writing to swap (all one-time costs), and saves
>>>> some swap space (not commonly an issue).
>>>>
>>>> For pflash memory that gets accessed, your patch merely delays page
>>>> allocation from image read time to first access.
>>>>
>>>> I wonder how these savings and delays affect actual system performance.
>>>> Without an observable change in system performance, all we'd accomplish
>>>> is changing a bunch of numers in /proc/$pid/.
>>>>
>>>> What improvement(s) can you observe?
>>>
>>> We only use pflash device for UEFI, and we hardly care about the performance.
>>> I think the bottleneck of the performance is the MMIO emulation, even this
>>> patch would delay page allocation at the first access.
>> 
>> I wasn't inquiring about the performance of the pflash device.  I was
>> inquiring about *system* performance.  But let me rephrase my question.
>> 
>> Doing work to save resources is only worthwhile if something valuable
>> gets better in a measurable way.  I'm asking you
>> 
>> (1) to explain what exactly you value, and 
>> 
>> (2) to provide measurements that show improvement.
>> 
>
> What we exactly value is the cost of memory resources and it is the only
> thing that this patch aims to resolve.

Then measure this cost!

> I am confused that why you think it will impact the system performance? Did I
> neglect something?

If the patch does not impact how the system as a whole performs, then
it's useless.

Since you find it useful, it must have some valuable[*] observable
effect for you.  Tell us about it!

I keep asking not to torment you, but to guide you towards building a
compelling justification for your patch.  However, I can only show you
the path; the walking you'll have to do yourself.

>>>> I guess the best case for your patch is many guests with relatively
>>>> small RAM sizes.
>> 
>> .
>> 


[*] Changing a bunch of numbers in /proc is not valuable.
Kevin Wolf May 13, 2019, 1:15 p.m. UTC | #12
Am 13.05.2019 um 13:59 hat Markus Armbruster geschrieben:
> Xiang Zheng <zhengxiang9@huawei.com> writes:
> 
> > On 2019/5/10 23:16, Markus Armbruster wrote:
> >> Xiang Zheng <zhengxiang9@huawei.com> writes:
> >> 
> >>> On 2019/5/9 19:59, Markus Armbruster wrote:
> >>>> Xiang Zheng <zhengxiang9@huawei.com> writes:
> >>>>
> >>>>> On 2019/5/8 21:20, Markus Armbruster wrote:
> >>>>>> Laszlo Ersek <lersek@redhat.com> writes:
> >>>>>>
> >>>>>>> Hi Markus,
> >>>>>>>
> >>>>>>> On 05/07/19 20:01, Markus Armbruster wrote:
> >>>>>>>> The subject is slightly misleading.  Holes read as zero.  So do
> >>>>>>>> non-holes full of zeroes.  The patch avoids reading the former, but
> >>>>>>>> still reads the latter.
> >>>>>>>>
> >>>>>>>> Xiang Zheng <zhengxiang9@huawei.com> writes:
> >>>>>>>>
> >>>>>>>>> Currently we fill the memory space with two 64MB NOR images when
> >>>>>>>>> using persistent UEFI variables on virt board. Actually we only use
> >>>>>>>>> a very small(non-zero) part of the memory while the rest significant
> >>>>>>>>> large(zero) part of memory is wasted.
> >>>>>>>>
> >>>>>>>> Neglects to mention that the "virt board" is ARM.
> >>>>>>>>
> >>>>>>>>> So this patch checks the block status and only writes the non-zero part
> >>>>>>>>> into memory. This requires pflash devices to use sparse files for
> >>>>>>>>> backends.
> >>>>>>>>
> >>>>>>>> I started to draft an improved commit message, but then I realized this
> >>>>>>>> patch can't work.
> >>>>>>>>
> >>>>>>>> The pflash_cfi01 device allocates its device memory like this:
> >>>>>>>>
> >>>>>>>>     memory_region_init_rom_device(
> >>>>>>>>         &pfl->mem, OBJECT(dev),
> >>>>>>>>         &pflash_cfi01_ops,
> >>>>>>>>         pfl,
> >>>>>>>>         pfl->name, total_len, &local_err);
> >>>>>>>>
> >>>>>>>> pflash_cfi02 is similar.
> >>>>>>>>
> >>>>>>>> memory_region_init_rom_device() calls
> >>>>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls
> >>>>>>>> qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
> >>>>>>>> memory gets written to even with this patch.
> >>>>>>>
> >>>>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to
> >>>>>>> allocate the the new RAMBlock object called "new_block". The actual
> >>>>>>> guest RAM allocation occurs inside ram_block_add(), which is also called
> >>>>>>> by qemu_ram_alloc_internal().
> >>>>>>
> >>>>>> You're right.  I should've read more attentively.
> >>>>>>
> >>>>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to
> >>>>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in
> >>>>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well.
> >>>>>>>
> >>>>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call
> >>>>>>> phys_mem_alloc().
> >>>>>>>
> >>>>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with
> >>>>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from
> >>>>>>> "target/s390x/kvm.c" (setting the function pointer to
> >>>>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end
> >>>>>>> up calling the default qemu_anon_ram_alloc() function, through the
> >>>>>>> funcptr. (I think anyway.)
> >>>>>>>
> >>>>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in
> >>>>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc()
> >>>>>>> passes (-1) for "fd".)
> >>>>>>>
> >>>>>>> I may have missed something, of course -- I obviously didn't test it,
> >>>>>>> just speculated from the source.
> >>>>>>
> >>>>>> Thanks for your sleuthing!
> >>>>>>
> >>>>>>>> I'm afraid you neglected to test.
> >>>>>>
> >>>>>> Accusation actually unsupported.  I apologize, and replace it by a
> >>>>>> question: have you observed the improvement you're trying to achieve,
> >>>>>> and if yes, how?
> >>>>>>
> >>>>>
> >>>>> Yes, we need to create sparse files as the backing images for pflash device.
> >>>>> To create sparse files like:
> >>>>>
> >>>>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
> >>>>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
> >>>>
> >>>> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to
> >>>> 64MiB.
> >>>>
> >>>>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
> >>>>
> >>>> This creates the varstore as a 64MiB hole.  As far as I know (very
> >>>> little), you should use the varstore template that comes with the
> >>>> firmware binary.
> >>>>
> >>>> I use
> >>>>
> >>>>     cp --sparse=always bld/pc-bios/edk2-arm-vars.fd .
> >>>>     cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd .
> >>>>
> >>>> These guys are already zero-padded, and I use cp to sparsify.
> >>>>
> >>>>> Start a VM with below commandline:
> >>>>>
> >>>>>     -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\
> >>>>>     -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \
> >>>>>
> >>>>> Then observe the memory usage of the qemu process (THP is on).
> >>>>>
> >>>>> 1) Without this patch:
> >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
> >>>>> AnonHugePages:    706560 kB
> >>>>> AnonHugePages:      2048 kB
> >>>>> AnonHugePages:     65536 kB    // pflash memory device
> >>>>> AnonHugePages:     65536 kB    // pflash memory device
> >>>>> AnonHugePages:      2048 kB
> >>>>>
> >>>>> # ps aux | grep qemu-system-aarch64
> >>>>> RSS: 879684
> >>>>>
> >>>>> 2) After applying this patch:
> >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
> >>>>> AnonHugePages:    700416 kB
> >>>>> AnonHugePages:      2048 kB
> >>>>> AnonHugePages:      2048 kB    // pflash memory device
> >>>>> AnonHugePages:      2048 kB    // pflash memory device
> >>>>> AnonHugePages:      2048 kB
> >>>>>
> >>>>> # ps aux | grep qemu-system-aarch64
> >>>>> RSS: 744380
> >>>>
> >>>> Okay, this demonstrates the patch succeeds at mapping parts of the
> >>>> pflash memory as holes.
> >>>>
> >>>> Do the guests in these QEMU processes run?
> >>>
> >>> Yes.
> >> 
> >> Good to know, thanks.
> >> 
> >>>>> Obviously, there are at least 100MiB memory saved for each guest.
> >>>>
> >>>> For a definition of "memory".
> >>>>
> >>>> Next question: what impact on system performance do you observe?
> >>>>
> >>>> Let me explain.
> >>>>
> >>>> Virtual memory holes get filled in by demand paging on access.  In other
> >>>> words, they remain holes only as long as nothing accesses the memory.
> >>>>
> >>>> Without your patch, we allocate pages at image read time and fill them
> >>>> with zeroes. If we don't access them again, the kernel will eventually
> >>>> page them out (assuming you're running with swap).  So the steady state
> >>>> is "we waste some swap space", not "we waste some physical RAM".
> >>>>
> >>>
> >>> Not everybody wants to run with swap because it may cause low performance.
> >> 
> >> Someone running without swap because he heard someone say someone said
> >> swap may be slow is probably throwing away performance.
> >> 
> >> But I assume you mean people running without swap because they measured
> >> their workload and found it more performant without swap.  Legitimate.
> >
> > Yes, and I had ever suffered from the high IO waits with swap.:)
> >
> >> 
> >>>> Your patch lets us map pflash memory pages containing only zeros as
> >>>> holes.
> >>>>
> >>>> For pages that never get accessed, your patch avoids page allocation,
> >>>> filling with zeroes, writing to swap (all one-time costs), and saves
> >>>> some swap space (not commonly an issue).
> >>>>
> >>>> For pflash memory that gets accessed, your patch merely delays page
> >>>> allocation from image read time to first access.
> >>>>
> >>>> I wonder how these savings and delays affect actual system performance.
> >>>> Without an observable change in system performance, all we'd accomplish
> >>>> is changing a bunch of numers in /proc/$pid/.
> >>>>
> >>>> What improvement(s) can you observe?
> >>>
> >>> We only use pflash device for UEFI, and we hardly care about the performance.
> >>> I think the bottleneck of the performance is the MMIO emulation, even this
> >>> patch would delay page allocation at the first access.
> >> 
> >> I wasn't inquiring about the performance of the pflash device.  I was
> >> inquiring about *system* performance.  But let me rephrase my question.
> >> 
> >> Doing work to save resources is only worthwhile if something valuable
> >> gets better in a measurable way.  I'm asking you
> >> 
> >> (1) to explain what exactly you value, and 
> >> 
> >> (2) to provide measurements that show improvement.
> >> 
> >
> > What we exactly value is the cost of memory resources and it is the only
> > thing that this patch aims to resolve.
> 
> Then measure this cost!
> 
> > I am confused that why you think it will impact the system performance? Did I
> > neglect something?
> 
> If the patch does not impact how the system as a whole performs, then
> it's useless.
> 
> Since you find it useful, it must have some valuable[*] observable
> effect for you.  Tell us about it!
> 
> I keep asking not to torment you, but to guide you towards building a
> compelling justification for your patch.  However, I can only show you
> the path; the walking you'll have to do yourself.

Is this discussion really a good use of our time?

The patch is simple, and a few obvious improvements it brings were
mentioned (even by yourself), such as avoiding OOM without swap; and
with swap enabled, saving swap space for more useful content and
saving unnecessary I/O related to accessing swap needlessly.

You may consider these improvements neglegible, but even small
improvments can add up. If you really want to measure them, it should be
clear how to do it. I don't see value in actually setting up such
environments just to get some numbers that show what we already know.

So, what's the downside of the patch? The worst case is, the memory
usage numbers only look better, but most people don't have a use case
where the improvement matters. There might be some maintenance cost
associated with the code, but it's small and I suspect this discussion
has already cost us more time than maintaining the code will ever cost
us.

So why not just take it?

Kevin
Markus Armbruster May 13, 2019, 1:36 p.m. UTC | #13
Kevin Wolf <kwolf@redhat.com> writes:

> Am 13.05.2019 um 13:59 hat Markus Armbruster geschrieben:
>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>> 
>> > On 2019/5/10 23:16, Markus Armbruster wrote:
>> >> Xiang Zheng <zhengxiang9@huawei.com> writes:
>> >> 
>> >>> On 2019/5/9 19:59, Markus Armbruster wrote:
>> >>>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>> >>>>
>> >>>>> On 2019/5/8 21:20, Markus Armbruster wrote:
>> >>>>>> Laszlo Ersek <lersek@redhat.com> writes:
>> >>>>>>
>> >>>>>>> Hi Markus,
>> >>>>>>>
>> >>>>>>> On 05/07/19 20:01, Markus Armbruster wrote:
>> >>>>>>>> The subject is slightly misleading.  Holes read as zero.  So do
>> >>>>>>>> non-holes full of zeroes.  The patch avoids reading the former, but
>> >>>>>>>> still reads the latter.
>> >>>>>>>>
>> >>>>>>>> Xiang Zheng <zhengxiang9@huawei.com> writes:
>> >>>>>>>>
>> >>>>>>>>> Currently we fill the memory space with two 64MB NOR images when
>> >>>>>>>>> using persistent UEFI variables on virt board. Actually we only use
>> >>>>>>>>> a very small(non-zero) part of the memory while the rest significant
>> >>>>>>>>> large(zero) part of memory is wasted.
>> >>>>>>>>
>> >>>>>>>> Neglects to mention that the "virt board" is ARM.
>> >>>>>>>>
>> >>>>>>>>> So this patch checks the block status and only writes the non-zero part
>> >>>>>>>>> into memory. This requires pflash devices to use sparse files for
>> >>>>>>>>> backends.
>> >>>>>>>>
>> >>>>>>>> I started to draft an improved commit message, but then I realized this
>> >>>>>>>> patch can't work.
>> >>>>>>>>
>> >>>>>>>> The pflash_cfi01 device allocates its device memory like this:
>> >>>>>>>>
>> >>>>>>>>     memory_region_init_rom_device(
>> >>>>>>>>         &pfl->mem, OBJECT(dev),
>> >>>>>>>>         &pflash_cfi01_ops,
>> >>>>>>>>         pfl,
>> >>>>>>>>         pfl->name, total_len, &local_err);
>> >>>>>>>>
>> >>>>>>>> pflash_cfi02 is similar.
>> >>>>>>>>
>> >>>>>>>> memory_region_init_rom_device() calls
>> >>>>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls
>> >>>>>>>> qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
>> >>>>>>>> memory gets written to even with this patch.
>> >>>>>>>
>> >>>>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to
>> >>>>>>> allocate the the new RAMBlock object called "new_block". The actual
>> >>>>>>> guest RAM allocation occurs inside ram_block_add(), which is also called
>> >>>>>>> by qemu_ram_alloc_internal().
>> >>>>>>
>> >>>>>> You're right.  I should've read more attentively.
>> >>>>>>
>> >>>>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to
>> >>>>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in
>> >>>>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well.
>> >>>>>>>
>> >>>>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call
>> >>>>>>> phys_mem_alloc().
>> >>>>>>>
>> >>>>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with
>> >>>>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from
>> >>>>>>> "target/s390x/kvm.c" (setting the function pointer to
>> >>>>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end
>> >>>>>>> up calling the default qemu_anon_ram_alloc() function, through the
>> >>>>>>> funcptr. (I think anyway.)
>> >>>>>>>
>> >>>>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in
>> >>>>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc()
>> >>>>>>> passes (-1) for "fd".)
>> >>>>>>>
>> >>>>>>> I may have missed something, of course -- I obviously didn't test it,
>> >>>>>>> just speculated from the source.
>> >>>>>>
>> >>>>>> Thanks for your sleuthing!
>> >>>>>>
>> >>>>>>>> I'm afraid you neglected to test.
>> >>>>>>
>> >>>>>> Accusation actually unsupported.  I apologize, and replace it by a
>> >>>>>> question: have you observed the improvement you're trying to achieve,
>> >>>>>> and if yes, how?
>> >>>>>>
>> >>>>>
>> >>>>> Yes, we need to create sparse files as the backing images for pflash device.
>> >>>>> To create sparse files like:
>> >>>>>
>> >>>>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>> >>>>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
>> >>>>
>> >>>> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to
>> >>>> 64MiB.
>> >>>>
>> >>>>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
>> >>>>
>> >>>> This creates the varstore as a 64MiB hole.  As far as I know (very
>> >>>> little), you should use the varstore template that comes with the
>> >>>> firmware binary.
>> >>>>
>> >>>> I use
>> >>>>
>> >>>>     cp --sparse=always bld/pc-bios/edk2-arm-vars.fd .
>> >>>>     cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd .
>> >>>>
>> >>>> These guys are already zero-padded, and I use cp to sparsify.
>> >>>>
>> >>>>> Start a VM with below commandline:
>> >>>>>
>> >>>>>     -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\
>> >>>>>     -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \
>> >>>>>
>> >>>>> Then observe the memory usage of the qemu process (THP is on).
>> >>>>>
>> >>>>> 1) Without this patch:
>> >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
>> >>>>> AnonHugePages:    706560 kB
>> >>>>> AnonHugePages:      2048 kB
>> >>>>> AnonHugePages:     65536 kB    // pflash memory device
>> >>>>> AnonHugePages:     65536 kB    // pflash memory device
>> >>>>> AnonHugePages:      2048 kB
>> >>>>>
>> >>>>> # ps aux | grep qemu-system-aarch64
>> >>>>> RSS: 879684
>> >>>>>
>> >>>>> 2) After applying this patch:
>> >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB'
>> >>>>> AnonHugePages:    700416 kB
>> >>>>> AnonHugePages:      2048 kB
>> >>>>> AnonHugePages:      2048 kB    // pflash memory device
>> >>>>> AnonHugePages:      2048 kB    // pflash memory device
>> >>>>> AnonHugePages:      2048 kB
>> >>>>>
>> >>>>> # ps aux | grep qemu-system-aarch64
>> >>>>> RSS: 744380
>> >>>>
>> >>>> Okay, this demonstrates the patch succeeds at mapping parts of the
>> >>>> pflash memory as holes.
>> >>>>
>> >>>> Do the guests in these QEMU processes run?
>> >>>
>> >>> Yes.
>> >> 
>> >> Good to know, thanks.
>> >> 
>> >>>>> Obviously, there are at least 100MiB memory saved for each guest.
>> >>>>
>> >>>> For a definition of "memory".
>> >>>>
>> >>>> Next question: what impact on system performance do you observe?
>> >>>>
>> >>>> Let me explain.
>> >>>>
>> >>>> Virtual memory holes get filled in by demand paging on access.  In other
>> >>>> words, they remain holes only as long as nothing accesses the memory.
>> >>>>
>> >>>> Without your patch, we allocate pages at image read time and fill them
>> >>>> with zeroes. If we don't access them again, the kernel will eventually
>> >>>> page them out (assuming you're running with swap).  So the steady state
>> >>>> is "we waste some swap space", not "we waste some physical RAM".
>> >>>>
>> >>>
>> >>> Not everybody wants to run with swap because it may cause low performance.
>> >> 
>> >> Someone running without swap because he heard someone say someone said
>> >> swap may be slow is probably throwing away performance.
>> >> 
>> >> But I assume you mean people running without swap because they measured
>> >> their workload and found it more performant without swap.  Legitimate.
>> >
>> > Yes, and I had ever suffered from the high IO waits with swap.:)
>> >
>> >> 
>> >>>> Your patch lets us map pflash memory pages containing only zeros as
>> >>>> holes.
>> >>>>
>> >>>> For pages that never get accessed, your patch avoids page allocation,
>> >>>> filling with zeroes, writing to swap (all one-time costs), and saves
>> >>>> some swap space (not commonly an issue).
>> >>>>
>> >>>> For pflash memory that gets accessed, your patch merely delays page
>> >>>> allocation from image read time to first access.
>> >>>>
>> >>>> I wonder how these savings and delays affect actual system performance.
>> >>>> Without an observable change in system performance, all we'd accomplish
>> >>>> is changing a bunch of numers in /proc/$pid/.
>> >>>>
>> >>>> What improvement(s) can you observe?
>> >>>
>> >>> We only use pflash device for UEFI, and we hardly care about the performance.
>> >>> I think the bottleneck of the performance is the MMIO emulation, even this
>> >>> patch would delay page allocation at the first access.
>> >> 
>> >> I wasn't inquiring about the performance of the pflash device.  I was
>> >> inquiring about *system* performance.  But let me rephrase my question.
>> >> 
>> >> Doing work to save resources is only worthwhile if something valuable
>> >> gets better in a measurable way.  I'm asking you
>> >> 
>> >> (1) to explain what exactly you value, and 
>> >> 
>> >> (2) to provide measurements that show improvement.
>> >> 
>> >
>> > What we exactly value is the cost of memory resources and it is the only
>> > thing that this patch aims to resolve.
>> 
>> Then measure this cost!
>> 
>> > I am confused that why you think it will impact the system performance? Did I
>> > neglect something?
>> 
>> If the patch does not impact how the system as a whole performs, then
>> it's useless.
>> 
>> Since you find it useful, it must have some valuable[*] observable
>> effect for you.  Tell us about it!
>> 
>> I keep asking not to torment you, but to guide you towards building a
>> compelling justification for your patch.  However, I can only show you
>> the path; the walking you'll have to do yourself.
>
> Is this discussion really a good use of our time?
>
> The patch is simple, and a few obvious improvements it brings were
> mentioned (even by yourself), such as avoiding OOM without swap; and
> with swap enabled, saving swap space for more useful content and
> saving unnecessary I/O related to accessing swap needlessly.
>
> You may consider these improvements neglegible, but even small
> improvments can add up. If you really want to measure them, it should be
> clear how to do it. I don't see value in actually setting up such
> environments just to get some numbers that show what we already know.
>
> So, what's the downside of the patch? The worst case is, the memory
> usage numbers only look better, but most people don't have a use case
> where the improvement matters. There might be some maintenance cost
> associated with the code, but it's small and I suspect this discussion
> has already cost us more time than maintaining the code will ever cost
> us.
>
> So why not just take it?

As is, the patch's commit message fails to meet the standards I set as a
maintainer, because (1) it's too vague on what the patch does, and what
its limitations are (relies on well-behaved guests), and (2) it fails to
make the case for the patch.

Fortunately, I'm not the maintainer here, Philippe is.  My standards do
not matter.
diff mbox series

Patch

diff --git a/hw/block/block.c b/hw/block/block.c
index bf56c76..3cb9d4c 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -15,6 +15,44 @@ 
 #include "qapi/qapi-types-block.h"
 
 /*
+ * Read the non-zeroes parts of @blk into @buf
+ * Reading all of the @blk is expensive if the zeroes parts of @blk
+ * is large enough. Therefore check the block status and only write
+ * the non-zeroes block into @buf.
+ *
+ * Return 0 on success, non-zero on error.
+ */
+static int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
+{
+    int ret;
+    int64_t target_size, bytes, offset = 0;
+    BlockDriverState *bs = blk_bs(blk);
+
+    target_size = bdrv_getlength(bs);
+    if (target_size < 0) {
+        return target_size;
+    }
+
+    for (;;) {
+        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_SECTORS);
+        if (bytes <= 0) {
+            return 0;
+        }
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
+        if (ret < 0) {
+            return ret;
+        }
+        if (!(ret & BDRV_BLOCK_ZERO)) {
+            ret = bdrv_pread(bs->file, offset, (uint8_t *) buf + offset, bytes);
+            if (ret < 0) {
+                return ret;
+            }
+        }
+        offset += bytes;
+    }
+}
+
+/*
  * Read the entire contents of @blk into @buf.
  * @blk's contents must be @size bytes, and @size must be at most
  * BDRV_REQUEST_MAX_BYTES.
@@ -53,7 +91,7 @@  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
      * block device and read only on demand.
      */
     assert(size <= BDRV_REQUEST_MAX_BYTES);
-    ret = blk_pread(blk, 0, buf, size);
+    ret = blk_pread_nonzeroes(blk, buf);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "can't read block backend");
         return false;