Message ID | 20190505070059.4664-1-zhengxiang9@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | pflash: Only read non-zero parts of backend image | expand |
On Sun, 5 May 2019 at 08:02, Xiang Zheng <zhengxiang9@huawei.com> wrote: > > Currently we fill the memory space with two 64MB NOR images when > using persistent UEFI variables on virt board. Actually we only use > a very small(non-zero) part of the memory while the rest significant > large(zero) part of memory is wasted. > > So this patch checks the block status and only writes the non-zero part > into memory. This requires pflash devices to use sparse files for > backends. Do you mean "pflash devices will no longer work if the file that is backing them is not sparse", or just "if the file that is backing them is not sparse then you won't get the benefit of using less memory" ? thanks -- PMM
On 2019/5/5 23:37, Peter Maydell wrote: > On Sun, 5 May 2019 at 08:02, Xiang Zheng <zhengxiang9@huawei.com> wrote: >> >> Currently we fill the memory space with two 64MB NOR images when >> using persistent UEFI variables on virt board. Actually we only use >> a very small(non-zero) part of the memory while the rest significant >> large(zero) part of memory is wasted. >> >> So this patch checks the block status and only writes the non-zero part >> into memory. This requires pflash devices to use sparse files for >> backends. > > Do you mean "pflash devices will no longer work if the file > that is backing them is not sparse", or just "if the file that > is backing them is not sparse then you won't get the benefit > of using less memory" ? > I mean the latter, if the file is not sparse, nothing would change. I will improve this commit message in the next version.
The subject is slightly misleading. Holes read as zero. So do non-holes full of zeroes. The patch avoids reading the former, but still reads the latter. Xiang Zheng <zhengxiang9@huawei.com> writes: > Currently we fill the memory space with two 64MB NOR images when > using persistent UEFI variables on virt board. Actually we only use > a very small(non-zero) part of the memory while the rest significant > large(zero) part of memory is wasted. Neglects to mention that the "virt board" is ARM. > So this patch checks the block status and only writes the non-zero part > into memory. This requires pflash devices to use sparse files for > backends. I started to draft an improved commit message, but then I realized this patch can't work. The pflash_cfi01 device allocates its device memory like this: memory_region_init_rom_device( &pfl->mem, OBJECT(dev), &pflash_cfi01_ops, pfl, pfl->name, total_len, &local_err); pflash_cfi02 is similar. memory_region_init_rom_device() calls memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device memory gets written to even with this patch. I'm afraid you neglected to test. I still believe this approach can be made to work. Need a replacement for memory_region_init_rom_device() that uses mmap() with MAP_ANONYMOUS.
Hi Markus, On 05/07/19 20:01, Markus Armbruster wrote: > The subject is slightly misleading. Holes read as zero. So do > non-holes full of zeroes. The patch avoids reading the former, but > still reads the latter. > > Xiang Zheng <zhengxiang9@huawei.com> writes: > >> Currently we fill the memory space with two 64MB NOR images when >> using persistent UEFI variables on virt board. Actually we only use >> a very small(non-zero) part of the memory while the rest significant >> large(zero) part of memory is wasted. > > Neglects to mention that the "virt board" is ARM. > >> So this patch checks the block status and only writes the non-zero part >> into memory. This requires pflash devices to use sparse files for >> backends. > > I started to draft an improved commit message, but then I realized this > patch can't work. > > The pflash_cfi01 device allocates its device memory like this: > > memory_region_init_rom_device( > &pfl->mem, OBJECT(dev), > &pflash_cfi01_ops, > pfl, > pfl->name, total_len, &local_err); > > pflash_cfi02 is similar. > > memory_region_init_rom_device() calls > memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls > qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device > memory gets written to even with this patch. As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to allocate the the new RAMBlock object called "new_block". The actual guest RAM allocation occurs inside ram_block_add(), which is also called by qemu_ram_alloc_internal(). One frame outwards the stack, qemu_ram_alloc() passes NULL to qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well. Then in ram_block_add(), we take the (!new_block->host) branch, and call phys_mem_alloc(). Unfortunately, "phys_mem_alloc" is a function pointer, set with phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from "target/s390x/kvm.c" (setting the function pointer to legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end up calling the default qemu_anon_ram_alloc() function, through the funcptr. (I think anyway.) And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc() passes (-1) for "fd".) I may have missed something, of course -- I obviously didn't test it, just speculated from the source. Thanks Laszlo > > I'm afraid you neglected to test. > > I still believe this approach can be made to work. Need a replacement > for memory_region_init_rom_device() that uses mmap() with MAP_ANONYMOUS. >
Laszlo Ersek <lersek@redhat.com> writes: > Hi Markus, > > On 05/07/19 20:01, Markus Armbruster wrote: >> The subject is slightly misleading. Holes read as zero. So do >> non-holes full of zeroes. The patch avoids reading the former, but >> still reads the latter. >> >> Xiang Zheng <zhengxiang9@huawei.com> writes: >> >>> Currently we fill the memory space with two 64MB NOR images when >>> using persistent UEFI variables on virt board. Actually we only use >>> a very small(non-zero) part of the memory while the rest significant >>> large(zero) part of memory is wasted. >> >> Neglects to mention that the "virt board" is ARM. >> >>> So this patch checks the block status and only writes the non-zero part >>> into memory. This requires pflash devices to use sparse files for >>> backends. >> >> I started to draft an improved commit message, but then I realized this >> patch can't work. >> >> The pflash_cfi01 device allocates its device memory like this: >> >> memory_region_init_rom_device( >> &pfl->mem, OBJECT(dev), >> &pflash_cfi01_ops, >> pfl, >> pfl->name, total_len, &local_err); >> >> pflash_cfi02 is similar. >> >> memory_region_init_rom_device() calls >> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls >> qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device >> memory gets written to even with this patch. > > As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to > allocate the the new RAMBlock object called "new_block". The actual > guest RAM allocation occurs inside ram_block_add(), which is also called > by qemu_ram_alloc_internal(). You're right. I should've read more attentively. > One frame outwards the stack, qemu_ram_alloc() passes NULL to > qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in > qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well. > > Then in ram_block_add(), we take the (!new_block->host) branch, and call > phys_mem_alloc(). > > Unfortunately, "phys_mem_alloc" is a function pointer, set with > phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from > "target/s390x/kvm.c" (setting the function pointer to > legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end > up calling the default qemu_anon_ram_alloc() function, through the > funcptr. (I think anyway.) > > And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in > qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc() > passes (-1) for "fd".) > > I may have missed something, of course -- I obviously didn't test it, > just speculated from the source. Thanks for your sleuthing! >> I'm afraid you neglected to test. Accusation actually unsupported. I apologize, and replace it by a question: have you observed the improvement you're trying to achieve, and if yes, how? [...]
On 2019/5/8 21:20, Markus Armbruster wrote: > Laszlo Ersek <lersek@redhat.com> writes: > >> Hi Markus, >> >> On 05/07/19 20:01, Markus Armbruster wrote: >>> The subject is slightly misleading. Holes read as zero. So do >>> non-holes full of zeroes. The patch avoids reading the former, but >>> still reads the latter. >>> >>> Xiang Zheng <zhengxiang9@huawei.com> writes: >>> >>>> Currently we fill the memory space with two 64MB NOR images when >>>> using persistent UEFI variables on virt board. Actually we only use >>>> a very small(non-zero) part of the memory while the rest significant >>>> large(zero) part of memory is wasted. >>> >>> Neglects to mention that the "virt board" is ARM. >>> >>>> So this patch checks the block status and only writes the non-zero part >>>> into memory. This requires pflash devices to use sparse files for >>>> backends. >>> >>> I started to draft an improved commit message, but then I realized this >>> patch can't work. >>> >>> The pflash_cfi01 device allocates its device memory like this: >>> >>> memory_region_init_rom_device( >>> &pfl->mem, OBJECT(dev), >>> &pflash_cfi01_ops, >>> pfl, >>> pfl->name, total_len, &local_err); >>> >>> pflash_cfi02 is similar. >>> >>> memory_region_init_rom_device() calls >>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls >>> qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device >>> memory gets written to even with this patch. >> >> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to >> allocate the the new RAMBlock object called "new_block". The actual >> guest RAM allocation occurs inside ram_block_add(), which is also called >> by qemu_ram_alloc_internal(). > > You're right. I should've read more attentively. > >> One frame outwards the stack, qemu_ram_alloc() passes NULL to >> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in >> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well. >> >> Then in ram_block_add(), we take the (!new_block->host) branch, and call >> phys_mem_alloc(). >> >> Unfortunately, "phys_mem_alloc" is a function pointer, set with >> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from >> "target/s390x/kvm.c" (setting the function pointer to >> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end >> up calling the default qemu_anon_ram_alloc() function, through the >> funcptr. (I think anyway.) >> >> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in >> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc() >> passes (-1) for "fd".) >> >> I may have missed something, of course -- I obviously didn't test it, >> just speculated from the source. > > Thanks for your sleuthing! > >>> I'm afraid you neglected to test. > > Accusation actually unsupported. I apologize, and replace it by a > question: have you observed the improvement you're trying to achieve, > and if yes, how? > Yes, we need to create sparse files as the backing images for pflash device. To create sparse files like: dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 Start a VM with below commandline: -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\ -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \ Then observe the memory usage of the qemu process (THP is on). 1) Without this patch: # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' AnonHugePages: 706560 kB AnonHugePages: 2048 kB AnonHugePages: 65536 kB // pflash memory device AnonHugePages: 65536 kB // pflash memory device AnonHugePages: 2048 kB # ps aux | grep qemu-system-aarch64 RSS: 879684 2) After applying this patch: # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' AnonHugePages: 700416 kB AnonHugePages: 2048 kB AnonHugePages: 2048 kB // pflash memory device AnonHugePages: 2048 kB // pflash memory device AnonHugePages: 2048 kB # ps aux | grep qemu-system-aarch64 RSS: 744380 Obviously, there are at least 100MiB memory saved for each guest.
Xiang Zheng <zhengxiang9@huawei.com> writes: > On 2019/5/8 21:20, Markus Armbruster wrote: >> Laszlo Ersek <lersek@redhat.com> writes: >> >>> Hi Markus, >>> >>> On 05/07/19 20:01, Markus Armbruster wrote: >>>> The subject is slightly misleading. Holes read as zero. So do >>>> non-holes full of zeroes. The patch avoids reading the former, but >>>> still reads the latter. >>>> >>>> Xiang Zheng <zhengxiang9@huawei.com> writes: >>>> >>>>> Currently we fill the memory space with two 64MB NOR images when >>>>> using persistent UEFI variables on virt board. Actually we only use >>>>> a very small(non-zero) part of the memory while the rest significant >>>>> large(zero) part of memory is wasted. >>>> >>>> Neglects to mention that the "virt board" is ARM. >>>> >>>>> So this patch checks the block status and only writes the non-zero part >>>>> into memory. This requires pflash devices to use sparse files for >>>>> backends. >>>> >>>> I started to draft an improved commit message, but then I realized this >>>> patch can't work. >>>> >>>> The pflash_cfi01 device allocates its device memory like this: >>>> >>>> memory_region_init_rom_device( >>>> &pfl->mem, OBJECT(dev), >>>> &pflash_cfi01_ops, >>>> pfl, >>>> pfl->name, total_len, &local_err); >>>> >>>> pflash_cfi02 is similar. >>>> >>>> memory_region_init_rom_device() calls >>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls >>>> qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device >>>> memory gets written to even with this patch. >>> >>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to >>> allocate the the new RAMBlock object called "new_block". The actual >>> guest RAM allocation occurs inside ram_block_add(), which is also called >>> by qemu_ram_alloc_internal(). >> >> You're right. I should've read more attentively. >> >>> One frame outwards the stack, qemu_ram_alloc() passes NULL to >>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in >>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well. >>> >>> Then in ram_block_add(), we take the (!new_block->host) branch, and call >>> phys_mem_alloc(). >>> >>> Unfortunately, "phys_mem_alloc" is a function pointer, set with >>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from >>> "target/s390x/kvm.c" (setting the function pointer to >>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end >>> up calling the default qemu_anon_ram_alloc() function, through the >>> funcptr. (I think anyway.) >>> >>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in >>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc() >>> passes (-1) for "fd".) >>> >>> I may have missed something, of course -- I obviously didn't test it, >>> just speculated from the source. >> >> Thanks for your sleuthing! >> >>>> I'm afraid you neglected to test. >> >> Accusation actually unsupported. I apologize, and replace it by a >> question: have you observed the improvement you're trying to achieve, >> and if yes, how? >> > > Yes, we need to create sparse files as the backing images for pflash device. > To create sparse files like: > > dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 > dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to 64MiB. > dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 This creates the varstore as a 64MiB hole. As far as I know (very little), you should use the varstore template that comes with the firmware binary. I use cp --sparse=always bld/pc-bios/edk2-arm-vars.fd . cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd . These guys are already zero-padded, and I use cp to sparsify. > Start a VM with below commandline: > > -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\ > -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \ > > Then observe the memory usage of the qemu process (THP is on). > > 1) Without this patch: > # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' > AnonHugePages: 706560 kB > AnonHugePages: 2048 kB > AnonHugePages: 65536 kB // pflash memory device > AnonHugePages: 65536 kB // pflash memory device > AnonHugePages: 2048 kB > > # ps aux | grep qemu-system-aarch64 > RSS: 879684 > > 2) After applying this patch: > # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' > AnonHugePages: 700416 kB > AnonHugePages: 2048 kB > AnonHugePages: 2048 kB // pflash memory device > AnonHugePages: 2048 kB // pflash memory device > AnonHugePages: 2048 kB > > # ps aux | grep qemu-system-aarch64 > RSS: 744380 Okay, this demonstrates the patch succeeds at mapping parts of the pflash memory as holes. Do the guests in these QEMU processes run? > Obviously, there are at least 100MiB memory saved for each guest. For a definition of "memory". Next question: what impact on system performance do you observe? Let me explain. Virtual memory holes get filled in by demand paging on access. In other words, they remain holes only as long as nothing accesses the memory. Without your patch, we allocate pages at image read time and fill them with zeroes. If we don't access them again, the kernel will eventually page them out (assuming you're running with swap). So the steady state is "we waste some swap space", not "we waste some physical RAM". Your patch lets us map pflash memory pages containing only zeros as holes. For pages that never get accessed, your patch avoids page allocation, filling with zeroes, writing to swap (all one-time costs), and saves some swap space (not commonly an issue). For pflash memory that gets accessed, your patch merely delays page allocation from image read time to first access. I wonder how these savings and delays affect actual system performance. Without an observable change in system performance, all we'd accomplish is changing a bunch of numers in /proc/$pid/. What improvement(s) can you observe? I guess the best case for your patch is many guests with relatively small RAM sizes.
On 2019/5/9 19:59, Markus Armbruster wrote: > Xiang Zheng <zhengxiang9@huawei.com> writes: > >> On 2019/5/8 21:20, Markus Armbruster wrote: >>> Laszlo Ersek <lersek@redhat.com> writes: >>> >>>> Hi Markus, >>>> >>>> On 05/07/19 20:01, Markus Armbruster wrote: >>>>> The subject is slightly misleading. Holes read as zero. So do >>>>> non-holes full of zeroes. The patch avoids reading the former, but >>>>> still reads the latter. >>>>> >>>>> Xiang Zheng <zhengxiang9@huawei.com> writes: >>>>> >>>>>> Currently we fill the memory space with two 64MB NOR images when >>>>>> using persistent UEFI variables on virt board. Actually we only use >>>>>> a very small(non-zero) part of the memory while the rest significant >>>>>> large(zero) part of memory is wasted. >>>>> >>>>> Neglects to mention that the "virt board" is ARM. >>>>> >>>>>> So this patch checks the block status and only writes the non-zero part >>>>>> into memory. This requires pflash devices to use sparse files for >>>>>> backends. >>>>> >>>>> I started to draft an improved commit message, but then I realized this >>>>> patch can't work. >>>>> >>>>> The pflash_cfi01 device allocates its device memory like this: >>>>> >>>>> memory_region_init_rom_device( >>>>> &pfl->mem, OBJECT(dev), >>>>> &pflash_cfi01_ops, >>>>> pfl, >>>>> pfl->name, total_len, &local_err); >>>>> >>>>> pflash_cfi02 is similar. >>>>> >>>>> memory_region_init_rom_device() calls >>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls >>>>> qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device >>>>> memory gets written to even with this patch. >>>> >>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to >>>> allocate the the new RAMBlock object called "new_block". The actual >>>> guest RAM allocation occurs inside ram_block_add(), which is also called >>>> by qemu_ram_alloc_internal(). >>> >>> You're right. I should've read more attentively. >>> >>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to >>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in >>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well. >>>> >>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call >>>> phys_mem_alloc(). >>>> >>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with >>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from >>>> "target/s390x/kvm.c" (setting the function pointer to >>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end >>>> up calling the default qemu_anon_ram_alloc() function, through the >>>> funcptr. (I think anyway.) >>>> >>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in >>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc() >>>> passes (-1) for "fd".) >>>> >>>> I may have missed something, of course -- I obviously didn't test it, >>>> just speculated from the source. >>> >>> Thanks for your sleuthing! >>> >>>>> I'm afraid you neglected to test. >>> >>> Accusation actually unsupported. I apologize, and replace it by a >>> question: have you observed the improvement you're trying to achieve, >>> and if yes, how? >>> >> >> Yes, we need to create sparse files as the backing images for pflash device. >> To create sparse files like: >> >> dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 >> dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc > > This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to > 64MiB. > >> dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 > > This creates the varstore as a 64MiB hole. As far as I know (very > little), you should use the varstore template that comes with the > firmware binary. > > I use > > cp --sparse=always bld/pc-bios/edk2-arm-vars.fd . > cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd . > > These guys are already zero-padded, and I use cp to sparsify. > >> Start a VM with below commandline: >> >> -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\ >> -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \ >> >> Then observe the memory usage of the qemu process (THP is on). >> >> 1) Without this patch: >> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' >> AnonHugePages: 706560 kB >> AnonHugePages: 2048 kB >> AnonHugePages: 65536 kB // pflash memory device >> AnonHugePages: 65536 kB // pflash memory device >> AnonHugePages: 2048 kB >> >> # ps aux | grep qemu-system-aarch64 >> RSS: 879684 >> >> 2) After applying this patch: >> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' >> AnonHugePages: 700416 kB >> AnonHugePages: 2048 kB >> AnonHugePages: 2048 kB // pflash memory device >> AnonHugePages: 2048 kB // pflash memory device >> AnonHugePages: 2048 kB >> >> # ps aux | grep qemu-system-aarch64 >> RSS: 744380 > > Okay, this demonstrates the patch succeeds at mapping parts of the > pflash memory as holes. > > Do the guests in these QEMU processes run? Yes. > >> Obviously, there are at least 100MiB memory saved for each guest. > > For a definition of "memory". > > Next question: what impact on system performance do you observe? > > Let me explain. > > Virtual memory holes get filled in by demand paging on access. In other > words, they remain holes only as long as nothing accesses the memory. > > Without your patch, we allocate pages at image read time and fill them > with zeroes. If we don't access them again, the kernel will eventually > page them out (assuming you're running with swap). So the steady state > is "we waste some swap space", not "we waste some physical RAM". > Not everybody wants to run with swap because it may cause low performance. > Your patch lets us map pflash memory pages containing only zeros as > holes. > > For pages that never get accessed, your patch avoids page allocation, > filling with zeroes, writing to swap (all one-time costs), and saves > some swap space (not commonly an issue). > > For pflash memory that gets accessed, your patch merely delays page > allocation from image read time to first access. > > I wonder how these savings and delays affect actual system performance. > Without an observable change in system performance, all we'd accomplish > is changing a bunch of numers in /proc/$pid/. > > What improvement(s) can you observe? We only use pflash device for UEFI, and we hardly care about the performance. I think the bottleneck of the performance is the MMIO emulation, even this patch would delay page allocation at the first access. > > I guess the best case for your patch is many guests with relatively > small RAM sizes. > > . >
Xiang Zheng <zhengxiang9@huawei.com> writes: > On 2019/5/9 19:59, Markus Armbruster wrote: >> Xiang Zheng <zhengxiang9@huawei.com> writes: >> >>> On 2019/5/8 21:20, Markus Armbruster wrote: >>>> Laszlo Ersek <lersek@redhat.com> writes: >>>> >>>>> Hi Markus, >>>>> >>>>> On 05/07/19 20:01, Markus Armbruster wrote: >>>>>> The subject is slightly misleading. Holes read as zero. So do >>>>>> non-holes full of zeroes. The patch avoids reading the former, but >>>>>> still reads the latter. >>>>>> >>>>>> Xiang Zheng <zhengxiang9@huawei.com> writes: >>>>>> >>>>>>> Currently we fill the memory space with two 64MB NOR images when >>>>>>> using persistent UEFI variables on virt board. Actually we only use >>>>>>> a very small(non-zero) part of the memory while the rest significant >>>>>>> large(zero) part of memory is wasted. >>>>>> >>>>>> Neglects to mention that the "virt board" is ARM. >>>>>> >>>>>>> So this patch checks the block status and only writes the non-zero part >>>>>>> into memory. This requires pflash devices to use sparse files for >>>>>>> backends. >>>>>> >>>>>> I started to draft an improved commit message, but then I realized this >>>>>> patch can't work. >>>>>> >>>>>> The pflash_cfi01 device allocates its device memory like this: >>>>>> >>>>>> memory_region_init_rom_device( >>>>>> &pfl->mem, OBJECT(dev), >>>>>> &pflash_cfi01_ops, >>>>>> pfl, >>>>>> pfl->name, total_len, &local_err); >>>>>> >>>>>> pflash_cfi02 is similar. >>>>>> >>>>>> memory_region_init_rom_device() calls >>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls >>>>>> qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device >>>>>> memory gets written to even with this patch. >>>>> >>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to >>>>> allocate the the new RAMBlock object called "new_block". The actual >>>>> guest RAM allocation occurs inside ram_block_add(), which is also called >>>>> by qemu_ram_alloc_internal(). >>>> >>>> You're right. I should've read more attentively. >>>> >>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to >>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in >>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well. >>>>> >>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call >>>>> phys_mem_alloc(). >>>>> >>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with >>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from >>>>> "target/s390x/kvm.c" (setting the function pointer to >>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end >>>>> up calling the default qemu_anon_ram_alloc() function, through the >>>>> funcptr. (I think anyway.) >>>>> >>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in >>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc() >>>>> passes (-1) for "fd".) >>>>> >>>>> I may have missed something, of course -- I obviously didn't test it, >>>>> just speculated from the source. >>>> >>>> Thanks for your sleuthing! >>>> >>>>>> I'm afraid you neglected to test. >>>> >>>> Accusation actually unsupported. I apologize, and replace it by a >>>> question: have you observed the improvement you're trying to achieve, >>>> and if yes, how? >>>> >>> >>> Yes, we need to create sparse files as the backing images for pflash device. >>> To create sparse files like: >>> >>> dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 >>> dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc >> >> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to >> 64MiB. >> >>> dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 >> >> This creates the varstore as a 64MiB hole. As far as I know (very >> little), you should use the varstore template that comes with the >> firmware binary. >> >> I use >> >> cp --sparse=always bld/pc-bios/edk2-arm-vars.fd . >> cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd . >> >> These guys are already zero-padded, and I use cp to sparsify. >> >>> Start a VM with below commandline: >>> >>> -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\ >>> -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \ >>> >>> Then observe the memory usage of the qemu process (THP is on). >>> >>> 1) Without this patch: >>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' >>> AnonHugePages: 706560 kB >>> AnonHugePages: 2048 kB >>> AnonHugePages: 65536 kB // pflash memory device >>> AnonHugePages: 65536 kB // pflash memory device >>> AnonHugePages: 2048 kB >>> >>> # ps aux | grep qemu-system-aarch64 >>> RSS: 879684 >>> >>> 2) After applying this patch: >>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' >>> AnonHugePages: 700416 kB >>> AnonHugePages: 2048 kB >>> AnonHugePages: 2048 kB // pflash memory device >>> AnonHugePages: 2048 kB // pflash memory device >>> AnonHugePages: 2048 kB >>> >>> # ps aux | grep qemu-system-aarch64 >>> RSS: 744380 >> >> Okay, this demonstrates the patch succeeds at mapping parts of the >> pflash memory as holes. >> >> Do the guests in these QEMU processes run? > > Yes. Good to know, thanks. >>> Obviously, there are at least 100MiB memory saved for each guest. >> >> For a definition of "memory". >> >> Next question: what impact on system performance do you observe? >> >> Let me explain. >> >> Virtual memory holes get filled in by demand paging on access. In other >> words, they remain holes only as long as nothing accesses the memory. >> >> Without your patch, we allocate pages at image read time and fill them >> with zeroes. If we don't access them again, the kernel will eventually >> page them out (assuming you're running with swap). So the steady state >> is "we waste some swap space", not "we waste some physical RAM". >> > > Not everybody wants to run with swap because it may cause low performance. Someone running without swap because he heard someone say someone said swap may be slow is probably throwing away performance. But I assume you mean people running without swap because they measured their workload and found it more performant without swap. Legitimate. >> Your patch lets us map pflash memory pages containing only zeros as >> holes. >> >> For pages that never get accessed, your patch avoids page allocation, >> filling with zeroes, writing to swap (all one-time costs), and saves >> some swap space (not commonly an issue). >> >> For pflash memory that gets accessed, your patch merely delays page >> allocation from image read time to first access. >> >> I wonder how these savings and delays affect actual system performance. >> Without an observable change in system performance, all we'd accomplish >> is changing a bunch of numers in /proc/$pid/. >> >> What improvement(s) can you observe? > > We only use pflash device for UEFI, and we hardly care about the performance. > I think the bottleneck of the performance is the MMIO emulation, even this > patch would delay page allocation at the first access. I wasn't inquiring about the performance of the pflash device. I was inquiring about *system* performance. But let me rephrase my question. Doing work to save resources is only worthwhile if something valuable gets better in a measurable way. I'm asking you (1) to explain what exactly you value, and (2) to provide measurements that show improvement. >> I guess the best case for your patch is many guests with relatively >> small RAM sizes.
On 2019/5/10 23:16, Markus Armbruster wrote: > Xiang Zheng <zhengxiang9@huawei.com> writes: > >> On 2019/5/9 19:59, Markus Armbruster wrote: >>> Xiang Zheng <zhengxiang9@huawei.com> writes: >>> >>>> On 2019/5/8 21:20, Markus Armbruster wrote: >>>>> Laszlo Ersek <lersek@redhat.com> writes: >>>>> >>>>>> Hi Markus, >>>>>> >>>>>> On 05/07/19 20:01, Markus Armbruster wrote: >>>>>>> The subject is slightly misleading. Holes read as zero. So do >>>>>>> non-holes full of zeroes. The patch avoids reading the former, but >>>>>>> still reads the latter. >>>>>>> >>>>>>> Xiang Zheng <zhengxiang9@huawei.com> writes: >>>>>>> >>>>>>>> Currently we fill the memory space with two 64MB NOR images when >>>>>>>> using persistent UEFI variables on virt board. Actually we only use >>>>>>>> a very small(non-zero) part of the memory while the rest significant >>>>>>>> large(zero) part of memory is wasted. >>>>>>> >>>>>>> Neglects to mention that the "virt board" is ARM. >>>>>>> >>>>>>>> So this patch checks the block status and only writes the non-zero part >>>>>>>> into memory. This requires pflash devices to use sparse files for >>>>>>>> backends. >>>>>>> >>>>>>> I started to draft an improved commit message, but then I realized this >>>>>>> patch can't work. >>>>>>> >>>>>>> The pflash_cfi01 device allocates its device memory like this: >>>>>>> >>>>>>> memory_region_init_rom_device( >>>>>>> &pfl->mem, OBJECT(dev), >>>>>>> &pflash_cfi01_ops, >>>>>>> pfl, >>>>>>> pfl->name, total_len, &local_err); >>>>>>> >>>>>>> pflash_cfi02 is similar. >>>>>>> >>>>>>> memory_region_init_rom_device() calls >>>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls >>>>>>> qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device >>>>>>> memory gets written to even with this patch. >>>>>> >>>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to >>>>>> allocate the the new RAMBlock object called "new_block". The actual >>>>>> guest RAM allocation occurs inside ram_block_add(), which is also called >>>>>> by qemu_ram_alloc_internal(). >>>>> >>>>> You're right. I should've read more attentively. >>>>> >>>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to >>>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in >>>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well. >>>>>> >>>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call >>>>>> phys_mem_alloc(). >>>>>> >>>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with >>>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from >>>>>> "target/s390x/kvm.c" (setting the function pointer to >>>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end >>>>>> up calling the default qemu_anon_ram_alloc() function, through the >>>>>> funcptr. (I think anyway.) >>>>>> >>>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in >>>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc() >>>>>> passes (-1) for "fd".) >>>>>> >>>>>> I may have missed something, of course -- I obviously didn't test it, >>>>>> just speculated from the source. >>>>> >>>>> Thanks for your sleuthing! >>>>> >>>>>>> I'm afraid you neglected to test. >>>>> >>>>> Accusation actually unsupported. I apologize, and replace it by a >>>>> question: have you observed the improvement you're trying to achieve, >>>>> and if yes, how? >>>>> >>>> >>>> Yes, we need to create sparse files as the backing images for pflash device. >>>> To create sparse files like: >>>> >>>> dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 >>>> dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc >>> >>> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to >>> 64MiB. >>> >>>> dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 >>> >>> This creates the varstore as a 64MiB hole. As far as I know (very >>> little), you should use the varstore template that comes with the >>> firmware binary. >>> >>> I use >>> >>> cp --sparse=always bld/pc-bios/edk2-arm-vars.fd . >>> cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd . >>> >>> These guys are already zero-padded, and I use cp to sparsify. >>> >>>> Start a VM with below commandline: >>>> >>>> -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\ >>>> -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \ >>>> >>>> Then observe the memory usage of the qemu process (THP is on). >>>> >>>> 1) Without this patch: >>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' >>>> AnonHugePages: 706560 kB >>>> AnonHugePages: 2048 kB >>>> AnonHugePages: 65536 kB // pflash memory device >>>> AnonHugePages: 65536 kB // pflash memory device >>>> AnonHugePages: 2048 kB >>>> >>>> # ps aux | grep qemu-system-aarch64 >>>> RSS: 879684 >>>> >>>> 2) After applying this patch: >>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' >>>> AnonHugePages: 700416 kB >>>> AnonHugePages: 2048 kB >>>> AnonHugePages: 2048 kB // pflash memory device >>>> AnonHugePages: 2048 kB // pflash memory device >>>> AnonHugePages: 2048 kB >>>> >>>> # ps aux | grep qemu-system-aarch64 >>>> RSS: 744380 >>> >>> Okay, this demonstrates the patch succeeds at mapping parts of the >>> pflash memory as holes. >>> >>> Do the guests in these QEMU processes run? >> >> Yes. > > Good to know, thanks. > >>>> Obviously, there are at least 100MiB memory saved for each guest. >>> >>> For a definition of "memory". >>> >>> Next question: what impact on system performance do you observe? >>> >>> Let me explain. >>> >>> Virtual memory holes get filled in by demand paging on access. In other >>> words, they remain holes only as long as nothing accesses the memory. >>> >>> Without your patch, we allocate pages at image read time and fill them >>> with zeroes. If we don't access them again, the kernel will eventually >>> page them out (assuming you're running with swap). So the steady state >>> is "we waste some swap space", not "we waste some physical RAM". >>> >> >> Not everybody wants to run with swap because it may cause low performance. > > Someone running without swap because he heard someone say someone said > swap may be slow is probably throwing away performance. > > But I assume you mean people running without swap because they measured > their workload and found it more performant without swap. Legitimate. Yes, and I had ever suffered from the high IO waits with swap.:) > >>> Your patch lets us map pflash memory pages containing only zeros as >>> holes. >>> >>> For pages that never get accessed, your patch avoids page allocation, >>> filling with zeroes, writing to swap (all one-time costs), and saves >>> some swap space (not commonly an issue). >>> >>> For pflash memory that gets accessed, your patch merely delays page >>> allocation from image read time to first access. >>> >>> I wonder how these savings and delays affect actual system performance. >>> Without an observable change in system performance, all we'd accomplish >>> is changing a bunch of numers in /proc/$pid/. >>> >>> What improvement(s) can you observe? >> >> We only use pflash device for UEFI, and we hardly care about the performance. >> I think the bottleneck of the performance is the MMIO emulation, even this >> patch would delay page allocation at the first access. > > I wasn't inquiring about the performance of the pflash device. I was > inquiring about *system* performance. But let me rephrase my question. > > Doing work to save resources is only worthwhile if something valuable > gets better in a measurable way. I'm asking you > > (1) to explain what exactly you value, and > > (2) to provide measurements that show improvement. > What we exactly value is the cost of memory resources and it is the only thing that this patch aims to resolve. I am confused that why you think it will impact the system performance? Did I neglect something? >>> I guess the best case for your patch is many guests with relatively >>> small RAM sizes. > > . >
Xiang Zheng <zhengxiang9@huawei.com> writes: > On 2019/5/10 23:16, Markus Armbruster wrote: >> Xiang Zheng <zhengxiang9@huawei.com> writes: >> >>> On 2019/5/9 19:59, Markus Armbruster wrote: >>>> Xiang Zheng <zhengxiang9@huawei.com> writes: >>>> >>>>> On 2019/5/8 21:20, Markus Armbruster wrote: >>>>>> Laszlo Ersek <lersek@redhat.com> writes: >>>>>> >>>>>>> Hi Markus, >>>>>>> >>>>>>> On 05/07/19 20:01, Markus Armbruster wrote: >>>>>>>> The subject is slightly misleading. Holes read as zero. So do >>>>>>>> non-holes full of zeroes. The patch avoids reading the former, but >>>>>>>> still reads the latter. >>>>>>>> >>>>>>>> Xiang Zheng <zhengxiang9@huawei.com> writes: >>>>>>>> >>>>>>>>> Currently we fill the memory space with two 64MB NOR images when >>>>>>>>> using persistent UEFI variables on virt board. Actually we only use >>>>>>>>> a very small(non-zero) part of the memory while the rest significant >>>>>>>>> large(zero) part of memory is wasted. >>>>>>>> >>>>>>>> Neglects to mention that the "virt board" is ARM. >>>>>>>> >>>>>>>>> So this patch checks the block status and only writes the non-zero part >>>>>>>>> into memory. This requires pflash devices to use sparse files for >>>>>>>>> backends. >>>>>>>> >>>>>>>> I started to draft an improved commit message, but then I realized this >>>>>>>> patch can't work. >>>>>>>> >>>>>>>> The pflash_cfi01 device allocates its device memory like this: >>>>>>>> >>>>>>>> memory_region_init_rom_device( >>>>>>>> &pfl->mem, OBJECT(dev), >>>>>>>> &pflash_cfi01_ops, >>>>>>>> pfl, >>>>>>>> pfl->name, total_len, &local_err); >>>>>>>> >>>>>>>> pflash_cfi02 is similar. >>>>>>>> >>>>>>>> memory_region_init_rom_device() calls >>>>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls >>>>>>>> qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device >>>>>>>> memory gets written to even with this patch. >>>>>>> >>>>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to >>>>>>> allocate the the new RAMBlock object called "new_block". The actual >>>>>>> guest RAM allocation occurs inside ram_block_add(), which is also called >>>>>>> by qemu_ram_alloc_internal(). >>>>>> >>>>>> You're right. I should've read more attentively. >>>>>> >>>>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to >>>>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in >>>>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well. >>>>>>> >>>>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call >>>>>>> phys_mem_alloc(). >>>>>>> >>>>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with >>>>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from >>>>>>> "target/s390x/kvm.c" (setting the function pointer to >>>>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end >>>>>>> up calling the default qemu_anon_ram_alloc() function, through the >>>>>>> funcptr. (I think anyway.) >>>>>>> >>>>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in >>>>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc() >>>>>>> passes (-1) for "fd".) >>>>>>> >>>>>>> I may have missed something, of course -- I obviously didn't test it, >>>>>>> just speculated from the source. >>>>>> >>>>>> Thanks for your sleuthing! >>>>>> >>>>>>>> I'm afraid you neglected to test. >>>>>> >>>>>> Accusation actually unsupported. I apologize, and replace it by a >>>>>> question: have you observed the improvement you're trying to achieve, >>>>>> and if yes, how? >>>>>> >>>>> >>>>> Yes, we need to create sparse files as the backing images for pflash device. >>>>> To create sparse files like: >>>>> >>>>> dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 >>>>> dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc >>>> >>>> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to >>>> 64MiB. >>>> >>>>> dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 >>>> >>>> This creates the varstore as a 64MiB hole. As far as I know (very >>>> little), you should use the varstore template that comes with the >>>> firmware binary. >>>> >>>> I use >>>> >>>> cp --sparse=always bld/pc-bios/edk2-arm-vars.fd . >>>> cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd . >>>> >>>> These guys are already zero-padded, and I use cp to sparsify. >>>> >>>>> Start a VM with below commandline: >>>>> >>>>> -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\ >>>>> -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \ >>>>> >>>>> Then observe the memory usage of the qemu process (THP is on). >>>>> >>>>> 1) Without this patch: >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' >>>>> AnonHugePages: 706560 kB >>>>> AnonHugePages: 2048 kB >>>>> AnonHugePages: 65536 kB // pflash memory device >>>>> AnonHugePages: 65536 kB // pflash memory device >>>>> AnonHugePages: 2048 kB >>>>> >>>>> # ps aux | grep qemu-system-aarch64 >>>>> RSS: 879684 >>>>> >>>>> 2) After applying this patch: >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' >>>>> AnonHugePages: 700416 kB >>>>> AnonHugePages: 2048 kB >>>>> AnonHugePages: 2048 kB // pflash memory device >>>>> AnonHugePages: 2048 kB // pflash memory device >>>>> AnonHugePages: 2048 kB >>>>> >>>>> # ps aux | grep qemu-system-aarch64 >>>>> RSS: 744380 >>>> >>>> Okay, this demonstrates the patch succeeds at mapping parts of the >>>> pflash memory as holes. >>>> >>>> Do the guests in these QEMU processes run? >>> >>> Yes. >> >> Good to know, thanks. >> >>>>> Obviously, there are at least 100MiB memory saved for each guest. >>>> >>>> For a definition of "memory". >>>> >>>> Next question: what impact on system performance do you observe? >>>> >>>> Let me explain. >>>> >>>> Virtual memory holes get filled in by demand paging on access. In other >>>> words, they remain holes only as long as nothing accesses the memory. >>>> >>>> Without your patch, we allocate pages at image read time and fill them >>>> with zeroes. If we don't access them again, the kernel will eventually >>>> page them out (assuming you're running with swap). So the steady state >>>> is "we waste some swap space", not "we waste some physical RAM". >>>> >>> >>> Not everybody wants to run with swap because it may cause low performance. >> >> Someone running without swap because he heard someone say someone said >> swap may be slow is probably throwing away performance. >> >> But I assume you mean people running without swap because they measured >> their workload and found it more performant without swap. Legitimate. > > Yes, and I had ever suffered from the high IO waits with swap.:) > >> >>>> Your patch lets us map pflash memory pages containing only zeros as >>>> holes. >>>> >>>> For pages that never get accessed, your patch avoids page allocation, >>>> filling with zeroes, writing to swap (all one-time costs), and saves >>>> some swap space (not commonly an issue). >>>> >>>> For pflash memory that gets accessed, your patch merely delays page >>>> allocation from image read time to first access. >>>> >>>> I wonder how these savings and delays affect actual system performance. >>>> Without an observable change in system performance, all we'd accomplish >>>> is changing a bunch of numers in /proc/$pid/. >>>> >>>> What improvement(s) can you observe? >>> >>> We only use pflash device for UEFI, and we hardly care about the performance. >>> I think the bottleneck of the performance is the MMIO emulation, even this >>> patch would delay page allocation at the first access. >> >> I wasn't inquiring about the performance of the pflash device. I was >> inquiring about *system* performance. But let me rephrase my question. >> >> Doing work to save resources is only worthwhile if something valuable >> gets better in a measurable way. I'm asking you >> >> (1) to explain what exactly you value, and >> >> (2) to provide measurements that show improvement. >> > > What we exactly value is the cost of memory resources and it is the only > thing that this patch aims to resolve. Then measure this cost! > I am confused that why you think it will impact the system performance? Did I > neglect something? If the patch does not impact how the system as a whole performs, then it's useless. Since you find it useful, it must have some valuable[*] observable effect for you. Tell us about it! I keep asking not to torment you, but to guide you towards building a compelling justification for your patch. However, I can only show you the path; the walking you'll have to do yourself. >>>> I guess the best case for your patch is many guests with relatively >>>> small RAM sizes. >> >> . >> [*] Changing a bunch of numbers in /proc is not valuable.
Am 13.05.2019 um 13:59 hat Markus Armbruster geschrieben: > Xiang Zheng <zhengxiang9@huawei.com> writes: > > > On 2019/5/10 23:16, Markus Armbruster wrote: > >> Xiang Zheng <zhengxiang9@huawei.com> writes: > >> > >>> On 2019/5/9 19:59, Markus Armbruster wrote: > >>>> Xiang Zheng <zhengxiang9@huawei.com> writes: > >>>> > >>>>> On 2019/5/8 21:20, Markus Armbruster wrote: > >>>>>> Laszlo Ersek <lersek@redhat.com> writes: > >>>>>> > >>>>>>> Hi Markus, > >>>>>>> > >>>>>>> On 05/07/19 20:01, Markus Armbruster wrote: > >>>>>>>> The subject is slightly misleading. Holes read as zero. So do > >>>>>>>> non-holes full of zeroes. The patch avoids reading the former, but > >>>>>>>> still reads the latter. > >>>>>>>> > >>>>>>>> Xiang Zheng <zhengxiang9@huawei.com> writes: > >>>>>>>> > >>>>>>>>> Currently we fill the memory space with two 64MB NOR images when > >>>>>>>>> using persistent UEFI variables on virt board. Actually we only use > >>>>>>>>> a very small(non-zero) part of the memory while the rest significant > >>>>>>>>> large(zero) part of memory is wasted. > >>>>>>>> > >>>>>>>> Neglects to mention that the "virt board" is ARM. > >>>>>>>> > >>>>>>>>> So this patch checks the block status and only writes the non-zero part > >>>>>>>>> into memory. This requires pflash devices to use sparse files for > >>>>>>>>> backends. > >>>>>>>> > >>>>>>>> I started to draft an improved commit message, but then I realized this > >>>>>>>> patch can't work. > >>>>>>>> > >>>>>>>> The pflash_cfi01 device allocates its device memory like this: > >>>>>>>> > >>>>>>>> memory_region_init_rom_device( > >>>>>>>> &pfl->mem, OBJECT(dev), > >>>>>>>> &pflash_cfi01_ops, > >>>>>>>> pfl, > >>>>>>>> pfl->name, total_len, &local_err); > >>>>>>>> > >>>>>>>> pflash_cfi02 is similar. > >>>>>>>> > >>>>>>>> memory_region_init_rom_device() calls > >>>>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls > >>>>>>>> qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device > >>>>>>>> memory gets written to even with this patch. > >>>>>>> > >>>>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to > >>>>>>> allocate the the new RAMBlock object called "new_block". The actual > >>>>>>> guest RAM allocation occurs inside ram_block_add(), which is also called > >>>>>>> by qemu_ram_alloc_internal(). > >>>>>> > >>>>>> You're right. I should've read more attentively. > >>>>>> > >>>>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to > >>>>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in > >>>>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well. > >>>>>>> > >>>>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call > >>>>>>> phys_mem_alloc(). > >>>>>>> > >>>>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with > >>>>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from > >>>>>>> "target/s390x/kvm.c" (setting the function pointer to > >>>>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end > >>>>>>> up calling the default qemu_anon_ram_alloc() function, through the > >>>>>>> funcptr. (I think anyway.) > >>>>>>> > >>>>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in > >>>>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc() > >>>>>>> passes (-1) for "fd".) > >>>>>>> > >>>>>>> I may have missed something, of course -- I obviously didn't test it, > >>>>>>> just speculated from the source. > >>>>>> > >>>>>> Thanks for your sleuthing! > >>>>>> > >>>>>>>> I'm afraid you neglected to test. > >>>>>> > >>>>>> Accusation actually unsupported. I apologize, and replace it by a > >>>>>> question: have you observed the improvement you're trying to achieve, > >>>>>> and if yes, how? > >>>>>> > >>>>> > >>>>> Yes, we need to create sparse files as the backing images for pflash device. > >>>>> To create sparse files like: > >>>>> > >>>>> dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 > >>>>> dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc > >>>> > >>>> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to > >>>> 64MiB. > >>>> > >>>>> dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 > >>>> > >>>> This creates the varstore as a 64MiB hole. As far as I know (very > >>>> little), you should use the varstore template that comes with the > >>>> firmware binary. > >>>> > >>>> I use > >>>> > >>>> cp --sparse=always bld/pc-bios/edk2-arm-vars.fd . > >>>> cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd . > >>>> > >>>> These guys are already zero-padded, and I use cp to sparsify. > >>>> > >>>>> Start a VM with below commandline: > >>>>> > >>>>> -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\ > >>>>> -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \ > >>>>> > >>>>> Then observe the memory usage of the qemu process (THP is on). > >>>>> > >>>>> 1) Without this patch: > >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' > >>>>> AnonHugePages: 706560 kB > >>>>> AnonHugePages: 2048 kB > >>>>> AnonHugePages: 65536 kB // pflash memory device > >>>>> AnonHugePages: 65536 kB // pflash memory device > >>>>> AnonHugePages: 2048 kB > >>>>> > >>>>> # ps aux | grep qemu-system-aarch64 > >>>>> RSS: 879684 > >>>>> > >>>>> 2) After applying this patch: > >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' > >>>>> AnonHugePages: 700416 kB > >>>>> AnonHugePages: 2048 kB > >>>>> AnonHugePages: 2048 kB // pflash memory device > >>>>> AnonHugePages: 2048 kB // pflash memory device > >>>>> AnonHugePages: 2048 kB > >>>>> > >>>>> # ps aux | grep qemu-system-aarch64 > >>>>> RSS: 744380 > >>>> > >>>> Okay, this demonstrates the patch succeeds at mapping parts of the > >>>> pflash memory as holes. > >>>> > >>>> Do the guests in these QEMU processes run? > >>> > >>> Yes. > >> > >> Good to know, thanks. > >> > >>>>> Obviously, there are at least 100MiB memory saved for each guest. > >>>> > >>>> For a definition of "memory". > >>>> > >>>> Next question: what impact on system performance do you observe? > >>>> > >>>> Let me explain. > >>>> > >>>> Virtual memory holes get filled in by demand paging on access. In other > >>>> words, they remain holes only as long as nothing accesses the memory. > >>>> > >>>> Without your patch, we allocate pages at image read time and fill them > >>>> with zeroes. If we don't access them again, the kernel will eventually > >>>> page them out (assuming you're running with swap). So the steady state > >>>> is "we waste some swap space", not "we waste some physical RAM". > >>>> > >>> > >>> Not everybody wants to run with swap because it may cause low performance. > >> > >> Someone running without swap because he heard someone say someone said > >> swap may be slow is probably throwing away performance. > >> > >> But I assume you mean people running without swap because they measured > >> their workload and found it more performant without swap. Legitimate. > > > > Yes, and I had ever suffered from the high IO waits with swap.:) > > > >> > >>>> Your patch lets us map pflash memory pages containing only zeros as > >>>> holes. > >>>> > >>>> For pages that never get accessed, your patch avoids page allocation, > >>>> filling with zeroes, writing to swap (all one-time costs), and saves > >>>> some swap space (not commonly an issue). > >>>> > >>>> For pflash memory that gets accessed, your patch merely delays page > >>>> allocation from image read time to first access. > >>>> > >>>> I wonder how these savings and delays affect actual system performance. > >>>> Without an observable change in system performance, all we'd accomplish > >>>> is changing a bunch of numers in /proc/$pid/. > >>>> > >>>> What improvement(s) can you observe? > >>> > >>> We only use pflash device for UEFI, and we hardly care about the performance. > >>> I think the bottleneck of the performance is the MMIO emulation, even this > >>> patch would delay page allocation at the first access. > >> > >> I wasn't inquiring about the performance of the pflash device. I was > >> inquiring about *system* performance. But let me rephrase my question. > >> > >> Doing work to save resources is only worthwhile if something valuable > >> gets better in a measurable way. I'm asking you > >> > >> (1) to explain what exactly you value, and > >> > >> (2) to provide measurements that show improvement. > >> > > > > What we exactly value is the cost of memory resources and it is the only > > thing that this patch aims to resolve. > > Then measure this cost! > > > I am confused that why you think it will impact the system performance? Did I > > neglect something? > > If the patch does not impact how the system as a whole performs, then > it's useless. > > Since you find it useful, it must have some valuable[*] observable > effect for you. Tell us about it! > > I keep asking not to torment you, but to guide you towards building a > compelling justification for your patch. However, I can only show you > the path; the walking you'll have to do yourself. Is this discussion really a good use of our time? The patch is simple, and a few obvious improvements it brings were mentioned (even by yourself), such as avoiding OOM without swap; and with swap enabled, saving swap space for more useful content and saving unnecessary I/O related to accessing swap needlessly. You may consider these improvements neglegible, but even small improvments can add up. If you really want to measure them, it should be clear how to do it. I don't see value in actually setting up such environments just to get some numbers that show what we already know. So, what's the downside of the patch? The worst case is, the memory usage numbers only look better, but most people don't have a use case where the improvement matters. There might be some maintenance cost associated with the code, but it's small and I suspect this discussion has already cost us more time than maintaining the code will ever cost us. So why not just take it? Kevin
Kevin Wolf <kwolf@redhat.com> writes: > Am 13.05.2019 um 13:59 hat Markus Armbruster geschrieben: >> Xiang Zheng <zhengxiang9@huawei.com> writes: >> >> > On 2019/5/10 23:16, Markus Armbruster wrote: >> >> Xiang Zheng <zhengxiang9@huawei.com> writes: >> >> >> >>> On 2019/5/9 19:59, Markus Armbruster wrote: >> >>>> Xiang Zheng <zhengxiang9@huawei.com> writes: >> >>>> >> >>>>> On 2019/5/8 21:20, Markus Armbruster wrote: >> >>>>>> Laszlo Ersek <lersek@redhat.com> writes: >> >>>>>> >> >>>>>>> Hi Markus, >> >>>>>>> >> >>>>>>> On 05/07/19 20:01, Markus Armbruster wrote: >> >>>>>>>> The subject is slightly misleading. Holes read as zero. So do >> >>>>>>>> non-holes full of zeroes. The patch avoids reading the former, but >> >>>>>>>> still reads the latter. >> >>>>>>>> >> >>>>>>>> Xiang Zheng <zhengxiang9@huawei.com> writes: >> >>>>>>>> >> >>>>>>>>> Currently we fill the memory space with two 64MB NOR images when >> >>>>>>>>> using persistent UEFI variables on virt board. Actually we only use >> >>>>>>>>> a very small(non-zero) part of the memory while the rest significant >> >>>>>>>>> large(zero) part of memory is wasted. >> >>>>>>>> >> >>>>>>>> Neglects to mention that the "virt board" is ARM. >> >>>>>>>> >> >>>>>>>>> So this patch checks the block status and only writes the non-zero part >> >>>>>>>>> into memory. This requires pflash devices to use sparse files for >> >>>>>>>>> backends. >> >>>>>>>> >> >>>>>>>> I started to draft an improved commit message, but then I realized this >> >>>>>>>> patch can't work. >> >>>>>>>> >> >>>>>>>> The pflash_cfi01 device allocates its device memory like this: >> >>>>>>>> >> >>>>>>>> memory_region_init_rom_device( >> >>>>>>>> &pfl->mem, OBJECT(dev), >> >>>>>>>> &pflash_cfi01_ops, >> >>>>>>>> pfl, >> >>>>>>>> pfl->name, total_len, &local_err); >> >>>>>>>> >> >>>>>>>> pflash_cfi02 is similar. >> >>>>>>>> >> >>>>>>>> memory_region_init_rom_device() calls >> >>>>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls >> >>>>>>>> qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device >> >>>>>>>> memory gets written to even with this patch. >> >>>>>>> >> >>>>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to >> >>>>>>> allocate the the new RAMBlock object called "new_block". The actual >> >>>>>>> guest RAM allocation occurs inside ram_block_add(), which is also called >> >>>>>>> by qemu_ram_alloc_internal(). >> >>>>>> >> >>>>>> You're right. I should've read more attentively. >> >>>>>> >> >>>>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to >> >>>>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in >> >>>>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well. >> >>>>>>> >> >>>>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call >> >>>>>>> phys_mem_alloc(). >> >>>>>>> >> >>>>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with >> >>>>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from >> >>>>>>> "target/s390x/kvm.c" (setting the function pointer to >> >>>>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end >> >>>>>>> up calling the default qemu_anon_ram_alloc() function, through the >> >>>>>>> funcptr. (I think anyway.) >> >>>>>>> >> >>>>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in >> >>>>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc() >> >>>>>>> passes (-1) for "fd".) >> >>>>>>> >> >>>>>>> I may have missed something, of course -- I obviously didn't test it, >> >>>>>>> just speculated from the source. >> >>>>>> >> >>>>>> Thanks for your sleuthing! >> >>>>>> >> >>>>>>>> I'm afraid you neglected to test. >> >>>>>> >> >>>>>> Accusation actually unsupported. I apologize, and replace it by a >> >>>>>> question: have you observed the improvement you're trying to achieve, >> >>>>>> and if yes, how? >> >>>>>> >> >>>>> >> >>>>> Yes, we need to create sparse files as the backing images for pflash device. >> >>>>> To create sparse files like: >> >>>>> >> >>>>> dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 >> >>>>> dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc >> >>>> >> >>>> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to >> >>>> 64MiB. >> >>>> >> >>>>> dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 >> >>>> >> >>>> This creates the varstore as a 64MiB hole. As far as I know (very >> >>>> little), you should use the varstore template that comes with the >> >>>> firmware binary. >> >>>> >> >>>> I use >> >>>> >> >>>> cp --sparse=always bld/pc-bios/edk2-arm-vars.fd . >> >>>> cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd . >> >>>> >> >>>> These guys are already zero-padded, and I use cp to sparsify. >> >>>> >> >>>>> Start a VM with below commandline: >> >>>>> >> >>>>> -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\ >> >>>>> -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \ >> >>>>> >> >>>>> Then observe the memory usage of the qemu process (THP is on). >> >>>>> >> >>>>> 1) Without this patch: >> >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' >> >>>>> AnonHugePages: 706560 kB >> >>>>> AnonHugePages: 2048 kB >> >>>>> AnonHugePages: 65536 kB // pflash memory device >> >>>>> AnonHugePages: 65536 kB // pflash memory device >> >>>>> AnonHugePages: 2048 kB >> >>>>> >> >>>>> # ps aux | grep qemu-system-aarch64 >> >>>>> RSS: 879684 >> >>>>> >> >>>>> 2) After applying this patch: >> >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' >> >>>>> AnonHugePages: 700416 kB >> >>>>> AnonHugePages: 2048 kB >> >>>>> AnonHugePages: 2048 kB // pflash memory device >> >>>>> AnonHugePages: 2048 kB // pflash memory device >> >>>>> AnonHugePages: 2048 kB >> >>>>> >> >>>>> # ps aux | grep qemu-system-aarch64 >> >>>>> RSS: 744380 >> >>>> >> >>>> Okay, this demonstrates the patch succeeds at mapping parts of the >> >>>> pflash memory as holes. >> >>>> >> >>>> Do the guests in these QEMU processes run? >> >>> >> >>> Yes. >> >> >> >> Good to know, thanks. >> >> >> >>>>> Obviously, there are at least 100MiB memory saved for each guest. >> >>>> >> >>>> For a definition of "memory". >> >>>> >> >>>> Next question: what impact on system performance do you observe? >> >>>> >> >>>> Let me explain. >> >>>> >> >>>> Virtual memory holes get filled in by demand paging on access. In other >> >>>> words, they remain holes only as long as nothing accesses the memory. >> >>>> >> >>>> Without your patch, we allocate pages at image read time and fill them >> >>>> with zeroes. If we don't access them again, the kernel will eventually >> >>>> page them out (assuming you're running with swap). So the steady state >> >>>> is "we waste some swap space", not "we waste some physical RAM". >> >>>> >> >>> >> >>> Not everybody wants to run with swap because it may cause low performance. >> >> >> >> Someone running without swap because he heard someone say someone said >> >> swap may be slow is probably throwing away performance. >> >> >> >> But I assume you mean people running without swap because they measured >> >> their workload and found it more performant without swap. Legitimate. >> > >> > Yes, and I had ever suffered from the high IO waits with swap.:) >> > >> >> >> >>>> Your patch lets us map pflash memory pages containing only zeros as >> >>>> holes. >> >>>> >> >>>> For pages that never get accessed, your patch avoids page allocation, >> >>>> filling with zeroes, writing to swap (all one-time costs), and saves >> >>>> some swap space (not commonly an issue). >> >>>> >> >>>> For pflash memory that gets accessed, your patch merely delays page >> >>>> allocation from image read time to first access. >> >>>> >> >>>> I wonder how these savings and delays affect actual system performance. >> >>>> Without an observable change in system performance, all we'd accomplish >> >>>> is changing a bunch of numers in /proc/$pid/. >> >>>> >> >>>> What improvement(s) can you observe? >> >>> >> >>> We only use pflash device for UEFI, and we hardly care about the performance. >> >>> I think the bottleneck of the performance is the MMIO emulation, even this >> >>> patch would delay page allocation at the first access. >> >> >> >> I wasn't inquiring about the performance of the pflash device. I was >> >> inquiring about *system* performance. But let me rephrase my question. >> >> >> >> Doing work to save resources is only worthwhile if something valuable >> >> gets better in a measurable way. I'm asking you >> >> >> >> (1) to explain what exactly you value, and >> >> >> >> (2) to provide measurements that show improvement. >> >> >> > >> > What we exactly value is the cost of memory resources and it is the only >> > thing that this patch aims to resolve. >> >> Then measure this cost! >> >> > I am confused that why you think it will impact the system performance? Did I >> > neglect something? >> >> If the patch does not impact how the system as a whole performs, then >> it's useless. >> >> Since you find it useful, it must have some valuable[*] observable >> effect for you. Tell us about it! >> >> I keep asking not to torment you, but to guide you towards building a >> compelling justification for your patch. However, I can only show you >> the path; the walking you'll have to do yourself. > > Is this discussion really a good use of our time? > > The patch is simple, and a few obvious improvements it brings were > mentioned (even by yourself), such as avoiding OOM without swap; and > with swap enabled, saving swap space for more useful content and > saving unnecessary I/O related to accessing swap needlessly. > > You may consider these improvements neglegible, but even small > improvments can add up. If you really want to measure them, it should be > clear how to do it. I don't see value in actually setting up such > environments just to get some numbers that show what we already know. > > So, what's the downside of the patch? The worst case is, the memory > usage numbers only look better, but most people don't have a use case > where the improvement matters. There might be some maintenance cost > associated with the code, but it's small and I suspect this discussion > has already cost us more time than maintaining the code will ever cost > us. > > So why not just take it? As is, the patch's commit message fails to meet the standards I set as a maintainer, because (1) it's too vague on what the patch does, and what its limitations are (relies on well-behaved guests), and (2) it fails to make the case for the patch. Fortunately, I'm not the maintainer here, Philippe is. My standards do not matter.
diff --git a/hw/block/block.c b/hw/block/block.c index bf56c76..3cb9d4c 100644 --- a/hw/block/block.c +++ b/hw/block/block.c @@ -15,6 +15,44 @@ #include "qapi/qapi-types-block.h" /* + * Read the non-zeroes parts of @blk into @buf + * Reading all of the @blk is expensive if the zeroes parts of @blk + * is large enough. Therefore check the block status and only write + * the non-zeroes block into @buf. + * + * Return 0 on success, non-zero on error. + */ +static int blk_pread_nonzeroes(BlockBackend *blk, void *buf) +{ + int ret; + int64_t target_size, bytes, offset = 0; + BlockDriverState *bs = blk_bs(blk); + + target_size = bdrv_getlength(bs); + if (target_size < 0) { + return target_size; + } + + for (;;) { + bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_SECTORS); + if (bytes <= 0) { + return 0; + } + ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL); + if (ret < 0) { + return ret; + } + if (!(ret & BDRV_BLOCK_ZERO)) { + ret = bdrv_pread(bs->file, offset, (uint8_t *) buf + offset, bytes); + if (ret < 0) { + return ret; + } + } + offset += bytes; + } +} + +/* * Read the entire contents of @blk into @buf. * @blk's contents must be @size bytes, and @size must be at most * BDRV_REQUEST_MAX_BYTES. @@ -53,7 +91,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, * block device and read only on demand. */ assert(size <= BDRV_REQUEST_MAX_BYTES); - ret = blk_pread(blk, 0, buf, size); + ret = blk_pread_nonzeroes(blk, buf); if (ret < 0) { error_setg_errno(errp, -ret, "can't read block backend"); return false;
Currently we fill the memory space with two 64MB NOR images when using persistent UEFI variables on virt board. Actually we only use a very small(non-zero) part of the memory while the rest significant large(zero) part of memory is wasted. So this patch checks the block status and only writes the non-zero part into memory. This requires pflash devices to use sparse files for backends. Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> --- hw/block/block.c | 40 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 39 insertions(+), 1 deletion(-)