diff mbox series

[RFC] hw/arm/virt: use variable size of flash device to save memory

Message ID 20190325125142.11628-1-zhengxiang9@huawei.com (mailing list archive)
State New, archived
Headers show
Series [RFC] hw/arm/virt: use variable size of flash device to save memory | expand

Commit Message

Xiang Zheng March 25, 2019, 12:51 p.m. UTC
Currently we fill the VIRT_FLASH space with two 64MB NOR images when
using persistent UEFI variables on QEMU. Actually we only use a very
small part of the memory while the rest significant large part of
memory is wasted.

This patch creates and maps a variable size of flash device instead of
a mandatory 64MB one to save memory.

Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
---

This patch might be insufficient since it also needs to modify the flash size
in ACPI and DTB.

BTW, I don't understand why it requires the two NOR images to be exactly 64MB
in size when using -pflash.

---
 hw/arm/virt.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

Comments

Peter Maydell March 25, 2019, 1:11 p.m. UTC | #1
On Mon, 25 Mar 2019 at 12:53, Xiang Zheng <zhengxiang9@huawei.com> wrote:
>
> Currently we fill the VIRT_FLASH space with two 64MB NOR images when
> using persistent UEFI variables on QEMU. Actually we only use a very
> small part of the memory while the rest significant large part of
> memory is wasted.
>
> This patch creates and maps a variable size of flash device instead of
> a mandatory 64MB one to save memory.
>
> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
> ---
>
> This patch might be insufficient since it also needs to modify the flash size
> in ACPI and DTB.
>
> BTW, I don't understand why it requires the two NOR images to be exactly 64MB
> in size when using -pflash.

I don't think we should do this. The board should in general
create the same hardware visible to the guest, not change
it based on subtle things like the size of the image files.

The reason why the flash images must be 64MB in size
when using -pflash is that they are the backing store
for a writable device. Suppose you have 1MB of data in your
backing image that you pass to QEMU and then the guest writes
to the last block of the flash device. The new data
written by the guest to the end of the device has to be
stored somewhere, so the file has to be large enough
to cover the whole of the flash area.

thanks
-- PMM
Xiang Zheng March 25, 2019, 2:03 p.m. UTC | #2
Hi Peter,

Thanks for your reply!

On 2019/3/25 21:11, Peter Maydell wrote:
> On Mon, 25 Mar 2019 at 12:53, Xiang Zheng <zhengxiang9@huawei.com> wrote:
>>
>> Currently we fill the VIRT_FLASH space with two 64MB NOR images when
>> using persistent UEFI variables on QEMU. Actually we only use a very
>> small part of the memory while the rest significant large part of
>> memory is wasted.
>>
>> This patch creates and maps a variable size of flash device instead of
>> a mandatory 64MB one to save memory.
>>
>> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
>> ---
>>
>> This patch might be insufficient since it also needs to modify the flash size
>> in ACPI and DTB.
>>
>> BTW, I don't understand why it requires the two NOR images to be exactly 64MB
>> in size when using -pflash.
> 
> I don't think we should do this. The board should in general
> create the same hardware visible to the guest, not change
> it based on subtle things like the size of the image files.
> 
> The reason why the flash images must be 64MB in size
> when using -pflash is that they are the backing store
> for a writable device. Suppose you have 1MB of data in your
> backing image that you pass to QEMU and then the guest writes
> to the last block of the flash device. The new data
> written by the guest to the end of the device has to be
> stored somewhere, so the file has to be large enough
> to cover the whole of the flash area.
> 

Is there any way to support config or limit the size that both
guest and QEMU are visible?

The original QEMU_EFI.fd has only 2M, but we need to stuff it
to 64M with 62M unused data. It will consume a large amount of
memory when running multiple VM simultaneously.
Laszlo Ersek March 25, 2019, 2:07 p.m. UTC | #3
On 03/25/19 14:11, Peter Maydell wrote:
> On Mon, 25 Mar 2019 at 12:53, Xiang Zheng <zhengxiang9@huawei.com> wrote:
>>
>> Currently we fill the VIRT_FLASH space with two 64MB NOR images when
>> using persistent UEFI variables on QEMU. Actually we only use a very
>> small part of the memory while the rest significant large part of
>> memory is wasted.
>>
>> This patch creates and maps a variable size of flash device instead of
>> a mandatory 64MB one to save memory.
>>
>> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
>> ---
>>
>> This patch might be insufficient since it also needs to modify the flash size
>> in ACPI and DTB.
>>
>> BTW, I don't understand why it requires the two NOR images to be exactly 64MB
>> in size when using -pflash.
> 
> I don't think we should do this. The board should in general
> create the same hardware visible to the guest, not change
> it based on subtle things like the size of the image files.

I consider this question/behavior board-specific.

For the "virt" board, I agree with you (Peter), simply because that's
how the "virt" board has always worked.

(See also the tangentially related series

  [Qemu-devel] [PATCH v8 0/2]
  pflash: Require backend size to match device, improve errors

at

  http://mid.mail-archive.com/20190319163551.32499-1-armbru@redhat.com

. This series still lets each board decide the question for itself, but
it improves error detection and reporting.)

Thanks,
Laszlo

> The reason why the flash images must be 64MB in size
> when using -pflash is that they are the backing store
> for a writable device. Suppose you have 1MB of data in your
> backing image that you pass to QEMU and then the guest writes
> to the last block of the flash device. The new data
> written by the guest to the end of the device has to be
> stored somewhere, so the file has to be large enough
> to cover the whole of the flash area.
> 
> thanks
> -- PMM
>
Markus Armbruster March 26, 2019, 6:17 a.m. UTC | #4
Zheng Xiang <zhengxiang9@huawei.com> writes:

> Hi Peter,
>
> Thanks for your reply!
>
> On 2019/3/25 21:11, Peter Maydell wrote:
>> On Mon, 25 Mar 2019 at 12:53, Xiang Zheng <zhengxiang9@huawei.com> wrote:
>>>
>>> Currently we fill the VIRT_FLASH space with two 64MB NOR images when
>>> using persistent UEFI variables on QEMU. Actually we only use a very
>>> small part of the memory while the rest significant large part of
>>> memory is wasted.
>>>
>>> This patch creates and maps a variable size of flash device instead of
>>> a mandatory 64MB one to save memory.
>>>
>>> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
>>> ---
>>>
>>> This patch might be insufficient since it also needs to modify the flash size
>>> in ACPI and DTB.
>>>
>>> BTW, I don't understand why it requires the two NOR images to be exactly 64MB
>>> in size when using -pflash.
>> 
>> I don't think we should do this. The board should in general
>> create the same hardware visible to the guest, not change
>> it based on subtle things like the size of the image files.

Concur.

>> The reason why the flash images must be 64MB in size
>> when using -pflash is that they are the backing store
>> for a writable device. Suppose you have 1MB of data in your
>> backing image that you pass to QEMU and then the guest writes
>> to the last block of the flash device. The new data
>> written by the guest to the end of the device has to be
>> stored somewhere, so the file has to be large enough
>> to cover the whole of the flash area.
>> 
>
> Is there any way to support config or limit the size that both
> guest and QEMU are visible?
>
> The original QEMU_EFI.fd has only 2M, but we need to stuff it
> to 64M with 62M unused data. It will consume a large amount of
> memory when running multiple VM simultaneously.

Here's a number of ideas.

The first one is of course making the flash memory size configurable, to
eliminate the "unused" part.  Our PC machines use the backing image
sizes as configuration.  I consider that a bad idea that should not be
allowed to spread to other machines.  Peter seems to agree.

The accepted way to create minor variations of a machine type is machine
properties.  Whether using them to vary flash chip size would be
acceptable is for the board maintainer to decide.

Now let's think about mapping these flash images more efficiently.

We could avoid backing their "unused" part with pages.  Unless the
"unused" part is read-only, this leaves you at your guests' mercy: they
can make the host allocate pages by writing to them.

We could share the pflash memory among VMs running the same firmware.
If writes are permitted, we'd have to unshare on write (COW).  Again,
you're at your guests' mercy unless read-only: they can make the host
unshare pages by writing to them.

I figure the "share" idea would be easier to implement[*].

Both ideas require either trusted guests or read-only flash.  EFI
generally wants some read/write flash for its var store.  Can we make
everything else read-only?

We can improve the device models to let us set up a suitable part of the
pflash memory read-only.  This is how real hardware works.  Our PC
machines currently approximate this with *two* flash chips, one
read-only, one read/write, which I consider a mistake that should not be
allowed to spread to other machines.

Prior discussions

    Message-ID: <87imxfssq1.fsf@dusky.pond.sub.org>
    https://lists.nongnu.org/archive/html/qemu-devel/2019-02/msg05056.html

and

    Message-ID: <87y378n5iy.fsf@dusky.pond.sub.org>
    https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html



[*] If you run KSM (kernel same-page merging), the kernel should set up
the sharing for you automatically.  But not everybody wants to run KSM.
Laszlo Ersek March 26, 2019, 11:03 a.m. UTC | #5
On 03/26/19 07:17, Markus Armbruster wrote:
> Zheng Xiang <zhengxiang9@huawei.com> writes:
> 
>> Hi Peter,
>>
>> Thanks for your reply!
>>
>> On 2019/3/25 21:11, Peter Maydell wrote:
>>> On Mon, 25 Mar 2019 at 12:53, Xiang Zheng <zhengxiang9@huawei.com> wrote:
>>>>
>>>> Currently we fill the VIRT_FLASH space with two 64MB NOR images when
>>>> using persistent UEFI variables on QEMU. Actually we only use a very
>>>> small part of the memory while the rest significant large part of
>>>> memory is wasted.
>>>>
>>>> This patch creates and maps a variable size of flash device instead of
>>>> a mandatory 64MB one to save memory.
>>>>
>>>> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
>>>> ---
>>>>
>>>> This patch might be insufficient since it also needs to modify the flash size
>>>> in ACPI and DTB.
>>>>
>>>> BTW, I don't understand why it requires the two NOR images to be exactly 64MB
>>>> in size when using -pflash.
>>>
>>> I don't think we should do this. The board should in general
>>> create the same hardware visible to the guest, not change
>>> it based on subtle things like the size of the image files.
> 
> Concur.
> 
>>> The reason why the flash images must be 64MB in size
>>> when using -pflash is that they are the backing store
>>> for a writable device. Suppose you have 1MB of data in your
>>> backing image that you pass to QEMU and then the guest writes
>>> to the last block of the flash device. The new data
>>> written by the guest to the end of the device has to be
>>> stored somewhere, so the file has to be large enough
>>> to cover the whole of the flash area.
>>>
>>
>> Is there any way to support config or limit the size that both
>> guest and QEMU are visible?
>>
>> The original QEMU_EFI.fd has only 2M, but we need to stuff it
>> to 64M with 62M unused data. It will consume a large amount of
>> memory when running multiple VM simultaneously.
> 
> Here's a number of ideas.
> 
> The first one is of course making the flash memory size configurable, to
> eliminate the "unused" part.  Our PC machines use the backing image
> sizes as configuration.  I consider that a bad idea that should not be
> allowed to spread to other machines.  Peter seems to agree.

For the full picture, we have to realize that flash size is a trade-off
for virtual machines too, not just for physical machines. UEFI firmware
tends to grow like an OS -- whether that's good or bad belongs on
another page :) --, and so you tend to need more and more space for it,
even after Link Time Optimization, and image compression, over time.

So you are left with a choice.

* You can let board logic size the flash dynamically at startup (and
then call it a "bad idea" :) ).

* Alternatively, you can hardcode the pflash size in the board setup,
perhaps depdent on machine type version, and/or dependent on some
properties. Even assuming that this diversity doesn't break migration at
once, you'll create a dependency between firmware releases (sizes) and
machine types. 'For the next release of ArmVirtQemu/edk2, you'll need
"virt-4.1" or later'.

In addition, in most cases, the firmware, when it runs from flash,
cannot dynamically adapt itself to random flash sizes and/or base
addresses, so not only will new (larger) firmware not fit on old machine
types, but old (small) firmware may also not feel "at home" on new
machine types. (Note: this is not a theoretical limitation, but it is a
*very* practical one.)

That's a kind of mapping that is taken for "obvious" in the physical
world (you get the board, your firmware comes with it, that's why it's
called *firm*), but it used to be frowned upon in the virtual world.

* Or else, you pad the pflash chips as broadly as you dare, in order to
never run into the above mess -- and then someone complains "it consumes
too many resources". :)


For some perspective, OVMF started out with a 1MB cumulative size
(executable + varstore). We ran out of 1MB with executable code, so we
introduced the 2MB build (with unchanged varstore size). Then, under
Microsoft's SVVP checks, the varstore size proved insufficient, and we
introduced the 4MB buid, with enough space in both the executable part
and the varstore part for them to last for "the foreseeable future".
And, the dynamic logic in the PC board code allows up to a 8MB
cumulative size (and that's also not a machine type property, but a cold
hard constant).

With the dynamic sizing in QEMU (which, IIRC, I had originally
introduced still in the 1MB times, due to the split between the
executable and varstore parts), both the 1MB->2MB switch, and the
2MB->4MB switch in the firmware caused zero pain in QEMU. And right now,
4MB looks like a "sweet spot", with some elbow room left.

The "virt" board took a different approach, with different benefits (no
dynamic sizing at startup, hence identical guest view) and different
problems (it's more hungry for resources). "Pick your poison."


A hopefully more constructive comment below:

> The accepted way to create minor variations of a machine type is machine
> properties.  Whether using them to vary flash chip size would be
> acceptable is for the board maintainer to decide.
> 
> Now let's think about mapping these flash images more efficiently.
> 
> We could avoid backing their "unused" part with pages.  Unless the
> "unused" part is read-only, this leaves you at your guests' mercy: they
> can make the host allocate pages by writing to them.

First, the flash backends' on-disk demand shouldn't be catastrophic,
even if we use "raw" -- the executable should be shared between all VMs
running on the host, and the varstore files (which must be private to
VMs) could be created as sparse files. For example, libvirt could be
improved to create sparse copies of the varstore *template* files.

Second, regarding physical RAM consumption: disable memory overcommit,
and set a large swap. The unused parts of the large pflash chips should
be swapped out at some point (after their initial population from disk),
and should never be swapped back in again.

Thanks
Laszlo

> 
> We could share the pflash memory among VMs running the same firmware.
> If writes are permitted, we'd have to unshare on write (COW).  Again,
> you're at your guests' mercy unless read-only: they can make the host
> unshare pages by writing to them.
> 
> I figure the "share" idea would be easier to implement[*].
> 
> Both ideas require either trusted guests or read-only flash.  EFI
> generally wants some read/write flash for its var store.  Can we make
> everything else read-only?
> 
> We can improve the device models to let us set up a suitable part of the
> pflash memory read-only.  This is how real hardware works.  Our PC
> machines currently approximate this with *two* flash chips, one
> read-only, one read/write, which I consider a mistake that should not be
> allowed to spread to other machines.
> 
> Prior discussions
> 
>     Message-ID: <87imxfssq1.fsf@dusky.pond.sub.org>
>     https://lists.nongnu.org/archive/html/qemu-devel/2019-02/msg05056.html
> 
> and
> 
>     Message-ID: <87y378n5iy.fsf@dusky.pond.sub.org>
>     https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html
> 
> 
> 
> [*] If you run KSM (kernel same-page merging), the kernel should set up
> the sharing for you automatically.  But not everybody wants to run KSM.
>
Markus Armbruster March 26, 2019, 4:39 p.m. UTC | #6
Laszlo Ersek <lersek@redhat.com> writes:

> On 03/26/19 07:17, Markus Armbruster wrote:
>> Zheng Xiang <zhengxiang9@huawei.com> writes:
>> 
>>> Hi Peter,
>>>
>>> Thanks for your reply!
>>>
>>> On 2019/3/25 21:11, Peter Maydell wrote:
>>>> On Mon, 25 Mar 2019 at 12:53, Xiang Zheng <zhengxiang9@huawei.com> wrote:
>>>>>
>>>>> Currently we fill the VIRT_FLASH space with two 64MB NOR images when
>>>>> using persistent UEFI variables on QEMU. Actually we only use a very
>>>>> small part of the memory while the rest significant large part of
>>>>> memory is wasted.
>>>>>
>>>>> This patch creates and maps a variable size of flash device instead of
>>>>> a mandatory 64MB one to save memory.
>>>>>
>>>>> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
>>>>> ---
>>>>>
>>>>> This patch might be insufficient since it also needs to modify the flash size
>>>>> in ACPI and DTB.
>>>>>
>>>>> BTW, I don't understand why it requires the two NOR images to be exactly 64MB
>>>>> in size when using -pflash.
>>>>
>>>> I don't think we should do this. The board should in general
>>>> create the same hardware visible to the guest, not change
>>>> it based on subtle things like the size of the image files.
>> 
>> Concur.
>> 
>>>> The reason why the flash images must be 64MB in size
>>>> when using -pflash is that they are the backing store
>>>> for a writable device. Suppose you have 1MB of data in your
>>>> backing image that you pass to QEMU and then the guest writes
>>>> to the last block of the flash device. The new data
>>>> written by the guest to the end of the device has to be
>>>> stored somewhere, so the file has to be large enough
>>>> to cover the whole of the flash area.
>>>>
>>>
>>> Is there any way to support config or limit the size that both
>>> guest and QEMU are visible?
>>>
>>> The original QEMU_EFI.fd has only 2M, but we need to stuff it
>>> to 64M with 62M unused data. It will consume a large amount of
>>> memory when running multiple VM simultaneously.
>> 
>> Here's a number of ideas.
>> 
>> The first one is of course making the flash memory size configurable, to
>> eliminate the "unused" part.  Our PC machines use the backing image
>> sizes as configuration.  I consider that a bad idea that should not be
>> allowed to spread to other machines.  Peter seems to agree.
>
> For the full picture, we have to realize that flash size is a trade-off
> for virtual machines too, not just for physical machines. UEFI firmware
> tends to grow like an OS -- whether that's good or bad belongs on
> another page :) --, and so you tend to need more and more space for it,
> even after Link Time Optimization, and image compression, over time.

Same for physical and virtual machines.

> So you are left with a choice.
>
> * You can let board logic size the flash dynamically at startup (and
> then call it a "bad idea" :) ).
>
> * Alternatively, you can hardcode the pflash size in the board setup,
> perhaps depdent on machine type version, and/or dependent on some
> properties.

The difference between getting flash memory size from the backend
vs. getting it from a machine type or property is implicit vs. explicit
configuration.

Implicit vs. explicit can have ramifications beyond the user interface.
For instance, if we ever get around to transferring device configuration
in the migration stream, the machine property would surely be part of
that, but the size of the backend won't.

>             Even assuming that this diversity doesn't break migration at
> once,

I doubt explicit could break anything that implicit couldn't :)

>       you'll create a dependency between firmware releases (sizes) and
> machine types. 'For the next release of ArmVirtQemu/edk2, you'll need
> "virt-4.1" or later'.

Yes if you tie the size to the machine type.  No if you get it from a
machine property.

> In addition, in most cases, the firmware, when it runs from flash,
> cannot dynamically adapt itself to random flash sizes and/or base
> addresses, so not only will new (larger) firmware not fit on old machine
> types, but old (small) firmware may also not feel "at home" on new
> machine types. (Note: this is not a theoretical limitation, but it is a
> *very* practical one.)

The exact same problem exists for physical machines.  You can revise
your firmware only within limits set by the board.

I don't mean to say this problem isn't worth avoiding for virtual
machines.  Only that it is neither new nor intractable.

> That's a kind of mapping that is taken for "obvious" in the physical
> world (you get the board, your firmware comes with it, that's why it's
> called *firm*), but it used to be frowned upon in the virtual world.

I'm willing to give developers of virtual firmware more flexibility than
they get in the physical world.  I just happen to dislike "implicit" and
"any multiple of 4KiB up to a limit" (because physical flash chips with
sizes like 64140KiB do not exist, and virtual ones should not).

> * Or else, you pad the pflash chips as broadly as you dare, in order to
> never run into the above mess -- and then someone complains "it consumes
> too many resources". :)

I think that complaint would be exactly as valid for unpadded firmware!
Once you get to the point where you care whether each guest's firmware
eats up 2MiB or 64MiB, what you *really* want is probably 128KiB per
guest plust $whatever shared among all guests.  You probably won't care
all that much whether $whatever is 2MiB - 128KiB or 64MiB - 128KiB.

> For some perspective, OVMF started out with a 1MB cumulative size
> (executable + varstore). We ran out of 1MB with executable code, so we
> introduced the 2MB build (with unchanged varstore size). Then, under
> Microsoft's SVVP checks, the varstore size proved insufficient, and we
> introduced the 4MB buid, with enough space in both the executable part
> and the varstore part for them to last for "the foreseeable future".
> And, the dynamic logic in the PC board code allows up to a 8MB
> cumulative size (and that's also not a machine type property, but a cold
> hard constant).
>
> With the dynamic sizing in QEMU (which, IIRC, I had originally
> introduced still in the 1MB times, due to the split between the
> executable and varstore parts), both the 1MB->2MB switch, and the
> 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now,
> 4MB looks like a "sweet spot", with some elbow room left.

Explicit configuration would've been exactly as painless.  Even with
pflash sizes restricted to powers of two.

> The "virt" board took a different approach, with different benefits (no
> dynamic sizing at startup, hence identical guest view) and different
> problems (it's more hungry for resources). "Pick your poison."

Here's the one I'd like to be able to pick: have a single pflash chip
with a read-only part and a read/write part.  Map the read-only part so
it is shared among guests.  Map the read/write part normally.  Default
the sizes to something that makes sense now, with reasonable elbow room.
Make sure there's a way to grow.

> A hopefully more constructive comment below:
>
>> The accepted way to create minor variations of a machine type is machine
>> properties.  Whether using them to vary flash chip size would be
>> acceptable is for the board maintainer to decide.
>> 
>> Now let's think about mapping these flash images more efficiently.
>> 
>> We could avoid backing their "unused" part with pages.  Unless the
>> "unused" part is read-only, this leaves you at your guests' mercy: they
>> can make the host allocate pages by writing to them.
>
> First, the flash backends' on-disk demand shouldn't be catastrophic,
> even if we use "raw" -- the executable should be shared between all VMs
> running on the host, and the varstore files (which must be private to
> VMs) could be created as sparse files. For example, libvirt could be
> improved to create sparse copies of the varstore *template* files.

Same story as in memory, really: share the read-only part (i.e. the
executable) among the guests, keep the read/write part small.

> Second, regarding physical RAM consumption: disable memory overcommit,
> and set a large swap. The unused parts of the large pflash chips should
> be swapped out at some point (after their initial population from disk),
> and should never be swapped back in again.

Just providing swap should do the trick, shouldn't it?

> Thanks
> Laszlo
>
>> 
>> We could share the pflash memory among VMs running the same firmware.
>> If writes are permitted, we'd have to unshare on write (COW).  Again,
>> you're at your guests' mercy unless read-only: they can make the host
>> unshare pages by writing to them.
>> 
>> I figure the "share" idea would be easier to implement[*].
>> 
>> Both ideas require either trusted guests or read-only flash.  EFI
>> generally wants some read/write flash for its var store.  Can we make
>> everything else read-only?
>> 
>> We can improve the device models to let us set up a suitable part of the
>> pflash memory read-only.  This is how real hardware works.  Our PC
>> machines currently approximate this with *two* flash chips, one
>> read-only, one read/write, which I consider a mistake that should not be
>> allowed to spread to other machines.
>> 
>> Prior discussions
>> 
>>     Message-ID: <87imxfssq1.fsf@dusky.pond.sub.org>
>>     https://lists.nongnu.org/archive/html/qemu-devel/2019-02/msg05056.html
>> 
>> and
>> 
>>     Message-ID: <87y378n5iy.fsf@dusky.pond.sub.org>
>>     https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html
>> 
>> 
>> 
>> [*] If you run KSM (kernel same-page merging), the kernel should set up
>> the sharing for you automatically.  But not everybody wants to run KSM.
>>
Laszlo Ersek March 26, 2019, 5:10 p.m. UTC | #7
On 03/26/19 17:39, Markus Armbruster wrote:
> Laszlo Ersek <lersek@redhat.com> writes:

>> With the dynamic sizing in QEMU (which, IIRC, I had originally
>> introduced still in the 1MB times, due to the split between the
>> executable and varstore parts), both the 1MB->2MB switch, and the
>> 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now,
>> 4MB looks like a "sweet spot", with some elbow room left.
> 
> Explicit configuration would've been exactly as painless.  Even with
> pflash sizes restricted to powers of two.

I wrote the patch that ended up as commit 637a5acb46b3 -- with your R-b
-- in 2013. I'm unsure if machine type properties existed back then, but
even if they did, do you think I knew about them? :)

You are right, of course; it's just that we can't tell the future.

>> Second, regarding physical RAM consumption: disable memory overcommit,
>> and set a large swap. The unused parts of the large pflash chips should
>> be swapped out at some point (after their initial population from disk),
>> and should never be swapped back in again.
> 
> Just providing swap should do the trick, shouldn't it?

Yes, it should.

(I just find any given swap setting more trustworthy if memory
overcommit is disabled -- for me, disabling memory overcommit comes
first. I trust "this swap should be large enough" much more if the
kernel enforces allocations and the OOM killer can't surprise me later.
I know we don't check g_malloc/g_new in QEMU, but I still prefer if
those crash QEMU over the OOM killer picking a victim.)

Thanks,
Laszlo
Markus Armbruster March 26, 2019, 6:36 p.m. UTC | #8
Laszlo Ersek <lersek@redhat.com> writes:

> On 03/26/19 17:39, Markus Armbruster wrote:
>> Laszlo Ersek <lersek@redhat.com> writes:
>
>>> With the dynamic sizing in QEMU (which, IIRC, I had originally
>>> introduced still in the 1MB times, due to the split between the
>>> executable and varstore parts), both the 1MB->2MB switch, and the
>>> 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now,
>>> 4MB looks like a "sweet spot", with some elbow room left.
>> 
>> Explicit configuration would've been exactly as painless.  Even with
>> pflash sizes restricted to powers of two.
>
> I wrote the patch that ended up as commit 637a5acb46b3 -- with your R-b
> -- in 2013. I'm unsure if machine type properties existed back then, but
> even if they did, do you think I knew about them? :)
>
> You are right, of course; it's just that we can't tell the future.

True!  All we can do is continue to design as well as we can given the
information, experience and resources we have, and when the inevitable
design mistakes become apparent, limit their impact.

Some of the things we now consider mistakes we just didn't see.  Others
we saw (e.g. multiple pflash devices, unlike physical hardware), but
underestimated their impact.

[...]
Xiang Zheng April 3, 2019, 2:12 p.m. UTC | #9
Hi Laszlo and Markus,

Thanks for your useful suggestions and comments! :)

On 2019/3/27 2:36, Markus Armbruster wrote:
> Laszlo Ersek <lersek@redhat.com> writes:
> 
>> On 03/26/19 17:39, Markus Armbruster wrote:
>>> Laszlo Ersek <lersek@redhat.com> writes:
>>
>>>> With the dynamic sizing in QEMU (which, IIRC, I had originally
>>>> introduced still in the 1MB times, due to the split between the
>>>> executable and varstore parts), both the 1MB->2MB switch, and the
>>>> 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now,
>>>> 4MB looks like a "sweet spot", with some elbow room left.
>>>
>>> Explicit configuration would've been exactly as painless.  Even with
>>> pflash sizes restricted to powers of two.
>>
>> I wrote the patch that ended up as commit 637a5acb46b3 -- with your R-b
>> -- in 2013. I'm unsure if machine type properties existed back then, but
>> even if they did, do you think I knew about them? :)
>>
>> You are right, of course; it's just that we can't tell the future.
> 
> True!  All we can do is continue to design as well as we can given the
> information, experience and resources we have, and when the inevitable
> design mistakes become apparent, limit their impact.
> 
> Some of the things we now consider mistakes we just didn't see.  Others
> we saw (e.g. multiple pflash devices, unlike physical hardware), but
> underestimated their impact.
> 

I thought about your comments and wrote the following patch (just for test)
which uses a file mapping to replace the anonymous mapping. UEFI seems to work
fine. So why not use a file mapping to read or write a pflash device?

Except this way, I don't know how to share the pflash memory among VMs or
reduce the consumption for resource. :(

---8>---

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ce2664a..12c78f2 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -898,6 +898,7 @@ static void create_one_flash(const char *name, hwaddr flashbase,
     qdev_prop_set_uint16(dev, "id2", 0x00);
     qdev_prop_set_uint16(dev, "id3", 0x00);
     qdev_prop_set_string(dev, "name", name);
+    qdev_prop_set_bit(dev, "prealloc", false);
     qdev_init_nofail(dev);

     memory_region_add_subregion(sysmem, flashbase,
diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index 16dfae1..23a85bc 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -92,6 +92,7 @@ struct PFlashCFI01 {
     void *storage;
     VMChangeStateEntry *vmstate;
     bool old_multiple_chip_handling;
+    bool prealloc;
 };

 static int pflash_post_load(void *opaque, int version_id);
@@ -731,11 +732,21 @@ static void pflash_cfi01_realize(DeviceState *dev, Error **errp)
     }
     device_len = sector_len_per_device * blocks_per_device;

-    memory_region_init_rom_device(
-        &pfl->mem, OBJECT(dev),
-        &pflash_cfi01_ops,
-        pfl,
-        pfl->name, total_len, &local_err);
+    if (pfl->blk && !pfl->prealloc) {
+        memory_region_init_rom_device_from_file(
+            &pfl->mem, OBJECT(dev),
+            &pflash_cfi01_ops,
+            pfl,
+            pfl->name, total_len,
+            blk_is_read_only(pfl->blk) ? RAM_SHARED : RAM_PMEM,
+            blk_bs(pfl->blk)->filename, &local_err);
+    } else {
+        memory_region_init_rom_device(
+            &pfl->mem, OBJECT(dev),
+            &pflash_cfi01_ops,
+            pfl,
+            pfl->name, total_len, &local_err);
+    }
     if (local_err) {
         error_propagate(errp, local_err);
         return;
@@ -899,6 +910,7 @@ static Property pflash_cfi01_properties[] = {
     DEFINE_PROP_STRING("name", PFlashCFI01, name),
     DEFINE_PROP_BOOL("old-multiple-chip-handling", PFlashCFI01,
                      old_multiple_chip_handling, false),
+    DEFINE_PROP_BOOL("prealloc", PFlashCFI01, prealloc, true),
     DEFINE_PROP_END_OF_LIST(),
 };

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 1625913..1b16d3b 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -804,6 +804,16 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
                                              uint64_t size,
                                              Error **errp);

+void memory_region_init_rom_device_from_file(MemoryRegion *mr,
+                                             struct Object *owner,
+                                             const MemoryRegionOps *ops,
+                                             void *opaque,
+                                             const char *name,
+                                             uint64_t size,
+                                             uint32_t ram_flags,
+                                             const char *path,
+                                             Error **errp);
+
 /**
  * memory_region_init_iommu: Initialize a memory region of a custom type
  * that translates addresses
diff --git a/memory.c b/memory.c
index 9fbca52..905956b 100644
--- a/memory.c
+++ b/memory.c
@@ -1719,6 +1719,36 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
     }
 }

+void memory_region_init_rom_device_from_file(MemoryRegion *mr,
+                                             Object *owner,
+                                             const MemoryRegionOps *ops,
+                                             void *opaque,
+                                             const char *name,
+                                             uint64_t size,
+                                             uint32_t ram_flags,
+                                             const char *path,
+                                             Error **errp)
+{
+    DeviceState *owner_dev;
+    Error *err = NULL;
+    assert(ops);
+    memory_region_init(mr, owner, name, size);
+    mr->ops = ops;
+    mr->opaque = opaque;
+    mr->terminates = true;
+    mr->rom_device = true;
+    mr->destructor = memory_region_destructor_ram;
+    mr->ram_block = qemu_ram_alloc_from_file(size, mr, ram_flags, path, &err);
+    if (err) {
+        mr->size = int128_zero();
+        object_unparent(OBJECT(mr));
+        error_propagate(errp, err);
+        return ;
+    }
+    owner_dev = DEVICE(owner);
+    vmstate_register_ram(mr, owner_dev);
+}
+
 void memory_region_init_iommu(void *_iommu_mr,
                               size_t instance_size,
                               const char *mrtypename,
Laszlo Ersek April 3, 2019, 3:35 p.m. UTC | #10
On 04/03/19 16:12, Xiang Zheng wrote:
> Hi Laszlo and Markus,
> 
> Thanks for your useful suggestions and comments! :)
> 
> On 2019/3/27 2:36, Markus Armbruster wrote:
>> Laszlo Ersek <lersek@redhat.com> writes:
>>
>>> On 03/26/19 17:39, Markus Armbruster wrote:
>>>> Laszlo Ersek <lersek@redhat.com> writes:
>>>
>>>>> With the dynamic sizing in QEMU (which, IIRC, I had originally
>>>>> introduced still in the 1MB times, due to the split between the
>>>>> executable and varstore parts), both the 1MB->2MB switch, and the
>>>>> 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now,
>>>>> 4MB looks like a "sweet spot", with some elbow room left.
>>>>
>>>> Explicit configuration would've been exactly as painless.  Even with
>>>> pflash sizes restricted to powers of two.
>>>
>>> I wrote the patch that ended up as commit 637a5acb46b3 -- with your R-b
>>> -- in 2013. I'm unsure if machine type properties existed back then, but
>>> even if they did, do you think I knew about them? :)
>>>
>>> You are right, of course; it's just that we can't tell the future.
>>
>> True!  All we can do is continue to design as well as we can given the
>> information, experience and resources we have, and when the inevitable
>> design mistakes become apparent, limit their impact.
>>
>> Some of the things we now consider mistakes we just didn't see.  Others
>> we saw (e.g. multiple pflash devices, unlike physical hardware), but
>> underestimated their impact.
>>
> 
> I thought about your comments and wrote the following patch (just for test)
> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
> fine. So why not use a file mapping to read or write a pflash device?

Honestly I can't answer "why not do this". Maybe we should.

Regarding "why not do this *exactly as shown below*" -- probably because
then we'd have updates to the same underlying regular file via both
mmap() and write(), and the interactions between those are really tricky
(= best avoided).

One of my favorite questions over the years used to be posting a matrix
of possible mmap() and file descriptor r/w/sync interactions, with the
outcomes as they were "implied" by POSIX, and then asking at the bottom,
"is my understanding correct?" I've never ever received an answer to
that. :)

Also... although we don't really use them in practice, some QCOW2
features for pflash block backends are -- or would be -- nice. (Not the
internal snapshots or the live RAM dumps, of course.)

Thanks
Laszlo

> Except this way, I don't know how to share the pflash memory among VMs or
> reduce the consumption for resource. :(
> 
> ---8>---
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index ce2664a..12c78f2 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -898,6 +898,7 @@ static void create_one_flash(const char *name, hwaddr flashbase,
>      qdev_prop_set_uint16(dev, "id2", 0x00);
>      qdev_prop_set_uint16(dev, "id3", 0x00);
>      qdev_prop_set_string(dev, "name", name);
> +    qdev_prop_set_bit(dev, "prealloc", false);
>      qdev_init_nofail(dev);
> 
>      memory_region_add_subregion(sysmem, flashbase,
> diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
> index 16dfae1..23a85bc 100644
> --- a/hw/block/pflash_cfi01.c
> +++ b/hw/block/pflash_cfi01.c
> @@ -92,6 +92,7 @@ struct PFlashCFI01 {
>      void *storage;
>      VMChangeStateEntry *vmstate;
>      bool old_multiple_chip_handling;
> +    bool prealloc;
>  };
> 
>  static int pflash_post_load(void *opaque, int version_id);
> @@ -731,11 +732,21 @@ static void pflash_cfi01_realize(DeviceState *dev, Error **errp)
>      }
>      device_len = sector_len_per_device * blocks_per_device;
> 
> -    memory_region_init_rom_device(
> -        &pfl->mem, OBJECT(dev),
> -        &pflash_cfi01_ops,
> -        pfl,
> -        pfl->name, total_len, &local_err);
> +    if (pfl->blk && !pfl->prealloc) {
> +        memory_region_init_rom_device_from_file(
> +            &pfl->mem, OBJECT(dev),
> +            &pflash_cfi01_ops,
> +            pfl,
> +            pfl->name, total_len,
> +            blk_is_read_only(pfl->blk) ? RAM_SHARED : RAM_PMEM,
> +            blk_bs(pfl->blk)->filename, &local_err);
> +    } else {
> +        memory_region_init_rom_device(
> +            &pfl->mem, OBJECT(dev),
> +            &pflash_cfi01_ops,
> +            pfl,
> +            pfl->name, total_len, &local_err);
> +    }
>      if (local_err) {
>          error_propagate(errp, local_err);
>          return;
> @@ -899,6 +910,7 @@ static Property pflash_cfi01_properties[] = {
>      DEFINE_PROP_STRING("name", PFlashCFI01, name),
>      DEFINE_PROP_BOOL("old-multiple-chip-handling", PFlashCFI01,
>                       old_multiple_chip_handling, false),
> +    DEFINE_PROP_BOOL("prealloc", PFlashCFI01, prealloc, true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 1625913..1b16d3b 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -804,6 +804,16 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
>                                               uint64_t size,
>                                               Error **errp);
> 
> +void memory_region_init_rom_device_from_file(MemoryRegion *mr,
> +                                             struct Object *owner,
> +                                             const MemoryRegionOps *ops,
> +                                             void *opaque,
> +                                             const char *name,
> +                                             uint64_t size,
> +                                             uint32_t ram_flags,
> +                                             const char *path,
> +                                             Error **errp);
> +
>  /**
>   * memory_region_init_iommu: Initialize a memory region of a custom type
>   * that translates addresses
> diff --git a/memory.c b/memory.c
> index 9fbca52..905956b 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1719,6 +1719,36 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
>      }
>  }
> 
> +void memory_region_init_rom_device_from_file(MemoryRegion *mr,
> +                                             Object *owner,
> +                                             const MemoryRegionOps *ops,
> +                                             void *opaque,
> +                                             const char *name,
> +                                             uint64_t size,
> +                                             uint32_t ram_flags,
> +                                             const char *path,
> +                                             Error **errp)
> +{
> +    DeviceState *owner_dev;
> +    Error *err = NULL;
> +    assert(ops);
> +    memory_region_init(mr, owner, name, size);
> +    mr->ops = ops;
> +    mr->opaque = opaque;
> +    mr->terminates = true;
> +    mr->rom_device = true;
> +    mr->destructor = memory_region_destructor_ram;
> +    mr->ram_block = qemu_ram_alloc_from_file(size, mr, ram_flags, path, &err);
> +    if (err) {
> +        mr->size = int128_zero();
> +        object_unparent(OBJECT(mr));
> +        error_propagate(errp, err);
> +        return ;
> +    }
> +    owner_dev = DEVICE(owner);
> +    vmstate_register_ram(mr, owner_dev);
> +}
> +
>  void memory_region_init_iommu(void *_iommu_mr,
>                                size_t instance_size,
>                                const char *mrtypename,
>
Xiang Zheng April 8, 2019, 1:43 p.m. UTC | #11
On 2019/4/3 23:35, Laszlo Ersek wrote:
>> I thought about your comments and wrote the following patch (just for test)
>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>> fine. So why not use a file mapping to read or write a pflash device?
> Honestly I can't answer "why not do this". Maybe we should.
> 
> Regarding "why not do this *exactly as shown below*" -- probably because
> then we'd have updates to the same underlying regular file via both
> mmap() and write(), and the interactions between those are really tricky
> (= best avoided).
> 
> One of my favorite questions over the years used to be posting a matrix
> of possible mmap() and file descriptor r/w/sync interactions, with the
> outcomes as they were "implied" by POSIX, and then asking at the bottom,
> "is my understanding correct?" I've never ever received an answer to
> that. :)

In my opinion, it's feasible to r/w/sync the memory devices which use a block
backend via mmap() and write().

> 
> Also... although we don't really use them in practice, some QCOW2
> features for pflash block backends are -- or would be -- nice. (Not the
> internal snapshots or the live RAM dumps, of course.)

Regarding sparse files, can we read actual backend size data for the read-only
pflash memory as the following steps shown?

1) Create a sparse file -- QEMU_EFI-test.raw:
   dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0

2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
   dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc

3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
patch applied:

---8>---

diff --git a/hw/block/block.c b/hw/block/block.c
index bf56c76..ad18d5e 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -30,7 +30,7 @@
 bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
                                  Error **errp)
 {
-    int64_t blk_len;
+    int64_t blk_len, actual_len;
     int ret;

     blk_len = blk_getlength(blk);
@@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
      * block device and read only on demand.
      */
     assert(size <= BDRV_REQUEST_MAX_BYTES);
-    ret = blk_pread(blk, 0, buf, size);
+    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
+    ret = blk_pread(blk, 0, buf,
+            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "can't read block backend");
         return false;
Laszlo Ersek April 8, 2019, 4:14 p.m. UTC | #12
On 04/08/19 15:43, Xiang Zheng wrote:
> 
> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>> I thought about your comments and wrote the following patch (just for test)
>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>>> fine. So why not use a file mapping to read or write a pflash device?
>> Honestly I can't answer "why not do this". Maybe we should.
>>
>> Regarding "why not do this *exactly as shown below*" -- probably because
>> then we'd have updates to the same underlying regular file via both
>> mmap() and write(), and the interactions between those are really tricky
>> (= best avoided).
>>
>> One of my favorite questions over the years used to be posting a matrix
>> of possible mmap() and file descriptor r/w/sync interactions, with the
>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>> "is my understanding correct?" I've never ever received an answer to
>> that. :)
> 
> In my opinion, it's feasible to r/w/sync the memory devices which use a block
> backend via mmap() and write().

Maybe. I think that would be a first in QEMU, and you'd likely have to
describe and follow a semi-formal model between fd actions and mmap()
actions.

Here's the (unconfirmed) table I referred to earlier.

+-------------+-----------------------------------------------------+
| change made | change visible via                                  |
| through     +-----------------+-------------+---------------------+
|             | MAP_SHARED      | MAP_PRIVATE | read()              |
+-------------+-----------------+-------------+---------------------+
| MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
|             |                 |             | MS_ASYNC, or normal |
|             |                 |             | system activity     |
+-------------+-----------------+-------------+---------------------+
| MAP_PRIVATE | no              | no          | no                  |
+-------------+-----------------+-------------+---------------------+
| write()     | depends on      | unspecified | yes                 |
|             | MS_INVALIDATE,  |             |                     |
|             | or the system's |             |                     |
|             | read/write      |             |                     |
|             | consistency     |             |                     |
+-------------+-----------------+-------------+---------------------+

Presumably:

- a write() to some offset range of a regular file should be visible in
a MAP_SHARED mapping of that range after a matching msync(...,
MS_INVALIDATE) call;

- changes through a MAP_SHARED mapping to a regular file should be
visible via read() after a matching msync(..., MS_SYNC) call returns.

I sent this table first in 2010 to the Austin Group mailing list, and
received no comments. Then another person (on the same list) asked
basically the same questions in 2013, to which I responded with the
above assumptions / interpretations -- and received no comments from
third parties again.

But, I'm really out of my depth here -- we're not even discussing
generic read()/write()/mmap()/munmap()/msync() interactions, but how
they would fit into the qemu block layer. And I have no idea.

> 
>>
>> Also... although we don't really use them in practice, some QCOW2
>> features for pflash block backends are -- or would be -- nice. (Not the
>> internal snapshots or the live RAM dumps, of course.)
> 
> Regarding sparse files, can we read actual backend size data for the read-only
> pflash memory as the following steps shown?
> 
> 1) Create a sparse file -- QEMU_EFI-test.raw:
>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
> 
> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
> 
> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
> patch applied:
> 
> ---8>---
> 
> diff --git a/hw/block/block.c b/hw/block/block.c
> index bf56c76..ad18d5e 100644
> --- a/hw/block/block.c
> +++ b/hw/block/block.c
> @@ -30,7 +30,7 @@
>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>                                   Error **errp)
>  {
> -    int64_t blk_len;
> +    int64_t blk_len, actual_len;
>      int ret;
> 
>      blk_len = blk_getlength(blk);
> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>       * block device and read only on demand.
>       */
>      assert(size <= BDRV_REQUEST_MAX_BYTES);
> -    ret = blk_pread(blk, 0, buf, size);
> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
> +    ret = blk_pread(blk, 0, buf,
> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
>      if (ret < 0) {
>          error_setg_errno(errp, -ret, "can't read block backend");
>          return false;
> 

This hunk looks dubious for a general helper function. It seems to
assume that the holes are punched at the end of the file.

Sorry, this discussion is making me uncomfortable. I don't want to
ignore your questions, but I have no idea about the right answers. This
really needs the attention of the block people.

Laszlo
Xiang Zheng April 9, 2019, 3:39 a.m. UTC | #13
On 2019/4/9 0:14, Laszlo Ersek wrote:
> On 04/08/19 15:43, Xiang Zheng wrote:
>>
>> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>>> I thought about your comments and wrote the following patch (just for test)
>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>>>> fine. So why not use a file mapping to read or write a pflash device?
>>> Honestly I can't answer "why not do this". Maybe we should.
>>>
>>> Regarding "why not do this *exactly as shown below*" -- probably because
>>> then we'd have updates to the same underlying regular file via both
>>> mmap() and write(), and the interactions between those are really tricky
>>> (= best avoided).
>>>
>>> One of my favorite questions over the years used to be posting a matrix
>>> of possible mmap() and file descriptor r/w/sync interactions, with the
>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>>> "is my understanding correct?" I've never ever received an answer to
>>> that. :)
>>
>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
>> backend via mmap() and write().
> 
> Maybe. I think that would be a first in QEMU, and you'd likely have to
> describe and follow a semi-formal model between fd actions and mmap()
> actions.
> 
> Here's the (unconfirmed) table I referred to earlier.
> 
> +-------------+-----------------------------------------------------+
> | change made | change visible via                                  |
> | through     +-----------------+-------------+---------------------+
> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
> +-------------+-----------------+-------------+---------------------+
> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
> |             |                 |             | MS_ASYNC, or normal |
> |             |                 |             | system activity     |
> +-------------+-----------------+-------------+---------------------+
> | MAP_PRIVATE | no              | no          | no                  |
> +-------------+-----------------+-------------+---------------------+
> | write()     | depends on      | unspecified | yes                 |
> |             | MS_INVALIDATE,  |             |                     |
> |             | or the system's |             |                     |
> |             | read/write      |             |                     |
> |             | consistency     |             |                     |
> +-------------+-----------------+-------------+---------------------+
> 
> Presumably:
> 
> - a write() to some offset range of a regular file should be visible in
> a MAP_SHARED mapping of that range after a matching msync(...,
> MS_INVALIDATE) call;
> 
> - changes through a MAP_SHARED mapping to a regular file should be
> visible via read() after a matching msync(..., MS_SYNC) call returns.
> 
> I sent this table first in 2010 to the Austin Group mailing list, and
> received no comments. Then another person (on the same list) asked
> basically the same questions in 2013, to which I responded with the
> above assumptions / interpretations -- and received no comments from
> third parties again.
> 
> But, I'm really out of my depth here -- we're not even discussing
> generic read()/write()/mmap()/munmap()/msync() interactions, but how
> they would fit into the qemu block layer. And I have no idea.
> 
>>
>>>
>>> Also... although we don't really use them in practice, some QCOW2
>>> features for pflash block backends are -- or would be -- nice. (Not the
>>> internal snapshots or the live RAM dumps, of course.)
>>
>> Regarding sparse files, can we read actual backend size data for the read-only
>> pflash memory as the following steps shown?
>>
>> 1) Create a sparse file -- QEMU_EFI-test.raw:
>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
>>
>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
>>
>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
>> patch applied:
>>
>> ---8>---
>>
>> diff --git a/hw/block/block.c b/hw/block/block.c
>> index bf56c76..ad18d5e 100644
>> --- a/hw/block/block.c
>> +++ b/hw/block/block.c
>> @@ -30,7 +30,7 @@
>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>                                   Error **errp)
>>  {
>> -    int64_t blk_len;
>> +    int64_t blk_len, actual_len;
>>      int ret;
>>
>>      blk_len = blk_getlength(blk);
>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>       * block device and read only on demand.
>>       */
>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
>> -    ret = blk_pread(blk, 0, buf, size);
>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
>> +    ret = blk_pread(blk, 0, buf,
>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
>>      if (ret < 0) {
>>          error_setg_errno(errp, -ret, "can't read block backend");
>>          return false;
>>
> 
> This hunk looks dubious for a general helper function. It seems to
> assume that the holes are punched at the end of the file.
> 
> Sorry, this discussion is making me uncomfortable. I don't want to
> ignore your questions, but I have no idea about the right answers. This
> really needs the attention of the block people.

I feel deeply sorry for making you uncomfortable. It's generous of you to give me so much
of your time.
Markus Armbruster April 9, 2019, 6:01 a.m. UTC | #14
László's last sentence below is "This really needs the attention of the
block people."  Cc'ing some.

Laszlo Ersek <lersek@redhat.com> writes:

> On 04/08/19 15:43, Xiang Zheng wrote:
>> 
>> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>>> I thought about your comments and wrote the following patch (just for test)
>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>>>> fine. So why not use a file mapping to read or write a pflash device?
>>> Honestly I can't answer "why not do this". Maybe we should.
>>>
>>> Regarding "why not do this *exactly as shown below*" -- probably because
>>> then we'd have updates to the same underlying regular file via both
>>> mmap() and write(), and the interactions between those are really tricky
>>> (= best avoided).
>>>
>>> One of my favorite questions over the years used to be posting a matrix
>>> of possible mmap() and file descriptor r/w/sync interactions, with the
>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>>> "is my understanding correct?" I've never ever received an answer to
>>> that. :)
>> 
>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
>> backend via mmap() and write().
>
> Maybe. I think that would be a first in QEMU, and you'd likely have to
> describe and follow a semi-formal model between fd actions and mmap()
> actions.
>
> Here's the (unconfirmed) table I referred to earlier.
>
> +-------------+-----------------------------------------------------+
> | change made | change visible via                                  |
> | through     +-----------------+-------------+---------------------+
> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
> +-------------+-----------------+-------------+---------------------+
> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
> |             |                 |             | MS_ASYNC, or normal |
> |             |                 |             | system activity     |
> +-------------+-----------------+-------------+---------------------+
> | MAP_PRIVATE | no              | no          | no                  |
> +-------------+-----------------+-------------+---------------------+
> | write()     | depends on      | unspecified | yes                 |
> |             | MS_INVALIDATE,  |             |                     |
> |             | or the system's |             |                     |
> |             | read/write      |             |                     |
> |             | consistency     |             |                     |
> +-------------+-----------------+-------------+---------------------+
>
> Presumably:
>
> - a write() to some offset range of a regular file should be visible in
> a MAP_SHARED mapping of that range after a matching msync(...,
> MS_INVALIDATE) call;
>
> - changes through a MAP_SHARED mapping to a regular file should be
> visible via read() after a matching msync(..., MS_SYNC) call returns.
>
> I sent this table first in 2010 to the Austin Group mailing list, and
> received no comments. Then another person (on the same list) asked
> basically the same questions in 2013, to which I responded with the
> above assumptions / interpretations -- and received no comments from
> third parties again.
>
> But, I'm really out of my depth here -- we're not even discussing
> generic read()/write()/mmap()/munmap()/msync() interactions, but how
> they would fit into the qemu block layer. And I have no idea.
>
>> 
>>>
>>> Also... although we don't really use them in practice, some QCOW2
>>> features for pflash block backends are -- or would be -- nice. (Not the
>>> internal snapshots or the live RAM dumps, of course.)
>> 
>> Regarding sparse files, can we read actual backend size data for the read-only
>> pflash memory as the following steps shown?
>> 
>> 1) Create a sparse file -- QEMU_EFI-test.raw:
>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
>> 
>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
>> 
>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
>> patch applied:
>> 
>> ---8>---
>> 
>> diff --git a/hw/block/block.c b/hw/block/block.c
>> index bf56c76..ad18d5e 100644
>> --- a/hw/block/block.c
>> +++ b/hw/block/block.c
>> @@ -30,7 +30,7 @@
>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>                                   Error **errp)
>>  {
>> -    int64_t blk_len;
>> +    int64_t blk_len, actual_len;
>>      int ret;
>> 
>>      blk_len = blk_getlength(blk);
>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>       * block device and read only on demand.
>>       */
>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
>> -    ret = blk_pread(blk, 0, buf, size);
>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
>> +    ret = blk_pread(blk, 0, buf,
>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
>>      if (ret < 0) {
>>          error_setg_errno(errp, -ret, "can't read block backend");
>>          return false;
>> 
>
> This hunk looks dubious for a general helper function. It seems to
> assume that the holes are punched at the end of the file.
>
> Sorry, this discussion is making me uncomfortable. I don't want to
> ignore your questions, but I have no idea about the right answers. This
> really needs the attention of the block people.
Kevin Wolf April 9, 2019, 8:28 a.m. UTC | #15
Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben:
> László's last sentence below is "This really needs the attention of the
> block people."  Cc'ing some.
> 
> Laszlo Ersek <lersek@redhat.com> writes:
> 
> > On 04/08/19 15:43, Xiang Zheng wrote:
> >> 
> >> On 2019/4/3 23:35, Laszlo Ersek wrote:
> >>>> I thought about your comments and wrote the following patch (just for test)
> >>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
> >>>> fine. So why not use a file mapping to read or write a pflash device?
> >>> Honestly I can't answer "why not do this". Maybe we should.
> >>>
> >>> Regarding "why not do this *exactly as shown below*" -- probably because
> >>> then we'd have updates to the same underlying regular file via both
> >>> mmap() and write(), and the interactions between those are really tricky
> >>> (= best avoided).
> >>>
> >>> One of my favorite questions over the years used to be posting a matrix
> >>> of possible mmap() and file descriptor r/w/sync interactions, with the
> >>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
> >>> "is my understanding correct?" I've never ever received an answer to
> >>> that. :)
> >> 
> >> In my opinion, it's feasible to r/w/sync the memory devices which use a block
> >> backend via mmap() and write().
> >
> > Maybe. I think that would be a first in QEMU, and you'd likely have to
> > describe and follow a semi-formal model between fd actions and mmap()
> > actions.
> >
> > Here's the (unconfirmed) table I referred to earlier.
> >
> > +-------------+-----------------------------------------------------+
> > | change made | change visible via                                  |
> > | through     +-----------------+-------------+---------------------+
> > |             | MAP_SHARED      | MAP_PRIVATE | read()              |
> > +-------------+-----------------+-------------+---------------------+
> > | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
> > |             |                 |             | MS_ASYNC, or normal |
> > |             |                 |             | system activity     |
> > +-------------+-----------------+-------------+---------------------+
> > | MAP_PRIVATE | no              | no          | no                  |
> > +-------------+-----------------+-------------+---------------------+
> > | write()     | depends on      | unspecified | yes                 |
> > |             | MS_INVALIDATE,  |             |                     |
> > |             | or the system's |             |                     |
> > |             | read/write      |             |                     |
> > |             | consistency     |             |                     |
> > +-------------+-----------------+-------------+---------------------+
> >
> > Presumably:
> >
> > - a write() to some offset range of a regular file should be visible in
> > a MAP_SHARED mapping of that range after a matching msync(...,
> > MS_INVALIDATE) call;
> >
> > - changes through a MAP_SHARED mapping to a regular file should be
> > visible via read() after a matching msync(..., MS_SYNC) call returns.
> >
> > I sent this table first in 2010 to the Austin Group mailing list, and
> > received no comments. Then another person (on the same list) asked
> > basically the same questions in 2013, to which I responded with the
> > above assumptions / interpretations -- and received no comments from
> > third parties again.
> >
> > But, I'm really out of my depth here -- we're not even discussing
> > generic read()/write()/mmap()/munmap()/msync() interactions, but how
> > they would fit into the qemu block layer. And I have no idea.

There is no infrastructure in the block layer for mmapping image files.

> >>> Also... although we don't really use them in practice, some QCOW2
> >>> features for pflash block backends are -- or would be -- nice. (Not the
> >>> internal snapshots or the live RAM dumps, of course.)
> >> 
> >> Regarding sparse files, can we read actual backend size data for the read-only
> >> pflash memory as the following steps shown?
> >> 
> >> 1) Create a sparse file -- QEMU_EFI-test.raw:
> >>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
> >> 
> >> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
> >>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
> >> 
> >> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
> >> patch applied:
> >> 
> >> ---8>---
> >> 
> >> diff --git a/hw/block/block.c b/hw/block/block.c
> >> index bf56c76..ad18d5e 100644
> >> --- a/hw/block/block.c
> >> +++ b/hw/block/block.c
> >> @@ -30,7 +30,7 @@
> >>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
> >>                                   Error **errp)
> >>  {
> >> -    int64_t blk_len;
> >> +    int64_t blk_len, actual_len;
> >>      int ret;
> >> 
> >>      blk_len = blk_getlength(blk);
> >> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
> >>       * block device and read only on demand.
> >>       */
> >>      assert(size <= BDRV_REQUEST_MAX_BYTES);
> >> -    ret = blk_pread(blk, 0, buf, size);
> >> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
> >> +    ret = blk_pread(blk, 0, buf,
> >> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
> >>      if (ret < 0) {
> >>          error_setg_errno(errp, -ret, "can't read block backend");
> >>          return false;
> >> 
> >
> > This hunk looks dubious for a general helper function. It seems to
> > assume that the holes are punched at the end of the file.
> >
> > Sorry, this discussion is making me uncomfortable. I don't want to
> > ignore your questions, but I have no idea about the right answers. This
> > really needs the attention of the block people.

Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even
implemented in all block drivers, and when it is implemented it isn't
necessarily reliable (e.g. return 0 for block devices). It's fine for
'qemu-img info', but that's it.

But even if you actually get the correct allocated file size, that's the
disk space used on the host filesystem, so it doesn't include any holes,
but it does include image format metadata, the data could possibly be
compressed etc. In other words, for a guest device it's completely
meaningless.

I'm not even sure why you would want to do this even in your special
case of a raw image that is sparse only at the end. Is it an
optimisation to avoid reading zeros? This ends up basically being
memset() for sparse blocks, so very quick. And I think you do want to
zero out the remaining part of the buffer anyway. So it looks to me as
if it were complicating the code for no use.

Kevin
Xiang Zheng April 10, 2019, 8:36 a.m. UTC | #16
Hi Kevin,

On 2019/4/9 16:28, Kevin Wolf wrote:
> Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben:
>> László's last sentence below is "This really needs the attention of the
>> block people."  Cc'ing some.
>>
>> Laszlo Ersek <lersek@redhat.com> writes:
>>
>>> On 04/08/19 15:43, Xiang Zheng wrote:
>>>>
>>>> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>>>>> I thought about your comments and wrote the following patch (just for test)
>>>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>>>>>> fine. So why not use a file mapping to read or write a pflash device?
>>>>> Honestly I can't answer "why not do this". Maybe we should.
>>>>>
>>>>> Regarding "why not do this *exactly as shown below*" -- probably because
>>>>> then we'd have updates to the same underlying regular file via both
>>>>> mmap() and write(), and the interactions between those are really tricky
>>>>> (= best avoided).
>>>>>
>>>>> One of my favorite questions over the years used to be posting a matrix
>>>>> of possible mmap() and file descriptor r/w/sync interactions, with the
>>>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>>>>> "is my understanding correct?" I've never ever received an answer to
>>>>> that. :)
>>>>
>>>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
>>>> backend via mmap() and write().
>>>
>>> Maybe. I think that would be a first in QEMU, and you'd likely have to
>>> describe and follow a semi-formal model between fd actions and mmap()
>>> actions.
>>>
>>> Here's the (unconfirmed) table I referred to earlier.
>>>
>>> +-------------+-----------------------------------------------------+
>>> | change made | change visible via                                  |
>>> | through     +-----------------+-------------+---------------------+
>>> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
>>> +-------------+-----------------+-------------+---------------------+
>>> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
>>> |             |                 |             | MS_ASYNC, or normal |
>>> |             |                 |             | system activity     |
>>> +-------------+-----------------+-------------+---------------------+
>>> | MAP_PRIVATE | no              | no          | no                  |
>>> +-------------+-----------------+-------------+---------------------+
>>> | write()     | depends on      | unspecified | yes                 |
>>> |             | MS_INVALIDATE,  |             |                     |
>>> |             | or the system's |             |                     |
>>> |             | read/write      |             |                     |
>>> |             | consistency     |             |                     |
>>> +-------------+-----------------+-------------+---------------------+
>>>
>>> Presumably:
>>>
>>> - a write() to some offset range of a regular file should be visible in
>>> a MAP_SHARED mapping of that range after a matching msync(...,
>>> MS_INVALIDATE) call;
>>>
>>> - changes through a MAP_SHARED mapping to a regular file should be
>>> visible via read() after a matching msync(..., MS_SYNC) call returns.
>>>
>>> I sent this table first in 2010 to the Austin Group mailing list, and
>>> received no comments. Then another person (on the same list) asked
>>> basically the same questions in 2013, to which I responded with the
>>> above assumptions / interpretations -- and received no comments from
>>> third parties again.
>>>
>>> But, I'm really out of my depth here -- we're not even discussing
>>> generic read()/write()/mmap()/munmap()/msync() interactions, but how
>>> they would fit into the qemu block layer. And I have no idea.
> 
> There is no infrastructure in the block layer for mmapping image files.
> 
>>>>> Also... although we don't really use them in practice, some QCOW2
>>>>> features for pflash block backends are -- or would be -- nice. (Not the
>>>>> internal snapshots or the live RAM dumps, of course.)
>>>>
>>>> Regarding sparse files, can we read actual backend size data for the read-only
>>>> pflash memory as the following steps shown?
>>>>
>>>> 1) Create a sparse file -- QEMU_EFI-test.raw:
>>>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
>>>>
>>>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>>>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
>>>>
>>>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
>>>> patch applied:
>>>>
>>>> ---8>---
>>>>
>>>> diff --git a/hw/block/block.c b/hw/block/block.c
>>>> index bf56c76..ad18d5e 100644
>>>> --- a/hw/block/block.c
>>>> +++ b/hw/block/block.c
>>>> @@ -30,7 +30,7 @@
>>>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>>>                                   Error **errp)
>>>>  {
>>>> -    int64_t blk_len;
>>>> +    int64_t blk_len, actual_len;
>>>>      int ret;
>>>>
>>>>      blk_len = blk_getlength(blk);
>>>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>>>       * block device and read only on demand.
>>>>       */
>>>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
>>>> -    ret = blk_pread(blk, 0, buf, size);
>>>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
>>>> +    ret = blk_pread(blk, 0, buf,
>>>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
>>>>      if (ret < 0) {
>>>>          error_setg_errno(errp, -ret, "can't read block backend");
>>>>          return false;
>>>>
>>>
>>> This hunk looks dubious for a general helper function. It seems to
>>> assume that the holes are punched at the end of the file.
>>>
>>> Sorry, this discussion is making me uncomfortable. I don't want to
>>> ignore your questions, but I have no idea about the right answers. This
>>> really needs the attention of the block people.
> 
> Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even
> implemented in all block drivers, and when it is implemented it isn't
> necessarily reliable (e.g. return 0 for block devices). It's fine for
> 'qemu-img info', but that's it.
> 
> But even if you actually get the correct allocated file size, that's the
> disk space used on the host filesystem, so it doesn't include any holes,
> but it does include image format metadata, the data could possibly be
> compressed etc. In other words, for a guest device it's completely
> meaningless.
> 
> I'm not even sure why you would want to do this even in your special
> case of a raw image that is sparse only at the end. Is it an
> optimisation to avoid reading zeros? This ends up basically being
> memset() for sparse blocks, so very quick. And I think you do want to
> zero out the remaining part of the buffer anyway. So it looks to me as
> if it were complicating the code for no use.

The primary target of this discussion is to save the memory allocated
for UEFI firmware device. On virtual machine, we split it into two flash
devices[1] -- one is read-only in 64MB size and another is read-write in
64MB size. The qemu commandline is:

   -drive file=QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on \
   -drive file=empty_VARS.fd,if=pflash,format=raw,unit=1 \

Regarding the read-only one which are created from QEMU_EFI.fd, the original
QEMU_EFI.fd is only 2MB in size and we need to stuff it to 64MB with 62MB
'zeros':

   dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
   dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc

For some historical reasons such as compatibility and extensibility[2], we
restrict both their size to 64MB -- both UEFI and qemu have cold hard
constants.

This will consume a large amount of memory when running multiple VM
simultaneously (each guest needs 128MB).

There are accepted ideas proposed by Markus and Laszlo from prior discussion[3]:

1) Getting flash memory size from a machine type or property.

2) Map the read-only part so it is shared among guests; Map the read-write
part normally.

The first idea called explicit configuration which may break migration.

For the second idea, the only way I can imagine is using a file mapping to
read or write pflash devices so that we can allocate memory on demand. So I
tried to fit mmap() actions into the block layer[4]. However I am not sure
that maping image file whether can work fine for block layer.

On the other hand, Laszlo had mentioned the features of QCOW2 and sparse
files which may be nice for pflash block backends. But I am not familiar with
it.

Till now I still have no idea to get rid of this problem (more properly it's
a optimization).


[1]https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html
[2]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07009.html
[3]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07107.html
[4]https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00719.html
Markus Armbruster April 11, 2019, 7:15 a.m. UTC | #17
Xiang Zheng <zhengxiang9@huawei.com> writes:

> Hi Kevin,
>
> On 2019/4/9 16:28, Kevin Wolf wrote:
>> Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben:
>>> László's last sentence below is "This really needs the attention of the
>>> block people."  Cc'ing some.
>>>
>>> Laszlo Ersek <lersek@redhat.com> writes:
>>>
>>>> On 04/08/19 15:43, Xiang Zheng wrote:
>>>>>
>>>>> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>>>>>> I thought about your comments and wrote the following patch (just for test)
>>>>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
>>>>>>> fine. So why not use a file mapping to read or write a pflash device?
>>>>>> Honestly I can't answer "why not do this". Maybe we should.
>>>>>>
>>>>>> Regarding "why not do this *exactly as shown below*" -- probably because
>>>>>> then we'd have updates to the same underlying regular file via both
>>>>>> mmap() and write(), and the interactions between those are really tricky
>>>>>> (= best avoided).
>>>>>>
>>>>>> One of my favorite questions over the years used to be posting a matrix
>>>>>> of possible mmap() and file descriptor r/w/sync interactions, with the
>>>>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>>>>>> "is my understanding correct?" I've never ever received an answer to
>>>>>> that. :)
>>>>>
>>>>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
>>>>> backend via mmap() and write().
>>>>
>>>> Maybe. I think that would be a first in QEMU, and you'd likely have to
>>>> describe and follow a semi-formal model between fd actions and mmap()
>>>> actions.
>>>>
>>>> Here's the (unconfirmed) table I referred to earlier.
>>>>
>>>> +-------------+-----------------------------------------------------+
>>>> | change made | change visible via                                  |
>>>> | through     +-----------------+-------------+---------------------+
>>>> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
>>>> +-------------+-----------------+-------------+---------------------+
>>>> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
>>>> |             |                 |             | MS_ASYNC, or normal |
>>>> |             |                 |             | system activity     |
>>>> +-------------+-----------------+-------------+---------------------+
>>>> | MAP_PRIVATE | no              | no          | no                  |
>>>> +-------------+-----------------+-------------+---------------------+
>>>> | write()     | depends on      | unspecified | yes                 |
>>>> |             | MS_INVALIDATE,  |             |                     |
>>>> |             | or the system's |             |                     |
>>>> |             | read/write      |             |                     |
>>>> |             | consistency     |             |                     |
>>>> +-------------+-----------------+-------------+---------------------+
>>>>
>>>> Presumably:
>>>>
>>>> - a write() to some offset range of a regular file should be visible in
>>>> a MAP_SHARED mapping of that range after a matching msync(...,
>>>> MS_INVALIDATE) call;
>>>>
>>>> - changes through a MAP_SHARED mapping to a regular file should be
>>>> visible via read() after a matching msync(..., MS_SYNC) call returns.
>>>>
>>>> I sent this table first in 2010 to the Austin Group mailing list, and
>>>> received no comments. Then another person (on the same list) asked
>>>> basically the same questions in 2013, to which I responded with the
>>>> above assumptions / interpretations -- and received no comments from
>>>> third parties again.
>>>>
>>>> But, I'm really out of my depth here -- we're not even discussing
>>>> generic read()/write()/mmap()/munmap()/msync() interactions, but how
>>>> they would fit into the qemu block layer. And I have no idea.
>> 
>> There is no infrastructure in the block layer for mmapping image files.
>> 
>>>>>> Also... although we don't really use them in practice, some QCOW2
>>>>>> features for pflash block backends are -- or would be -- nice. (Not the
>>>>>> internal snapshots or the live RAM dumps, of course.)
>>>>>
>>>>> Regarding sparse files, can we read actual backend size data for the read-only
>>>>> pflash memory as the following steps shown?
>>>>>
>>>>> 1) Create a sparse file -- QEMU_EFI-test.raw:
>>>>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
>>>>>
>>>>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>>>>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
>>>>>
>>>>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
>>>>> patch applied:
>>>>>
>>>>> ---8>---
>>>>>
>>>>> diff --git a/hw/block/block.c b/hw/block/block.c
>>>>> index bf56c76..ad18d5e 100644
>>>>> --- a/hw/block/block.c
>>>>> +++ b/hw/block/block.c
>>>>> @@ -30,7 +30,7 @@
>>>>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>>>>                                   Error **errp)
>>>>>  {
>>>>> -    int64_t blk_len;
>>>>> +    int64_t blk_len, actual_len;
>>>>>      int ret;
>>>>>
>>>>>      blk_len = blk_getlength(blk);
>>>>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>>>>>       * block device and read only on demand.
>>>>>       */
>>>>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
>>>>> -    ret = blk_pread(blk, 0, buf, size);
>>>>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
>>>>> +    ret = blk_pread(blk, 0, buf,
>>>>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
>>>>>      if (ret < 0) {
>>>>>          error_setg_errno(errp, -ret, "can't read block backend");
>>>>>          return false;
>>>>>
>>>>
>>>> This hunk looks dubious for a general helper function. It seems to
>>>> assume that the holes are punched at the end of the file.
>>>>
>>>> Sorry, this discussion is making me uncomfortable. I don't want to
>>>> ignore your questions, but I have no idea about the right answers. This
>>>> really needs the attention of the block people.
>> 
>> Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even
>> implemented in all block drivers, and when it is implemented it isn't
>> necessarily reliable (e.g. return 0 for block devices). It's fine for
>> 'qemu-img info', but that's it.
>> 
>> But even if you actually get the correct allocated file size, that's the
>> disk space used on the host filesystem, so it doesn't include any holes,
>> but it does include image format metadata, the data could possibly be
>> compressed etc. In other words, for a guest device it's completely
>> meaningless.
>> 
>> I'm not even sure why you would want to do this even in your special
>> case of a raw image that is sparse only at the end. Is it an
>> optimisation to avoid reading zeros? This ends up basically being
>> memset() for sparse blocks, so very quick. And I think you do want to
>> zero out the remaining part of the buffer anyway. So it looks to me as
>> if it were complicating the code for no use.
>
> The primary target of this discussion is to save the memory allocated
> for UEFI firmware device. On virtual machine, we split it into two flash
> devices[1] -- one is read-only in 64MB size and another is read-write in
> 64MB size. The qemu commandline is:
>
>    -drive file=QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on \
>    -drive file=empty_VARS.fd,if=pflash,format=raw,unit=1 \
>
> Regarding the read-only one which are created from QEMU_EFI.fd, the original
> QEMU_EFI.fd is only 2MB in size and we need to stuff it to 64MB with 62MB
> 'zeros':
>
>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
>
> For some historical reasons such as compatibility and extensibility[2], we
> restrict both their size to 64MB -- both UEFI and qemu have cold hard
> constants.

These reasons aren't historical.  But they're valid, and that's all that
matters :)

> This will consume a large amount of memory when running multiple VM
> simultaneously (each guest needs 128MB).

Understood.

> There are accepted ideas proposed by Markus and Laszlo from prior discussion[3]:
>
> 1) Getting flash memory size from a machine type or property.
>
> 2) Map the read-only part so it is shared among guests; Map the read-write
> part normally.
>
> The first idea called explicit configuration which may break migration.

For any configuration that can break migration, the solution is to have
identical configuration on source and destination.

> For the second idea, the only way I can imagine is using a file mapping to
> read or write pflash devices so that we can allocate memory on demand. So I
> tried to fit mmap() actions into the block layer[4]. However I am not sure
> that maping image file whether can work fine for block layer.

It adds an odd special case to the block layer.  But then flash devices
are an odd user of the block layer.

Normal users use the block layer as a *block* layer: to read and write
blocks.

Our flash devices don't do that.  They are *memory* devices, not *block*
devices.  Pretty much the only thing the two have in common is
persistence of data.  The block layer provides persistence, and I figure
that's why it got pressed into service here.

What the flash devices do is keep the complete image contents in memory,
with changes written through to disk.

This is pretty much exactly what mmap() does, so using it suggests
itself.

It's very much *not* what the block layer does.  mmap() is not even
possible for anything but a file backend.

With mmap(), memory gets faulted in on demand, and sharing read-only
memory with other VMs should be feasible.  The question is how to use it
here.  Is there a sane way to use it without rewriting the flash devices
not to use the block layer?

> On the other hand, Laszlo had mentioned the features of QCOW2 and sparse
> files which may be nice for pflash block backends. But I am not familiar with
> it.

Offhand, two QCOW2 features come to mind:

1. Sparse storage even when the storage substrate can't do sparse.  I
consider this one largely irrelevant.  Any decent file system can do
sparse.  Archival and network transfer can compress instead.

2. Snapshots to go with disk snapshots.  *Live* snapshots (the ones that
include memory) already have the flash contents in the memory part, so
this sems only relevant for "dead" snapshots.

Anything else?

> Till now I still have no idea to get rid of this problem (more properly it's
> a optimization).
>
>
> [1]https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html
> [2]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07009.html
> [3]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07107.html
> [4]https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00719.html
Kevin Wolf April 11, 2019, 12:22 p.m. UTC | #18
Am 10.04.2019 um 10:36 hat Xiang Zheng geschrieben:
> Hi Kevin,
> 
> On 2019/4/9 16:28, Kevin Wolf wrote:
> > Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben:
> >> László's last sentence below is "This really needs the attention of the
> >> block people."  Cc'ing some.
> >>
> >> Laszlo Ersek <lersek@redhat.com> writes:
> >>
> >>> On 04/08/19 15:43, Xiang Zheng wrote:
> >>>>
> >>>> On 2019/4/3 23:35, Laszlo Ersek wrote:
> >>>>>> I thought about your comments and wrote the following patch (just for test)
> >>>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work
> >>>>>> fine. So why not use a file mapping to read or write a pflash device?
> >>>>> Honestly I can't answer "why not do this". Maybe we should.
> >>>>>
> >>>>> Regarding "why not do this *exactly as shown below*" -- probably because
> >>>>> then we'd have updates to the same underlying regular file via both
> >>>>> mmap() and write(), and the interactions between those are really tricky
> >>>>> (= best avoided).
> >>>>>
> >>>>> One of my favorite questions over the years used to be posting a matrix
> >>>>> of possible mmap() and file descriptor r/w/sync interactions, with the
> >>>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
> >>>>> "is my understanding correct?" I've never ever received an answer to
> >>>>> that. :)
> >>>>
> >>>> In my opinion, it's feasible to r/w/sync the memory devices which use a block
> >>>> backend via mmap() and write().
> >>>
> >>> Maybe. I think that would be a first in QEMU, and you'd likely have to
> >>> describe and follow a semi-formal model between fd actions and mmap()
> >>> actions.
> >>>
> >>> Here's the (unconfirmed) table I referred to earlier.
> >>>
> >>> +-------------+-----------------------------------------------------+
> >>> | change made | change visible via                                  |
> >>> | through     +-----------------+-------------+---------------------+
> >>> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
> >>> +-------------+-----------------+-------------+---------------------+
> >>> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
> >>> |             |                 |             | MS_ASYNC, or normal |
> >>> |             |                 |             | system activity     |
> >>> +-------------+-----------------+-------------+---------------------+
> >>> | MAP_PRIVATE | no              | no          | no                  |
> >>> +-------------+-----------------+-------------+---------------------+
> >>> | write()     | depends on      | unspecified | yes                 |
> >>> |             | MS_INVALIDATE,  |             |                     |
> >>> |             | or the system's |             |                     |
> >>> |             | read/write      |             |                     |
> >>> |             | consistency     |             |                     |
> >>> +-------------+-----------------+-------------+---------------------+
> >>>
> >>> Presumably:
> >>>
> >>> - a write() to some offset range of a regular file should be visible in
> >>> a MAP_SHARED mapping of that range after a matching msync(...,
> >>> MS_INVALIDATE) call;
> >>>
> >>> - changes through a MAP_SHARED mapping to a regular file should be
> >>> visible via read() after a matching msync(..., MS_SYNC) call returns.
> >>>
> >>> I sent this table first in 2010 to the Austin Group mailing list, and
> >>> received no comments. Then another person (on the same list) asked
> >>> basically the same questions in 2013, to which I responded with the
> >>> above assumptions / interpretations -- and received no comments from
> >>> third parties again.
> >>>
> >>> But, I'm really out of my depth here -- we're not even discussing
> >>> generic read()/write()/mmap()/munmap()/msync() interactions, but how
> >>> they would fit into the qemu block layer. And I have no idea.
> > 
> > There is no infrastructure in the block layer for mmapping image files.
> > 
> >>>>> Also... although we don't really use them in practice, some QCOW2
> >>>>> features for pflash block backends are -- or would be -- nice. (Not the
> >>>>> internal snapshots or the live RAM dumps, of course.)
> >>>>
> >>>> Regarding sparse files, can we read actual backend size data for the read-only
> >>>> pflash memory as the following steps shown?
> >>>>
> >>>> 1) Create a sparse file -- QEMU_EFI-test.raw:
> >>>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
> >>>>
> >>>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
> >>>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
> >>>>
> >>>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below
> >>>> patch applied:
> >>>>
> >>>> ---8>---
> >>>>
> >>>> diff --git a/hw/block/block.c b/hw/block/block.c
> >>>> index bf56c76..ad18d5e 100644
> >>>> --- a/hw/block/block.c
> >>>> +++ b/hw/block/block.c
> >>>> @@ -30,7 +30,7 @@
> >>>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
> >>>>                                   Error **errp)
> >>>>  {
> >>>> -    int64_t blk_len;
> >>>> +    int64_t blk_len, actual_len;
> >>>>      int ret;
> >>>>
> >>>>      blk_len = blk_getlength(blk);
> >>>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
> >>>>       * block device and read only on demand.
> >>>>       */
> >>>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
> >>>> -    ret = blk_pread(blk, 0, buf, size);
> >>>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
> >>>> +    ret = blk_pread(blk, 0, buf,
> >>>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len);
> >>>>      if (ret < 0) {
> >>>>          error_setg_errno(errp, -ret, "can't read block backend");
> >>>>          return false;
> >>>>
> >>>
> >>> This hunk looks dubious for a general helper function. It seems to
> >>> assume that the holes are punched at the end of the file.
> >>>
> >>> Sorry, this discussion is making me uncomfortable. I don't want to
> >>> ignore your questions, but I have no idea about the right answers. This
> >>> really needs the attention of the block people.
> > 
> > Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even
> > implemented in all block drivers, and when it is implemented it isn't
> > necessarily reliable (e.g. return 0 for block devices). It's fine for
> > 'qemu-img info', but that's it.
> > 
> > But even if you actually get the correct allocated file size, that's the
> > disk space used on the host filesystem, so it doesn't include any holes,
> > but it does include image format metadata, the data could possibly be
> > compressed etc. In other words, for a guest device it's completely
> > meaningless.
> > 
> > I'm not even sure why you would want to do this even in your special
> > case of a raw image that is sparse only at the end. Is it an
> > optimisation to avoid reading zeros? This ends up basically being
> > memset() for sparse blocks, so very quick. And I think you do want to
> > zero out the remaining part of the buffer anyway. So it looks to me as
> > if it were complicating the code for no use.
> 
> The primary target of this discussion is to save the memory allocated
> for UEFI firmware device. On virtual machine, we split it into two flash
> devices[1] -- one is read-only in 64MB size and another is read-write in
> 64MB size. The qemu commandline is:
> 
>    -drive file=QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on \
>    -drive file=empty_VARS.fd,if=pflash,format=raw,unit=1 \
> 
> Regarding the read-only one which are created from QEMU_EFI.fd, the original
> QEMU_EFI.fd is only 2MB in size and we need to stuff it to 64MB with 62MB
> 'zeros':
> 
>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
> 
> For some historical reasons such as compatibility and extensibility[2], we
> restrict both their size to 64MB -- both UEFI and qemu have cold hard
> constants.
> 
> This will consume a large amount of memory when running multiple VM
> simultaneously (each guest needs 128MB).

Okay, so your problem is that blk_pread() writes to the whole buffer,
writing explicit zeroes for unallocated parts of the image, while you
would like to leave those parts of the buffer untouched so that we don't
actually allocate the memory, but can just use the shared zero page.

If you just want to read the non-zero parts of the image, that can be
done by using a loop that calls bdrv_block_status() and only reads from
the image if the BDRV_BLOCK_ZERO bit is clear.

Would this solve your problem?

Kevin
Xiang Zheng April 12, 2019, 1:52 a.m. UTC | #19
On 2019/4/11 20:22, Kevin Wolf wrote:
> Okay, so your problem is that blk_pread() writes to the whole buffer,
> writing explicit zeroes for unallocated parts of the image, while you
> would like to leave those parts of the buffer untouched so that we don't
> actually allocate the memory, but can just use the shared zero page.
> 
> If you just want to read the non-zero parts of the image, that can be
> done by using a loop that calls bdrv_block_status() and only reads from
> the image if the BDRV_BLOCK_ZERO bit is clear.
> 
> Would this solve your problem?

Sounds good! What if guest tried to read/write the zero parts?
Xiang Zheng April 12, 2019, 9:26 a.m. UTC | #20
On 2019/4/11 15:15, Markus Armbruster wrote:
>> For some historical reasons such as compatibility and extensibility[2], we
>> restrict both their size to 64MB -- both UEFI and qemu have cold hard
>> constants.
> These reasons aren't historical.  But they're valid, and that's all that
> matters :)
> 
>> This will consume a large amount of memory when running multiple VM
>> simultaneously (each guest needs 128MB).
> Understood.
> 
>> There are accepted ideas proposed by Markus and Laszlo from prior discussion[3]:
>>
>> 1) Getting flash memory size from a machine type or property.
>>
>> 2) Map the read-only part so it is shared among guests; Map the read-write
>> part normally.
>>
>> The first idea called explicit configuration which may break migration.
> For any configuration that can break migration, the solution is to have
> identical configuration on source and destination.
> 
>> For the second idea, the only way I can imagine is using a file mapping to
>> read or write pflash devices so that we can allocate memory on demand. So I
>> tried to fit mmap() actions into the block layer[4]. However I am not sure
>> that maping image file whether can work fine for block layer.
> It adds an odd special case to the block layer.  But then flash devices
> are an odd user of the block layer.
> 
> Normal users use the block layer as a *block* layer: to read and write
> blocks.
> 
> Our flash devices don't do that.  They are *memory* devices, not *block*
> devices.  Pretty much the only thing the two have in common is
> persistence of data.  The block layer provides persistence, and I figure
> that's why it got pressed into service here.
> 
> What the flash devices do is keep the complete image contents in memory,
> with changes written through to disk.
> 
> This is pretty much exactly what mmap() does, so using it suggests
> itself.
> 
> It's very much *not* what the block layer does.  mmap() is not even
> possible for anything but a file backend.
> 
> With mmap(), memory gets faulted in on demand, and sharing read-only
> memory with other VMs should be feasible.  The question is how to use it
> here.  Is there a sane way to use it without rewriting the flash devices
> not to use the block layer?

There are two important changes we may need to pay more attention to if we
use mmap():

1) Sync/Update flash content from guest memory to a file backend. With mmap()
we might not have to do the extra sync/update as the block layer do in
pflash_write().
2) Live migration.

> 
>> On the other hand, Laszlo had mentioned the features of QCOW2 and sparse
>> files which may be nice for pflash block backends. But I am not familiar with
>> it.
> Offhand, two QCOW2 features come to mind:
> 
> 1. Sparse storage even when the storage substrate can't do sparse.  I
> consider this one largely irrelevant.  Any decent file system can do
> sparse.  Archival and network transfer can compress instead.
> 
> 2. Snapshots to go with disk snapshots.  *Live* snapshots (the ones that
> include memory) already have the flash contents in the memory part, so
> this sems only relevant for "dead" snapshots.
> 
> Anything else?
> 

No more, thanks.:D

Kevin came up with a good idea which might solve my problem.
See https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg01865.html
Xiang Zheng April 12, 2019, 9:50 a.m. UTC | #21
On 2019/4/12 9:52, Xiang Zheng wrote:
> On 2019/4/11 20:22, Kevin Wolf wrote:
>> Okay, so your problem is that blk_pread() writes to the whole buffer,
>> writing explicit zeroes for unallocated parts of the image, while you
>> would like to leave those parts of the buffer untouched so that we don't
>> actually allocate the memory, but can just use the shared zero page.
>>
>> If you just want to read the non-zero parts of the image, that can be
>> done by using a loop that calls bdrv_block_status() and only reads from
>> the image if the BDRV_BLOCK_ZERO bit is clear.
>>
>> Would this solve your problem?
> 
> Sounds good! What if guest tried to read/write the zero parts?
> 

I wrote the below patch (refer to bdrv_make_zero()) for test, it seems
that everything is OK and the memory is also exactly allocated on demand.

This requires pflash devices to use sparse files backend. Thus I have to
create images like:

   dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
   dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc

   dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0


---8>---

diff --git a/block/block-backend.c b/block/block-backend.c
index f78e82a..ed8ca87 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
                         flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
 }

+int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
+{
+    int ret = bdrv_pread_nonzeroes(blk->root, buf);
+    return ret;
+}
+
 int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
 {
     int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
diff --git a/block/io.c b/block/io.c
index dfc153b..83e5ea7 100644
--- a/block/io.c
+++ b/block/io.c
@@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
                         BDRV_REQ_ZERO_WRITE | flags);
 }

+int bdrv_pread_nonzeroes(BdrvChild *child, void *buf)
+{
+    int ret;
+    int64_t target_size, bytes, offset = 0;
+    BlockDriverState *bs = child->bs;
+
+    target_size = bdrv_getlength(bs);
+    if (target_size < 0) {
+        return target_size;
+    }
+
+    for (;;) {
+        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
+        if (bytes <= 0) {
+            return 0;
+        }
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
+        if (ret < 0) {
+            return ret;
+        }
+        if (ret & BDRV_BLOCK_ZERO) {
+            offset += bytes;
+            continue;
+        }
+        ret = bdrv_pread(child, offset, buf, bytes);
+        if (ret < 0) {
+            return ret;
+        }
+        offset += bytes;
+    }
+}
+
 /*
  * Completely zero out a block device with the help of bdrv_pwrite_zeroes.
  * The operation is sped up by checking the block status and only writing
diff --git a/hw/block/block.c b/hw/block/block.c
index bf56c76..e3c67f8 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -53,7 +53,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
      * block device and read only on demand.
      */
     assert(size <= BDRV_REQUEST_MAX_BYTES);
-    ret = blk_pread(blk, 0, buf, size);
+    ret = blk_pread_nonzeroes(blk, buf);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "can't read block backend");
         return false;
diff --git a/include/block/block.h b/include/block/block.h
index c7a2619..d0e06cf 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -322,6 +322,7 @@ int bdrv_write(BdrvChild *child, int64_t sector_num,
                const uint8_t *buf, int nb_sectors);
 int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
                        int bytes, BdrvRequestFlags flags);
+int bdrv_pread_nonzeroes(BdrvChild *child, void *buf);
 int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags);
 int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int bytes);
 int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov);
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 3be05c2..5d349d2 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -129,6 +129,7 @@ int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
 BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
                                   int bytes, BdrvRequestFlags flags,
                                   BlockCompletionFunc *cb, void *opaque);
+int blk_pread_nonzeroes(BlockBackend *blk, void *buf);
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags);
 int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes);
 int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int bytes,
Kevin Wolf April 12, 2019, 10:57 a.m. UTC | #22
Am 12.04.2019 um 11:50 hat Xiang Zheng geschrieben:
> 
> On 2019/4/12 9:52, Xiang Zheng wrote:
> > On 2019/4/11 20:22, Kevin Wolf wrote:
> >> Okay, so your problem is that blk_pread() writes to the whole buffer,
> >> writing explicit zeroes for unallocated parts of the image, while you
> >> would like to leave those parts of the buffer untouched so that we don't
> >> actually allocate the memory, but can just use the shared zero page.
> >>
> >> If you just want to read the non-zero parts of the image, that can be
> >> done by using a loop that calls bdrv_block_status() and only reads from
> >> the image if the BDRV_BLOCK_ZERO bit is clear.
> >>
> >> Would this solve your problem?
> > 
> > Sounds good! What if guest tried to read/write the zero parts?
> > 
> 
> I wrote the below patch (refer to bdrv_make_zero()) for test, it seems
> that everything is OK and the memory is also exactly allocated on demand.
> 
> This requires pflash devices to use sparse files backend. Thus I have to
> create images like:
> 
>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
> 
>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
> 
> 
> ---8>---
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index f78e82a..ed8ca87 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>                          flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
>  }
> 
> +int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
> +{
> +    int ret = bdrv_pread_nonzeroes(blk->root, buf);
> +    return ret;
> +}

I don't think this deserves a place in the public block layer interface,
as it's only a single device that makes use of it.

Maybe you wrote things this way because there is no blk_block_status(),
but you can get the BlockDriverState with blk_bs(blk) and then implement
everything inside hw/block/block.c.

>  int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
>  {
>      int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
> diff --git a/block/io.c b/block/io.c
> index dfc153b..83e5ea7 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
>                          BDRV_REQ_ZERO_WRITE | flags);
>  }
> 
> +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf)
> +{
> +    int ret;
> +    int64_t target_size, bytes, offset = 0;
> +    BlockDriverState *bs = child->bs;
> +
> +    target_size = bdrv_getlength(bs);
> +    if (target_size < 0) {
> +        return target_size;
> +    }
> +
> +    for (;;) {
> +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
> +        if (bytes <= 0) {
> +            return 0;
> +        }
> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        if (ret & BDRV_BLOCK_ZERO) {
> +            offset += bytes;
> +            continue;
> +        }
> +        ret = bdrv_pread(child, offset, buf, bytes);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        offset += bytes;

I think the code becomes simpler the other way round:

    if (!(ret & BDRV_BLOCK_ZERO)) {
        ret = bdrv_pread(child, offset, buf, bytes);
        if (ret < 0) {
            return ret;
        }
    }
    offset += bytes;

You don't increment buf, so if you have a hole in the file, this will
corrupt the buffer. You need to either increment buf, too, or use
(uint8_t*) buf + offset for the bdrv_pread() call.

> +    }
> +}
> +
>  /*
>   * Completely zero out a block device with the help of bdrv_pwrite_zeroes.
>   * The operation is sped up by checking the block status and only writing

Kevin
Xiang Zheng April 15, 2019, 2:39 a.m. UTC | #23
On 2019/4/12 18:57, Kevin Wolf wrote:
> Am 12.04.2019 um 11:50 hat Xiang Zheng geschrieben:
>>
>> On 2019/4/12 9:52, Xiang Zheng wrote:
>>> On 2019/4/11 20:22, Kevin Wolf wrote:
>>>> Okay, so your problem is that blk_pread() writes to the whole buffer,
>>>> writing explicit zeroes for unallocated parts of the image, while you
>>>> would like to leave those parts of the buffer untouched so that we don't
>>>> actually allocate the memory, but can just use the shared zero page.
>>>>
>>>> If you just want to read the non-zero parts of the image, that can be
>>>> done by using a loop that calls bdrv_block_status() and only reads from
>>>> the image if the BDRV_BLOCK_ZERO bit is clear.
>>>>
>>>> Would this solve your problem?
>>>
>>> Sounds good! What if guest tried to read/write the zero parts?
>>>
>>
>> I wrote the below patch (refer to bdrv_make_zero()) for test, it seems
>> that everything is OK and the memory is also exactly allocated on demand.
>>
>> This requires pflash devices to use sparse files backend. Thus I have to
>> create images like:
>>
>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
>>
>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
>>
>>
>> ---8>---
>>
>> diff --git a/block/block-backend.c b/block/block-backend.c
>> index f78e82a..ed8ca87 100644
>> --- a/block/block-backend.c
>> +++ b/block/block-backend.c
>> @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>>                          flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
>>  }
>>
>> +int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
>> +{
>> +    int ret = bdrv_pread_nonzeroes(blk->root, buf);
>> +    return ret;
>> +}
> 
> I don't think this deserves a place in the public block layer interface,
> as it's only a single device that makes use of it.
> 
> Maybe you wrote things this way because there is no blk_block_status(),
> but you can get the BlockDriverState with blk_bs(blk) and then implement
> everything inside hw/block/block.c.

Yes, you are right.

> 
>>  int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
>>  {
>>      int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
>> diff --git a/block/io.c b/block/io.c
>> index dfc153b..83e5ea7 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
>>                          BDRV_REQ_ZERO_WRITE | flags);
>>  }
>>
>> +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf)
>> +{
>> +    int ret;
>> +    int64_t target_size, bytes, offset = 0;
>> +    BlockDriverState *bs = child->bs;
>> +
>> +    target_size = bdrv_getlength(bs);
>> +    if (target_size < 0) {
>> +        return target_size;
>> +    }
>> +
>> +    for (;;) {
>> +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
>> +        if (bytes <= 0) {
>> +            return 0;
>> +        }
>> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +        if (ret & BDRV_BLOCK_ZERO) {
>> +            offset += bytes;
>> +            continue;
>> +        }
>> +        ret = bdrv_pread(child, offset, buf, bytes);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +        offset += bytes;
> 
> I think the code becomes simpler the other way round:
> 
>     if (!(ret & BDRV_BLOCK_ZERO)) {
>         ret = bdrv_pread(child, offset, buf, bytes);
>         if (ret < 0) {
>             return ret;
>         }
>     }
>     offset += bytes;
> 
> You don't increment buf, so if you have a hole in the file, this will
> corrupt the buffer. You need to either increment buf, too, or use
> (uint8_t*) buf + offset for the bdrv_pread() call.
> 

Yes, I didn't notice it. I think the latter is better. Does *BDRV_BLOCK_ZERO*
mean that there are all-zeroes data or a hole in the sector? But if I use an
image filled with zeroes, it will not set BDRV_BLOCK_ZERO bit on return.

Should I resend a patch?

---8>---

From 4dbfe4955aa9fe23404cbe1890fbe148be2ff10e Mon Sep 17 00:00:00 2001
From: Xiang Zheng <zhengxiang9@huawei.com>
Date: Sat, 13 Apr 2019 02:27:03 +0800
Subject: [PATCH] pflash: Only read non-zero parts of backend image

Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
when using persistent UEFI variables on virt board. Actually we only use
a very small(non-zero) part of the memory while the rest significant
large(zero) part of memory is wasted.

So this patch checks the block status and only writes the non-zero part
into memory. This requires pflash devices to use sparse files for
backends.

Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
---
 hw/block/block.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/hw/block/block.c b/hw/block/block.c
index bf56c76..3cb9d4c 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -15,6 +15,44 @@
 #include "qapi/qapi-types-block.h"

 /*
+ * Read the non-zeroes parts of @blk into @buf
+ * Reading all of the @blk is expensive if the zeroes parts of @blk
+ * is large enough. Therefore check the block status and only write
+ * the non-zeroes block into @buf.
+ *
+ * Return 0 on success, non-zero on error.
+ */
+static int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
+{
+    int ret;
+    int64_t target_size, bytes, offset = 0;
+    BlockDriverState *bs = blk_bs(blk);
+
+    target_size = bdrv_getlength(bs);
+    if (target_size < 0) {
+        return target_size;
+    }
+
+    for (;;) {
+        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_SECTORS);
+        if (bytes <= 0) {
+            return 0;
+        }
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
+        if (ret < 0) {
+            return ret;
+        }
+        if (!(ret & BDRV_BLOCK_ZERO)) {
+            ret = bdrv_pread(bs->file, offset, (uint8_t *) buf + offset, bytes);
+            if (ret < 0) {
+                return ret;
+            }
+        }
+        offset += bytes;
+    }
+}
+
+/*
  * Read the entire contents of @blk into @buf.
  * @blk's contents must be @size bytes, and @size must be at most
  * BDRV_REQUEST_MAX_BYTES.
@@ -53,7 +91,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
      * block device and read only on demand.
      */
     assert(size <= BDRV_REQUEST_MAX_BYTES);
-    ret = blk_pread(blk, 0, buf, size);
+    ret = blk_pread_nonzeroes(blk, buf);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "can't read block backend");
         return false;
Xiang Zheng April 22, 2019, 1:37 a.m. UTC | #24
Ping?

On 2019/4/15 10:39, Xiang Zheng wrote:
> On 2019/4/12 18:57, Kevin Wolf wrote:
>> Am 12.04.2019 um 11:50 hat Xiang Zheng geschrieben:
>>>
>>> On 2019/4/12 9:52, Xiang Zheng wrote:
>>>> On 2019/4/11 20:22, Kevin Wolf wrote:
>>>>> Okay, so your problem is that blk_pread() writes to the whole buffer,
>>>>> writing explicit zeroes for unallocated parts of the image, while you
>>>>> would like to leave those parts of the buffer untouched so that we don't
>>>>> actually allocate the memory, but can just use the shared zero page.
>>>>>
>>>>> If you just want to read the non-zero parts of the image, that can be
>>>>> done by using a loop that calls bdrv_block_status() and only reads from
>>>>> the image if the BDRV_BLOCK_ZERO bit is clear.
>>>>>
>>>>> Would this solve your problem?
>>>>
>>>> Sounds good! What if guest tried to read/write the zero parts?
>>>>
>>>
>>> I wrote the below patch (refer to bdrv_make_zero()) for test, it seems
>>> that everything is OK and the memory is also exactly allocated on demand.
>>>
>>> This requires pflash devices to use sparse files backend. Thus I have to
>>> create images like:
>>>
>>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
>>>
>>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
>>>
>>>
>>> ---8>---
>>>
>>> diff --git a/block/block-backend.c b/block/block-backend.c
>>> index f78e82a..ed8ca87 100644
>>> --- a/block/block-backend.c
>>> +++ b/block/block-backend.c
>>> @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>>>                          flags | BDRV_REQ_ZERO_WRITE, cb, opaque);
>>>  }
>>>
>>> +int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
>>> +{
>>> +    int ret = bdrv_pread_nonzeroes(blk->root, buf);
>>> +    return ret;
>>> +}
>>
>> I don't think this deserves a place in the public block layer interface,
>> as it's only a single device that makes use of it.
>>
>> Maybe you wrote things this way because there is no blk_block_status(),
>> but you can get the BlockDriverState with blk_bs(blk) and then implement
>> everything inside hw/block/block.c.
> 
> Yes, you are right.
> 
>>
>>>  int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count)
>>>  {
>>>      int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0);
>>> diff --git a/block/io.c b/block/io.c
>>> index dfc153b..83e5ea7 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>> @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
>>>                          BDRV_REQ_ZERO_WRITE | flags);
>>>  }
>>>
>>> +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf)
>>> +{
>>> +    int ret;
>>> +    int64_t target_size, bytes, offset = 0;
>>> +    BlockDriverState *bs = child->bs;
>>> +
>>> +    target_size = bdrv_getlength(bs);
>>> +    if (target_size < 0) {
>>> +        return target_size;
>>> +    }
>>> +
>>> +    for (;;) {
>>> +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
>>> +        if (bytes <= 0) {
>>> +            return 0;
>>> +        }
>>> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
>>> +        if (ret < 0) {
>>> +            return ret;
>>> +        }
>>> +        if (ret & BDRV_BLOCK_ZERO) {
>>> +            offset += bytes;
>>> +            continue;
>>> +        }
>>> +        ret = bdrv_pread(child, offset, buf, bytes);
>>> +        if (ret < 0) {
>>> +            return ret;
>>> +        }
>>> +        offset += bytes;
>>
>> I think the code becomes simpler the other way round:
>>
>>     if (!(ret & BDRV_BLOCK_ZERO)) {
>>         ret = bdrv_pread(child, offset, buf, bytes);
>>         if (ret < 0) {
>>             return ret;
>>         }
>>     }
>>     offset += bytes;
>>
>> You don't increment buf, so if you have a hole in the file, this will
>> corrupt the buffer. You need to either increment buf, too, or use
>> (uint8_t*) buf + offset for the bdrv_pread() call.
>>
> 
> Yes, I didn't notice it. I think the latter is better. Does *BDRV_BLOCK_ZERO*
> mean that there are all-zeroes data or a hole in the sector? But if I use an
> image filled with zeroes, it will not set BDRV_BLOCK_ZERO bit on return.
> 
> Should I resend a patch?
> 
> ---8>---
> 
>>From 4dbfe4955aa9fe23404cbe1890fbe148be2ff10e Mon Sep 17 00:00:00 2001
> From: Xiang Zheng <zhengxiang9@huawei.com>
> Date: Sat, 13 Apr 2019 02:27:03 +0800
> Subject: [PATCH] pflash: Only read non-zero parts of backend image
> 
> Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
> when using persistent UEFI variables on virt board. Actually we only use
> a very small(non-zero) part of the memory while the rest significant
> large(zero) part of memory is wasted.
> 
> So this patch checks the block status and only writes the non-zero part
> into memory. This requires pflash devices to use sparse files for
> backends.
> 
> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
> ---
>  hw/block/block.c | 40 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/block/block.c b/hw/block/block.c
> index bf56c76..3cb9d4c 100644
> --- a/hw/block/block.c
> +++ b/hw/block/block.c
> @@ -15,6 +15,44 @@
>  #include "qapi/qapi-types-block.h"
> 
>  /*
> + * Read the non-zeroes parts of @blk into @buf
> + * Reading all of the @blk is expensive if the zeroes parts of @blk
> + * is large enough. Therefore check the block status and only write
> + * the non-zeroes block into @buf.
> + *
> + * Return 0 on success, non-zero on error.
> + */
> +static int blk_pread_nonzeroes(BlockBackend *blk, void *buf)
> +{
> +    int ret;
> +    int64_t target_size, bytes, offset = 0;
> +    BlockDriverState *bs = blk_bs(blk);
> +
> +    target_size = bdrv_getlength(bs);
> +    if (target_size < 0) {
> +        return target_size;
> +    }
> +
> +    for (;;) {
> +        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_SECTORS);
> +        if (bytes <= 0) {
> +            return 0;
> +        }
> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        if (!(ret & BDRV_BLOCK_ZERO)) {
> +            ret = bdrv_pread(bs->file, offset, (uint8_t *) buf + offset, bytes);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
> +        offset += bytes;
> +    }
> +}
> +
> +/*
>   * Read the entire contents of @blk into @buf.
>   * @blk's contents must be @size bytes, and @size must be at most
>   * BDRV_REQUEST_MAX_BYTES.
> @@ -53,7 +91,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>       * block device and read only on demand.
>       */
>      assert(size <= BDRV_REQUEST_MAX_BYTES);
> -    ret = blk_pread(blk, 0, buf, size);
> +    ret = blk_pread_nonzeroes(blk, buf);
>      if (ret < 0) {
>          error_setg_errno(errp, -ret, "can't read block backend");
>          return false;
>
diff mbox series

Patch

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ce2664a..8269532 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -44,6 +44,7 @@ 
 #include "sysemu/numa.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/kvm.h"
+#include "sysemu/block-backend.h"
 #include "hw/loader.h"
 #include "exec/address-spaces.h"
 #include "qemu/bitops.h"
@@ -881,14 +882,18 @@  static void create_one_flash(const char *name, hwaddr flashbase,
     DriveInfo *dinfo = drive_get_next(IF_PFLASH);
     DeviceState *dev = qdev_create(NULL, TYPE_PFLASH_CFI01);
     SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
+    BlockBackend *blk_backend = NULL;
     const uint64_t sectorlength = 256 * 1024;
+    hwaddr real_size = flashsize;
 
     if (dinfo) {
-        qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo),
-                            &error_abort);
+        blk_backend = blk_by_legacy_dinfo(dinfo);
+        qdev_prop_set_drive(dev, "drive", blk_backend, &error_abort);
+        real_size = blk_getlength(blk_backend);
+        real_size = flashsize > real_size ? real_size : flashsize;
     }
 
-    qdev_prop_set_uint32(dev, "num-blocks", flashsize / sectorlength);
+    qdev_prop_set_uint32(dev, "num-blocks", real_size / sectorlength);
     qdev_prop_set_uint64(dev, "sector-length", sectorlength);
     qdev_prop_set_uint8(dev, "width", 4);
     qdev_prop_set_uint8(dev, "device-width", 2);